KR101355496B1

KR101355496B1 - Scheduling mechanism of a hierarchical processor including multiple parallel clusters

Info

Publication number: KR101355496B1
Application number: KR1020087007583A
Authority: KR
Inventors: 앤드류 에프 글류
Original assignee: 디 인벤션 사이언스 펀드 원, 엘엘씨
Priority date: 2005-08-29
Filing date: 2006-08-28
Publication date: 2014-01-28
Also published as: WO2007027671A2; KR20080043378A; GB0805594D0; GB2444455A; WO2007027671A3

Abstract

계층 프로세서에 관한 다양한 실시예가 설명된다.

Various embodiments of a layer processor are described.

Description

복수의 병렬 클러스터들을 포함하는 계층 프로세서의 스케쥴링 메카니즘{SCHEDULING MECHANISM OF A HIERARCHICAL PROCESSOR INCLUDING MULTIPLE PARALLEL CLUSTERS}SCHEDULING MECHANISM OF A HIERARCHICAL PROCESSOR INCLUDING MULTIPLE PARALLEL CLUSTERS

본 발명은 계층 프로세서에 관한 것이다. The present invention relates to a layer processor.

이용 가능한 여러 가지의 마이크로프로세서가 존재하며, 이들은 다양한 마이크로아키텍처를 사용할 수 있다.There are several microprocessors available, and they can use a variety of microarchitectures.

도 1은 예시적인 실시예에 따른 프로세서(100)의 명령 파이프라인(instruction pipeline)을 나타내는 블록도.1 is a block diagram illustrating an instruction pipeline of a processor 100 according to an example embodiment.

도 2는 예시적인 실시예에 따른 다중 레벨 명령 스케줄러(instruction scheduler)를 나타내는 블록도.2 is a block diagram illustrating a multi-level instruction scheduler in accordance with an exemplary embodiment.

도 3은 예시적인 실시예에 따른 다중 레벨 명령 스케줄러를 나타내는 블록도.3 is a block diagram illustrating a multi-level command scheduler in accordance with an exemplary embodiment.

도 4는 예시적인 시스템을 나타내는 블록도.4 is a block diagram illustrating an exemplary system.

도 5는 레벨 2 스케줄러(126)가 레벨 1 스케줄러와 병렬로 결합되어 있는 예시적인 실시예를 나타내는 도면.5 illustrates an exemplary embodiment in which a level 2 scheduler 126 is coupled in parallel with a level 1 scheduler.

도 6은 또 다른 예시적인 실시예에 따라 매퍼(mapper)가 레벨 1에도 직접 연결될 수 있는 블록도.6 is a block diagram in which a mapper may be directly connected to level 1 according to another exemplary embodiment.

도 7은 예시적인 실시예에 따른 다중 레벨 레지스터 파일을 나타내는 블록도.7 is a block diagram illustrating a multilevel register file in accordance with an exemplary embodiment.

도 8은 예시적인 실시예에 따른 다중 레벨 레지스터 파일을 나타내는 블록도.8 is a block diagram illustrating a multi-level register file in accordance with an exemplary embodiment.

도 9는 바이패스 네트워크의 예시적인 실시예를 나타내는 도면.9 illustrates an exemplary embodiment of a bypass network.

도 10은 예시적인 실시예에 따른 바이패스 네트워크를 나타내는 블록도.10 is a block diagram illustrating a bypass network in accordance with an exemplary embodiment.

도 11은 예시적인 실시예에 따른 인터-클러스터(inter-cluster) 바이패스 메커니즘 또는 인터-클러스터 통신을 제공하기 위한 레벨 2 레지스터 파일의 사용을 나타내는 블록도.FIG. 11 is a block diagram illustrating the use of an inter-cluster bypass mechanism or a level 2 register file to provide inter-cluster communication in accordance with an example embodiment. FIG.

도 12는 예시적인 실시예에 따른 저장 버퍼를 나타내는 도면.12 illustrates a storage buffer in accordance with an exemplary embodiment.

도 13은 다양한 클러스터의 저장 버퍼 간의 데이터 경로를 나타내는 블록도.13 is a block diagram illustrating data paths between storage buffers of various clusters.

도 14는 트레이스-로그(trace-log)의 예시적인 사용을 나타내는 프로세서의 블록도.14 is a block diagram of a processor that illustrates an exemplary use of trace-log.

도 15는 예시적인 실시예에 따른 다중 코어 프로세서(1500)의 블록도.15 is a block diagram of a multi-core processor 1500 in accordance with an exemplary embodiment.

I. 예시적인 프로세서 마이크로아키텍처의 일반적인 설명I. General Description of Example Processor Microarchitecture

동일한 부호가 동일한 요소를 표시하는 도면을 참조하면, 도 1은 예시적인 실시예에 따른 프로세서(100)의 명령 파이프라인(instruction pipeline)을 나타내 는 블록도이다. 예시적인 실시예에 따라, 프로세서(100)는 계층형(hierarchical)일 수 있거나 또는 다중 레벨일 수 있는 하나 이상의 스테이지를 포함할 수 있다. 예시적인 실시예에서, 하나 이상의 파이프라인 스테이지는 클러스터(또는 실행 클러스터)로 그룹화될 수 있다. 프로세서(100)는 예를 들면 하나 이상의 스테이지가 병렬 처리 경로를 제공하도록 각 클러스터에 복제되는 상태로 다중 병렬 클러스터를 포함할 수 있다.Referring to the drawings in which like numerals refer to like elements, FIG. 1 is a block diagram illustrating an instruction pipeline of a processor 100 according to an exemplary embodiment. According to an example embodiment, the processor 100 may include one or more stages that may be hierarchical or may be multilevel. In an example embodiment, one or more pipeline stages may be grouped into clusters (or execution clusters). The processor 100 may include multiple parallel clusters, for example, with one or more stages replicated to each cluster to provide a parallel processing path.

도 1을 참조하면, 프로세서(100)의 명령 파이프라인은 많은 파이프라인 스테이지를 포함할 수 있다. 파이프라인 스테이지 중 하나 이상은 다중 구조를 포함할 수 있거나 또는 다중 레벨일 수 있다. 프로세서(100)는 명령 및 명령 포인터(instruction pointer; IP)(112)를 페치(fetch)하는 명령 페치 유닛(도시되지 않음)을 포함하여 디코딩될 다음 명령의 어드레스를 제공할 수 있다. 프로세서(100)는 레벨 1 분기 예측기(BP1)(114)와 레벨 2 분기 예측기(BP2)(122), 및 분기 예측기 키(BPQ)(127)와 같이 분기가 취해지는지를 예측하는 하나 이상의 분기 예측기(branch predictor)를 포함할 수 있다. 프로세서(100)는 또한 레벨 1 명령 캐시(I$1)(116) 및 레벨 2 명령 캐시(I$2)(124)와 같이 명령을 캐싱 또는 저장하는 하나 이상의 명령 캐시(instruction cache)를 포함할 수 있다. 명령 디코더(118)는 아키텍처 명령을 하나 이상의 마이크로 연산(uop)으로 디코딩할 수 있다. 일부 마이크로프로세서(예컨대, 최근의 펜티엄)가 명령을 더 간단한 형태(즉, uop)로 해석하는 반면에 다른 것들(예컨대, PowerPC)은 이러한 임의 해석을 요하지 않기 때문에, 용어 uop 및 명령이 교체 가능하게 사용된다는 것은 당업자에게 알려져 있다. 이 작업에서 개시되어 있는 개념은 어느 한 방법에 동일하게 적용되며, 단지 요구되는 차이는 보다 복잡한 디코드 스테이지의 존재이다.Referring to FIG. 1, the instruction pipeline of the processor 100 may include many pipeline stages. One or more of the pipeline stages may include multiple structures or may be multiple levels. The processor 100 may provide an address of the next instruction to be decoded, including an instruction fetch unit (not shown) that fetches the instruction and instruction pointer (IP) 112. The processor 100 may predict one or more branch predictors, such as level 1 branch predictor (BP1) 114 and level 2 branch predictor (BP2) 122, and branch predictor key (BPQ) 127 to predict whether a branch is taken. (branch predictor). Processor 100 may also include one or more instruction caches that cache or store instructions, such as level 1 instruction cache (I $ 1) 116 and level 2 instruction cache (I $ 2) 124. . The instruction decoder 118 may decode the architectural instruction into one or more micro operations (uop). Some microprocessors (eg, recent Pentium) interpret the instructions in a simpler form (ie, uop), while others (eg, PowerPC) do not require this arbitrary interpretation, so that the terms uop and instructions are interchangeable. It is known to those skilled in the art that it is used. The concepts disclosed in this work apply equally to either method, the only difference being required is the presence of more complex decode stages.

프로세서(100)는 아키텍처(또는 가상) 레지스터를 물리적 레지스터에 매핑하는 레벨 1 매퍼(M1)(120) 및/또는 레벨 2 매퍼(M2)(150)와 같은 매퍼(또는 레지스터 개명기)를 포함할 수 있다. 일반적으로 하나 이상의 명령 스케줄러는 예를 들면 명령용 피연산자가 준비되고 적절한 실행 자원이 이용 가능할 때 실행하기 위한 uop의 실행을 스케줄링할 수 있다. 예시적인 실시예에 따라, 스케줄러는 단일 스케줄러일 수 있거나 또는 레벨 2 스케줄러(S2)(126) 및 하나 이상의 레벨 1 스케줄러(S1)(132)와 같은 다중 레벨 스케줄러(또는 다중 스케줄러)를 포함할 수 있다.Processor 100 may include a mapper (or register renamer) such as Level 1 Mapper (M1) 120 and / or Level 2 Mapper (M2) 150 that maps an architectural (or virtual) register to a physical register. Can be. In general, one or more instruction schedulers can schedule the execution of uops to execute, for example, when operands for instructions are prepared and appropriate execution resources are available. According to an example embodiment, the scheduler may be a single scheduler or may include a multi-level scheduler (or multiple scheduler), such as level 2 scheduler (S2) 126 and one or more level 1 scheduler (S1) 132. have.

예시적인 실시예에 따라, 프로세서(100)는 하나 이상의 클러스터를 병렬식으로 포함할 수 있는데, 각 클러스터는 하나 이상의 파이프라인 스테이지를 포함한다. 예시적인 실시예에서, 각 클러스터에 대한 파이프라인 스테이지는 다중 클러스터 각각에 대하여 복제 또는 중복되어 병렬 처리 경로를 제공할 수 있다. 도 1에 나타낸 예시적인 프로세서에서, 프로세서(100)는 클러스터(130A, 130B, 130C)와 같은 하나 이상의 클러스터(130)를 포함할 수 있다. 3개의 클러스터가 도 1의 예시적인 프로세서에 도시되어 있지만, 임의 개수의 클러스터가 사용될 수 있고 클러스터는 이종형(heterogeneous)일 수 있다.According to an example embodiment, processor 100 may include one or more clusters in parallel, each cluster including one or more pipeline stages. In an example embodiment, pipeline stages for each cluster may be duplicated or duplicated for each of the multiple clusters to provide parallel processing paths. In the example processor shown in FIG. 1, processor 100 may include one or more clusters 130, such as clusters 130A, 130B, 130C. Although three clusters are shown in the example processor of FIG. 1, any number of clusters may be used and the clusters may be heterogeneous.

도 1을 참조하면, 클러스터(130A)는 레벨 1 스케줄러(132A), 레벨 1 레지스터 파일(RF1)(134A), 피연산자를 캡처하여 실행 유닛에 제공하는 피연산자 캡처 어레이(OC)(135A), uop(또는 다른 유형의 명령)를 실행하는 하나 이상의 실행 유 닛(136A), 메모리에 기록될 데이터를 저장하는 레벨 1 저장 버퍼(SB1)(138A), 데이터를 캐싱 또는 저장하는 레벨 1 데이터 캐시(D$1)(140A) 및 uop의 철회의 초기 스테이지에 도움을 줄 수 있는 레벨 1 명령 창(IW1)(142A)을 포함할 수 있다. 마찬가지로 다른 클러스터(130B, 130C)는 하나 이상의 스테이지를 포함할 수 있다. 예를 들면, 클러스터(130B)는 다음 중 하나를 포함할 수 있다: 레벨 1 스케줄러(132B), 레벨 1 레지스터 파일(134B), 피연산자 캡처 어레이(135B), 실행 유닛(136B), 레벨 1 저장 버퍼(138B), 레벨 1 데이터 캐시(140B) 및 레벨 1 명령 창(142B). 마찬가지로, 클러스터(130C)는 예를 들면 다음 중 하나 이상을 포함할 수 있다: 레벨 1 스케줄러(132C), 레벨 1 레지스터 파일(134C), 피연산자 캡처 어레이(135C), 실행 유닛(136C), 레벨 1 저장 버퍼(138C), 레벨 1 데이터 캐시(140C) 및 레벨 1 명령 창(142C).Referring to FIG. 1, cluster 130A includes level 1 scheduler 132A, level 1 register file (RF1) 134A, operand capture array (OC) 135A, uop (capturing the operands and providing them to the execution unit). Or one or more execution units 136A for executing other types of instructions), a level 1 storage buffer (SB1) 138A for storing data to be written to memory, and a level 1 data cache (D $ 1) for caching or storing data. ) 140A and a level 1 command window (IW1) 142A that may assist in the initial stage of withdrawal of the uop. Similarly, other clusters 130B and 130C may include one or more stages. For example, cluster 130B may include one of the following: level 1 scheduler 132B, level 1 register file 134B, operand capture array 135B, execution unit 136B, level 1 storage buffer. (138B), Level 1 Data Cache 140B and Level 1 Command Window 142B. Similarly, cluster 130C may include, for example, one or more of the following: level 1 scheduler 132C, level 1 register file 134C, operand capture array 135C, execution unit 136C, level 1 Storage buffer 138C, level 1 data cache 140C, and level 1 command window 142C.

각 클러스터(130)(예컨대, 130A, 130B 또는 130C 중 어느 하나)는 도 1에 나타낸 스테이지를 포함할 수 있거나, 상이한 세트의 스테이지를 포함할 수 있거나, 또는 도 1에서의 클러스터(130)에 나타낸 이러한 스테이지의 서브세트만을 포함할 수 있다. 예를 들면, 일 실시예에서, 클러스터(130A)는 레벨 1 스케줄러(132A), 레벨 1 레지스터 파일(134A), 실행 유닛(136A) 및 레벨 1 데이터 캐시(140A)를 포함할 수 있다. 예를 들면, 클러스터(130A)는 피연산자 캡처 어레이(135A), 레벨 1 저장 버퍼(138A) 및 레벨 1 명령 창(142A)과 같은 스테이지를 포함하거나 또는 포함하지 않을 수 있다. 또 다른 예시적인 실시예에서, 클러스터(130A)는 레벨 1 스케줄러(132A), 레벨 1 레지스터 파일(134A) 및 실행 유닛(136A)을 포함할 수 있다. 다른 많은 조합들이 클러스터(130)에 사용될 수 있다.Each cluster 130 (eg, any one of 130A, 130B, or 130C) may include the stages shown in FIG. 1, may comprise a different set of stages, or shown in cluster 130 in FIG. 1. Only a subset of these stages may be included. For example, in one embodiment, cluster 130A may include a level 1 scheduler 132A, a level 1 register file 134A, an execution unit 136A, and a level 1 data cache 140A. For example, cluster 130A may or may not include stages such as operand capture array 135A, level 1 storage buffer 138A, and level 1 command window 142A. In another exemplary embodiment, cluster 130A may include a level 1 scheduler 132A, a level 1 register file 134A, and an execution unit 136A. Many other combinations may be used for the cluster 130.

따라서, 각 클러스터 내에 제공되는 스테이지 또는 구조는 퍼-클러스터(per-cluster) 구조로 고려될 수 있다. 예를 들면, 레벨 1 스케줄러(S1)(132), 레벨 1 레지스터 파일(RF1)(134), 피연산자 캡처 어레이(OC)(135), 실행 유닛(136), 레벨 1 저장 버퍼(SB1)(138), 레벨 1 데이터 캐시(D$1)(140) 및 레벨 1 명령 창(IW1)(142) 중 하나 이상이 각 클러스터에 제공(또는 퍼-클러스터에 기초하여 제공)될 수 있다.Thus, the stage or structure provided within each cluster may be considered a per-cluster structure. For example, Level 1 Scheduler (S1) 132, Level 1 Register File (RF1) 134, Operand Capture Array (OC) 135, Execution Unit 136, Level 1 Storage Buffer (SB1) 138 ), One or more of level 1 data cache (D $ 1) 140 and level 1 command window (IW1) 142 may be provided (or provided on a per-cluster basis) to each cluster.

또한, 클러스터(130) 내에 제공되는 하나 이상의 스테이지(또는 구조)는 다중 레벨 구조의 일부일 수 있으며, 여기서 구조의 제 1 레벨(레벨 1)이 퍼-클러스터에 기초하여 제공되고 구조의 제 2 레벨(레벨 2)이 다중 클러스터에 또는 모든 클러스터에 제공(인터-클러스터 구조로서 제공)된다. 예를 들면, 레벨 1 스케줄러(S1)(130A, 130B, 또는 130C)(퍼-클러스터에 기초하여 제공됨) 및 다중(또는 모든) 클러스터에 제공되는 인터-클러스터 레벨 2 스케줄러(S2)를 포함하는 다중 레벨 스케줄러가 제공될 수 있다.In addition, one or more stages (or structures) provided within the cluster 130 may be part of a multi-level structure, where a first level (level 1) of the structure is provided based on a per-cluster and a second level ( Level 2) is provided in multiple clusters or in all clusters (as an inter-cluster structure). For example, multiple, including level 1 scheduler (S1) 130A, 130B, or 130C (provided based on per-cluster) and inter-cluster level 2 scheduler (S2) provided to multiple (or all) clusters. A level scheduler may be provided.

또한, 다중 레벨 레지스터 파일은 예를 들면 클러스터마다 제공되는 레벨 1 레지스터 파일(RF1)(132A, 132B, 132C), 인터-클러스터 레벨 2 레지스터 파일(RF2)(152)을 포함할 수 있다. 다중 레벨 저장 버퍼는 예를 들면 클러스터마다 제공되는 레벨 1 저장 버퍼(SB1)(138A, 138B, 138C), 및 다중 또는 모든 클러스터에 제공되는 인터-클러스터 제 2 레벨(L2) 저장 버퍼(SB2)(154)를 포함할 수 있다. 레벨 2 레지스터 파일(152)은 명령에 대한 실행 결과를 저장할 수 있는데, 이는 다 른 명령에 대한 피연산자로서 이용될 수 있다. 레벨 2 레지스터 파일(152)은 또한 명령의 철회를 처리할 수 있는 레벨 2 명령 창을 포함할 수 있다.The multi-level register file may also include, for example, a level 1 register file (RF1) 132A, 132B, 132C, and an inter-cluster level 2 register file (RF2) 152 provided per cluster. The multi-level store buffers are for example Level 1 storage buffers (SB1) 138A, 138B, 138C provided per cluster, and inter-cluster second level (L2) storage buffers SB2 (provided for multiple or all clusters) 154). Level 2 register file 152 may store the execution result for an instruction, which may be used as an operand for another instruction. The level 2 register file 152 may also include a level 2 command window that can handle revocation of the command.

다중 레벨 데이터 캐시는 퍼-클러스터 제 1 레벨(L1) 데이터 캐시(D$1)(140A, 140B, 140C) 및 인터-클러스터 레벨 2 데이터 캐시(D$2)(156)를 포함할 수 있다. 다중 레벨 명령 창은 퍼-클러스터 제 1 레벨(L1) 명령 창(IW1)(142A, 142B, 142C), 및 예를 들면 레벨 2 레지스터 파일(152)의 일부로서 제공될 수 있는 인터-클러스터 레벨 2 명령 창(IW2)을 포함할 수 있다.The multi-level data cache may include a per-cluster first level (L1) data cache (D $ 1) 140A, 140B, 140C and an inter-cluster level 2 data cache (D $ 2) 156. The multi-level command window may be provided as part of the per-cluster first level (L1) command window (IW1) 142A, 142B, 142C, and, for example, the level 2 register file 152. It may include a command window IW2.

예를 들면, 다중 레벨 스테이지의 사용은 더 작고 및/또는 더 빠른 구조가 실행 유닛(136)에 더 밀접할 수 있는 클러스터 내에 제공되는 것을 허용하는 반면, 다중(또는 모든) 클러스터에 의해 사용될 스테이지에 더 크고 아마도 더 느린 구조를 제공한다. 이 다중 레벨 구조는 어떤 시간 감응 태스크가 실행 유닛의 근처에 위치한 더 작거나 더 빠른 구조에 배치되는 것을 허용하여 처리 또는 실행 속도를 향상시키는 반면 다중 클러스터에 공통적일 수 있는 다른 더 큰 구조에 다른 태스크를 할당한다.For example, the use of multi-level stages allows smaller and / or faster structures to be provided in clusters that may be closer to execution unit 136, while at the stages to be used by multiple (or all) clusters. It provides a larger and perhaps slower structure. This multilevel structure allows certain time-sensitive tasks to be placed in smaller or faster structures located near execution units, thereby speeding up processing or execution, while other tasks in other larger structures that may be common to multiple clusters. Allocate

또한, 예시적인 실시예에 따라, 위에서 지적한 바와 같이, 분기 예측기, 명령 캐시 및 매퍼 스테이지는 또한 다중 레벨일 수 있으며, 퍼-클러스터 구조 및 인터-클러스터 구조(도 1에 도시되지 않음) 모두를 각각 포함할 수 있거나, 또는 예를 들면 다중 인터-클러스터 구조(예컨대, 도 1의 예에 도시됨)를 포함할 수 있다.Further, in accordance with an exemplary embodiment, as noted above, the branch predictor, instruction cache, and mapper stage may also be multilevel, each of both the per-cluster structure and the inter-cluster structure (not shown in FIG. 1). It may include, or may include, for example, multiple inter-cluster structures (eg, shown in the example of FIG. 1).

이제부터 도 1에서의 예시적인 프로세서(100)의 스테이지의 예시적인 특징 및 동작을 보다 상세하게 설명한다. 명령 포인터(IP)(112)는 다음 명령이 페치될 수 있는 메모리 내의 위치를 식별 또는 가리킬 수 있다. 예시적인 실시예에 따라, 레벨 1 분기 예측기(114)는 분기 명령이 그 위치에 존재하는지와 분기가 취해지는지를 예측할 수 있고 분기 예측기 키(127)에 분기 명령의 어드레스 및 예측을 기록할 수 있다. 레벨 2 분기 예측기(122)는 분기 예측기 키(127)로부터 예측을 판독하여 이들을 검증할 수 있다. 예시적인 실시예에서, 레벨 1 분기 예측기(114)는 비교적 빠른 분기 예측기일 수 있는 반면, 레벨 2 분기 예측기는 예측기(114)보다 더 크고 더 느리지만 더 정확할 수 있다. 분기 예측기(114, 122)는 예를 들면 라인(125)을 통하여 수신된 실행 결과에 기초하여 분기 예측기의 정확도를 검증 또는 체크할 수 있다. 분기 예측기(114, 122)는 임의 유형의 분기 예측기일 수 있다.The exemplary features and operation of the stage of the example processor 100 in FIG. 1 will now be described in more detail. Instruction pointer (IP) 112 may identify or point to a location in memory where the next instruction may be fetched. According to an exemplary embodiment, the level 1 branch predictor 114 may predict whether a branch instruction exists at that location and whether a branch is taken and record the address and prediction of the branch instruction in the branch predictor key 127. . Level 2 branch predictor 122 may read the predictions from branch predictor key 127 to verify them. In an example embodiment, the level 1 branch predictor 114 may be a relatively fast branch predictor, while the level 2 branch predictor may be larger and slower but more accurate than the predictor 114. Branch predictors 114 and 122 may verify or check the accuracy of branch predictors based, for example, on execution results received via line 125. Branch predictors 114 and 122 may be any type of branch predictor.

프로세서(100)는 또한 명령을 캐싱하는 하나 이상의 명령 캐시를 포함할 수 있다. 예를 들면, 명령은 초기에 레벨 1 명령 캐시(116)에 저장 또는 캐싱되어 레벨 2 명령 캐시(124) 등에 기록될 수 있다. 최근 최소 사용(least recently used; LRU) 알고리즘 또는 다른 캐싱 알고리즘이 명령 캐시(116, 124)에 저장된 명령을 관리하는데 사용될 수 있다. 명령 캐시(116 및/또는 124)는 아키텍처 명령, 디코딩된 명령 캐시(또는 마이크로 연산 캐시), 트레이스 캐시 등에 대한 캐시와 같은 임의 유형의 명령 캐시일 수 있다. 명령 디코더(D)(118)는 명령 캐시(116 및/또는 124)에 연결되어 아키텍처 명령을 하나 이상의 마이크로 연산(또는 uop) 등으로 디코딩할 수 있다.Processor 100 may also include one or more instruction caches that cache instructions. For example, instructions may be initially stored or cached in the level 1 instruction cache 116 and written to the level 2 instruction cache 124 or the like. A least recently used (LRU) algorithm or other caching algorithm may be used to manage the instructions stored in the instruction cache 116, 124. The instruction cache 116 and / or 124 may be any type of instruction cache, such as a cache for architecture instructions, decoded instruction cache (or micro arithmetic cache), trace cache, or the like. The instruction decoder (D) 118 may be coupled to the instruction cache 116 and / or 124 to decode architecture instructions into one or more micro operations (or uops) or the like.

자원의 배치는 (예컨대, 도 1에 도시되거나 도시되지 않을 수 있는 레벨 1 매퍼(120) 또는 다른 구조나 스테이지에 의해) 디코딩된 uop마다 수행될 수 있다. 일부 자원의 이러한 배치는 예를 들면 다음의 단계를 포함할 수 있다: uop에 대한 실행 결과를 저장하도록 레벨 2 레지스터 파일 내의 엔트리를 uop마다 할당하는 단계. uop에 대한 레벨 2 레지스터 파일(152) 내의 엔트리는 또한 uop의 상태를 나타내는 필드를 포함할 수 있다. 레지스터 파일(152) 내의 엔트리에서 트래킹될 수 있는 uop에 대한 상이한 상태는 예를 들면 다음을 포함할 수 있다: uop는 실행을 위해 스케줄링되고, uop가 실행중이며, uop는 실행을 완료하고 결과는 레지스터 파일 엔트리에 다시 기록되며, uop는 철회를 준비하고, uop는 철회중이다. 이 할당은 할당기 스테이지(도시되지 않음, 매퍼(120) 등의 바로 앞에 제공될 수 있음)에 의해, 또는 레벨 1 매퍼(120)와 같은 또 다른 스테이지에 의해 수행될 수 있다.The placement of resources may be performed per decoded uop (eg, by level 1 mapper 120 or other structure or stage, which may or may not be shown in FIG. 1). Such placement of some resources may include, for example, the following steps: allocating an entry in the level 2 register file per uop to store the execution result for the uop. The entry in level 2 register file 152 for uop may also include a field indicating the status of uop. Different states for uop that can be tracked in entries in register file 152 may include, for example: uop is scheduled for execution, uop is running, uop completes execution and the result is a register It is written back to the file entry, uop is preparing to withdraw, and uop is withdrawing. This assignment may be performed by an allocator stage (not shown, which may be provided immediately before mapper 120, etc.), or by another stage, such as level 1 mapper 120.

프로세서(100) 내의 매퍼(또는 레지스터 개명기)는 단일 구조일 수 있거나 다중 레벨일 수 있다. 예시적인 실시예에 따라, 프로세서(100)는 프로그래머에 의해 보여지거나 액세스될 수 있는 제한된 세트의 아키텍처 레지스터(예컨대, eax, ebx ...)를 포함할 수 있다. 프로세서(100)는 레벨 2 레지스터 파일로서 나타낸 더 큰 세트의 물리적 레지스터를 포함할 수 있다(이 일부는 레벨 1 레지스터 파일(134) 및/또는 피연산자 캡처 어레이(135)에 의해 캐싱될 수 있음). uop는 다중 필드, 예컨대 2개의 소스 피연산자 및 목적 피연산자를 특정하는 필드를 포함할 수 있다. 이들 피연산자 또는 필드 각각은 아키텍처 레지스터 중 하나를 참조할 수 있다. 예시적인 실시예에 따라, 레벨 1 매퍼(120)는 아키텍처 레지스터를 참조하는 uop 필드 각각을 레벨 2 레지스터 파일(152) 내의 레지스터와 연관시킬 수 있다. 레벨 1 매퍼(120)는 물리적 레지스터(예컨대, 레벨 2 레지스터 파일(152) 내의 레 지스터)에 아키텍처 레지스터를 매핑하는 것을 나타내는 맵 또는 RAT(register alias table)를 저장 또는 유지할 수 있다.The mapper (or register renamer) within processor 100 may be of a single structure or may be of multiple levels. According to an example embodiment, processor 100 may include a limited set of architectural registers (eg, eax, ebx ...) that may be viewed or accessed by a programmer. Processor 100 may include a larger set of physical registers, represented as a level 2 register file (some of which may be cached by level 1 register file 134 and / or operand capture array 135). uop may include multiple fields, such as fields specifying two source operands and a target operand. Each of these operands or fields may reference one of the architecture registers. According to an example embodiment, the level 1 mapper 120 may associate each of the uop fields that reference an architectural register with a register in the level 2 register file 152. Level 1 mapper 120 may store or maintain a map or register alias table (RAT) indicating mapping an architectural register to a physical register (eg, a register in level 2 register file 152).

새로운 uop가 레벨 1 매퍼(120)에서 수신될 때, 레벨 2 레지스터 파일(152) 내의 물리적 레지스터가 uop의 실행 결과로 할당되고, uop의 레지스터 피연산자는 레벨 2 레지스터 파일(152) 내의 적절한 물리적 레지스터를 가리키도록 매핑될 수 있다. 갱신된 맵이 생성되고, 물리적 아키텍처 레지스터 매핑의 이전 상태(예컨대, uop 스트림에서 초기)를 나타내는 더 오래된 맵이 또한 레벨 1 매퍼(120)에 저장될 수 있거나, 또는 레벨 2 매퍼(150)로 이동될 수 있다.When a new uop is received at level 1 mapper 120, the physical registers in level 2 register file 152 are allocated as a result of the execution of uop, and the register operands of the uop are assigned the appropriate physical registers in level 2 register file 152. Can be mapped to point to. An updated map is generated and an older map representing the previous state of the physical architecture register mapping (eg, initial in the uop stream) can also be stored in the level 1 mapper 120 or moved to the level 2 mapper 150. Can be.

예시적인 실시예에 따라, 프로세서(100)는 단일 스레드(thread)를 수용할 수 있으며, 다중 스레드 또는 다중 스레딩을 수용할 수 있다. 스레드는 프로그래밍의 기본 단위를 포함할 수 있다. 스레드 및 클러스터(130)는 관련될 수 있다. 다중 병렬 스레드는 클러스터를 공유(또는 이 상에서 실행)할 수 있다. 하나의 스레드는 다중 클러스터 상에서 실행할 수 있다. 또한, 프로세서(100)는 클러스터마다 스레드 유연성이 존재하는 정책을 구현할 수 있는데, 즉, 가능하면 프로세서(100)는 이것이 요구되지는 않지만 클러스터마다 하나의 스레드를 할당할 수 있다. 스레드는 하나의 클러스터로부터 또 다른 클러스터로 이동시킬 수 있고, 제 1 스레드는 예컨대 개별 클러스터 상에 제공될 수 있는 제 2 스레드를 야기(또는 둘로 분기)할 수 있다.According to an example embodiment, the processor 100 may accommodate a single thread and may accommodate multiple threads or multiple threads. Threads can contain the basic units of programming. Threads and clusters 130 may be related. Multiple parallel threads can share (or run on) a cluster. One thread can run on multiple clusters. Further, processor 100 may implement a policy in which thread flexibility exists per cluster, i.e., if possible, processor 100 may allocate one thread per cluster, although this is not required. A thread can move from one cluster to another, and the first thread can cause (or bifurcate) a second thread, which can be provided on an individual cluster, for example.

예시적인 실시예에 따라, 단일 명령 스케줄러가 사용될 수 있다. 또 다른 실시예에 따라, 인터-클러스터 레벨 2 스케줄러(S2)(126) 및 클러스터마다 레벨 1 스 케줄러(S1)(132)(예컨대, 클러스터(130A)에는 스케줄러(132A), 클러스터(130B)에는 스케줄러(132B), 그리고 클러스터(130C)에는 스케줄러(132C))의 조합과 같은 다중 레벨 스케줄러가 사용될 수 있다.In accordance with an exemplary embodiment, a single command scheduler may be used. According to another embodiment, the inter-cluster Level 2 scheduler (S2) 126 and the level 1 scheduler (S1) 132 per cluster (eg, the cluster 130A includes the scheduler 132A, the cluster 130B). For example, a multi-level scheduler such as a combination of the scheduler 132B and the cluster 130C may be used.

레벨 2 스케줄러(126)는 몇 개의 태스크를 수행할 수 있다. 스케줄러(126)는 특정 기준 또는 정책에 따라 스레드 또는 개별 uop를 클러스터에 할당하는 정책을 구현할 수 있다. 예를 들면, 레벨 스케줄러(126)는 제 1 스레드를 클러스터(130A)에, 제 2 스레드를 클러스터(130B)에 그리고 제 3 스레드를 클러스터(130C)에 할당할 수 있다. 변형적으로, 스케줄러는 부하 균형 정책을 구현할 수 있는데, 여기서 스케줄러는 이용 가능한 클러스터에 걸친 uop 부하를 대략 균형 맞추기 위해, 예컨대 더 큰 처리 스루풋을 제공하거나 또는 이용 가능한 처리 자원을 보다 효율적으로 사용하기 위해 uop를 할당한다. 레벨 2 스케줄러는 또한 부하 균형 또는 스레드 유연성과 같은 정책, 또는 일부 다른 정책에 기초하여 각 uop를 선택된 클러스터(선택된 레벨 1 스케줄러)로 전송할 수 있다. 예를 들면, 각 uop는 uop가 연관되어 있는 스레드를 식별하는 스레드 ID를 포함할 수 있다. 레벨 2 스케줄러(126)는 uop에 대한 스레드 ID(예컨대, 클러스터마다 하나의 스레드를 할당함)에 기초하여 각 uop를 클러스터에 전송할 수 있다.Level 2 scheduler 126 may perform several tasks. The scheduler 126 may implement a policy that assigns threads or individual uops to a cluster according to specific criteria or policies. For example, the level scheduler 126 may assign the first thread to the cluster 130A, the second thread to the cluster 130B, and the third thread to the cluster 130C. Alternatively, the scheduler can implement a load balancing policy, where the scheduler can roughly balance uop load across the available clusters, for example to provide greater processing throughput or more efficiently use available processing resources. Allocate uop. The level 2 scheduler may also send each uop to the selected cluster (selected level 1 scheduler) based on policies such as load balancing or thread flexibility, or some other policy. For example, each uop may include a thread ID that identifies the thread with which the uop is associated. The level 2 scheduler 126 may send each uop to the cluster based on the thread ID for the uop (eg, assign one thread per cluster).

또 다른 예로서, 제 1 클러스터에 할당된 제 1 스레드가 제 2 스레드를 야기하는 경우, 스케줄러(126)는 제 2 스레드를 제 2 클러스터에 할당할 수 있다. 이후, 야기된 스레드와 연관된 uop는 스케줄러(126)에 의해 제 2 클러스터로 전송될 수 있는 반면, 원래의 스레드와 연관된 uop는 예를 들면 제 1 클러스터로 연속하여 전송될 수 있다.As another example, if the first thread assigned to the first cluster results in a second thread, the scheduler 126 may assign the second thread to the second cluster. Then, the uop associated with the caused thread may be sent to the second cluster by the scheduler 126, while the uop associated with the original thread may be sent continuously to the first cluster, for example.

예시적인 실시예에서, 레벨 2 스케줄러(126)는 uop에 대한 소스 피연산자 각각이 이용 가능하고 실행할 준비가 된 시기를 나타내는 피연산자 상태 정보를 uop마다 저장할 수 있다. 레벨 2 스케줄러(126)는 uop에 대한 소스 피연산자가 이용 가능한 후에 uop를 레벨 1 스케줄러로 전송할 수 있거나, 또는 레벨 2 스케줄러(126)는 피연산자가 준비되기 전에 추론적으로 uop를 레벨 1 스케줄러로 전송할 수 있다. 예시적인 실시예에서, 레벨 2 스케줄러는 3개의 uop, 4개의 uop, 5개의 uop 등의 그룹과 같은 그룹으로 uop를 레벨 1 스케줄러로 전송할 수 있다. 스케줄러(126)에 의해 선택된 클러스터로 전송된 uop의 그룹은 그룹 내의 uop 간의 종속 체인을 포함하는 uop의 그룹을 포함할 수 있다. 예를 들면, 레벨 2 스케줄러(126)가 제 1 uop의 하나 이상의 소스 피연산자가 현재 준비되어 있음을 검출하면, 레벨 2 스케줄러(126)는 제 1 uop를 레벨 1 스케줄러로 전송할 수 있고 하나 이상의 부가적인 uop는 제 1 uop에 종속할 수 있거나 준비된 대로 검출된 동일한 피연산자에 종속할 수 있거나, 또는 제 1 uop에 관련될 수 있다. 이들은 레벨 2 스케줄러(126)가 수행할 수 있는 특징 및 동작의 예일 뿐이며, 본 개시내용은 이에 한정되지 않는다.In an exemplary embodiment, the level 2 scheduler 126 may store operand state information for each uop, indicating when each source operand for the uop is available and ready to run. Level 2 scheduler 126 may send uop to level 1 scheduler after the source operand for uop is available, or level 2 scheduler 126 may speculatively send uop to level 1 scheduler before the operand is ready. have. In an example embodiment, the level 2 scheduler may send uops to the level 1 scheduler in groups such as 3 uops, 4 uops, 5 uops, and the like. The group of uops sent to the cluster selected by the scheduler 126 may include a group of uops that includes a dependent chain between uops in the group. For example, if the level 2 scheduler 126 detects that one or more source operands of the first uop are currently ready, the level 2 scheduler 126 may send the first uop to the level 1 scheduler and the one or more additional The uop may be dependent on the first uop or may be dependent on the same operand detected as prepared, or may be related to the first uop. These are merely examples of features and operations that the level 2 scheduler 126 may perform, and the present disclosure is not so limited.

각각의 레벨 1 스케줄러(132)(예컨대, 132A, 132B, 132C)는 레벨 2 스케줄러(126)로부터 uop를 수신할 수 있다. 각각의 레벨 1 스케줄러(132)는 또한 uop에 대한 소스 피연산자 각각이 이용 가능하고 실행할 준비가 된 시기를 나타내는 피연산자 상태 정보를 수신하는 uop마다 유지할 수 있다. 예시적인 실시예에서, 각각의 레벨 1 스케줄러(132)는 예를 들면 실행 자원(예컨대, 요구되는 실행 유닛(136))이 이용 가능하고 uop에 대한 피연산자가 준비된 경우 실행하기 위한 각 개별적인 uop를 스케줄링 또는 디스패치(dispatch)할 수 있다. 변형적으로, 레벨 1 스케줄러(132)는 소스 피연산자가 아직 준비되지 않은 경우에도 실행하기 위한 실행 유닛(136)에 uop를 추론적으로 디스패치할 수 있다.Each level 1 scheduler 132 (eg, 132A, 132B, 132C) may receive uop from the level 2 scheduler 126. Each level 1 scheduler 132 may also maintain for each uop receiving operand status information indicating when each source operand for the uop is available and ready to run. In an exemplary embodiment, each level 1 scheduler 132 schedules each individual uop to execute, for example, when execution resources (eg, required execution unit 136) are available and operands for the uop are ready. Or dispatch. Alternatively, level 1 scheduler 132 may speculatively dispatch uop to execution unit 136 for execution even if the source operand is not yet ready.

각 클러스터(130)는 실행 유닛(X)(136)(예컨대, 클러스터(130A)에는 실행 유닛(136A), 클러스터(130B)에는 실행 유닛(136B) 그리고 클러스터(130C)에는 실행 유닛(136C))을 포함할 수 있다. 실행 유닛의 임의 개수 및 배열이 사용될 수 있지만, 각 실행 유닛(136)은 예를 들면 2개의 산술 논리 유닛(ALU) 실행 유닛 및 2개의 메모리 실행 유닛을 포함할 수 있다. 메모리 실행 유닛은 예를 들면 메모리 저장을 수행하는 메모리 저장(메모리 데이터 기록) 실행 유닛 및 메모리 로드를 수행하는 메모리 로드(메모리 판독) 실행 유닛을 포함할 수 있다.Each cluster 130 is an execution unit (X) 136 (eg, an execution unit 136A for cluster 130A, an execution unit 136B for cluster 130B, and an execution unit 136C for cluster 130C). It may include. Although any number and arrangement of execution units may be used, each execution unit 136 may include, for example, two arithmetic logic unit (ALU) execution units and two memory execution units. The memory execution unit may include, for example, a memory storage (memory data write) execution unit that performs memory storage and a memory load (memory read) execution unit that performs memory load.

예를 들면 인터-클러스터(또는 공유) 레벨 2 저장 버퍼(SB2)(154), 및 퍼-클러스터 레벨 1 저장 버퍼(SB1)(138)(예컨대, 클러스터(130A)에는 저장 버퍼(138A), 클러스터(130B)에는 저장 버퍼(138B) 그리고 클러스터(130C)에는 저장 버퍼(138C))를 포함할 수 있는 다중 레벨 저장 버퍼가 사용될 수 있다. 레벨 2 저장 버퍼(154)는 예를 들면 스레드 이동과 같이 스레드가 다중 클러스터를 가로질러 확산되는 것을 허용할 수 있다. uop가 메모리 저장 명령인 경우, (예컨대, 레벨 2 스케줄러(126) 또는 레벨 1 스케줄러(132)에 의해) 선택된 클러스터의 퍼-클러스터 레벨 1 저장 버퍼(SB1)(138)에 엔트리를 할당하여 메모리에 기록될 데이터를 저장할 수 있다. 예시적인 실시예에 따라, 저장값은 연관된 레벨 1 저장 버퍼(예컨대, 클러스터(130A) 내의 저장 명령에 대한 저장 버퍼(138A))에 초기에 기록될 수 있다. 레벨 1 저장 버퍼(138)와 레벨 2 저장 버퍼(154) 간의 데이터의 일관성을 유지하도록 라이트-스루(write-through)의 일부 또는 다른 캐시 코히런스 알고리즘으로서 레벨 2 저장 버퍼에 공간이 존재하는 경우 레벨 1 저장 버퍼(138)로부터 레벨 2 저장 버퍼(154)까지 저장 값이 기록될 수 있다. 최근 최소 사용(LRU) 또는 다른 알고리즘과 같은 알고리즘이 레벨 1 저장 버퍼(138) 및 레벨 2 저장 버퍼(154)에 의해 사용되어 저장 버퍼 내의 데이터의 저장을 관리할 수 있다. 저장 연산(메모리 기록)이 완료되고 저장 uop가 철회된 경우, 저장 버퍼 내의 데이터는 삭제될 수 있으며 레벨 1 저장 버퍼(138) 내의 연관된 엔트리는 또 다른 메모리 저장 uop에 재할당될 수 있다. 예시적인 실시예에 따라, 레벨 1 저장 버퍼(138)는 더 작고 더 빠른 저장 버퍼일 수 있는 반면, 레벨 2 (공유) 저장 버퍼는 레벨 1 저장 버퍼(138)보다 더 크고 아마도 그만큼 빠르지 않을 수 있다.For example, inter-cluster (or shared) level 2 storage buffer (SB2) 154, and per-cluster level 1 storage buffer (SB1) 138 (eg, cluster 130A includes storage buffer 138A, cluster A multilevel storage buffer may be used that may include a storage buffer 138B in 130B and a storage buffer 138C in cluster 130C. Level 2 storage buffer 154 may allow threads to spread across multiple clusters, such as, for example, thread movement. If uop is a memory store command, it allocates an entry to the per-cluster level 1 storage buffer (SB1) 138 of the selected cluster (eg, by level 2 scheduler 126 or level 1 scheduler 132) and stores it in memory. The data to be recorded can be stored. According to an exemplary embodiment, the stored value may be initially written to an associated level 1 storage buffer (eg, the storage buffer 138A for the storage command in cluster 130A). Level if there is space in the level 2 storage buffer as part of the write-through or other cache coherence algorithm to maintain data consistency between the level 1 storage buffer 138 and level 2 storage buffer 154. Stored values may be written from one storage buffer 138 to a level two storage buffer 154. Algorithms, such as recent least used (LRU) or other algorithms, may be used by level 1 storage buffer 138 and level 2 storage buffer 154 to manage the storage of data in the storage buffer. When the store operation (memory write) is completed and the store uop is withdrawn, the data in the store buffer can be deleted and the associated entry in the level 1 store buffer 138 can be reallocated to another memory store uop. According to an example embodiment, the level 1 storage buffer 138 may be a smaller and faster storage buffer, while the level 2 (shared) storage buffer may be larger and perhaps not as fast as the level 1 storage buffer 138. .

예시적인 실시예에 따라, 다중(또는 모든) 클러스터에 의해 공유된 레벨 2 데이터 캐시(D$2)(156) 및 클러스터마다 (퍼-클러스터) 레벨 1 데이터 캐시(140)(예컨대, 클러스터(130A)에는 데이터 캐시(140A), 클러스터(130B)에는 데이터 캐시(140B) 그리고 클러스터(130C)에는 데이터 캐시(140C))와 같은 다중 레벨 데이터 캐시가 사용될 수 있다. 레벨 1 데이터 캐시(140)는 예를 들면 레벨 2 데이터 캐시(156)보다 더 작고 더 빠를 수 있다. 예컨대 메모리 로드 동작(메모리 판독)에 응답하여 프로세서(100)에 의해 메모리로부터 수신된 데이터는 라인(162A)(클러스 터 A 메모리 로드), 라인(162B)(클러스터 B 메모리 로드) 및 라인(162C)(클러스터 C 메모리 로드)에 의해 도시되어 있다. 메모리 로드 동작에 응답하여 수신(라인(162)을 통하여 수신)된 데이터는 연관된 클러스터에 대한 레벨 1 데이터 캐시(140)에 입력될 수 있고, 그리고 나서 예를 들면 레벨 2 데이터 캐시(156)에 기록될 수 있다. 메모리 로드 동작으로부터의 데이터는 또한 연관된 클러스터에 대한 (메모리 로드 실행 유닛과 같은) 실행 유닛(136)에 입력될 수 있다.In accordance with an exemplary embodiment, the level 2 data cache (D $ 2) 156 shared by multiple (or all) clusters and per cluster (per-cluster) level 1 data cache 140 (eg, cluster 130A). Multilevel data caches such as data cache 140A, cluster 130B, data cache 140B, and cluster 130C may be used. Level 1 data cache 140 may be smaller and faster than level 2 data cache 156, for example. For example, data received from the memory by the processor 100 in response to a memory load operation (memory read) may be received by lines 162A (cluster A memory load), line 162B (cluster B memory load), and line 162C. (Cluster C memory load). Data received in response to a memory load operation (received via line 162) may be entered into the level 1 data cache 140 for the associated cluster and then written to, for example, the level 2 data cache 156. Can be. Data from the memory load operation may also be input to the execution unit 136 (such as the memory load execution unit) for the associated cluster.

예시적인 실시예에 따라, 단일 레지스터 파일이 사용될 수 있다. 또 다른 실시예에서, 다중 레벨 레지스터 파일이 사용될 수 있다. 예를 들면, 다중 레벨 레지스터 파일은 인터-클러스터(다중 클러스터 또는 모든 클러스터에 의해 공유됨) 레벨 2 레지스터 파일(RF2)(152) 및 (레벨 1 레지스터 파일(RF1)(134)과 같은) 하나 이상의 퍼-클러스터 레지스터 파일을 포함할 수 있다. 레벨 2 레지스터 파일(152)은 실행 결과를 저장하는 많은 물리적(앨리어스) 레지스터를 포함할 수 있다. 레벨 2 레지스터 파일 내의 레지스터는 uop에 대한 실행 결과를 저장하도록 uop마다 할당될 수 있다. 레벨 1 레지스터 파일(RF1)(134)과 같은 퍼-클러스터 레지스터 파일이 클러스터마다 제공될 수 있다(예컨대, 클러스터(130A)에는 레지스터 파일(134A), 클러스터(130B)에는 레지스터 파일(134B) 그리고 클러스터(130C)에는 레지스터 파일(134C)). 예시적인 실시예에서, 레벨 2 레지스터 파일 및 퍼-클러스터 레벨 1 레지스터 파일(134)은 2 레벨 레지스터 파일을 제공할 수 있다. 이와 같은 경우에, 레벨 1 레지스터 파일(134)은 (명령으로부터의) 중간 문자값, 및 막 기록되는 결과를 무시하고 앞서 판독될 수 있는 긴 레지스터 값을 포함하여 다양한 메 커니즘을 통해서 얻어진 레지스터 값을 포함하는 피연산자 값을 실행 유닛(136)에 저장 및 제공할 수 있다. 레벨 1 레지스터 파일(134)은 최근 기록된 값을 저장하도록 동작할 수 있고, 시간내 인덱싱될 수 있으며, 기록된 물리적 레지스터 개수에 의해 연관하여 인덱싱된 캡처 CAM(content addressable memory)을 사용할 수 있다.According to an exemplary embodiment, a single register file may be used. In another embodiment, a multi level register file may be used. For example, a multi-level register file may be an inter-cluster (shared by multiple clusters or all clusters) Level 2 register file (RF2) 152 and one or more (such as level 1 register file (RF1) 134). It may contain a per-cluster register file. Level 2 register file 152 may include many physical (alias) registers that store execution results. A register in the level 2 register file may be allocated per uop to store the execution result for the uop. Per-cluster register files such as Level 1 register file (RF1) 134 may be provided per cluster (e.g., register file 134A for cluster 130A, register file 134B for cluster 130B, and cluster). Register file 134C). In an example embodiment, the level 2 register file and the per-cluster level 1 register file 134 may provide a two level register file. In such a case, the level 1 register file 134 may store register values obtained through various mechanisms, including intermediate character values (from the instruction), and long register values that can be read earlier, ignoring the result just written. Including operand values may be stored and provided to execution unit 136. Level 1 register file 134 may be operative to store a recently recorded value, may be indexed in time, and may use a captured content addressable memory (CAM) indexed in association by the number of physical registers recorded.

예시적인 실시예에 따라, 예를 들면 3개의 레벨을 이용하는 다중 레벨 레지스터 파일이 사용될 수 있으며, 예를 들면 인터-레벨 2 레지스터 파일(152), 퍼-클러스터 레벨 1 레지스터 파일(134) 및 퍼-클러스터 피연산자 캡처 어레이(135)(클러스터(130A)에는 피연산자 어레이 캡처(135A), 클러스터(130B)에는 피연산자 어레이 캡처(135B) 그리고 클러스터(130C)에는 피연산자 어레이 캡처(135C)를 포함함)를 포함할 수 있다. 이 예시적인 실시예에서, 각각의 피연산자 캡처 어레이는 (명령으로부터의) 중간 문자 값, 및 막 기록되는 결과를 무시하고 앞서 판독될 수도 있는 긴 레지스터 값을 포함하여 다양한 메커니즘을 통해서 얻어진 레지스터 값을 포함하는 피연산자 값을 실행 유닛(136)에 저장 및 제공할 수 있으며, 최근 기록된 값을 저장하도록 동작할 수 있다. 피연산자 캡처 어레이(135)는 비교적 작고 빠른 캐시를 제공하여 피연산자 값을 실행 유닛(136)에 저장 및 제공할 수 있다. 이들 레지스터 값은 또한 (피연산자 캡처 어레이(135)로서) 동일한 클러스터의 레벨 1 레지스터 파일(134)뿐만 아니라 레벨 2 레지스터 파일(152)에 캐싱 또는 저장될 수 있다.According to an exemplary embodiment, for example, a multi-level register file using three levels may be used, for example inter-level two register file 152, per-cluster level 1 register file 134 and per- Cluster operand capture array 135 (including operand array capture 135A in cluster 130A, operand array capture 135B in cluster 130B, and operand array capture 135C in cluster 130C). Can be. In this exemplary embodiment, each operand capture array includes register values obtained through various mechanisms, including intermediate character values (from the instruction), and long register values that may be read earlier, ignoring the result just written. Operand values may be stored and provided to execution unit 136, and may be operable to store recently recorded values. Operand capture array 135 may provide a relatively small and fast cache to store and provide operand values to execution unit 136. These register values may also be cached or stored in the level 2 register file 152 as well as the level 1 register file 134 of the same cluster (as operand capture array 135).

도 1에 나타낸 바와 같이, 실행 유닛(136)으로부터 출력된 실행 결과는 라인(160A)(클러스터(130A)), 라인(160B)(클러스터(130B)) 및 라인(160C)(클러스 터(130C))를 통하여 피연산자 캡처 어레이(135) 및 연관된 클러스터의 레벨 1 레지스터 파일에 입력될 수 있다. 이들 값은 직접, 또는 예를 들면 레벨 1 레지스터 파일에 기록되는 경우 레벨 1 레지스터 파일(134)로부터 레벨 2 레지스터 파일로의의 라이트-스루로서 레벨 2 레지스터 파일에 또한 기록될 수 있다.As shown in FIG. 1, the execution result output from the execution unit 136 is line 160A (cluster 130A), line 160B (cluster 130B), and line 160C (cluster 130C). Can be entered into the level 1 register file of the operand capture array 135 and associated cluster. These values may also be written to the level 2 register file directly or as a write-through from the level 1 register file 134 to the level 2 register file, for example when written to the level 1 register file.

변형적으로, 실행 유닛(136)으로부터 출력된 실행 결과는 레벨 2 레지스터 파일(152)에 기록되고 나서, 레벨 2 레지스터 파일(152)로부터 (입력 값이 저장되는 레지스터와 일치하는 경우) 레지스터 내의 값을 갱신할 수 있는 레벨 1 레지스터 파일(134)로 송신될 수 있다. 실행 결과는 피연산자 캡처 어레이가 또 다른 명령이나 uop에 대한 피연산자로서 그 결과를 찾는 경우 피연산자 캡처 어레이에 저장될 연관된 클러스터의 피연산자 캡처 어레이(135)에 또한 입력될 수 있다. 레지스터의 명칭은 레벨 2 스케줄러(126) 및/또는 연관된 클러스터에 대한 레벨 1 스케줄러에 제공되어 스케줄러는 피연산자가 (예컨대, 명령 스케줄링 결정이 이루어지는 것을 허용하도록) 준비될 수 있는 갱신된 정보를 수신할 수 있다.Alternatively, the execution result output from the execution unit 136 is written to the level 2 register file 152 and then the value in the register (if the input value matches the register to be stored) from the level 2 register file 152. May be sent to the level 1 register file 134 which may be updated. Execution results may also be entered into the operand capture array 135 of the associated cluster to be stored in the operand capture array if the operand capture array finds the result as an operand for another instruction or uop. The name of the register is provided to the level 2 scheduler 126 and / or the level 1 scheduler for the associated cluster so that the scheduler can receive updated information that an operand can be prepared (eg, to allow instruction scheduling decisions to be made). have.

예시적인 실시예에 따라 단일 레벨 명령 창(또는 철회 스테이지)이 사용될 수 있거나, 또는 다중 레벨 명령 창(또는 철회 스테이지)이 사용될 수 있다. 명령 창은 일반적으로 uop의 철회의 처리를 담당할 수 있다. 다중 레벨 명령 창에서, 예를 들면, (퍼-클러스터) 레벨 1 명령 창(IW1)은 클러스터마다 제공될 수 있다(클러스터(130A)에는 명령 창(142A), 클러스터(130B)에는 명령 창(142B) 그리고 클러스터(130C)에는 명령 창(142C)). 레벨 1 명령 창(142A)은 uop의 철회시 초기 서비스를 수행할 수 있다. 공유 레벨 2 명령 창(레벨 2 레지스터 파일(152)의 일부로서 제공될 수 있음)은 예시적인 실시예에 따라 모든 클러스터로부터 uop에 대한 철회 프로세스를 완료할 수 있다.According to an exemplary embodiment a single level command window (or withdraw stage) may be used, or a multi level command window (or withdraw stage) may be used. The command window can generally be responsible for handling the uop's withdrawal. In a multilevel command window, for example, a (per-cluster) level 1 command window IW1 may be provided per cluster (command window 142A for cluster 130A and command window 142B for cluster 130B). And cluster 130C in command window 142C). The level 1 command window 142A may perform an initial service upon withdrawal of the uop. The shared level 2 command window (which may be provided as part of the level 2 register file 152) may complete the withdrawal process for uops from all clusters according to an example embodiment.

II. 일부 다중 레벨 구조 및 기타 상세의 부가적인 예II. Additional examples of some multilevel structures and other details

A. 예시적인 다중 레벨 명령 스케줄러A. Example Multilevel Command Scheduler

예시적인 실시예에 따라, 명령 스케줄러는 명령 창 내에 한 세트의 후보 명령을 유지 또는 전개하여 각 명령(또는 uop)이실행되어야 하는 시기를 결정할 수 있지만, 명령 스케줄러는 여러 다양한 방법으로 많은 기능을 수행할 수 있다. 예시적인 실시예에 따라, 명령 스케줄러는 다음과 같은 2가지 구조로 구분될 수 있다: 통상적으로 실행 유닛에 더 밀접할 수 있는 더 작은(따라서 통상적으로 더 빠른) 명령 스케줄러, 및 통상적으로 실행 유닛으로부터 더 멀리 떨어져 있는 더 큰(따라서 통상적으로 더 느린) 명령 스케줄러. 이들은 레벨 1(L1) 및 레벨 2(L2) 명령 스케줄러(IS)로 칭할 수 있지만, 개념은 계층의 더 많은 레벨로 일반화된다.According to an exemplary embodiment, the command scheduler may maintain or deploy a set of candidate commands within a command window to determine when each command (or uop) should be executed, but the command scheduler performs many functions in a variety of ways. can do. According to an exemplary embodiment, an instruction scheduler can be divided into two structures: a smaller (and therefore typically faster) instruction scheduler, which can typically be closer to an execution unit, and typically from an execution unit. Larger (and therefore typically slower) command scheduler farther away. These may be referred to as Level 1 (L1) and Level 2 (L2) Command Schedulers (IS), but the concept is generalized to more levels of hierarchy.

도 2는 예시적인 실시예에 따른 다중 레벨 명령 스케줄러를 나타내는 블록도이다. 도 2를 참조하면, 레벨 2 명령 스케줄러(226)는 클러스터(230A, 230B)를 포함하여 다중 클러스터(또는 실행 클러스터)에 연결된다. 이 예에서는 2개의 클러스터만이 도시되어 있지만, 임의 개수의 클러스터가 사용될 수 있다. 각 클러스터는 레벨 1 스케줄러 및 하나 이상의 실행 유닛을 포함할 수 있다. 예를 들면, 클러스터(230A)는 레벨 1 스케줄러(232A) 및 실행 유닛(236A)을 포함할 수 있는 반면, 클러스터(230B)는 레벨 1 스케줄러(232B) 및 실행 유닛(236B)을 포함할 수 있다.2 is a block diagram illustrating a multi-level command scheduler in accordance with an exemplary embodiment. 2, the level 2 command scheduler 226 is connected to multiple clusters (or execution clusters), including clusters 230A and 230B. Although only two clusters are shown in this example, any number of clusters can be used. Each cluster may include a level 1 scheduler and one or more execution units. For example, cluster 230A may include level 1 scheduler 232A and execution unit 236A, while cluster 230B may include level 1 scheduler 232B and execution unit 236B. .

예시적인 실시예에서, 레벨 1 스케줄러 및 레벨 2 스케줄러는 비교 회로(또 는 "피커(picker)") 또는 타이밍 휠 회로(timing wheel circuit)를 포함할 수 있다. 예를 들면, 피커는 CAM(content addressable memory) 포트를 포함할 수 있다. 피커는 명령에 대한 피연산자(또는 물리적 레지스터 중 하나에 대한 레지스터 값)가 이용 가능(예컨대, 실행 유닛으로부터 귀환되는 레지스터 피연산자에 대한 새로운 값)해지는 시기를 검출하도록 다중 엔트리(예컨대, uop당 하나의 엔트리) 및 다중 CAM 포트를 갖는다. 예를 들면, 명령 스케줄러가 32개의 엔트리를 갖고, 각 엔트리가 예를 들면 2개의 입력(소스 피연산자)을 가질 수 있으며, 클록 사이클마다 생성되는 4개의 실행 유닛 결과(클러스터 내의 각 실행 유닛으로부터의 결과)가 존재하는 경우, 명령 스케줄러는 256개의 비교 회로 및 4개의 실행 유닛에 대응하는 4개의 CAM 포트를 포함할 수 있다. 스케줄러 내의 각 명령 또는 uop는 2개의 소스 피연산자(또는 입력된 물리적 레지스터)를 식별할 수 있다. 각 클록 사이클 동안, 각 명령에 대한 비교 회로는 입력 중 하나를 미정의 명령 중 하나에 일치시키는 새로운 결과 데이터에 대하여 체크할 수 있다. 이러한 방식으로, 스케줄러는 소스 피연산자가 실행을 위해 스케줄링 대기중인 많은 다양한 명령 또는 uop을 준비하는 시기를 유지할 수 있다. 스케줄러는 예를 들면 명령에 대한 소스 피연산자가 준비되고 실행 자원이 이용 가능한 경우 실행을 위한 명령 또는 uop를 디스패치 또는 전송할 수 있다.In an exemplary embodiment, the level 1 scheduler and level 2 scheduler may comprise a comparison circuit (or “picker”) or a timing wheel circuit. For example, the picker may include a content addressable memory (CAM) port. The picker detects when an operand for an instruction (or a register value for one of the physical registers) becomes available (e.g., a new value for a register operand returned from an execution unit), such as one entry per uop. ) And multiple CAM ports. For example, the instruction scheduler may have 32 entries, each entry may have, for example, two inputs (source operands), and four execution unit results generated per clock cycle (results from each execution unit in the cluster). ), The command scheduler may include 256 comparison circuits and four CAM ports corresponding to four execution units. Each instruction or uop in the scheduler can identify two source operands (or input physical registers). During each clock cycle, the comparison circuit for each command may check for new result data that matches one of the inputs to one of the undefined commands. In this way, the scheduler can maintain when the source operand prepares for many different instructions or uops that are waiting to be scheduled for execution. The scheduler can dispatch or send an instruction or uop for execution, for example, if a source operand for the instruction is prepared and execution resources are available.

타이밍 휠 회로는 어떤 명령 또는 uop가 상이한 메커니즘을 사용하여 실행되는지를 제어할 수 있다. 예시적인 타이밍 휠 회로에서, 명령은 리스트에 놓여져, 실행을 준비할 것으로 기대되는 시기에 기초하여 리스트 상에 위치할 수 있다(명령 은 추론적으로 스케줄링될 수 있다). 따라서, 타이밍 휠 회로에서, 아직 실행될 준비가 되어 있지 않지만 앞으로 준비될 것으로 기대되는 명령을 스케줄링하는 것이 가능할 수 있다. 타이밍 휠용 명령 버퍼는 예를 들면 실행 시기가 발생한 경우 아직 실행할 준비가 되지 않은 명령이 휠이 일 회전을 완료한 후 앞으로 자동적으로 실행될 수 있는 원형 버퍼일 수 있다. 피커 및 타이밍 휠 회로는 스케줄러가 실행을 위한 명령을 스케줄링하는데 사용될 수 있는 단지 2가지 유형의 회로이며, 다른 많은 기술이 사용될 수 있다.The timing wheel circuit can control which commands or uops are executed using different mechanisms. In an exemplary timing wheel circuit, an instruction may be placed in a list and placed on the list based on when it is expected to prepare for execution (the instruction may be speculatively scheduled). Thus, in the timing wheel circuit, it may be possible to schedule an instruction that is not yet ready to be executed but is expected to be prepared in the future. The command buffer for the timing wheel can be, for example, a circular buffer in which a command that is not yet ready to be executed when the timing of execution occurs can be automatically executed in the future after the wheel has completed one revolution. The picker and timing wheel circuits are just two types of circuits that the scheduler can use to schedule instructions for execution, and many other techniques can be used.

예시적인 실시예에 따라, 레벨 1 및 레벨 2 스케줄러는 각각 피커 회로 또는 타이밍 휠 회로, 또는 모두를 사용할 수 있다. 예를 들면, 레벨 1 스케줄러는 타이밍 휠에 앞서 피커 회로를 포함할 수 있다. 다른 한편, 레벨 2 스케줄러(126)는 피커 회로에 앞서 타이밍 휠을 포함할 수 있다. 또한, 스케줄러는 숏 커트(short cut) 회로를 포함할 수 있는데, 예를 들면 레벨 1 스케줄러 타이밍 휠에 엔트리가 존재하지 않고 새로운 명령이 레벨 1 스케줄러에 입력되면, 피커 회로를 바이패스할 수 있다. 마찬가지로, 레벨 2 스케줄러는 레벨 1 스케줄러에서 공간이 이용 가능한 경우 제외될 수 있다.According to an exemplary embodiment, the level 1 and level 2 schedulers may use the picker circuit or the timing wheel circuit, or both, respectively. For example, the level 1 scheduler may include a picker circuit prior to the timing wheel. On the other hand, the level 2 scheduler 126 may include a timing wheel prior to the picker circuit. The scheduler may also include a short cut circuit, for example, if no entry is present in the level 1 scheduler timing wheel and a new command is input to the level 1 scheduler, the picker circuit may be bypassed. Similarly, the level 2 scheduler may be excluded if space is available in the level 1 scheduler.

도 3은 예시적인 실시예에 따른 다중 레벨 명령 스케줄러를 나타내는 블록도이다. 레벨 2 스케줄러는 타이밍 휠 회로(302) 및 피커 회로(304)를 포함할 수 있다. 각 실행 클러스터는 피커 회로 및 타이밍 휠 회로를 포함할 수 있다. 예를 들면, 제 1 클러스터는 피커 회로(306) 및 타이밍 휠 회로(312)를 포함할 수 있고, 제 2 클러스터는 피커 회로(308) 및 타이밍 휠 회로(314)를 포함할 수 있는 반면, 제 3 클러스터는 피커 회로(310) 및 타이밍 휠 회로(316)를 포함할 수 있다. 더욱이, 클러스터는 이종형일 수 있는데, 일부는 단지 피커만을, 다른 일부는 타이밍 휠을, 다른 일부는 모두를 갖는다.3 is a block diagram illustrating a multi-level command scheduler in accordance with an exemplary embodiment. The level 2 scheduler may include a timing wheel circuit 302 and a picker circuit 304. Each execution cluster may include a picker circuit and a timing wheel circuit. For example, the first cluster may include the picker circuit 306 and the timing wheel circuit 312, and the second cluster may include the picker circuit 308 and the timing wheel circuit 314, while The three clusters may include the picker circuit 310 and the timing wheel circuit 316. Moreover, clusters may be heterogeneous, some with only pickers, others with timing wheels, and others with all.

예시적인 실시예에서, 레벨 1 스케줄러(132)는 4개의 CAM 포트, 즉 표준적 실행 유닛 클러스터의 2 정수 ALU 및 2 로드 포트 각각에 하나씩을 가질 수 있다. 레벨 1 스케줄러는 uop에 대한 모든 피연산자가 준비되는 것으로 기대되는 시기 또는 uop에 대한 모든 피연산자가 준비되는 시기를 스케줄러에 자극 또는 표시하는 피커 회로를 사용할 수 있다. 레벨 1 스케줄러(132)는 예를 들면 하나의 uop를 일시에 실행 유닛(136)에 디스패치할 수 있다(또는 클록 사이클당 실행 유닛마다 하나의 uop, 여기서 실행 유닛(136)은 4개의 실행 유닛을 포함할 수 있다).In an exemplary embodiment, the level 1 scheduler 132 may have four CAM ports, one for each of the two integer ALUs and two load ports of the standard execution unit cluster. The level 1 scheduler can use a picker circuit that stimulates or indicates to the scheduler when all operands for a uop are expected to be prepared or when all operands for a uop are prepared. Level 1 scheduler 132 may dispatch, for example, one uop to execution unit 136 at one time (or one uop per execution unit per clock cycle, where execution unit 136 may have four execution units). May include).

예시적인 실시예에서, 레벨 2 스케줄러(126)는 16개의 파티션, 각 64개의 엔트리를 포함할 수 있다. 각 엔트리는 4개의 uop를 포함할 수 있다. 각 엔트리는 3개의 CAM 포트를 가질 수 있다. 각 엔트리는 (S1&S2&S3)와 같은 입력 피연산자, 임의의 준비된 (S1│S2│S3), 및 (S1&S2│S3)과 같은 임의 개수의 다른 논리 함수에 의해 만족되도록 논리 함수를 특정할 수 있다. 엔트리는 논리 함수가 만족되는 경우 준비된 대로 처리되어, uop의 피연산자가 이용 가능하고 uop가 실행을 위해 디스패치될 수 있음을 레벨 2 스케줄러에 표시한다.In an exemplary embodiment, the level 2 scheduler 126 may include 16 partitions, each 64 entries. Each entry may contain four uops. Each entry can have three CAM ports. Each entry can specify a logical function to be satisfied by an input operand such as (S1 & S2 & S3), any prepared (S1 | S2 | S3), and any number of other logical functions such as (S1 & S2 | S3). The entry is processed as prepared if the logical function is satisfied, indicating to the level 2 scheduler that operands of the uop are available and that the uop can be dispatched for execution.

변형적으로, 레벨 2 스케줄러(126)(도 1)는 모든 피연산자(입력)가 준비되기 전에 명령의 그룹을 선택된 레벨 1 스케줄러에 추론적으로 전송할 수 있다. 예를 들면, 명령이 2개의 입력(소스 피연산자)을 갖는 경우, uop는 입력(또는 소스 피연 산자) 중 어느 하나가 이용 가능한 경우 언제나 레벨 2 스케줄러로부터 레벨 1 스케줄러로 전송될 수 있다. 이것은 레벨 1 명령 스케줄러가 여전히 그 종속성을 따를 수 있기 때문에 수행될 수 있다.Alternatively, the level 2 scheduler 126 (FIG. 1) can speculatively send a group of commands to the selected level 1 scheduler before all operands (inputs) are ready. For example, if an instruction has two inputs (source operands), uop may be sent from the level 2 scheduler to the level 1 scheduler whenever either of the inputs (or source operands) is available. This can be done because the level 1 command scheduler can still follow its dependencies.

다중 명령(uop)은 이들이 레벨 2 명령 스케줄러에 놓이기 전에 함께 그룹화될 수 있다. 이 그룹의 uop 또는 명령은 종속 체인을 통하여 관련되거나, 관련되지 않을 수 있거나, 또는 종속 관계에 상관 없이(예컨대, 원래의 프로그램 순서로) 선택될 수 있다. 그리고 나서 레벨 2 스케줄러는 임의 입력이 이용 가능해지는 경우 언제나 전체 그룹을 레벨 1 스케줄러로 전송할 수 있는데, 이는 동일한 기본 블록(또는 그룹) 내의 명령의 일부가 실행을 개시하고 나서 나머지 명령이 바로 실행을 개시하는 양호한 후보라는 것을 나타낼 수 있다. 이것은 단지 하나의 예일 뿐이다.Multiple instructions (uop) can be grouped together before they are placed in the level 2 instruction scheduler. The uops or instructions in this group may or may not be related through the dependency chain, or may be selected regardless of the dependency (eg, in the original program order). The level 2 scheduler can then send the entire group to the level 1 scheduler whenever any input is available, which means that some of the instructions within the same basic block (or group) begin execution and then the rest of the instructions begin execution immediately. It can be shown that it is a good candidate. This is just one example.

따라서, 예시적인 실시예에 따라, 레벨 2 스케줄러(126)는 대략적인 스케줄링을 수행할 수 있는 반면, 레벨 1 스케줄러(132)는 uop의 정확한(또는 더 정확한) 스케줄링 또는 디스패치를 수행할 수 있다. 예를 들면, 레벨 2 스케줄러(126)는 uop의 그룹을 스케줄링할 수 있는 반면, 레벨 1 스케줄러(132)는 개별 uop의 실행을 스케줄링할 수 있다. 예시적인 실시예에서, 레벨 2 스케줄러(126)는 uop(또는 명령)의 그룹을 레벨 1 스케줄러로 디스패치 또는 전송할 수 있다. 레벨 1 스케줄러로 디스패치 또는 전송된 uop의 그룹은 uop의 종속 체인일 수 있다(예컨대, uop의 그룹은 일부 유형의 종속 관계를 가짐). 이 uop의 그룹은 단지 하나(또는 일부)의 uop 그룹이 실행을 준비하는 경우, 또는 그룹 내의 적어도 하나의 uop에 대한 일부 피연산자가 준비되는 경우 적절한 레벨 1 스케줄러로 전송될 수 있다(그리고 나머지 uop는 추론적으로 전송됨). 예를 들면, 레벨 2 스케줄러는 uop 중 하나에 대한 3개의 피연산자 또는 입력 중 단지 하나가 준비되는 경우, 또는 uop 중 하나가 실행할 준비가 되어 있는 경우(예컨대, 하나의 uop의 모든 피연산자가 준비됨) 4개의 uop의 그룹을 전송할 수 있다.Thus, according to an exemplary embodiment, level 2 scheduler 126 may perform coarse scheduling, while level 1 scheduler 132 may perform accurate (or more accurate) scheduling or dispatch of uop. For example, level 2 scheduler 126 may schedule a group of uops, while level 1 scheduler 132 may schedule the execution of an individual uop. In an example embodiment, the level 2 scheduler 126 may dispatch or send a group of uops (or instructions) to the level 1 scheduler. A group of uops dispatched or sent to a level 1 scheduler may be a dependent chain of uops (eg, a group of uops has some type of dependency relationship). This group of uops may be sent to the appropriate level 1 scheduler if only one (or some) uop group is ready to run, or if some operands for at least one uop in the group are ready (and the remaining uops may be Speculatively sent). For example, a level 2 scheduler might be prepared if only one of the three operands or inputs for one of the uops is ready, or if one of the uops is ready to run (e.g., all operands of one uop are ready) A group of four uops can be transmitted.

이러한 방식으로, 레벨 2 스케줄러에 대한 회로는 레벨 2 스케줄러에 더 적은 CAM 포트가 요구될 수 있기 때문에 감소 또는 간소화될 수 있다.In this way, the circuit for the level 2 scheduler can be reduced or simplified since fewer CAM ports may be required for the level 2 scheduler.

레벨 2 스케줄러(126)에 의해 디스패치하기 위한 uop의 스케줄링 그룹은 예를 들면 매퍼(120)에 의해 제작 또는 그룹 제작될 수 있다. 도 4는 예컨대 특정 조건이 만족되는 경우 uop 그룹 빌더(402)가 레벨 1 스케줄러 중 하나에 그룹으로서 디스패치 또는 전송될 uop의 그룹을 제작할 수 있는 예시적인 시스템을 나타내는 블록도이다. 예를 들면, 4 uop의 그룹은 그룹의 하나의 피연산자가 이용 가능한 경우, 또는 그룹의 하나의 uop(또는 명령)가 실행에 준비되어 있는 경우 선택된 레벨 1 스케줄러로 전송될 수 있다.A scheduling group of uops for dispatch by the level 2 scheduler 126 may be produced or grouped by, for example, the mapper 120. 4 is a block diagram illustrating an example system where, for example, a uop group builder 402 can produce a group of uops to be dispatched or sent as a group to one of the level 1 schedulers if a particular condition is met. For example, a group of 4 uops may be sent to the selected level 1 scheduler if one operand of the group is available, or if one uop (or command) of the group is ready for execution.

예시적인 실시예에 따라, 레벨 2 스케줄러는 적절한 레벨 1 스케줄러가 충분하지 않으면 제외될 수 있다. 도 5는 레벨 2 스케줄러(126)가 레벨 1 스케줄러(132)와 병렬로 레벨 1 매퍼(120)에 연결되는 예시적인 실시예를 나타낸다. 이것은 uop 또는 명령이 스케줄러의 양쪽 모두의 레벨에 직접 입력되는 것을 허용할 수 있으며, 레벨 1 스케줄러가 아직 충분하지 않으면 레벨 2 스케줄러를 제외하는 것을 용이하게 할 수 있다. 도 6은 또 다른 예시적인 실시예에 따라 매퍼가 레벨 1에도 직접 연결될 수 있는 블록도이다.According to an exemplary embodiment, the level 2 scheduler may be excluded if an appropriate level 1 scheduler is not sufficient. 5 illustrates an exemplary embodiment in which a level 2 scheduler 126 is coupled to a level 1 mapper 120 in parallel with a level 1 scheduler 132. This may allow uop or commands to be entered directly at both levels of the scheduler, and may facilitate excluding the level 2 scheduler if the level 1 scheduler is not yet sufficient. Fig. 6 is a block diagram in which a mapper may be directly connected to level 1 according to another exemplary embodiment.

예시적인 실시예에 따라, 큰 레벨 2 명령 스케줄러(126)는 다중 레벨 1 스케줄러 사이에서 공유될 수 있다. 이것은 실행 유닛의 클러스터가 제작되는 것을 허용할 수 있고 레벨 2 스케줄러 공간이 이들 사이에서 효율적으로 공유되게 한다. 이러한 설계는 예를 들면 2개의 클러스터를 갖는 시스템에 대하여 도 2에 도시되어 있다. 이 방법은 임의 개수의 하드웨어 클러스터 및 연관된 레벨 1 명령 스케줄러로 일반화되어 있다. 이 설계에서 보통의 캐시인지 또는 명령 스케줄러인지에 따라 임의의 사적 레벨 1 및 공유 레벨 2 구조를 갖는 예시적인 목표는 더 작은 구조가 빠르고 (부근의 실행 유닛과 같이) 필요 장소의 부근에 있는 것을 허용하는 반면, 더 큰 구조가 효율적으로 공유되는 것을 허용할 수 있다. 하나의 클러스터가 활성 명령 창에서 매우 큰 세트의 명령을 필요로 하는 경우 공유 레벨 2 구조는 동적으로 이것을 더 많은 엔트리에 할당할 수 있다. 로드가 모든 클러스터 사이에서 대략 동일한 경우, 레벨 2 구조는 동일하게 공유될 수 있다. 그리고, 예시적인 실시예에 따라, 클러스터마다 (레벨 2 명령 스케줄러와 같은) 레벨 2 구조에 할당되는 엔트리의 수는 시간에 따라 변경되어 실행 프로그램의 유동성 변경 및 필요성 변경을 반영할 수 있다.In accordance with an exemplary embodiment, the large level 2 command scheduler 126 may be shared between multiple level 1 schedulers. This can allow a cluster of execution units to be built and allows the level 2 scheduler space to be efficiently shared between them. This design is shown in FIG. 2 for a system with two clusters, for example. This method is generalized to any number of hardware clusters and associated level 1 command schedulers. Exemplary goals with arbitrary private Level 1 and Shared Level 2 structures, depending on whether it is a normal cache or command scheduler in this design, allow smaller structures to be fast and in the vicinity of the place of need (such as nearby execution units). On the other hand, it can allow larger structures to be shared efficiently. If a cluster requires a very large set of commands in the active command window, the shared level 2 structure can dynamically allocate them to more entries. If the load is approximately the same between all clusters, the level 2 structure can be shared equally. And, according to an exemplary embodiment, the number of entries assigned to a level 2 structure (such as a level 2 command scheduler) per cluster may change over time to reflect fluidity changes and necessity changes in the execution program.

예시적인 실시예에 따라, 레벨 2 명령 스케줄러(126)는 물리적으로 분할될 수 있다. 레벨 2 스케줄러 내의 각 파티션은 단일(또는 상이한) 레벨 1 명령 스케줄러에 서비스 제공하도록 할당될 수 있으며, 각 레벨 1 명령 스케줄러는 다중 레벨 2 스케줄러 파티션과 연관될 수 있다. 이 할당은 동적으로 변화될 수 있으며, 따라서, 파티션 사이즈는 다중 클러스터용 레벨 2 명령 스케줄러 내의 자원 할당을 위한 입도(granularity)인 것으로 고려될 수 있다. 이 방법의 이점은 L2 명령 스케줄러에 필요로 하는 CAM 포트의 수를 크게 감소시키는 것이다. 각 피커(또는 비교) 회로는 각 실행 유닛의 출력을 전형적으로 관찰(또는 이로부터 데이터를 수신)할 수 있다. 레벨 2 스케줄러 물리적 파티션이 다중 레벨 1 클러스터에 대한 명령을 유지할 수 있는 경우, 예시적인 실시예에 따라, 전형적으로 각 클러스터로부터의 각 실행 유닛의 출력을 일치시키는 포트를 갖는다. 각 레벨 2 파티션을 N(N 클러스터 머신에 해당)보다는 오히려 하나의 클러스터와 연관시킴으로써, 이러한 포트의 개수는 N*M으로부터 M(M 실행 유닛을 갖는 클러스터에 해당)으로 감소될 수 있다.According to an example embodiment, the level 2 command scheduler 126 may be physically partitioned. Each partition within a level 2 scheduler may be assigned to service a single (or different) level 1 command scheduler, and each level 1 command scheduler may be associated with multiple level 2 scheduler partitions. This allocation may change dynamically, and thus the partition size may be considered to be granularity for resource allocation within the level 2 command scheduler for multiple clusters. The advantage of this method is that it greatly reduces the number of CAM ports needed for the L2 command scheduler. Each picker (or comparison) circuit can typically observe (or receive data from) the output of each execution unit. If a level 2 scheduler physical partition can sustain instructions for multiple level 1 clusters, according to an exemplary embodiment, it typically has a port that matches the output of each execution unit from each cluster. By associating each level 2 partition with one cluster rather than N (corresponding to N cluster machines), the number of such ports can be reduced from N * M to M (corresponding to a cluster with M execution units).

예시적인 실시예에 따라, 프로세서(100) 내의 스케줄러(예컨대, 레벨 1 스케줄러 및/또는 레벨 2 스케줄러)는 때때로 추론적으로 실행을 위한 uop를 스케줄링할 수 있다. 즉, 스케줄러는 정확한 실행에 필수적인 모든 조건이 만족되기 전에 실행을 위한 uop를 때때로 스케줄링할 수 있다(예컨대, 모든 입력 또는 소스 피연산자가 아직은 준비되어 있지 않지만, 곧 준비될 것으로 기대된다). 이러한 경우에, 기대 또는 희망은 정확한 실행에 필수적인 모든 조건이 uop가 실제로 실행되는 시간에 의해 만족된다는 것이다. uop가 실행된 경우 정확한 실행에 필수적인 조건들이 준비되지 않으면, uop는 실행을 위해 재발행(재실행)되고, 이것은 종종 재실행으로 칭한다. 재실행의 예시적인 근거는 다음을 포함한다: 캐시 미시(cache miss), 종속성 위반, 예기치 않은 자원 구속 등.In accordance with an exemplary embodiment, a scheduler (eg, level 1 scheduler and / or level 2 scheduler) within processor 100 may sometimes speculatively schedule uop for execution. That is, the scheduler can sometimes schedule uops for execution before all the conditions necessary for correct execution are met (eg, all input or source operands are not yet ready, but are expected to be ready soon). In this case, the expectation or hope is that all the conditions necessary for correct execution are satisfied by the time that the uop is actually executed. When uop is executed, if the conditions necessary for correct execution are not prepared, uop is reissued (rerun) for execution, which is often referred to as rerun. Exemplary reasons for redo include: cache misses, dependency violations, unexpected resource constraints, and the like.

예시적인 실시예에 따라, 프로세서(100)는 복구 스케줄러를 포함할 수 있으며, uop를 재실행할 수 있다. 재실행은 원래의 스케줄러를 사용하여 스케줄링될 수 있지만, 긴 지연 시간 재실행을 대기하는 연산은 결정적인 레벨 1 스케줄러로부터 보조 구조 내로 이동될 수 있다. 더욱이, 재실행 스톰(replay storm), 소위 안티-스케줄러를 취소하는 스케줄러 회로가 존재할 수 있다.According to an example embodiment, the processor 100 may include a recovery scheduler and may rerun uop. Reruns can be scheduled using the original scheduler, but operations waiting for long delay reruns can be moved from the critical Level 1 scheduler into the auxiliary structure. Moreover, there may be a scheduler circuit that cancels a replay storm, the so-called anti-scheduler.

재실행 스톰 안티-스케줄러는 취소 메시지가 원래의 데이터 흐름 지연 시간보다 더 빠르다는 것을 보장함으로써 재실행하는 것으로 판명된 경우에 스케줄링되는 연산의 파면(wavefront)을 따라잡을 수 있다. 먼저, 안티-스케줄링 연산은 가장 낮은 또는 더 낮은 지연 시간을 가질 수 있다: 예컨대, 메모리 동작은 ALU 동작으로서 동일한 1(또는 0.5) 사이클 지연 시간을 갖는다. 그렇지만, 이것은 캐치-업을 보장하기에 충분하지 않다: 어느 정도의 이행적 폐쇄(transitive closure)가 필수적이다. 비트맵 스케줄러에서 완전한 이행적 폐쇄를 계산하는 것은 수월하다. 태그 기반 스케줄러에서 이행적 폐쇄는 보다 값비싸다. 따라서, 안티-스케줄러는 예를 들면 더 크고 더 느린 비트맵 스케줄러일 수 있다. 연산은 안전하게 재실행할 때까지 이 재실행 스톰 안티-스케줄러에 머무를 수 있다.Redo Storm Anti-Scheduler can keep up with the wavefront of a scheduled operation if it is found to be redo by ensuring that the cancellation message is faster than the original data flow delay time. First, the anti-scheduling operation can have the lowest or lower delay time: for example, the memory operation has the same 1 (or 0.5) cycle delay time as the ALU operation. However, this is not enough to ensure catch-up: some degree of transitive closure is necessary. It is easy to calculate the complete transitional closure in the bitmap scheduler. Transitive closure in tag-based schedulers is more expensive. Thus, the anti-scheduler may be a larger and slower bitmap scheduler, for example. The operation can stay in this redo storm anti-scheduler until it is safely redone.

예시적인 실시예에 따라, 재실행 스케줄러(캐시 미스와 같은 긴 지연 이벤트를 대기하는 연산을 재실행함) 및/또는 재실행 스톰 안티-스케줄러 기능이 레벨 2 스케줄러에 배치되어 클러스터 사이에서 공유될 수 있다.According to an example embodiment, a redo scheduler (which re-executes operations waiting for long delayed events such as cache misses) and / or redo storm anti-scheduler functionality may be deployed in the level 2 scheduler and shared between clusters.

B. 예시적인 계층 레지스터 파일B. Example Hierarchy Register File

스케줄 후 판독은 연산이 스케줄러로부터 디스패치된 후 레지스터 파일을 판독할 수 있다; 캡처(피연산자 캡처 어레이)는 연산이 스케줄러에서 행해짐에 따라 물리적 레지스터 파일로부터 구 값들을 판독하고, 이들이 다시 기록됨에 따라 신 값을 "캡처"한다. 스케줄 후 판독은 물리적 레지스터 파일 상에 많은 수의 포트를 필요로 할 수 있다; 피연산자 캡처는 더 적은 수를 필요로 할 수 있다.Post-Schedule Read may read the register file after the operation is dispatched from the scheduler; Capture (operator capture array) reads the old values from the physical register file as the operation is performed in the scheduler, and "captures" the scene values as they are written back. Post-scheduled reads may require a large number of ports on the physical register file; Operand capture may require fewer numbers.

예시적인 실시예에 따라, 레지스터 파일 포트 감소는 스케줄 후 판독 마이크로아키텍처에 대해서도 전체의 레지스터 파일을 판독하는 것이 필수적이 아니기 때문에 중요할 수 있다.According to an exemplary embodiment, register file port reduction may be important because it is not necessary to read the entire register file even for a post-schedule read microarchitecture.

스케줄 후 판독은 일부 적용성을 여전히 갖고 있다: 레벨 1 레지스터 파일(RF1)은 피연산자를 스케줄러에 배치하기 전에 판독되어, 디스패치 시 연산 번호(레벨 1 스케줄러 엔트리 번호)에 의해 인덱싱되는 피연산자 저장 어레이에 값을 전달할 수 있다.Post-scheduled reads still have some applicability: the level 1 register file RF1 is read before placing the operands in the scheduler, and values are stored in the operand storage arrays that are indexed by the operation number (level 1 scheduler entry number) at dispatch time. Can be passed.

많은 마이크로프로세서에서 레지스터 파일은 크고 느리며, 상당한 전력을 소비하는 경향이 있다. 레지스터 파일을 사이즈화하는데 있어서의 2개의 주요 요소는 R로 칭하는 엔트리의 번호와, P로 칭하는 포트의 번호이다. 클록당 단일 명령만을 실행할 수 있는 클래식 마이크로프로세서(즉, 슈퍼스칼라 아님)는 "add r1, r2, r3"과 같은 명령을 지원하기 위해 2개의 판독 포트와 하나의 기록 포트를 필요로 한다. 간단한 근사법은 클록당 N 명령을 발행할 수 있는 슈퍼스칼라 프로세서가 3*N 포트를 필요로 한다는 것이다. 어떤 경우에는, 동시 실행의 정도가 증가함에 따라, 즉, N의 값이 증가함에 따라, 포트의 수와 물리적 레지스터 모두가 증가할 수 있다. 어떤 경우에는, 예를 들면, 레지스터 파일의 물리적 실리콘 영역이 R*P²로서 증가할 수 있고, 지연은 P*R¹ ^/2로서 증가할 수 있으며, 에너지는 R*P²로서 증가할 수 있다. 이들 관계는 설명에 이용되는 주먹구구(rule of thumb)식 또는 추정이며, 본 개시내용은 이에 한정되지 않는다.In many microprocessors, register files are large and slow and tend to consume significant power. The two main elements in the size of the register file are the number of entries called R and the number of ports called P. Classic microprocessors that can only execute a single instruction per clock (ie not superscalar) require two read ports and one write port to support instructions such as "add r1, r2, r3". A simple approximation is that a superscalar processor that can issue N instructions per clock requires 3 * N ports. In some cases, both the number of ports and the physical registers may increase as the degree of concurrent execution increases, i.e., as the value of N increases. In some cases, for example, there is a physical silicon area of the register file to increase as R * P ^2, the delay may be increased as the P * R ^{^1/2,} the energy may increase as R * P ² . These relationships are rule of thumb equations or estimates used in the description, and the present disclosure is not so limited.

이러한 레지스터 파일의 부정적 영향을 감소하기 위한 한가지 예시적인 기술은 레지스터 파일 캐시를 사용하는 것이다. 예를 들면, 이 아이디어는 실행 유닛에 피연산자 대역폭을 제공하도록 모든 필수적인 포트를 갖는 더 작은 캐시 메모리(작은 캐시 레지스터 파일), 및 더 적은 포트를 갖는 캐시의 뒤에 놓인 더 큰 레지스터 파일을 제작하는 것일 수 있다. 예를 들면, 레지스터 액세스는 레지스터 캐시로 전송될 수 있는데, 이는 보통 LRU의 어떤 근사법을 이용하여 관리되며, 캐시 미스는 다시 채우기 위해 더 큰 주요 레지스터 파일로 전송된다. 결과적으로, 주요 레지스터 파일이 인덱스로서 물리적 레지스터 번호를 사용하여 직접 RAM 구조에 어드레싱하는 동안, 캐시 레지스터 파일은 특정한 물리적 레지스터와 연관된 값을 현재 유지하는지를 결정하는데 CAM을 사용할 수 있다. 완전히 포트화된 캐시 레지스터 파일에 의해 액세스가 충분히 만족되는 한 거의 또는 어떠한 부정적인 성능 영향은 존재하지 않는다.One example technique for reducing the negative impact of such a register file is to use a register file cache. For example, the idea might be to make a smaller cache memory (small cache register file) with all the necessary ports, and a larger register file behind the cache with fewer ports to provide operand bandwidth to the execution unit. have. For example, register access can be sent to the register cache, which is usually managed using some approximation of the LRU, and cache misses are sent to a larger main register file to repopulate. As a result, while the main register file addresses the RAM structure directly using the physical register number as an index, the cache register file can use the CAM to determine if it currently holds a value associated with a particular physical register. There is little or no negative performance impact as long as access is sufficiently satisfied by a fully ported cache register file.

또 다른 예시적인 실시예에 따라, 이용될 수 있는 또 다른 기술은 바이패스 캐시(바이패스 레지스터 파일)이다. 본 예에서 용어 바이패스는 실행 유닛에 의해 생성된 새로운 데이터 결과들을, 이들을 레지스터 파일에 기록하고 나서 파일로부터 판독된 종속 명령을 갖는 것보다는 오히려 이들을 대기하는 명령에 직접 전송하는 프로세스를 가리키는데 사용될 수 있다. 예를 들면, 바이패스 캐시는 최종 몇개의 이러한 값들을 유지할 수 있고 이 값들을 스케줄러에 입력하는 새로운 명령에 직접 제공할 수 있다. 적어도 어떤 경우에는, 이 방법은 데이터 기록의 총수를 큰 주요 레지스터 파일로 감소시킬 수 있는데, 이는 달리 이용할 수 있는 것보다 더 일찍 데이터를 제공함으로써 성능을 향상시킬 수 있다.According to another exemplary embodiment, another technique that can be used is a bypass cache (bypass register file). In this example the term bypass can be used to refer to the process of writing new data results generated by an execution unit directly to a command waiting for them rather than having them write to a register file and then have dependent instructions read from the file. have. For example, the bypass cache can maintain the last few of these values and provide them directly to new commands that enter the scheduler. In at least some cases, this method can reduce the total number of data records to a large main register file, which can improve performance by providing data earlier than otherwise available.

또 다른 예시적인 실시예에 따라, 프로세서(100)에서 명령에 대한 피연산자를 얻는데 추가적인 기술이 이용될 수 있다. 먼저, 피연산자는 명령이 실행 유닛에 전송되는 경우 레지스터 파일로부터 판독될 수 있거나 또는 피연산자는 이들이 생성됨에 따라 어떤 새로운 구조에 캡처될 수 있으며, 그리고 나서 실행 유닛에 전송되는 경우 이 구조로부터 판독된다.According to another example embodiment, additional techniques may be used to obtain operands for instructions in the processor 100. First, an operand can be read from a register file when an instruction is sent to an execution unit or an operand can be captured in some new structure as they are created, and then read from this structure when sent to an execution unit.

예시적인 실시예에 따라, 3 레벨 레지스터 파일이 사용될 수 있다. 다중 레벨 레지스터 파일에 관한 아래의 예시적인 설명은 1 클러스터(퍼-클러스터에 기초하여)에 제공되며, 클러스터마다 반복될 수 있다. 레벨 2 레지스터 파일(RF2)(152)은 마이크로-아키텍처에서 물리적 레지스터마다 단일 엔트리(예컨대, 레지스터)를 가질 수 있다. 이 레지스터 파일은 물리적 레지스터 파일 번호에 의해 어드레싱될 수 있고, 예를 들면 일반적인 RAM(random access memory)으로서 인덱싱될 수 있으며, 따라서 회로는 CAM 회로보다 더 간단할 수 있다. 레벨 2(또는 주요) 레지스터 파일에 수반되는 복잡성은 예를 들면 2개의 소스로부터 발생할 수 있다. 첫째로, 레벨 2(또는 주요) 레지스터 파일(152)은 예를 들면 80 엔트리 및 아마도 더 많은 엔트리와 같이 비교적 클 수 있다. 둘째로, 실행 유닛이 높은 피연산자 대역폭 요건을 가질 수 있기 때문에, 피연산자 값을 직접 제공하는 경우 레벨 2(또는 주요) 레지스터 파일(152)이 비교적 많은 수의 포트를 갖는 것이 바람직할 수 있다. 계층 레지스터 파일은 더 적은 레지스터와 많은 수의 포트를 갖는 더 작은 구조를 제공하여 이들을 실행 유닛에 더 밀접하게 배치할 수 있는데, 이는 데이터 대역폭이 실질적으로 요구된다.According to an exemplary embodiment, a three level register file may be used. The following example description of a multi-level register file is provided in one cluster (based on per-cluster) and may be repeated per cluster. Level 2 register file (RF2) 152 may have a single entry (eg, a register) per physical register in the micro-architecture. This register file can be addressed by a physical register file number, for example indexed as a normal random access memory (RAM), so the circuit can be simpler than the CAM circuit. The complexity involved in the Level 2 (or main) register file may arise, for example, from two sources. First, the level 2 (or main) register file 152 may be relatively large, for example 80 entries and possibly more entries. Second, because the execution unit may have high operand bandwidth requirements, it may be desirable for the level 2 (or main) register file 152 to have a relatively large number of ports when directly providing operand values. Hierarchical register files provide a smaller structure with fewer registers and a larger number of ports so that they can be placed closer to the execution unit, which requires substantially data bandwidth.

도 7은 예시적인 실시예에 따른 다중 레벨 레지스터 파일을 나타내는 블록도이다. 실행 유닛에 가장 밀접한 이들 구조는 성능에 가장 큰 영향을 줄 수 있다. 레벨 0 레지스터 파일(RF0)인 것으로 고려될 수 있는 피연산자 캡처 어레이(OC)(135)는 피연산자를 (클러스터 내에서) 실행 유닛(136)에 직접 제공할 수 있다. 예시적인 실시예에서, 레벨 1 명령 스케줄러(S1, 도 1)(132)는 대응하는 클러스터의 OC(135) 내의 대응하는 엔트리를 갖고, uop에 대한 OC(135) 내의 엔트리는 레벨 1 스케줄러 내의 uop와 동일한 인덱스 값을 가질 수 있으므로 피연산자 캡처 어레이(135)는 예를 들면 명령이 실행 유닛으로 이동하는 경우 빠른 RAM으로서 액세스될 수 있다. 명령이 레벨 1 스케줄러(132)에 입력될 때, 이용 가능한 경우 피연산자 데이터는 동일한(또는 대응하는) 클러스터 내에서 명령에 대한 대응하는 피연산자 캡처 엔트리에 기록된다(이는 퍼-클러스터에 기초하여 행해질 수 있다). OC(135) 내의 각 엔트리는 아직 실행을 완료하지 않은 명령에 의해 생성되는 중이었기 때문에 명령이 레벨 1 명령 스케줄러에 입력된 경우 준비되지 않은 피연산자 데이터를 캡처하는데 사용되는 한 세트의 CAM을 또한 가질 수 있다. 명령이 실행을 완료하는 경우, 실행 유닛(136)은 새로운 데이터 결과뿐만 아니라 그 결과를 저장해야 하는 많은 물리적 레지스터를 제공할 수 있다. 각 실행 유닛에 대하여, OC(135)는 만족되지 않은 입력의 물리적 레지스터 수에 대하여 기록될 물리적 레지 스터 수와 정합한다. 정합이 존재하는 경우, 새로운 데이터 값은 피연산자 캡처 어레이(135) 내에 캡처된다. 따라서 예를 들면 다음과 같이 피연산자 캡처 어레이(135)에 2가지 유형의 기록 포트가 존재할 수 있다: 명령이 (OC와 동일한 클러스터의) 레벨 1 스케줄러에 입력되는 경우에 사용되는 레벨 1 스케줄러 엔트리 번호를 사용하여 RAM과 같이 인덱싱되는 세트와, 실행 유닛에 의해 제공된 물리적 레지스터 번호 및 CAM을 사용하여 어드레싱되는 세트. OC(135)의 사이즈는 예를 들면 클러스터 내의 레벨 1 명령 스케줄러(132)의 사이즈에 기초하여 결정되는 설계 파라미터일 수 있다.7 is a block diagram illustrating a multi-level register file according to an example embodiment. These structures closest to the execution unit can have the greatest impact on performance. Operand capture array (OC) 135, which may be considered to be a level 0 register file RF0, may provide operands directly to execution unit 136 (within the cluster). In an exemplary embodiment, level 1 command scheduler (S1, FIG. 1) 132 has a corresponding entry in OC 135 of the corresponding cluster, and an entry in OC 135 for uop is uop in level 1 scheduler. Operand capture array 135 may be accessed as fast RAM, for example when an instruction moves to an execution unit, since it may have the same index value. When an instruction is entered into the level 1 scheduler 132, the operand data, if available, is written to the corresponding operand capture entry for the instruction within the same (or corresponding) cluster (this may be done based on the per-cluster). ). Since each entry in OC 135 was being generated by a command that has not yet completed execution, it may also have a set of CAMs used to capture unprepared operand data when the command is entered into the level 1 command scheduler. have. When the instruction completes execution, execution unit 136 may provide a new data result as well as many physical registers that need to store the result. For each execution unit, OC 135 matches the number of physical registers to be written against the number of physical registers of the unsatisfied input. If there is a match, the new data value is captured in operand capture array 135. Thus, for example, there may be two types of write ports in operand capture array 135 as follows: A Level 1 scheduler entry number used when a command is entered into a Level 1 scheduler (of the same cluster as the OC). A set that is indexed, such as RAM, using a physical register number provided by an execution unit, and a set that is addressed using a CAM. The size of the OC 135 may be a design parameter that is determined based on the size of the level 1 command scheduler 132 in the cluster, for example.

예시적인 실시예에서, 레지스터 파일 캐시(레벨 1 레지스터 파일(134)(RF1))는 명령이 (이 클러스터에 대한) 레벨 1 스케줄러에 입력되고 명령이 피연산자 캡처 어레이(135)에 할당되는 경우 액세스될 수 있다. 이것은 명령이 스케줄링되기 전에 따라서 명령이 실행 유닛에 요구되는 시기에 앞서 이루어질 수 있다. 이 방법은 데이터가 실제로 요구되는 시기에 앞서 캐시 미스를 검출하는 이득과, 피연산자 캡처 어레이(135)로부터 실행 유닛(136)으로 피연산자를 이동시키는데 사용되는 주 경로의 캐시 리필 회로를 제거하는 이득을 갖는다. 예시적인 실시예에 따라, 피연산자 캡처 어레이(135)는 따라서 바이패스 캐시로서 동작할 수 있다. 피연산자 캡처 어레이(135) 및 레벨 1 레지스터 파일(134)은 단일 레지스터 파일(예컨대, RF1) 내에서 결합될 수 있지만, 이는 필요로 하지 않고 단지 설계 선택일 뿐이다.In an example embodiment, the register file cache (level 1 register file 134 (RF1)) is accessed when an instruction is entered into the level 1 scheduler (for this cluster) and the instruction is assigned to the operand capture array 135. Can be. This may be done before the instruction is scheduled and thus before the instruction is required for the execution unit. This method has the benefit of detecting cache misses prior to when data is actually needed and the benefit of eliminating the cache refill circuitry of the primary path used to move the operand from the operand capture array 135 to the execution unit 136. . According to an example embodiment, operand capture array 135 may thus operate as a bypass cache. Operand capture array 135 and level 1 register file 134 may be combined within a single register file (eg, RF1), but this is not required and is merely a design choice.

주 레지스터 파일인 레벨 2 레지스터 파일(RF2, 152)은 예를 들면 보조 기억 장치를 제공하는데 사용될 수 있으며 모든 또는 거의 모든 레지스터 값을 유지할 수 있다. 그렇지만, RF1 캐시 미스의 예상 수를 만족시키도록 충분한 판독 대역폭을 제공하는 것이 필요할 뿐이다. 따라서, 판독 포트의 수는 예시적인 실시예에서 감소될 수 있다. 더욱이, 데이터 기록은 기록 포트의 수를 가장 나쁜 경우의 대역폭보다는 오히려 예상되는 안정한 상태의 대역폭으로 감소시키기 위해서 (종속 명령이 피연산자 캡처 어레이 또는 레벨 1 레지스터 파일로부터 만족되고 있기 때문에) 버퍼링될 수 있다.Level 2 register files RF2, 152, which are main register files, can be used to provide auxiliary storage, for example, and can hold all or almost all register values. However, it is only necessary to provide sufficient read bandwidth to meet the expected number of RF1 cache misses. Thus, the number of read ports can be reduced in the exemplary embodiment. Moreover, data writes can be buffered (since dependent instructions are being satisfied from operand capture arrays or level 1 register files) to reduce the number of write ports to the expected steady state bandwidth rather than the worst case bandwidth.

예시적인 실시예에 따라, 큰 레벨 2 레지스터 파일(RF2, 152)은 다중 클러스터 사이에서 공유될 수 있는 반면, 각 클러스터는 전용 OC(135) 및 레벨 1 레지스터 파일(RF1, 134)을 갖는다. 이 방법에서, 레벨 1 레지스터 파일(134)이 높은 히트 레이트(hit rate)를 갖는 것이 이득이 될 수 있지만, 그렇지 않으면 성능은 저하될 수 있다. 이 공유의 한가지 이득은 레지스터 파일 값이 실제로 필요로 하는 경우에만 RF2로부터 RF1로 복사되면서 스레드(예컨대, 실행 프로그램)가 한 클러스터로부터 다른 클러스터로 이동되는 것을 허용한다는 것이다. 이는 한 클러스터로부터 다른 클러스터로의 투명한 스레드 이동을 용이하게 하는데 도움을 줄 수 있다.According to an exemplary embodiment, large level 2 register files RF2, 152 may be shared among multiple clusters, while each cluster has a dedicated OC 135 and level 1 register files RF1, 134. In this method, it may be beneficial for the level 1 register file 134 to have a high hit rate, but otherwise performance may be degraded. One benefit of this sharing is that it allows threads (eg, executing programs) to be moved from one cluster to another while copying from RF2 to RF1 only if the register file values are actually needed. This can help to facilitate transparent thread movement from one cluster to another.

이제부터 다중 레벨 레지스터 파일에 관한 다른 상세 및 예시적인 실시예를 설명한다. 예시적인 실시예에 따라, 다중 레벨 레지스터 파일은 (클러스터마다) 레벨 1 레지스터 파일(RF1)(134) 및 바이패스 캐시를 포함할 수 있으며, 이를 피연산자 취득 서브시스템으로 칭할 수 있다. 실시예에 따라, RF1은 스케줄링 전에 판독될 수 있고, 피연산자 캡처 어레이(OC)(135)는 스케줄링 후에 판독될 수 있다. 예 시적인 실시예에 따라, 레벨 1 레지스터 파일(RF1)은 연산이 S1 스케줄러에서 행해지기 전에 판독될 수 있다. RF1로부터 판독된 값은 (동일한 클러스터 내의) 피연산자 캡처 어레이(OC)(135)로 전달될 수 있다. 피연산자 캡처 어레이는 연산이 S1 스케줄러로부터 디스패치된 후에 판독될 수 있다. 이것은 S1 엔트리 번호에 의해 인덱싱된다.Other details and exemplary embodiments of a multilevel register file are now described. According to an exemplary embodiment, a multi-level register file may include a level 1 register file (RF1) 134 and a bypass cache (per cluster), which may be referred to as an operand acquisition subsystem. According to an embodiment, RF1 may be read before scheduling, and operand capture array (OC) 135 may be read after scheduling. According to an exemplary embodiment, the level 1 register file RF1 may be read before the operation is performed on the S1 scheduler. The value read from RF1 may be passed to operand capture array (OC) 135 (in the same cluster). The operand capture array can be read after the operation is dispatched from the S1 scheduler. This is indexed by the S1 entry number.

도 8은 예시적인 실시예에 따른 다중 레벨 레지스터 파일을 나타내는 블록도이다. 레벨 1 레지스터 파일(RF1)은 판독 시 물리적 레지스터 번호에 의해 CAM 인덱싱될 수 있다. RF1은 RF2로 요구를 송신하면서 주요 물리적 레지스터 파일을 "미스"할 수 있다.8 is a block diagram illustrating a multi-level register file according to an example embodiment. Level 1 register file RF1 may be CAM indexed by physical register number upon reading. RF1 can "miss" the main physical register file while sending a request to RF2.

다중 RF1 미스 요구는 예를 들면 동일한 레지스터를 요구하는 2개의 명령이 RF2로의 단일 판독 액세스 포트만을 사용하여 이용 가능한 자원을 보다 효율적으로 사용하도록 결합될 수 있다.Multiple RF1 miss requests can be combined, for example, so that two instructions requiring the same register can more efficiently use available resources using only a single read access port to RF2.

예를 들면 최근 최소 사용(LRU), 의사 최근 최소 사용, 및 랜덤식의 임의 수의 잘 알려진 교체 정책을 사용하여 RF1이 관리될 수 있다는 것을 당업자는 이해할 것이다.Those skilled in the art will appreciate that RF1 can be managed using any number of well known replacement policies, for example, recent minimal use (LRU), pseudo recent minimum use, and random.

RF1의 일부는 선입 선출(FIFO) 메모리로서 구성되며, 바이패스 캐시로서 알려질 수 있다. 예를 들면, N 엔트리 바이패스 캐시는 실행 유닛에 의해 생성된 최종 N 값들을 유지하여, 이들 값이 스케줄러에 입력하는 나중 명령에 제공되고 따라서 위에서 설명한 RF1/RF2 액세스 메커니즘을 바이패스하는 것을 허용한다.Part of RF1 is configured as a first-in, first-out (FIFO) memory and may be known as a bypass cache. For example, the N entry bypass cache maintains the last N values generated by the execution unit, allowing these values to be provided to later instructions that enter the scheduler and thus bypass the RF1 / RF2 access mechanism described above. .

레벨 1 레지스터 파일(RF1) 미스 요구는 연산 발행을 지연시키지 않는다: 대 신, 연산은 S1 스케줄러에서 행해지고 CAM은 피연산자 캡처 엔트리에 대하여 인에이블되며, RF1 미스 요구가 스케줄링된다. RF1 미스 요구가 완료되는 경우 데이터는 (동일한 클러스터 내의) 레벨 1 스케줄러 및 피연산자 캡처 어레이(OC)(135)를 갱신하는데 사용되고, 정상적으로 정확하게 웨이크업을 수행한다.Level 1 register file (RF1) miss requests do not delay operation issuance: instead, the operation is performed in the S1 scheduler and the CAM is enabled for the operand capture entry, and the RF1 miss request is scheduled. When the RF1 miss request is completed, the data is used to update the Level 1 scheduler and operand capture array (OC) 135 (in the same cluster) and perform wakeup normally correctly.

RF1 충전은 RF1 번호에 의해 RAM 인덱싱되는 바와 같은 다이어그램으로 표시되는 기록 포트를 사용할 수 있다. 단순화하기 위해, 이 포트는 제거되어 물리적 레지스터 번호(preg#)에 의해 CAM 인덱싱될 수 있는 RF1 실행 유닛 라이트백 포트(writeback port)와 결합될 수 있다. 도 8에서, 실행 유닛 라이트백은 RF1로 전송되어 preg#에 의해 CAM 인덱싱될 수 있다. RF1은 정확한 값이 실행 유닛에 의해 생성되는 것으로 기대되는 경우에 미리 할당되어, RF1 라이트백 CAM 포트(즉, 실행 유닛에 연결됨)가 레지스터 엔트리 상에 정합하는 것을 보장할 수 있다. 바이패스 캐시(BY$)는 시간 인덱스 RAM 포트에 의해 기록되지만, preg# CAM에 의해 판독된다; RF$는 상술한 바와 같이 판독 및 기록된다.RF1 charging can use a write port, represented by a diagram as RAM indexed by RF1 number. For simplicity, this port can be combined with an RF1 execution unit writeback port that can be removed and indexed CAM by physical register number (preg #). In FIG. 8, execution unit writeback may be sent to RF1 and indexed to CAM by preg #. RF1 may be pre-allocated if the correct value is expected to be generated by the execution unit, to ensure that the RF1 writeback CAM port (ie, connected to the execution unit) matches on the register entry. Bypass cache BY $ is written by the time index RAM port, but is read by preg # CAM; RF $ is read and written as described above.

예시적인 실시예에 따라, RF1(레벨 1 레지스터 파일)은 주 레지스터 파일(레벨 2 레지스터 파일)의 캐시일 수 있다. 이와 같은 캐시(레벨 1 레지스터 파일)는 적어도 부분적으로 CAM 인덱싱 - 즉 CAM 인덱싱 또는 태그 정합 - 될 수 있으며 이것은 예기치 않게 "미스"를 취할 수 있다. 스케줄 후 판독을 위해, 스케줄링이 RAM 인덱싱된 후 판독되는 어레이를 갖는 것이 가능하다. 이 포스트-스케줄 어레이의 콘텐츠는 CAM을 사용할 수 있는 스케줄러에서 연산이 행해지기 전에 체크된다; 그렇지만, 포스트-스케줄 어레이는 스케줄링 후 판독되는 경우에 미스될 수 없다. 여 기서, 포스트-스케줄 어레이는 논-캐시(RF1)일 수 있다. 스케줄링 전에 판독되는 구조는 실제로 RF2의 동적 캐시(dynamic cache)이지만, 이 문단에서 설명하는 프리-스케줄 구조는 예시적인 실시예에 따라 데이터 값을 저장하지 않는다.According to an exemplary embodiment, RF1 (Level 1 register file) may be a cache of the main register file (Level 2 register file). Such a cache (level 1 register file) may be at least partially CAM indexed-ie CAM indexed or tag matched-which may unexpectedly take a "miss". For post-schedule reading, it is possible to have an array that is read after the scheduling has been RAM indexed. The content of this post-schedule array is checked before the operation is performed on a scheduler that can use the CAM; However, the post-schedule array cannot be missed if it is read after scheduling. Here, the post-schedule array may be non-cache (RF1). The structure read before scheduling is actually a dynamic cache of RF2, but the pre-schedule structure described in this paragraph does not store data values in accordance with the exemplary embodiment.

다른 스케줄 후 판독 장치는 포스트-스케줄 어레이에 액세스하기 위해 CAM을 사용할 수 있다. 이 스킴에서, 포스트 스케줄 어레이는 동적 미스가 가능한 캐시(RF$)(레지스터 파일 캐시)일 수 있다.(포스트-스케줄 CAM 포트를 생성하지만, 동적 미스가 발생하지 않도록 이를 관리하는 것이 또한 가능하다.) 다중 레벨 스케줄러에 있어서 프리-S1(레벨 1 스케줄러) 및 포스트-S1 레지스터 파일 메커니즘이 존재할 수 있다. 프리-스케줄(RF1$)(레벨 1 레지스터 파일 캐시) 및 포스트-스케줄 피연산자 캡처(피연산자 캡처 어레이(135)) 구조를 포함하는 피연산자 취득 마이크로아키텍처의 구조는 예를 들면 많은 이점을 가질 수 있다.Another post-schedule reading device may use the CAM to access the post-schedule array. In this scheme, the post schedule array can be a dynamic miss enabled cache (RF $) (register file cache). (Creates a post-scheduled CAM port, but it is also possible to manage it so that dynamic miss does not occur.) For multi-level schedulers, there may be pre-S1 (level 1 scheduler) and post-S1 register file mechanisms. The structure of the operand acquisition microarchitecture, including the pre-schedule (RF1 $) (level 1 register file cache) and post-schedule operand capture (operator capture array 135) structures, can have many advantages, for example.

프리-스케줄 구조에 데이터를 배치하는 것은 포스트-스케줄 어레이를 과도하게 복잡화시키지 않고서 더 큰 RF1$ 메커니즘 - 더 큰 LRU 캐시, 더 큰 바이패스 캐시(BY$) - 이 사용되는 것을 허용한다. 이들 프리-스케줄(RF1$) 메커니즘은 적은 포트를 가질 수 있는 반면, 포스트-스케줄 어레이는 완전한 포트를 필요로 한다.Placing data in the pre-schedule structure allows a larger RF1 $ mechanism-a larger LRU cache, a larger bypass cache (BY $)-to be used without overly complexing the post-schedule array. These pre-schedule (RF1 $) mechanisms can have fewer ports, while post-schedule arrays require complete ports.

프리-스케줄 구조에 데이터를 배치하는 것은 활성 레지스터 파일과 같은 대안이 사용되는 것을 허용한다. 예컨대 분기 오류 예측 보구는 레지스터 값뿐만 아니라 맵을 복구할 수 있다.Placing data in the pre-schedule structure allows alternatives such as active register files to be used. For example, the branch error prediction tool may recover maps as well as register values.

포스트-스케줄 OC(피연산자 캡처 어레이)는 N개만의 엔트리를 필요로 할 수 있는데, 여기서 N은 레벨 1 스케줄러(S1) 내의 엔트리의 개수이다. 이는 실행 유닛 디스패치 포트당 하나의 RAM 포트만을 필요로 할 수 있는 반면, 다른 포스트 스케줄 구조는 예를 들면 실행 유닛 디스패치 포트당 피연산자마다 하나의 포트를 사용할 수 있다.Post-scheduled OC (Operation Capture Array) may require only N entries, where N is the number of entries in the level 1 scheduler S1. This may require only one RAM port per execution unit dispatch port, while other post schedule structures may use one port per operand per execution unit dispatch port, for example.

포스트-스케줄 OC(피연산자 캡처)의 주 비용은 라이트백 포트 상의 CAM이다. 이들은 포스트-스케줄 레벨 레지스터 파일(RF1)에 기록을 미리 할당하는 것과, 새로운 요구를 생성하는 것을 결합하여 RAM으로 전환될 수 있다.The main cost of post-scheduled OC (Operand Capture) is CAM on the writeback port. These can be converted to RAM by combining the pre-allocation of records to the post-schedule level register file RF1 and the creation of new requests.

레지스터 파일 포트를 감소시키도록 에버네선트 바이패싱(evanescent bypassing)에 의존하는 메커니즘은 성능을 감소시키는 포지티브 피드백을 나타낼 수 있다: 연산이 지연되는 경우, 바이패스 네트워크로부터 값을 선택할 기회를 잃을 수 있으며, 지연되어, 이는 지연되는 순차 연산의 기회를 증가시킨다. 완전히 포트화된 마이크로아키텍처는 면적을 희생하여 이와 같은 포지티브 피드백을 갖지 않는다.Mechanisms that rely on evanescent bypassing to reduce register file ports may exhibit positive feedback that reduces performance: if operations are delayed, you may lose the opportunity to select values from the bypass network. , Delayed, which increases the chance of delayed sequential operations. Fully ported microarchitectures do not have this positive feedback at the expense of area.

예시적인 실시예에 따라, 프로세서(100)는 완전한 물리적 레지스터 파일/명령 창에 완전한 포트를 갖지 않음으로써 면적을 절약할 수 있다. 대부분의 물리적 레지스터 파일(또는 파일 엔트리)은 1 또는 2개만의 포트를 가질 수 있다. 다중 레벨 레지스터 파일 아키텍처가 사용될 수 있으며, 예를 들면 다음과 같은 가능한 포지티브 피드백 이슈를 어드레싱할 수 있다(이들은 단지 예일 뿐이며 본 개시내용은 이에 한정되지 않는다).According to an exemplary embodiment, processor 100 may save area by not having a complete port in a complete physical register file / command window. Most physical register files (or file entries) can have only one or two ports. A multi-level register file architecture can be used, for example addressing the following possible positive feedback issues (these are just examples and the disclosure is not so limited).

프리-스케줄러(RF1)는 캐시 미스할 수 있지만, 보통 추후 연산을 방해하지 않는다. 방해된 연산은 S1로 전송되고, 충전이 다시 기록되는 경우 그 미스 피연산 자를 캡처하도록 대기한다. 결과적으로, RF1 미스는 추후의 독립적인 명령을 지연시키지 않는다. 더욱이, 데이터 값들은 OC 내의 특정 위치에 할당되고 (전형적으로) 이들과 연관된 명령이 실행됐을 때까지 제거되지 않는다.The pre-scheduler (RF1) can cache misses, but usually does not interfere with future operations. The interrupted operation is sent to S1 and waits to capture the miss operand when the charge is recorded again. As a result, the RF1 miss does not delay later independent commands. Moreover, data values are assigned to specific locations within the OC and are not (typically) removed until an instruction associated with them has been executed.

더 많은 엔트리를 필요로 하지 않도록 관리될 수 있기 때문에, 최소 면적을 수반할 수 있는 또 다른 예는 논-캐시 포스트-스케줄러(RF1)를 갖는 데이터-완전 프리-스케줄러(RF1$)이다. 포스트-스케줄러(RF1) 내의 많은 여분의 엔트리는 상이한 레지스터 파일 캐시 기능에 기인할 수 있다: LRU RF1$, BY$ 등. 이들이 프리-스케줄러(RF1)로 이동되는 경우, 포스트-스케줄러(RF1) 엔트리는 통상적으로 대응하는 uop가 완전히 다시 기록될 때까지 유지된다. OC CAM은 통상적으로 이러한 고려를 배제한다. 예시적인 실시예에 따라, 피연산자 캡처 어레이는 모든 uop와 연관된 모든 피연산자에 대하여 CAM 포트를 가질 수 있는데, 예를 들면 x86은 명령으로부터 직접 추출된 중간 값에 대하여 논-CAM 포트를 더하여 2개의 소스 CAM 포트를 가질 수 있다.Another example that may involve a minimum area is the data-complete pre-scheduler (RF1 $) with a non-cache post-scheduler (RF1), since it can be managed to not require more entries. Many extra entries in the post-scheduler (RF1) may be due to different register file cache functions: LRU RF1 $, BY $, etc. When they are moved to the pre-scheduler RF1, the post-scheduler RF1 entry is typically maintained until the corresponding uop is completely rewritten. OC CAM typically excludes this consideration. According to an exemplary embodiment, the operand capture array may have a CAM port for all operands associated with all uops, for example x86 adds two source CAMs by adding a non-CAM port for the intermediate value extracted directly from the instruction. It can have a port.

CAM은 제 1 연산이 실행을 개시한 경우 존재하지 않은 연산을 대기하도록 연산이 그 결과를 직접 전송하는 것을 허용하는 것이 필수적이다. 이들 CAM은 RAM 인덱싱을 사용하여 단일 위치에 연산 기록을 가짐으로써 제거될 수 있다. 포스트-스케줄러 레벨 1 레지스터 파일(RF1)에 기록하는 경우, RAM 인덱싱의 사용은 라이트백이 완료될 때까지 이 위치가 보전되어야 한다는 것을 의미한다.It is essential that the CAM allow the operation to send its results directly to wait for a non-existent operation when the first operation has started executing. These CAMs can be eliminated by having the operation record in a single location using RAM indexing. When writing to the post-scheduler level 1 register file RF1, the use of RAM indexing means that this location must be preserved until the writeback is complete.

많은 CAM이 미사용 또는 낭비될 수 있는데, 그 이유는 많은 연산들이 피연산자로서 문자 중간 상수를 갖고; 미리 이용 가능한 적어도 하나의 피연산자를 더 가 지며, 아마도 동시에 연산이 레벨 1 스케줄러(S1)에서 행해졌기 때문이다. 이들 중간 및 초기 이용 가능한 피연산자들은 실행 유닛 라이트백을 위한 CAM 포트를 필요로 하지 않는다: 이들은 개별 어레이 내에 배치되거나, 또는 CAM 포트가 없는 것을 제외하고 동일한 어레이 내에 배치될 수 있다.Many CAMs can be unused or wasted because many operations have character intermediate constants as operands; It also has at least one operand already available, probably because the operation was done at the level 1 scheduler S1 at the same time. These intermediate and initially available operands do not require a CAM port for execution unit writeback: they can be placed in a separate array or in the same array except there is no CAM port.

그렇지만, 예시적인 실시예에 따라, 성능을 향상시키기 위해서, 1 이상의 동적 입력의 가능성은 온 더 플라이(on the fly) 식으로 캡처될 수 있으며, 보통 여기서 피연산자가 바이패스 경로 상에서 선택된다는 보장이 없다. 변화하는 개수의 동적 및 정적 피연산자를 인에이블하기 위해서, 적어도 몇몇 경우에서는, OC(피연산자 캡처) 어레이의 일부 이점을 회피할 수 있다. 예를 들면, 각각의 피연산자는 스케줄러 번호에 의해 인덱싱되는 단일 액세스 대신에 독립적으로 인덱싱될 수 있다.However, according to an exemplary embodiment, to improve performance, the possibility of one or more dynamic inputs can be captured on the fly, where there is usually no guarantee that the operands are selected on the bypass path. . In order to enable a varying number of dynamic and static operands, at least in some cases, some of the benefits of an OC (operator capture) array can be avoided. For example, each operand can be indexed independently instead of a single access indexed by the scheduler number.

이러한 방법에서, 2가지 상이한 유형의 포스트-스케줄러(RF1) 어레이가 구현될 수 있고 양쪽 모두는 명령이 실행 유닛에 디스패치되는 시기에 피연산자 번호에 의해 인덱싱된다.제 1의 포스트-스케줄러(RF1) 어레이는 실행 유닛 출력에 기초하여 CAM 인덱싱될 수 있다. 정적 피연산자가 들어있는 제 2의 포스트-스케줄러(RF1)는 실행 유닛 출력에 기초하여 CAM 인덱싱되지 않는다. 많은 경우에 설계 트레이트오프(design tradeoff)는 정적 피연산자가 거의 항상 CAM 엔트리에 저장될 수 있기 때문에 CAM 포트를 증가시키는 것이 전형적이지만, 그 역은 같지 않다.In this way, two different types of post-scheduler (RF1) arrays can be implemented, both of which are indexed by the operand number at the time the instruction is dispatched to the execution unit. The first post-scheduler (RF1) array Can be CAM indexed based on the execution unit output. The second post-scheduler RF1 containing the static operand is not CAM indexed based on the execution unit output. In many cases design tradeoffs are typical for increasing CAM ports because static operands can almost always be stored in CAM entries, but the reverse is not the same.

그렇지만, 피연산자당 디코더의 방법은 모든 포트를 지원하지 않고서 상당히 많은 입력 피연산자(예컨대, FMAC(floating-point multiply accumulate))를 갖는 명령을 지원함으로써 포트를 더 감소시키고자 하는 이점을 가질 수 있다. 또한, 타이밍 휠 스케줄러가 S1 내부에서 사용되는 경우, 바이패스 경로 상에서 값이 선택되는 것을 보장할 수 있다.However, the method of decoder per operand can have the advantage of further reducing the ports by supporting instructions with a significant number of input operands (eg, floating-point multiply accumulate) without supporting all ports. In addition, when the timing wheel scheduler is used inside S1, it is possible to ensure that a value is selected on the bypass path.

값은 프로세스 이동 및 분기(예컨대, 하나의 스레드를 야기하는 또 다른 스레드)를 지원하면서 한 클러스터로부터 다른 클러스터로 전달될 수 있다. 또한, 전용 인터-클러스터 바이패스 네트워크가 제공될 수도 있다. 전용 바이패스 네트워크가 존재하지 않는 경우, RF1(레지스터 파일 1) 미스는 클러스터 사이에서 공유되는 물리적 레지스터 파일(예컨대, 레벨 2 레지스터 파일)로 전송될 수 있다. 물리적 레지스터 파일(PRF)(예컨대, 레벨 2 레지스터 파일)이 레지스터 값을 갖는 경우, 답변한다; 그렇지 않으면, 레벨 2 레지스터 파일(PRF)은 어느 클러스터가 값을 생성할지를 트래킹한다. 값이 준비되어 있지만 레벨 2 레지스터 파일에 기록되지 않으면, PRF는 소유 클러스터에 요구를 전송할 수 있으며, 그리고 나서 요구자에 응답을 전송한다. 아직 준비되지 않은 경우, PRF는 요구된 값의 라이트-스루를 결국 촉진시키는 소유 클러스터에 요구를 전송할 수 있는데, 예컨대, 인터-클러스터 통신은 공유된 PRF(레벨 2 레지스터 파일과 같은)(물리적 레지스터 파일)를 통하여 이루어질 수 있는데, 이는 어는 클러스터가 값을 생성 및 요구하는지를 트래킹하는 디렉토리를 구현할 수 있다. 이와 같은 프로토콜은 (레벨 2 레지스터 파일과 같은) 물리적 레지스터 파일(PRF)을 통하여 값들이 즉시 기록되는 경우, 또는 이들이 후에 기록되는 경우 가장 잘 작용할 수 있다.The value can be passed from one cluster to another while supporting process movement and branching (eg, another thread causing one thread). In addition, a dedicated inter-cluster bypass network may be provided. If there is no dedicated bypass network, the RF1 (register file 1) miss may be sent to a physical register file (eg, a level 2 register file) shared between clusters. Answer if the physical register file (PRF) (eg, a level 2 register file) has a register value; Otherwise, the Level 2 Register File (PRF) tracks which cluster will generate a value. If the value is ready but not written to the level 2 register file, the PRF can send the request to the owning cluster, and then send a response to the requestor. If not already ready, the PRF can send a request to the owning cluster which eventually promotes the write-through of the required value. Can be implemented, which can implement a directory to keep track of which clusters generate and require values. Such a protocol may work best if values are written immediately via a physical register file (PRF) (such as a level 2 register file), or if they are written later.

C. 예시적인 명령 창C. Example Command Window

레벨 2 레지스터 파일에 관한 부가적인 설명 및 실시예를 설명한다. 예시적인 실시예에 따라, 각 클러스터에 대한 레벨 1 레지스터 파일은 라이트-스루 구조일 수 있다. 즉, 실행 유닛의 결과 기록은 클러스터로 다시 전송되고, 또한, 예를 들면 레벨 2 레지스터 파일(예컨대, PRF)에 라이트-스루된다. 이는 상당히 높은 양의 라이트-스루 트래픽을 생성할 수 있다: 통상적으로 대략 3개의 클러스터 및 rro의 실행 포트가 사이클당 12개의 기록을 제공한다.Additional descriptions and embodiments of Level 2 register files are described. According to an exemplary embodiment, the level 1 register file for each cluster may be a write-through structure. That is, the result record of the execution unit is transferred back to the cluster and also written-through to, for example, a level 2 register file (eg PRF). This can produce a fairly high amount of light-through traffic: typically about 3 clusters and rro's execution port provide 12 records per cycle.

예를 들면, 레벨 2 레지스터 파일(RF2)은 무작위로 할당된, 매핑된 레지스터 파일로서 구성될 수 있다. 변형적으로, 레벨 2의 레지스터 파일은 요청시 데이터를 복사하는 인텔 팬티엄 P6 ROB/RRF(RRF는 인텔의 전문 용어인 real register file을 나타냄)로서 동일한 스타일로 구성될 수도 있다. 예시적인 실시예에 따라, RF2/PRF는 상술한 무작위로 할당된 레지스터 파일 어레이를 포함할 수도 있으며, 이는 재배열 버퍼(ROB(re-order buffer))와 RAT(register allocation table)을 포함할 수동 있으며, 이는 또는 맵 델타 리스트로서 기능하지만, ROB는 데이터를 포함할 필요는 없고, 맵을 갱신하는데 사용될 수 있는 RF2/PRF 레지스터로의 포인터만을 제공할 수도 있다.For example, the level 2 register file RF2 may be configured as a randomly assigned, mapped register file. Alternatively, a level 2 register file may be configured in the same style as the Intel Pentium P6 ROB / RRF (RRF stands for Intel's terminology real register file) for copying data on demand. In accordance with an exemplary embodiment, the RF2 / PRF may include a randomly allocated register file array as described above, which may include a reorder buffer (ROB) and a register allocation table (RAT). It may also function as a map delta list, but the ROB does not need to contain data and may only provide a pointer to an RF2 / PRF register that may be used to update the map.

일 구성에서, RF2/PRF는 높은 연속 라이트-스루(write-through) 대역폭을 지원하기 위해 높게 뱅킹되어 있고, 각 뱅크는 작은 수의 기록 포트와 판독 포트만을 갖고 있다. 버퍼들은 뱅크 충돌을 회피하기 위해 스케줄링 되는 연속 기록 동작을 허용한다. 예시적인 실시예에 따라, 레벨 2 레지스터 파일(RF2)은 매 클러스터에 대해 매 실행 유닛마다 전체 폭의 기록 포트를 포함할 수도 있다. 반환 경 로(return path)는 더 좁아질 수도 있으며, 아마도 RF2(레벨 2의 레지스터 파일)의 반환에 대한 단 하나의 경로는 RF1(레벨 1의 레지스터 파일)로 평가될 수 있다.In one configuration, RF2 / PRF is highly banked to support high continuous write-through bandwidth, with each bank having only a small number of write and read ports. The buffers allow for a continuous write operation that is scheduled to avoid bank conflicts. According to an exemplary embodiment, the level 2 register file RF2 may include a full width write port for every execution unit for every cluster. The return path may be narrower and perhaps only one path to the return of RF2 (level 2 register file) can be evaluated as RF1 (level 1 register file).

몇몇 구성 또는 애플리케이션에서, 몇몇 설계 고려 사항들은 물리적 레지스터 파일의 대역폭을 줄이고자 할 수도 있다(예를 들면, 레벨 2의 레지스터 파일, RF2). 예를 들면, 이러한 몇몇 구성들은 전체 하드웨어 대역폭이 이용가능할지라도, (1) 하드웨어 복잡성을 감소, 포트 또는 (2) 전력 감소를 포함할 수도 있다.In some configurations or applications, some design considerations may want to reduce the bandwidth of the physical register file (eg, a level 2 register file, RF2). For example, some of these configurations may include (1) reducing hardware complexity, port, or (2) reducing power, even though overall hardware bandwidth is available.

예시적인 실시예에 따라, 프로세서(100)는 개선된 물리적 레지스터 파일(PRF) 판독 대역폭 및 레이턴시를 제공하기 위해 각 클러스터에 대하여 레벨 1의 레지스터 파일(RF1)에 따를 수도 있다. 그러므로, PRF 기록 대역폭 고려 사항들을 논의하기에 유용하며, 실행 결과들은 실행 유닛(136)으로부터 레벨 2의 레지스터 파일(RF2)까지 및 레벨 1의 레지스터 파일(RF1)로부터 레벨 2의 레지스터 파일(RF2)까지 기록된다.In accordance with an exemplary embodiment, the processor 100 may follow the level 1 register file RF1 for each cluster to provide improved physical register file (PRF) read bandwidth and latency. Therefore, it is useful to discuss PRF write bandwidth considerations, and the execution results are from execution unit 136 to level 2 register file RF2 and level 1 register file RF1 to level 2 register file RF2. Is recorded until.

몇몇 형태에 따르면, 구조는 RF2 대역폭을 감소시키도록 구성될 수도 있다. 예를 들면, RF2/PRF에 대한 연속 기록 동작은 재기동 파면에 위치하지 않다고 알려질 때까지 지연될 수 있다.According to some aspects, the structure may be configured to reduce the RF2 bandwidth. For example, the continuous write operation for RF2 / PRF may be delayed until it is known that it is not located at the restart wavefront.

기본 구성에서, PRF(RF2) 레지스터는 지원될 것으로 예상되는 가장 큰 데이터 값(예를 들면 128비트)에 따른 크기의 블록들에 할당될 수도 있다. 다중의 작은 레지스터들(예를 들면 64비트, 32비트)은 이들이 할당(매퍼) 파이프스테이지를 통과하는 블록에 할당된다. 클러스터 실행 유닛 라이트백(writeback)과 PRF(RF2) 사이의 버퍼링은 함께 모아지도록 128비트 블록 내에 다중의 작은 기록을 허용한다. 이들 버퍼들은 클러스터(S1) 스케줄링을 멈추게 하여 조절하기에 충분한 크기이다. 일 형태에서, 그러므로 PRF(RF2)는 아주 작은 세그먼트들을 갖는 세그먼트화된 순열로 되도록 고려될 수 있다.In a basic configuration, the PRF (RF2) register may be assigned to blocks of size according to the largest data value (eg 128 bits) expected to be supported. Multiple small registers (eg 64-bit, 32-bit) are allocated to the blocks through which they pass the allocation (mapper) pipestage. The buffering between the cluster execution unit writeback and the PRF (RF2) allows multiple small writes in 128-bit blocks to be brought together. These buffers are large enough to adjust by stopping cluster S1 scheduling. In one form, the PRF (RF2) can therefore be considered to be a segmented permutation with very small segments.

보다 복잡한 예에서, 대체 RF1/RF2 배열이 실시될 수도 있다. 예를 들면, RF1은 레지스터의 순차적으로 인접하는 블록들을 축적할 수 있고, RF2를 통하여 기록될 수 있다. RF2가 순차적으로 할당되는 구성에서 실시될 가능성이 보다 더 있다.In a more complex example, an alternate RF1 / RF2 arrangement may be implemented. For example, RF1 may accumulate sequentially adjacent blocks of registers and may be written via RF2. It is even more likely to be implemented in configurations where RF2 is assigned sequentially.

또 다른 예에서, 재기록된 값들은 연속 기록 동작이 없을 수 있다. 이러한 접근법과 호환가능한 설계의 예에서, 명령 창는 블록들로 나뉘거나, 묶이고 하나의 묶음의 끝에서 다른 묶음에 의해 판독되는 값만이 연속 기록된다. 이 접근법이 IW가 각 동작마다 엔트리보다 오히려 묶음을 포함하는 경우에도 적용될 수도 있지만, RF2가 순차적으로 할당되는 경우, 이는 전형적으로 적용될 것이다.In another example, the rewritten values may be without a continuous write operation. In an example of a design compatible with this approach, the command window is divided into blocks, or bundled and only values that are read by the other bundle at the end of one bundle are continuously written. This approach may apply even if the IW includes a bundle rather than an entry for each operation, but if RF2 is assigned sequentially, this will typically apply.

예시적인 실시예에 따라, 각 클러스터는 프리 스탠딩(free standing)으로 이루어질 수도 있다. 즉, 각 클러스터는 자신의 후퇴 로직(예를 들면, IW2로 부터)과 레벨 2의 레지스터 파일을 포함한다. 이러한 방식으로, 각 실행 클러스터(130)는 독립적으로 이루어질 수도 있다.According to an exemplary embodiment, each cluster may be made free standing. That is, each cluster contains its own retraction logic (eg from IW2) and a level 2 register file. In this manner, each execution cluster 130 may be made independently.

D. 예시적인 파이프라인 및 재실행D. Example Pipeline and Rerun

상술한 바와 같이, 프로세서(100)는 uop에 대한 모든 조건이나 입력들이 수신되기 전에 ups를 재실행할 수도 있다. 예시적인 실시예에 따라, 재실행은 다중 레벨 재실행 메커니즘을 이용하여 실시될 수도 있다. 예를 들면, 제 1 재실행 메커 니즘은 파이프라인에서 모두 재실행이 수용 가능할 경우 드문 이벤트에 대해서만 사용될 수도 있고, 제 2 재실행 메커니즘은 (회복)스케줄러를 통하여 진행하고 의존적인 동작만을 재실행한다. 프로세서(100)는 재실행에 의해 야기된 데드록(deadlock)이나 라이브록(livelock)을 회피하는 것이 가능할 경우 에이지(age) 기반의 스케줄링을 사용할 수도 있다. 또한, 캐시 미스(cache miss) 같은 이벤트에 의해 야기된 잘못된 실행의 파면보다 더 빠른 데이터플로 그래프를 횡단하는 재실행 스톰 안티-스케줄러가 채용될 수도 있다. 이는 재실행하려고 모든 이어지는 동작들을 야기하는 단일 재실행과 같은 불필요한 작업을 방지할 수도 있다.As described above, the processor 100 may re-execute ups before all conditions or inputs to uop are received. According to an example embodiment, redo may be performed using a multilevel redo mechanism. For example, the first redo mechanism may only be used for rare events if all redo is acceptable in the pipeline, and the second redo mechanism proceeds through the (recovery) scheduler and only redo dependent actions. The processor 100 may use age based scheduling if it is possible to avoid deadlocks or livelocks caused by redo. In addition, a redo storm anti-scheduler may be employed that traverses the dataflow graph faster than the wavefront of false execution caused by an event such as a cache miss. This may prevent unnecessary work, such as a single redo, causing all subsequent actions to be redone.

의존적 동작들이나 uops를 기동(실행을 위해 디스패치되게 하도록 야기)하는데 사용될 수도 있는 동작 라이트백의 몇몇 다른 형태들이 존재할 수도 있으며, 데이터가 이용가능하다고 알려지는 지시를 포함(이에 한정되지 않음)하고, 데이터는 에러 보정 하드웨어 또는 전체 캐시 유용성에 의해 확증되지 않았지만 이용가능하다고 여겨지고, 이전 데이터는 무효하다고 알려지며(즉, 오염됨), 이전 라이트백은 안전하게 완료되었다.There may be some other forms of action writebacks that may be used to invoke dependent actions or uops to cause it to be dispatched for execution, including but not limited to an indication that data is known to be available, and the data may be Although not corroborated by error correction hardware or overall cache availability, it is considered available, previous data is known to be invalid (i.e., contaminated), and previous writebacks have been safely completed.

예시적인 실시예에 따라, 재실행 예측기는 이용가능한 의존적 동작이 비-재실행 안전 데이터로 스케줄링되어야 하는지의 여부나, 재실행 안전까지 대기해야하는지의 여부를 결정할 수도 있다.According to an example embodiment, the redo predictor may determine whether the available dependent operations should be scheduled with non-redo safety data or whether to wait until redo safety.

E. E. 바이패스Bypass 예 Yes

예시적인 실시예에 따라, 바이패스 네트워크가 사용될 수도 있다. 예를 들면, 바이패스 네트워크는 레이턴시 동종 또는 레이턴시 이종이고, 및/또는 대역폭 동질 또는 대역폭 이종이다. 많은 경우에 있어서, 이종의 레이턴시를 갖더라도 동일 사이클에서 매 다음 실행으로 매 실행 유닛을 바이패스하는 것은 바람직하지 않을 수도 있으며, 다른 모든 실행 유닛에서 모든 실행 유닛의 전체 대역폭을 바이패스할 수 있는 것도 바람직하지 않을 수도 있다.According to an exemplary embodiment, a bypass network may be used. For example, the bypass network is latency homogeneous or latency heterogeneous, and / or bandwidth homogeneous or bandwidth heterogeneous. In many cases, even with heterogeneous latency, it may not be desirable to bypass every execution unit in every subsequent execution in the same cycle, and it is also possible to bypass the entire bandwidth of all execution units in all other execution units. It may not be desirable.

도 9는 전체 레이턴시 동질 바이패스 네트워크의 실시예를 나타내며, 4개의 실행 유닛(2개의 ALU와 2개의 메모리)을 포함한다. 이 바이패스 회로는 2개의 바이패스 클러스터로 나뉠 수도 있으며, 각 클러스터 사이에 추가 레이턴시를 가지며, 도 10에 이 회로를 도시한다. 이러한 구성은 레이턴스 이종이 될 수도 있지만, 전체적으로 접속되어 있으므로, 대역폭 동질이다. 예를 들면, 임의의 실행 유닛은 예를 들면, 전체 대역폭에서 임의의 다른 실행 유닛에 결과를 전송할 수도 있다.9 illustrates an embodiment of a full latency homogeneous bypass network and includes four execution units (two ALUs and two memories). This bypass circuit may be divided into two bypass clusters, with additional latency between each cluster, which is shown in FIG. Such a configuration may be heterogeneous in latency, but is bandwidth homogeneous since it is entirely connected. For example, any execution unit may send results to any other execution unit, for example, at full bandwidth.

일반적으로, 시스템은 동일 사이클 내에서 모든 실행 유닛을 매 다음 실행 유닛으로 바이패스하지 않으며, 이는 하드웨어 비용이 초과하기 때문이다. 일 형태에서 이는 바이패스 네트워크가 레이턴시와 대역폭이 모두 이종성을 의미한다.In general, the system does not bypass every execution unit to the next execution unit within the same cycle, because hardware costs are exceeded. In one form, this means that the bypass network is heterogeneous in both latency and bandwidth.

대역폭 이종 바이패스 네트워크는 몇몇 경우에 있어서는 내부 클러스터 바이패싱이 필요한 결과가 이용가능한 와이어가 있는 것보다 더 생성되는 경우 시간 간격을 조절하기 위하여, 저장소나 버퍼링을 필요로 할 수도 있다. 결국 백프레셔어(backpressure)가 이러한 내부 클러스터 바이패싱의 생성시키는 구실을 일으킬 수도 있다. 예시적인 실시예에 따라, 주의깊은 스케줄링은 이러한 버퍼링을 전혀 필요로하지 않을 수도 있지만, 내부 클러스터 바이패싱뿐만 아니라 결과들의 내부 클러스터 바이패싱을 지연시킬 수도 있다. Bandwidth Heterogeneous Bypass Networks may in some cases require storage or buffering to adjust the time interval when the result of requiring internal cluster bypassing is more than there are available wires. Eventually, backpressure may create the excuse for creating such internal cluster bypassing. According to an exemplary embodiment, careful scheduling may not require such buffering at all, but may delay internal cluster bypassing of results as well as internal cluster bypassing.

내부 클러스터 바이패싱을 위한 전용의 저장소나 버퍼링을 생성하기보다는, 프로세서(100)가 현재의 물리적 레지스터 파일 메커니즘을 이용할 수도 있다. 예를 들면, 이러한 기술을 이용하여, 내부 클러스터 바이패싱에 대한 명시적이나 분리적 메커니즘이 존재하지 않을 수도 있지만, 내부 클러스터 통신은 레벨 2의 레지스터 파일에서와 같이, 클러스터 사이에 물리적 레지스터 파일을 통하여 발생할 수도 있다. 도 11은 실시예에 따른 내부 클러스터 바이패스 메커니즘 또는 내부 클러스터 통신을 제공하기 위한 레벨 2의 레지스터 파일 사용을 나타낸다. Rather than creating dedicated storage or buffering for internal cluster bypassing, processor 100 may use the current physical register file mechanism. For example, using such a technique, there may not be an explicit or separate mechanism for internal cluster bypassing, but internal cluster communication may occur through physical register files between clusters, such as in a level 2 register file. It may be. 11 illustrates an internal cluster bypass mechanism or use of a level 2 register file to provide internal cluster communication according to an embodiment.

예시적인 실시예에서, 레벨 2의 레지스터 파일은 예를 들면, 실제 어레이 셀에서 판독과 기록 모두를 위한 단일 포트를 가질 수도 있다. 판독과 기록 모두의 의사 멀티포팅을 제공하기 위해 뱅킹이 사용될 수도 있다.In an exemplary embodiment, a level 2 register file may have a single port for both reading and writing, for example in an actual array cell. Banking may be used to provide pseudo multiporting of both reading and writing.

와이어의 단일 세트는 판독된 데이터 값을 각 클러스터에 반환한다. 실시예에서, 레지스터 값의 내부 클러스터 통신용으로 사용된 동일 데이터 반환 경로를 클러스터 사이에 공유된 데이터 캐시로부터 판독된 메모리 값의 반환용으로 사용할 수도 있다(이 경로는 내부 클러스터 저장 버퍼 전송용으로 사용될 수도 있음).A single set of wires returns the read data values to each cluster. In embodiments, the same data return path used for internal cluster communication of register values may be used for return of memory values read from data caches shared between clusters (this path may be used for internal cluster storage buffer transfers). has exist).

도 11에 나타낸 회로는 잠재적으로 다수의 레지스터 와이어와 사이에서 제한된 수의 물리적 포트를 스케줄링하고 충돌을 버퍼링하기 위해 중재 로직을 가질 수도 있다.The circuit shown in FIG. 11 may potentially have arbitration logic to schedule a limited number of physical ports and buffer collisions between multiple register wires.

예시적인 실시예에서, 공유된 RF(예를 들면, 레벨 2의 레지스터 파일)는 클러스터가 값을 생성하는 것, 클러스터 값을 요구하는 것을 트랙킹할 수도 있으며, 예를 들면 이하에 관련될 수도 있다.In an example embodiment, a shared RF (eg, a level 2 register file) may track that the cluster generates a value, requires a cluster value, and may, for example, relate to the following.

a. 값이 공유된 레지스터 파일(RF)에 존재할 경우, 반환한다.a. Return the value if it exists in the shared register file (RF).

b. 값이 공유된 레지스터 파일(RF)에 존재하지 않을 경우, 생성 클러스터에 요구를 전송한다. 이 요구는, 준비되면 또는 생성되었다면 즉시로 공유된 레지스터 파일(RF)에 기록되는 값을 야기할 수도 있다. 공유된 레지스터 파일은 클러스터를 요구하는 것으로 회답할 수도 있다.b. If the value does not exist in the shared register file RF, send a request to the producing cluster. This request may cause a value to be written to the shared register file RF when ready or immediately created. The shared register file may reply by requesting a cluster.

예시적인 실시예에 따라, 내부 클러스터 바이패스 프로토콜/메커니즘은 퍼(per) 클러스터 레지스터 파일(예를 들면, 레벨 1의 레지스터 파일)이 없더라도 사용될 수도 있다.According to an exemplary embodiment, an internal cluster bypass protocol / mechanism may be used even without a per cluster register file (eg, a level 1 register file).

F. F. 세그먼트화된Segmented 순차적 저장소에 관한 예 Example of sequential storage

몇몇 경우에, 멀티스레딩은 순차적 데이터 구조에 있어서 문제점을 야기할 수도 있다. 비 멀티스레딩된 순차적 데이터 구조들은 예를 들면, 원형 큐로서 할당될 수도 있다. 멀티스레딩은 때로 이들 원형 큐의 복제를 요구할 수도 있다. 고정 크기의 원형 큐의 복제는 이들의 고정된 또는 일정한 크기로 인해 몇몇 경우에서는 제한적일 수도 있다.In some cases, multithreading may cause problems with sequential data structures. Non-multithreaded sequential data structures may be allocated, for example, as a circular queue. Multithreading may sometimes require replication of these circular queues. Replication of fixed size circular queues may be limited in some cases due to their fixed or constant size.

예시적인 실시예에 따라, 저장소의 세그먼트, 메모리 또는 그 밖의 자원들이 세그먼트나 청크(chunk)에 할당될 수도 있다. 이 기술을 여기서는 세그먼트화된 순차적 저장소로서 언급한다. 예를 들면, 메모리 부분(또는 다른 자원)은 세크먼트들로 나누어질 수도 있다. 오브젝트(예를 들면, 스레드, 클러스터)들은 하나 이상의 세그먼트나 메모리 청크로 예를 들면 순차적으로 할당될 수도 있다.According to an example embodiment, segments, memory, or other resources of the storage may be allocated to the segments or chunks. This technique is referred to herein as a segmented sequential store. For example, the memory portion (or other resource) may be divided into segments. Objects (eg, threads, clusters) may be allocated, for example, sequentially into one or more segments or chunks of memory.

예시적인 실시예에 따라, 세그먼트화된 순차적 접근법은 매운 큰 버퍼를 세 그먼트로 구분하는 것을 포함할 수도 있다. 할당은 세그먼트내에서 순차적일 수도 있다. 세그먼트들은 단속적으로 할당될 수도 있으며, 자원을 역동적으로 변경하고, 상당한 유연성을 제공한다.According to an example embodiment, the segmented sequential approach may include dividing a very large buffer into segments. Allocations may be sequential within a segment. Segments may be allocated intermittently, dynamically changing resources, and providing considerable flexibility.

예시적인 실시예에서, 세그먼트들은 힙(heap) 내로부터 무작위로 할당될 수도 있고, 세그먼트 자체나 보조 데이터 구조 내에 저장된 포인터를 사용하여 서로 링크될 수도 있다. 할당될 각 세그먼트나 메모리 청크에 대하여 프리세트(이산적) 크기로 될 수도 있거나, 세그먼트 크기는 동적으로 변경될 수도 있다. 메모리(또는 다른 자원)의 세그먼트는 요청시에 (동적으로) 할당되거나 오브젝트, 예를 들면, 스레드 또는 클러스터에 필요에 따라 할당될 수도 있다.In an example embodiment, the segments may be randomly allocated from within the heap and may be linked to each other using pointers stored within the segment itself or in ancillary data structures. The preset may be of a preset (discrete) size for each segment or memory chunk to be allocated, or the segment size may be changed dynamically. Segments of memory (or other resources) may be allocated (dynamically) upon request or allocated as needed to an object, eg, a thread or cluster.

예시적인 실시예에 따라, 하드웨어(또는 프로세서 내의 하드웨어 블록)는 세그먼트화된 순차적 저장소를 관리하는데 사용될 수도 있고, 계산할 수 있는 추가 회로가 제공될 수도 있다. 다음의 저장소 동작이 현재의 세그먼트를 채우(오버플로)거나 그 데이터를 소비(언더플로)할지를 판단하기 위해, 메모리의 세그먼트는 스레드에 할당될 수도 있거나 클러스터 회로가 제공될 수도 있다. 힙내에서 정확한 어드레스를 찾기 위해 하나의 세그먼트를 그 프리프로세서에 링크하는 저장된 포인터를 사용할 수도 있다. 추가 세그먼트가 자동으로 할당되거나 오브젝트(예를 들면, 스레드 또는 클러스터 또는 그 밖의 오브젝트)의 요청시 할당될 수도 있다. In accordance with an exemplary embodiment, hardware (or hardware blocks within the processor) may be used to manage segmented sequential storage, and additional circuitry may be provided that can be calculated. Segments of memory may be allocated to threads or cluster circuits may be provided to determine whether the next storage operation fills (overflows) the current segment or consumes (underflows) its data. You can also use a stored pointer that links a segment to its preprocessor to find the correct address in the heap. Additional segments may be automatically allocated or allocated upon request of an object (eg, a thread or cluster or other object).

예를 들면, 실행하고 있는 프로그램은 레벨 1의 저장 버퍼(SB1)내에 저장 동작을 행할 수도 있다. 조금 후에 저장은 레벨 2의 저장 버퍼(SB2)에 복사될 수도 있다. 이 복사를 관리하는 하드웨어 회로는 SB2에 저장하거나 프리 세그먼트에 할 당할 수도 있고 현재의 스레드(또는 현제 오브젝트)에 대해 SB2 내에 링크시킬 수도 있다. 실시예에서, 실행 프로그램은 이러한 프로세스를 관리할 필요가 없으며 전형적으로 이렇게 할 능력을 갖고 있지 않거나, 결과를 관찰할 수도 없다(예를 들면, 저장소 버퍼 SB2는 전형적으로 실행 프로그램에 보이지 않기 때문). 또한, 세그먼트화된 순차적 저장소의 세그먼트에서의 엔트리는 단순한 메모리 저장소 위치가 될 필요가 없다. 예를 들면, SB2에서 각 엔트리는 CAM 기능을 행하는 어드레스 비교기와 관련될 수도 있다.For example, the executing program may perform a storage operation in the level 1 storage buffer SB1. A short time later, the store may be copied to the level 2 storage buffer SB2. The hardware circuitry managing this copy may be stored in SB2 or assigned to a free segment and linked in SB2 for the current thread (or current object). In an embodiment, the executable program does not need to manage this process and typically does not have the ability to do this or may not observe the results (eg, because storage buffer SB2 is typically invisible to the executable program). Also, entries in segments of segmented sequential storage need not be simple memory storage locations. For example, each entry in SB2 may be associated with an address comparator that performs a CAM function.

몇몇 사용에 있어서 내부 세그먼트 계산이 없을 수도 있다. 예를 들면, 트레이스-로그는 보통의 RAM일수도 있다. 세그먼트내에서 순차적 할당은 병렬의, 높은 대역폭, 판독을 허용하는데 사용될 수도 있다. 세그먼트는 대역폭 목표에 충분히 커야한다. 무작위로 할당된 세그먼트들은 포인터를 사용하여 서로 체인연결될 수도 있다. 세그먼트 길이는 체인에서 다음 세그먼트를 역참조하는 레이턴시를 숨기기에 충분할 수도 있다.In some uses there may be no internal segment calculation. For example, the trace-log may be normal RAM. Sequential allocation within a segment may be used to allow parallel, high bandwidth, readout. The segment should be large enough for the bandwidth goal. Randomly allocated segments may be chained together using pointers. The segment length may be sufficient to hide latency that dereferences the next segment in the chain.

다른 사용에 있어서, 내부 세그먼트 계산일 수도 있다. 몇몇 애플리케이션에서 세그먼트들은 태그나 CAM이 주어지고, 무작위로 할당된 세그먼트가 동적 순서로 배치되도록 한다. 또는, 각 세그먼트 계산은 후보를 반환할 수도 있다. 세그먼트 태그는 이 후보들을 재배열하고, 원하는 엔트리를 얻는데 사용될 수도 있다.In other uses, it may be internal segment calculation. In some applications, segments are given tags or CAMs, allowing randomly allocated segments to be placed in dynamic order. Alternatively, each segment calculation may return a candidate. The segment tag may be used to rearrange these candidates and obtain the desired entry.

그 밖의 사용에 있어서, 타임스템프나 ID가 세그먼트화된 순차적 데이터 구조에서 엔트리의 위치와 관련하여 비교될 수도 있다. 이 경우에, 세그먼트들은 단속적으로 할당될 수 있지만, 원형의 방식으로 된다. 예를 들면, 예시적인 실시예에 따라, 단순한 원형 구조와 단일의 랩(wrap) 비트를 갖고, 스레드에 할당된 새로운 세그먼트들은 원형의 순서에서 최신의 것 이상이고 가장 오래된 것 이하일 경우 사용될 수만 있다. 가장 오래된 것 이상이고 최신의 것 이하인 경우 즉시로 이용될 수 없지만, 가장 오래된 것이 진행할 때까지 대기해야만 한다. 다수의 렙 비트는 더 빠른 재사용을 허용하지만, 억제는 여전히 남아있다(이 문제를 완전히 제거하기에 충분한 렙 비트(인덱스의 두 배의 크기)를 할당하더라도).For other uses, timestamps or IDs may be compared with respect to the location of entries in the segmented sequential data structure. In this case, the segments can be assigned intermittently, but in a circular manner. For example, according to an exemplary embodiment, with a simple circular structure and a single wrap bit, new segments assigned to a thread can only be used if they are more than the newest and less than the oldest in the circular order. If it is more than the oldest and less than the latest, it cannot be used immediately, but must wait until the oldest proceeds. Many rep bits allow for faster reuse, but suppression still remains (even if you allocate enough rep bits (double the size of the index) to completely eliminate this problem).

G. 계층적 저장 버퍼의 예G. Example of Hierarchical Store Buffers

저장 버퍼는 전형적으로 마이크로프로세서와 메모리 서브시스템 사이에 위치한다. 저장은 의존적 동작이 발생하지 않는 한 완료되어서는 안 된다. 이러한 방식으로 메모리 로드와는 다르며, 하나의 명령 A가 메모리로부터 값을 로딩하고 명령 B가 평가하는데 사용될 경우, A의 실행을 지연시키는 문제를 생성할 수도 있다. 다른 한편, A가 값을 메모리에 저장할 경우, 이 동작을 완료시키기 위한 어떠한 명시적 기록 명령도 없다. 그래서 저장 동작을 제쳐놓고 로딩 동작에 우선 순위를 갖게 하는데 이점일 수 있다.The storage buffer is typically located between the microprocessor and the memory subsystem. Save should not be completed unless dependent actions take place. This is different from the memory load in this way, and may create a problem that delays the execution of A if one instruction A loads a value from memory and the instruction B is used to evaluate. On the other hand, if A stores the value in memory, there is no explicit write command to complete this operation. So it may be an advantage to put aside the save operation and give priority to the loading operation.

이러한 접근법이 갖는 하나의 문제점은 묵시적 관계를 통하여 명령 B가 이전 명령 A에 의해 저장된 값에 따라 의존적일 수 있으며, 즉, 이들은 동일 메모리 위치를 참조하게 되며, 이러한 관계가 프로그램이 실행될 때까지 분명하지 않을 수도 있으며, 사실 의존성은 다른 데이터 값에 의거하여 A와 B의 모든 실행이 아닌 몇몇의 경우에만 존재하지 않을 수도 있다. A가 메모리에 값을 저장하고 이어지는 명령 B가 그 값을 로딩할 필요가 있을 경우, 그러나 A가 현재 저장 버퍼에 관여하고 있 어 아직 그 값을 메모리에 기록하는 것이 종료되지 않았을 경우, 하드웨어는 B가 메모리보다는 저장 버퍼로부터 그 값을 취하도록 한다. 메모리 내의 값은 그 시점에서 이전 것으로 여겨진다(그리고, 이전 데이터는 프로그램에 에러나 문제를 생성할 수도 있음).One problem with this approach is that through an implicit relationship, command B may depend on the value stored by the previous command A, that is, they refer to the same memory location, and this relationship is not obvious until the program is executed. In fact, dependencies may not exist in some cases, not all implementations of A and B based on other data values. If A stores a value in memory and subsequent commands B need to load the value, but if A is currently involved in the storage buffer and writing of that value to memory has not yet ended, the hardware will exit B. Causes the value to be taken from the storage buffer rather than from memory. The value in memory is considered old at that point (and old data may create errors or problems in the program).

이러한 문제점을 해결 예로는 매칭 CAM을 갖는 저장 버퍼내의 각 엔트리를 가도록 할 수도 있다. 저장 버퍼 엔트리는 두 개의 요소를 갖는다. 기록할 어드레스와 기록될 데이터이다. 이들 엔트리 각각은 저장 버퍼에 기록되는 일정한 순서를 유지할 수도 있고, 그래서 가장 오래된 엔트리가 메모리에 기록될 다음의 것으로 된다. 각 엔트리에 대한 어드레스 계산기는 임의의 새로운 로딩 동작의 어드레스에 대하여 그 어드레스를 비교할 수도 있다. 로딩될 어드레스가 저장 버퍼내의 하나와 매칭할 경우, 그 저장 버퍼내의 값이 로딩 동작으로 진행되고 메모리 로딩은 종료된다. 저장 버퍼내의 하나 이상의 어드레스가 로딩될 어드레스와 매칭할 경우, 최신의 매칭 엔트리, 즉, 최근에 저장 버퍼에 들어간 엔트리가 사용된다. 저장 버퍼 엔트리가 기록될 데이터의 큰 청크(보통 프로세서 워드 크기, 예를 들면 32비트)를 유지할 수도 있기 때문에 회로는 약간 더 복잡해 질 수도 있지만, 아키텍처가 더 작은 크기의 기록(예를 들면, 바이트)을 지원할 경우, 어드레스 매처(matcher)는 각 저장 버퍼 엔트리 내에 더 작은 청크가 양호한 데이터를 보유하고 있음을 나타내는 유효 비트를 늘릴 수도 있다. 마지막으로, 마이크로 아키텍처 설계에서의 그 밖의 요소에 따라서, 최신의 엔트리가 정확히 동일 위치에 기록할 경우 기록되기 전에 저장 버퍼에서 엔트리를 취소할 수도 있다. 발생할 수 있는 문제점으로는, 몇 몇 경우에서, 큰 명령 창을 지원하기 위해 비교적 많은 저장 버퍼가 사용될 수 있고, 이는 몇몇 경우에서 프로세서 클록 사이클을 늦출 수도 있다.An example of solving this problem may be to have each entry in a storage buffer with a matching CAM. The storage buffer entry has two elements. The address to be written and the data to be written. Each of these entries may maintain a certain order in which they are written to the storage buffer, so that the oldest entry is next to be written to memory. The address calculator for each entry may compare that address against the address of any new loading operation. If the address to be loaded matches one in the storage buffer, the value in that storage buffer proceeds to the loading operation and memory loading ends. When one or more addresses in the storage buffer match the address to be loaded, the latest matching entry, that is, the entry that recently entered the storage buffer, is used. The circuit may be slightly more complicated because the storage buffer entry may hold a large chunk of data to be written (typically processor word size, eg 32 bits), but the architecture may be slightly smaller in complexity (eg bytes). If it is supported, the address matcher may increase the valid bit indicating that a smaller chunk in each storage buffer entry holds good data. Finally, depending on other factors in the microarchitecture design, if the latest entry is written to exactly the same location, the entry may be canceled from the storage buffer before being written. Problems that may arise, in some cases, a relatively large number of storage buffers may be used to support large command windows, which in some cases may slow down the processor clock cycle.

그러므로, 예시적인 실시예에 따라, 비교적 작고 빠른 버퍼, 예를 들면, 실행 유닛에 근접하여 위치된 작은 저장 버퍼(예를 들면, 레벨 1의 저장 버퍼, SB1)와, 큰 명령 창을 지원하는 큰 저장 버퍼(예를 들면, 레벨 2의 저장 버퍼2, SB2)를 포함하는 멀티레벨 저장 버퍼가 제공될 수 있다. 이 접근법은 단일 스레드를 동작시키기 위해 멀티 스레드와 멀티 클러스터 프로세서뿐만 아니라 단일의 클러스터 프로세서용으로 사용될 수도 있다.Therefore, according to an exemplary embodiment, a relatively small and fast buffer, for example a small storage buffer (eg, a level 1 storage buffer, SB1) located in proximity to an execution unit, and a large supporting window A multilevel storage buffer may be provided that includes a storage buffer (eg, level 2 storage buffer 2, SB2). This approach may be used for a single cluster processor as well as for multi-threaded and multi-cluster processors to run a single thread.

예시적인 실시예에 따라, 각 레벨 1의 저장 버퍼, SB1은 무작위로 할당된 구조이며, 즉, 종래의 저자 버퍼에서와 같이 FIFO 형식으로 할당되지 않을 수도 있다. 종래의 저장 버퍼는 버퍼에서 순서로서 연령을 트래킹하지만, 무작위로 할당된 SB1은 저장 버퍼 내에서 엔트리의 위치와 상관없이 할당될 수도 있고, SB1 엔트리는 연령 정보를 명시적으로 저장할 수도 있다. 이들 값들은 def(definition) 및 킬(kill) 시간으로서 불리 수도 있다. 명령 스케줄러는 스케줄링 시간의 대상을 트래킹할 수도 있다. 저장이 레벨 1의 저장 버퍼(SB1) 내에 기록되면 현재 시간이 def 시간으로 사용되고 킬 시간은 미정의된다. SB1에서 다른 엔트리가 이 새로운 저장와 동일한 어드레스로 매칭되고, 이 엔트리가 미정의된 킬 시간을 가질 경우, 이것의 킬 시간이 채워진다. 다수의 엔트리가 기록된 유효한 바이트가 다르지만 동일 어드레스를 갖고 새로운 기록과 오버랩(즉, 바이트 오버랩)이 복잡해 질 수 있기 때문에 회로는 약간 복잡하게 된다. 이어지는 로드가 X시간에 이루어질 경우, 킬 시간이 없는 동일한 어드레스를 갖는 임의의 SB1 엔트리에 대한 어드레스 비교하여 매칭이 종료된다. 최종적으로, 예시적인 실시예에 따라, 적절한 논리적 순서로 계속 저장하기 위해 엔트리는 SB1로부터 꺼내져 이들의 def 시간의 순서로 SB2로 이동(또는 복사)될 수도 있다.According to an exemplary embodiment, the storage buffer SB1 of each level 1 is a randomly allocated structure, that is, may not be allocated in FIFO format as in the conventional author buffer. Conventional storage buffers track ages in order in the buffer, but a randomly assigned SB1 may be allocated regardless of the position of the entry in the storage buffer, and the SB1 entry may explicitly store age information. These values may be referred to as def (definition) and kill time. The command scheduler may track the subject of the scheduling time. If the store is written into the level 1 storage buffer SB1, the current time is used as the def time and the kill time is undefined. If another entry in SB1 matches to the same address as this new store and this entry has an undefined kill time, its kill time is filled. The circuit is slightly complicated because the valid bytes in which multiple entries are written are different, but with the same address and new writes and overlaps (ie byte overlaps) can be complicated. If the subsequent load is made at time X, the matching is terminated by address comparison for any SB1 entry having the same address with no kill time. Finally, according to an exemplary embodiment, entries may be taken out of SB1 and moved (or copied) to SB2 in the order of their def times in order to continue storing in the proper logical order.

예시적인 실시예에 따라, 클러스터 저장 버퍼(SB1)(레벨 1 저장 버퍼)는 랜덤하게 (LRU) 할당된 레인지 CAM 구조일 수 있다: 이 구조 내의 모든 엔트리는 그것이 유효한 간격, 즉, 어드레스뿐만 아니라 [Def,Kill] 간격을 특징으로 할 수 있다. 어드레스가 정합되고, 로드 타임스탬프가 [Def,Kill] 간격 내에 있는 경우, 로드는 SB1 엔트리와 정합된다.According to an exemplary embodiment, the cluster storage buffer SB1 (level 1 storage buffer) may be a randomly allocated (LRU) range CAM structure: every entry within this structure is not only an effective interval, i. Def, Kill] intervals. If the address is matched and the load timestamp is within the [Def, Kill] interval, the load is matched with the SB1 entry.

예시적인 실시예에 따라, 레벨 1 저장 버퍼(SB1)는, 예를 들면, 데이터 폭 64 비트의 저장 버퍼를 가질 수 있다.(128 비트, 예를 들면 4 x 32 비트 또는 다른 데이터 폭이 사용될 수도 있다.) 비트마스크는 (1) 연관된 저장(store)에 의해서 기록되는 바이트를 나타낼 수도 있지만, 또한 (2) 무기록된 바이트 중의 어떤 바이트가 유효한 데이터를 갖고 있는지를 나타낼 수도 있다. 전체 64 비트 폭을 차지하지 않는 부분 기록에서는 누락 바이트가 공급될 수도 있다. SB1에 노출된 신규 저장 데이터는 레벨 1 저장 버퍼(SB1) 엔트리에 의해서 CAM될 수 있으며, 또한 일치하는 저장의 누락 바이트를 갱신할 수도 있다.According to an exemplary embodiment, the level 1 storage buffer SB1 may have a storage buffer of, for example, a data width of 64 bits. (128 bits, for example 4 × 32 bits or other data width may be used. The bitmask may indicate (1) the bytes written by the associated store, but also (2) which bytes of the unlisted bytes contain valid data. Missing bytes may be supplied for partial writes that do not occupy the entire 64-bit width. The new stored data exposed to SB1 may be CAM by the level 1 storage buffer (SB1) entry and may also update the missing bytes of the matching store.

도 12는 예시적인 실시예에 따른 저장 버퍼를 도시하고 있다. 레벨 2 저장 버퍼(SB2)는 FIFO로 동작하며, 따라서 비교기가 상대적으로 단순해질 수 있다. 각 엔트리는 def/kill 데이터를 포함하고 있을 수 있으나 이 데이터는 로드 명령과 관 련된 어드레스 및 실행 시간과는 일치하지 않는다. 그보다도 SB2는 물리적인 파티션 1202 또는 세그먼트(1202) 또는 세그먼트(예를 들면, 분할된 순차 스토리지)(도 12)로 분할될 수 있으며, 각각의 파티션은 이 파티션이 보유하고 있는 가장 오래된 엔트리에 대한 def 시간을 가지고 있다. 따라서, SB2 파티션 내의 엔트리는 적재 명령의 주소에 대해서 SB2 파티션 내의 FIFO 순서와 일치하게 된다. 각 파티션 내에서의 일치하는 엔트리 중 최신 엔트리에 대해서, 파티션은 엔트리 내에 저장된 데이터와 함께 [Def, Kill] 간격을 반환한다. 이후에, 실시예에 따르면, 선택기 논리(1204)는 일치하는 엔트리 중에서 가장 오래된 엔트리를 선택한다.12 illustrates a storage buffer according to an example embodiment. The level 2 storage buffer SB2 operates as a FIFO, so that the comparator can be relatively simple. Each entry may contain def / kill data, but this data does not match the address and execution time associated with the load instruction. Rather, SB2 may be partitioned into physical partitions 1202 or segments 1202 or segments (eg, partitioned sequential storage) (FIG. 12), with each partition representing the oldest entry held by this partition. have def time. Thus, the entry in the SB2 partition will match the FIFO order in the SB2 partition with respect to the address of the load instruction. For the newest of the matching entries in each partition, the partition returns the [Def, Kill] interval with the data stored in the entry. Thereafter, according to the embodiment, the selector logic 1204 selects the oldest entry among the matching entries.

저장된 버퍼가 복수의 라이브 파티션을 갖는 경우라면, 로드가 발생할 때 모든 라이브 파티션을 검색할 수 있다. 그러나, 새로운 파티션들이 일치(match)를 제공할 수 없다고 알려져 있기 때문에, 이 검색은 먼저 로드 시간과 일치하는 논리 스케줄링 시간과 다음으로 가장 오래된 파티션을 커버하는 파티션과의 일치하는 것에 의해서 시작될 수 있다. 이들 두 개의 파티션에서 일치가 발견되는 경우에는, 로드 명령으로 데이터가 제공된다. 이들 두 개의 파티션이 일치하지 않는다면, 모든 더 오래된 파티션을 다음으로 검색하여 가장 새로운 일치를 로드 명령에 반환한다. 많은 경우에, 파티션은 일치하지 않으므로 SB2에서의 로드는 미스(miss)하며, 실제 데이터를 위해서 로드는 메모리로 릴리스된다.If the stored buffer has multiple live partitions, then all live partitions can be retrieved when the load occurs. However, since it is known that new partitions cannot provide a match, this search can be started by matching the logical scheduling time that first matches the load time and the partition that covers the next oldest partition. If a match is found in these two partitions, data is provided to the load command. If these two partitions do not match, all older partitions are searched next and the newest match is returned to the load command. In many cases, the partitions do not match, so the load on SB2 misses, and the load is released into memory for the actual data.

예시적인 실시예에 따라, 추론적 다중 스레딩의 일부 형태에 적합한 경우에는 덜 정밀한 일치를 사용할 수 있다. 실제로는, 스레드는 특정 데이터 요소가 동시에 실행되는 스레드에 의해서 변경되지 않는다고 가정하면서 실행될 수 있다. 이 후에, 상기 가정은 추론적 스레드가 실패하거나 비추론적으로 될 준비가 되어 있는지가 체크된다. 예시적인 실시예에 따라, 하드웨어는 추론적 스레드에 이용 가능한 데이터 값이 변경되어 이 데이터 값이 중간 시간 중에 검출되지 않았는지(또는 검출할 수 없었는지)의 경우(왜냐하면, 데이터는 마지막에 체크될 수 있기 때문이다)를 지원할 수 있다. 실제로는, 적어도 일부 시스템에 대해서, 저장 버퍼(SB, store buffe) 내의 일치가 상대적으로 드물 수 있고, 또한 SB2 이전에 SB1이 액세스되기 때문에, 추론적 스레드는 일치가 없다는 것을 결정하기 전에 SB2 검색을 단축(또는 중단)시킬 수도 있다. 추론적 스레드에 의해서 스테일 데이터(stale data)의 사용을 검출하기 위해서 후속하는 체킹(checking)을 행함으로써 추론적 스레드를 동작시키는 동일한 하드웨어를 사용할 수도 있다.In accordance with an exemplary embodiment, less precise matching may be used when suitable for some form of speculative multithreading. In practice, a thread can be executed assuming that certain data elements are not changed by threads running concurrently. After this, the assumption is checked whether the speculative thread is ready to fail or become non- speculative. In accordance with an exemplary embodiment, the hardware may change the data value available to the speculative thread so that this data value was not detected (or could not be detected) during the intermediate time (because the data was last checked). Can be used). In practice, for at least some systems, because matches in the store buffer (SB) may be relatively rare, and because SB1 is accessed before SB2, the inferential thread does an SB2 search before determining that there is no match. It can also be shortened (or stopped). It is also possible to use the same hardware to run the speculative thread by performing subsequent checking to detect the use of stale data by the speculative thread.

또한, 예시적인 실시예에 따라, 레벨 2 저장 버퍼(SB2) 내의 각 파티션 또는 세그먼트와 함께 엔트리들을 유지할 수 있고, 또한 이들 엔트리들을 사용하여 나중의 파티션에 유지된 저장에 의해서 충족되는 값을 유지하도록 할 수도 있다. 예를 들면, 어떤 로드는 SB1에서 미스(miss)될 수 있고, 이 로드와 다음으로 신규한 파티션과 동시에 발생하는 저장을 유지하고 있는 SB2 파티션에서 실패할 수도 있으며, 이후에는 오래된 파티션을 히트(hit)할 수도 있다. 이들 데이터 값은 이후에 동시 발생 SB2 파티션에 캐싱되어진다.Further, according to an exemplary embodiment, it is possible to maintain entries with each partition or segment in the level 2 storage buffer SB2, and also to use these entries to maintain values met by storage maintained in later partitions. You may. For example, some loads may be missed on SB1, fail on SB2 partitions that hold storage concurrent with this load and the next new partition, then hit the old partition. )You may. These data values are then cached in the concurrent SB2 partition.

다중 스레드 또는 다중 클러스터의 어떤 것을 사용하더라도, SB2 파티션은 각 파티션이 (세그먼트화된 순차 저장의 일부로서) 세그먼트인 채로 세그먼트화된 순차 저장으로서 관리될 수 있다. 예를 들면, 각 파티션은 동일한 스레드 내에서 다음의 신규한 파티션과 가장 오래된 파티션에 대한 링크를 가지고 있다.Whether using multiple threads or multiple clusters, an SB2 partition can be managed as segmented sequential storage, with each partition being a segment (as part of segmented sequential storage). For example, each partition has links to the next new and oldest partitions in the same thread.

또한, 실시예에 따라, 마이크로 코드 루틴과 같은 프로세스를 사용하여 SB2 파티션(또는 세그먼트) 체인 내의 엔트리의 순서를 검사하고, 저장 어드레스에 기초하여 분할된 신규 체인을 제작할 수도 있다. 원래의 단일 SB2 체인이 동일 길이의 N 체인으로 분할되는 경우에, 각각의 신규 로드 어드레스를 검색하기 위해 예상되는 파티션의 개수는 대략 1/N으로 감소하게 된다. 대부분의 로드 어드레스가 저장 버퍼 내에서 누락되기 때문에, 이 1/N의 감소는 종종 대부분의 메모리 로드에서 실현된다.In addition, depending on the embodiment, a process, such as a microcode routine, may be used to check the order of entries in the SB2 partition (or segment) chain, and to create a new, split chain based on the storage address. If the original single SB2 chain is split into N chains of equal length, the number of partitions expected to retrieve each new load address is reduced to approximately 1 / N. Since most load addresses are missing in the storage buffer, this 1 / N reduction is often realized at most memory loads.

또한, 저장이 SB1에서 SB2로 이동하는 경우, SB1에서 저장을 제거할 필요는 없다. 이런 방식으로, SB1은 필터로 동작하여 공통 일치를 제공할 수 있으며, 또한 SB2 상에서의 대역폭을 감소시킬 수 있다. SB1 엔트리는 일단 SB2로 전달되면 카피(copy)로 마킹되어야 하며, 따라서 이곳에는 미래에도 재차 복사되지 않도록 한다. SB1 구조는 엔트리가 후속하는 로드 어드레스와의 일치를 유지하는 경우 엔트리가 장기 생존하면서 LRU 캐시로서 단순하게 관리될 수 있다. 실행 클러스터가 다중 스레딩을 지원하는 경우, SB1 엔트리는 이들을 구별하기 위해서 스레드 ID 별로 분리될 수 있다.Also, if the store moves from SB1 to SB2, there is no need to remove the store at SB1. In this way, SB1 can act as a filter to provide a common match and also reduce the bandwidth on SB2. Once the SB1 entry is passed to SB2, it must be marked as a copy so that it is not copied again in the future. The SB1 structure can simply be managed as an LRU cache while the entry survives long if the entry maintains a match with a subsequent load address. If an execution cluster supports multiple threading, SB1 entries may be separated by thread ID to distinguish them.

이하에서는 다중 레벨 저장 버퍼의 추가 상세 및 실시예에 대해서 설명한다. 레벨 1 저장 버퍼(SB1)은 하나 또는 그 이상의 CAM을 포함할 수 있다. 레벨 2 저장 버퍼는 정리되어 창가 커짐에 따라서 검색 시간을 최소화할 수 있도록, 예를 들면, 클래식 저장 버퍼 중의 다중 파티션 또는 세그먼트를 포함할 수 있다.Further details and embodiments of multilevel storage buffers are described below. The level 1 storage buffer SB1 may include one or more CAMs. The level 2 storage buffer may include, for example, multiple partitions or segments in the classic storage buffer to minimize retrieval time as the window grows and becomes larger.

예시적인 실시예에 따라, 공유 인터-클러스터 저장 버퍼인 SB2(레벨 2 저장 버퍼)는 스레드마다 할당된 세그먼트화된 순차 데이터 구조일 수 있다. 각 세그먼트는 임의의 로드(load)보다 더 오래된 가장 최신의 저장을 발견해내는 것과 같은 저장 버퍼 전송 계산을 수행할 수도 있다. 각 세그먼트는 [Def, Kill] 간격과 함께 후보 저장을 반환할 수 있다. 예시적인 실시예에서, SB2는 실제로는 [Def, Kill] 간격을 저장하고 있지 않으며, 이들을 전역적으로 CAM하지도 않고 있다. 즉, 반환된 간격은 Kill 시간으로서 "세그먼트의 끝에서 유효함"(valid at end of segment)을 나타낼 수 있다. 이들 모든 반환된 후보 저장을 세그먼트별로 비교함으로써, 전송할 단일 저장을 결정할 수 있게 된다.According to an exemplary embodiment, the shared inter-cluster storage buffer SB2 (Level 2 storage buffer) may be a segmented sequential data structure allocated per thread. Each segment may perform storage buffer transfer calculations such as finding the latest store that is older than any load. Each segment can return a candidate store with a [Def, Kill] interval. In an exemplary embodiment, SB2 does not actually store the [Def, Kill] intervals, nor does it CAM them globally. That is, the returned interval may indicate "valid at end of segment" as the kill time. By comparing all these returned candidate stores by segment, it is possible to determine a single store to transmit.

SB2의 세그먼트는 저장이 내부에 위치하게 됨에 따라서 누락된 바이트의 적절한 갱신을 수행한다. 그러나, 저장은 모든 스레드의 SB2 세그먼트에 노출되지 않을 수도 있으며, 따라서 세그먼트 사이에서의 CAM 갱신이 없을 수도 있다. SB1 엔트리는 SB2 엔트리에서 누락된 데이터를 가지고 있을 수 있으며, 또한 SB2 엔트리의 갱신, 즉 SB1이 지연되어 갱신되는데 사용될 수도 있다.The segment of SB2 performs the appropriate update of the missing bytes as the storage is located inside. However, the store may not be exposed to the SB2 segment of every thread, so there may be no CAM update between segments. The SB1 entry may have data missing from the SB2 entry and may also be used to update the SB2 entry, that is, update SB1 with delay.

입력 로드는 히트(hit)인 경우에 SB1(레벨 1 저장 버퍼)으로부터 충족될 수도 있다. SB1이 누락되면, 이들은 캐시(SB2)로 전송되며, 프로빙(SB2)을 시작한다. 로드 타임 스탬프는 알려져 있고, 따라서 이 로드를 포함하고 있는 세그먼트와 그 바로 앞 선행자는 즉각적으로 프로빙된다. 이때 로드가 충족되는 경우, 데이터는 즉각적으로 획득된다. 그러나, 이때 로드가 충족되지 않는 경우에, 해당 스레드에 대한 가장 오래된 명령과 로드 사이에서의 모든 SB2 세그먼트(파티션)을 증명할 필 요가 있다. 이는 스케줄링 및 예측 문제로서 취급될 수도 있다. 따라서, 이와 같은 세그먼트는 모두 최소 추론적인 스레드에 관해서 체크될 수 있다. 하지만, 추론적인 SpMT 스레드에 대해서, 특정 세그먼트를 프로빙하지 않는 것이 가능한데, 이는 검증 재실행을 수행되고 또한 궁극적으로는 (더 적게 필요한 경우에) 회수에 근접하여 모든 필요한 프로브를 행하기 때문이다.The input load may be met from SB1 (level 1 storage buffer) in the case of a hit. If SB1 is missing, they are sent to cache SB2 and start probing SB2. The load time stamp is known, so the segment containing this load and the immediately preceding one are probed immediately. If the load is met at this time, the data is acquired immediately. However, if the load is not met at this time, you need to prove all the SB2 segments (partitions) between the oldest instruction and the load for that thread. This may be treated as a scheduling and prediction problem. Thus, all such segments can be checked in terms of least speculative threads. However, for speculative SpMT threads, it is possible not to probe a particular segment, because it performs a verification rerun and ultimately performs all the necessary probes close to the count (if less needed).

예시적인 실시예에서, 각 세그먼트 또는 파티션은 대략 32 저장을 포함할 수 있다. 몇 개의 나머지 엔트리, 예를 들면, 4 개의 엔트리는 "live-in"을 유지하도록 할당되어, 저장이 상당히 오래되었다고 하더라도 이들을 포함하고 있는 세그먼트로부터 로드가 직접적으로 충족되도록 하고 있다. 이들 "live-in" 엔트리는 LRU(least recently used, 최근 최소 사용) 캐싱 알고리즘에 따라서 관리될 수 있다.In an example embodiment, each segment or partition may include approximately 32 stores. Some remaining entries, for example four entries, are allocated to maintain a "live-in" so that the load is met directly from the segment containing them even if the storage is quite old. These "live-in" entries may be managed according to the last recently used (LRU) caching algorithm.

모든 저장은 단일 체인의 세그먼트 내에 할당될 수 있다. 이는 어드레스 미지 비교(address unknown comparison) 뿐만 아니라 어드레스 일치도 가능하게 한다. (Multi-Scalar가 저장-로드 종속성 예측기를 가지고 있다고 가정한다.) 세그먼트는 완전해질 필요가 없다. 예를 들면, 실제 실행 중에, 하나의 스레드는 원래의 세그먼트를 계속 사용하고, 다른 스레드는 신규 세그먼트를 사용한다. 다른 경로가 알려진 경우, 원래의 SB2 세그먼트는 분기점 이후에는 소거된다.All stores can be allocated within segments of a single chain. This allows for address matching as well as address unknown comparison. (Assuming Multi-Scalar has a store-load dependency predictor.) A segment does not need to be complete. For example, during actual execution, one thread continues to use the original segment while the other thread uses the new segment. If another path is known, the original SB2 segment is erased after the fork.

일 실시예에서, 모든 저장이 단일 체인의 세그먼트(파티션) 내에 할당될 수도 있기 때문에, 매우 다른 어드레스로부터의 저장이 동일한 저장 버퍼 내에 저장될 수도 있다. 세그먼트의 외부로 저장의 일부를 특정한 어드레스 범위로 제한되는 신규 세그먼트로 복사함으로써 임의의 로드에 대해서 프로빙해야 할 버퍼의 개수를 감소시킬 수 있다. 레벨 2 저장 버퍼(SB2) 세그먼트는 유효한 어드레스 범위를 지시하는 베이스 어드레스/마스크 쌍을 가지고 있을 수 있다. 일 실시예에서, 저장 버퍼는 주소 범위에 의해서 분할될 수도 있다.In one embodiment, since all storage may be allocated within a segment (partition) of a single chain, storage from very different addresses may be stored in the same storage buffer. By copying a portion of the storage out of the segment into a new segment that is limited to a particular address range, the number of buffers to be probed for any load can be reduced. The level 2 storage buffer (SB2) segment may have a base address / mask pair that indicates a valid address range. In one embodiment, the storage buffer may be divided by address range.

SB2 세그먼트 또는 파티션은 함께 체인화되어, 메인 체인 내의 부모, 뿐만 아니라 요약 및 어드레스 범위의 파티션을 지시할 수도 있다. 유사하게, SpMT(추론적 다중-스레딩) 및 실제 스레드, 두 개의 SB2 체인은 동일한 조상을 지시하게 된다. SB2 세그먼트의 재이용은 그와 같은 링크의 갱신과 관련되어 있을 수 있다. 일 실시예에서, 가비지 컬렉션은 없을 수도 있으며, 따라서 수정에 대한 지연이 필요하지 않기 때문에 임의의 세그먼트를 사용하는 모든 스레드가 회수될 준비가 되는 경우에는 언제든지 임의의 세그먼트를 복구할 수 있다.SB2 segments or partitions may be chained together to indicate the partitions in the parent as well as the summary and address ranges in the main chain. Similarly, two SB2 chains, SpMT (inferential multi-threading) and real threads, point to the same ancestor. Reuse of SB2 segments may involve updating such links. In one embodiment, there may be no garbage collection, so no delay for modification is needed, so any segment using any segment can be recovered at any time when all threads using the segment are ready to be reclaimed.

예시적인 실시예에서, 저장 버퍼의 마이크로 아키텍처는, 특히 SB1 [Def, Kill] 간격 CAM에 타임 스탬프를 사용할 수 있다. SB2에서, 순차 할당 및 크로스-링크는 타임 스탬프가 묵시적이라는 것을 의미하고 있으며, 또는 오히려 SB1 타임 스탬프는 필요할 때마다 재구성될 수 있다는 것을 의미하고 있다. SpMT에서 분기 오예측이 가끔 발생하고 있기 때문에 SB1 타임 스탬프의 번호를 다시 매길 필요가 있는 경우에는, SB2가 근거가 된다고 판단할 수 있으므로, 전체 SB1을 폐기할 수 있다(SB2로 전송되도록 대기 중인 저장은 제외).In an exemplary embodiment, the microarchitecture of the storage buffer may use the time stamp, particularly for the SB1 [Def, Kill] interval CAM. In SB2, sequential allocation and cross-link mean that the time stamp is implicit, or rather, that the SB1 time stamp can be reconfigured whenever needed. Since branch misprediction occurs occasionally in SpMT, if it is necessary to renumber the SB1 timestamp, it can be determined that SB2 is the basis, so that the entire SB1 can be discarded (save waiting to be sent to SB2). ).

묵시적 SB2의 순서 및 명시적 SB1 타임 스탬프는 다중 스레드 SB1 타임 스탬프가 일부 경우에서 단순화될 있도록 한다. 예를 들면, 스킵-어헤드(skip-ahead) 스레드는 자체 타임 스탬프를 돌출시킬 필요가 없으며, SB1의 관점에서 보아, 이들은 완전히 분리된 스레드 ID를 가질 수도 있고, 또한 실제 실행에 사용되는 비트 마스크를 채택할 수도 있다.The order of the implicit SB2 and the explicit SB1 time stamp allow the multithreaded SB1 time stamp to be simplified in some cases. For example, skip-ahead threads do not need to protrude their time stamps, and from the point of view of SB1, they may have completely separate thread IDs, and also bit masks used for actual execution. May be adopted.

실제 스레드는 공지된 방식의 비트 마스크를 채택할 수도 있다. 이는 SB1 엔트리가 분기 이전으로부터 자손 스레드 모두에 의한 공유를 허용한다. 그러나, 이들 비트의 끝에 도달하게 되면, 분기를 중단할 필요는 없으며, 분기된 스레드에는 신규 SB1 스레드 ID가 할당될 수 있고, 단순히 SB1 엔트리 공유에 대한 기회가 사라질 뿐이다.The actual thread may employ a bit mask in a known manner. This allows SB1 entries to be shared by all descendant threads from before branch. However, once the end of these bits is reached, there is no need to abort the branch, the branched thread may be assigned a new SB1 thread ID, and the opportunity for sharing the SB1 entry simply disappears.

일 실시예에 따라, 저장은 공유 레벨 2 저장 버퍼(SB2) 내에서 종료된다. 이 SB2가 L1에 비해서 L2 데이터 캐시에 근접하고 있기 때문에, 저장 커밋(commit)은 L2에 대해서 행해지며, 필요하다면, 인버스 라이트-스루(Inverse write-through)로 간주될 수도 있는 L1의 무효화 또는 갱신을 행한다.According to one embodiment, the storage is terminated in shared level 2 storage buffer SB2. Because this SB2 is closer to the L2 data cache than L1, a save commit is made to L2 and, if necessary, invalidates or updates L1, which may be considered inverse write-through. Is done.

예시적인 실시예에서, 저장-로드 종속성 예측기는 로드가 저장으로부터 데이터를 수신해야 하는지의 여부를 예측한다. 관련된 저장 버퍼의 오프셋은 분할된 순차 SB2에 의해서 해석되며, 이들은 전형적으로 범위 CAM SB1에는 적용되지 않는다. 로드와 저장 어드레스가 알려지는 경우에, 로드와 저장이 실제로 일치하는지를 결정하는 것은 간단하다. 그러나, 신규한 저장도 일치한다고 간섭이 없다는 것을 검증할 필요는 있다. 저장-로드 전송의 예측은 필요한 저장 버퍼 전송을 강요하거나 제한할 수도 있지만, 이를 완전히 없앨 수는 없다. 또한 일 실시예에서는 여전히 이 예측이 정확한 것인지를 검증할 필요가 있다. 상술한 바와 같이, 이와 같은 검 증은 SB2의 일부 세그먼트의 프로빙과 관련되어 있을 수도 있다.In an example embodiment, the store-load dependency predictor predicts whether the load should receive data from the store. The offset of the associated storage buffer is interpreted by the split sequence SB2, which typically does not apply to the range CAM SB1. If the load and store addresses are known, it is simple to determine if the load and store actually match. However, it is necessary to verify that there is no interference if the new storage is also consistent. Prediction of a store-load transfer may force or limit the required store buffer transfer, but cannot completely eliminate it. In addition, one embodiment still needs to verify that this prediction is correct. As mentioned above, such verification may be related to the probing of some segments of SB2.

저장-로드 전송 예측기가 매우 훌륭하다면, 검증은 지연될 수도 있다. 로드가 회수에 근접하게 되면, 프로빙되어야 하는 SB2 세그먼트는 더 적어지게 된다. 이와 같은 저장-로드 전송 예측은 SB1의 복잡도를 감소시킨다. [Def, Kill] CAM은 있다고 하더라도 그렇게 자주 사용되지 않을 수도 있다. 이와는 달리, SB1은 SB2 인덱스 상에서 CAM될 수 있으며, 따라서 관련 예측이 여기에 액세스할 수 있게 된다.If the store-load transfer predictor is very good, the verification may be delayed. As the load approaches recovery, fewer SB2 segments need to be probed. This storage-load transfer prediction reduces the complexity of SB1. [Def, Kill] CAM, if any, may not be used that often. Alternatively, SB1 can be CAM on the SB2 index, so that relevant predictions can access it.

H. 다중 스레드에 관한 예H. Examples of Multiple Threads

예시적인 실시예에 따라, 프로세서(100)의 다중 레벨 명령 파이프 라인(예를 들면, 도 1)은, 묵시적 다중 스레딩, 명시적 다중 스레딩, 및 다른 종류의 다중 스레딩과 같은 다중 스레딩을 지원할 수 있다. 일 실시예에서, 다중 스레드는 동일한 비순차적 실행 코어 상에서 동작할 수 있다. 그러나, 이 장치는, 일부의 경우에 있어서, 파이프 라인의 스래시(thrash) 및 회선 쟁탈(contention)을 발생시킬 수 있다. 따라서, 효율을 개선시키기 위해서는, 프로세서(100)는 다중 실행 클러스터에 하나 이상의 구조가 복제될 수 있는 다중 레벨 파이프 라인을 구비할 수 있다. 상술한 바와 같이, 필수적이지는 않지만, 클러스터당 하나의 스레드가 있을 수 있다. 클러스터당 다중 스레드가 있을 수도 있고, 스레드는 신규 스레드를 야기(또는 생성하도록 분기)할 수도 있으며, 이때 신규 스레드는 부모 스레드와 마찬가지로 동일 클러스터 또는 서로 다른 클러스터 상에서 동작할 수도 있다. 스레드는 정적으로 클러스터에 종속되어 있을 수도 있고, 또는 동적으로 생성될 수도 있으며, 또한 다른 클러스터에 할당될 수도 있다.According to an example embodiment, the multi-level instruction pipeline (eg, FIG. 1) of the processor 100 may support multithreading, such as implicit multithreading, explicit multithreading, and other types of multithreading. . In one embodiment, multiple threads can operate on the same out of order execution core. However, the device may, in some cases, generate a pipeline thrash and line contention. Thus, to improve efficiency, the processor 100 may have a multilevel pipeline in which one or more structures can be replicated in a multi-execution cluster. As mentioned above, although not required, there may be one thread per cluster. There may be multiple threads per cluster, and threads may cause (or branch to create) new threads, where the new threads may operate on the same cluster or on different clusters as well as the parent thread. Threads may be statically dependent on the cluster, or may be created dynamically and may also be assigned to other clusters.

예시적인 실시예에 따라, 프로세서(100)는 스레드를 가상화하여 다수의 스레드를 가능하게 할 수도 있으며, 또한 이들의 상태를 유저 메모리에 유지되어 있는 데이터 구조에 저장할 수도 있다. 가상 유저 스레드는 이 데이터 구조로부터 하드웨어 및 마이크로 코드에 의해서 문맥적으로 전환되거나, 더 적은 수의 하드웨어 스레드 문맥에 시간 다중화(time multiplex)되어질 수도 있다.According to an exemplary embodiment, the processor 100 may virtualize threads to enable multiple threads, and may also store their state in a data structure maintained in user memory. Virtual user threads may be contextually switched from this data structure by hardware and microcode, or may be time multiplexed to fewer hardware thread contexts.

예시적인 실시예에 따라, 다중 클러스터 다중 스레드 마이크로 아키텍처는 명시적인 다중 스레딩을 제공할 수 있고, 이때의 스레드는 부팅시에 생성될 수 있으며, 필수적이지는 않지만 단일 클러스터 상에서 각각 동작할 수도 있으며, 이는 단순한 예를 뿐이다. 명시적 다중 스레딩은, 예를 들면, 프로그래머가 명시적으로 병렬 계산을 규정할 수도 있는 프로세서를 참조할 수도 있다. 정적으로 명시적인 다중 스레딩(SEMT, static explicit multithreading)은, 예시적인 실시예에 따르면, 부팅시 운영 체제(OS, operating system)에 의해서 논리 CPU 또는 논리 프로세서가 분명하게 드러나는 경우, 및 OS가 이들 각각을 독립적인 CPU로 관리하는 경우를 지칭할 수도 있다. 동적으로 명시적인 다중 스레딩(DEMT, dynamic explicit multithreading)은 유저가 분기(Fork) 명령을 통해서 스레드를 생성하도록 허용할 수도 있다. OS는, 필수적이지는 않지만, 이와 같은 스레드에 대해 알고 있을 수도 있다.According to an exemplary embodiment, a multi-cluster multi-threaded microarchitecture may provide explicit multithreading, where threads may be created at boot and may operate on a single cluster, although not necessarily, respectively. Just a simple example. Explicit multithreading may refer to a processor, for example, where a programmer may explicitly specify parallel computation. Static explicit multithreading (SEMT) is according to an exemplary embodiment when the logical CPU or logical processor is clearly revealed by the operating system (OS) at boot time, and the OS is each of these. It may refer to the case of managing by an independent CPU. Dynamic explicit multithreading (DEMT) can also allow a user to create a thread through a Fork command. The operating system, though not essential, may know about such a thread.

분기(forking)(예를 들면, 신규 스레드를 야기하는 스레드)는, 예를 들면, 신규 명령 포인터(IP, instruction pointer)를 획득하는 것, 및 (신규 스레드용의) 신규 레지스터 문맥을 획득하는 것에 관련되어 있을 수 있다. 이를 제공할 수 있는 메커니즘의 하나는 메모리 데이터 구조로부터 IP 및 다른 레지스터 값을 판독하는 것이다. 일 실시예에서, 전체 아키텍처의 상태는 신규 스레드용으로 제공되며, 이 때 스레드가 부모인지 자손인지를 나타내는 상태 코드가 있다는 점이 차이점이다.Forking (e.g., a thread causing a new thread) may be used, for example, to obtain a new instruction pointer (IP) and to obtain a new register context (for a new thread). May be related. One mechanism that can provide this is to read IP and other register values from memory data structures. In one embodiment, the state of the overall architecture is provided for a new thread, with the difference that there is a status code indicating whether the thread is a parent or descendant.

예시적인 실시예에서, 부모 및 자손 스레드가 동일한 메모리 공간 내에 존재하고 있기 때문에, 분기 이전에 커밋된 부모 저장은 자손에게도 명백하게 드러나야 한다. 그러나, 일부 경우에 있어서, 부모 및 자손 스레드 사이에서의 분기점에 후속하여 저장을 전송하는 것이 적절하지 않을 수도 있는데, 이는 거동에 있어서 프로세서 일치 메모리 순서 모델에서의 개별 CPU 상에 이 아키텍처를 에뮬레이션할 때와는 다른 아키텍처 거동을 생성하기 때문이다.In an example embodiment, since the parent and descendant threads are in the same memory space, the parent store committed before branching should also be apparent to the descendants. In some cases, however, it may not be appropriate to transfer storage subsequent to a branch point between parent and descendant threads, which in behavior when emulating this architecture on individual CPUs in the processor matched memory ordering model. This creates a different architectural behavior than.

다른 예시적인 실시예에 따라, 부모 및 자손 스레드는 서로 다른 클러스터 상에서 동작할 수도 있다. 자손 스레드는 소망의 IP를 전송함으로써 별도의 클러스터 상에서 생성될 수 있고, 대부분의 레지스터는 소프트웨어적이 아니라 하드웨어적으로 접속될 수 있고, 또는 일부 레지스터(또는 전혀)가 IP와 함께 클러스터 사이에서 전송되며, 자손이 실행되기 전에 부모 및 자손 스레드 모두에게 보여질 수 있도록 저장 버퍼를 비울 수도 있다. 인터-클러스터 전달은 인터-클러스터 메모리 트래픽에 사용된 것들과 유사한 데이터 경로 상에서 발생할 수도 있다.According to another example embodiment, the parent and descendant threads may operate on different clusters. Child threads can be created on a separate cluster by sending the desired IP, most registers can be connected in hardware rather than software, or some registers (or not at all) are transferred between clusters with IP, You can also empty the storage buffer so that it can be seen by both parent and child threads before the child is executed. Inter-cluster forwarding may occur on a data path similar to those used for inter-cluster memory traffic.

클러스터 사이에서의 저장 버퍼의 일관성은 실제적이고 종종 최대 분기 지연을 제거할 수 있다. 이와 같은 일관성을 유지하기 위해서는, 전체 저장 버퍼의 콘텐츠가 부모 클러스터로부터 실제적으로 유출되어, 모두가 자손 클러스터로 유입될 수 있거나, 또는 필요한 것만 자손 클러스터로부터의 요구에 따라서 나중에 유입될 수도 있다. 예측 실행의 용량이 증가하므로, 요구에 따른 나중의 유입은 장점이 있다.The consistency of the storage buffers between clusters is practical and can often eliminate the maximum branch delay. In order to maintain this consistency, the contents of the entire storage buffer can be actually leaked from the parent cluster so that all can flow into the descendant cluster, or only what is needed can be introduced later on demand from the descendent cluster. As the capacity of the forecasting run increases, later influx on demand is advantageous.

서로 다른 클러스터의 저장 버퍼 사이의 데이터 경로를 도 13에 나타내었다. 일단 저장 버퍼의 (대략 인터-클러스터 캐시 일관성으로 공유된) 일관성을 저장하기 위해 클러스터 사이에 데이터 경로가 있게 되면, 분기 명령 마이크로 코드는 명시적 의사 저장(pseudo-store)을 통해서 레지스터 값을 전달할 수 있다. 이는 결국 요구에 따라서 유입시키는 것과 관련될 수 있는 저장 버퍼 메커니즘 내로의 레지스터 값을 유입시키는 것과 동일하다.13 shows data paths between storage buffers of different clusters. Once there is a data path between clusters to store the coherence of the store buffer (approximately shared by inter-cluster cache coherency), the branch instruction microcode can pass register values through explicit pseudo-store. have. This is equivalent to introducing a register value into a storage buffer mechanism that may in turn be associated with introducing it on demand.

예시적인 실시예에 따라, 하나 이상의 명시적 스레드는 임의의 클러스터 상에서 동작할 수 있다. 클러스터 자체가 다중 스레드인 경우라면, 예를 들면, 로드 밸런싱을 위해서 동적 인터-클러스터 스레드 이동을 사용할 수도 있다. 또한 동적 인터-클러스터 스레드 이동은, 저장 버퍼 및 레지스터 값의 모두에 대해서, 효율적인 인터-클러스터 데이터 값 전달 메커니즘을 사용할 수도 있다. 예를 들면 다음과 같다. According to an example embodiment, one or more explicit threads may operate on any cluster. If the cluster itself is multi-threaded, for example, dynamic inter-cluster thread movement may be used for load balancing. Dynamic inter-cluster thread movement may also use an efficient inter-cluster data value transfer mechanism for both the storage buffer and the register value. For example:

상대적으로 적은 개수의 클러스터를 사용할 수도 있다: 2, 3, 또는 4. DEMT 및 IMT 작업 로드(load)는 대개 16 스레드에 접근하는 것보다 더 많은 스레드를 필요로 한다.You may use a relatively small number of clusters: 2, 3, or 4. DEMT and IMT workloads usually require more threads than accessing 16 threads.

종종 SoEMT(Switch-on-Event Multithreading)용으로 하나 이상의 스레드를 동일한 클러스터에서 동작시키는 것이 바람직하다. 그러나, 두 개의 스레드가 SoEMT를 사용하여 동일한 클러스터를 공유하면서 시작하고, 이후에 캐시 미스를 취하는 것을 중지하는 경우, 로드 밸런싱을 위해서라면 인터-클러스터 이동이 바람직하다.It is often desirable to have more than one thread running in the same cluster for Switch-on-Event Multithreading (SoEMT). However, if two threads start sharing the same cluster using SoEMT, and later stop taking cache misses, inter-cluster movement is desirable for load balancing.

분기(fork)시 부모 및 자손 사이에는 빈번한 통신이 예상되며, 이후에는 불변 (비저장-버퍼) 메모리를 통하는 것을 제외하고는 통신이 거의 (또는 전혀) 없음이 예상된다.Frequent communication is expected between the parent and offspring at the fork, then little (or no) communication is expected except through immutable (non-buffered) memory.

이는 일시적으로 부모 스레드로서 동일한 클러스터 상에서 자손 클러스터를 동작시키는 것, 및 나중에만 이동시키는 것과 관련되어 있을 수 있다. 이는, 스레드 이동이 공격적인 추론적인 스레드에 대해서 견뎌낼 수 있을 수도 있기 때문에 IMT(implicit multithreading, 묵시적 다중 스레딩)/SpMT(추론적 다중 스레딩)에도 양호하게 적용된다.This may be related to temporarily running offspring clusters on the same cluster as the parent thread, and only moving them later. This also applies well to implicit multithreading (IMT) / SpMT (inferential multithreading) because thread movement may be able to withstand aggressive inferential threads.

다른 예시적인 실시예에서, 데이터 스칼라(datascalar) 접근법을 사용할 수도 있다. 이와 같은 접근법에서, 자손 스레드는 부모의 오래된 클러스터 및 신규 클러스터의 모두에서 동작할 수 있다. 이후에 데이터 스칼라 전송은 오래된 클러스터로부터 신규 클러스터로 값을 유입시킬 수 있다. 잠시 후에, 오래된 부모의 오래된 클러스터 상의 자손 스레드는 종료되며, 또한 남아 있는 자손 스레드는 신규 클러스터 상에서만 동작하게 된다. 이후에 자손 스레드는 데이터 스칼라 스레드에 의해서는 유입되지 않는 값을 얻기 위해서 요구에 따른 나중의 유입에 의존하게 된다.In another example embodiment, a datascalar approach may be used. In this approach, child threads can operate on both the old and new clusters of the parent. Subsequent data scalar transfers can introduce values from the old cluster to the new cluster. After a while, the descendant threads on the old cluster of the old parent are terminated, and the remaining descendant threads will only run on the new cluster. Subsequent threads will then rely on later invocations on demand to obtain values that are not introduced by the data scalar thread.

변형례, 예를 들면, SpMT 없이 다중 클러스터로부터 이득을 얻을 수 있는 방 식으로서의 변형례에서는 명령의 배치(batch) 처리를 사용할 수도 있는데, 하나의 클러스터 상에서 일단의 명령(예를 들면, 1000 개의 명령)을 실행시키고, 이후에 제 2 클러스터 상에서 다음의 1000 개의 명령을 실행시킨다. 이 방식에서는 스레드 이동량에 초점을 맞추게 되며, 또한 인터-클러스터 전송과 관련되어질 수도 있다. 배치(batch)가 순환하는 경우에는, SMT 클러스터링과 매우 유사해지게 된다.In a variant, for example, a way to benefit from multiple clusters without SpMT, batch processing of instructions may be used, with a set of instructions (for example 1000 instructions on a cluster). ), And then execute the next 1000 commands on the second cluster. This approach focuses on thread movement and may also be related to inter-cluster transfers. If the batch cycles, it will be very similar to SMT clustering.

예시적인 실시예에서, IMT 및 DEMT의 모두는 클론(clone) 분기를 사용할 수도 있으며, 모두 부모 및 자손이 동일한 클러스터 상에서 동작하고, 나중에 이동한다.In an exemplary embodiment, both IMT and DEMT may use clone branches, with both parents and offspring operating on the same cluster and later moving on.

상술한 명시적인 다중 스레딩 마이크로 아키텍처에 대해서, 묵시적, 추론적, 스킵 어헤드(skipahead), 및 실제의 다중 스레드에 (선택적으로) 다수의 특징을 부가할 수도 있다.For the explicit multithreading microarchitecture described above, a number of features may be added (optionally) to implicit, speculative, skiphead, and actual multiple threads.

예를 들면, 어떤 스레드가 다음번의 실행에서 활성화되어야 하는지를 예측하는데 사용되는 스레드 예측기(TP, thread predictor).For example, a thread predictor (TP) used to predict which thread should be active on the next run.

추론적인 실행으로부터 이득을 얻기 위한 메커니즘은 다음과 같다. 이 메커니즘은 데이터를 캐시에 미리 읽어 들이기 위해서 추론의 장점을 취하는 것과 관련되어 있을 수 있다. 예시적인 실시예에서 트레이스-로그(TL, trace-log) 메커니즘을 사용할 수도 있다. TL은 스레드에 의해서 판독된 어떤 데이터 값이 다른 스레드(즉, live-in 값)에 의해서 이전에 생성되었는지를 기록하고, 또한 스레드에 의해서 기록된 어떤 값이 후속하는 스레드(즉, live-out 값)에 의해서 판독될 수 있었는지를 기록한다. TL은 재실행을 용이하게 하기 위해서 병렬 검증을 채택할 수도 있다.The mechanism for benefiting from speculative practice is as follows. This mechanism may involve taking advantage of inference to preload data into the cache. In an example embodiment, a trace-log (TL) mechanism may be used. The TL records which data values read by a thread were previously generated by another thread (i.e. live-in value), and what values recorded by the thread are followed by subsequent threads (i.e. live-out value). Record if it could be read by The TL may employ parallel verification to facilitate redo.

실제 전송을 지원하기 위해서 저장 버퍼 태그 비트를 사용할 수도 있다.You can also use the storage buffer tag bits to support the actual transfer.

도 14는 라인(1404)을 통해서 매퍼(M, mapper) 및 명령 캐시에 연결된 트레이스-로그(1402)를 나타내고 있다.14 shows trace-log 1402 connected to a mapper (M) and instruction cache via line 1404.

클러스터당 트레이스 로그를 나타내지 않음으로써, 비추론적인 스레드는 (비추론적 블록 명령 재사용 버퍼로서 사용되는 것이 아닌 이상) 결과를 트레이스 로그에 커밋할 수 없음을 강조하였다. 다른 실시예에서, 트레이스 로그는 약간의 포트(port)를 가지면서, 대규모이고 공유되면서 분할된 순차 메모리 구조일 수도 있다.By not showing the trace log per cluster, we emphasized that non-inferential threads cannot commit results to the trace log (unless used as a non-inferential block instruction reuse buffer). In another embodiment, the trace log may be a large, shared, partitioned sequential memory structure with some ports.

트레이스 로그(1402)는 트레이스 로그로부터의 재실행 페치 명령을 검증할 수도 있으며, 또한 이드을 최소한으로 디코딩하여 매퍼/개명기(M)로 직접 전송할 수도 있다.Trace log 1402 may verify the redo fetch command from the trace log, and may also decode the id to a minimum and send it directly to the mapper / reformatter (M).

상술한 바와 같이, 다중 스레딩을 지원하는데 사용될 수도 있는 많은 기법들이 있다. 이들 중 설명된 많은 것들은 명시적인 다중 스레딩과 관련되어 있을 수 있다.As mentioned above, there are many techniques that may be used to support multithreading. Many of these may be related to explicit multithreading.

다수의 기법을 사용하여 묵시적인 다중 스레딩을 지원할 수도 있으며, 여기에는 실제 실행 및 추론적인/스킵 어헤드적인 다중 스레딩(SpMT/SkMT)가 포함된다.Multiple techniques may be used to support implicit multithreading, including actual execution and speculative / skip-ahead multithreading (SpMT / SkMT).

실제 실행은 분기 메커니즘의 존재를 필요로 한다. 이때 맵(map)의 분기, 및 태그 비트의 채택 등은 잘 알려져 있다. 실제 분기를 해결하고자 하는 경우에는, 단순히 잘못된 경로를 폐기하고, 그 리소스를 복구하도록 배열한다.Actual execution requires the presence of a branching mechanism. At this time, the branch of the map, the adoption of tag bits, and the like are well known. If you want to resolve the actual branch, simply discard the wrong path and arrange to recover that resource.

실제 스레드는 즉각적으로 분기될 필요는 없다. SpMT 스레드에 대해서는, 잠재적인 분기 위치는 맵 델타 리스트 내에 오프셋으로서 단순히 기록될 수 있다. 나중에, 스레드를 분기하도록 결정되는 경우, 맵은 분기 위치에서 재구축될 수 있으며, 이후에 복제(clone)(또는, 실제적으로는, 복제되고, 이후에 델타 리스트를 따라서 분기 위치로 이동)된다. 이는 지연된 실제 분기를 가능하게 한다. 실제 분기는 SpMT보다는 더 간단한데, 그 이유는 스레드 사이에서 전송을 행할 필요가 전혀 없으며, 단순히 이전 분기(pre-fork)에서 이후 분기(pro-fork) 경로로만 전송하면 되기 때문이다.The actual thread does not need to branch immediately. For SpMT threads, the potential branch position can simply be recorded as an offset in the map delta list. Later, if it is determined to fork the thread, the map can be rebuilt at the branch location, which is then cloned (or, in fact, cloned and then moved to the branch location along the delta list). This allows for delayed actual branching. The actual branch is simpler than SpMT, because there is no need to transfer between threads at all, simply transfer from the pre-fork to the pro-fork path.

자체로는 묵시적 다중 스레딩의 형태인 추론적 다중 스레딩의 스킵 어헤드 형식 또한 사용될 수 있어, 단일 스레드의 성능을 개선할 수 있다.The skip-ahead form of speculative multithreading, which is itself a form of implicit multithreading, can also be used, which can improve the performance of a single thread.

IMT/SpMT/SkMT는 최소 추론으로부터 더욱 추론적인 스레드와의 통신과 관련되어 있을 수도 있기 때문에, 실제 스레딩보다는 더욱 복잡할 수 있다. 예시적인 실시예에서, 이는 트레이스-로그에 의해서 달성될 수 있으며, 이 때 명령의 결과는 이 트레이스-로그에 기록되어질 수도 있다. 덜 추론적인 스레드가 추론적인 스레드 내에서 동작하는 경우, 이 스레드는 연산을 페치함으로써 합류하고 트레이스-로그 내에 결과를 저장하고, 이들이 정확한 명령임을 검증하고, 또한 이들이 동일한 결과를 산출한다는 점을 증명한다. 병렬 검증은, 일 실시예에서 정확하기만 하다면 통상적으로 원래의 실행보다는 더 빠를 수 있다.IMT / SpMT / SkMT can be more complex than actual threading because IMT / SpMT / SkMT may be related to communication with threads that are more speculative from minimal inference. In an exemplary embodiment, this may be accomplished by a trace-log, where the results of the command may be recorded in this trace-log. If less inferential threads operate in inferential threads, they join by fetching operations and store the results in the trace-log, verify that they are correct instructions, and also prove that they produce the same results. . Parallel verification can typically be faster than the original implementation as long as it is accurate in one embodiment.

데이터 값이 부정확하지만 명령은 정확하게 페치된 경우에, 충분히 성기기만 하다면 재실행을 충족시킨다. 지나치게 조밀하거나, 또는 명령 스트림이 발산하는 경우에는, 트레이스-로그의 재실행을 중지하지만, 일부 실시예에서는 추후에 합류할 준비가 되어 있어야 한다.If the data value is incorrect but the instruction is fetched correctly, the retry is satisfied if it is sparse enough. If too dense or instruction streams diverge, the redo of the trace-log is stopped, but in some embodiments it must be ready to join later.

트레이스-로그에는 하나 이상의(또는 심지어 모든) 명령의 결과가 기록될 수도 있다. 다른 실시예에서는 분기 방향만을 기록할 수도 있다.The trace-log may record the results of one or more (or even all) commands. In other embodiments, only the branching direction may be recorded.

다른 실시예에서, 다중 레벨 트레이스-로그가 제공될 수 있다. 예를 들면, (예를 들면, 전체) 명령의 결과를 기록하지만, 또한 계층적으로 배치(batch) 처리하여, 라이브-인을 블록에 기록하고 검증한다. 거친 입도(coarse granularity)에서의 재실행을 검증하는데 사용될 수도 있다.In other embodiments, multilevel trace-logs may be provided. For example, the results of the (eg full) command are recorded, but are also hierarchically batched to record and verify the live-in in a block. It can also be used to verify redo at coarse granularity.

트레이스-로그 시작점은 SpMT 이력을 통해서 테이블 해쉬-인덱싱되어 기록될 수도 있다. 잠재적인 조인(join) 명령, "반환"과 같은 명령은 SpMT 이력을 보여주고, 또한 트레이스-로그-시작점을 조사한다. 이들이 발견되는 경우, 트레이스-로그 검증 재실행이 시작될 수 있다.The trace-log start point may be written to the table hash-indexed via the SpMT history. Potential join commands, such as "return", show the SpMT history and also examine the trace-log-start point. If they are found, trace-log verify rerun can be initiated.

예시적인 실시예에 따라, 분기 예측기를 사용할 수도 있다. 분기 사이트는 IP 해시에 의해서 인덱싱되어, 아마도 분기 예측기 히스토리와 합체하고, 아마도 추론 깊이를 제공할 수 있다. 예시적인 실시예에 따라, 분기 예측기는 최후 종속성의 폰노이만 식별자(VNID, Von Neuman identity)를 제공할 수도 있는데, 예를 들면, 일단 상기 지점(VNID)을 통과하면, 스레드를 분기할 수도 있다. VNID는, 예를 들면, 추론 실패가 인식되었을 때 최후 종속성이 자체 추론적인지를 기록할 수도 있다.According to an example embodiment, a branch predictor may be used. Branch sites can be indexed by IP hashes, possibly coalescing with branch predictor history, and possibly providing inference depth. According to an exemplary embodiment, the branch predictor may provide a Von Neuman identity (VNID) of the last dependency, for example, once branching through the point (VNID), may branch the thread. The VNID may, for example, record whether the last dependency is self inferring when an inference failure is recognized.

이 최후 종속성이 결여되면, 분기 예측기는 예를 들면 얼마나 더 진행될 수 있을 지를 기록하는데, 이는 트레이스-로그의 검증을 성공적으로 재실행한 명령 개수의 축소된 형태로 나타난다. 개수가 너무 작으면, 분기가 되지 않을 수도 있다.Lacking this last dependency, the branch predictor records, for example, how far it can go, which appears as a reduced form of the number of instructions that successfully rerun the validation of the trace-log. If the number is too small, it may not branch.

분기 예측기는 사전 정보, 예를 들면 CALL과 반환 사이의 명령의 개수, 분기 예측 오류의 개수, 캐시 미스의 개수, 및 파이프라인 스톨(pipeline stall)과 같은 사전 정보를 기록할 수도 있다.The branch predictor may record dictionary information, such as the number of instructions between CALL and return, the number of branch prediction errors, the number of cache misses, and the pipeline stall.

I. 캐시 및 메모리에 관한 예I. Examples of Caches and Memory

통상의 명령은, 예를 들면, 명령 캐시 내에 캐시(cache)될 수도 있다. 가끔 이들 명령은 트레이스(trace)라고 불리는 블록으로 제작되며, 이들은 통상적으로 트레이스 캐시 내에 캐싱될 수 있다. 프로세서 내의 트레이스 캐시는, 예를 들면, 동적인 명령 순서를 저장하는 명령 캐시를 구비할 수 있으며, 이 명령 캐시는, 예를 들면, 페치되고 실행된 이후에 동일한 명령 순서용의 메모리 또는 정규 명령 캐시로의 반환을 필요로 하지 않는 후속 시간에서 명령을 따르도록 하고 있다. 트레이스 캐시의 장점은 파이프 라인 연산에서의 필요한 페치 대역폭을 감소시킬 수 있다는 것이다.Conventional instructions may be cached, for example, in an instruction cache. Sometimes these instructions are written in blocks called traces, which can typically be cached in the trace cache. The trace cache in the processor may include, for example, an instruction cache that stores a dynamic instruction sequence, for example, a memory or regular instruction cache for the same instruction sequence after being fetched and executed. The command follows at a later time that does not require a return. The advantage of the trace cache is that it reduces the fetch bandwidth required for pipeline operations.

또한, 일부 명령은 매우 복잡할 수 있으며, 또한, 예를 들면, 다섯 또는 그 이상의 upo로 디코딩되거나 해석될 수도 있으며, 따라서 표준 명령 디코더에 의해서 디코딩되지 않을 수도 있다. 대신에, 이들 복잡한 명령은 디코딩 또는 해석을 위해서 마이크로 명령 시퀀서(MIS, micro instruction sequencer)로 전송되어질 수 있다. MIS는 각각의 복잡한 아키텍처 명령과 관련된 일련의 마이크로 op(micro-sp 또는 uop)를 포함하는 마이크로 코드 ROM을 구비하고 있을 수 있다. 일련의 하나 이상의 uop(micro-op)는 복잡한 아키텍처 명령이 디코딩 또는 해석되는 경우에 디코더에 의해서 생성되며 - 이 일련의 uop는 마이크로 코드 캐시 내에 위치될 수도 있다. 예시적인 실시예에 따라, MIS용의 마이크로 코드(예를 들면, 하나 이상의 복잡한 명령에 적합한 일련의 uop를 포함하고 있을 수도 있는 코드)는 트레이스 캐시 엔트리(예를 들면, 트레이스 캐시) 또는 통상적인 명령 또는 uop(예를 들면, 통상적인 명령 캐시) 중의 하나와 함께 캐싱될 수도 있다. MIS 마이크로 코드가 명령 캐시 또는 트레이스 캐시 중의 하나에 동적으로 캐싱될 수 있도록 하게 되면, 전체 캐시 메모리의 사용 효율은 더욱 높아지게 되며, 예를 들면, 캐시 스토리지를 명령 종류에 동적으로 할당함으로써 임의의 특정 시점에서의 최상의 이득을 얻을 수 있게 된다.In addition, some instructions may be very complex, and may also be decoded or interpreted, for example, with five or more upo, and thus not decoded by a standard instruction decoder. Instead, these complex instructions can be sent to a micro instruction sequencer (MIS) for decoding or interpretation. The MIS may have a microcode ROM containing a series of micro ops (micro-sp or uop) associated with each complex architectural instruction. A series of one or more uops (micro-ops) are generated by the decoder when complex architectural instructions are decoded or interpreted-this series of uops may be located in the microcode cache. In accordance with an exemplary embodiment, the microcode for the MIS (e.g., code that may contain a series of uops suitable for one or more complex instructions) may be a trace cache entry (e.g., a trace cache) or a conventional instruction. Or cached with one of uops (eg, a conventional instruction cache). By allowing the MIS microcode to be dynamically cached in either the instruction cache or the trace cache, the utilization of the entire cache memory becomes more efficient, for example, by dynamically allocating cache storage to the instruction type at any point in time. You will get the best benefit from.

따라서, MIS 마이크로 코드 ROM을 구비하여 각각의 복잡한 아키텍처 명령과 관련된 일련의 uop를 저장할 수 있는 다중 레벨 마이크로 코드(예를 들면, MIS에 적합한 마이크로 코드), 및 복잡한 명령의 적어도 일부에 대해서 MIS 마이크로 코드를 캐시할 수 있는 레벨 1 캐시가 제공될 수 있다. MIS 마이크로 코드용 레벨 1(L1) 캐시는 별도의 마이크로 코드이거나, 또는 트레이스 캐시 및/또는 L1 명령 캐시 중의 하나일 수도 있다. 상술한 바와 같이, 예시적인 실시예에서, 일부 복잡한 명령용의 MIS 마이크로 코드는 트레이스 캐시 또는 명령 캐시 중의 하나에, 이들 캐시에서의 가용 공간 또는 다른 규준에 따라서 동적으로 저장되거나 또는 할당된다.Thus, a multi-level microcode (e.g., microcode suitable for MIS) capable of having a MIS microcode ROM capable of storing a series of uops associated with each complex architectural instruction, and MIS microcode for at least some of the complex instructions. A level 1 cache can be provided that can cache. The Level 1 (L1) cache for MIS microcode may be separate microcode or one of a trace cache and / or an L1 instruction cache. As noted above, in an exemplary embodiment, the MIS microcode for some complex instructions is dynamically stored or allocated in either the trace cache or the instruction cache, depending on the available space in these caches or other criteria.

예시적인 실시예에 따라, 분기 예측기는 명령 캐시(I$), 트레이스 캐시(T$), 및 마이크로 코드 캐시(UC)를 구비하고 있을 수 있다. BP2 분기 예측기는 I$, T$, 및 UC(마이크로 코드) 분기 예측기 사이에서 공유될 수 있다. 전용 BP1 예측기가 I$, T$, UC에 밀접하게 결합(또는 이들 각각과 관련)되어 있는 것이 바람직할 수 있다. 또한, 다중 BPQ(branch predictor queues, 분기 예측기 큐)를 사용할 수도 있는데, 예로는 BP1 → I$(레벨 1 분기 예측기 및 명령 캐시 사이), BP1 → TS(BP1 및 트레이스 캐시 사이), BP1 → UC(레벨 1 분기 예측기 및 마이크로 코드 캐시 사이)에서의 BPQ를 들 수 있다. 전용 BP1은 특화될 수도 있다.According to an exemplary embodiment, the branch predictor may have an instruction cache (I $), a trace cache (T $), and a micro code cache (UC). The BP2 branch predictor may be shared between the I $, T $, and UC (microcode) branch predictors. It may be desirable for a dedicated BP1 predictor to be tightly coupled (or associated with each of them) to I $, T $, UC. You can also use multiple branch predictor queues (BPQs), such as BP1 → I $ (between level 1 branch predictor and instruction cache), BP1 → TS (between BP1 and trace cache), BP1 → UC ( BPQ) between the level 1 branch predictor and the microcode cache. Dedicated BP1 may be specialized.

다른 실시예에서, 레벨 2 분기 예측기(BP2) 또는 레벨 1 분기 예측기(BP1)는 명령 캐시(I$) 및 트레이스 캐시(T$) 사이에서 공유될 수도 있는데, 이는, 예를 들면, 명령 페치가 이들 사이에서 전환(switch)되기 때문이다. UC BP와 I$ 및 T$를 공유하는 것은 다를 수도 있는데, 이는 UC 페치가 통상적으로 정상 명령 페치 내에 존재(nest)하고 있기 때문이다. 이 공유는 더 긴 명령 흐름 속에 내장된 마이크로 코드에 적합한 신규 스레드를 도입함으로써 사용할 수 있다. UC(마이크로 코드) BP(분기 예측기)의 이력은 마이크로 코드 흐름의 시작시에 전역 페치 BP 이력을 사용하여 초기화될 수 있다.In another embodiment, the level 2 branch predictor BP2 or the level 1 branch predictor BP1 may be shared between the instruction cache I $ and the trace cache T $, for example, if the instruction fetch is This is because they are switched between them. Sharing I $ and T $ with UC BP may be different because UC fetches are typically nested within normal instruction fetches. This sharing can be used by introducing new threads suitable for microcode embedded in longer instruction flows. The history of the UC (microcode) BP (branch predictor) can be initialized using the global fetch BP history at the beginning of the microcode flow.

예시적인 실시예에 따라, 공유 인터-클러스터 메모리 데이터-구조 - 레벨 2(L2) 메모리 캐시(M$2 또는 D$2, 도 1의 156), L2 저장 버퍼(SB2, 도 1의 154), 및 L2 레지스터 파일(RF2/PRF/IW, 도 1의 152) - 는 인터-클러스터 디렉토리를 사용할 수도 있다. 공유 인터-클러스터 구조는 각 엔트리에서의 값을 포함하고 있는 클러스터의 디렉토리를 유지할 수 있으며, 여기에는 각 캐시 라인에서의 레벨 2 캐 시(M$2 또는 D$2)용의 디렉토리 값, 각 레벨 2 저장 버퍼(SB2)용의 버퍼 엔트리에서의 디렉터 값, 및 각 RF2/PRF용의 물리적 레지스터에서의 디렉토리 값이 포함된다.Shared inter-cluster memory data-structure-level 2 (L2) memory cache (M $ 2 or D $ 2, 156 of FIG. 1), L2 storage buffer (SB2, 154 of FIG. 1), and L2, according to an exemplary embodiment. The register file RF2 / PRF / IW, 152 of FIG. 1, may use an inter-cluster directory. The shared inter-cluster structure can maintain a directory of clusters containing the values in each entry, which contains the directory values for each level 2 cache (M $ 2 or D $ 2) on each cache line, and each level 2 store. The director value in the buffer entry for buffer SB2 and the directory value in the physical register for each RF2 / PRF are included.

각 엔트리에서의 디렉토리 값은, 예를 들면, L2 사본이 유효한지, 어떤 클러스터가 그 값을 가지고 있는지(예를 들면, 값을 획득하는데 어떤 클러스터를 가로채야 할지), 어떤 클러스터가 값을 요청하였는지, 즉 값이 L2로 전달되었을 때 어떤 클러스터가 반응을 반송해야 하는지, 및 추후 기록 또는 나중에 기록하는 구조에 적합하도록 가로 채기 요청이 이미 전송되었는지 등을 나타낼 수도 있다.The directory value in each entry is, for example, whether the L2 copy is valid, which cluster has that value (e.g. which clusters to intercept to obtain the value), and which cluster requested the value. That is, it may indicate which cluster should return a response when the value is passed to L2, and whether an intercept request has already been sent to suit later recording or later recording structures.

M$2/D$2용으로 종래의 디렉토리 기반의 MESI 기반의 프로토콜을 사용할 수도 있다. 레지스터 및 저장 버퍼 엔트리용으로서, 갱신 프로토콜, M$2/D$2용으로 사용된 프로토콜과 동일하거나 유사한 프로토콜과 같은 갱신 프로토콜을 사용할 수도 있다. M$2/D$2용의 이 디렉토리를 사용하여 (예를 들면, 정확한 구조 또는 클러스터에 대해서) 메모리 캐시 프로빙 및 무효화를 시작할 수도 있다.A conventional directory-based MESI-based protocol may be used for M $ 2 / D $ 2. For register and store buffer entries, an update protocol, such as the protocol used for M $ 2 / D $ 2, may be used. You can also use this directory for M $ 2 / D $ 2 to initiate memory cache probing and invalidation (eg, for the correct structure or cluster).

일 실시예에 따라, 클러스터 캐시 및 다른 구조(예를 들면, D$l, RPl, OC, SB1, RPl, Sl5 및 X)는 각각의 클러스터 또는 스레드용의 파티션으로 분할된 전체 클러스터용으로 사용되는 캐시 또는 구조일 수 있다. 예를 들면, 여기에는, 단일 데이터 캐시(D$1)로서 캐시 내에 세 개의 파티션을 가지고, 각 클러스터에 대해서 하나의 캐시 파티션을 갖는 단일 데이터 캐시(D$1), 레벨 1 레지스터 파일(RF1)로서 RF1 내에 세 개의 파티션을 가지고, 각 클러스터에 대해서 하나의 파티션을 갖는 하나의 레벨 1 레지스터 파일(RF1), 피연산자 캡처 어레이(OC)로서 이 어레이 내에 세 개의 파티션을 가지고, 각 클러스터에 대해서 하나의 OC 파티션을 갖는 하나의 피연산자 캡처 어레이(OC), 레벨 1 레지스터 파일로서 세 개의 파티션을 가지고 각 클러스터에 대해서 하나의 파티션을 갖는 하나의 레벨 1 레지스터 파일이 포함될 수 있다. IW1은 각 클러스터에 대해서 분할된 하나의 창(IW1)을 포함하고 있을 수 있다. 하나의 레벨 1 저장 버퍼(SB1)는, 세 개의 SB1 파티션으로 분할되어 있으며, 또한 각 클러스터에 대해서 하나의 파티션을 갖는다. 단일 캐싱 어레이를 이와 같이 분할하는 것은 일부 장점이 있는데 - 예를 들면, 더 적은 수의 스레드가 동작하고 있을 때, 또는 클러스터가 사용되지 않고 있는 경우(예를 들면, 1 클러스터에 대해서 단 하나의 스레드가 동작하는 경우)에, 사용되지 않은 파티션을 재사용할 수 있다는 장점이 있다.According to one embodiment, the cluster cache and other structures (eg, D $ l, RPl, OC, SB1, RPl, Sl5 and X) are used for the entire cluster divided into partitions for each cluster or thread. It can be a cache or a structure. For example, here is a single data cache (D $ 1) with three partitions in the cache as a single data cache (D $ 1), and a single data cache (D $ 1) with one cache partition for each cluster, RF1 as a level 1 register file (RF1). One level 1 register file (RF1) with three partitions in it, one partition for each cluster, operand capture array (OC), three partitions in this array, one OC partition for each cluster One operand capture array (OC) with a level 1 register file, one level 1 register file with three partitions and one partition for each cluster. IW1 may include one window IW1 divided for each cluster. One level 1 storage buffer SB1 is divided into three SB1 partitions and has one partition for each cluster. This partitioning of a single caching array has some advantages-for example, when fewer threads are running, or when a cluster is not in use (for example, only one thread for one cluster). Has the advantage that you can reuse unused partitions.

그러나, 본 발명의 다른 실시예에 따라, 클러스터 캐시 및/또는 다른 퍼-클러스터(per-cluster) 구조(예를 들면, D$l, RPl, OC, SB1, RPl, Sl, 및 X)는 별도의 구조(예를 들면, 분할된 단순한 하나의 캐시 또는 구조가 아님)일 수도 있다. 예를 들면, 세 개의 별도의 OC 어레이가 제공될 수 있는데, 여기에는 각 클러스터에 대해서 하나의 OC 어레이와, 각 클러스터에 대해서 하나의 D$1을 갖는 세 개의 별도의 D$1 어레이, 각 클러스터에 대해서 하나의 레벨 1 저장 버퍼(SB1)를 갖는 세 개의 별도의 레벨 1 저장 버퍼, 클러스터당 하나의 RF1을 갖는 세 개의 별도의 레벨 1 레지스터 파일(RF1), 세 개의 별도의 레벨 1 스케줄러(단순히 각 스케줄러에 대해서 세 개의 파티션으로 분할된 하나의 스케줄러가 아님) 등이 포함될 수 있다. 별도의 캐시 또는 구조를 사용함으로써 몇 가지 장점이 있을 수 있다. 예를 들 면, 별도의 캐시를 파악할 때, 몇 가지 장점이 있다(다른 퍼-클러스터 구조에 대해서도 유사한 장점이 적용될 수 있다). 먼저, 별도의 어레이 또는 캐시는 통상적으로 더 작고, 따라서 더 빠르다. 다음에, 클러스터는 가능한 한 독립적이고, 배치가 양호하고, 스케줄러와, 실행 유닛, 및 캐시(및 가능하다면 각 클러스터용의 다른 구조)를 포함하고 있을 수 있다. 클러스터 캐시가 단일 어레이의 단순한 파티션인 경우, 클러스터의 개수를 변경하는 것은 더욱 곤란할 수도 있다.However, according to another embodiment of the present invention, the cluster cache and / or other per-cluster structures (eg, D $ l, RPl, OC, SB1, RPl, Sl, and X) are separate. It may also be a structure of (e.g., not a single partitioned cache or structure). For example, three separate OC arrays can be provided, one OC array for each cluster, three separate D $ 1 arrays with one D $ 1 for each cluster, for each cluster Three separate Level 1 storage buffers with one Level 1 storage buffer (SB1), three separate Level 1 register files (RF1) with one RF1 per cluster, three separate Level 1 schedulers (simply each scheduler) It is not a scheduler divided into three partitions. There may be several advantages to using a separate cache or structure. For example, there are several advantages to identifying separate caches (similar advantages may apply to other per-cluster architectures). First, separate arrays or caches are typically smaller, and therefore faster. Next, the clusters are as independent as possible, well placed, and may include a scheduler, execution units, and cache (and possibly other structures for each cluster). If the cluster cache is a simple partition of a single array, it may be more difficult to change the number of clusters.

J. 다중 코어 프로세서에 관한 추가 예J. Additional Examples for Multicore Processors

예시적인 실시예에 따라, 프로세서는 다중 프로세서 코어를 구비할 수도 있다. 도 15는 예시적인 실시예에 따른 다중-코어 프로세서(1500)의 블록 다이아그램이다. 예의 프로세서(1500)는 프로세서 코어 0 또는 코어 1을 포함하고 있을 수 있지만, 임의 개수의 코어를 구비할 수도 있다. 도 15의 다중 코어 프로세서는 또한 (예로서) 다음과 같은 다중 레벨 캐시 계층을 포함할 수도 있다.According to an example embodiment, the processor may have multiple processor cores. 15 is a block diagram of a multi-core processor 1500 in accordance with an exemplary embodiment. The example processor 1500 may include processor core 0 or core 1, but may include any number of cores. The multi-core processor of FIG. 15 may also include the following multi-level cache hierarchy (as an example).

Ll : 여기에는 통상적으로 몇 가지의 제 1 레벨 캐시가 있는데, 예를 들면, 명령 캐시(I$), 데이터 캐시(D$)(클러스터당 하나의 D$), 및 가능하다면 부동점(floating point) 또는 벡터 데이터용의 캐시로서의 다른 "위젯"(widget) 캐시 등.Ll: There are typically several first level caches, for example instruction cache (I $), data cache (D $) (one D $ per cluster), and possibly a floating point. ) Or other "widget" cache as a cache for vector data.

L2 : 예시적인 실시예에 따라, 프로세서(1500)는 CPU 코어로부터 CPU 외부로의 단일 L2 코어 일치점(1510)(코어 0으로 나타냄)을 포함할 수 있다. 이 일치점을 형성하는 것은 당연하다. 또한 이 지점에 L2 캐시를 부착하는 것도 당연하다. 이와 같은 L2는 퍼-CPU 코어이다. 이 지점(1510)에는, 명령 캐시(I$), 데이터 캐시(D$), L2$ 등의 연결점이 있을 수 있다.L2: According to an exemplary embodiment, the processor 1500 may include a single L2 core match point 1510 (denoted core 0) from the CPU core out of the CPU. It is natural to form this coincidence. It is also natural to attach an L2 cache to this point. This L2 is a per-CPU core. At this point 1510, there may be a connection point such as instruction cache I $, data cache D $, L2 $ and the like.

L3 : 클러스터 사이에서 공유되는 캐시. 또한 전체 CPU 코어가 함께 결합되는 곳인 다중 코어 일치점(1512)일 수도 있으며, 여기를 통해서 칩 외부 디바이스 또는 구조, 예를 들면, 레벨 3 캐시(L3$)와 통신할 수도 있다.L3: Cache shared between clusters. It may also be a multi-core match point 1512 where all of the CPU cores are coupled together, through which they may communicate with off-chip devices or structures, such as a level 3 cache (L3 $).

또한, 각각의 CPU 코어의 I$ 및 D$는 각각 외부 세계로의 고유한 경로(칩 외부 구조 또는 디바이스로/에서)를 가지고 있을 수 있다. 또한, 몇 개의 코어가 단일 I$ 경로를 공유할 수 있으면서, 별도의 D$을 갖는 장치를 사용할 수도 있다. 그러나, "CPU 코어당 단일 일치점"(single coherency point per CPU core) 모델은 장점을 가지고 있을 수 있는데, 이 모델은 단 하나의 통합 캐시를 갖는 CPU 코어에 의해서, L1 I$ 및 D$를 갖고 있을 수도 있는 프로세서 코어에 의해서, 및 더욱 특화된 형태의 캐시를 사용하는 개량된 마이크로-아키텍처에 의해서 캐시를 가지지 않는 단순 CPU 코어보다 더 나은 성능을 나타낸다. 외부 세계로부터 CPU 코어의 캐시 구조를 숨길 수도 있어, 예시적인 실시예에 따라, 이종(異種) 다중 코어 시스템이 가능하다.In addition, I $ and D $ of each CPU core may each have a unique path (to / from a chip external structure or device) to the outside world. It is also possible to use a device with separate D $ s, although several cores can share a single I $ path. However, the "single coherency point per CPU core" model may have an advantage, which may have L1 I $ and D $ by a CPU core with only one unified cache. Performance may be better than a simple CPU core without a cache by a processor core that may be, and by an improved micro-architecture using a more specialized form of cache. It is also possible to hide the cache structure of the CPU core from the outside world, so that, according to an exemplary embodiment, a heterogeneous multicore system is possible.

이러한 배열은 주어진 캐시 레벨로 하여금 항상 0으로 설정되도록 한다. 예를 들어, 캐시 통합된 per-CPU 코어가 없을 경우, 모든 코어 가운데 공요된 단일의 L2가 있는 것이 효과적이다. 또는, 모든 코어 가운데 공유된 캐시가 없을 경우, 캐시들이 완전히 분리된 두 개 이상의 CPU 코어가 있는 것이 효과적이다. 이는 몇몇 다른 구성 옵션을 가능하게 하고, 다수의 구성으로 판매함으로써 주어진 마이크로 아키텍처를 가장 잘 이용하려고 하는 회사에 유리한다. 그러나, 이는 단지 다른 실 시예이며, 본 발명은 이에 한정되지 않는다.This arrangement causes a given cache level to always be set to zero. For example, if there is no cache-integrated per-CPU core, it is effective to have a single L2 common among all cores. Or, if there is no shared cache among all cores, it is effective to have two or more CPU cores with completely separate caches. This enables several different configuration options and is advantageous for companies that want to make the best use of a given microarchitecture by selling in multiple configurations. However, this is merely another embodiment, and the present invention is not limited thereto.

예시적인 실시예에 따라, 각 OS(operating system) 관리 프로세스는 프로세서의 사용자 가상 메모리의 관점에서 데이터 구조를 가질 수 있으며, 명시적 스레드를 나타낸다. 이는 프로세스 런 큐(run queue)로서 언급될 수도 있다. 프로세스에 대해 준비된 스레드를 나타낼 수도 있기 때문에, 이는 "프로세스 런 큐"로 불릴 수도 있다. OS는 프로세스에 대해 잘 알고 있을 수도 있으며, 각 논리 프로세서에 대해 하나의 OS 프로세스가 동작한다(OS가 알고 있는 논리 프로세서). 예를 들어, 사용될 수도 있는 많은 지시가 있음에도 불구하고 많은 다른 지시들이 채용될 수도 있다.According to an exemplary embodiment, each operating system (OS) management process may have a data structure in terms of the user virtual memory of the processor and represent an explicit thread. This may be referred to as a process run queue. This may be referred to as a "process run queue" because it may represent a thread prepared for a process. The OS may be familiar with the process, with one OS process running for each logical processor (the logical processor known to the OS). For example, many other instructions may be employed, although there are many instructions that may be used.

예시적인 실시예에 따라, 동일한 CPU 코어 예를 들어, 동일한 다이(die)의 다수의 복사물들을 포함할 수도 있기 때문에, 프로세서(1500)는 다수의 코어(멀티-코어 프로세서)를 포함할 수도 있다. 실시예에서, 멀티-코어 프로세서(예를 들어, 프로세서(1500))는 CPU 코어마다 복수의 스레드를 동작할 수도 있다.According to an exemplary embodiment, the processor 1500 may include multiple cores (multi-core processor), as it may include multiple copies of the same CPU core, for example, the same die. In an embodiment, a multi-core processor (eg, processor 1500) may operate a plurality of threads per CPU core.

멀티클러스터 CPU 코어는 클러스터들 사이보다 내부에서 더욱 많이 통신하는 클러스터들로 나뉠 수도 있다. 특히, 각 스캐줄러, 실행 유닛, 데이터 캐시, 및 저장 버퍼 각각의 1개의 복사를 포함하는 클러스터는 클러스터 마다 하나의 스레드를 동작시키는데 특히 적합하며, 본 발명은 이에 한정되지 않는다. 실시예에 따르면, 도 15에 나타낸 바와 같이, 프로세서는 멀티-코어, 멀티스레드, 멀티클러스터(예를 들어, 코어마다 다수의 클러스터)일 수도 있다.Multicluster CPU cores may be divided into clusters that communicate more internally than between clusters. In particular, a cluster comprising one copy of each scheduler, execution unit, data cache, and storage buffer is particularly suitable for running one thread per cluster, and the invention is not so limited. According to an embodiment, as shown in FIG. 15, the processor may be multi-core, multithreaded, multicluster (eg, multiple clusters per core).

실시예에 따르면, 하나의 칩위에 N개의 CPU 코어가 존재할 경우, M개의 스레 드가 각 코어에 대하여 동작할 수도 있으며, M*N개의 스레드 또는 논리 프로세서들이 칩마다 동작할 수도 있다.According to an embodiment, when there are N CPU cores on one chip, M threads may operate for each core, and M * N threads or logical processors may operate per chip.

다수의 논리 프로세서(코어)로부터의 이득으로 약간의 작업 부가로 될 수도 있다. 예를 들어, M=4의 스레드/코어, N=8의 코어/칩 → 칩마다 M*N=32의 스레드가 된다. Benefit from multiple logical processors (cores) may be a small workload. For example, M = 4 threads / cores, N = 8 cores / chips → M * N = 32 threads per chip.

전력은 두번째 이유가 될 수도 있다. 2개의 전체적으로 독립 코어 상에서 동작하는 2개의 독립 스레드는 동일한 코어 상에서 동작하는 동일의 2개의 독립 스레드보다 더 낫은 성능을 가질 수도 있다. 그러나, 최대 성능이 멀티스레드/멀티클러스터 CPU 코어에 대해 더 낮을 수도 있지만, 전력/성능 비는 멀티코어 솔루션에 비해 멀티클러스터 솔루션에 있어서 보다 양호하게 될 수도 있다. 멀티코어 칩은 2X의 전력 소비를 갖고, 정적 및 동적 모두를 갖는다. 멀티클러스터 멀티스레드 코어는 비순차적 코어, 대략 몇몇 칩 상의 1/8 코어를 복제할 수도 있다(실시예에 따르면). 그러므로, 2개의 클러스터는 12.5% 영역을 필요로하고, 이 때문에 12.5% 누손이 있고, 나머지 라우팅을 계산하여 15%로 올림한다. 코어의 남은 부분에 대한 누손은 동일하게 된다. 동적 전력은 대략 두 배가 될 수도 있지만, 그래도 전력 성능 비는 개선될 가능성이 있다.Power may be the second reason. Two independent threads running on two entirely independent cores may have better performance than the same two independent threads running on the same core. However, although the maximum performance may be lower for a multithreaded / multicluster CPU core, the power / performance ratio may be better for a multicluster solution than for a multicore solution. Multicore chips have a power consumption of 2X and have both static and dynamic. Multicluster multithreaded cores may replicate out of order cores, roughly 1/8 core on some chips (according to embodiments). Therefore, the two clusters require 12.5% area, so there is 12.5% leak, and the remaining routing is calculated and rounded up to 15%. The leaks for the remainder of the core are the same. Dynamic power may be approximately doubled, but power performance ratios are likely to improve.

전력 관리 고려사항들은 이를 확대할 수도 있다. 코어 내의 클러스터보다 제2 코어를 완전히 전력 오프하는 것이 더욱 용이할 것이다. 이는 멀티코어와 멀티스레드 모두를 흥미를 갖게 할 수도 있다. 실시예에 있어서, 2개의 스레드를 사용하는 몇몇 작업 부하에서, 어떤 스레드로 CPU 코어를 완전히 사용하지 않기 때문에, 제2 코어가 전력 오프하고, 스레드 모두를 동일 코어 상에서 동작하는 것이 더 좋을 수도 있다.Power management considerations may expand this. It would be easier to completely power off the second core than the cluster within the core. This may be interesting for both multicore and multithreaded. In an embodiment, in some workloads that use two threads, it may be better for the second core to power off and run all of the threads on the same core, since some threads do not fully use the CPU core.

예시적인 실시예에 따르면, 멀티스레드, 멀티클러스터, 및 멀티코어 프로세서의 이점의 예는 새로운 스레드을 분기시키는 마이크로아키텍처 기법(speculative Skipahead Multithreading, eager Multithreading, 명시적 사용작 레벨 지시 세트 확장)이 관련한다.According to an exemplary embodiment, examples of the benefits of multithreaded, multiclustered, and multicore processors involve microarchitecture techniques for branching new threads (speculative skipahead multithreading, eager multithreading, explicit usage level instruction set extension).

일부 기간 동안, 프리-포크(pre-fork) 코드는 포스트-포크(post-fork)로 진행해야 한다. 동일 CPU 코어 상이라면 보다 용이해질 것이며, 실제로, 하나의 CPU 코어 내의 동일 클러스터 상일 경우, 바이패스 네트워크와 저장 버퍼를 공유하게 된다. 결과적으로, 실시예에 따르면, 장수명의 독립 스레드가 다른 클러스터, 및 다른 CPU 코어로 이동되어야 한다.For some period of time, the pre-fork code has to go post-fork. It would be easier if they were on the same CPU core, and in fact, if they were on the same cluster within one CPU core, they would share the storage buffer with the bypass network. As a result, according to the embodiment, long-lived independent threads must be moved to other clusters and to other CPU cores.

설명된 구현예의 특징들이 본원에 개시된 바와 같이 예시되었지만, 많은 개량, 교체, 변경 및 그 동등물은 당업자에게는 자명할 것이다. 따라서, 첨부된 청구범위가 다양한 실시예의 진정한 사상 내에 해당되는 이러한 모든 개량 및 변경들을 포함하는 것으로 의도된다는 것을 이해해야 한다.While the features of the described embodiments have been illustrated as disclosed herein, many improvements, replacements, modifications, and equivalents thereof will be apparent to those skilled in the art. Accordingly, it is to be understood that the appended claims are intended to cover all such improvements and modifications as would fall within the true spirit of the various embodiments.

Claims

계층 마이크로프로세서로서,As a layer microprocessor,

복수의 2 레벨 명령 파이프라인 요소 - 상기 2 레벨 명령 파이프라인 요소는 2 레벨 저장 버퍼를 포함함 - ; 및A plurality of two level command pipeline elements, wherein the two level command pipeline elements comprise a two level store buffer; And

복수의 실행 클러스터를 포함하고, 각각의 실행 클러스터는 상기 2 레벨 명령 파이프라인 요소 각각에 결합되고, 복수의 1 레벨 명령 파이프라인 요소 - 상기 1 레벨 명령 파이프라인 요소 각각은 각각의 2 레벨 명령 파이프라인 요소에 대응함 - 및 상기 1 레벨 명령 파이프라인 요소 각각에 결합되는 하나 이상의 명령 실행 유닛을 포함하고,A plurality of execution clusters, each execution cluster coupled to each of the two level instruction pipeline elements, and a plurality of one level instruction pipeline elements, each of the one level instruction pipeline elements being a respective two level instruction pipeline element; Corresponding to an element, and one or more instruction execution units coupled to each of the first level instruction pipeline elements,

상기 마이크로프로세서는 상기 복수의 1 레벨 명령 파이프라인 요소, 상기 복수의 2 레벨 명령 파이프라인 요소, 및 상기 복수의 실행 클러스터를 이용하여 다중 실행 스레드를 실행시키도록 구성되고, 각각의 가상 스레드를 생성하기 위해 상기 다중 실행 스레드 중 하나 이상을 가상화하도록 구성되고, 시간 다중화 방식으로 상기 가상 스레드를 실행시키도록 구성되고,The microprocessor is configured to execute multiple execution threads using the plurality of one level instruction pipeline elements, the plurality of two level instruction pipeline elements, and the plurality of execution clusters, and to generate each virtual thread. Configured to virtualize one or more of the multiple executing threads for execution, and configured to execute the virtual threads in a time multiplexed manner,

상기 복수의 1 레벨 명령 파이프라인 요소 각각은 정의(definition) 시간 또는 킬(kill) 시간 중 적어도 하나를 포함하는 연령 정보를 저장할 수 있는 하나 이상의 엔트리를 포함하는 저장 버퍼를 포함하고,Each of the plurality of one-level command pipeline elements includes a storage buffer including one or more entries capable of storing age information including at least one of a definition time or a kill time;

상기 마이크로프로세서는 각각의 엔트리가 1 레벨 저장 버퍼에 기록되는 현재 시간을 반영하는 정의 시간을 저장하고, 상기 1 레벨 저장 버퍼에 있는 다른 엔트리가 상기 각각의 엔트리와 일치하는 동일한 어드레스를 갖는 곳에 그 다음에 기록되는 때를 반영하는 킬 시간을 저장하도록 또한 구성되는 것인, 계층 마이크로프로세서.The microprocessor stores a definition time that reflects the current time each entry is written to the one level storage buffer, and then where other entries in the one level storage buffer have the same address that matches each entry. And is further configured to store a kill time that reflects when it is written to.

제1항에 있어서, 상기 다중 실행 스레드의 하나 이상의 실행 스레드는 각각의 실행 클러스터에 결합되는 것인, 계층 마이크로프로세서.The layered microprocessor of claim 1, wherein one or more execution threads of the multiple execution threads are coupled to each execution cluster.

제1항에 있어서, 상기 다중 실행 스레드의 하나 이상의 실행 스레드는 각각의 제1 실행 클러스터로부터 각각의 제2 실행 클러스터로 동적으로 할당 및 이동되는 것인, 계층 마이크로프로세서.The layered microprocessor of claim 1, wherein one or more execution threads of the multiple execution threads are dynamically allocated and moved from each first execution cluster to each second execution cluster.

제1항에 있어서, 상기 다중 실행 스레드의 적어도 하나의 실행 스레드는 다른 실행 스레드로부터 야기되는 것인, 계층 마이크로프로세서.The layered microprocessor of claim 1, wherein at least one execution thread of the multiple execution thread is derived from another execution thread.

제1항에 있어서, 상기 2 레벨 명령 파이프라인 요소는 레지스터 파일 구조를 포함하고; 상기 1 레벨 명령 파이프라인 요소는 각각 레지스터 파일 구조를 포함하는 것인, 계층 마이크로프로세서.2. The apparatus of claim 1, wherein the two level instruction pipeline element comprises a register file structure; Wherein the first level instruction pipeline elements each comprise a register file structure.

제1항에 있어서, 상기 1 레벨 명령 파이프라인 요소는 1 레벨 명령 스케줄러 및 1 레벨 레지스터 파일을 포함하는 것인, 계층 마이크로프로세서.2. The layered microprocessor of claim 1, wherein the first level instruction pipeline element comprises a one level instruction scheduler and a one level register file.

제1항에 있어서, 상기 2 레벨 명령 파이프라인 요소는 2 레벨 명령 스케줄러 및 2 레벨 레지스터 파일을 포함하는 것인, 계층 마이크로프로세서.2. The layered microprocessor of claim 1, wherein the two level instruction pipeline element comprises a two level instruction scheduler and a two level register file.

제1항에 있어서, 상기 2 레벨 명령 파이프라인 요소는 2 레벨 레지스터 파일을 포함하고, 상기 1 레벨 명령 파이프라인 요소는 복수의 1 레벨 레지스터 파일을 포함하고, 상기 계층 마이크로프로세서는 각각이 상기 2 레벨 레지스터 파일 및 각각의 1 레벨 레지스터 파일과 결합되는 복수의 3 레벨 레지스터 파일을 더 포함하는 것인, 계층 마이크로프로세서.2. The system of claim 1, wherein the two level instruction pipeline element comprises a two level register file, the one level instruction pipeline element comprises a plurality of one level register files, and wherein the hierarchical microprocessor is each of the two levels. Further comprising a register file and a plurality of three-level register files associated with each one-level register file.

제1항에 있어서, 상기 복수의 1 레벨 명령 파이프라인 요소는 각각의 실행 클러스터에 각각 포함되는 것인, 계층 마이크로프로세서.2. The layered microprocessor of claim 1, wherein the plurality of one level instruction pipeline elements are each included in each execution cluster.

계층 마이크로프로세서에서 명령을 실행하는 방법으로서,A method of executing instructions on a layered microprocessor,

복수의 2 레벨 명령 파이프라인 요소를 이용하여 실행을 위해 상기 명령을 획득하는 단계;Obtaining the instruction for execution using a plurality of two level instruction pipeline elements;

상기 복수의 2 레벨 명령 파이프라인 요소에 제1 피연산자 상태 정보를 저장하는 단계;Storing first operand state information in the plurality of two level instruction pipeline elements;

상기 제1 피연산자 상태 정보에 기초하여, 상기 마이크로프로세서의 각각의 실행 클러스터에 포함된 복수의 1 레벨 명령 파이프라인 요소에 상기 명령을 디스패치하는 단계;Dispatching the instructions to a plurality of one level instruction pipeline elements included in each execution cluster of the microprocessor based on the first operand state information;

상기 복수의 1 레벨 명령 파이프라인 요소에 상기 명령에 대한 제2 피연산자 상태 정보를 저장하는 단계; Storing second operand state information for the instruction in the plurality of one level instruction pipeline elements;

상기 제2 피연산자 상태 정보에 기초하여, 상기 실행 클러스터에 포함된 복수의 각 실행 유닛에 상기 명령을 디스패치하는 단계;Dispatching the instruction to a plurality of execution units included in the execution cluster based on the second operand state information;

상기 각 실행 유닛에서 다중 실행 스레드가 포함된 상기 명령어 중 하나 이상을 실행하는 단계;Executing at least one of the instructions including multiple execution threads in each execution unit;

상기 다중 실행 스레드 중 각각의 실행 스레드로부터 하나 이상의 가상 실행 스레드를 생성하는 단계;Creating one or more virtual execution threads from each execution thread of the multiple execution threads;

시간 다중화 방식으로 상기 가상 실행 스레드를 실행하는 단계;Executing the virtual execution thread in a time multiplexed manner;

다중 저장 버퍼의 하나 이상의 엔트리에 정의 시간 - 상기 다중 저장 버퍼 중 저장 버퍼는 상기 복수의 1 레벨 명령 파이프라인 요소 각각에 포함되고, 상기 정의 시간은 각각의 엔트리가 정해진 1 레벨 저장 버퍼 내에 기록되는 현재 시간을 반영함 - 을 포함하는 연령 정보를 저장하는 단계; 및Definition time in one or more entries of a multiple storage buffer, wherein a storage buffer of the multiple storage buffers is included in each of the plurality of one level command pipeline elements, and the definition time is the current time at which each entry is written within a defined one level storage buffer. Reflecting time; And

상기 다중 저장 버퍼의 하나 이상의 엔트리에 킬 시간 - 상기 킬 시간은 상기 주어진 1 레벨 저장 버퍼에 있는 다른 엔트리가 상기 각각의 엔트리와 일치하는 동일한 어드레스를 갖는 곳에 그 다음에 기록되는 때를 반영함 - 을 포함하는 연령 정보를 저장하는 단계Kill time for one or more entries in the multiple store buffer, wherein the kill time reflects when another entry in the given one-level store buffer is then written to where it has the same address that matches each entry; Storing age information, including

를 포함하는 계층 마이크로프로세서에서 명령을 실행하는 방법.How to execute the instructions in the layer microprocessor comprising a.

제10항에 있어서, 특정한 실행 클러스터와 적어도 하나의 실행 스레드를 연관시키는 단계를 더 포함하고, 상기 적어도 하나의 실행 스레드의 명령은 상기 특정한 실행 클러스터에 디스패치되는 것인, 계층 마이크로프로세서에서 명령을 실행하는 방법.11. The method of claim 10, further comprising associating at least one execution thread with a particular execution cluster, wherein instructions of the at least one execution thread are dispatched to the particular execution cluster. How to.

제10항에 있어서, The method of claim 10,

각각의 실행 클러스터에 상기 실행 스레드를 동적으로 할당하는 단계; 및Dynamically allocating the thread of execution to each execution cluster; And

제1 실행 클러스터로부터 제2 실행 클러스터로 적어도 하나의 실행 스레드의 실행을 이동시키는 단계Moving execution of at least one thread of execution from a first execution cluster to a second execution cluster;

를 더 포함하는 계층 마이크로프로세서에서 명령을 실행하는 방법.How to execute the instructions in the layer microprocessor further comprising.

제12항에 있어서, 상기 실행 스레드를 동적으로 할당하는 상기 단계 및 상기 적어도 하나의 실행 스레드를 이동시키는 상기 단계는, 부하 균형 정책에 따라 상기 실행 스레드를 동적으로 할당하고, 상기 적어도 하나의 실행 스레드를 이동시키는 것인, 계층 마이크로프로세서에서 명령을 실행하는 방법.The method of claim 12, wherein the step of dynamically allocating the thread of execution and the step of moving the at least one thread of execution dynamically allocate the thread of execution in accordance with a load balancing policy and the at least one thread of execution. Moving instructions to the layer microprocessor.

제10항에 있어서, The method of claim 10,

다른 실행 스레드로부터 적어도 하나의 실행 스레드를 야기하는 단계Causing at least one thread of execution from another thread of execution

계층 마이크로프로세서로서,As a layer microprocessor,

복수의 2 레벨 명령 파이프라인 요소; 및A plurality of two level instruction pipeline elements; And

상기 2 레벨 명령 파이프라인 요소 및 상기 1 레벨 명령 파이프라인 요소는 각각 명령 스케줄러, 저장 버퍼, 및 레지스터 파일 구조를 포함하고,The two-level instruction pipeline element and the one-level instruction pipeline element each comprise an instruction scheduler, a storage buffer, and a register file structure,

상기 1 레벨 명령 파이프라인 요소 각각의 1 레벨 저장 버퍼는 정의 시간 또는 킬 시간 중 적어도 하나를 포함하는 연령 정보를 저장할 수 있는 하나 이상의 엔트리를 포함하고,The one level storage buffer of each of the one level command pipeline elements includes one or more entries capable of storing age information including at least one of a defining time or a kill time,

상기 마이크로프로세서는 각각의 엔트리가 주어진 1 레벨 저장 버퍼에 기록되는 현재 시간을 반영하는 정의 시간을 저장하고, 상기 주어진 1 레벨 저장 버퍼에 있는 다른 엔트리가 상기 각각의 엔트리와 일치하는 동일한 어드레스를 갖는 곳에 그 다음에 기록되는 때를 반영하는 킬 시간을 저장하도록 또한 구성되는 것인, 계층 마이크로프로세서.The microprocessor stores a definition time that reflects the current time each entry is written to a given one level storage buffer, where other entries in the given one level storage buffer have the same address that matches each entry. And further configured to store a kill time that reflects when it is recorded.

제15항에 있어서, 상기 2 레벨 레지스터 파일 및 각각의 1 레벨 레지스터 파일과 결합되는 복수의 3 레벨 레지스터 파일을 더 포함하는 것인, 계층 마이크로프로세서.16. The layered microprocessor of claim 15, further comprising a plurality of three level register files associated with the two level register file and each one level register file.

복수의 2 레벨 명령 파이프라인 요소를 이용하여 실행을 위해 다중 실행 스레드를 포함하는 상기 명령을 획득하는 단계;Obtaining the instruction including multiple execution threads for execution using a plurality of two level instruction pipeline elements;

각각의 실행 클러스터에 실행 스레드를 동적으로 할당하는 단계;Dynamically allocating a thread of execution to each execution cluster;

상기 각 실행 유닛에서 상기 명령의 하나 이상을 실행하는 단계;Executing at least one of the instructions in each execution unit;

제1 실행 클러스터로부터 제2 실행 클러스터로 적어도 하나의 실행 스레드의 실행을 이동시키는 단계;Moving execution of at least one thread of execution from a first execution cluster to a second execution cluster;

제15항에 있어서, 상기 제1 레벨 저장 버퍼는 메모리 서브시스템에 기록된 데이터를 저장하기 위한 것이고, 상기 2 레벨 저장 버퍼는 상기 1 레벨 저장 버퍼 및 상기 메모리 서브시스템 사이에 존재하는 것인, 계층 마이크로프로세서.16. The layer of claim 15, wherein the first level storage buffer is for storing data written to a memory subsystem and the second level storage buffer is between the first level storage buffer and the memory subsystem. Microprocessor.

제18항에 있어서, 데이터는 상기 메모리 서브시스템을 위한 저장 동작에 응답하여 1 레벨 저장 버퍼에 저장되고, 상기 저장 동작은 메모리 저장 명령을 포함하는 마이크로 연산인 것인, 계층 마이크로프로세서.19. The hierarchical microprocessor of claim 18, wherein data is stored in a one level storage buffer in response to a storage operation for the memory subsystem, the storage operation being a micro operation comprising memory storage instructions.

제15항에 있어서, 상기 1 레벨 저장 버퍼의 적어도 일부는 다중 스레딩을 수용하기 위해 스레드 ID에 의해 분리되는 엔트리를 포함하는 것인, 계층 마이크로프로세서.16. The layered microprocessor of claim 15, wherein at least a portion of the first level storage buffer includes entries separated by thread IDs to accommodate multiple threading.