CN113853584A - Variable delay instructions - Google Patents

Variable delay instructions Download PDF

Info

Publication number
CN113853584A
CN113853584A CN202080037631.4A CN202080037631A CN113853584A CN 113853584 A CN113853584 A CN 113853584A CN 202080037631 A CN202080037631 A CN 202080037631A CN 113853584 A CN113853584 A CN 113853584A
Authority
CN
China
Prior art keywords
instruction
execution
pipeline
data
write
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202080037631.4A
Other languages
Chinese (zh)
Inventor
T·D·安德森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Texas Instruments Inc
Original Assignee
Texas Instruments Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/384,328 external-priority patent/US11210098B2/en
Application filed by Texas Instruments Inc filed Critical Texas Instruments Inc
Publication of CN113853584A publication Critical patent/CN113853584A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • G06F9/30116Shadow registers, e.g. coupled registers, not forming part of the register space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • G06F9/3858Result writeback, i.e. updating the architectural state or memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3861Recovery, e.g. branch miss-prediction, exception handling
    • G06F9/3863Recovery, e.g. branch miss-prediction, exception handling using multiple copies of the architectural state, e.g. shadow registers

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Advance Control (AREA)

Abstract

Techniques involve executing instructions by a processor, including receiving a first instruction for execution, determining a first latency value based on an expected amount of time needed to execute the first instruction, storing the first latency value in a write-back queue (1306), starting execution of the first instruction on an instruction execution pipeline, adjusting the latency value based on an amount of time elapsed since starting execution of the first instruction, outputting a first result of the first instruction based on the latency value, receiving a second instruction, determining that the second instruction is a variable latency instruction, storing a ready value in the write-back queue (1306) indicating that a second result of the second instruction is not ready, starting execution of the second instruction on the instruction execution pipeline, updating the ready value to indicate that the second result is ready, and outputting the second result.

Description

Variable delay instructions
Background
A Digital Signal Processor (DSP) is optimized for processing data streams that may be derived from various input signals, such as sensor data, video streams, voice channels, radar signals, biomedical signals, and the like. A digital signal processor operating on real-time data may receive an input data stream, perform a filtering function (e.g., encoding or decoding) on the data stream, and output a transformed data stream. The system is referred to as real-time because if the transformed data stream is not available for output at scheduling time, the application will fail. Video coding may utilize predictable, but non-sequential, input data patterns. Many applications require memory accesses to load data registers in a data register file and then supply data from the data registers to functional units that perform data processing.
One or more DSP processing cores may be combined with various peripheral circuits, memory blocks, etc., on a single Integrated Circuit (IC) die to form a system on a chip (SoC). The advent of SoC architectures for embedded systems has created many challenges for software development systems for developing and debugging software applications executing on these architectures. These systems may include multiple interconnected processors that share the use of on-chip and off-chip memory. Processors may include some combination of an instruction cache (ICache) and a data cache (DCache) to improve processing. Furthermore, multiple processors (with memory shared between them) may be incorporated into a single embedded system. Processors may physically share the same memory without accessing data or executing code located in the same memory location, or they may use portions of the shared memory as a common shared memory.
In early microprocessors, instruction execution was "atomic" in the sense that the processor fetched an instruction and executed it completely before fetching another instruction and executing it. Modern microprocessors typically execute instructions in steps, rather than atomically. This series of steps is referred to as an "instruction execution pipeline," or simply a "pipeline. A pipeline may include several stages, including the steps of reading an instruction from memory, decoding the instruction, reading a value to be operated on, performing the operation, and writing the result to some type of storage. This is referred to as "pipelining" because the processor can execute several instructions at different stages simultaneously, i.e., "in the pipeline". In this mode of operation, the processor may fetch an instruction while it decodes a previous instruction, while it reads an input value of an earlier instruction, and so on. By overlapping the execution of instructions, we increase the rate at which the processor can execute instructions.
One meaning of the pipeline is that an instruction in the "read input" stage may require a value generated by an earlier instruction, but a "write" of that value has not yet occurred. This situation is usually handled in two ways: the processor looks for these conditions and inserts an appropriate stall in the pipeline, or the programmer arranges the instructions so that this never happens by scheduling dependent instructions far enough apart so that the condition does not occur. The former solution is generally referred to as a "protected" pipeline, and the latter solution is referred to as an "unprotected" pipeline. Many modern general-purpose architectures implement "protected" pipelines.
Protected pipelines have the advantage that they allow CPU designers to deepen the pipeline in subsequent generations of processors, while still executing legacy code correctly. However, a protected pipeline may require a large amount of logic to detect situations where a delay should be inserted in the pipeline.
Unprotected pipelines have the advantage that they require little/no hardware control mechanism to produce correct program results when executing instructions that require more than one CPU cycle to execute in the pipeline. A programmer or compiler is responsible for scheduling instructions so that an instruction completes before a subsequent instruction requires its results. Unprotected pipelines allow the use of "multi-allocation" code, where multiple writes to a particular register can be made in the pipeline at the same time. This is a very low cost, low complexity alternative to register renaming, or in processors without register renaming, having enough registers in the architecture to accommodate all in-flight computations, and is useful for high performance low power Digital Signal Processing (DSP) applications.
Existing processors are typically designed to have either protected or unprotected behavior.
Disclosure of Invention
This description relates generally to the field of DSPs. More particularly, but not by way of limitation, aspects of the present description relate to a method for executing a plurality of instructions by a processor. One such method includes executing, by a processor, a plurality of instructions by receiving a first instruction for execution on an instruction execution pipeline. The method also includes determining a first delay value based on an expected amount of time required to execute the first instruction. The method further includes storing the first latency value in a write-back queue, the write-back queue storing information associated with instruction execution. The method also includes starting execution of the first instruction on the instruction execution pipeline. The method further includes adjusting the delay value based on an amount of time elapsed since execution of the first instruction was started. The method also includes outputting a first result of the first instruction based on the delay value. The method further includes receiving a second instruction for execution on the instruction execution pipeline. The method also includes determining that the second instruction is a variable latency instruction. The method further includes storing, in the write-back queue, a ready value indicating that a second result of the second instruction is not ready. The method also includes starting execution of the second instruction on the instruction execution pipeline. The method further includes updating the ready value to indicate that the second result is ready based on a determination that execution of the second instruction is complete. The method also includes outputting the second result.
Another aspect of the present description relates to a processor including an instruction execution pipeline having a plurality of pipeline stages. The processor also includes pipeline circuitry configured to receive a first instruction for execution on an instruction execution pipeline. The pipeline circuitry is further configured to determine a first delay value based on an expected amount of time required to execute the first instruction. The pipeline circuitry is also configured to store the first latency value in a write back queue that stores information associated with instruction execution. The pipeline circuitry is further configured to begin execution of the first instruction on the instruction execution pipeline. The pipeline circuitry is also configured to adjust the latency value based on an amount of time elapsed since execution of the first instruction was started. The pipeline circuitry is further configured to output a first result of the first instruction based on the delay value. The pipeline circuitry is also configured to receive a second instruction for execution on the instruction execution pipeline. The pipeline circuitry is further configured to determine that the second instruction is a variable latency instruction. The pipeline circuitry is also configured to store a ready value in the writeback queue indicating that a second result of the second instruction is not ready. The pipeline circuitry is further configured to begin execution of the second instruction on the instruction execution pipeline. The pipeline circuitry is also configured to update the ready value to indicate that the second result is ready based on a determination that execution of the second instruction is complete. The pipeline circuitry is further configured to output the second result.
Another aspect of the description relates to a processing system that includes a memory and a processor. The processor also includes an instruction execution pipeline having a plurality of pipeline stages. The processor further includes pipeline circuitry configured to receive a first instruction for execution on an instruction execution pipeline. The pipeline circuitry is also configured to determine a first delay value based on an expected amount of time required to execute the first instruction. The pipeline circuitry is further configured to store the first latency value in a write back queue that stores information associated with instruction execution. The pipeline circuitry is also configured to begin execution of the first instruction on the instruction execution pipeline. The pipeline circuitry is further configured to adjust the latency value based on an amount of time elapsed since execution of the first instruction was started. The pipeline circuitry is also configured to output a first result of the first instruction based on the delay value. The pipeline circuitry is further configured to receive a second instruction for execution on the instruction execution pipeline. The pipeline circuitry is also configured to determine that the second instruction is a variable latency instruction. The pipeline circuitry is further configured to store a ready value in the writeback queue indicating that a second result of the second instruction is not ready. The pipeline circuitry is also configured to begin execution of the second instruction on the instruction execution pipeline. The pipeline circuitry is further configured to update the ready value to indicate that the second result is ready based on a determination that execution of the second instruction is complete. The pipeline circuitry is also configured to output the second result.
Drawings
For a detailed description of various examples, reference will now be made to the accompanying drawings in which:
FIG. 1 illustrates an example processor having multiple data paths.
FIG. 2 illustrates details of functional units and register files of an example processor.
FIG. 3 illustrates a global scalar register file of an example processor.
Fig. 4-6 illustrate a local register file of an example processor.
FIG. 7 illustrates pipeline stages of an example processor.
Fig. 8 is a circuit diagram illustrating an example functional unit and capture queue within a data path, according to some aspects.
FIG. 9 illustrates an example functional unit, capture queue, and scoreboard (scoreboard) complex, according to some aspects.
FIG. 10 illustrates an example capture queue register bit field, according to some aspects.
Fig. 11 is a timing diagram of an example capture queue, in accordance with some aspects.
Fig. 12 illustrates an example write back queue, in accordance with some aspects.
Fig. 13 illustrates an example circuit for variable delay life tracking, according to some aspects.
FIG. 14 is a flow diagram illustrating a technique for executing a plurality of instructions by a processor, according to some aspects.
Detailed Description
A Digital Signal Processor (DSP) is optimized for processing data streams that may be derived from various input signals, such as sensor data, video streams, voice channels, radar signals, biomedical signals, and the like. Memory bandwidth and scheduling are issues of concern for digital signal processors operating on real-time data. An example DSP processing core including a streaming engine to improve processing efficiency and data scheduling will be described below.
One or more DSP processing cores may be combined with various peripheral circuits, memory blocks, etc., on a single Integrated Circuit (IC) die to form a system on a chip (SoC). See, for example, "66 AK2Hx multi-core Keyson 2013TM
Figure BDA0003366191210000041
System on a chip ", which is incorporated herein by reference.
Various embodiments of the processing cores within a given series may have different numbers of instruction pipeline stages depending on the particular technology and cost/performance tradeoff. The embodiments described herein are representative and include multiple pipeline stages.
FIG. 1 illustrates an example processor 100 including dual scalar/ vector data paths 115, 116. The processor 100 includes separate level one instruction cache (L1I)121 and level one data cache (L1D) 123. The processor 100 includes an L2 combined instruction/data cache (L2)130 that holds both instructions and data. FIG. 1 illustrates the connection between the L1I cache and the L2 combined instruction/data cache 130 through a 512-bit bus 142. FIG. 1 illustrates the connection between the L1D cache 123 and the L2 combined instruction/data cache 130 through a 512-bit bus 145. In this example of the processor 100, the L2 combined instruction/data cache 130 stores both instructions of the backup L1I cache 121 and data of the backup L1D cache 123. In this example, L2 combined instruction/data cache 130 is further connected to higher level caches and/or main memory using known or later developed memory system techniques not illustrated in FIG. 1. In various examples, the sizes of L1I cache 121, L1D cache 123, and L2 cache 130 may be implemented in different sizes; in this example, L1I cache 121 and L1D cache 123 are each 32 kilobytes, and L2 cache 130 is 1024 kilobytes. In this example, the central processing unit core 110, the L1I cache 121, the L1D cache 123, and the L2 combined instruction/data cache 130 are formed on a single integrated circuit. This single integrated circuit optionally contains other circuitry.
Central processing unit core 110 fetches instructions from L1I cache 121 as controlled by instruction fetch unit 111. Instruction fetch unit 111 determines the next instruction to execute and invokes a fetch packet size set for such instruction. The nature and size of the extracted packets will be described in further detail below. Instructions are fetched directly from the L1I cache 121 on a cache hit (if these instructions are stored in the L1I cache 121). On a cache miss (the specified instruction fetch packet is not stored in L1I cache 121), the instructions are looked up in L2 combination cache 130. In this example, the size of the cache line in L1I cache 121 is equal to the size of the fetch packet, 512 bits. The memory location of these instructions is a hit or miss in the L2 combined cache 130. Cache 130 service hits are combined from L2. Misses are serviced from a higher level cache (not illustrated) or from main memory (not illustrated). In this example, the requested instruction is supplied to both the L1I cache 121 and the central processing unit core 110 simultaneously to speed up usage.
In this example, the central processing unit core 110 includes a plurality of functional units to perform instruction-specified data processing tasks. Instruction dispatch unit 112 determines the target functional unit for each fetched instruction. In this example, the central processing unit 110 operates as a Very Long Instruction Word (VLIW) processor capable of operating on multiple instructions in corresponding functional units simultaneously. Typically, compilers organize instructions in an execution packet that is executed together. Instruction dispatch unit 112 directs each instruction to its target functional unit. The functional units assigned to an instruction are completely specified by the instruction generated by the compiler. The hardware of the central processing unit core 110 does not participate in this functional unit assignment. In this example, instruction dispatch unit 112 may operate on multiple instructions in parallel. The number of such parallel instructions is set by the size of the execution packet. This will be described further below.
Instruction decode unit 113 decodes each instruction in the current execution packet. The decode includes a register target identifying a functional unit executing the instruction, identifying, from among the possible register files, a register supplying data for the corresponding data processing operation, and identifying a result of the corresponding data processing operation. As described below, an instruction may include one constant field in place of one register number operand field. The result of this decoding is a signal for controlling the target functional unit to perform the data processing operation specified by the corresponding instruction on the specified data.
The central processing unit core 110 includes a control register 114. The control register 114 stores information for control of the functional unit in the scalar datapath side a115 and the vector datapath side B116. This information may include mode information or the like.
Decoded instructions from instruction decode 113 and information stored in control registers 114 are supplied to scalar datapath side a115 and vector datapath side B116. As a result, functional units within scalar datapath side a115 and vector datapath side B116 perform instruction-specified data processing operations on instruction-specified data and store the results in instruction-specified data registers. Each of scalar datapath side a115 and vector datapath side B116 includes a plurality of functional units operating in parallel. These are described in further detail below in conjunction with fig. 2. Between scalar datapath side a115 and vector datapath side B116 there is a datapath 117 that allows data exchange.
The central processing unit core 110 includes further non-instruction based modules. The simulation unit 118 allows determination of the machine state of the central processing unit core 110 in response to instructions. This capability can be used for algorithm development. Interrupt/exception unit 119 enables central processing unit core 110 to respond to external asynchronous events (interrupts) and to respond to attempts to perform improper operations (exceptions).
The processor 100 includes a streaming engine 125. The stream engine 125 supplies two data streams from predetermined addresses, typically cached in the L2 combination cache 130, to the register file of the vector data path side B of the central processing unit core 110. This provides controlled data movement from memory (e.g., cached in L2 combination cache 130) directly to the functional unit operand inputs.
FIG. 1 illustrates an example data width of a bus between various components. The L1I cache 121 supplies instructions to the instruction fetch unit 111 via bus 141. In this example, bus 141 is a 512-bit bus. Bus 141 is unidirectional from L1I cache 121 to central processing unit 110. The L2 combination cache 130 supplies instructions to the L1I cache 121 via bus 142. In this example, bus 142 is a 512-bit bus. Bus 142 unidirectionally combines cache 130 from L2 to L1I cache 121.
The L1D cache 123 exchanges data with the register file in scalar datapath side A115 via bus 143. In this example, bus 143 is a 64-bit bus. The L1D cache 123 exchanges data with the register file in the vector data path side B116 via bus 144. In this example, the bus 144 is a 512-bit bus. Buses 143 and 144 are illustrated as bi-directionally supporting both central processing unit 110 data reads and data writes. L1D cache 123 exchanges data with L2 combination cache 130 via bus 145. In this example, the bus 145 is a 512-bit bus. Bus 145 is illustrated as a bi-directional supporting caching service for both data reads and data writes by the central processing unit 110.
The processor data request is fetched directly from L1D cache 123 on a cache hit (if the requested data is stored in L1D cache 123). On a cache miss (the specified data is not stored in L1D cache 123), this data is looked up in L2 combination cache 130. The memory location of this requested data is a hit or miss in the L2 combined cache 130. Cache 130 service hits are combined from L2. Misses are serviced from another level of cache (not illustrated) or from main memory (not illustrated). The requested data may be supplied to both the L1D cache 123 and the central processing unit core 110 at the same time to speed up usage.
The L2 combination cache 130 supplies data of the first data stream to the stream engine 125 via the bus 146. In this example, the bus 146 is a 512-bit bus. The streaming engine 125 supplies the data of this first data flow to the functional units of the vector data path side B116 via the bus 147. In this example, the bus 147 is a 512-bit bus. The L2 combination cache 130 supplies the data of the second data stream to the stream engine 125 via the bus 148. In this example, the bus 148 is a 512-bit bus. The stream engine 125 supplies data of this second data stream to the functional units of the vector data path side B116 via a bus 149, which in this example is a 512-bit bus. According to this example, buses 146, 147, 148, and 149 are illustrated as uni-directionally combining cache 130 from L2 to stream engine 125 and to vector data path side B116.
FIG. 2 illustrates further details of functional units and register files within scalar datapath side A115 and vector datapath side B116. Scalar datapath side a115 includes L1 cells 221, S1 cells 222, M1 cells 223, N1 cells 224, D1 cells 225, and D2 cells 226. Scalar datapath side A115 includes a global scalar register file 211, an L1/S1 local register file 212, an M1/N1 local register file 213, and a D1/D2 local register file 214. Vector datapath side B116 includes L2 cell 241, S2 cell 242, M2 cell 243, N2 cell 244, C cell 245, and P cell 246. The vector datapath side B116 includes a global vector register file 231, an L2/S2 local register file 232, an M2/N2/C local register file 233, and a predicate register file 234. There are limits to what functional units can read from or write to what register file. These are described below.
Scalar datapath side a115 includes L1 cells 221. The L1 cell 221 typically accepts two 64-bit operands and produces a 64-bit result. The two operands each specify a register call from an instruction in either the global scalar register file 211 or the L1/S1 local register file 212. The L1 unit 221 may perform the following instruction-selected operations: a 64-bit addition/subtraction operation; 32-bit min/max operation; 8-bit Single Instruction Multiple Data (SIMD) instructions, such as sum of absolute values, minimum and maximum determination, round robin minimum/maximum operations, and various move operations between register files. The results may be written into instruction specific registers of the global scalar register file 211, the L1/S1 local register file 212, the M1/N1 local register file 213, or the D1/D2 local register file 214.
Scalar datapath side a115 includes S1 cell 222. The S1 unit 222 generally accepts two 64-bit operands and produces a 64-bit result. The two operands each specify a register call from an instruction in either the global scalar register file 211 or the L1/S1 local register file 212. In this example, the S1 cell 222 performs the same type of operation as the L1 cell 221. In another example, there may be slight variations between the data processing operations supported by the L1 cell 221 and the S1 cell 222. The results may be written into instruction specific registers of the global scalar register file 211, the L1/S1 local register file 212, the M1/N1 local register file 213, or the D1/D2 local register file 214.
Scalar datapath side a115 contains M1 cells 223. The M1 cell 223 typically accepts two 64-bit operands and produces a 64-bit result. The two operands each specify a register call from an instruction in either the global scalar register file 211 or the M1/N1 local register file 213. In this example, M1 unit 223 performs the following instruction selection operations: 8-bit multiplication operation; complex dot product operation; a 32-bit count operation; complex conjugate multiplication operation; and bitwise logical operations, moves, adds, and subtracts. The results may be written into instruction specific registers of the global scalar register file 211, the L1/S1 local register file 212, the M1/N1 local register file 213, or the D1/D2 local register file 214.
Scalar datapath side a115 includes N1 cells 224. N1 unit 224 typically accepts two 64-bit operands and produces a 64-bit result. The two operands each specify a register call from an instruction in either the global scalar register file 211 or the M1/N1 local register file 213. In this example, the N1 cell 224 performs the same type of operation as the M1 cell 223. There may be some double operation (referred to as a double issued instruction) that takes both the M1 cell 223 and the N1 cell 224 together. The results may be written into instruction specific registers of the global scalar register file 211, the L1/S1 local register file 212, the M1/N1 local register file 213, or the D1/D2 local register file 214.
Scalar datapath side A115 includes D1 cells 225 and D2 cells 226. The D1 unit 225 and the D2 unit 226 typically each accept two 64-bit operands and each generate one 64-bit result. The D1 unit 225 and the D2 unit 226 typically perform address calculations and corresponding load and store operations. The D1 unit 225 is used for 64-bit scalar loads and stores. The D2 unit 226 is used for 512-bit vector loads and stores. In this example, the D1 cell 225 and the D2 cell 226 also perform: exchanging, packing and unpacking the loaded and stored data; a 64-bit SIMD arithmetic operation; and 64-bit bitwise logical operations. The D1/D2 local register file 214 will typically store the base and offset addresses used in the address calculation for the corresponding load and store. The two operands each specify a register call from an instruction in either the global scalar register file 211 or the D1/D2 local register file 214. The computed result may be written into an instruction specific register of the global scalar register file 211, the L1/S1 local register file 212, the M1/N1 local register file 213, or the D1/D2 local register file 214.
Vector datapath side B116 contains L2 element 241. The L2 unit 241 generally accepts two 512-bit operands and produces a 512-bit result. The two operands are each called from an instruction-specific register in the global vector register file 231, the L2/S2 local register file 232, or the predicate register file 234. In this example, the L2 cell 241 executes similar instructions as the L1 cell 221, but does not execute on the wider 512-bit data. The results may be written into the instruction specific registers of the global vector register file 231, the L2/S2 local register file 232, the M2/N2/C local register file 233, or the predicate register file 234.
The vector data path side B116 includes an S2 element 242. The S2 unit 242 generally accepts two 512-bit operands and produces a 512-bit result. The two operands are each called from an instruction-specific register in the global vector register file 231, the L2/S2 local register file 232, or the predicate register file 234. In this example, the S2 unit 242 executes similar instructions as the S1 unit 222. The results may be written into the instruction specific registers of the global vector register file 231, the L2/S2 local register file 232, the M2/N2/C local register file 233, or the predicate register file 234.
Vector datapath side B116 contains M2 element 243. M2 unit 243 typically accepts two 512-bit operands and produces a 512-bit result. The two operands are each called from an instruction-specific register in the global vector register file 231 or the M2/N2/C local register file 233. In this example, the M2 unit 243 executes similar instructions as the M1 unit 223, except for the wider 512-bit data. The results may be written into the instruction specific registers of the global vector register file 231, the L2/S2 local register file 232, or the M2/N2/C local register file 233.
Vector datapath side B116 contains an N2 element 244. The N2 unit 244 generally accepts two 512-bit operands and produces a 512-bit result. The two operands are each called from an instruction-specific register in the global vector register file 231 or the M2/N2/C local register file 233. In this example, the N2 cell 244 performs the same type of operation as the M2 cell 243. There may be some double operations (referred to as double issued instructions) that take both the M2 unit 243 and the N2 unit 244 together. The results may be written into the instruction specific registers of the global vector register file 231, the L2/S2 local register file 232, or the M2/N2/C local register file 233.
Vector datapath side B116 includes a correlation (C) unit 245. C unit 245 typically accepts two 512-bit operands and produces a 512-bit result. The two operands are each called from an instruction-specific register in the global vector register file 231 or the M2/N2/C local register file 233.
Vector datapath side B116 includes P unit 246. Vector predicate (P) unit 246 performs basic logic operations on registers of local predicate register file 234. P-unit 246 has direct access to read from and write to predicate register file 234.
Fig. 3 illustrates a global scalar register file 211. There are 16 independent 64-bit wide scalar registers designated a0 through a 15. Each register of global scalar register file 211 can be read or written as 64-bit scalar data. All scalar datapath side a115 functional units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can read or write to the global scalar register file 211. Global scalar register file 211 can be read as 32-bits or 64-bits and can only be written as 64-bits. The execution instruction determines a read data size. Vector datapath side B116 functional units (L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246) may read from the global scalar register file 211 via the cross-over path 117 under the limitations detailed below.
FIG. 4 illustrates the D1/D2 local register file 214. There are 16 independent 64-bit wide scalar registers designated as D0-D16. Each register of the D1/D2 local register file 214 may be read or written as 64-bits of scalar data. All scalar datapath side a115 functional units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can write to the global scalar register file 211. Only the D1 unit 225 and the D2 unit 226 may read from the D1/D2 local scalar register file 214. It is expected that the data stored in the D1/D2 local scalar register file 214 will include the base and offset addresses used in the address calculation.
FIG. 5 illustrates the L1/S1 local register file 212. In this example, the L1/S1 local register file 212 includes eight independent 64-bit wide scalar registers designated AL 0-AL 7. In this example, instruction encoding allows the L1/S1 local register file 212 to include up to 16 registers. In this example, only eight registers are implemented to reduce circuit size and complexity. Each register of the L1/S1 local register file 212 may be read or written as 64-bits of scalar data. All scalar datapath side a115 functional units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can write to the L1/S1 local scalar register file 212. Only the L1 unit 221 and the S1 unit 222 can read from the L1/S1 local scalar register file 212.
FIG. 6 illustrates the M1/N1 local register file 213. In this example, eight independent 64-bit wide scalar registers designated as AM 0-AM 7 are implemented. In this example, instruction encoding allows the M1/N1 local register file 213 to include up to 16 registers. In this example, only eight registers are implemented to reduce circuit size and complexity. Each register of the M1/N1 local register file 213 may be read or written as 64-bits of scalar data. All scalar datapath side a115 functional units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can write to the M1/N1 local scalar register file 213. Only the M1 unit 223 and the N1 unit 224 may read from the M1/N1 local scalar register file 213.
FIG. 7 illustrates the following pipeline stages: program fetch stage 710, dispatch and decode stage 720, and execute stage 730. The program fetch stage 710 includes three stages for all instructions. The dispatch and decode stage 720 includes three stages for all instructions. The execute stage 730 includes one to four stages dependent on the instruction.
The fetch stage 710 includes a program address generation (PG) stage 711, a Program Access (PA) stage 712, and a Program Receive (PR) stage 713. During the program address generation stage 711, a program address is generated in the processor and a read request is sent to the memory controller for the L1I cache. During the program access phase 712, L1I caches process requests, accesses data in its memory and sends fetch packets to processor boundaries. During program receive stage 713, the processor registers fetch packets.
The processor core 110 (FIG. 1) and the L1I cache 121 pipeline (FIG. 1) are decoupled from each other. Fetch packet returns from the L1I cache may take different numbers of clock cycles depending on the external circumstances, such as whether there is a hit in the L1I cache 121, or whether there is a hit in the L2 combination cache 130. Thus, the program access phase 1112 may take several clock cycles instead of one clock cycle as in the other phases.
The instructions executed in parallel constitute a data packet. In this example, one execution packet may contain up to 16 slots 32-bits wide for 16 instructions. No two instructions in the execution packet may use the same functional unit. A slot is one of five types: 1) self-contained instructions executing on one of the functional units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, D2 unit 226, L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246) of the processor core 110; 2) a no-unit instruction, such as a NOP (no operation) instruction or a plurality of NOP instructions; 3) a branch instruction; 4) a constant field extension; and 5) conditional code extension. Some of these slot types will be described below.
The dispatch and decode stage 720 (FIG. 7) includes to the appropriate execution Unit (DS) stage 721, instruction Pre-decode (DC1) stage 722; and instruction dispatch of the instruction decode, operand fetch (DC2) stage 723. During instruction dispatch to the appropriate execution unit stage 721, the fetch packet is split into execution packets and assigned to the appropriate functional units. During the instruction pre-decode stage 722, the source register, destination register, and associated paths are decoded for execution of instructions in the functional units. During the instruction decode, operand read stage 723, more detailed unit decode is done, as well as operands read from the register file.
The execute stage 730 includes execute (E1-E5) stages 731-735. Different types of instructions require different numbers of these stages to complete their execution. These stages of the pipeline play an important role in understanding the device state at processor cycle boundaries.
During the E1 stage 731, the condition of the instruction is evaluated and the operand is operated on. As illustrated in fig. 7, the E1 stage 731 may receive an operand from one of a stream buffer 741 and a register file schematically shown as 742. For load and store instructions, address generation is performed and address modifications are written to the register file. For branch instructions, the branch fetch packet in the PG phase is affected. As illustrated in FIG. 7, load and store instructions access memory, here shown schematically as memory 751. For single-cycle instructions, the result is written to the destination register file. This assumes that any condition of the instruction is evaluated as true. If the condition evaluates to false, the instruction does not write any results or have no pipeline operations after the E1 stage 731.
During the E2 stage 732, the load instruction sends an address to memory. Store instructions send addresses and data to memory. If saturation occurs, a single-loop instruction that saturates the result sets a Saturation (SAT) bit in a Control Status Register (CSR). For a 2-cycle instruction, the result is written to the destination register file.
During the E3 stage 733, a data memory access is performed. If saturation occurs, any multiply instruction that saturates the result sets the SAT bit in a Control State Register (CSR). For a 3-cycle instruction, the result will be written to the destination register file.
During the E4 stage 734, the load instruction brings the data to the processor boundary. For a 4-cycle instruction, the result is written to the destination register file.
During the E5 stage 735, the load instruction writes data into a register. This is schematically illustrated in fig. 7 with an input from memory 751 to the E5 stage 1135.
As described above, the processor 100 may operate in both a protected mode and an unprotected mode. In some cases, pipeline protection may be enabled or disabled by setting a processor bit. For example, protection may be controlled by setting a bit in a control register (e.g., a task status register). In some cases, the instructions may be used to set a protection mode, such as PROT or UNPROT.
Unprotected mode or exposed pipeline mode are common VLIW operation modes. The unprotected mode may require a programmer or compiler to know the latency of the instructions and insert NOPS or other instructions between dependent instructions to ensure correctness. For example, the first instruction MPY 32A 0, A1, A2; the multiplication may be received by a processor. This instruction takes four processor cycles to execute and output to the a2 register. If the programmer or compiler wishes to use the output of the MPY32 instruction for a second instruction, e.g., ADD A2, A8, A8; accumulate, then three NOP instructions are inserted by the programmer or compiler to get the correct behavior. However, unexpected events (such as processing interrupts or cache misses) may cause inaccuracies in NOP instructions inserted by a programmer or compiler.
In the protected or unexposed pipeline mode, the pipeline conforms to a sequential operation model in which dependent instructions are guaranteed to be correct regardless of how many cycles it takes to complete the instruction. For instructions that take multiple cycles to complete, if a subsequent instruction attempts to read the destination of the first instruction within the first instruction's delay slot, the CPU pipeline will automatically insert a NOP cycle until the instruction that will write to the register has completed. In the above example, if the processor 100 receives an MPY32 instruction, followed by an ADD instruction, in protected mode, the processor 100 will automatically insert three NOP cycles between these instructions.
Pipeline hazards may exist in certain processors, such as multi-stage pipeline processors capable of processing multiple instructions in a pipeline. An unresolved pipeline hazard is typically a situation where a processor may produce an undesirable or unexpected result. There may be different types of pipeline hazards. Two such types include data hazards and structural hazards. Data hazards are typically the case where an instruction executing in the pipeline involves data from a previous instruction. If not handled, data hazards may lead to race conditions. Typically, data hazards include read after write and write after write. Examples of data hazards include, but are not limited to, when an instruction attempts to access the results of a previous instruction that is still being processed at a later time.
Structural hazards often occur due to the structure of the data path of the processor. Some processors may be limited in the manner in which writes are performed. In one such example, a single functional unit may be able to perform a single write to the output register file per clock cycle, with both instructions attempting to output their results to the same register in a single cycle. Thus, when a first instruction that takes two clock cycles to complete is executed on a functional unit followed by a second instruction that takes one clock cycle to complete on the same functional unit, both instructions will complete and attempt to write to the output register file within the same cycle.
In some processors, when executing in protected mode, all functional units fetched from the instruction into the E1 stage and the entire execution pipeline may stall when pipeline dependencies are found. Since all functional units are halted, no unit is allowed to proceed until the pipeline conflict has been resolved. To help address data hazard situations, enable fast processor mode switching, and address load/store latencies in both protected and unprotected modes, and enable recoverable interrupts in unprotected modes, a capture queue may be used. The capture queue structure helps to save the state of the pipeline registers, and then write back the saved state, e.g., to continue execution or output to a general purpose register file. The capture queue may be used to detect hazard conditions, generate appropriate pauses, and load and unload capture queue registers to help resolve pipeline hazards in protected mode. As described herein, although interrupts and exceptions are different concepts, they may be handled in a similar manner by a processor, and the terms may be used interchangeably in this description.
In some cases, the capture queue may also be used in conjunction with a processor executing in an unprotected mode or a processor having an unprotected pipeline. For example, a capture queue may be used to help enable recoverable interrupts. As an example, the processor may receive a four-cycle MPY32 instruction that is output to the A2 register in four cycles. When the processor is in the unprotected mode, the executing code may then issue a circular Shift (SHR) instruction on the data currently in a 2. The SHR instruction then follows a loop Move (MV) instruction that moves the shifted data in a2 to the A3 register. The NOP may then be inserted, and the result of the MPY instruction is then output to A2. The ADD instruction may then be executed using the data from a 2. If an interrupt is received after the SHR instruction but before the MV, an undesirable result may occur. For example, the transfer to the interrupt handler may ensure that all pending writes in the pipeline are completed before execution of the interrupt handler begins, to avoid the results from the interrupt handler damaging the program. Thus, the interrupt handler will allow the MPY32 instruction to complete and output to A2. After the interrupt handler returns, the result of MPY32 will be restored to A2 and the next instruction MV will execute. However, the MV instruction will now execute on the result of MPY32, rather than the result of the SHR instruction.
FIG. 8 is a circuit diagram 800 illustrating an example functional unit and capture queue within a data path according to aspects of the present description. Although shown in the context of a scalar datapath, in some cases, a capture queue may be utilized with both scalar and vector datapaths. According to certain aspects, the capture queues may include a scoreboard 802 (which includes hazard detection logic), local unit capture queues 804A-804E (collectively 804), and a central capture queue 806. The scoreboard 802 includes a write back queue that includes a set of registers, and the scoreboard 802 is coupled to a set of associated functional units 808A-808E (collectively 808) and a central capture queue 806. The functional units 808 may each be associated with a respective local unit capture queue 804.
According to certain aspects, a capture queue helps enable recoverable interrupts in a pipelined processor. As described above, a processor pipeline may include multiple stages, each stage performing a discrete step to process an instruction. Multiple instructions may be executed at different stages of the pipeline. For example, pausing and flushing the entire pipeline to handle interrupts is relatively inefficient. Furthermore, the interrupt handler instructions are executed through the processor pipeline stage, and clearing the entire pipeline will not change the number of cycles required for the interrupt handler instructions to clear the processor pipeline. Instead of discarding partially executed instructions, execution of these instructions may continue to complete and store the results to the capture queue structure. For example, a four-cycle MPY32 instruction may be received, followed by a multi-cycle Load (LDD) instruction in the next processor cycle. When an LDD instruction is received, the MPY32 instruction is in the E2 loop. During processing in E1, the LDD instruction causes a cache miss, resulting in an exception. The LDD instruction is then discarded and the exception handler is loaded. However, MPY32 instructions may continue execution to completion during the E2-E4 stages, and the results of the MPY32 instructions are stored in the capture queue. In some cases, the results of MPY32 instructions may be stored in central capture queue 806, as the instructions of the exception handler may require local capture queue 804. As the MPY32 instruction continues, the instructions of the exception handler may also be executed in the pipeline. Once the exception handler is complete, the LDD instruction may be reissued to E1 for execution, and the results of the MPY32 instruction restored to the local capture queue 804 for output to the output register.
According to some aspects, the progress of instructions that take more than one execution cycle may be tracked to help ensure that information is written in the correct location and at the correct time. For example, the scoreboard 802 may include a write back queue. In some cases, the write back queue may be a set of registers that may be used to store information associated with executing instructions. The write back queue slot associated with the executed instruction may be associated with a particular slot in the local capture queue 804 and include a pointer to the particular slot. The information in the write back queue may include a lifetime tracking value that tracks to which local capture queue the corresponding instruction should be written back, and a latency value that tracks when the result of the instruction should be ready for output.
In the unprotected mode and when the instruction enters the E1 stage, the value of the lifetime tracking value corresponds to the expected number of cycles required by the functional unit 808 to process the instruction. These lifetime tracking values may be adjusted, for example, by a decrement value, for each clock cycle that the processor is not halted. The life tracking value may be suspended at any time that the pipeline is suspended. This scoreboard helps enable interrupt/event recovery by tracking where the values from the local unit capture queue 804 should be restored. When the age tracking value equals 0, the result of the instruction is ready to be written back to the output register.
If an instruction is interrupted before the age tracking value has become zero, the instruction result and its corresponding age tracking value may be saved to maintain correct execution when returning from the interrupt. For example, upon receiving an interrupt, the scoreboard 802 may halt any portion of the pipeline and MPY32 instructions that have been executed, and the state of the pipeline stage may be saved to the local unit capture queue 804, and then to the central capture queue 806. For example, a corresponding age tracking value may also be saved in the write-back queue. The interrupt may then be processed and, after the interrupt is handled, any results and state related to the MPY32 instructions held in the local unit capture queue 804 or the central capture queue 806 may be restored. Processing of MPY32 may then be resumed based on the resumed life tracking value.
In some cases, instructions in the first stage of execution (e.g., the E1 stage) will not be resumed in the first stage. Rather, the instructions may be reloaded into the first stage and run when processing resumes. For example, in some cases, the pipeline may receive two instructions at a time as a dual instruction. In this case, the results of the two instructions may be output in the same loop. As a more specific example, the SUB and LDD commands may be issued together as a dual instruction. Both commands enter the E1 stage and are processed. LDD commands may experience a page fault and raise a page fault exception when attempting to access a memory address to output the contents of the memory address. Since the SUB command is a single-cycle command, the results of the SUB command are ready to be output at the end of the E1 phase. This output may be saved to the central capture queue 806 because, in some cases, the E1 stage may not have an associated local capture queue. Execution then proceeds to the exception handler. After the exception handler is complete, execution returns to the main process. Since the first execution of the LDD command causes an exception, the LDD command needs to be re-executed to obtain the desired result. Then, when the SUB and LDD dual instructions are reloaded into E1 and re-executed, the results of the SUB command stored in the central capture queue 806 may be discarded. In some cases, a two-cycle command may be issued as part of a dual instruction, such as using an LDD command. The multi-cycle command may then proceed to E2 and write the allocation back to the queue entry before the exception handler executes. Typically, a write back queue entry is made whenever there is one entry stored in the local capture queue. The multi-cycle command may also be rolled back into E1 and re-executed with the LDD instruction. However, rolling back the execution state may require tracking more instruction results than the number of pipeline stages. In some cases, the number of registers in the write back queue may exceed the number of pipeline stages to handle boundary cases around instructions that trace out the E1 stage to the E2 stage and generate output but will roll back to the E1 stage.
In some cases, if execution of the multi-cycle instruction has begun when the interrupt is received, e.g., if the MPY32 instruction is in the E2-E4 stage, the multi-cycle instruction may be executed to completion and the results stored in the central capture queue 806 via the local unit capture queue 804. After the interrupt is handled, the stored results from the multi-cycle instructions are restored from the central capture queue 806 to the local unit capture queue 804 for output.
In some cases, the local unit capture queue 804 and the central capture queue 806 may be omitted, and instead, a save memory or register may be used to enable interrupt handling in the unprotected mode. In this case, if an interrupt is received after execution of the instruction has begun (e.g., in the E2-E4 stages), the instruction may be executed to completion and the results stored in save memory. After the interrupt is handled, the stored result is then written to the output register. If an interrupt is received before execution of the instruction begins (e.g., at stage E1), the instruction is reloaded after the interrupt is handled. If the instruction passes the E1 stage and moves to E2, a local unit capture queue may be allocated for the instruction.
According to certain aspects, the information in the write-back queue may also include a latency value to help track the age of the associated instruction. The delay value may be initialized based on an expected number of processor cycles required for execution of the associated instruction. The delay value may be adjusted, such as by decrementing the value, for each clock cycle, regardless of whether the pipeline is stalled. If there is no pipeline stall, both the lifetime tracking value and the delay value will expire at the same time, and the results of the instruction may be written to the output register file. As described above, if the pipeline stalls, adjusting the age tracking value associated with the instruction may be stalled. However, if the instruction passes through the E1 stage, execution of the instruction continues until the instruction completes. In this case, the latency counter will reach its expiration value (e.g., zero) before the lifetime tracking value reaches its expiration value (e.g., zero), and the results of the instructions may be captured in the local unit capture queue. In the case where the output has been captured by the local unit capture queue, the write back queue entry may continue to track the output until the lifetime reaches its expiration value. When the age value reaches its expiration value and the pipeline is not stalled, output may be transferred from the local unit capture queue into an output register file specified by the instruction.
According to certain aspects, the scoreboard 802 may also track memory load operations that have not yet completed, such as those operations due to unexpected events in the memory system (e.g., cache misses). In some cases, the scoreboard 802 may track up to a fixed number (e.g., up to eight) outstanding loads that have not completed before the pipeline is halted due to the L1D read data not being returned. There are at least three general cases where a pipeline may stall due to a memory load condition. First, in protected mode, if the destination of a load instruction is read as an operand to a subsequent instruction before the L1D data cache can return data, the pipeline stalls until the L1D data cache returns data. Second, in either protected or unprotected mode, if the destination of the load instruction is read as an operand to the instruction, and the L1D data cache indicates that it will not have data before the 4-cycle L1D cache latency, the pipeline will stall. Third, if the processor has sent eight load instructions and has not returned data for any of them, the pipeline will stall when it encounters the next load instruction, provided it has not stalled for any of the reasons described above.
Using the scoreboard 802 to track memory load behavior helps to allow the processor to accept data returns from the memory system in any order. In some cases, the processor may be configured to send a transaction Identifier (ID) with each load instruction, and L1D may return the corresponding transaction ID with the load data. The scoreboard 802 may also allow the compiler to further up-load instructions in the schedule and hide the L1D cache miss penalty when the compiler knows it has enough other work for the processor. In addition, the scoreboard 802 allows the L1D data cache to support miss-under-miss (hit-miss) behavior, resulting in possible performance improvements for code with a mix of loads that may miss (e.g., large database entry lookups) and loads that may hit (e.g., stack accesses).
According to certain aspects, the central capture queue 806 may hold the contents of the local unit capture queue 804, for example, when an interrupt or exception event occurs. The central capture queue 806 may include one or more save registers Q0-Q4 to delay updates to processor registers when one or more problems with instruction writeback are detected that should occur while instructions leave the E1 execution stage. For example, during execution of a load or store instruction, a page fault as part of a branch to a page fault handling service may be detected by a micro-translation lookaside buffer (μ TLB). Typically, the μ TLB translates load/store instruction addresses to physical mappings. In the event that the virtual-to-physical address mapping for a particular memory access instruction cannot be found, the μ TLB triggers a page fault in the E1 execution phase. The load/save instruction is then placed into a central capture queue. In protected mode, all instructions preceding the load/save instruction that caused the page fault complete normally. In the unprotected mode, if instructions preceding the load/save instruction that caused the page fault have not reached their normal write-back loop before the page fault is detected by the μ TLB, the results of such instructions will be saved in the central capture queue or the local unit capture queue for output to the register file after the page fault is resolved. After the page fault is resolved, for example after the correct page translation entry is located, the load/save instruction will be resumed, and execution using the correct page translation entry resumed. In some cases, a correctable problem with the execution of an instruction may be detected when the instruction is at the E2 stage of execution. If the processor determines that the instruction has a correctable problem in stage E2, the register file updates delayed in the central capture queue 806 will be returned to the local unit capture queue so that they can be saved when the processor transfers execution to the exception handler.
According to certain aspects, in protected mode, when an interrupt is received, execution of instructions already in the pipeline prior to the interrupt of normal execution is followed by the interrupt, and then the remaining instructions are executed. Interrupt handling is straightforward since the program expects the processor to insert delays as needed. In the unprotected mode, when instruction scheduling is handled by the application or the compiler itself, an attempt to interrupt execution in the middle of executing application code will likely break instruction scheduling. Local unit acquisition queues may be used to help solve such potential scheduling problems. When operating in the unprotected mode, when an interrupt is received, the contents of the pipeline of the functional unit handling the interrupt may be written to the local unit capture queue. The interrupt is then processed and the pipeline is resumed after the interrupt is handled. However, the capture queue may be used for data hazards and load/store out-of-order completion, as well as handling interrupts. In some cases, the pipeline may operate in an unprotected mode before an interrupt is received, but after the interrupt is received, but before control is returned to the application code, the pipeline is switched to a protected mode, for example, by an interrupt handler. Since the local unit capture queue may be used when operating in protected mode, pipeline data in the local unit capture queue may be offloaded, for example, to memory. This memory space may be space in cache memory, such as in an L1, L2, or L3 cache, or in an on-die Static Random Access Memory (SRAM) cache.
When it is known that the local unit capture queue needs to be unloaded, for example when an interrupt is received while operating in an unprotected mode, the processor may facilitate pre-writing a block of memory allocated for the contents of the local unit capture queue in the memory space. Each executing task in the processor is associated with an Event Context Save Pointer (ECSP) that points to the memory space. When the original task is interrupted by an interrupt or another higher priority task, the state of the pipeline registers of the functional unit are saved to the local unit capture queue and then copied to the block of memory based on ECSP- cA pointing to the block of memory. The functional unit then starts executing the interrupt task and rewrites the ECSP to the ECSP-B associated with the interrupt task. During execution of the interrupt task, the local unit captures the queue for completion of the interrupt task. When the interrupt task is completed, the original task is reloaded and the ECSP is rewritten based on ECSP-A. Based on ECSP-A, the state of the pipeline registers is copied to the local unit capture queue and then to the pipeline registers. Execution of the original task resumes at the location where the previous execution stopped.
FIG. 9 illustrates an example functional unit, capture queue, and scoreboard complex 900 in accordance with aspects of the present description. As shown, the functional unit 902 of the complex 900 corresponds to the.m functional unit 808B from fig. 8. The functional unit 902 includes four pipeline stages, and other functional units may include more or fewer pipeline stages. Each pipeline stage of the functional unit 902 takes one clock cycle to complete. Each instruction may take a different number of cycles to process. For example, a first instruction may take two cycles to complete, and thus the output is from the E2 pipeline stage. Each functional unit may generate a single write to output register file 914 per clock cycle via result bus 916. The local unit capture queue 904 helps track the pipeline register contents in the corresponding functional unit. Typically, there may be one local unit capture queue 904 per functional unit. Each pipeline stage (here E1, E2, and E4) that may generate a result may be coupled to one or more MUXs 906A-906C and capture queue registers 908A-908C of the local unit capture queue 904. Connecting the pipeline stages to multiple capture queue registers helps handle long series of instructions. For example, there may be a series of instructions in the pipeline that will all attempt to write to the output register in the same clock cycle, such as a four-cycle instruction, followed by a three-cycle, then two-cycle, and one-cycle instruction. In this case, the four-cycle instruction will be written to the output register, and the three-cycle, two-cycle, and one-cycle instructions will be stored in the capture queue registers 908A-908C.
The local unit capture queue 904 may operate in conjunction with a scoreboard 910. The scoreboard 910 is coupled to the MUXs 906A-906C along with the central capture queue 918, and the clock gate enables the capture queue registers 908A-908C via the bus 912. The scoreboard 910 maintains a set of registers that can help track whether a functional unit is working on producing results. If the corresponding register of the functional unit is working to produce the result as a write back value for the corresponding register, then the bit corresponding to the register is high. The all functional unit scoreboard tracks register write back results and is then ored together at the top level to consolidate all register usage per cycle. The scoreboard 908 may then make a set of comparisons. In some cases, scoreboard 908 may compare each read operand of each functional unit to detect a potential read hazard after a write in protected mode. For example, if the.n scr1 operand is register a1, and a1 would be written back by the.m unit in two cycles,. N would detect if another instruction operand attempted to read a1, and suspend reading the instruction operand of a1 at the E1 stage until the corresponding bit is set low. The corresponding bits may also be compared to the write address of each cell to detect write hazards after writing in protected mode. For example, if another functional unit (e.g.,. L) is writing to A1, then A1 will be written back by the.M functional unit in three cycles (which are two cycles after the.L functional unit has committed to generating the A1 write back value). The hazard logic is then used to load the local capture queue for the.L functional unit until.M completes writing to A1 and the corresponding bit is set low. The local capture queue of the. L functional unit will then unload the a1 value from its storage and place it on the output of the. L.
Fig. 10 illustrates an example capture queue register bit field 1000, according to some aspects. The fields shown, their order, and field sizes may vary, and the fields as illustrated in FIG. 10 illustrate one example capture queue register bit field. Bit fields 1002 and 1004 illustrate two example data formats for a capture queue and a write back queue. According to certain aspects, information from block 1006 is stored in a write-back queue and information from block 1008 is stored in a capture queue. In this example of the write-back queue, V represents whether the bit is valid, DV indicates whether a write is updating the master register file, PV indicates whether this write is updating the predicate register file, RFNUM encodes which register file is being written, RFADDR encodes the register file address, RFPADDR encodes the predicate register file address, and LIFE encodes the age tracking value. For the capture queue, the FP state represents the assertion file state and DATA represents the stored DATA.
As illustrated in fig. 11, the capture queue structure also facilitates fast mode switching between protected and unprotected modes, and vice versa. Previously, some processors could switch between, for example, an unprotected mode and a protected mode, but would typically stall instructions after the switch command until all valid instructions are completed. The capture queue helps to allow switching from unprotected mode to protected mode and vice versa without clearing the pipeline or even pausing in some cases. For example, when switching from the unprotected mode to the protected mode, the lifetime of any instruction already in the pipeline may be set to less than 0, e.g., -1, meaning that the corresponding instruction should have already been committed to the register file. As described above, the hazard logic associated with the protected mode then becomes valid. In cycle 1 of FIG. 10, the processor pipeline executing the illustrated instructions executes in an unprotected mode. At cycle 4, the PROT command is executed in E1 and the pipeline is switched to protected mode. The life tracking value for MPY is then set to-1. The age tracking value of the ADD command is then set to the value normally associated with ADD commands because execution of the command has not yet begun. If the ADD command utilizes the A0 register to which the MPY32 command is output, execution of the ADD command continues normally as described above. In the event that the ADD command does not utilize the same registers as the MPY32 command, then the ADD command can be executed immediately after the PROT command without stalling the pipeline.
Fig. 12 illustrates an example circuit for life tracking 1200, according to some aspects. After the DC2 stage reads the instruction and passes it from the DC2 register 1202 to the E1 stage, the write back queue 1210 is allocated for the instruction. Counter 1204 keeps track of which write back queue slots WBQ 0 through WBQ 4 should be used next and generates a pointer to the next write back queue slot and this pointer id decoded in decoder 1206, which decoder 1206 converts the pointer to an address on the write back queue bus. The write back queue slots are allocated in a round robin order, e.g., first the write back queue slot WBQ 0 is allocated, then WBQ 1 is allocated, and so on to the last write back queue slot, here WBQ 4. After the last write back queue slot is allocated, the allocation of the next write back queue slot is returned to the first write back queue slot, and the next write back queue slot to be allocated is WBQ 0. The write back queue slots may be allocated in a round robin order to help ensure that if there are multiple write backs in the same round, the allocation of the write back queue slots will occur in a deterministic manner. The data written back into the queue slots may be associated with corresponding local unit capture queue slots CQ 0 through CQ 3 via unit scheduler 1208.
A local unit capture queue slot associated with the instruction may be allocated to the write back queue slot entry on the basis that the lowest entry is available. In some cases, the local unit capture queue slot number may be determined at the time of instruction transfer from the DC2 stage to the E1 stage and saved in memory when the instruction is at the E1 stage. When an instruction passes from the E1 stage to the E2 stage, the local unit capture queue slot number may be written into the local unit capture queue number field of the associated write back queue. The local unit capture queue number of the write back queue and any unit capture queue slot number assigned in stage E1 may be combined to construct a vector of all currently used local unit capture queue slots. The next available local unit acquisition queue slot for use is the lowest number of local unit acquisition queue slots currently unused.
According to certain aspects, certain commands may not return results within a fixed number of cycles. These commands may be referred to as variable latency commands. Examples of variable latency instructions may include a LOAD command as well as DIV and MOD commands. In some cases, variable latency commands may be divided into two different types, the first type being a memory operation (e.g., a LOAD command), and the second type being a command in which the amount of time required to complete the command varies based on the operands of the command operation (e.g., a divide or modulo command). A first type of variable latency command is typically available for retrieving data from a memory system. In some cases, the memory system may return the results in any order and/or size. The time at which data is returned may vary due to, for example, cache misses, memory bank conflicts, cache maintenance operations, and the like. Similarly, the second type of variable delay command may take up to 64 cycles to complete, but the exact number of cycles may vary, for example, based on the values of the divisor and dividend. The lifetime tracking of variable latency instructions may be handled by modifying the write back queue.
In some cases, the first type of variable latency command may be handled in a manner similar to the manner in which other multi-cycle instructions are handled using certain modifications. First, the mapping between the write back queue slots and the local unit capture queues may be modified to have a one-to-one mapping, rather than having more write back queue slots than local unit capture queue slots. Furthermore, instead of using a circular buffer to select which write back queue slot to use next, the next write back queue slot may be selected based on the lowest entry available and the local unit capture queue available. Upon issuing the LOAD command, the selected write back queue slot and local unit capture queue slot number may be passed to the memory system as a command id (cid). Then, when the memory system returns a portion of the requested data from the LOAD command, the portion may be returned with the associated CID. The returned portion may be assembled in the appropriate portion of the write-back queue. The memory system may also return an indication that the associated portion of the requested data is the final portion to be returned (e.g., RLAST). Upon receiving this indication, LOAD data may be output from the write-back buffer.
Fig. 13 illustrates an example circuit for variable delay life tracking 1300 in accordance with some aspects. In some cases, the ready counter and dedicated write back queue slot 1302 in write back queue 1306 and dedicated local unit capture queue slot 1304 in local unit capture queue 1308 may also be used to handle variable latency commands of the second type. If the dedicated write back queue slot 1302 is occupied or active, any new second type commands may be suspended in stage E1 until the current variable latency command has completed. In some cases, the variable latency commands are not pipelined, and the functional units may execute a single variable latency command at a time. In some cases, in the unprotected mode, the second type of command is defined to take zero or one cycle to complete. Since the functional unit is occupied processing variable latency commands for several cycles, the next instruction may then pause and execute after the second type of command is completed. According to certain aspects, where variable latency commands are executed using the dedicated write-back queue 1302 and the local unit capture queue 1304, certain functional units may be configured to support a second type of command while other functional units may not support the second type of command.
Fig. 14 is a flow diagram 1400 illustrating a technique for executing a plurality of instructions by a processor, according to some aspects. At block 1402, a first instruction is received for execution on an instruction execution pipeline. As an example, a non-variable latency instruction may be received by a processor for execution. At block 1404, a first delay value may be determined based on an expected amount of time required to execute the first instruction. For example, an instruction may be associated with an expected number of processor cycles needed to execute the associated instruction. A delay value may be assigned based on this expected number of processor cycles. At block 1406, the first latency value is stored in a write back queue associated with the first instruction. The write back queue stores information associated with instruction execution. For example, the write back queue may be a set of processor registers that may store information associated with executed instructions. The write back queue may be associated with a local unit capture queue slot. At block 1408, execution of the first instruction on the instruction execution pipeline may begin. At block 1410, a delay value may be adjusted based on an amount of time that has elapsed since execution of the first instruction was started. For example, a delay value associated with executing an instruction may be adjusted each processor clock cycle. At block 1412, a first result of the first instruction based on the delay value may be output. At block 1414, a second instruction for execution on the instruction execution pipeline is received. At block 1416, the second instruction is determined to be a variable latency instruction. Examples of variable latency instructions include, but are not limited to, memory operations, divisions, and modulo operations. At block 1418, a ready indication is stored in the write back queue indicating that a second result of the second instruction is not ready. As an example, the latency tracking value may be replaced with a data ready indicator indicating whether the second instruction has completed execution. At block 1420, execution of the second instruction begins on the instruction execution pipeline. If a third variable latency instruction is received for execution on the instruction execution pipeline while the second instruction is being executed, the third instruction will stall based on the data ready indicator. If a third instruction is received for execution on the instruction pipeline, while the second instruction being executed is not a variable latency instruction, but the third instruction utilizes a memory location that the second instruction will use, the third instruction will also stall based on the data ready indicator. At block 1422, based on a determination that execution of the second instruction is complete, the ready value is updated to indicate that the second result is ready. For example, a signal may be received having a portion of data requested from a memory indicating that the portion of data is the last portion of the requested data. The data ready indication in the write back queue may be updated based on the signal. Similarly, after division or modulo operation is completed, the data ready indication in the write back queue may be updated. At block 1424, a second result is output. For example, the results may be available in an appropriate output register. Execution of the halted third instruction may then begin.
In this description, the term "coupled" means either indirect or direct wired or wireless connection. Thus, if a first device is coupled to a second device, that connection may be through a direct connection or through an indirect connection via other devices and connections. The recitation "based on" means "based at least in part on". Thus, if X is based on Y, X may be a function of Y and any number of other factors.
The above discussion is meant to be illustrative of the principles and various embodiments of the present description. Numerous variations and modifications will become apparent to those skilled in the art once the above description is fully appreciated. It is intended that the following claims cover all such variations and modifications.
In the drawings, like elements are denoted by like reference numerals for consistency.
Modifications are possible in the described embodiments, and other embodiments are possible within the scope of the claims.

Claims (20)

1. A method for executing a plurality of instructions by a processor, the method comprising:
receiving a first instruction for execution on an instruction execution pipeline;
determining a first delay value based on an expected amount of time required to execute the first instruction;
storing the first latency value in a write-back queue, the write-back queue storing information associated with instruction execution;
initiating execution of the first instruction on the instruction execution pipeline;
adjusting the delay value based on an amount of time elapsed since execution of the first instruction was initiated;
outputting a first result of the first instruction based on the delay value;
receiving a second instruction for execution on the instruction execution pipeline;
determining that the second instruction is a variable latency instruction;
storing, in the write back queue, a ready value indicating that a second result of the second instruction is not ready;
initiating execution of the second instruction on the instruction execution pipeline;
updating the ready value to indicate that the second result is ready based on a determination that execution of the second instruction is complete; and
and outputting the second result.
2. The method of claim 1, wherein the second instruction comprises a memory operation.
3. The method of claim 2, wherein the memory operation comprises loading data from memory, and further comprising:
receiving one or more portions of the data from a memory;
assembling the one or more portions of the data in the write-back queue when the one or more portions are received; and
outputting the data after assembling the data from the one or more portions of the data.
4. The method of claim 1, wherein the second instruction comprises a divide or modulo instruction.
5. The method of claim 4, wherein the indication is stored in a dedicated write-back queue.
6. The method of claim 1, wherein the expected amount of time is based on a number of processor cycles in which a respective instruction may complete.
7. The method of claim 1, further comprising:
receiving a third instruction for execution on the instruction execution pipeline before execution of the second instruction has completed; and
suspending execution of the third instruction until execution of the second instruction has completed.
8. The method of claim 7, wherein suspending the execution of the third instruction is based on a data ready indicator.
9. A processor, comprising:
an instruction execution pipeline having a plurality of pipeline stages;
pipeline circuitry configured to:
receiving a first instruction for execution on an instruction execution pipeline;
determining a first delay value based on an expected amount of time required to execute the first instruction;
storing the first latency value in a write-back queue, the write-back queue storing information associated with instruction execution;
initiating execution of the first instruction on the instruction execution pipeline;
adjusting the delay value based on an amount of time elapsed since execution of the first instruction was initiated;
outputting a first result of the first instruction based on the delay value;
receiving a second instruction for execution on the instruction execution pipeline;
determining that the second instruction is a variable latency instruction;
storing, in the write back queue, a ready value indicating that a second result of the second instruction is not ready;
initiating execution of the second instruction on the instruction execution pipeline;
updating the ready value to indicate that the second result is ready based on a determination that execution of the second instruction is complete; and
and outputting the second result.
10. The processor of claim 9, wherein the second instruction comprises a memory operation.
11. The processor of claim 10, wherein the memory operation comprises loading data from memory, and wherein the pipeline circuitry is further configured to:
receiving one or more portions of the data from a memory;
assembling the one or more portions of the data in the write-back queue when the one or more portions are received; and
outputting the data after assembling the data from the one or more portions of the data.
12. The processor of claim 9, wherein the second instruction comprises a divide or modulo instruction.
13. The processor of claim 12, wherein the indication is stored in a dedicated write-back queue.
14. The processor of claim 9, wherein the expected amount of time is based on a number of processor cycles in which a respective instruction may complete.
15. The processor of claim 9, wherein the pipeline circuitry is further configured to:
receiving a third instruction for execution on the instruction execution pipeline before execution of the second instruction has completed; and
suspending execution of the third instruction until execution of the second instruction has completed.
16. A processing system, comprising:
a memory;
a processor, comprising:
an instruction execution pipeline having a plurality of pipeline stages;
pipeline circuitry configured to:
receiving a first instruction for execution on an instruction execution pipeline;
determining a first delay value based on an expected amount of time required to execute the first instruction;
storing the first latency value in a write-back queue, the write-back queue storing information associated with instruction execution;
initiating execution of the first instruction on the instruction execution pipeline;
adjusting the delay value based on an amount of time elapsed since execution of the first instruction was initiated;
outputting a first result of the first instruction based on the delay value;
receiving a second instruction for execution on the instruction execution pipeline;
determining that the second instruction is a variable latency instruction;
storing, in the write back queue, a ready value indicating that a second result of the second instruction is not ready;
initiating execution of the second instruction on the instruction execution pipeline;
updating the ready value to indicate that the second result is ready based on a determination that execution of the second instruction is complete; and
and outputting the second result.
17. The processing system of claim 16, wherein the second instruction comprises a memory operation.
18. The processing system of claim 17, wherein the memory operation comprises loading data from memory, and wherein the pipeline circuitry is further configured to:
receiving one or more portions of the data from a memory;
assembling the one or more portions of the data in the write-back queue when the one or more portions are received; and
outputting the data after assembling the data from the one or more portions of the data.
19. The processing system of claim 16, wherein the second instruction comprises a divide or modulo instruction.
20. The processing system of claim 19, wherein the indication is stored in a dedicated write-back queue.
CN202080037631.4A 2019-04-15 2020-04-15 Variable delay instructions Pending CN113853584A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US16/384,328 2019-04-15
US16/384,328 US11210098B2 (en) 2013-07-15 2019-04-15 Variable latency instructions
PCT/US2020/028178 WO2020214624A1 (en) 2019-04-15 2020-04-15 Variable latency instructions

Publications (1)

Publication Number Publication Date
CN113853584A true CN113853584A (en) 2021-12-28

Family

ID=72837575

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080037631.4A Pending CN113853584A (en) 2019-04-15 2020-04-15 Variable delay instructions

Country Status (2)

Country Link
CN (1) CN113853584A (en)
WO (1) WO2020214624A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11210098B2 (en) * 2013-07-15 2021-12-28 Texas Instruments Incorporated Variable latency instructions

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6195748B1 (en) * 1997-11-26 2001-02-27 Compaq Computer Corporation Apparatus for sampling instruction execution information in a processor pipeline
GB2447907B (en) * 2007-03-26 2009-02-18 Imagination Tech Ltd Processing long-latency instructions in a pipelined processor
US8688964B2 (en) * 2009-07-20 2014-04-01 Microchip Technology Incorporated Programmable exception processing latency

Also Published As

Publication number Publication date
WO2020214624A1 (en) 2020-10-22

Similar Documents

Publication Publication Date Title
JP5357017B2 (en) Fast and inexpensive store-load contention scheduling and transfer mechanism
US7051190B2 (en) Intra-instruction fusion
US20220113966A1 (en) Variable latency instructions
US11693661B2 (en) Mechanism for interrupting and resuming execution on an unprotected pipeline processor
US20010042188A1 (en) Multiple-thread processor for threaded software applications
US20210294639A1 (en) Entering protected pipeline mode without annulling pending instructions
EP2269134A1 (en) System and method of selectively committing a result of an executed instruction
US20040205326A1 (en) Early predicate evaluation to reduce power in very long instruction word processors employing predicate execution
US20240036876A1 (en) Pipeline protection for cpus with save and restore of intermediate results
US6341348B1 (en) Software branch prediction filtering for a microprocessor
US20210326136A1 (en) Entering protected pipeline mode with clearing
JP2001209535A (en) Command scheduling device for processors
CN113535236A (en) Method and apparatus for instruction set architecture based and automated load tracing
CN113853584A (en) Variable delay instructions
US20080141252A1 (en) Cascaded Delayed Execution Pipeline
US6988121B1 (en) Efficient implementation of multiprecision arithmetic
US10901747B2 (en) Unified store buffer
CN113568663A (en) Code prefetch instruction
US11449336B2 (en) Method of storing register data elements to interleave with data elements of a different register, a processor thereof, and a system thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination