GB2454816A

GB2454816A - Method for executing a load instruction in a pipeline processor, putting the data in the target address into a buffer then loading the requested data.

Info

Publication number: GB2454816A
Application number: GB0822115A
Authority: GB
Inventors: Son Dao Trong; Juergen Haess; David Shane Hutton; Michael Klein; John Gilbert Rell Jr; Eric Mark Schwarz; Kevin Chung-Lung Shum
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2008-01-15
Filing date: 2008-12-04
Publication date: 2009-05-20
Anticipated expiration: 2028-12-04
Also published as: GB2454816B; GB0822115D0

Abstract

Disclosed is a method and system for operating the execution unit of a computer, the execution unit having a pipeline-based execution flow during which load instructions are processed. The load instructions having the function of loading data from a storage means into a predetermined location within the pipeline, preferably a register-implemented pipeline. The method has the steps of, when a load instruction occurs in the pipeline, reading (610) the current value of the target location, and buffering (620) the current target value at a predetermined location within said pipeline. Next, the value of the source location is loaded (610) and stored (620) at the target location, the pipeline is executed according to its execution flow, using the loaded value for computing purposes. If an event (630) indicating that the loaded value is not correct occurs, (660) the buffered original value may be used instead of the loaded value. The execution unit may be a floating point unit with the reading and/or loading of the data being done using a multiply-add data path.

Description

DESCRIPTION

Out of Order Execution of Floating Point Loads with Integrated Refresh Mechanism

1. BACKGROUND OF THE INVENTION

1.1. FIELD OF THE INVENTION

The present invention relates to computer processor technology, more particularly it relates to a method and system for operating an execution unit of an electronic computer system, wherein the execution unit comprises a pipeline-based execution flow during which amongst other instructions also load instructions are processed having the function of loading data from a storage means into a predetermined location within the pipeline, preferably a register-implemented pipeline.

1.2. DESCRIPTION AND DISADVANTAGES OF PRIOR ART

In earlier prior art processors, the processing of instructions is normally done "in-order" in the following steps: 1. Instruction fetch.

2. If input operands are available (in registers for instance), the instruction is dispatched to the appropriate functional unit. If one or more operands is unavailable during the current clock cycle (generally because they are being fetched from memory), however, the processor stalls until they are available.

3. The instruction is executed by the appropriate functional unit.

4. The functional unit writes the results back to the register file.

Figure 1 illustrates such in-order load instruction in a -critical case of a read instruction 10 being dependent of a preceding instruction's 12 result, by way of schematically showing the scheme of storage locations of a shift register based, prior art execution pipeline in a "time-line" way, wherein the cycles (columns) are assumed to increase from left to right, and the pipeline depth extends from top to bottom in a couple of rows. As reveals from the drawing the instruction 10 which needs to read data from a storage location in the Floating Point Register (FPR) Pipeline needs to wait disadvantageously for a time gap of 7 cycles (end of cycle 0 to begin of cycle 8) until a preceding instruction 12 has written its computed data to that storage location.

A more recent prior art "out-of-order" paradigm breaks up the processing of instructions into the following steps: 1. Instruction fetch.

2. Instruction dispatch to an instruction queue (also called instruction buffer or reservation stations) 3. The instruction waits in the queue until its input operands are available. The instruction is then allowed to leave the queue before earlier, older instructions.

4. The instruction is issued to the appropriate functional unit and executed by that unit.

5. The results are queued.

6. Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage.

The key concept of out-of-order processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation is not available. In the outline above, the out-of-order processor avoids the stall that occurs in "in-order" step (2) of the in-order processor when the instruction is not completely ready to be processed due to missing data.

Out-of-order processors fill the gap of 7 free "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as program order, in the processor they are handled in data order, the order in which the data, operands, become available in the processor's registers.

Prior art pipelined out-of-order microprocessors use speculative execution to reduce the cost of conditional branch instructions.

When a conditional branch instruction is encountered, the processor guesses which way the branch is most likely to go (this is called branch prediction), and immediately starts executing instructions from that point. If the guess later proves to be incorrect, all computation past the branch point is discarded. The early execution is relatively cheap because the pipeline stages involved would otherwise lie dormant until the next instruction was known. However, wasted instructions consume CPU cycles that could have otherwise delivered performance, and on a laptop, those cycles consume batteries. There is always a penalty for a mispredicted branch. This holds even more true for large depth pipelines of floating point execution units (FPU) where the penalty is quite high due to the fact that the computation of a result requires a relative high amount of computational power. To partially get rid of these drawbacks out-of-order loads are introduced.

Figure 2 illustrates such out-of-order load instruction by way of the scheme of figure 1; The load instruction 12 writes a speculative result in the Floating Point Register (FPR) already in the second cycle. The subsequent instruction reads that data immediately from that register location and processes this data.

In cycle 9, the computed result of the instruction 12 is written to a recovery unit, see arrow 14.

With particular focus to the present invention, in order to increase the performance of execution units having a relatively large pipeline depth, such as a floating point execution,unit, it is important to start any computation as soon as all the input operands are ready. For operands which are the result of a previous instruction, this is done in prior art through a forwarding path originating near the end of the pipeline.

However, if this previous instruction is a load instruction, then its result is known long before the end of the pipeline and it will unnecessarily slow down the floating point unit if it waits for the load to reach the end of the pipeline before its data are forwarded to the following instruction.

The standard solution to this problem is to implement a register pipeline and additional forwarding paths from all stages of the load instruction. The problem with this solution is that it takes up a huge amount of registers, wiring resources and control logic.

Another attempt to solve the problem is to execute the load instruction out-of-order with respect to other arithmetic instructions. This means that the load instruction writes its data to the register file as early as possible, so upcoming instructions can directly load their input operand from the FPR instead of reading them from a forwarding path, thus greatly lessening the amount of wiring that is needed.

This solution works fine as long as no instructions are killed due to wrong predicted branches or rejected due to cache misses.

However, if the load instruction is killed or rejected after it wrote its result to the register file, a wrong value exists in the FPR.

This needs to be fixed by calling some refresh mechanism from a recovery unit (RU) which restores the original content of the FPR. The RU is keeping copies of all floating point registers.

These copies are updated in-order, so it is possible to reconstruct the correct value of a register.

Such solution to this problem is sketched in figure 3 for a prior art IBM Power6 server, illustrating a killed out-of-order load instruction including an external, fictive refresh mechanism 30. This straight forward approach may implement a register pipeline and additional forwarding paths from all stages of the load instruction requiring a huge amount of registers, wiring resources and control logic. But this could manage also cache rejects and branches which do happen quite often and which -in absence of such additional huge hardware -would be the cause for stalling the whole processor until any refresh has finished. Such stalling is a major performance drawback. So, either one tolerates the disadvantage of a huge amount of registers, wiring resources and control logic, or one tolerates the performance decrease imposed by a stalling processor. Until now, there is no way known to either find a tolerable compromise between them two or to find a solution which avoids both disadvantages.

1.3. OBJECTIVES OF THE INVENTION The objective of the present invention is thus to improve the performance of load instruction processing with tolerably small amount of hardware.

2. SUMMARY ANI) ADVANTAGES OF THE INVENTION

This objective of the invention is achieved by the features stated in enclosed independent claims. Further advantageous arrangements and embodiments of the invention are set forth in the respective subclairns. Reference should now be made to the appended claims.

The inventive method to solve the above problem is based on the idea to execute the load instruction out-of-order with respect to other arithmetic instructions. This means that the load instruction writes its data to the Floating Point Register (FPR) as early as possible, so upcoming instructions can directly load their input operand from the FPR instead of reading them from a forwarding path requiring enormous wiring, control logic and registers. By avoiding this, the inventive method greatly reduces the amount of wiring that is needed.

The problem of using wrongly loaded values due to wrong predicted branches could be fixed by calling some refresh mechanism from a recovery unit (RU) which restores the original content of the FPR. The RU keeps copies of all floating point registers. These copies are updated in-order, so it is possible to always reconstruct the correct value of a register. However this would require quite a few additional cycles to complete.

With the present invention, the fix can be done within the Floating point pipeline itself by an internal re-order mechanism which is implemented by not only writing the FPR out-of-order but also by taking the original source value of the load instruction all the way down the pipeline and use this value to overwrite the value that was written earlier (out-of-order) to the FPR. This eliminates the need for additional recovery process.

Thus, with respect to the wording used in the appended claims, a method and respective system for operating an execution unit of an electronic computer system is disclosed, wherein the execution unit comprises a pipeline-based execution flow during which load instructions are processed amongst other instructions, having the function of loading data from a source location of a storage means into a predetermined target location within the pipeline, wherein the method is characterized by the steps of: a) reading the current (original) value of the target location, and buffering this current target value at a predetermined location within the pipeline, b) loading the value of the source location and storing the loaded value at the target location (early write into FPR), c) executing the pipeline according to its execution flow, using the loaded value for computing purposes, d) on occurrence of an event indicating that the loaded value is not correct -for example in case of a mispredicted branch, or a data error, cache miss, etc., -deciding to use the previously buffered original target value instead of the loaded value.

When the execution unit is a floating point execution unit, the advantage results, that the performance gain is remarkably high, as the pipeline depth is relatively large.

Further advantageously, step a) or step b) or both of them can be implemented in a Floating Point Execution Unit by using a prior art multiply-add data path implemented usually in prior art for calculation purposes of floating point operands.

So, according to the inventive method, in case of a load instruction, the inventive method does not only read the source operand which is supposed to be written to the floating point register, but it also reads the old value acting as the target for the load instruction in the FPR. This is preferably done in a floating point unit having a multiply-add path without adding additional data paths and without needing more wiring resources, just by using these multiply add data paths.

If, in the event of a kill or a reject, the inventive method discovers that a wrong value has been written to the FPR, it controls the data paths such that the original FPR value will be taken all the way down the pipeline and then will be stored back into the FPR using the standard FPR write data paths. In this case, no check-pointing of the data is done in the recovery unit.

The treatment of write-after-write hazards is preferably done as follows: If an instruction writes to a storage location x of the FPR and it is followed by a load instruction which also writes to x, the write of the first instruction to the FPR has to be suppressed since it would be before the load instruction in an in-order unit and thus should not write the FPR after the early load has updated it.

However, if the load instruction has been killed or rejected, the previous instruction will be switched back to write its result, since else the data would get lost. Also, the check-pointing of the previous instruction must not be suppressed.

3. BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example and is not limited by the shape of the figures of the drawings in which: Figure 1 illustrates an in-order load instruction by way of schematically showing the scheme of storage locations of a shift register based, prior art execution pipeline in a "time-line" way, wherein the cycles (columns) are assumed to increase from left to right, and the pipeline has a depth of ten stages (rows); Figure 2 illustrates an out-of-order load instruction by way of the scheme of figure 1; Figure 3 illustrates a killed out-of-order load instruction by way of the scheme of figure 1, including an external, fictive refresh mechanism; Figure 4 illustrates a killed out-of-order load instruction by way of the scheme of figure 1, including an internal, inventive refresh mechanism; Figure 5 depicts a circuit diagram illustrating the inventive method when implemented in multiply- add circuit of a prior art floating point execution unit; and Figure 6 illustrates the control flow of the most important steps of a preferred embodiment of the inventive method.

4. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT With general reference to the figures and with special reference npw to figure 4, a killed out-of-order load instruction is sketched wherein the killing is implemented by an inventive refresh mechanism internal to the pipeline logic in contrast to the external mechanism 30 of figure 3. Data is read in the first cycle at step 40, and a new result is written to the FPR immediately in the next cycle, see arrow 42 such as depicted in figure 4. But according to the inventive method, both, old, i.e. the original data at the load target location, and new data, i.e. the source data for the load target location is read at step 40, see circle 41. According to the inventive method in cases 43 of cache misses and wrongly predicted branches illustrated by circle 43, the load instruction is killed just by writing the old data to the FPR, see circle 45, thus effecting a kind of "undoing" of the effects of the reading of the (wrong) new data.

-10 -With reference to figure 5 a preferred embodiment of the inventive method is described when applied to a prior art Floating point execution unit. In figure 5, a circuit diagram is given including an embodiment of the inventive internal refresh mechanism when implemented in multiply-add circuit 50 of a prior art floating point execution unit.

Such prior art out of order floating point execution unit is in turn described in more detail in "binary floating-point unit design: The fused multiply-add dataf low", chapter 1, section 3.

It processes binary floating point instructions by using multiply-add instructions.

With particular reference now to the details of figure 5 the multiply-add circuit 50 comprises three read ports 54, 56, 58.

Port 54 is connected to an alignment unit 53, and ports 56, 58 are connected to the multiplier unit.

According to the invention a multiplexer 52 is provided at the output of the alignment unit 53 including select lines 51A,B further described later below, and a multiplier unit 55 connecting either the alignment result operand or the multiplier result operand to an adder unit, which is in turn not depicted in the drawing. A write connection is provided from the read ports 56 and 58 in order to enable writing to the floating point register, as indicated by reference sign 59.

Next, a preferred embodiment of the inventive method will be described wherein the step of reading the current value of the target location and buffering the current target value at some predetermined location within the pipeline, as well as the steps of loading the value of the source location and storing the loaded value at the target location are implemented by using the multiply-add data path of figure 5, used in prior art for calculation purposes of floating point operands.

-i_i -Since there are already multiple read ports in the register tile to load all three operands A, B and C of a multiply add instruction (A * C) + B, the source operand is read which is supposed to be stored in the FPR at a location "x" as operand A, and the old target value of FPR location "x" as operand B. Operand A is then used to write its value at FPR location "x" out of order. Operand C is given as a constant "1".

At this point, it is still unknown if the load instruction will be later killed or not. So, it is yet unknown, if the inventive method needs the A operand at the end of the pipeline for check-pointing if not killed, or if the inventive method needs the B operand for restoring the original value, if the load is killed.

The multiply-add mechanism "abused" according to this preferred inventive feature works by first multiplying the operand A with operand C. A multiplication with "1" lets the operand A unchanged, which is desired.

In the mean time, the operand B, read by read port 54, is only shifted in the alignment circuit 53 to the correct position to be added to the product A*C later on. Since it is known in case of a load instruction coming into the pipeline, that an addition is actually not desired, respective control signals are generated by the inventive method which force the operand B to remain unchanged while also operand A is controlled unchanged through the multiplier multiplication with neutral "1".

Respective control lines are depicted with reference numbers 62 and 64.

With this implementation, operands A and B are unchanged in the inventively used data paths during a few cycles (the depth of the multiplier). These cycles are enough to wait for the kill decision coming from outside of the depicted unit via the select lines 51 A 51B: -12 -If the load instruction is not killed, the adder circuit is controlled to ignore the B operand, passing down operand A unchanged. In case of a kill, A is ignored instead, and B is passed down. If the load instruction is not killed, the RU will be written with the new value of the load result, but the FPR is not written since it was written earlier. If the load is killed, the RU is not written but the FPR is written with operand B to recover the original value of the FPR before the load.

In addition to the before-described circuit and control respectively, which is restricted basically to units which comprise a multiply-add circuit, the basic steps of the control flow of the inventive are depicted in figure 6 and are described as follows. The control flow of figure 6 may also be implemented in a way different to that one of figure 5, and may be used also in execution units other than FPUs: The first step 610 comprises to read both, the source operand which is supposed to be written to the FPR, and the old value of the target FPR.

The next step 620 is then to keep both values separately stored until the "cache reject" or the "branch wrong" signals arrive.

Then a decision 630 is done: If the load instruction is not killed (left branch from decision 630) Here, the operand A is taken for being used in the pipeline, and the data of the load instruction are passed to RU to check-point the result of the load, step 650.

Otherwise, if the load instruction is killed, then the next step is to take operand B, write back the old (former) value to the FPR, and the RU is not written to, step 660.

The step 640 is to implement a mechanism to deal with write-after-write hazards: This means basically to perform the step of suppressing a FPR write, if an FPU instruction is followed by a -13 -load instruction overwriting the same FPR. Then a further step is performed to re-enable the FPR write if this following load instruction is killed.

The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc. The circuit as described above is part of the design for an integrated circuit chip. The chip design is created in a graphical computer programming language, and stored in a computer storage medium (such as a disk, tape, physical hard drive, or virtual hard drive such as in a storage access network). If the designer does not fabricate chips or the photolithographic masks used to fabricate chips, the designer transmits the resulting design by physical means (e.g., by providing a copy of the storage medium storing the design) or electronically (e.g., through the Internet) to such entities, directly or indirectly. The stored design is then converted into the appropriate format (e.g., GDSII) for the fabrication of photolithographic masks, which typically include multiple copies of the chip design in question that are to be formed on a wafer.

The photolithographic masks are utilized to define areas of the wafer (and/or the layers thereon) to be etched or otherwise processed.

Claims

-14 - CLAIMS

1. A method for operating an execution unit of an electronic computer system, wherein the execution unit comprises a pipeline-based execution flow during which load instructions are processed amongst other instructions, having the function of loading data from a source location of a storage means into a predetermined target location within said pipeline, characterized by the steps of: in case of a load instruction occurring in the pipeline: a) reading (610) the current value of said target location, and buffering (620) said current target value at a predetermined location within said pipeline, b) loading (610) the value of the source location and storing (620) the loaded value at said target location, c) executing the pipeline according to its execution flow, using the loaded value for computing purposes, d) on occurrence (630) of an event indicating that the loaded value is not correct, deciding to use (660) said buffered value instead of the loaded value.

2. The method according to claim 1, wherein said execution unit is a floating point execution unit, and step a) is done by using a multiply-add data path implemented for calculation purposes of floating point operands.

3. The method according to claim 1, wherein said execution unit is a floating point execution unit, and step b) is done by using a multiply-add data path implemented for calculation purposes of floating point operands.

4. An electronic data processing system having an execution unit, implementing a pipeline-based execution flow during which load instructions are processed amongst other instructions, having the function of loading data from a source location of a storage means into a predetermined target location within said pipeline, characterized by a functional component (52, 56) performing the steps of: in case of a load instruction occurring in the pipeline: a) reading (610) the current value of said target location, arid buffering (620) said current target value at a predetermined location within said pipeline, b) loading (610) the value of the source location and storing (620) the loaded value at said target location, C) executing the pipeline according to its execution flow, using the loaded value for computing purposes, d) on occurrence (630) of an event indicating that the loaded value is not correct, deciding to use (660) said buffered value instead of the loaded value.