WO2002039272A1

WO2002039272A1 - Method and apparatus for reducing branch latency

Info

Publication number: WO2002039272A1
Application number: PCT/US2001/049653
Authority: WO
Inventors: John L. Redford
Original assignee: Chipwrights Design, Inc.
Priority date: 2000-11-10
Filing date: 2001-11-09
Publication date: 2002-05-16
Also published as: TW559733B; AU2002227451A1; WO2002039272A9

Abstract

A method and apparatus for reducing latency in execution of branch instructions are provided. A branch instruction includes an opcode portion (122) and an address portion (128) that includes a displacement (124) and a code (126) that identifies a block in the instruction memory (22) in which the branch target instruction is located. During the fetch cycle in which the branch instruction is fetched, the displacement portion (124) of the branch instruction is reinserted into the address register (20) as the address of the next instruction to be fetched. The code (126) is used to ensure that the address register (20) is pointing to the correct block. As a result, during the next instruction fetch cycle, the target instruction is fetched for execution. Hence, the branch processing latency found in prior systems in which the next fetch cycle is skipped while the branch target address is computed, such as by adding an offset to the program counter (12) value, is eliminated.

Description

METHOD AND APPARATUS FOR REDUCING BRANCH LATENCY

Background of the Invention

Processing systems typically execute program instructions in multiple stages or cycles. During a fetch cycle, a memory address at which the instruction to be executed is stored is read "from a program counter and written into an address register. The address in the address register is then used to access the instruction memory location, and the instruction is fetched from the instruction memory and loaded into an instruction register.

In general, the instruction is a multiple-bit word. It typically includes a multiple-bit opcode which identifies the instruction and memory address information. The memory address information can include, for example, a memory value which defines a location in memory at which an operand for the instruction is stored or a location at which the result of the instruction is to be stored when the instruction is complete.

After the instruction is fetched and loaded into the instruction register, a decode cycle is implemented. During the decode cycle, the instruction opcode is decoded to identify the instruction to be executed and to determine the processing steps required.

Depending on the instruction, the memory address information in the instruction can then be used to retrieve an operand, if one is required. When the decoding is complete, the instruction can be executed.

In the normal sequential flow of instruction execution, after an instruction is fetched, the program counter is incremented to point to the next location in the instruction memory. During the next fetch cycle, the next instruction in the program is then fetched for execution.

One very common type of instruction is a branch instruction. Branch instructions are used to alter the normal sequential flow of instruction execution. Branches arc used, for example, to control instruction loops. When the last instruction in a loop is reached, if the condition that would terminate the loop is not satisfied, program flow must return to the top of the loop. In this case, a branch instruction is used to load the address of the top that would be obtained by incrementing the program counter. Branch instructions are also used to route program execution to separate procedure modules by loading address information pointing to the start of the procedure into the address register.

As is typical of most instructions, a branch instruction includes an opcode and address information. The address information typically takes the form of an offset value which defines the number of addresses that the program execution will jump in taking the branch. The offset value is typically a signed number which is added to the present instruction address during the decode cycle. Following this decode cycle, the sum is loaded into the address register such that the first instruction of the branch can be fetched during the next fetch cycle.

The efficiency of program execution can be enhanced by pipelining instruction execution. In pipelining, a first instruction is fetched during a first fetch cycle. Next, the decode cycle for the first instruction is performed simultaneously with the fetch cycle for the next instruction in the program. That is, while the first instruction is being decoded, the next instruction is being fetched. This approach can, in general, significantly increase program execution speed and efficiency.

In the case of branch instructions, the gains in efficiency realized by pipelining are compromised by the delay involved in computing the branch target address. During the decoding of (he branch instruction, the offset value must be added to the present program counter address value. This addition requires a significant portion of the decode cycle.

Accordingly, the fetch cycle for the next instruction cannot be performed simultaneously, since the next instruction address has not yet been determined and loaded into the address register. Instead, the fetch cycle for the next instruction, i.e., the branch target instruction, cannot begin until after the decode cycle is complete. The condition described above is commonly referred to as branch latency. It would be desirable to eliminate this latency such that the address of the first instruction after the branch, i.e., the branch target instruction, can be loaded into the address register soon enough after the branch instruction is fetched such that the branch target instruction can be fetched during the next fetch cycle. Summary of the Invention

The present invention is directed to an approach to eliminating this branch latency condition. In accordance with the invention, there is provided a method and apparatus for processing a branch instruction. The branch instruction includes an opcode portion containing an opcode and an address portion containing address information related to an address of a branch target instruction to which execution of a program is to branch in response to the branch instruction. The address portion of the branch instruction includes a memory block identifying portion and a displacement portion, the memory block identifying portion identifying a block in the memory to which the execution is to branch in response to the branch instruction. Execution branches to the branch target instruction using the displacement portion of the address portion of the branch instruction as an address within the memory block identified by the memory block identifying portion of the address portion of the branch instruction.

The memory block identifying portion of the address portion of the branch instruction identifies the block in memory to which execution is to branch in response to the branch instruction, in the event that the branch is taken. In one embodiment, the block can be one of three possible blocks. One of the blocks can be the block that contains the branch instruction, in which case the branch target instruction is in the same block as the branch instruction. The other blocks can be the immediately preceding block or the immediately following block.

In one embodiment, the block identifier includes at least two bits capable of defining at least four codes used to identify blocks. One of the codes identifies the same block as the branch instruction. A second of the codes identifies the immediately following block, and a third of the codes identifies the immediately preceding block, A fourth code can be used to identify a branch within the same block in a particular direction.

The first code can then be used to identify a branch within the same block in the opposite direction. Therefore, in one example of this configuration, the first and second codes, e.g., 00 and 01, can be used to identify a forward branch, the former within the block and the latter into the next block. The third and fourth codes, e.g., 10 and 11, can be used to identify a backward branch, the former into the preceding block and the latter within the present block.

In one embodiment, the block identifying portion can be used in branch prediction, i.e., to predict whether the branch will be taken. In one embodiment, if (he branch is backward, then it is predicted that the branch will be taken. If the branch is forward, then it is predicted that the branch will not be taken. Hence, in the illustration set forth above, if the first bit of the block identifying code is a 1, then a backward branch is called for, and it is predicted that the branch will be taken. On the other hand, if the first bit is a 0, then a forward branch is called for, and it is predicted that the branch will not be taken. The approach of the invention substantially reduces or eliminates aspects of branch latency found in the prior art. The branch target address is generated directly from the displacement address information in the branch instruction without performing any time consuming arithmetic operations such as adding an address offset to the program counter value. The branch target address is applied directly back to the address register as a hardware function in an effectively immediate fashion as part of the fetch cycle for the branch instruction. As a result, the next cycle can be used to fetch the branch target instruction, resulting in no loss of cycles. This is in contrast to prior approaches in which the next fetch cycle had to be skipped because of the delay involved in computing the branch target address during the decode cycle. The invention therefore provides significantly improved efficiency over the prior art in the processing of branch instructions.

Brief Description of the Drawings

The foregoing and other objects, features and advantages of the invention will be apparent from the more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.

FIG. 1 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during conventional execution of program instructions in a pipeline configuration.

FIG. 2 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during conventional execution of program instructions in a pipeline configuration in which a branch instruction is processed.

FIG. 3 contains a schematic diagram which illustrates the format of a conventional branch instruction.

FIG. 4 is a schematic functional block diagram which illustrates execution of instructions in a conventional configuration. FIG. 5 is a schematic functional block diagram which illustrates execution of instructions in a configuration according to the invention which solves the branch latency problem in the configuration of FIG. 4.

FIG. 6 is a schematic block diagram illustrating the addresses and locations of a portion of an instruction memory. FIG, 7 is a schematic diagram illustrating a format for a branch instruction word in accordance with the invention.

FIG. 8 is a schematic functional block diagram which illustrates execution of instructions in a configuration according to the invention in which the branch instruction identifies the block of instruction memory to which the execution is to branch. FIG. 9 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during execution of program instructions in accordance with the method and apparatus for reducing branch latency of the present invention.

Detailed Description of Preferred Embodiments of the Invention

FIG. 1 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during conventional execution of program instructions in a pipeline configuration. As shown in FIG. 1, during a first cycle, a first instruction is fetched from the instruction memory location addressed by the value PC stored in the program counter. During the next cycle, the instruction PC is decoded while the instruction identified by the incremented program counter value PC + 1 is fetched, During the third cycle, the instruction PC + 1 is decoded while the next instruction, identified by the program counter value PC + 2 is fetched. This process continues as instructions are fetched and decoded in the sequence controlled by incrementing the program counter during each cycle. Because of the pipelining configuration, instructions are processed efficiently, with one instruction being decoded and the next instruction being fetched simultaneously, In general, no instruction cycles are wasted.

FIG, 2 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during conventional execution of program instructions in a pipeline configuration in which a branch instruction is processed. As shown in the timing chart, a first instruction identified by the program counter value PC is fetched during the first cycle. In this case, the instruction is a branch instruction. The format of a branch instruction is illustrated schematically in FIG, 3. As shown in FIG. 3, the branch instruction includes an opcode portion and a memory displacement portion. The opcode defines the type of branch instruction, e.g., the conditions under which the branch is to be taken. The displacement portion defines the address to which program flow will proceed if the branch is taken. Typically, the displacement is a signed number which is added to the address of the branch instruction, i.e., the present value PC of the program counter. Accordingly, during the next cycle, the address that is to be loaded into the program counter to execute the branch is computed by adding the present address to the displacement, i.e., PC + Disp, Because the addition takes considerable time to complete, the address of the instruction of the beginning of the branch, i.e., the branch target instruction address, is not loaded into the program counter until late in the second cycle. As a result, the branch target instruction, referred to by the address value PC + Disp., is not fetched until the third cycle. Thus, a cycle is lost while the branch target instruction address is computed. This condition is commonly referred to as branch latency.

FIG. 4 is a schematic functional block diagram which illustrates execution of instructions in a conventional configuration, As shown in the FIG: 4, a processing system 10 for executing program instructions includes a program counter 12 which generates the addresses of the instructions to be executed in sequence. In this description, it will be assumed that instruction addresses are 32 bits long. It will be understood that other address sizes can be used and that the invention is applicable to other address sizes. The address is read from the program counter 12 and is routed to an incrementing module 14 and a summing module 26. The summer 26 adds the program counter value to a displacement routed from the instruction register 24 and applies the result to one input of a multiplexer 16. The address is also be incremented by the incrementing module 14, and the incremented result is applied to another input of the MUX 16. One of the addresses applied to the MUX 16 is selected via the MUX select input S by a branch prediction module 18. If the branch selector 18 determines that a branch is to be taken, then the MUX input from the summer 26 is selected such that a branch target instruction address, generaLed in the summer 26 by adding the displacement to the present program counter value, is loaded into the address register 20. Otherwise, the incremented address is selected such that it is loaded into the address register 20.

The address loaded into the address register 20 is applied to the instruction memory 22 to access the next instruction to be executed. That instruction is read from the memory 22 into an instruction register 24. The instruction can then be passed on for decoding and further processing. The displacement portion of the instruction, if any, can be routed as shown back to the summer 26 to compute an address to which program flow will jump, such as, for example, when the fetched instruction is a branch instruction. As described above, this approach introduces branch latency because of the time involved in performing the sum operation 26.

FIG. 5 is a schematic functional block diagram which illustrates a solution to the branch latency problem described above. In FIG. 5, instead of adding the displacement portion of a branch instruction to the program counter value, the displacement is extracted directly from the instruction in the instruction memory 22 and is inserted directly into the address at the input to the MUX 16 as a replacement for its least significant bits (LSBs), in this particular illustration, the sixteen LSBs, labeled 15:0, This is done as soon as the branch instruction is fetched from its memory location in the instruction memory 22, before the fetch cycle for the branch instruction terminates and the next fetch cycle begins. As a result, during the next succeeding cycle, the next instruction, which is the branch target instruction, can be fetched because its address is already present in the address register 20 before the succeeding fetch cycle begins. Hence, the branch instruction and the branch target instruction can be fetched in successive fetch cycles, with no loss of cycles. The branch latency described above is eliminated.

One particular drawback to this approach is illustrated in FIG. 6, which is a schematic block diagram illustrating the addresses and locations of a portion of a typical instruction memory 22. The memory 22 can be defined to be made up of multiple blocks

102, 104, 106. As shown in this particular illustrative example, each block has a group of locations with addresses ranging from 0000_I6 to FFFF_lβ. Hence, the sixteen LSBs of each memory address define a location within a particular block consisting of 2¹⁶ locations.

During execution of a program, at any given time, the program counter is accessing an instruction stored at one of the locations in one of the blocks, block 104, for example.

When a branch instruction is encountered, in accordance with the approach described above, the 1 -bit displacement portion of the instruction is placed in the next address in its 16 LSB positions. Execution then continues from the one of 2¹⁶ locations within block 104. A drawback to this situation is derived from the fact that the location to which the branch is made must be within the same block. Because of this, the size of a possible jump may be severely limited, depending on the current value in the program counter, i.e., the location from which the branch is taken. For example, if the program is currently executing near the end of a block when it encounters a branch instruction, a forward branch can only be made a small distance, i.e., a small number of locations, Likewise, if the program is currently executing near the beginning of a block, backward branches are extremely limited in possible distance, This situation places a constraint on the programming flexibility of the system.

To solve this problem, in one embodiment, the invention uses a portion of the displacement portion of a branch instruction to identify a block in instruction memory to which the branch should be made. FIG. 7 is a schematic diagram illustrating a format for a branch instruction word in accordance with the invention. The example of FIG. 7 uses a 32-bit instruction with a 16-bit displacement field. It will be understood that the invention is applicable to other sizes. Referring to FIG. 7, the instruction format 120 includes an opcode field 122 including bits 16 to 31 and an address or displacement field 128 including bits 0 to 15. The address field 128 is further divided into an address value field 124 including bits 0 to 13 and a block field 126 including bits 14 and 15. The two-bit block field defines whether the branch should take place to the same block as the present block (referred to as PC), to the immediately preceding block (referred to as PC - 1) or to the immediately succeeding block (referred to as PC + 1). The address value field 124 defines the address within the identified block from which the branch target instruction should be fetched.

Hence, hi one embodiment, in this illustration, the block field 126 includes at least two bits capable of defining at least four codes used to identify blocks. One of the codes identifies the same block as the branch instruction. A second of the codes identifies the immediately following block, and a third of the codes identifies the immediately preceding block. A fourth code can be used to identify a branch within the same block in a particular direction. The first code can then be used to identify a branch within the same block in the opposite direction. Therefore, in one example of this configuration, the first and second codes, e.g., 00 and 01 , can be used to identify a forward branch, the former within the block and the latter into the next block, The third and fourth codes, e.g., 10 and 11, can be used to identify a backward branch, the former into the preceding block and the latter within the present block. Hence, a 0 bit in position 15 can indicate a forward branch, and a 1 in position 15 can indicate a backward branch. FIG. 8 is a schematic functional block diagram which illustrates execution of instructions in a configuration according to the invention in which the branch instruction identifies the block of instruction memory to which the execution is to branch. In this configuration, the 14 LSBs 13:0 are routed from the instruction memory 22 to three inputs of a four-input MUX 216. The remaining 18 bits 31 : 14 are taken from the program counter 12 and are combined with the 14 LSBs from the instruction memory 22. The 18 bits 31:14 arc routed through an incrementing module 220, a decrementing module 222 and a direct path 223, and the resulting routed bits are combined with the 14 LSBs from the instruction memory 22 at the inputs to the MUX 216, The incrementing module 220 is used to generate an address that is used when the branch is to the next block in memory; the decrementing module 222 is used to generate an address that is used when the branch is to the immediately preceding block in memory; and the direct path 223 is used to generate an address when the branch is within the present block. The fourth input to the MUX 216 receives bits 31:0 directly from the program counter and is used where normal sequential program execution is being used.

The branch prediction module 18 is used to select which address is to be loaded into the address register 20. If no branch is to be taken, then bits 31 :0 from the incrementing module 14 are selected. If a branch is to be taken into the next block, then the address that includes bits 31:14 from the incrementing module 220 is selected. If a branch is to be taken into the previous block, then the address that includes bits 31:14 from the decrementing module 222 is selected. If a branch is to be taken within the present block, then the address that includes bits 31:14 from the direct path 223 is selected.

The block identifying code added to the branch instruction in accordance with the invention can be used as an aid in branch prediction. As indicated above, a 0 bit in position 15 can indicate a forward branch, and a 1 in position 15 can indicate a backward branch. One common example approach to branch prediction is to take backward branches and not to take forward branches. Therefore, in accordance with the invention, if a 0 is in position 15, then the branch is not taken, and if a 1 is in position 15, then the branch is taken. Hence, in accordance ith the embodiment of the invention shown in FIG. 8, the possible range of addresses to which a branch can be taken is increased over that of the embodiment of FIG. 5. Using this latter approach, the displacement portion of the address value can only specify 2¹⁴ possible addresses, as opposed to the 2¹⁶ , possible addresses of the former approach of FIG. 5. However, using the approach of FIG. 8, 2¹⁴ addresses can be specified in each of three possible memory blocks, in this example, Therefore, a large increase in possible branch distance and resulting programming flexibility are realized.

FIG. 9 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during execution of program instructions in accordance with the improvements of the present invention. As shown in FIG. 9, in accordance with the invention, a branch instruction and its associated branch target instruction can be fetched in successive cycles. The branch latency found in other conventional approaches is eliminated. It is noted that the invention can be implemented using an approach different than that described above. For example, the invention can be implemented immediately before execution commences as the instruction cache memory is loaded with instructions for execution, rather than altering the instructions themselves individually as the program is compiled or linked. In this latter approach, the block identifying field, which is the two-bit field of the exemplary embodiment described above, is added to the appropriate instruction cache memory locations as the instructions are loaded.

The embodiments described herein refer, for example, to 32-bit instructions, 16-bit address values with branch instructions, and a two-bit block identifying value. It will be understood that these numbers of bits may be different without departing from the scope of the invention.

While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing- from the spirit and scope of the invention as defined by the appended claims. What is claimed is:

Claims

1. A method of processing a branch instruction, the branch instruction being one of a plurality of instructions in a program stored in a block of an instruction memory, the method comprising: providing the branch instruction with an opcode portion containing an opcode and an address portion containing address information related to an address of a branch target instruction to which execution of a program is to branch in response to the branch instruction; providing the address portion of the branch instruction with a memory block identifying portion and a displacement portion, the memory block identifying portion identifying a block in the memory to which the execution is to branch in response to the branch instruction; and branching to the branch target instruction using the displacement portion of the address portion of the branch instruction as an address within the memory block identified by the memory block identifying portion of the address portion of the branch instruction.

2. The method of claim I wherein branching to the branch target instruction comprises generating a branch target address using the displacement portion of the address portion of the branch instruction and the memory-block identifying portion of the address portion of the branch instruction.

3. The method of claim 2 wherein the branch target address is generated during a fetch cycle of the branch instruction.

4. The method of claim 1 wherein the memory block identifying portion of the address portion of the branch instruction identifies the block to which execution is to branch as being one of the memory block preceding the memory block of the branch instruction, the memory block following the memory block of the branch instruction and the memory block of the branch instruction.

5. The method of claim 1 further comprising predicting whether the branch is to be taken using the memory block identifying portion of the address portion of the branch instruction.

6. The method of claim 1 wherein the memory block identifying portion of the address portion of the branch instruction defines at least four codes used to predict whether the branch will be taken.

7. The method of claim 6 wherein the four codes include a first pair of codes for forward branches and a second pair of codes for backward branches.

8. The method of claim 7 wherein the first pair of codes includes a code for a forward branch to a next memory block and a code for a forward branch within the memory block of the branch instruction.

9. The method of claim 7 wherein the second pair of codes includes a code for a backward branch to a preceding memory block and a code for a backward branch within the memory block of the branch instruction.

10. An apparatus for processing a branch instruction, the branch instruction being one of a plurality of instructions in a program, the apparatus comprising: an instruction memory for storing instructions, the branch instruction being stored in one of a plurality of blocks of the instruction memory, the branch instruction having an opcode portion containing an opcode and an address portion containing address information related to an address of a branch target instruction to which execution of a program is to branch in response to the branch instruction, and the address portion of the branch instruction having a memory block identifying portion and a displacement portion, the memory block identifying portion identifying a block in the memory to which the execution is to branch in response to the branch instruction; and a processor for executing instructions stored in the instruction memory, the processor causing execution of the instructions to branch to the branch target instruction using the displacement portion of the address portion of the branch instruction as an address within the memory block identified by the memory block identifying portion of the address portion of the branch instruction.

11. The apparatus of claim 10 wherein the processor generates a branch target address using the displacement portion of the address portion of the branch instruction and the memory block identifying portion of the address portion of the branch instruction.

12. The apparatus of claim 11 wherein the branch target address is generated during a fetch cycle of the branch instruction.

13. The apparatus of claim 10 wherein the memory block identifying portion of the address portion of the branch instruction identifies the block to which execution is to branch as being one of the memory block preceding the memory block of the branch instruction, the memory block following the memory block of the branch instruction and the memory block of the branch instruction.

14. The apparatus of claim 10 wherein the processor predicts whether the branch is to be taken using the memory block identifying portion of the address portion of the branch instruction.

15. The apparatus of claim 10 wherein the memoτy block identifying portion of the address portion of the branch instruction defines at least four codes used to predict whether the branch will be taken.

16, The apparatus of claim 15 wherein the four codes include a first pair of codes for forward branches and a second pair of codes for backward branches.

17. The apparatus of claim 16 wherein the first pair of codes includes a code for a forward branch to a next memory block and a code for a forward branch within the memory block of the branch instruction.

18. The apparatus of claim 16 wherein the second pair of codes includes a code for a backward branch to a preceding memory block and a code for a backward branch within the memory block of the branch instruction.