EP0186668A1

EP0186668A1 - Three word instruction pipeline

Info

Publication number: EP0186668A1
Application number: EP19850902321
Authority: EP
Inventors: Douglas B. Macgregor; Robert R. Thompson; David S. Mothersole; Mark W. Bluhm
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 1984-06-27
Filing date: 1985-04-22
Publication date: 1986-07-09
Also published as: WO1986000435A1

Abstract

Des systèmes pipeline de traitement de données ayant des pipelines profonds opèrent dans de mauvaises conditions, car ils consomment du temps d'accès excessif lors d'opérations normales, le problème s'aggravant lorsqu'il faut exécuter une opération de branchement qui nécessite une recharge du pipeline. Afin de diminuer ces inconvénients, un processeur de données qui utilise un pipeline d'instructions à trois mots composé de registres (32, 34/36 et 38/40) fournit un décodage anticipé par des décodeurs (18, 20, 22, 24 et 26) d'instructions en provenance d'une mémoire (12 et 14). Le processeur de données nécessite uniquement deux cycles d'accès au bus pour remplir le pipeline grâce à l'opération de son unité d'exécution (10). Deux parmi les trois mots requis pour remplir le pipeline sont extraits lors d'un seul cycle d'accès au bus, et tous les registres qui forment le pipeline sont remis simultanément à l'état initial lorsqu'une opération de branchement requiert la recharge du pipeline.Pipeline data processing systems with deep pipelines operate in poor conditions as they consume excessive access time during normal operations, the problem being compounded when a branching operation is required which requires recharging of the pipeline. In order to reduce these drawbacks, a data processor which uses a three-word instruction pipeline composed of registers (32, 34/36 and 38/40) provides early decoding by decoders (18, 20, 22, 24 and 26) of instructions from a memory (12 and 14). The data processor requires only two bus access cycles to fill the pipeline through the operation of its execution unit (10). Two of the three words required to fill the pipeline are extracted during a single bus access cycle, and all the registers that make up the pipeline are reset simultaneously when a branching operation requires recharging the pipeline.

Description

THREE WORD INSTRUCTION PIPELINE

BACKGROUND ART

Generally, a data processor performs a series of operations upon digital information in an execution unit according to a stored program of instructions. These instructions are often termed

"macroinstructions" in order to avoid confusion with the microinstructions contained in the control store of the data processor. Each of the macroinstructions indicates to the data processor a particular operation to be performed. In addition, most macroinstructions specify the address of one or more operands upon which the operation will be performed. There are several ways in which these operands may be specified. In some cases, the operand is already contained by a register within the data processor execution unit. In other cases, however, the operand is stored in memory external to the data processor. Occasionally, the operand is located in a memory location immediately following the memory location from which the current macroinstruction was obtained (so called immediate addressing). In other cases, the operand is stored at a location in memory which is referenced by one of the data processor registers (so called effective addressing).

Thus, in order to execute a macroinstruction, the data processor must typically perform a series of microinstructions for computing the address and acquiring each of the operands and perform another series of microinstructions for performing the operation specified by the macroinstruction upon the acquired operands. Each macroinstruction is accessed from memory and decoded to derive the control signals which select the microinstruction sequence or sequences to perform the operation required by the macroinstruction. Some of the microinstruction sequences are directed to calculation of effective addresses of operands if the operands are not immediately accessible in a register.

In complex data processor, many functions must be specified by the macroinstruction and as a result, the macroinstruction may exceed the width of the instruction decode mechanism. In such cases a single macroinstruction may be several words long. To expedite processing of such macroinstructions a pipelined architecture may be employed in which one macroinstruction (or a segment of a long macroinstruction) may be executing while another macroinstruction (or a second segment of the macroinstruction) is being decoded. Such a pipelined architecture is disclosed in U.S. Pa ten t No. 4,342,078, Tredennick et al.

In the Tredennick patent a 16-bit instruction path was provided and a series of registers IR, IRC and IRD were employed to store, decode, and execute successive macroinstructions (or segments of the same macroinstruction).

The Tredennick patent, moreover, discussed the use of macroinstruction calls at various levels to perform register to register instructions or to perform the necessary calculations (through microinstruction routines) to derive effective addresses from certain macroinstruction fields. The processor therein had a 16-bit bus as well as a 16-bit data path, so each instruction fetch produced one 16-bit macroinstruction or segment thereof. In the case where a pipelined instruction path is used, while early decoding can result in more rapid instruction execution, changes in the instruction flow, for example on a branch instruction, can be costly since the pipe must be refilled. In the device noted above, refilling the pipe required two successive bus accesses. If a three-deep pipe were used, three bus accesses would be required before execution could resume. This penalty is severe in that approximately twenty-five percent of executed instructions result in a change in the instruction flow. The still earlier decoding opportunity of a three-deep pipe would be overshadowed by the penalties of an instruction flow change.

BRIEF SUMMARY O F THE INVENTION

It is an object of the present invention to provide a data processor with a deep pipeline to accelerate decoding of instructions.

It is a further object of the present invention to avoid the time penalties associated with instruction flow changes in a pipelined data processor.

It is a still further object of the present invention to provide an improved data processor with pipelined instruction flow.

These and other objects and advantage of the present invention are accomplished by providing a data processor with a pipe n-words deep and wherein the n-word pipe may be filled in n-1 bus accesses. This is accomplished by providing a bus 2n words wide coupled between the memory and the pipe for carrying instructions to the execution unit, and a means for recognizing whether the first word to be fetched is on an even or an odd boundary in memory. BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 shows a block diagram of the execution unit of a data processor according to the invention.

Figure 2 shows a more detailed block diagram of a portion of the execution unit of Figure 1.

Figure 3 shows a block diagram of the essential elements of a three-level instruction pipeline with control elements in accordance with the instant invention.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT

Figure 1 shows a block diagram of the execution unit of a data processor according to the invention, with its associated control mechanisms. Specifically, the execution unit is shown at 10, and has a P section, an A section, and a D section. An external address bus and data bus are shown connected to the execution unit, as are a plurality of control lines emanating from a nano store 12. The nano store and a micro store 14 are both addressed by the sequence controller 16. The sequence controller is provided information from a plurality of decoders, 18, 20, 22, 24 and 26 and from a branch control 28.

The instructions to be executed by the execution unit are stored in data ram 30, which is an internal instruction cache, or in external memory as will be discussed. The tag ram 31 provides addresses to the cache as a function of the real address generated by the execution unit. The instructions are fetched from data ram 30 to a register IRB 32 which comprises the first register of the instruction pipeline. From IRB the instruction flows to register IRC and IRC2, 34 and 36, respectively. From IRC, the instructions flow to IR 38 and IRD 40. While in IRC, IRC2, and IRD the instructions are decoded and executed. The decoding takes place in the decoders 18, 20, 22, 24, and 26.

Instructions specify operand locations in one of three ways. The first way is by register specification. If the operand location is a register, the number of the register is given in a register field of the instruction. The second way is by effective address specification, which can designate an operand location by using one of the Effective Address modes. The third way, used by a variety of instructions, is by implicit reference to special registers.

Instructions are at least one word in length. The first word of the instruction is called the operation word. The length of the instruction is implicitly determined by fields in the operation word, and in some cases also in the following words. The operation word also specifies the operation to be performed. Any remaining words may further specify the operation or the operands. These words are either extensions to the effective address modes specified in the operation word, or immediate operands which are part of the instruction. The general format of an instruction is thus:

The instant processor separates memory references into two classes. This creates two address spaces, each with a complete logical address range. The first class is program references, which means the reference is to the object text of the program being executed.

In particular, all instruction fetches are made from the program space.

The other class is data references. Operand reads are from the data space with the exception of immediate operands, which are embedded in the instruction stream and therefore come from program space. All operand writes are to data space.

The operation of the execution unit of the instant processor is not unlike the operation of the processor described in the Tredennick patent referenced above, in the sense that an internal clock divides the apparent machine cycle time into four time periods, T1, T2, T3, and T4. Each set of T times is referred to as a "box" since the contents of one microinstruction box is executed in such a period (in some cases where such execution is not possible because, for example, excessive bus accesses were required, the clocks delay the resumption of the sequence until all necessary actions are taken). An example of a microinstruction box and the key for interpreting the same is given in Appendix 1 hereto. In the Tredennick patent, depending upon the instruction type, one of a number of microaddressing sequences could occur. For example, for an instruction in which a register to register operation was to be performed, only one microaddress call, an A1 call was necessary, since all the pertinent information was specified in a single instruction word and there was no necessity to calculate effective addresses. The actual execution of the instruction took longer than one box, however, because when the instruction itself was Called, decoding was immediately begun, but the decoding and subsequent operations, even register operations, could not be completed within four internal clock cycles. For other types of instructions, the A1 call would provide only a portion of the information necessary to execute the instruction because effective addresses needed to be calculated. In such case, the A1 call would reference an effective address routine or an immediate routine. This would result in an A2 and/or an A3 call to secure the necessary information to complete the instruction.

Similarly in the instant invention one or more calls may be made depending upon the type of instruction. Since in the instant machine the pipe is deeper than in the machine of the Tredennick patent, decoding can begin earlier and the entire microinstruction can be completed within one box of four microcycles. As noted before, in order to execute instructions where an address must be calculated, additional calls must be made. Also to handle special situations such as coprocessor interactions, additional microaddress calls must be generated. Following is a description of the call levels available, not all of which are pertinent to the instant invention:

A1 Initial instruction decode entry

A2/3 Secondary entry for, e.g., address calculations

A4 Save and restore

A5 Used for initial coprocessor operation decode

A6 Used as secondary entry for coprocessor operation decode

A7 Conditional A2 call

Of these, the A1 and A2/3 calls are most pertinent to the invention. The A1 calls are decoded by the A1 decoder PLA 20 of Figure 1 which decodes the value in IRC register 34 to provide an address to the first microword to be executed by an instruction. IRC will be loaded, as will be discussed further, during T2 of one microinstruction, and by T1 of the next, the result of the decode is available.

The A2/A3 calls are decoded by the A2/A3 decoder PLAs which decode the value in IRD to provide an address to the first microword of any additional functions associated with an instruction if they should be necessary. IRD is loaded during T1 of one microinstruction, and the result of the decode is available at T1 of the next.

As previously noted in Figure 1, the execution unit 10 is comprised of three main sections, a P section, an A section, and a D section. The P section is used to calculate instruction stream pointers to facilitate fast access into the cache.

The A section calculates operand addresses, and is used for some data manipulation. The D section is the primary location for data manipulation .

Figure 2 shows a block disgram of the P section of the execution unit, which handles all instruction stream fetching. It maintains the pointers into the instruction stream as well as the program counter associated with an instruction. Instruction accesses are always reads from the program space which are accessed through the cache.

In Figure 2, a register ADBPT 42, the address output buffer for the P section, generally points to the next word in the instruction stream to be fetched. It is connected, as are most of the other registers in the P section, between the address bus 44 and the data bus 46. The AOBP register 48 is essentially a copy of the

AOBPT register and bit 1 of AOBP is used to determine whether the next prefetch instruction boundary is odd or even

The AU 50 is a 32 bit arithmetic unit used for calculating addresses for the P section. A constant register KD is associated with the AU.

The register PC 52 is the program counter for the instruction stream, and points at the word following the first word of the instruction being executed. PC is loaded by register TP2 54 which in turn is maintained by register TP1 56. Bit one of TP1 can be set by the microcode associated with the macroinstruction bus controller such that the TP1 is corrected throughout the intermediate stage of an operation. This mechanism allows the processor to prefetch three words of instruction in two cache or bus accesses. Since the bus is 32 bits wide and the instructions are 16 bits wide, two 32 bit accesses always guarantee access of three instruction words, which is the number required to fill the three deep pipe. Thus regardless of the instruction boundary alignment, two words may be accessed in one microinstruction "box" of four microcycles, and one word in the adjacent box. Depending on the instruction alignment, the two word fetch is done in the first or second box. Since the processor can determine from the address of the first of the three word accesses what the alignment of the instruction is in memory, the processor is aware of whether the single word access occurred on the first or second access and can control the update of the pipe accordingly. Figure 3 shows a block diagram of a three register deep instruction pipeline, most of the elements of which are also shown in Figure 1. Where common elements are shown, consistent numbering is used. As previously noted, the pipeline is three registers deep, the three levels represented by IRB, the most distant register from the instruction register in the pipe, and the register pairs IRC/IRC2, and IR/IRD.

The IRB register 32 usually contains the most recent word read from the data ram or cache memory 30. A cache is a very high speed memory, of limited size, which contains the currently executing instruction stream. If a branch or other instruction stream change requires data not in the cache, an external bus cycle must be run to fetch the required data. The cache communicates with the IRB via cache holding register CHRL/CHRH 58. The IRB register is controlled by a field (PIPEOP) within a microcontrol word if an operation associated with the pipe is involved. IRB can also communicate with the execution unit buses by way of the IML register 60.

The IRC register 34 either contains the first extension word of an instruction or the next instruction to be executed. As an extension word, various fields of the word are used to perform branches or to be used as register pointers. As the next instruction to be executed, decoding is immediately begun in order to keep the processor flowing. To accomplish this, all of the initial entry instruction decoding (A1 calls) is done off this register. IRC is also controlled by the PIPEOP field of a microinstruction.

At the beginning of an instruction cycle, (A1 boundary), the IHC2 register 36 contains the same information as the IRC register. The primary use of the IRC2 register is to aid in coprocessor operations, which form no part of the instant invention, and to allow multiple word instructions to be handled more easily. Typically, a multiple word instruction contains control information which must be used throughout the execution time of that instruction.

This causes difficulties if words beyond the second word must be used in the middle of an instruction as in the case of an effective address evaluation in which there are some number of extension words. Thus for normal instructions IRC2 contains the second word of the instruction, if such is present. IRC2 is also used as as one of the sources of the A5 and A6 PLA decoders 24 and 26, respectively to support coprocessor operation. IRC2 is controlled by PIPEOP and by an A1 call. During a microinstruction in which an A1 call is made, the contents of IRC are transferred into IRC2.

The IR register 38 is used to hold a prefetched instruction for later transfer to IRD 40. Since it is desirable to make any needed instruction accesses as early as possible, IR is loaded with the next instruction as soon as possible. While this may not always be possible, where it is possible, the next instruction word is stored in IR rather than IRD since the word in IRD is still being used for residual control. The IR register is controlled by PIPEOP. A transfer from IR to IRD may be driven by an A1 call.

The IRD register 40 contains the first word of the instruction being executed. The instruction is loaded into IRD as soon as possible in the last microinstruction of the previous instruction and it resides there until the last microinstruction of its execution when the next instruction is loaded. The A2/A3 and most of the residual control PLAs are fed by this register. IRD is also controlled by PIPEOP and from an A1 call. The transfer from IR or IR into IRD is controlled by PIPEOP or it is inherent in an A1 call (as is the IRC > IRC2 update).

The operation of the pipe is as follows. The external data bus and the cache are both 32 bits wide, so an instruction fetch always produces 32 bits. An access from either the bus or the cache is first loaded into the cache holding register 58. It can be seen from Figure 3 that the cache holding register is divided into high and low segments. The word which was accessed is moved from the cache holding register to register IRB 32 which is a 16 bit register (as are the other pipe registers, IRC, IRC2, IR,and IRD). In the overall flow of the pipe, instructions always flow from IRB to IRC to IR. IRD and IRC2 always remain static since they contain the first and second words of the instruction, respectively. As the prefetch continues, the contents of IRB are put into IRC and the newly prefetched word is placed in IRB. If at that time a new instruction is to begin execution, i.e. an A1 call has caused the transfer of the instruction in register IR to IRD, IRC is copied into IRC2 so that IRC2 always contains the second word of the instruction. If a new instruction is not beginning execution, IRC is not copied into IRC2, but proceeds to IR as the pipe is advanced.

The static registers IRC2 and IRD provide most of the decodes, with IRD providing the first word decodes of the A2/A3 calls and IRC2 providing the second word decodes and register selections. A number of the second word decodes relate to A5 and A6 calls and are not relevant to this invention, except insofar as they occur in the pipe. Note that the Al calls have been previously made at the time when the instruction in IRC was copied into IRC2. Those calls were made, however, from the instruction in the IRC register. The residual decoding, including the selection of the ALU, condition code operations, etc., must continue past the time when the early microaddress calls were made. These take place in the IRD register.

An example of the decoding operation will be given using the unsigned divide operation, the format of which is as follows:

This is an example of a two word instruction in which bit 10 of the second word indicates the size of the dividend, bits 14-12 and 2-0 indicate the location of the operands. The dividend may in this case be 64 bits or 32 bits, and the divisor 32 bits. The effective address specified by bits 5-0 of the first word may have multiple words, so the first word may be needed until the end of the instruction. The second word is loaded into IRC2 and the first word is loaded into IRD. IRD tells the ALU what kind of operations to perform. During the effective address evaluation (A1 call), elements will be stepped up the pipe through IRB, IRC, and IR. There may be, for example, five words of effective address extension that are necessary to find the effective address. But the first and second words of the instruction must remain the same so that when the A2/A3 call to actually do the divide operation occurs, the register pointers out of the second word can be used to store the results.

While the information used to calculate the effective address is stepped through the pipe, it is not actually used by the pipe. As these words are required to calculate the effective address, they may be accessed by the execution unit through register IML 60. The fact that they may appear in IR is irrelevant, as they are never used there. The only reason for the existence of IR is to act as a staging area to hold the next instruction prior to loading IRD. If the next instruction was always prefetched as the last operation of a microinstruction, there would be no need for IR. But it is frequently useful to prefetch that instruction earlier, so IR is used to hold it pending the residual decoding of the instruction in IRD.

The actual microinstruction sequences used to control the pipeline are shown in Appendix 1, together with the flow diagrams of the microinstructions.

Specifically, beginning on page 23 of Appendix 1 is a description of the pipe operations based upon the contents of the PIPEOP field of the microinstructions. If the boundary of the next instruction word to be fetched is even, which is determined by bit 1 of the AOBP register as earlier discussed, and therefore two instructions are to be fetched on the first access, an EV3Fa operation is performed in which the cache holding register low is transferred to IRB,and the cache holding register high is transferred to register IRC via the IRB bus and registers IMH and IML. In this case, the second access will fetch only one address word, so a TUD operation is performed. The TUD operation loads the cache holding register low into the IRB register, the IRB contents are loaded into the IRC register via the IRB bus, the IMH register, and the IML register. At the same time the previous contents of the IRC register are loaded into IR.

If the first instruction to be fetched is on an odd boundary, only one word will be fetched on the first access, and two words on the second access. In this case an OD3F operation is first performed in which the cache holding register low is loaded into IRC via the IRB bus. This is followed by an EV3Fb operation in which the cache holding register low is loaded into the IRB register, the cache holding register high is loaded into the IRC register via the IRB bus, the IMH register, and the IML register. At the same time, the previous contents of the IRC register are advanced to the IR register.

APPENDIX I

MICROINSTRUCTION LISTING

ORIGIN: if shared, co-ordinate of origin if origin, # of boxes sharing with this box

DATA ACCESS INFORMATION:

R/W TIME

. - no access X no timing associated

<w> - write T1 write to aob in T1

<> - read T3 write to aob in T3

SPC - special signal T0 aob writen before T1

EXL - latch exception

TYPE

.,<>,<W> on R/W . - normal access UNK - program/data access CNORM - conditional normal CUNK - conditional prog/data AS - alternate address space CPU1 - cpu access - different bus error CPU2 - cpu access - normal bus error RMC - read-modify-write access

SPC on R/W

RST1 - restore stage 1 RST2 - restore stage 2 HALT - halt pin active RSET - reset pin active SYNC - synchronize machine

EXL on R/W

BERR - bus error PRIV - privilege viol. AEUR - address error TRAC - trace LINA - line a TRAP - trap

LINF - line f COP protocol viol,

ILL - illegal FORE - fomat error

DVBZ - divide by zero INT - interrupt 1st stack

BDCK - bad check INT2 - interrupt 2nd stack

TRPV - trap on overflow NOEX - no exception

MICRO SEQUENCER INFORMATION:

DB - direct branch - next microaddress in microword

BC - conditional branch

A1 - use the A1 PLA sample interrupts and trace

A1A - use the A1 PLA sample interrupts, do not sample trace

A1B - use the A1 PLA do not sample interrupts or trace

A2 - use the A2 PLA

A7 - functional conditional branch (DB or A2 PLA)

A4 - use the A4 latch as next micro address

A5 - use the A5 PLA

A6 - use the A6 PLA

SIZE: size = byte nano specified constant value size = word nano specified constant value size = long nano specified constant value size = ircsz irc[11]=0/1 => word/long size = irsz ird decode of the instruction size (byte/word/long). Need to have file specifying residual control size = ssize shifter control generates a size value. The latch in which this value is held has the following encoding

000 = byte

001 = word

010 = 3-byte

011 = long

100 = 5-byte *** must act as long sized

RXS - RX SUBSTITUTIONS:

RX is a general register pointer. It is used to point at either special purpose registers or user registers. RX generally is used to translate a register pointer field within an instruction into the control required to select the the appropriate register. rx = rz2d/rxd conditionally substitute rz2d use rz2d and force rx[3]=0 mul.1 0100 110 000 xxx xxx div.1 0100 110 001 xxx xxx rx = rx ird [ 11 : 9 ] muxed onto rx [ 2 : 0] rx [ 3] = 0 (data reg . )

(unless residual points) rxa then rx[3] = 1

(residual defined in opmap) rx = rz2 irc2[15:12] muxed onto rx[3:0] rx[3] is forced to 0 by residual control div.1 0100 110 001 xxx xxx bit field reg 1110 lxx 111 xxx xxx rx = rp rx[3:0] = ar[3:0]

The value in the ar latch must be inverted before going onto the rx bus for movem rl,-(ry) 0100 100 01x 100 xxx rx = rz ire [15:12] muxed onto rx[3:0]

(cannot use residual control) rx = ro2 rx[2:0] = irc2[8:6] rx[3] = 0 (data reg.)

Used in Bit Field, always data reg rx = car points @ cache address register rx = vbr points @ vector base register rx - vatl points @ vat1 rx = dt points @ dt rx = crp rx[3:0] = ar[3:0]

The value in ar points at a control register (i.e. not an element of the user visible register array) rx = usp rx[3:0] = F force effect of psws to be negated (0) rx = sp rx[2:0] = F, if psws=0 then address usp if psws=1 & pswm=0 then isp if psws=l & pswm=1 then msp

RYS - RY SUBSTITUTIONS: ry = ry ird[2:0] muxed onto ry[2:0] ry[3] = 1 (addr reg.) unless residual points ryd then ry[3] = 0. (residual defined in opmap) ry = ry/dbin This is a conditional substitution ry/dob for the normal ry selection (which includes the residual substitutions like dt) with dbin or dob. The substitution is made based on residual control defined in opmap (about 2 ird lines) which selects the dbin/dob and inhibits all action to ry (or the residually defined ry). Depending upon the direction to/from the rails dbin or dob is selected. If the transfer is to the rails then dbin is substituted while if the transfer is from the rails dob is substituted.

Special case: IRD = 0100 0xx 0ss 000 xxx

(clr,neg,negx,not) where if driven onto the a-bus will also drive onto the d-bus. ry = rw2 irc2[3:0] muxed onto ry[3:0] use rw2 movem ea,rl 0100 110 01x xxx xxx div.1 0100 110 001 xxx xxx bfield 1110 xxx xxx xxx xxx cop 1111 xxx xxx xxx xxx do not allow register to be written to div.w 1000 xxx x11 xxx xxx force ry[3] = 0 div.1 0100 110 001 xxx xxx bfield 1110 1xx x11 xxx xxx ry = rw2/dt conditionally substitute rw2 or dt use rw2 and force ry[3]=0 mul.1 0100 110 000 xxx xxx and irc2[10] = 1 div.1 0100 110 001 xxx xxx and irc2[I0] = 1 ry = vdt1 points @ virtual data temporary ry = vat2 points @ virtual address temporary 2 ry = dty points @ dt AU - ARITHMETIC UNIT OPERATIONS: 0- ASDEC add/sub add/sub based on residual control sub if ird = xxxx xxx xxx 100 xxx 1- ASXS add/sub add/sub based on residual (use alu add/sub). Do not extend db entry add if ird = 1101 xxx xxx xxx xxx add or 0101 xxx 0xx xxx xxx addq

2- SUB sub subtract AB from DB

3- DIV add/sub do add if aut[31] = 1, sub if aut[31] = 0; take db (part rem) shift by 1 shifting in alut[31] then do the add/sub.

4- NIL

6- SUBZX sub zero extend DB according to size then sub AB

8- ADDX8 add sign extend DB 8 -> 32 bits then add to AB

9- ADDX6 add sign extend DB 16 -> 32 bits then add to AB 10- ADD add add AB to DB 11- MULT add shift DB by 2 then add constant sign/zero extend based on residual and previous aluop muls = always sxtd mulu = sxtd when sub in previous aluop 12- ADDXS add sign extend DB based on size then add to AB 13- ADDSE add sign extend DB based on size then shift the extended result by 0,1,2,3 bits depending upon irc[10:9]. Finally add this to AB 14- ADDZX add zero extend DB according to size then add to AB 15- ADDSZ add zero extend DB according to size, shift by 2, then add

CONSTANTS 0,1 1 selected by:

(div * allzero) + (mult * alu carry = 0)

1,2,3,4 selected by size byte = 1 word = 2 3-by = 3 long = 4 If (Rx=SP or Ry=SP) and (Ry=Ry or Rx=Rx) and (Rx or Ry is a source and destination) and (au constant = 1,2,3,4) and (size = byte) then constant = 2 rather than one.

ALU - ARITHMETIC AND LOGIC UNIT OPERATIONS: co10 = x,nil co11 = and co12 = alu1,div,mult,or co133 = alu2,sub

cin add db + ab 0 addx db + ab x add1 db + ab 1 and ab ^ db - chg ab xor k=-1 - clr ab ^ k=0 - eor ab xor db - not ^~ab v db - or ab v db - set ab v k=-1 - sub db + ab 1 subx db + ab x mult (db sshifted by 2) adia/sub (ab shifted by 0,1,2

(if 0 then add/sub 0)) control for add/sub and shift amount comes from regb. Don't assert atrue for mult cin = 0 div build part, quot and advance part, remain.1 ab (pr.1:pq) shifted by 1, add0, value shifted in = au carry (quot bit) cin = 0 must assert atrue for div The condition codes are updated during late T3 based upon the data in alut and/or rega. These registers can be written to during T3. In the case of rega, there are times when the value to be tested is the result of an insertion from regb. CC CONDITION CODE UPDATE CONTROL:

standard n = alut msb (by size) z = alut=0 (by size) non-standard add c = cout v = vout addx.1 c = cout z = pswz ^ locz v = vout bcd1 c = cout bed2 c = cout v pswc z = pswz ^ locz bfld1 n = shiftend z = all zero bfld2 z = pswz ^ allzero bit z = allzero div v = au carry out mul1 n = (shiftend ^ irc2[10]) v

(alut[31] ^ ^~irc2[10]) z = (alut=0 ^ shift allzero ^ irc2[10]) v

(alut=0 ^ ^~irc2[10]) v = ^~irc2[10] ^ ((irc2[11] ^ (^~allzero ^

^~alut[31]) v (^~allone ^ alut[31))) v rirc2[11] ^ ^~allzero)) rotat c = shiftend = (sc=0 - 0 sc<>0 - end) rox.1 c = shiftend = (sc=0 - pswx sc<>0 - end) ! can do this in two steps as knz0c where ! c=pswx and cnz0c where c=shiftend (not ! with share row with shift) rox.3 v = shift overflow = ((^~allzero ^ sc>sz) v

(^~(allzero v allones) ^ sc<=sz)) ! can simplify this if we don't share ! rows but it will cost another box sub.1 c = ^~cout v = vout sub.2 c = ^~cout v = vout subx.1 c = ^~cout z = pswz ^ locz v = vout subx.2 c = ^~cout subx.3 c = ^~cout v pswc z = pswz ^ locz

The meaning and source of signals which are used to set the condition codes is listed below: allzero = every bit in rega field = 0 where the field is defined as starting at the bit pointed to by start and ending (including) at the bit pointed to by end. (see shift control) allone = every bit in rega field = 1 where the field is defined as starting at the bit pointed to by start and ending (including) at the bit pointed to by end. (see shift control) shiftend = the bit in rega pointed to by end = 1. (see shift control) locz = all alut for the applicable size = 0.

SHFTO - SHIFTER OPERATIONS: ror value in rega is rotated right by value in shift count register into regb. sxtd value in rega defined by start and end registers is sign extended to fill the undefined bits and that value is rotated right by the value in the shift count register. The result is in regb. xxtd value in rega defined by start and end registers is PSWX extended to fill the undefined bits and that value is rotated right by the value in the shift count register. The result is in regb. zxtd value in rega defined by start and end registers is zero extended to fill the undefined bits and that value is rotated right by the value in the shift count register. The result is in regb. = =ns the value in regb is rotated left by the value in shift count register and then inserted into the field defined by the start and end register in rega. Bits in rega that are not defined by start and end are not modified. boffs provides the byte offset in regb. If irc2[11]=1 then the offset is contained in RO and as such rega should be sign extended from rega to regb using the values established in start, end, and shift count of 3,31,3 respectively. If irc2[11]=0 then the offset is contained in the immediate field and should be loaded from irc2[10:6] or probably more conveniently osr [4:0]. This value however should be shifted by 3 bits such that osr [4:3] are loaded onto regb[l:0] with zero zero extension of the remaining bits. offs provides the offset in regb. If irc2[11]=1 then the offset is contained in RO and as such DB>REGB should be allowed to take place. If irc2[11]=0 then the offset is contained in the immediate field and osr [4:0] should be loaded onto regb [4:0] with zero extension of the remaining bits.

SHFTC - SHIFTER CONTROL:

*1* loaded based on ird[5] - if ird[5] = 0 then wr value comes from BC bus else value is loaded from regc.

FTU - FIELD TRANSLATION UNIT OPERATIONS:

3- LDCR load the control register from regb. The register is selected by the value in ar[1:0], this can be gated onto the rx bus.

4- DPSW load the psw with the value in regb. Either the ccr or the psw is loaded depending upon size. If size as byte then only load the ccr portion.

14- CLRFP clear the f-trace pending latch. (fpend2 only)

17- LDSH2 load the contents of the shifter control registers from regb. These include wr,osr,count. 19- LDSWB load the internal bus register from regb. This is composed of bus controller state information which must be accessed by the user in fault situations. 21- LDSWI load the first word of sswi (internal status word) from regb. This is composed of tpend, fpend1, fpend2, ar latch 23- LDSH1 load the contents of the shifter control registers from regb. These include st,en,sc. 25- LDUPC load micro pc into A4 from regb and check validity of rev #. 26- LDPER load per with the value on the a-bus. (should be a T3 load). ab>per 28- LDARL load the ar latch from regb. May be able to share with ldswi or ldswj 29- 0PSWM clear the psw master bit. 33- RPER load output of per into ar latch and onto be bus. There are two operations which use this function, MOVEM and BFFFO. MOVEM requires the least significant bit of the. lower word (16-bits only) that is a one to be encoded and latched into the AR latch and onto the BC BUS (inverted) so that it can be used to point at a register. If no bits are one then the end signal should be active which is routed to the branch pla. After doing the encoding, the least significant bit should be cleared.

For BFFFO it is necessary to find the most significant bit of a long word that is a one. This value is encoded into 6 bits where the most significant bit is the 32-bit all zero signal. Thus the following bits would yield the corresponding encoding. most sig bit set per out onto bc bus

31 0 11111 1110 0000

16 0 10000 1110 1111

0 0 00000 1111 1111

NONE 1 11111 0000 0000

The output is then gated onto the BC bus where it is sign extended to an 8-bit value. It does not hurt anything in the

BFFFO case to load the other latch (i.e.

BFFFO can load the AR latch).

For BFFFO it does not matter if a bit is cleared. 34- STCR store the control register in regb. The register is selected by the value in ar[1:0], this can be gated onto the rx bus. 37- STPSW store the psw or the cer in regb based on size. If size as byte then store cer only with bits 8 - 15 as zeros. 38- 0PEND store the psw in regb then set the supervisor bit and clear the trace bit in the psw. Tpend and Fpend are cleared. The whole psw is stored in regb. 39- 1PSWS store the psw in regb then set the supervisor bit and clear both trace bits in the psw. The whole psw is stored in regb. 40- STINST store IRD decoded information onto the BC bus and into regb. This data can be latched from the BC bus into other latches (i.e. wr & osr) by other control. 41- STIRD store the ird in regb. 43- STINL store the new interrupt level in pswi and regb. The three bits are loaded into the corresponding pswi bits. The same three bits are loaded onto be bus [3:1] with be bus [31:4] = 1 and [0] = 1, which is loaded into regb. Clear IPEND the following T1. 44- STV# store the format & vector number associated with the exception in regb.

47- STCRC store the contents of the CRC register in regb. Latch A4 with microaddress. 48 STSH2 store the contents of the shifter control registers into regb. These include wr,osr,count. Store high portion of shift control 50- STSWB store the internal bus register in regb. This is composed of bus controller state information which must be accessed by the user in fault situations. 52- STSWI store sswi (internal status word) in regb. The sswi is composed of tpend, ar latch, fpendl, fpend2 54- STSH1 store the contents of the shifter control registers into regb. These include st,en,sc. 56- STUPC store the micro pc in regb.

62- NONE 63- STPER store the per onto the a-bus. (should be a Tl transfer). per>ab PC - PC SECTION OPERATIONS:

AOBP[1]

0 1

31 - 3PFI EV3FI OD3FI 30 - 3PFF TPF EV3FI 0- NF aobpt>db>sas tp2>ab>sas 1- TPF aobρt>db>tp1 aobpt>db>aup>aobp*,aobpt

+2>auρ tp1>tp2 tp2>ab>sas 2- PCR tp2>ab>a-sect

(if ry=pc then connect pc and address section) aobpt>db>sas 3- PCRF aobpt>db>tp1 aobpt>db>aup>aobp*,aobpt

+2>aup tp1>tp2 tp2>ab>a-sect ( if ryaspc then connect pc and address section)

4- JMP1 tp2>db>a-sect a-sect>ab>aobpt

5- BOB aobpt>db>tp1 tpl>tp2 tp2>ab>sas

- EV3FI aobpt>db>tp1* aobpt>db>aup>aobρt +4>aup tp2>ab>sas

- OD3FI aobpt>db>aup>aobpt,tp2

+2>aup tp2>ab>sas

7- TRAP tp2>db>a-sect pc>ab>sas

8- TRAP2 tp2>ab>a-sect aobpt>db>sas

9- JMP2 a-sect>ab>aobpt aobpt>db>sas

10- PCOUT pc>ab>a-sect aobpt>db>sas

11- NPC Conditional update based on cc=t/f tp2>db>aup,a-sect a-sect>ab>aup>aobpt

12- LDTP2 a-sect>ab>tp2 aobpt>db>sas

13- SAVE1 pad>aobp aobpt>db>sas tp2>ab>sas

15- SAVE2 aobp>db>tp1 tp2>ab>sas 14- FIX aobρt>db>tp1 tp2>ab>aobpt tp1>tp2

16- LDPC tp2>pc aobpt>db>sas tp2>ab>sas

PIPE - PIPE OPERATIONS:

Description of bit encodings. [6] = use ire [5] = change of flow [4] = fetch instruction [3:0] as previously defined pipe control functionality.

AOBP [1] 0 1

0 1 1 3 - 3UDI EV3Fa OD3F

1 0 1 7 - 3UDF TUD EV3Fb

- FV3Fa chrl>irb chrh>pb>imh,iml,irc change of flow fetch instr

- EV3Fb chrl>irb chrh>pb>imh,iml,irc irc>ir ! implies use irc use pipe fetch instr

- OD3F chrl>pb>irc

! force miss regardless of whether odd or even change of flow fetch instr

0 0 0 0 - NUD x

1 0 0 0 - UPIPE use pipe

0 0 1 1- FIX2 Always transfer irb up pipe chr>irb to irc, im and lr irb needs irb>pb>imh,iml,irc to be replaced, do access and transfer chr to irb.

! force miss regardless of whether odd or even change of flow, fetch instr db>ird else load irb from d-bus. irb>pb>imh,iml,irc change of flow fetch instr 0 0 2 - IRAD ira>db 0 0 4 - IRTOD ir>ird 0 0 1 5 - FIX1 chr>irb if ire needs to be replaced, do access and transfer chr to irb, else no activity. ! force miss regardless of whether odd or even change of flow fetch instr 1 0 0 6 - 2T0C irc2>irc irc>ir use pipe 0 0 0 8 - CLRA clear irc2[14] ira>ab zxtd 8 -> 32 0 0 0 9 - STIRA db>>ira ira>pb>irc2 0 0 0 11 - ATOC db>>ira ira>pb>irc 0 0 1 13 - EUD chr>irb irb>pb>imh,iml fetch instr 1 0 0 14 - CTOD irc>ir,ird irb>irc use pipe 1 0 1 15 - TUD chr>irb irb>pb>imh,iml,irc irc>ir use pipe fetch instr 0 1115 - TOAD chr>irb irb>pb>imh,iml,ire irc>ir change of flow fetch instr

Claims

1. In a data processor comprising: a first instruction register; a second instruction register; a third instruction register; and instruction execution control means for selectively sequencing instruction words from said first instruction register through said second instruction register to said third instruction register, the instruction execution control means being responsive to a predetermined instruction word in said second instruction register to control the evaluation of an effective address of an operand, and to said predetermined instruction word in said third instruction register to control a selected operation upon said operand; the improvement comprising: a fourth instruction register interposed between said second and third instruction registers; whereby said instruction control means will selectively sequence said instruction words from said first instruction register through said second instruction register and said fourth instruction register to said third instruction register, thereby initiating the instruction decode earlier in said selective sequencing of said instruction words.

2. In the data processor of claim 1 wherein said instruction execution control means are responsive to said predetermined instruction word in said second instruction register to control selected residual operations upon said operand, the improvement further comprising: a fifth instruction register; and means in said instruction execution control means for selectively sequencing said instruction words into said fifth means, said instruction execution control means being responsive to said predetermined instruction word in said fifth instruction register to control said selected residual operations upon said operand.

3. In the data processor of claim 1 further comprising: communication control means for retrieving each of said instruction words from a storage device at the request of said instruction execution control means; and wherein said instruction execution control means requests said communication control means to retrieve said instruction words in a selected sequence in synchronization with the selective sequencing of said instruction words through said instruction registers; the improvement further comprising: means in said instruction execution control means responsive to a second predetermined one of said instruction words in said third instruction register for providing a control signal indicating that the three (3) instruction words in said first, second and fourth instruction registers must be replaced with three (3) other instruction words; and means in said communication control means responsive to said control signal for retrieving said three (3) other instruction words from said storage device, at least two (2) of said instruction words being retrieved using a single access to said storage device.

4. In a data processor comprising an execution unit and a pipeline for providing data segments representing instruction words from a memory to the execution unit, the pipeline comprising; a first register for holding, during execution of an instruction, a first portion of an instruction; a second register for holding, during execution of an instruction, a second portion of an instruction; a third register for holding, prior to execution of an instruction, a portion of an instruction; decoding means coupled to the first, second and third registers for decoding the contents of the first, second, and third registers, such that effective address evaluation is begun while an instruction portion is in the second register, completed while the instruction portion is in the first register, and simultaneously therewith the operation of the instruction is performed.

5. In a data processor having an execution unit, a memory, and a bus of two words in width coupled to the memory for carrying instructions from the memory to the execution unit, an instruction pipeline of one word width between the bus and the execution unit and having a depth of three registers; means for fetching two words from the memory on one access thereof and for fetching one word from the memory on another access thereof, and; means for determining the order in which the one and the another access occur.

6. A means for determining the order of fetches as set forth in claim 5 comprising a flag bit in the execution unit, wherein the flag bit is set or reset by an access controller.