CN101178645A

CN101178645A - Paralleling floating point multiplication addition unit

Info

Publication number: CN101178645A
Application number: CNA2007101799736A
Authority: CN
Inventors: 李兆麟; 李恭琼
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2007-12-20
Filing date: 2007-12-20
Publication date: 2008-05-14
Anticipated expiration: 2027-12-20
Also published as: CN100570552C

Abstract

The invention relates to a parallel floating-point fused multiply-add unit which simplifies similar technique and achieves the multiply-add operation of A+B+C*D (A is equal to or greater than B) and acquires the result of C*D, so as to achieve three classes production line: in the first production line, A and B are displaced and snapped, and the C*D is coded and part of C*D is compressed; in the second production line, the displacement and snapped result of A and B and the result of partial compressed C*D are compressed in 4:2CSA, and then front zero guide prediction, character prediction, half-add operation and normalized displacement are accomplished; in the third production line, the final add operation and rounding of A+B+C*D are accomplished and the index is counted, and the mantissa and the index of C*D are counted according to the output of the first production line. The invention has the advantages of achieving the parallel of instruction grade; accomplishes an add instruction and a multiply instruction at the same time; and also can accelerate two continuous instructions with correlative data.

Description

A kind of paralleling floating point multiplication addition unit

Technical field

The present invention relates to the Float Point Unit design, is a kind of high-speed floating point multiplicaton addition unit that is used to realize the high performance float-point computing.

Background technology

Data in literature shows, almost 50% the floating-point multiplication instruction instruction of closelying follow later is floating add or subtraction.Therefore, floating point multiplication addition mixing operation A+B * C has become a kind of basic operation in science computing and the multimedia application.Because the floating point multiplication addition mixing operation occurs so continually, realize that with floating point multiplication addition integrated unit (being reduced to the MAF unit) this operation has become a well selection of modern high performance commercial processors in application program.This implementation mainly contains following two advantages: (1) only needs once to round off, and is not twice; (2) form module by sharing some, can reduce circuit delay and hardware spending.Take advantage of and add 3 operands of (MAF) instruction needs.For example, carry out A+ (B * C) operation when to add what carry out when operand A is changed to 0 in the instruction be multiplying order taking advantage of, is changed to operand B or C at 1 o'clock, execution be add instruction.In most processors of having realized, the floating point multiplication addition computing generally realizes (related content is asked for an interview list of references 1:Floating-Point Multiply-Add-Fused with Reduced Latency, and the block diagram of implementation procedure is seen Fig. 1) by following steps:

1. at first multiplicand C is done ripple thatch coding, preserve compressed tree realization B * C that compression totalizer (CSA) is formed with carry then, obtain two partial products.When carrying out multiplication, operand A carries out negate and alignment shift operation.The symbol of operand A and B * C may be identical, also may be opposite.If the opposite in sign of A and B * C is done effective subtraction with A and B * C, the complement code that need obtain A is carried out addition, and A needs negate.The symbol of A and B * C is identical else if does effective addition, and A does not need negate.Below no matter whether need to carry out inversion operation, will be designated as A by the A after the negate device _Inv

In the IEEE-754 standard, the mantissa of single precision operand is 24 bits, adds 2 extra rounding bits, A _InvMost significant digit than the result of B * C moves to left 26 at most, and perhaps than B * C result's most significant digit 48 bits that move to right at most, promptly shift range is between [26,48].For displacement is oversimplified, the direction of displacement of normalization A is right shift in the design of floating point multiplication addition.So A _InvPlace the position of B * C left end 26 bits at first, A _InvIn alignment the figure place of dextroposition be 27-(exp (A)-(exp (B)+exp (C)-127), wherein, exp (A), exp (B) and exp (C) they are respectively operand A, the index of B and C.

2. the A after the alignment shift _Inv, the partial product ratio of compression after and B * C compression is that 3: 2 carry save adder (CSA) compresses, and obtains two partial products, neededly when handling the A supplement simultaneously adds 1.

3. utilize two partial products that obtain after the compression in 2 to carry out leading zero prediction (LZA, leading zeroanticipator), obtain the figure place of the denormalization left shift of addition results.Judge the positive and negative of net result simultaneously.

4. do the false add operation in the leading zero prediction with when meeting prediction, and finish the additive operation of part.Carry out false add operation and be in order to guarantee that the back can correctly round off.Because the operation of the needed time ratio false add of sign prediction is many, finish the part of final addition between can be at this moment in the space.

5. addition results utilizes the figure place that moves to left of LZA prediction to carry out denormalization left shift.If the sign prediction parts are judged net result for negative in 3, the result of complement form after the 4th step handled of the partial product that obtains after then selecting to compress in 2 carries out normalization shift.

6. the final addition and the operation of rounding off.

The deficiencies in the prior art shown in Figure 1 are as follows:

(1) can not handle an add instruction (A+B) and multiplying order (C * D) simultaneously, and need two cycles to finish this two instructions, find and analyze some application examples,, can improve the execution efficient of instruction stream significantly if addition and multiplying order can be carried out simultaneously;

When (2) continuous two instructions exist data to be correlated with, streamline will be forced to wait for two cycles (when three class pipeline is realized), and in actual applications, data are relevant to be again a very general phenomenon.

The deficiencies in the prior art shown in Figure 1 can not solve by adopting independent adder unit and multiplication unit, at first, can increase hardware spending, secondly, take advantage of to add instruction and need split into two instructions and carry out, reduced it and carried out efficient, and owing to carried out rounding off for twice, reduced precision, last, this scheme can not be quickened the relevant instruction of data has taken place.By adopting a multiplicaton addition unit and adder unit can remedy the part deficiency of prior art shown in Figure 1, but the increase of its hardware spending is too big, and for the relevant instruction of data has taken place, this solution is powerless equally.

Prior art shown in Figure 1 relatively, the present invention has realized that form is the operation of A+B+C * D, is called the parallel add operation of taking advantage of, and following advantage is arranged:

(1) can handle an add instruction (A+B) and a multiplying order (C * D), realized the instruction level parallelism of add instruction and multiplying order simultaneously;

(2) when two adjacent following three kinds of data of instruction generation are relevant, can handle as an instruction:

A) article one instruction: E=A+B, second instruction: F=E+C

B) article one instruction: E=A+B, second instruction: F=E+C * D

C) article one instruction: E=A+C * D, second instruction: F=E+B

(3) every parallel taking advantage of when adding ordering calculation and finishing can both obtain the multiply operation (result of C * D), and can specify the rounding mode of multiplication separately simultaneously.

Summary of the invention

The objective of the invention is to design a kind of high performance Fully-pipelined single precision paralleling floating point multiplication addition unit at a high speed, improve the degree of parallelism of floating point instruction and carry out efficient, guarantee the less hardware expense simultaneously.

The present invention is characterised in that, the invention provides the single precision paralleling floating point multiplication addition arithmetic element that a kind of three grades of flowing water are realized, way of realization is that (C * D) takes advantage of add operation to A+B+, A 〉=B, it is characterized in that this floating point multiplication addition unit contains three grades of flowing water, handling capacity is phase instruction weekly, and can produce the result of C * D simultaneously, this floating point multiplication addition unit contains:

First order streamline: by exponential sum symbol processing unit (1): the carry save adder CSA (7) of first 74 bit displacement aligner (2), second 74 bits displacement aligner (3), viscous position counter (4), first step-by-step negate device (5), second step-by-step negate device (6), 3:2, be that the ripple thatch scrambler (8) of base, partial product compressed tree (9), 24 bit adder (10) and the selector switch (11) that the carry save adder CSA of 3:2 forms are formed with 4; Wherein,

Exponential sum symbol processing unit (1), whether according to the exponential sum symbolic computation A+B+ of operand A, B, C, D (C * exponent e xp D), the exponent e xp_CD of C * D, is effectively to subtract sub, A+B+ (C * interim symbol sign D), symbol sign_CD of C * D, and shift amount mv_A, mv_B during the relative C of definite A * D displacement alignment with B, and whether need step-by-step negate sub_A, sub_B after A and the B displacement alignment, the step-by-step negate is promptly carried out negate to each, just become 1, become 01 with 0;

exp_CD＝exp_C+exp_D，

sub＝sign_Asign_Csign_D，

sign＝sign_CD＝sign_Csign_D，

sub_A＝sign_Asign_Csign_D，

sub_B＝sign_Bsign_Csign_D，

Wherein, sian_A, sign_B, sign_C and sign_D are respectively the symbols of operand A, B, C and D, exp_A, exp_B, exp_C and exp_D are respectively the indexes of operand A, B, C and D, according to IEEE 754 standards, the symbol of single precision floating datum is its most significant digit, and index is the 2nd to the 8th;  is an xor operation;

When exp_CD-exp_A≤-51 and sign_A ≠ sign_B,

exp＝exp_A，

mv_A＝0，

mv_B＝exp-exp_B，

When exp_CD-exp_A≤-51 and sign_A=sign_B,

exp＝exp_A+1，

mv_A＝1，

mv_B＝exp-exp_B，

-27＞exp_CD-exp_A＞-51 o'clock,

exp＝exp_CD+51，

mv_A＝exp-exp_A，

mv_B＝exp-exp_B，

23 〉=exp_CD-exp_A 〉=-27 o'clock,

exp＝exp_CD+27，

mv_A＝exp-exp_A，

mv_B＝exp-exp_B，

Exp_CD-exp_A＞23 o'clock,

exp＝exp_CD+1，

mv_A＝exp-exp_A，

mv_B＝exp-exp_B，

The shift unit of first 74 bit (2) is according to the mv_A value that obtains in the exponential sum symbol processing unit (1), the man_A of mantissa to A carries out right shift, according to IEEE 754 standards, the mantissa of single precision floating datum is its 8th to 32, mend 1 in most significant digit when it is standardizing number, otherwise in the most significant digit zero padding, unnomalized number will be treated as 0, output after the displacement is designated as align_A

align_A＝man_A□mv_A，

Wherein represents to move to right;

The shift unit of second 74 bit (3) is according to the mv_B value that obtains in the exponential sum symbol processing unit (1), and the man_B of mantissa of B is carried out right shift, and the output after the displacement is designated as align_B,

align_B＝man_B□mv_B；

Viscous position computing unit (4), calculate viscous position st1_B according to the sub_B that calculates in the shift result of the shift unit (3) of second 74 bit and the exponential sum symbol processing unit (1), mv_B＞74 o'clock, if it is 0 entirely that sub_B=0 and man_B shift out the part of the data channel of 74 bit widths, perhaps to shift out the part of the data channel of 74 bit widths be 1 entirely for sub_B=1 and man_B, st1_B=0 then, otherwise st1_B=1;

First step-by-step negate device (5), if the sign bit of the sign bit of A and C * D is different, to the output of the shift unit (2) of first 74 bit as a result align_A do the step-by-step inversion operation, otherwise do not do any operation directly with align_A output, the output of first step-by-step negate device (5) is designated as inv_A;

Second step-by-step negate device (6), if the sign bit of the sign bit of B and C * D is different, to the output of the shift unit (3) of second 74 bit as a result every bit of align_B all do inversion operation (step-by-step negate just), otherwise do not do any operation directly with align_B output, the output of second step-by-step negate device (6) is designated as inv_B;

With the output of first step-by-step negate device (5) and second step-by-step negate device (6) inv_A and inv_B as a result, and the sub_A that draws in the exponential sum symbol processing unit (1) sends into once compression of do among the 3:2 CSA (7) together, obtain sum_AB, carry_AB, wherein

sum_AB＝inv_A^inv_B^sub_A，

carry_AB＝((inv_A&inv_B)|(inv_A&sub_A)|(inv_A&sub_A))＜＜1，

And the result of sub_B and st1_B and operation is placed on the lowest order of carry_AB,

carry_AB[73]＝sub_B&st1_B，

Here ^, ﹠amp; With | represent respectively step-by-step XOR, step-by-step and and step-by-step or operation,＜＜expression is to shifting left;

The ripple thatch scrambler (8) of base 4 is encoded to the mantissa of C, the mantissa of result that will encode and D multiplies each other and obtains 13 partial products then, the carry that is admitted to 3:2 of these 13 partial products is preserved in the CSA compressed tree (9), the tree that 3:2 CSA tree promptly is made up of 3:2 CSA, 3 inputs will be compressed into 2 outputs through a CSA, if establish and be input as x, y, z, be output as s, c, then compression process can be expressed as follows:

s＝x^y^z，

c＝((x&y)|(x&z)|(y&z))＜＜1，

With 5 grades of 3:2 CSA cascade, form 3:2 CSA tree, just 2 of 13 partial product boil down tos can be designated as sum_CD, carry_CD respectively;

In low 24 totalizers (10) that are sent to 24 bits of sum CD and carry_CD, addition results is summarized as two information outputs: st1_CD and st1_CD_3MSB, and whether low 24 that wherein write down addition results be zero entirely, if be zero entirely, st1_CD=0 then, otherwise st1_CD=1; St1_CD_3MSB writes down the Senior Three position of the addition results of 25 bits;

Selector switch (11) is chosen one as st1 output according to the index range that calculates in the exponential sum symbol processing unit (1) from st1_B and st1_CD,

-27＞exp_CD_exp_A＞-51 o'clock, st1=st1_CD, st1=st1_B under other situation;

Second level streamline: form by the shift unit (19) of the half adder (16) of the half adder (15) of the half adder (14) of the leading zero prediction module (13) of 4:2 CSA (12), 74 bits, first 74 bit, first 74 bit and first 74 bit, sign prediction logic (17), selector switch (18), 74 bits with door (20); Wherein,

4:2 CSA (12) quite with the 3:2 of 2 cascades, with four input sum_AB, carry_AB, two of sum_CD and carry_CD boil down tos: sum and carry, wherein sum_CD and carry_CD will be according to the index range that calculates in the exponential sum symbol processing unit (1) input of back as CSA that be shifted, when exp_CD-exp_A＜-27, with sum_CD and carry_CD preceding 24 as the input, when 23 〉=exp_CD-exp_A 〉=-27, with sum_CD and carry_CD as input, under other situations (exp_CD-exp_A＞23), with sum_CD and carry_CD behind 26 bits that move to left as input, the most significant digit with the st1_CD_3MSB that obtains in 10 after the compression leaves on the lowest order of carry;

The leading zero prediction module (13) of 74 bits, when judging 12 output sum and carry addition as a result, the number of leading zero, the leading zero number is promptly from most significant digit, figure place between first non-0 will be leading one number if sum and carry addition result for negative, then judge here, promptly from the figure place of most significant digit to first non-1, concrete determination methods is:

By checking that the adjacent position in a certain position and the left and right sides thereof determines that who may be a most significant digit, establish a prediction bits f _i,

T＝sumcarry，G＝sum&carry，

f_{0} = \overset{&OverBar;}{T_{0}} T_{1}

f_{i} = T_{i - 1} (G_{i} {\overset{&OverBar;}{Z}}_{i + 1} + Z_{i} {\overset{&OverBar;}{G}}_{i + 1}) + {\overset{&OverBar;}{T}}_{i - 1} (Z_{i} {\overset{&OverBar;}{Z}}_{i + 1} + G_{i} {\overset{&OverBar;}{G}}_{i + 1}), i > 0

Wherein sum, carry are two outputs of (13),

Expression is with sum step-by-step negate, T _i, G _i, Z _iIf the i position of representing T, G, Z respectively is f _i=1, and f _j=0 (j=0,1 ... i-1), then leading zero number (LZN) is i;

If half adder is input as x, y, be output as s, c, then its principle can be expressed as:

s＝x^y，

c＝(x&y)＜＜1，

First half adder (14) is treated to output sum according to above-mentioned principle with sum and the carry that 4:2CSA (12) exports _HAposAnd carry _HApos

With after sum and the carry step-by-step negate as the input of second half adder of half adder (15), be output as sum _HAinvAnd carry _HAinv, and with carry _HAinvExtreme lower position be 1;

With sum _HAinvAnd carry _HAinvAs the input of the 3rd half adder (16), be output as sum after the step-by-step negate _HAcomAnd carry _HAcom, and with carry _HAcomExtreme lower position be 1, sum like this _HAcom+ carry _HAcomJust be equivalent to the complement form of sum+carry;

Sign prediction module (17), Forecasting Methodology is for judging whether the sum+carry most significant digit has carry to produce, if there is carry to produce, then addition result is for negative, it is 1 that output signal complement is composed, otherwise complement=0;

Selector switch (18) is according to the result of sign prediction, from sum _HApos, carry _HAposAnd sum _HAcom, carry _HAcomMiddle selection is a pair of as output, and its output is designated as sum _HA, carry _HA,

During complement=0, sum _HA=sum _HApos, carry _HA=carry _HApos,

During complement=1, sum _HA=sum _HAcom, carry _HA=carry _HAcom

The shift unit of 74 bits (19) is according to the leading zero prediction result, and the output of selector switch (18) is shifted left, and shift amount is LZN, and the output after the displacement is designated as sum _NorAnd carry _Nor

With door (20) the output complement of sign prediction module (17) and the output sign of exponential sum symbol processing unit (1) are carried out and operation, obtain the symbol of A+B+C * D;

Third level streamline is made up of the Index for Calculation unit (21) of A+B+C * D, the index amending unit (23) that finally adds/round off unit (22), C * D of A+B+C * D, the unit (24) that finally adds/round off of C * D; Wherein,

The Index for Calculation unit (21) of A+B+C * D, the LZN that obtains in the leading zero prediction module (13) according to the exp that obtains in the exponential sum symbol processing unit (1), 74 bits and A+B+C * D add finally/round off whether the index that 1 lt calculates A+B+C * D takes place in the unit (22), if 1 lt does not take place in finally adding/round off in the unit (22) of A+B+C * D, then the index of A+B+C * D is exp-LZN, otherwise need carry out the correction of 1 bit, the index of final A+B+C * D will be expressed as exp-LZN-1;

In finally the adding/round off in the unit (22) of A+B+C * D, at first with the output sum of the shift unit (19) of 74 bits _NorAnd carry _NorAddition, the result is designated as ABCD_added,

ABCD_added＝sum _HAnor+carry _HAnor，

Round off according to the st1 and the rounding mode that obtain in ABCD_added, the selector switch (11) then, rounding mode has: (RN) nearby rounds off, to infinite rounding off (RP) just, to negative infinite rounding off (RM), to zero round off (RZ), from application point of view, these four kinds of rounding modes can reduce to three: RN, RI, RZ;

RZ(x)＝x

Here

Represent respectively to round up and round downwards with  x ;

For negative, rounding mode RP can equivalence be RI, and RM can equivalence be RN; For positive number, rounding mode RP can equivalence be RN, and RM can equivalence be RI;

At first calculate viscous position st2, if the most significant digit of ABCD_added is 1, st2=|abcd_added[25:74 then], otherwise st2=|abcd_added[26:74]; Whole viscous position st is made up of st1 and st2 two parts:

st＝st1|st2

According to st, ABCD added and rounding mode RI, RN or RZ, calculate two nonces of round-off result then, be designated as rounding_result_tmp1 and rounding_result_tmp2 respectively, the computing method of rounding_result_tmp1 are as follows:

During RI=1,

If st=1 or ABCD_added[24]=1,

rounding_result_tmp1＝ABCD_added[0:23]+1；

Otherwise

rounding_result_tmp1＝ABCD_added[0:23]；

During RI=0, if RN=1,

If ABCD_added[24]=0

rounding_result_tmp1＝ABCD_added[0:23]；

Otherwise, during st=1,

rounding_result_tmp1＝ABCD_added[0:23]+1；

ABCD_added[23]=1 o'clock,

rounding_result_tmp1＝ABCD_added[0:23]+1；

Otherwise,

rounding_result_tmp1＝ABCD_added[0:23]；

During RI=0, if RN=0,

rounding_result_tmp1＝ABCD_added[0:23]；

The computing method of rounding_result_tmp2 are as follows:

During RI=1,

If st=1 or ABCD_added[25]=1,

rounding_result_tmp2＝ABCD_added[1:24]+1；

Otherwise

rounding_result_tmp2＝ABCD_added[1:24]；

During RI=0, if RN=1,

If ABCD_added[25]=0

rounding_result_tmp2＝ABCD_added[1:24]；

Otherwise, during st=1,

rounding_result_tmp2＝ABCD_added[1:24]+1；

ABCD_added[24]=1 o'clock,

rounding_result_tmp2＝ABCD_added[1:24]+1；

Otherwise

rounding_result_tmp2＝ABCD_added[1:24]；

During RI=0, if RN=0,

rounding_result_tmp2＝ABCD_added[1:24]；

At last from rounding_result_tmp1 and rounding_result_tmp2, choose the mantissa of a final A+B+C * D of conduct, and determine whether index in 21 needs the correction of 1 bit according to the most significant digit of ABCD_added and the most significant digit of rounding_result_tmp1:

If the most significant digit of rounding_result_tmp1 be 1 and the most significant digit of ABCD_added be 0, perhaps the most significant digit of ABCD_added is 1 o'clock, choose rounding_result_tmp1 as net result, do not need the correction of 1 bit in 21, otherwise choose rounding_result_tmp2 as net result, need the correction of 1 bit in 21;

The index amending unit (23) of C * D according to C * D finally add/round off whether carried out one move to left in the unit (24) could be after judging whether to revise as the index of final C * D to the exp_CD of output in the exponential sum symbol processing unit (1), if finally adding/round off in the unit (24) of C * D declared to such an extent that need revise, the index of then final C * D is exp_CD-1, otherwise the index of final C * D is exp_CD;

Finally adding/rounding off in the unit (24) at C * D, the st1_CD, the st1_CD_3MSB that obtain in high 24 and (10) of sum_CD, the carry_CD that obtains in the partial product compressed tree of forming according to the carry save adder CSA of 3:2 (9), calculate the mantissa of C * D, and judge whether need to carry out the correction of 1 bit;

At first, obtain CD_added with the most significant digit addition of high 24 and the st1_CD_3MSB of sum_CD, carry_CD:

CD_added＝sum_CD[0:23]+carry_CD[0:23]+st1_CD_3MSB[0]，

Use the mantissa that finally adds/round off similar method calculating C * D in the unit (22) of A+B+C * D then, calculate two nonce rounding_result_CD_tmp1 and rounding_result_CD_tmp2 earlier,

The computing method of rounding_result_CD_tmp1 are as follows:

If RI=1,

If st1_CD=1 or st1_CD_3MSB[1]=1,

rounding_result_CD_tmp1＝CD_added+1；

Otherwise,

rounding_result_CD_tmp1＝CD_added；

If RI=0, and RN=1,

If st1_CD_3MSB[1]=0,

rounding_result_CD_tmp1＝CD_added；

Otherwise, if st=1,

rounding_result_CD_tmp1＝CD_added+1；

If st1_CD_3MSB[1]=1, and CD_added[23]=1,

rounding_result_CD_tmp1＝CD_added+1；

E otherwise

rounding_result_CD_tmp1＝CD_added；

If RI=0, and RN=0,

rounding_result_CD_tmp1＝CD_added；

The computing method of rounding_result_CD_tmp2 are as follows:

If RI=1,

If st1_CD=1 or st1_CD_3MSB[2]=1,

rounding_result_CD_tmp2＝{CD_added[1:23]，st1_CD_3MSB[1]}+1；

Otherwise,

rounding_result_CD_tmp2＝{CD_added[1:23]，st1_CD_3MSB[1]}；

If RI=0, and RN=1,

If st1_CD_3MSB[2]=0,

rounding_result_CD_tmp2＝{CD_added[1:23]，st1_CD_3MSB[1]}；

Otherwise, if st=1,

rounding_result_CD_tmp2＝{CD_added[1:23]，st1_CD_3MSB[1]}+1；

If st1_CD_3MSB[2]=1, and st1_CD_3MSB[1]=1,

rounding_result_CD_tmp2＝{CD_added[1:23]，st1_CD_3MSB[1]}+1；

Otherwise,

rounding_result_CD_tmp2＜＝{CD_added[1:23]，st1_CD_3MSB[1]}；

If RI=0, and RN=0,

rounding_result_CD_tmp2＜＝{CD_added[1:23]，st1_CD_3MSB[1]}；

If the most significant digit of rounding_result_CD_tmp1 be 1 and the most significant digit of CD_added be 0, perhaps the most significant digit of CD_added is 1 o'clock, choose the net result of rounding_result_CD_tmp1 as C * D, do not need the correction of 1 bit in the index amending unit (23) of C * D, otherwise choose the net result of rounding_result_CD_tmp2, need the correction of 1 bit in the index amending unit (23) of C * D as C * D.

The present invention adopts three class pipeline to realize, realizes with VerilogHDL, carries out circuit synthesis by verifying the back with 0.18 micron standard cell lib.Synthesis result is carried out time series analysis, and the result shows that maximum delay was 2.89 nanoseconds.Show with SPEC 2000 assessments, behind employing the present invention, compare common multiplicaton addition unit, can obtain about 20% performance boost.

Description of drawings

Fig. 1 is the structured flowchart of a kind of existing multiplicaton addition unit of introducing among the list of references 1:Floating-Point Multiply-Add-Fused with Reduced Latency;

Fig. 2 is the structured flowchart of the single precision paralleling floating point multiplication addition unit of three class pipeline realization of the present invention;

When Fig. 3 a is exp_CD-exp_A≤-51 and sign_A ≠ sign_B, the synoptic diagram that the relative C of A * put in data channel D displacement alignment back;

When Fig. 3 b is exp_CD-exp_A≤-51 and sign_A=sign_B, the synoptic diagram that the relative C of A * put in data channel D displacement alignment back;

Fig. 3 c is-27＞exp_CD-exp_A＞-51 o'clock, and the synoptic diagram that the relative C of A * put in data channel D displacement alignment back;

Fig. 3 d is 23 〉=exp_CD-exp_A 〉=-27 o'clock, the synoptic diagram that the relative C of A * put in data channel D displacement alignment back;

Fig. 3 e is exp_CD-exp_A＞23 o'clock, the synoptic diagram that the relative C of A * put in data channel D displacement alignment back;

Fig. 4 is the block diagram of multiplier compression tree specific implementation, and this compressed tree is made up of the CSA of 11 49 bits.

Embodiment

The present invention is described in further detail below in conjunction with the drawings and specific embodiments.

The present invention adopts three class pipeline to realize, realizes with VerilogHDL, carries out circuit synthesis by verifying the back with 0.18 micron standard cell lib.

Single precision parallel floating point of the present invention unit is divided chronologically for three flow beats, below with reference to Fig. 2, entire work process is described.In the present embodiment, still represent that with A+B+C * D one parallel is taken advantage of add operation.And B is smaller or equal to A here, and this was anticipated by compiler.

The displacement alignment of first order streamline: A, B and the Persian coding of C * D, partial product compression.

The mantissa of 3 couples of C of ripple thatch scrambler of base 4 encodes, the mantissa of result that will encode and D multiplies each other and obtains 13 partial products then, the carry that is admitted to 3:2 of 13 partial products behind the coding is preserved in (CSA) compressed tree 9, the structure of CSA compressed tree has been done detailed introduction respectively in Fig. 4, input x, the y of each unit module, z are 3 numbers of wanting compressed 49 bits among Fig. 4, output S, C be respectively 49 bits after the compression with byte and carry byte, its logical relation is:

S＝x^y^z，

C＝((x&y)|(x&z)|(y&z))＜＜1，

Here ^, ﹠amp; With | represent respectively step-by-step XOR, step-by-step and and step-by-step or operation,＜＜the expression right shift.

13 partial products of input in1 ~ in13 of Fig. 4 for obtaining behind the ripple thatch coding, be output as obtain after the compression and, carry byte, just 9 output sum_CD and carry_CD among Fig. 2.Whole compressed tree is made of the CSA of 11 49 bits, with two of 13 partial product boil down tos, needs 5 grades CSA tree.

The Persian coding of displacement alignment, negate and the C * D of A, B, partial product compression executed in parallel.If the symbol of A or B is different with the symbol of C * D, then need its supplement.The method of asking the complement of a number is to add one after the negate.Supplement is required adds 1 and can utilize the room on the 3:2 CSA carry byte lowest order to realize.A _InvExpression is with the output of the mantissa step-by-step alignment of A and negate (if the sign bit of the sign bit of A and B * C is identical then do not need negate).

In common multiplicaton addition unit (represent with A+C * D here common take advantage of add operation), the method for A displacement alignment is generally: its position from most significant digit left side 26 bits of C * D is begun to deposit, be shifted to the right according to the index difference then.Between the most significant digit of the initial deposit position of A and C * D two rooms are arranged, purpose is to guarantee correct rounding off during much larger than C * D at A.In EMAF, two addends are arranged, must adopt new displacement alignment strategy, poor among the present invention according to the index of A, C, D, divide five kinds of situations, adopt different displacement alignment strategies respectively, the division methods of these five kinds of situations is as follows:

1) exp_CD-exp_A≤-51 and sign_A ≠ sign_B

2) exp_CD-exp_A≤-51 and sign_A=sign_B

3)-27＞exp_CD-exp_A＞-51

4)23≥exp_CD-exp_A≥-27

5)exp_CD-exp_A＞23

Wherein, sign_A, sign_B, sign_C and sign_D are respectively the symbols of operand A, B, C and D, exp_A, exp_B, exp_C and exp_D are respectively the indexes of operand A, B, C and D, according to IEEE 754 standards, the symbol of single precision floating datum is its most significant digit, and index is the 2nd to the 8th.

Data channel under the various situations, and A displacement alignment back is with respect to the display case of C * D in data channel as shown in Figure 3.Do not provide B among Fig. 3 the putting of data channel, this is because B does not exert an influence to the form of data passage, and only is earlier its most significant digit from data channel to be begun to deposit, and poor according to its index with C * D then, relative C * D is shifted.

When exp_CD-exp_A≤-51 and sign_A ≠ sign_B, the formation of data channel is shown in Fig. 3 (a), A is far longer than C * D, A is put since the most significant digit of the data channel of 74 bits, high 24 bits of C * D place on low 24 of 74 Bit data passages, and its low 24 bits place outside the data channel.If the index difference of B and A is smaller or equal to 24, then after the B displacement alignment, its lowest order will be on the left side of C * D most significant digit, and C * D will not influence net result fully, except rounding off; If the index difference of B and A is greater than 24, then after the B displacement alignment, its most significant digit will be on the right of A lowest order, and B and C * D will not influence the result of final A+B+C * D.Summing up above-mentioned two kinds of situations can find, the part that B and C * D shift out data channel does not in this case all have influence to the result of final A+B+C * D, does not just need to have considered yet.

When exp_CD-exp_A≤-51 and sign_A=sign_B, the formation of data channel begins A to put from second of data channel shown in Fig. 3 (b), and this is in order to prevent the passage of overflow data as a result of final A+B+C * D, and other and the previous case are similar.

-27＞exp_CD-exp_A＞-51 o'clock, the formation of data channel is shown in Fig. 3 (c), C * D putting in data channel is identical with preceding two kinds of situations, before the displacement A is placed on the highest 24 bits of 74 Bit data passages, index difference according to A and C * D is shifted then, because its index difference is between-51 to-27, the lowest order of displacement back A is on the left side of C * D most significant digit.When the shift amount of B greater than 50 the time, B and C * D all some outside data channel, but the B after this moment displacement and C * D do not influence the result of final A+B+C * D on the right of the lowest order of A, do not need as seen to consider simultaneously that B and C * D shift out the outer part of data channel.Low 24 bits of noting two partial products after C this moment * D compresses have the carry generation, and this carry is to need to consider.

23 〉=exp_CD-exp_A 〉=-27 o'clock, the formation of data channel is shown in Fig. 3 (d), C * D is put on low 48 bits of data channel, before the displacement A is placed on the highest 24 bits of 74 Bit data passages, index difference according to A and C * D is shifted then, because its index difference is between-27 to 23, displacement back A may still can not shift out data channel in the optional position of data channel.

Exp_CD-exp_A＞23 o'clock, the formation of data channel is shown in Fig. 3 (e), C * D is put on high 48 bits of data channel, before the displacement A is placed on the highest 24 bits of 74 Bit data passages, index difference according to A and C * D is shifted then, because its index difference is greater than 23, displacement back A may be in the optional position on data channel the 25th bit the right, even can shift out data channel.When A was moved out of data channel, its most significant digit was on the right of C * D lowest order, because B is smaller or equal to A, this moment, B was also much smaller than C * D, and A and B will not influence net result.

Summing up in above-mentioned 5 situation can find:

1) A effectively shifts out data channel never, that is to say, when A shifts out data channel (only possible under the situation shown in Fig. 3 (e)), it will not influence the result of final A+B+C * D, so the supplement of A will be greatly simplified: add 1 at the lowest order of data channel during sign_A ≠ sign_C  sign_D (this moment sub_A=1) and get final product,  represents XOR here.

2) B might shift out data channel in all cases, only at sign_B ≠ sign_C  sign_D (this moment sub_B=1), and B shift out data channel be 0 (st1_B=0 this moment) full the time, need add 1 supplement of finishing B at the lowest order of data channel.

3) B after the displacement and C * D may some be outside data channel simultaneously, and still at this moment B and C * D do not influence the data passage, so need not consider the problem that B and C * D can or can not have carry to produce simultaneously after the part addition outside the data channel.

4)-and 27＞exp_CD-exp_A＞-51 o'clock, low 24 bits of two partial products after C * D compression have carry and produce, and this carry is to need to consider.

Adding of A and B supplement 1 finished by 7 parts of Fig. 2, owing to there are two to add 1 operation, introduced a 3:2CSA here, the A supplement is required adds 1 input as CSA, and required the adding of B supplement 1 utilizes the lowest order room of the carry byte of CSA output to finish.Because the time-delay of the coding of multiplication and partial product compression is greater than the displacement alignment of A and B, this CSA can not cause the increasing of critical path.，

-27＞exp_CD-exp_A＞-51 o'clock, the carry of low 24 bits of two partial products after C * D compression can exert an influence to net result, the method that this carry is joined data channel is: with st1_CD_3MSB[0] insert on the lowest order of 4:2 CSA carry byte among Fig. 2, wherein st1_CD_3MSB is the Senior Three position of 25 bit result after low 24 additions of two partial product sum_CD, carry_CD after the C * D compression.

Result after result after second level streamline: A, the B displacement alignment and the partial product compression of C * D is after 4:2 CSA compression.Carry out leading zero prediction, sign prediction, false add computing and normalization shift.

Sum_AB as a result, the carry_AB (output of parts 7 in Fig. 2) of A, B displacement alignment and two partial product sum_CD, carry_CD after C * D compression in the upper level streamline, have been obtained, here at first import two of boil down tos with these four with a 4:2 CSA, be designated as sum and carry respectively, then sum and carry are input in the leading zero predicting unit 13, calculate leading zero number (being designated as LZN).

If directly sum and carry are carried out normalization shift below, addition again if addition result also needs its supplement for negative, has increased time-delay.The way of avoiding this time-delay is to judge the symbol of sum+carry in leading zero prediction, if sum+carry＜0 then selects the complement of sum and carry to represent to carry out follow-up processing, as normalization shift, finally add and round off etc.It is required when asking the complement of sum and carry here that to add 1 be to utilize the lowest order room of the carry byte of half adder 15 and 16 to realize.

The shift unit 19 of 74 bits carries out right shift according to the LZN that calculates in 13 to the output of selector switch 18, and its output result is designated as sum _Nor, carry _Nor

Third level streamline: the sum that utilizes second level streamline output _Nor, carry _NorFinish final addition and round off the index of calculating A+B+C * D.Calculate mantissa and the index of C * D simultaneously according to the output of first order streamline.

In 22, at first with sum _NorAnd carry _NorAddition, the result is designated as ABCD_added, the result during as rounding bit with the 25th and the 26th respectively according to rounding mode then, be designated as rounding_result_tmp1 and rounding_result_tmp2 respectively, if the most significant digit of rounding_result_tmp1 be 1 and the most significant digit of ABCD_added be 0, perhaps the most significant digit of ABCD_added is 1 o'clock, chooses rounding_result_tmp1 as net result, otherwise chooses rounding_result_tmp2 as net result.

The normalization shift amount that calculates in the interim index and 13 according to the data channel that calculates in 1 in 21 is calculated the index of A+B+C * D, revise according to 22 operation result then:, index is subtracted 1 if rounding_result_tmp2 is chosen as net result.

24 calculate the mantissa of C * D, and the method for mantissa of calculating A+B+C * D in the method and 22 is similar, will revise the index of C * D according to 24 result of calculation in same 23.

Claims

1. parallel floating point multiplication addition unit, way of realization are that (C * D) takes advantage of add operation, A 〉=B to A+B+, it is characterized in that this floating point multiplication addition unit contains three grades of flowing water, handling capacity is phase instruction weekly, and can produce the result of C * D simultaneously, this floating point multiplication addition unit contains:

exp_CD＝exp_C+exp_D，

sub＝sign_Asign_Csign_D，

sign＝sign_CD＝sign_Csign_D，

sub_A＝sign_Asign_Csign_D，

sub_B＝sign_Bsign_Csign_D，

Wherein, sign_A, sign_B, sign_C and sign_D are respectively the symbols of operand A, B, C and D, exp_A, exp_B, exp_C and exp_D are respectively the indexes of operand A, B, C and D, according to IEEE 754 standards, the symbol of single precision floating datum is its most significant digit, and index is the 2nd to the 8th;  is an xor operation;

When exp_CD-exp_A≤-51 and sign_A ≠ sign_B,

exp＝exp_A，

mv_A＝0，

mv_B＝exp-exp_B，

When exp_CD-exp_A≤-51 and sign_A=sign_B,

exp＝exp_A+1，

mv_A＝1，

mv_B＝exp-exp_B，

-27＞exp_CD-exp_A＞-51 o'clock,

exp＝exp_CD+51，

mv_A＝exp-exp_A，

mv_B＝exp-exp_B，

23 〉=exp_CD-exp_A 〉=-27 o'clock,

exp＝exp_CD+27，

mv_A＝exp-exp_A，

mv_B＝exp-exp_B，

Exp_CD-exp_A＞23 o'clock,

exp＝exp_CD+1，

mv_A＝exp-exp_A，

mv_B＝exp-exp_B，

align_A＝man_A□mv_A，

Wherein represents to move to right;

align_B＝man_B□mv_B；

sum_AB＝inv_A^inv_B^sub_A，

carry_AB＝((inv_A&inv_B)|(inv_A&sub_A)|(inv_A&sub_A))＜＜1，

carry_AB[73]＝sub_B&st1_B，

s＝x^y^z，

c＝((x&y)|(x&z)|(y&z))＜＜1，

In low 24 totalizers (10) that are sent to 24 bits of sum_CD and carry_CD, addition results is summarized as two information outputs: st1_CD and st1_CD_3MSB, and whether low 24 that wherein write down addition results be zero entirely, if be zero entirely, st1_CD=0 then, otherwise st1_CD=1; St1_CD_3MSB writes down the Senior Three position of the addition results of 25 bits; Selector switch (11) is chosen one as st1 output according to the index range that calculates in the exponential sum symbol processing unit (1) from st1_B and st1_CD,

-27＞exp_CD-exp_A＞-51 o'clock, st1=st1_CD, st1=st1_B under other situation;

Second level streamline: form by the shift unit (19) of the half adder (16) of the half adder (15) of the half adder (14) of the leading zero prediction module (13) of 4:2 CSA (12), 74 bits, first 74 bit, first 74 bit and first 74 bit, sign prediction logic (17), selector switch (18), 74 bits with door (20); Wherein, 4:2 CSA (12) quite with the 3:2 of 2 cascades, with four input sum_AB, carry_AB, two of sum_CD and carry_CD boil down tos: sum and carry, wherein sum_CD and carry_CD will be according to the index range that calculates in the exponential sum symbol processing unit (1) input of back as CSA that be shifted, when exp_CD-exp_A＜-27, with sum_CD and carry_CD preceding 24 as the input, when 23 〉=exp_CD-exp_A 〉=-27, with sum_CD and carry_CD as input, under other situations (exp_CD-exp_A＞23), with sum_CD and carry_CD behind 26 bits that move to left as input, the most significant digit with the st1_CD_3MSB that obtains in 10 after the compression leaves on the lowest order of carry;

The leading zero prediction module (13) of 74 bits, when judging 12 output sum and carry addition as a result, the number of leading zero, the leading zero number is promptly from most significant digit, and the figure place between first non-0 is if sum and carry addition result are for negative, what then judge here will be leading one number, promptly from the figure place of most significant digit to first non-1, concrete determination methods is: determine that by checking the adjacent position in a certain position and the left and right sides thereof who may be a most significant digit, establish a prediction bits f _i,

T＝sumcarry，G＝sum&carry，

f_{0} = \overset{&OverBar;}{T_{0}} T_{1}

f_{i} = T_{i - 1} (G_{i} {\overset{&OverBar;}{Z}}_{i + 1} + Z_{i} {\overset{&OverBar;}{G}}_{i + 1}) + {\overset{&OverBar;}{T}}_{i - 1} (Z_{i} {\overset{&OverBar;}{Z}}_{i + 1} + G_{i} {\overset{&OverBar;}{G}}_{i + 1}), i > 0

Wherein sum, carry are two outputs of (13),

s＝x^y，

c＝(x&y)＜＜1，

During complement=0, sum _HA=sum _HApos, carry _HA=carry _HApos,

During complement=1, sum _HA=sum _HAcom, carry _HA=carry _HAcom

ABCD_added＝sum _HAnor+carry _HAnor，

RZ(x)＝x

Here

Represent respectively to round up and round downwards with  x ;

st＝st1|st2，

Then according to st, ABCD_added and rounding mode RI, RN or RZ, calculate two nonces of round-off result, be designated as rounding_result_tmp1 and rounding_result_tmp2 respectively, the computing method of rounding_result_tmp1 are: during RI=1

If st=1 or ABCD_added[24]=1,

rounding_result_tmp1＝ABCD_added[0:23]+1；

Otherwise

rounding_result_tmp1＝ABCD_added[0:23]；

During RI=0, if RN=1,

If ABCD_added[24]=0

rounding_result_tmp1＝ABCD?added[0:23]；

Otherwise, during st=1,

rounding_result_tmp1＝ABCD_added[0:23]+1；

ABCD_added[23]=1 o'clock,

rounding_result_tmp1＝ABCD_added[0:23]+1；

Otherwise,

rounding_result_tmp1＝ABCD_added[0:23]；

During RI=0, if RN=0,

rounding_result_tmp1＝ABCD_added[0:23]；

The computing method of rounding_result_tmp2 are as follows:

During RI=1,

If st=1 or ABCD_added[25]=1,

rounding_result_tmp2＝ABCD_added[1:24]+1；

Otherwise

rounding_result_tmp2＝ABCD_added[1:24]；

During RI=0, if RN=1,

If ABCD_added[25]=0

rounding_result_tmp2＝ABCD_added[1:24]；

Otherwise, during st=1,

rounding_result_tmp2＝ABCD_added[1:24]+1；

ABCD_added[24]=1 o'clock,

rounding_result_tmp2＝ABCD_added[1:24]+1；

Otherwise, rounding_result_tmp2=ABCD_added[1:24];

During RI=0, if RN=0,

rounding_result_tmp2＝ABCD_added[1:24]；

CD_added＝sum_CD[0:23]+carry_CD[0:23]+st1_CD_3MSB[0]，

The computing method of rounding_result_CD_tmp1 are as follows:

If RI=1,

If st1_CD=1 or st1_CD_3MSB[1]=1,

rounding_result_CD_tmp1＝CD_added+1；

Otherwise,

rounding_result_CD_tmp1＝CD_added；

If RI=0, and RN=1,

If st1_CD_3MSB[1]=0,

rounding_result_CD_tmp1＝CD_added；

Otherwise, if st=1,

rounding_result_CD_tmp1＝CD_added+1；

If st1_CD_3MSB[1]=1, and CD_added[23]=1,

rounding_result_CD_tmp1＝CD_added+1；

E otherwise

rounding_result_CD_tmp1＝CD_added；

If RI=0, and RN=0,

rounding_result_CD_tmp1＝CD_added；

The computing method of rounding_result_CD_tmp2 are as follows:

If RI=1,

If st1_CD=1 or st1_CD_3MSB[2]=1,

rounding_result_CD_tmp2＝{CD_added[1:23]，st1_CD_3MSB[1]}+1；

Otherwise,

rounding_result_CD_tmp2＝{CD_added[1:23]，st1_CD_3MSB[1]}；

If RI=0, and RN=1,

If st1_CD_3MSB[2]=0,

rounding_result_CD_tmp2＝{CD_added[1:23]，st1_CD_3MSB[1]}；

Otherwise, if st=1,

rounding_result_CD_tmp2＝{CD_added[1:23]，st1_CD_3MSB[1]}+1；

If st1_CD_3MSB[2]=1, and st1_CD_3MSB[1]=1,

rounding_result_CD_tmp2＝{CD_added[1:23]，st1_CD_3MSB[1]}+1；

Otherwise,

rounding_result_CD_tmp2＜＝{CD_added[1:23]，st1_CD_3MSB[1]}；

If RI=0, and RN=0,

rounding_result_CD_tmp2＜＝{CD_added[1:23]，st1_CD_3MSB[1]}；