CN101178645A - Paralleling floating point multiplication addition unit - Google Patents

Paralleling floating point multiplication addition unit Download PDF

Info

Publication number
CN101178645A
CN101178645A CNA2007101799736A CN200710179973A CN101178645A CN 101178645 A CN101178645 A CN 101178645A CN A2007101799736 A CNA2007101799736 A CN A2007101799736A CN 200710179973 A CN200710179973 A CN 200710179973A CN 101178645 A CN101178645 A CN 101178645A
Authority
CN
China
Prior art keywords
result
exp
rounding
carry
sum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007101799736A
Other languages
Chinese (zh)
Other versions
CN100570552C (en
Inventor
李兆麟
李恭琼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CNB2007101799736A priority Critical patent/CN100570552C/en
Publication of CN101178645A publication Critical patent/CN101178645A/en
Application granted granted Critical
Publication of CN100570552C publication Critical patent/CN100570552C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention relates to a parallel floating-point fused multiply-add unit which simplifies similar technique and achieves the multiply-add operation of A+B+C*D (A is equal to or greater than B) and acquires the result of C*D, so as to achieve three classes production line: in the first production line, A and B are displaced and snapped, and the C*D is coded and part of C*D is compressed; in the second production line, the displacement and snapped result of A and B and the result of partial compressed C*D are compressed in 4:2CSA, and then front zero guide prediction, character prediction, half-add operation and normalized displacement are accomplished; in the third production line, the final add operation and rounding of A+B+C*D are accomplished and the index is counted, and the mantissa and the index of C*D are counted according to the output of the first production line. The invention has the advantages of achieving the parallel of instruction grade; accomplishes an add instruction and a multiply instruction at the same time; and also can accelerate two continuous instructions with correlative data.

Description

A kind of paralleling floating point multiplication addition unit
Technical field
The present invention relates to the Float Point Unit design, is a kind of high-speed floating point multiplicaton addition unit that is used to realize the high performance float-point computing.
Background technology
Data in literature shows, almost 50% the floating-point multiplication instruction instruction of closelying follow later is floating add or subtraction.Therefore, floating point multiplication addition mixing operation A+B * C has become a kind of basic operation in science computing and the multimedia application.Because the floating point multiplication addition mixing operation occurs so continually, realize that with floating point multiplication addition integrated unit (being reduced to the MAF unit) this operation has become a well selection of modern high performance commercial processors in application program.This implementation mainly contains following two advantages: (1) only needs once to round off, and is not twice; (2) form module by sharing some, can reduce circuit delay and hardware spending.Take advantage of and add 3 operands of (MAF) instruction needs.For example, carry out A+ (B * C) operation when to add what carry out when operand A is changed to 0 in the instruction be multiplying order taking advantage of, is changed to operand B or C at 1 o'clock, execution be add instruction.In most processors of having realized, the floating point multiplication addition computing generally realizes (related content is asked for an interview list of references 1:Floating-Point Multiply-Add-Fused with Reduced Latency, and the block diagram of implementation procedure is seen Fig. 1) by following steps:
1. at first multiplicand C is done ripple thatch coding, preserve compressed tree realization B * C that compression totalizer (CSA) is formed with carry then, obtain two partial products.When carrying out multiplication, operand A carries out negate and alignment shift operation.The symbol of operand A and B * C may be identical, also may be opposite.If the opposite in sign of A and B * C is done effective subtraction with A and B * C, the complement code that need obtain A is carried out addition, and A needs negate.The symbol of A and B * C is identical else if does effective addition, and A does not need negate.Below no matter whether need to carry out inversion operation, will be designated as A by the A after the negate device Inv
In the IEEE-754 standard, the mantissa of single precision operand is 24 bits, adds 2 extra rounding bits, A InvMost significant digit than the result of B * C moves to left 26 at most, and perhaps than B * C result's most significant digit 48 bits that move to right at most, promptly shift range is between [26,48].For displacement is oversimplified, the direction of displacement of normalization A is right shift in the design of floating point multiplication addition.So A InvPlace the position of B * C left end 26 bits at first, A InvIn alignment the figure place of dextroposition be 27-(exp (A)-(exp (B)+exp (C)-127), wherein, exp (A), exp (B) and exp (C) they are respectively operand A, the index of B and C.
2. the A after the alignment shift Inv, the partial product ratio of compression after and B * C compression is that 3: 2 carry save adder (CSA) compresses, and obtains two partial products, neededly when handling the A supplement simultaneously adds 1.
3. utilize two partial products that obtain after the compression in 2 to carry out leading zero prediction (LZA, leading zeroanticipator), obtain the figure place of the denormalization left shift of addition results.Judge the positive and negative of net result simultaneously.
4. do the false add operation in the leading zero prediction with when meeting prediction, and finish the additive operation of part.Carry out false add operation and be in order to guarantee that the back can correctly round off.Because the operation of the needed time ratio false add of sign prediction is many, finish the part of final addition between can be at this moment in the space.
5. addition results utilizes the figure place that moves to left of LZA prediction to carry out denormalization left shift.If the sign prediction parts are judged net result for negative in 3, the result of complement form after the 4th step handled of the partial product that obtains after then selecting to compress in 2 carries out normalization shift.
6. the final addition and the operation of rounding off.
The deficiencies in the prior art shown in Figure 1 are as follows:
(1) can not handle an add instruction (A+B) and multiplying order (C * D) simultaneously, and need two cycles to finish this two instructions, find and analyze some application examples,, can improve the execution efficient of instruction stream significantly if addition and multiplying order can be carried out simultaneously;
When (2) continuous two instructions exist data to be correlated with, streamline will be forced to wait for two cycles (when three class pipeline is realized), and in actual applications, data are relevant to be again a very general phenomenon.
The deficiencies in the prior art shown in Figure 1 can not solve by adopting independent adder unit and multiplication unit, at first, can increase hardware spending, secondly, take advantage of to add instruction and need split into two instructions and carry out, reduced it and carried out efficient, and owing to carried out rounding off for twice, reduced precision, last, this scheme can not be quickened the relevant instruction of data has taken place.By adopting a multiplicaton addition unit and adder unit can remedy the part deficiency of prior art shown in Figure 1, but the increase of its hardware spending is too big, and for the relevant instruction of data has taken place, this solution is powerless equally.
Prior art shown in Figure 1 relatively, the present invention has realized that form is the operation of A+B+C * D, is called the parallel add operation of taking advantage of, and following advantage is arranged:
(1) can handle an add instruction (A+B) and a multiplying order (C * D), realized the instruction level parallelism of add instruction and multiplying order simultaneously;
(2) when two adjacent following three kinds of data of instruction generation are relevant, can handle as an instruction:
A) article one instruction: E=A+B, second instruction: F=E+C
B) article one instruction: E=A+B, second instruction: F=E+C * D
C) article one instruction: E=A+C * D, second instruction: F=E+B
(3) every parallel taking advantage of when adding ordering calculation and finishing can both obtain the multiply operation (result of C * D), and can specify the rounding mode of multiplication separately simultaneously.
Summary of the invention
The objective of the invention is to design a kind of high performance Fully-pipelined single precision paralleling floating point multiplication addition unit at a high speed, improve the degree of parallelism of floating point instruction and carry out efficient, guarantee the less hardware expense simultaneously.
The present invention is characterised in that, the invention provides the single precision paralleling floating point multiplication addition arithmetic element that a kind of three grades of flowing water are realized, way of realization is that (C * D) takes advantage of add operation to A+B+, A 〉=B, it is characterized in that this floating point multiplication addition unit contains three grades of flowing water, handling capacity is phase instruction weekly, and can produce the result of C * D simultaneously, this floating point multiplication addition unit contains:
First order streamline: by exponential sum symbol processing unit (1): the carry save adder CSA (7) of first 74 bit displacement aligner (2), second 74 bits displacement aligner (3), viscous position counter (4), first step-by-step negate device (5), second step-by-step negate device (6), 3:2, be that the ripple thatch scrambler (8) of base, partial product compressed tree (9), 24 bit adder (10) and the selector switch (11) that the carry save adder CSA of 3:2 forms are formed with 4; Wherein,
Exponential sum symbol processing unit (1), whether according to the exponential sum symbolic computation A+B+ of operand A, B, C, D (C * exponent e xp D), the exponent e xp_CD of C * D, is effectively to subtract sub, A+B+ (C * interim symbol sign D), symbol sign_CD of C * D, and shift amount mv_A, mv_B during the relative C of definite A * D displacement alignment with B, and whether need step-by-step negate sub_A, sub_B after A and the B displacement alignment, the step-by-step negate is promptly carried out negate to each, just become 1, become 01 with 0;
exp_CD=exp_C+exp_D,
sub=sign_Asign_Csign_D,
sign=sign_CD=sign_Csign_D,
sub_A=sign_Asign_Csign_D,
sub_B=sign_Bsign_Csign_D,
Wherein, sian_A, sign_B, sign_C and sign_D are respectively the symbols of operand A, B, C and D, exp_A, exp_B, exp_C and exp_D are respectively the indexes of operand A, B, C and D, according to IEEE 754 standards, the symbol of single precision floating datum is its most significant digit, and index is the 2nd to the 8th;  is an xor operation;
When exp_CD-exp_A≤-51 and sign_A ≠ sign_B,
exp=exp_A,
mv_A=0,
mv_B=exp-exp_B,
When exp_CD-exp_A≤-51 and sign_A=sign_B,
exp=exp_A+1,
mv_A=1,
mv_B=exp-exp_B,
-27>exp_CD-exp_A>-51 o'clock,
exp=exp_CD+51,
mv_A=exp-exp_A,
mv_B=exp-exp_B,
23 〉=exp_CD-exp_A 〉=-27 o'clock,
exp=exp_CD+27,
mv_A=exp-exp_A,
mv_B=exp-exp_B,
Exp_CD-exp_A>23 o'clock,
exp=exp_CD+1,
mv_A=exp-exp_A,
mv_B=exp-exp_B,
The shift unit of first 74 bit (2) is according to the mv_A value that obtains in the exponential sum symbol processing unit (1), the man_A of mantissa to A carries out right shift, according to IEEE 754 standards, the mantissa of single precision floating datum is its 8th to 32, mend 1 in most significant digit when it is standardizing number, otherwise in the most significant digit zero padding, unnomalized number will be treated as 0, output after the displacement is designated as align_A
align_A=man_A□mv_A,
Wherein represents to move to right;
The shift unit of second 74 bit (3) is according to the mv_B value that obtains in the exponential sum symbol processing unit (1), and the man_B of mantissa of B is carried out right shift, and the output after the displacement is designated as align_B,
align_B=man_B□mv_B;
Viscous position computing unit (4), calculate viscous position st1_B according to the sub_B that calculates in the shift result of the shift unit (3) of second 74 bit and the exponential sum symbol processing unit (1), mv_B>74 o'clock, if it is 0 entirely that sub_B=0 and man_B shift out the part of the data channel of 74 bit widths, perhaps to shift out the part of the data channel of 74 bit widths be 1 entirely for sub_B=1 and man_B, st1_B=0 then, otherwise st1_B=1;
First step-by-step negate device (5), if the sign bit of the sign bit of A and C * D is different, to the output of the shift unit (2) of first 74 bit as a result align_A do the step-by-step inversion operation, otherwise do not do any operation directly with align_A output, the output of first step-by-step negate device (5) is designated as inv_A;
Second step-by-step negate device (6), if the sign bit of the sign bit of B and C * D is different, to the output of the shift unit (3) of second 74 bit as a result every bit of align_B all do inversion operation (step-by-step negate just), otherwise do not do any operation directly with align_B output, the output of second step-by-step negate device (6) is designated as inv_B;
With the output of first step-by-step negate device (5) and second step-by-step negate device (6) inv_A and inv_B as a result, and the sub_A that draws in the exponential sum symbol processing unit (1) sends into once compression of do among the 3:2 CSA (7) together, obtain sum_AB, carry_AB, wherein
sum_AB=inv_A^inv_B^sub_A,
carry_AB=((inv_A&inv_B)|(inv_A&sub_A)|(inv_A&sub_A))<<1,
And the result of sub_B and st1_B and operation is placed on the lowest order of carry_AB,
carry_AB[73]=sub_B&st1_B,
Here ^, ﹠amp; With | represent respectively step-by-step XOR, step-by-step and and step-by-step or operation,<<expression is to shifting left;
The ripple thatch scrambler (8) of base 4 is encoded to the mantissa of C, the mantissa of result that will encode and D multiplies each other and obtains 13 partial products then, the carry that is admitted to 3:2 of these 13 partial products is preserved in the CSA compressed tree (9), the tree that 3:2 CSA tree promptly is made up of 3:2 CSA, 3 inputs will be compressed into 2 outputs through a CSA, if establish and be input as x, y, z, be output as s, c, then compression process can be expressed as follows:
s=x^y^z,
c=((x&y)|(x&z)|(y&z))<<1,
With 5 grades of 3:2 CSA cascade, form 3:2 CSA tree, just 2 of 13 partial product boil down tos can be designated as sum_CD, carry_CD respectively;
In low 24 totalizers (10) that are sent to 24 bits of sum CD and carry_CD, addition results is summarized as two information outputs: st1_CD and st1_CD_3MSB, and whether low 24 that wherein write down addition results be zero entirely, if be zero entirely, st1_CD=0 then, otherwise st1_CD=1; St1_CD_3MSB writes down the Senior Three position of the addition results of 25 bits;
Selector switch (11) is chosen one as st1 output according to the index range that calculates in the exponential sum symbol processing unit (1) from st1_B and st1_CD,
-27>exp_CD_exp_A>-51 o'clock, st1=st1_CD, st1=st1_B under other situation;
Second level streamline: form by the shift unit (19) of the half adder (16) of the half adder (15) of the half adder (14) of the leading zero prediction module (13) of 4:2 CSA (12), 74 bits, first 74 bit, first 74 bit and first 74 bit, sign prediction logic (17), selector switch (18), 74 bits with door (20); Wherein,
4:2 CSA (12) quite with the 3:2 of 2 cascades, with four input sum_AB, carry_AB, two of sum_CD and carry_CD boil down tos: sum and carry, wherein sum_CD and carry_CD will be according to the index range that calculates in the exponential sum symbol processing unit (1) input of back as CSA that be shifted, when exp_CD-exp_A<-27, with sum_CD and carry_CD preceding 24 as the input, when 23 〉=exp_CD-exp_A 〉=-27, with sum_CD and carry_CD as input, under other situations (exp_CD-exp_A>23), with sum_CD and carry_CD behind 26 bits that move to left as input, the most significant digit with the st1_CD_3MSB that obtains in 10 after the compression leaves on the lowest order of carry;
The leading zero prediction module (13) of 74 bits, when judging 12 output sum and carry addition as a result, the number of leading zero, the leading zero number is promptly from most significant digit, figure place between first non-0 will be leading one number if sum and carry addition result for negative, then judge here, promptly from the figure place of most significant digit to first non-1, concrete determination methods is:
By checking that the adjacent position in a certain position and the left and right sides thereof determines that who may be a most significant digit, establish a prediction bits f i,
T=sumcarry,G=sum&carry,
Figure S2007101799736D00061
f 0 = T 0 ‾ T 1
f i = T i - 1 ( G i Z ‾ i + 1 + Z i G ‾ i + 1 ) + T ‾ i - 1 ( Z i Z ‾ i + 1 + G i G ‾ i + 1 ) , i > 0
Wherein sum, carry are two outputs of (13),
Figure S2007101799736D00064
Expression is with sum step-by-step negate, T i, G i, Z iIf the i position of representing T, G, Z respectively is f i=1, and f j=0 (j=0,1 ... i-1), then leading zero number (LZN) is i;
If half adder is input as x, y, be output as s, c, then its principle can be expressed as:
s=x^y,
c=(x&y)<<1,
First half adder (14) is treated to output sum according to above-mentioned principle with sum and the carry that 4:2CSA (12) exports HAposAnd carry HApos
With after sum and the carry step-by-step negate as the input of second half adder of half adder (15), be output as sum HAinvAnd carry HAinv, and with carry HAinvExtreme lower position be 1;
With sum HAinvAnd carry HAinvAs the input of the 3rd half adder (16), be output as sum after the step-by-step negate HAcomAnd carry HAcom, and with carry HAcomExtreme lower position be 1, sum like this HAcom+ carry HAcomJust be equivalent to the complement form of sum+carry;
Sign prediction module (17), Forecasting Methodology is for judging whether the sum+carry most significant digit has carry to produce, if there is carry to produce, then addition result is for negative, it is 1 that output signal complement is composed, otherwise complement=0;
Selector switch (18) is according to the result of sign prediction, from sum HApos, carry HAposAnd sum HAcom, carry HAcomMiddle selection is a pair of as output, and its output is designated as sum HA, carry HA,
During complement=0, sum HA=sum HApos, carry HA=carry HApos,
During complement=1, sum HA=sum HAcom, carry HA=carry HAcom
The shift unit of 74 bits (19) is according to the leading zero prediction result, and the output of selector switch (18) is shifted left, and shift amount is LZN, and the output after the displacement is designated as sum NorAnd carry Nor
With door (20) the output complement of sign prediction module (17) and the output sign of exponential sum symbol processing unit (1) are carried out and operation, obtain the symbol of A+B+C * D;
Third level streamline is made up of the Index for Calculation unit (21) of A+B+C * D, the index amending unit (23) that finally adds/round off unit (22), C * D of A+B+C * D, the unit (24) that finally adds/round off of C * D; Wherein,
The Index for Calculation unit (21) of A+B+C * D, the LZN that obtains in the leading zero prediction module (13) according to the exp that obtains in the exponential sum symbol processing unit (1), 74 bits and A+B+C * D add finally/round off whether the index that 1 lt calculates A+B+C * D takes place in the unit (22), if 1 lt does not take place in finally adding/round off in the unit (22) of A+B+C * D, then the index of A+B+C * D is exp-LZN, otherwise need carry out the correction of 1 bit, the index of final A+B+C * D will be expressed as exp-LZN-1;
In finally the adding/round off in the unit (22) of A+B+C * D, at first with the output sum of the shift unit (19) of 74 bits NorAnd carry NorAddition, the result is designated as ABCD_added,
ABCD_added=sum HAnor+carry HAnor
Round off according to the st1 and the rounding mode that obtain in ABCD_added, the selector switch (11) then, rounding mode has: (RN) nearby rounds off, to infinite rounding off (RP) just, to negative infinite rounding off (RM), to zero round off (RZ), from application point of view, these four kinds of rounding modes can reduce to three: RN, RI, RZ;
Figure S2007101799736D00081
Figure S2007101799736D00082
RZ(x)=x
Here
Figure S2007101799736D00083
Represent respectively to round up and round downwards with  x ;
For negative, rounding mode RP can equivalence be RI, and RM can equivalence be RN; For positive number, rounding mode RP can equivalence be RN, and RM can equivalence be RI;
At first calculate viscous position st2, if the most significant digit of ABCD_added is 1, st2=|abcd_added[25:74 then], otherwise st2=|abcd_added[26:74]; Whole viscous position st is made up of st1 and st2 two parts:
st=st1|st2
According to st, ABCD added and rounding mode RI, RN or RZ, calculate two nonces of round-off result then, be designated as rounding_result_tmp1 and rounding_result_tmp2 respectively, the computing method of rounding_result_tmp1 are as follows:
During RI=1,
If st=1 or ABCD_added[24]=1,
rounding_result_tmp1=ABCD_added[0:23]+1;
Otherwise
rounding_result_tmp1=ABCD_added[0:23];
During RI=0, if RN=1,
If ABCD_added[24]=0
rounding_result_tmp1=ABCD_added[0:23];
Otherwise, during st=1,
rounding_result_tmp1=ABCD_added[0:23]+1;
ABCD_added[23]=1 o'clock,
rounding_result_tmp1=ABCD_added[0:23]+1;
Otherwise,
rounding_result_tmp1=ABCD_added[0:23];
During RI=0, if RN=0,
rounding_result_tmp1=ABCD_added[0:23];
The computing method of rounding_result_tmp2 are as follows:
During RI=1,
If st=1 or ABCD_added[25]=1,
rounding_result_tmp2=ABCD_added[1:24]+1;
Otherwise
rounding_result_tmp2=ABCD_added[1:24];
During RI=0, if RN=1,
If ABCD_added[25]=0
rounding_result_tmp2=ABCD_added[1:24];
Otherwise, during st=1,
rounding_result_tmp2=ABCD_added[1:24]+1;
ABCD_added[24]=1 o'clock,
rounding_result_tmp2=ABCD_added[1:24]+1;
Otherwise
rounding_result_tmp2=ABCD_added[1:24];
During RI=0, if RN=0,
rounding_result_tmp2=ABCD_added[1:24];
At last from rounding_result_tmp1 and rounding_result_tmp2, choose the mantissa of a final A+B+C * D of conduct, and determine whether index in 21 needs the correction of 1 bit according to the most significant digit of ABCD_added and the most significant digit of rounding_result_tmp1:
If the most significant digit of rounding_result_tmp1 be 1 and the most significant digit of ABCD_added be 0, perhaps the most significant digit of ABCD_added is 1 o'clock, choose rounding_result_tmp1 as net result, do not need the correction of 1 bit in 21, otherwise choose rounding_result_tmp2 as net result, need the correction of 1 bit in 21;
The index amending unit (23) of C * D according to C * D finally add/round off whether carried out one move to left in the unit (24) could be after judging whether to revise as the index of final C * D to the exp_CD of output in the exponential sum symbol processing unit (1), if finally adding/round off in the unit (24) of C * D declared to such an extent that need revise, the index of then final C * D is exp_CD-1, otherwise the index of final C * D is exp_CD;
Finally adding/rounding off in the unit (24) at C * D, the st1_CD, the st1_CD_3MSB that obtain in high 24 and (10) of sum_CD, the carry_CD that obtains in the partial product compressed tree of forming according to the carry save adder CSA of 3:2 (9), calculate the mantissa of C * D, and judge whether need to carry out the correction of 1 bit;
At first, obtain CD_added with the most significant digit addition of high 24 and the st1_CD_3MSB of sum_CD, carry_CD:
CD_added=sum_CD[0:23]+carry_CD[0:23]+st1_CD_3MSB[0],
Use the mantissa that finally adds/round off similar method calculating C * D in the unit (22) of A+B+C * D then, calculate two nonce rounding_result_CD_tmp1 and rounding_result_CD_tmp2 earlier,
The computing method of rounding_result_CD_tmp1 are as follows:
If RI=1,
If st1_CD=1 or st1_CD_3MSB[1]=1,
rounding_result_CD_tmp1=CD_added+1;
Otherwise,
rounding_result_CD_tmp1=CD_added;
If RI=0, and RN=1,
If st1_CD_3MSB[1]=0,
rounding_result_CD_tmp1=CD_added;
Otherwise, if st=1,
rounding_result_CD_tmp1=CD_added+1;
If st1_CD_3MSB[1]=1, and CD_added[23]=1,
rounding_result_CD_tmp1=CD_added+1;
E otherwise
rounding_result_CD_tmp1=CD_added;
If RI=0, and RN=0,
rounding_result_CD_tmp1=CD_added;
The computing method of rounding_result_CD_tmp2 are as follows:
If RI=1,
If st1_CD=1 or st1_CD_3MSB[2]=1,
rounding_result_CD_tmp2={CD_added[1:23],st1_CD_3MSB[1]}+1;
Otherwise,
rounding_result_CD_tmp2={CD_added[1:23],st1_CD_3MSB[1]};
If RI=0, and RN=1,
If st1_CD_3MSB[2]=0,
rounding_result_CD_tmp2={CD_added[1:23],st1_CD_3MSB[1]};
Otherwise, if st=1,
rounding_result_CD_tmp2={CD_added[1:23],st1_CD_3MSB[1]}+1;
If st1_CD_3MSB[2]=1, and st1_CD_3MSB[1]=1,
rounding_result_CD_tmp2={CD_added[1:23],st1_CD_3MSB[1]}+1;
Otherwise,
rounding_result_CD_tmp2<={CD_added[1:23],st1_CD_3MSB[1]};
If RI=0, and RN=0,
rounding_result_CD_tmp2<={CD_added[1:23],st1_CD_3MSB[1]};
If the most significant digit of rounding_result_CD_tmp1 be 1 and the most significant digit of CD_added be 0, perhaps the most significant digit of CD_added is 1 o'clock, choose the net result of rounding_result_CD_tmp1 as C * D, do not need the correction of 1 bit in the index amending unit (23) of C * D, otherwise choose the net result of rounding_result_CD_tmp2, need the correction of 1 bit in the index amending unit (23) of C * D as C * D.
The present invention adopts three class pipeline to realize, realizes with VerilogHDL, carries out circuit synthesis by verifying the back with 0.18 micron standard cell lib.Synthesis result is carried out time series analysis, and the result shows that maximum delay was 2.89 nanoseconds.Show with SPEC 2000 assessments, behind employing the present invention, compare common multiplicaton addition unit, can obtain about 20% performance boost.
Description of drawings
Fig. 1 is the structured flowchart of a kind of existing multiplicaton addition unit of introducing among the list of references 1:Floating-Point Multiply-Add-Fused with Reduced Latency;
Fig. 2 is the structured flowchart of the single precision paralleling floating point multiplication addition unit of three class pipeline realization of the present invention;
When Fig. 3 a is exp_CD-exp_A≤-51 and sign_A ≠ sign_B, the synoptic diagram that the relative C of A * put in data channel D displacement alignment back;
When Fig. 3 b is exp_CD-exp_A≤-51 and sign_A=sign_B, the synoptic diagram that the relative C of A * put in data channel D displacement alignment back;
Fig. 3 c is-27>exp_CD-exp_A>-51 o'clock, and the synoptic diagram that the relative C of A * put in data channel D displacement alignment back;
Fig. 3 d is 23 〉=exp_CD-exp_A 〉=-27 o'clock, the synoptic diagram that the relative C of A * put in data channel D displacement alignment back;
Fig. 3 e is exp_CD-exp_A>23 o'clock, the synoptic diagram that the relative C of A * put in data channel D displacement alignment back;
Fig. 4 is the block diagram of multiplier compression tree specific implementation, and this compressed tree is made up of the CSA of 11 49 bits.
Embodiment
The present invention is described in further detail below in conjunction with the drawings and specific embodiments.
The present invention adopts three class pipeline to realize, realizes with VerilogHDL, carries out circuit synthesis by verifying the back with 0.18 micron standard cell lib.
Single precision parallel floating point of the present invention unit is divided chronologically for three flow beats, below with reference to Fig. 2, entire work process is described.In the present embodiment, still represent that with A+B+C * D one parallel is taken advantage of add operation.And B is smaller or equal to A here, and this was anticipated by compiler.
The displacement alignment of first order streamline: A, B and the Persian coding of C * D, partial product compression.
The mantissa of 3 couples of C of ripple thatch scrambler of base 4 encodes, the mantissa of result that will encode and D multiplies each other and obtains 13 partial products then, the carry that is admitted to 3:2 of 13 partial products behind the coding is preserved in (CSA) compressed tree 9, the structure of CSA compressed tree has been done detailed introduction respectively in Fig. 4, input x, the y of each unit module, z are 3 numbers of wanting compressed 49 bits among Fig. 4, output S, C be respectively 49 bits after the compression with byte and carry byte, its logical relation is:
S=x^y^z,
C=((x&y)|(x&z)|(y&z))<<1,
Here ^, ﹠amp; With | represent respectively step-by-step XOR, step-by-step and and step-by-step or operation,<<the expression right shift.
13 partial products of input in1 ~ in13 of Fig. 4 for obtaining behind the ripple thatch coding, be output as obtain after the compression and, carry byte, just 9 output sum_CD and carry_CD among Fig. 2.Whole compressed tree is made of the CSA of 11 49 bits, with two of 13 partial product boil down tos, needs 5 grades CSA tree.
The Persian coding of displacement alignment, negate and the C * D of A, B, partial product compression executed in parallel.If the symbol of A or B is different with the symbol of C * D, then need its supplement.The method of asking the complement of a number is to add one after the negate.Supplement is required adds 1 and can utilize the room on the 3:2 CSA carry byte lowest order to realize.A InvExpression is with the output of the mantissa step-by-step alignment of A and negate (if the sign bit of the sign bit of A and B * C is identical then do not need negate).
In common multiplicaton addition unit (represent with A+C * D here common take advantage of add operation), the method for A displacement alignment is generally: its position from most significant digit left side 26 bits of C * D is begun to deposit, be shifted to the right according to the index difference then.Between the most significant digit of the initial deposit position of A and C * D two rooms are arranged, purpose is to guarantee correct rounding off during much larger than C * D at A.In EMAF, two addends are arranged, must adopt new displacement alignment strategy, poor among the present invention according to the index of A, C, D, divide five kinds of situations, adopt different displacement alignment strategies respectively, the division methods of these five kinds of situations is as follows:
1) exp_CD-exp_A≤-51 and sign_A ≠ sign_B
2) exp_CD-exp_A≤-51 and sign_A=sign_B
3)-27>exp_CD-exp_A>-51
4)23≥exp_CD-exp_A≥-27
5)exp_CD-exp_A>23
Wherein, sign_A, sign_B, sign_C and sign_D are respectively the symbols of operand A, B, C and D, exp_A, exp_B, exp_C and exp_D are respectively the indexes of operand A, B, C and D, according to IEEE 754 standards, the symbol of single precision floating datum is its most significant digit, and index is the 2nd to the 8th.
Data channel under the various situations, and A displacement alignment back is with respect to the display case of C * D in data channel as shown in Figure 3.Do not provide B among Fig. 3 the putting of data channel, this is because B does not exert an influence to the form of data passage, and only is earlier its most significant digit from data channel to be begun to deposit, and poor according to its index with C * D then, relative C * D is shifted.
When exp_CD-exp_A≤-51 and sign_A ≠ sign_B, the formation of data channel is shown in Fig. 3 (a), A is far longer than C * D, A is put since the most significant digit of the data channel of 74 bits, high 24 bits of C * D place on low 24 of 74 Bit data passages, and its low 24 bits place outside the data channel.If the index difference of B and A is smaller or equal to 24, then after the B displacement alignment, its lowest order will be on the left side of C * D most significant digit, and C * D will not influence net result fully, except rounding off; If the index difference of B and A is greater than 24, then after the B displacement alignment, its most significant digit will be on the right of A lowest order, and B and C * D will not influence the result of final A+B+C * D.Summing up above-mentioned two kinds of situations can find, the part that B and C * D shift out data channel does not in this case all have influence to the result of final A+B+C * D, does not just need to have considered yet.
When exp_CD-exp_A≤-51 and sign_A=sign_B, the formation of data channel begins A to put from second of data channel shown in Fig. 3 (b), and this is in order to prevent the passage of overflow data as a result of final A+B+C * D, and other and the previous case are similar.
-27>exp_CD-exp_A>-51 o'clock, the formation of data channel is shown in Fig. 3 (c), C * D putting in data channel is identical with preceding two kinds of situations, before the displacement A is placed on the highest 24 bits of 74 Bit data passages, index difference according to A and C * D is shifted then, because its index difference is between-51 to-27, the lowest order of displacement back A is on the left side of C * D most significant digit.When the shift amount of B greater than 50 the time, B and C * D all some outside data channel, but the B after this moment displacement and C * D do not influence the result of final A+B+C * D on the right of the lowest order of A, do not need as seen to consider simultaneously that B and C * D shift out the outer part of data channel.Low 24 bits of noting two partial products after C this moment * D compresses have the carry generation, and this carry is to need to consider.
23 〉=exp_CD-exp_A 〉=-27 o'clock, the formation of data channel is shown in Fig. 3 (d), C * D is put on low 48 bits of data channel, before the displacement A is placed on the highest 24 bits of 74 Bit data passages, index difference according to A and C * D is shifted then, because its index difference is between-27 to 23, displacement back A may still can not shift out data channel in the optional position of data channel.
Exp_CD-exp_A>23 o'clock, the formation of data channel is shown in Fig. 3 (e), C * D is put on high 48 bits of data channel, before the displacement A is placed on the highest 24 bits of 74 Bit data passages, index difference according to A and C * D is shifted then, because its index difference is greater than 23, displacement back A may be in the optional position on data channel the 25th bit the right, even can shift out data channel.When A was moved out of data channel, its most significant digit was on the right of C * D lowest order, because B is smaller or equal to A, this moment, B was also much smaller than C * D, and A and B will not influence net result.
Summing up in above-mentioned 5 situation can find:
1) A effectively shifts out data channel never, that is to say, when A shifts out data channel (only possible under the situation shown in Fig. 3 (e)), it will not influence the result of final A+B+C * D, so the supplement of A will be greatly simplified: add 1 at the lowest order of data channel during sign_A ≠ sign_C  sign_D (this moment sub_A=1) and get final product,  represents XOR here.
2) B might shift out data channel in all cases, only at sign_B ≠ sign_C  sign_D (this moment sub_B=1), and B shift out data channel be 0 (st1_B=0 this moment) full the time, need add 1 supplement of finishing B at the lowest order of data channel.
3) B after the displacement and C * D may some be outside data channel simultaneously, and still at this moment B and C * D do not influence the data passage, so need not consider the problem that B and C * D can or can not have carry to produce simultaneously after the part addition outside the data channel.
4)-and 27>exp_CD-exp_A>-51 o'clock, low 24 bits of two partial products after C * D compression have carry and produce, and this carry is to need to consider.
Adding of A and B supplement 1 finished by 7 parts of Fig. 2, owing to there are two to add 1 operation, introduced a 3:2CSA here, the A supplement is required adds 1 input as CSA, and required the adding of B supplement 1 utilizes the lowest order room of the carry byte of CSA output to finish.Because the time-delay of the coding of multiplication and partial product compression is greater than the displacement alignment of A and B, this CSA can not cause the increasing of critical path.,
-27>exp_CD-exp_A>-51 o'clock, the carry of low 24 bits of two partial products after C * D compression can exert an influence to net result, the method that this carry is joined data channel is: with st1_CD_3MSB[0] insert on the lowest order of 4:2 CSA carry byte among Fig. 2, wherein st1_CD_3MSB is the Senior Three position of 25 bit result after low 24 additions of two partial product sum_CD, carry_CD after the C * D compression.
Result after result after second level streamline: A, the B displacement alignment and the partial product compression of C * D is after 4:2 CSA compression.Carry out leading zero prediction, sign prediction, false add computing and normalization shift.
Sum_AB as a result, the carry_AB (output of parts 7 in Fig. 2) of A, B displacement alignment and two partial product sum_CD, carry_CD after C * D compression in the upper level streamline, have been obtained, here at first import two of boil down tos with these four with a 4:2 CSA, be designated as sum and carry respectively, then sum and carry are input in the leading zero predicting unit 13, calculate leading zero number (being designated as LZN).
If directly sum and carry are carried out normalization shift below, addition again if addition result also needs its supplement for negative, has increased time-delay.The way of avoiding this time-delay is to judge the symbol of sum+carry in leading zero prediction, if sum+carry<0 then selects the complement of sum and carry to represent to carry out follow-up processing, as normalization shift, finally add and round off etc.It is required when asking the complement of sum and carry here that to add 1 be to utilize the lowest order room of the carry byte of half adder 15 and 16 to realize.
The shift unit 19 of 74 bits carries out right shift according to the LZN that calculates in 13 to the output of selector switch 18, and its output result is designated as sum Nor, carry Nor
Third level streamline: the sum that utilizes second level streamline output Nor, carry NorFinish final addition and round off the index of calculating A+B+C * D.Calculate mantissa and the index of C * D simultaneously according to the output of first order streamline.
In 22, at first with sum NorAnd carry NorAddition, the result is designated as ABCD_added, the result during as rounding bit with the 25th and the 26th respectively according to rounding mode then, be designated as rounding_result_tmp1 and rounding_result_tmp2 respectively, if the most significant digit of rounding_result_tmp1 be 1 and the most significant digit of ABCD_added be 0, perhaps the most significant digit of ABCD_added is 1 o'clock, chooses rounding_result_tmp1 as net result, otherwise chooses rounding_result_tmp2 as net result.
The normalization shift amount that calculates in the interim index and 13 according to the data channel that calculates in 1 in 21 is calculated the index of A+B+C * D, revise according to 22 operation result then:, index is subtracted 1 if rounding_result_tmp2 is chosen as net result.
24 calculate the mantissa of C * D, and the method for mantissa of calculating A+B+C * D in the method and 22 is similar, will revise the index of C * D according to 24 result of calculation in same 23.

Claims (1)

1. parallel floating point multiplication addition unit, way of realization are that (C * D) takes advantage of add operation, A 〉=B to A+B+, it is characterized in that this floating point multiplication addition unit contains three grades of flowing water, handling capacity is phase instruction weekly, and can produce the result of C * D simultaneously, this floating point multiplication addition unit contains:
First order streamline: by exponential sum symbol processing unit (1): the carry save adder CSA (7) of first 74 bit displacement aligner (2), second 74 bits displacement aligner (3), viscous position counter (4), first step-by-step negate device (5), second step-by-step negate device (6), 3:2, be that the ripple thatch scrambler (8) of base, partial product compressed tree (9), 24 bit adder (10) and the selector switch (11) that the carry save adder CSA of 3:2 forms are formed with 4; Wherein,
Exponential sum symbol processing unit (1), whether according to the exponential sum symbolic computation A+B+ of operand A, B, C, D (C * exponent e xp D), the exponent e xp_CD of C * D, is effectively to subtract sub, A+B+ (C * interim symbol sign D), symbol sign_CD of C * D, and shift amount mv_A, mv_B during the relative C of definite A * D displacement alignment with B, and whether need step-by-step negate sub_A, sub_B after A and the B displacement alignment, the step-by-step negate is promptly carried out negate to each, just become 1, become 01 with 0;
exp_CD=exp_C+exp_D,
sub=sign_Asign_Csign_D,
sign=sign_CD=sign_Csign_D,
sub_A=sign_Asign_Csign_D,
sub_B=sign_Bsign_Csign_D,
Wherein, sign_A, sign_B, sign_C and sign_D are respectively the symbols of operand A, B, C and D, exp_A, exp_B, exp_C and exp_D are respectively the indexes of operand A, B, C and D, according to IEEE 754 standards, the symbol of single precision floating datum is its most significant digit, and index is the 2nd to the 8th;  is an xor operation;
When exp_CD-exp_A≤-51 and sign_A ≠ sign_B,
exp=exp_A,
mv_A=0,
mv_B=exp-exp_B,
When exp_CD-exp_A≤-51 and sign_A=sign_B,
exp=exp_A+1,
mv_A=1,
mv_B=exp-exp_B,
-27>exp_CD-exp_A>-51 o'clock,
exp=exp_CD+51,
mv_A=exp-exp_A,
mv_B=exp-exp_B,
23 〉=exp_CD-exp_A 〉=-27 o'clock,
exp=exp_CD+27,
mv_A=exp-exp_A,
mv_B=exp-exp_B,
Exp_CD-exp_A>23 o'clock,
exp=exp_CD+1,
mv_A=exp-exp_A,
mv_B=exp-exp_B,
The shift unit of first 74 bit (2) is according to the mv_A value that obtains in the exponential sum symbol processing unit (1), the man_A of mantissa to A carries out right shift, according to IEEE 754 standards, the mantissa of single precision floating datum is its 8th to 32, mend 1 in most significant digit when it is standardizing number, otherwise in the most significant digit zero padding, unnomalized number will be treated as 0, output after the displacement is designated as align_A
align_A=man_A□mv_A,
Wherein represents to move to right;
The shift unit of second 74 bit (3) is according to the mv_B value that obtains in the exponential sum symbol processing unit (1), and the man_B of mantissa of B is carried out right shift, and the output after the displacement is designated as align_B,
align_B=man_B□mv_B;
Viscous position computing unit (4), calculate viscous position st1_B according to the sub_B that calculates in the shift result of the shift unit (3) of second 74 bit and the exponential sum symbol processing unit (1), mv_B>74 o'clock, if it is 0 entirely that sub_B=0 and man_B shift out the part of the data channel of 74 bit widths, perhaps to shift out the part of the data channel of 74 bit widths be 1 entirely for sub_B=1 and man_B, st1_B=0 then, otherwise st1_B=1;
First step-by-step negate device (5), if the sign bit of the sign bit of A and C * D is different, to the output of the shift unit (2) of first 74 bit as a result align_A do the step-by-step inversion operation, otherwise do not do any operation directly with align_A output, the output of first step-by-step negate device (5) is designated as inv_A;
Second step-by-step negate device (6), if the sign bit of the sign bit of B and C * D is different, to the output of the shift unit (3) of second 74 bit as a result every bit of align_B all do inversion operation (step-by-step negate just), otherwise do not do any operation directly with align_B output, the output of second step-by-step negate device (6) is designated as inv_B;
With the output of first step-by-step negate device (5) and second step-by-step negate device (6) inv_A and inv_B as a result, and the sub_A that draws in the exponential sum symbol processing unit (1) sends into once compression of do among the 3:2 CSA (7) together, obtain sum_AB, carry_AB, wherein
sum_AB=inv_A^inv_B^sub_A,
carry_AB=((inv_A&inv_B)|(inv_A&sub_A)|(inv_A&sub_A))<<1,
And the result of sub_B and st1_B and operation is placed on the lowest order of carry_AB,
carry_AB[73]=sub_B&st1_B,
Here ^, ﹠amp; With | represent respectively step-by-step XOR, step-by-step and and step-by-step or operation,<<expression is to shifting left;
The ripple thatch scrambler (8) of base 4 is encoded to the mantissa of C, the mantissa of result that will encode and D multiplies each other and obtains 13 partial products then, the carry that is admitted to 3:2 of these 13 partial products is preserved in the CSA compressed tree (9), the tree that 3:2 CSA tree promptly is made up of 3:2 CSA, 3 inputs will be compressed into 2 outputs through a CSA, if establish and be input as x, y, z, be output as s, c, then compression process can be expressed as follows:
s=x^y^z,
c=((x&y)|(x&z)|(y&z))<<1,
With 5 grades of 3:2 CSA cascade, form 3:2 CSA tree, just 2 of 13 partial product boil down tos can be designated as sum_CD, carry_CD respectively;
In low 24 totalizers (10) that are sent to 24 bits of sum_CD and carry_CD, addition results is summarized as two information outputs: st1_CD and st1_CD_3MSB, and whether low 24 that wherein write down addition results be zero entirely, if be zero entirely, st1_CD=0 then, otherwise st1_CD=1; St1_CD_3MSB writes down the Senior Three position of the addition results of 25 bits; Selector switch (11) is chosen one as st1 output according to the index range that calculates in the exponential sum symbol processing unit (1) from st1_B and st1_CD,
-27>exp_CD-exp_A>-51 o'clock, st1=st1_CD, st1=st1_B under other situation;
Second level streamline: form by the shift unit (19) of the half adder (16) of the half adder (15) of the half adder (14) of the leading zero prediction module (13) of 4:2 CSA (12), 74 bits, first 74 bit, first 74 bit and first 74 bit, sign prediction logic (17), selector switch (18), 74 bits with door (20); Wherein, 4:2 CSA (12) quite with the 3:2 of 2 cascades, with four input sum_AB, carry_AB, two of sum_CD and carry_CD boil down tos: sum and carry, wherein sum_CD and carry_CD will be according to the index range that calculates in the exponential sum symbol processing unit (1) input of back as CSA that be shifted, when exp_CD-exp_A<-27, with sum_CD and carry_CD preceding 24 as the input, when 23 〉=exp_CD-exp_A 〉=-27, with sum_CD and carry_CD as input, under other situations (exp_CD-exp_A>23), with sum_CD and carry_CD behind 26 bits that move to left as input, the most significant digit with the st1_CD_3MSB that obtains in 10 after the compression leaves on the lowest order of carry;
The leading zero prediction module (13) of 74 bits, when judging 12 output sum and carry addition as a result, the number of leading zero, the leading zero number is promptly from most significant digit, and the figure place between first non-0 is if sum and carry addition result are for negative, what then judge here will be leading one number, promptly from the figure place of most significant digit to first non-1, concrete determination methods is: determine that by checking the adjacent position in a certain position and the left and right sides thereof who may be a most significant digit, establish a prediction bits f i,
T=sumcarry,G=sum&carry,
Figure S2007101799736C00041
f 0 = T 0 ‾ T 1
f i = T i - 1 ( G i Z ‾ i + 1 + Z i G ‾ i + 1 ) + T ‾ i - 1 ( Z i Z ‾ i + 1 + G i G ‾ i + 1 ) , i > 0
Wherein sum, carry are two outputs of (13),
Figure S2007101799736C00044
Expression is with sum step-by-step negate, T i, G i, Z iIf the i position of representing T, G, Z respectively is f i=1, and f j=0 (j=0,1 ... i-1), then leading zero number (LZN) is i;
If half adder is input as x, y, be output as s, c, then its principle can be expressed as:
s=x^y,
c=(x&y)<<1,
First half adder (14) is treated to output sum according to above-mentioned principle with sum and the carry that 4:2CSA (12) exports HAposAnd carry HApos
With after sum and the carry step-by-step negate as the input of second half adder of half adder (15), be output as sum HAinvAnd carry HAinv, and with carry HAinvExtreme lower position be 1;
With sum HAinvAnd carry HAinvAs the input of the 3rd half adder (16), be output as sum after the step-by-step negate HAcomAnd carry HAcom, and with carry HAcomExtreme lower position be 1, sum like this HAcom+ carry HAcomJust be equivalent to the complement form of sum+carry;
Sign prediction module (17), Forecasting Methodology is for judging whether the sum+carry most significant digit has carry to produce, if there is carry to produce, then addition result is for negative, it is 1 that output signal complement is composed, otherwise complement=0;
Selector switch (18) is according to the result of sign prediction, from sum HApos, carry HAposAnd sum HAcom, carry HAcomMiddle selection is a pair of as output, and its output is designated as sum HA, carry HA,
During complement=0, sum HA=sum HApos, carry HA=carry HApos,
During complement=1, sum HA=sum HAcom, carry HA=carry HAcom
The shift unit of 74 bits (19) is according to the leading zero prediction result, and the output of selector switch (18) is shifted left, and shift amount is LZN, and the output after the displacement is designated as sum NorAnd carry Nor
With door (20) the output complement of sign prediction module (17) and the output sign of exponential sum symbol processing unit (1) are carried out and operation, obtain the symbol of A+B+C * D;
Third level streamline is made up of the Index for Calculation unit (21) of A+B+C * D, the index amending unit (23) that finally adds/round off unit (22), C * D of A+B+C * D, the unit (24) that finally adds/round off of C * D; Wherein,
The Index for Calculation unit (21) of A+B+C * D, the LZN that obtains in the leading zero prediction module (13) according to the exp that obtains in the exponential sum symbol processing unit (1), 74 bits and A+B+C * D add finally/round off whether the index that 1 lt calculates A+B+C * D takes place in the unit (22), if 1 lt does not take place in finally adding/round off in the unit (22) of A+B+C * D, then the index of A+B+C * D is exp-LZN, otherwise need carry out the correction of 1 bit, the index of final A+B+C * D will be expressed as exp-LZN-1;
In finally the adding/round off in the unit (22) of A+B+C * D, at first with the output sum of the shift unit (19) of 74 bits NorAnd carry NorAddition, the result is designated as ABCD_added,
ABCD_added=sum HAnor+carry HAnor
Round off according to the st1 and the rounding mode that obtain in ABCD_added, the selector switch (11) then, rounding mode has: (RN) nearby rounds off, to infinite rounding off (RP) just, to negative infinite rounding off (RM), to zero round off (RZ), from application point of view, these four kinds of rounding modes can reduce to three: RN, RI, RZ;
Figure S2007101799736C00051
RZ(x)=x
Here
Figure S2007101799736C00053
Represent respectively to round up and round downwards with  x ;
For negative, rounding mode RP can equivalence be RI, and RM can equivalence be RN; For positive number, rounding mode RP can equivalence be RN, and RM can equivalence be RI;
At first calculate viscous position st2, if the most significant digit of ABCD_added is 1, st2=|abcd_added[25:74 then], otherwise st2=|abcd_added[26:74]; Whole viscous position st is made up of st1 and st2 two parts:
st=st1|st2,
Then according to st, ABCD_added and rounding mode RI, RN or RZ, calculate two nonces of round-off result, be designated as rounding_result_tmp1 and rounding_result_tmp2 respectively, the computing method of rounding_result_tmp1 are: during RI=1
If st=1 or ABCD_added[24]=1,
rounding_result_tmp1=ABCD_added[0:23]+1;
Otherwise
rounding_result_tmp1=ABCD_added[0:23];
During RI=0, if RN=1,
If ABCD_added[24]=0
rounding_result_tmp1=ABCD?added[0:23];
Otherwise, during st=1,
rounding_result_tmp1=ABCD_added[0:23]+1;
ABCD_added[23]=1 o'clock,
rounding_result_tmp1=ABCD_added[0:23]+1;
Otherwise,
rounding_result_tmp1=ABCD_added[0:23];
During RI=0, if RN=0,
rounding_result_tmp1=ABCD_added[0:23];
The computing method of rounding_result_tmp2 are as follows:
During RI=1,
If st=1 or ABCD_added[25]=1,
rounding_result_tmp2=ABCD_added[1:24]+1;
Otherwise
rounding_result_tmp2=ABCD_added[1:24];
During RI=0, if RN=1,
If ABCD_added[25]=0
rounding_result_tmp2=ABCD_added[1:24];
Otherwise, during st=1,
rounding_result_tmp2=ABCD_added[1:24]+1;
ABCD_added[24]=1 o'clock,
rounding_result_tmp2=ABCD_added[1:24]+1;
Otherwise, rounding_result_tmp2=ABCD_added[1:24];
During RI=0, if RN=0,
rounding_result_tmp2=ABCD_added[1:24];
At last from rounding_result_tmp1 and rounding_result_tmp2, choose the mantissa of a final A+B+C * D of conduct, and determine whether index in 21 needs the correction of 1 bit according to the most significant digit of ABCD_added and the most significant digit of rounding_result_tmp1:
If the most significant digit of rounding_result_tmp1 be 1 and the most significant digit of ABCD_added be 0, perhaps the most significant digit of ABCD_added is 1 o'clock, choose rounding_result_tmp1 as net result, do not need the correction of 1 bit in 21, otherwise choose rounding_result_tmp2 as net result, need the correction of 1 bit in 21;
The index amending unit (23) of C * D according to C * D finally add/round off whether carried out one move to left in the unit (24) could be after judging whether to revise as the index of final C * D to the exp_CD of output in the exponential sum symbol processing unit (1), if finally adding/round off in the unit (24) of C * D declared to such an extent that need revise, the index of then final C * D is exp_CD-1, otherwise the index of final C * D is exp_CD;
Finally adding/rounding off in the unit (24) at C * D, the st1_CD, the st1_CD_3MSB that obtain in high 24 and (10) of sum_CD, the carry_CD that obtains in the partial product compressed tree of forming according to the carry save adder CSA of 3:2 (9), calculate the mantissa of C * D, and judge whether need to carry out the correction of 1 bit;
At first, obtain CD_added with the most significant digit addition of high 24 and the st1_CD_3MSB of sum_CD, carry_CD:
CD_added=sum_CD[0:23]+carry_CD[0:23]+st1_CD_3MSB[0],
Use the mantissa that finally adds/round off similar method calculating C * D in the unit (22) of A+B+C * D then, calculate two nonce rounding_result_CD_tmp1 and rounding_result_CD_tmp2 earlier,
The computing method of rounding_result_CD_tmp1 are as follows:
If RI=1,
If st1_CD=1 or st1_CD_3MSB[1]=1,
rounding_result_CD_tmp1=CD_added+1;
Otherwise,
rounding_result_CD_tmp1=CD_added;
If RI=0, and RN=1,
If st1_CD_3MSB[1]=0,
rounding_result_CD_tmp1=CD_added;
Otherwise, if st=1,
rounding_result_CD_tmp1=CD_added+1;
If st1_CD_3MSB[1]=1, and CD_added[23]=1,
rounding_result_CD_tmp1=CD_added+1;
E otherwise
rounding_result_CD_tmp1=CD_added;
If RI=0, and RN=0,
rounding_result_CD_tmp1=CD_added;
The computing method of rounding_result_CD_tmp2 are as follows:
If RI=1,
If st1_CD=1 or st1_CD_3MSB[2]=1,
rounding_result_CD_tmp2={CD_added[1:23],st1_CD_3MSB[1]}+1;
Otherwise,
rounding_result_CD_tmp2={CD_added[1:23],st1_CD_3MSB[1]};
If RI=0, and RN=1,
If st1_CD_3MSB[2]=0,
rounding_result_CD_tmp2={CD_added[1:23],st1_CD_3MSB[1]};
Otherwise, if st=1,
rounding_result_CD_tmp2={CD_added[1:23],st1_CD_3MSB[1]}+1;
If st1_CD_3MSB[2]=1, and st1_CD_3MSB[1]=1,
rounding_result_CD_tmp2={CD_added[1:23],st1_CD_3MSB[1]}+1;
Otherwise,
rounding_result_CD_tmp2<={CD_added[1:23],st1_CD_3MSB[1]};
If RI=0, and RN=0,
rounding_result_CD_tmp2<={CD_added[1:23],st1_CD_3MSB[1]};
If the most significant digit of rounding_result_CD_tmp1 be 1 and the most significant digit of CD_added be 0, perhaps the most significant digit of CD_added is 1 o'clock, choose the net result of rounding_result_CD_tmp1 as C * D, do not need the correction of 1 bit in the index amending unit (23) of C * D, otherwise choose the net result of rounding_result_CD_tmp2, need the correction of 1 bit in the index amending unit (23) of C * D as C * D.
CNB2007101799736A 2007-12-20 2007-12-20 A kind of paralleling floating point multiplication addition unit Active CN100570552C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007101799736A CN100570552C (en) 2007-12-20 2007-12-20 A kind of paralleling floating point multiplication addition unit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007101799736A CN100570552C (en) 2007-12-20 2007-12-20 A kind of paralleling floating point multiplication addition unit

Publications (2)

Publication Number Publication Date
CN101178645A true CN101178645A (en) 2008-05-14
CN100570552C CN100570552C (en) 2009-12-16

Family

ID=39404911

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007101799736A Active CN100570552C (en) 2007-12-20 2007-12-20 A kind of paralleling floating point multiplication addition unit

Country Status (1)

Country Link
CN (1) CN100570552C (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101916177A (en) * 2010-07-26 2010-12-15 清华大学 Configurable multi-precision fixed point multiplying and adding device
CN102339217A (en) * 2010-07-27 2012-02-01 中兴通讯股份有限公司 Fusion processing device and method for floating-point number multiplication-addition device
CN102707921A (en) * 2011-02-17 2012-10-03 Arm有限公司 Leading zero prediction in floating point addition
CN102722353A (en) * 2012-05-22 2012-10-10 广州晶锐信息技术有限公司 Floating-point unit of Java processor and control method thereof
WO2013155744A1 (en) * 2012-04-20 2013-10-24 Huawei Technologies Co., Ltd. System and method for signal processing in digital signal processors
CN104238992A (en) * 2014-09-09 2014-12-24 南京航空航天大学 High-performance imprecise floating point adder and application method thereof
CN103870237B (en) * 2011-01-05 2017-08-11 威盛电子股份有限公司 Predict the apparatus and method of sign digit
CN107229446A (en) * 2017-04-26 2017-10-03 深圳市创成微电子有限公司 A kind of audio data processor
CN107291420A (en) * 2017-06-27 2017-10-24 上海兆芯集成电路有限公司 Integrate arithmetic and the device of logical process
CN108153512A (en) * 2016-12-06 2018-06-12 Gsi 科技公司 Four steps are associated with full adder
CN108897522A (en) * 2018-06-14 2018-11-27 北京比特大陆科技有限公司 Data processing method, data processing equipment and electronic equipment
CN109634555A (en) * 2018-12-19 2019-04-16 深圳信息职业技术学院 A kind of floating add mantissa Fast rounding method based on injection value
CN110209374A (en) * 2019-05-23 2019-09-06 浙江大学 A kind of multiplier and its operating method based on racetrack memory
CN110399117A (en) * 2019-07-31 2019-11-01 上海燧原智能科技有限公司 A kind of mixing multiplication addition process method and device
CN110688090A (en) * 2019-09-11 2020-01-14 北京探境科技有限公司 Floating point multiplication method, circuit and equipment for AI (artificial intelligence) calculation
CN112230882A (en) * 2020-10-28 2021-01-15 海光信息技术股份有限公司 Floating-point number processing device, floating-point number adding device and floating-point number processing method
CN113168308A (en) * 2020-04-20 2021-07-23 深圳市大疆创新科技有限公司 Floating point accumulation apparatus, method and computer storage medium
CN113872608A (en) * 2021-12-01 2021-12-31 中国人民解放军海军工程大学 Wallace tree compressor based on Xilinx FPGA primitive
WO2022109917A1 (en) * 2020-11-26 2022-06-02 深圳市大疆创新科技有限公司 Floating point computation device, floating point computation method, mobile platform, and storage medium
CN117785108A (en) * 2024-02-27 2024-03-29 芯来智融半导体科技(上海)有限公司 Method, system, equipment and storage medium for processing front derivative

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101916177A (en) * 2010-07-26 2010-12-15 清华大学 Configurable multi-precision fixed point multiplying and adding device
CN102339217A (en) * 2010-07-27 2012-02-01 中兴通讯股份有限公司 Fusion processing device and method for floating-point number multiplication-addition device
CN103870237B (en) * 2011-01-05 2017-08-11 威盛电子股份有限公司 Predict the apparatus and method of sign digit
CN102707921A (en) * 2011-02-17 2012-10-03 Arm有限公司 Leading zero prediction in floating point addition
CN102707921B (en) * 2011-02-17 2017-06-30 Arm有限公司 Apparatus and method for performing floating add
CN104246690B (en) * 2012-04-20 2017-10-24 华为技术有限公司 It is used for the system and method for signal transacting in digital signal processor
WO2013155744A1 (en) * 2012-04-20 2013-10-24 Huawei Technologies Co., Ltd. System and method for signal processing in digital signal processors
CN104246690A (en) * 2012-04-20 2014-12-24 华为技术有限公司 System and method for signal processing in digital signal processors
US9274750B2 (en) 2012-04-20 2016-03-01 Futurewei Technologies, Inc. System and method for signal processing in digital signal processors
CN102722353A (en) * 2012-05-22 2012-10-10 广州晶锐信息技术有限公司 Floating-point unit of Java processor and control method thereof
CN102722353B (en) * 2012-05-22 2015-09-23 广州晶锐信息技术有限公司 Floating-point unit of Java processor and control method thereof
CN104238992A (en) * 2014-09-09 2014-12-24 南京航空航天大学 High-performance imprecise floating point adder and application method thereof
CN108153512A (en) * 2016-12-06 2018-06-12 Gsi 科技公司 Four steps are associated with full adder
CN108153512B (en) * 2016-12-06 2020-11-10 Gsi 科技公司 Four-step correlation full adder
US11604850B2 (en) 2016-12-06 2023-03-14 Gsi Technology Inc. In-memory full adder
CN107229446A (en) * 2017-04-26 2017-10-03 深圳市创成微电子有限公司 A kind of audio data processor
CN107291420A (en) * 2017-06-27 2017-10-24 上海兆芯集成电路有限公司 Integrate arithmetic and the device of logical process
CN108897522A (en) * 2018-06-14 2018-11-27 北京比特大陆科技有限公司 Data processing method, data processing equipment and electronic equipment
CN109634555B (en) * 2018-12-19 2022-11-01 深圳信息职业技术学院 Floating-point addition mantissa fast rounding method based on injection value
CN109634555A (en) * 2018-12-19 2019-04-16 深圳信息职业技术学院 A kind of floating add mantissa Fast rounding method based on injection value
CN110209374A (en) * 2019-05-23 2019-09-06 浙江大学 A kind of multiplier and its operating method based on racetrack memory
CN110209374B (en) * 2019-05-23 2021-04-20 浙江大学 Tracetrack memory-based multiplier and operation method thereof
CN110399117A (en) * 2019-07-31 2019-11-01 上海燧原智能科技有限公司 A kind of mixing multiplication addition process method and device
CN110688090A (en) * 2019-09-11 2020-01-14 北京探境科技有限公司 Floating point multiplication method, circuit and equipment for AI (artificial intelligence) calculation
CN113168308A (en) * 2020-04-20 2021-07-23 深圳市大疆创新科技有限公司 Floating point accumulation apparatus, method and computer storage medium
CN112230882A (en) * 2020-10-28 2021-01-15 海光信息技术股份有限公司 Floating-point number processing device, floating-point number adding device and floating-point number processing method
WO2022109917A1 (en) * 2020-11-26 2022-06-02 深圳市大疆创新科技有限公司 Floating point computation device, floating point computation method, mobile platform, and storage medium
CN113872608A (en) * 2021-12-01 2021-12-31 中国人民解放军海军工程大学 Wallace tree compressor based on Xilinx FPGA primitive
CN117785108A (en) * 2024-02-27 2024-03-29 芯来智融半导体科技(上海)有限公司 Method, system, equipment and storage medium for processing front derivative

Also Published As

Publication number Publication date
CN100570552C (en) 2009-12-16

Similar Documents

Publication Publication Date Title
CN100570552C (en) A kind of paralleling floating point multiplication addition unit
CN101221490B (en) Floating point multiplier and adder unit with data forwarding structure
CN101174200B (en) 5-grade stream line structure of floating point multiplier adder integrated unit
CN107273090A (en) Towards the approximate floating-point multiplier and floating number multiplication of neural network processor
CN101133389B (en) Multipurpose multiply-add functional unit
CN106897046B (en) A kind of fixed-point multiply-accumulator
CN101847087A (en) Reconfigurable transverse summing network structure for supporting fixed and floating points
CN101692202A (en) 64-bit floating-point multiply accumulator and method for processing flowing meter of floating-point operation thereof
CN104991757A (en) Floating point processing method and floating point processor
CN112463113B (en) Floating point addition unit
CN101770355B (en) Floating-point multiply-add fused unit compatible with double-precision and double-single-precision and compatibility processing method thereof
CN116594590A (en) Multifunctional operation device and method for floating point data
CN101371221A (en) Pre-saturating fixed-point multiplier
Vázquez et al. Iterative algorithm and architecture for exponential, logarithm, powering, and root extraction
US20050228844A1 (en) Fast operand formatting for a high performance multiply-add floating point-unit
WO2011137209A1 (en) Operand-optimized asynchronous floating-point units and methods of use thereof
CN100476718C (en) 64-bit floating dot multiplier and flow pad division method
CN110727412B (en) Mask-based hybrid floating-point multiplication low-power-consumption control method and device
CN116450085A (en) Extensible BFLoat 16-point multiplication arithmetic unit and microprocessor
CN102646033B (en) Provide implementation method and the device of the RSA Algorithm of encryption and signature function
EP1752872A2 (en) Method and system for high-speed floating-point operations and related computer program product
Schulte et al. Variable-precision, interval arithmetic coprocessors
US11182127B2 (en) Binary floating-point multiply and scale operation for compute-intensive numerical applications and apparatuses
Hsiao et al. Design of a low-cost floating-point programmable vertex processor for mobile graphics applications based on hybrid number system
US7689642B1 (en) Efficient accuracy check for Newton-Raphson divide and square-root operations

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant