CN112558920B

CN112558920B - Signed/unsigned multiply-accumulate device and method

Info

Publication number: CN112558920B
Application number: CN202011521792.9A
Authority: CN
Inventors: 尹首一; 谷江源; 孙庆斌; 张淞; 刘雷波; 魏少军
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2022-09-09
Anticipated expiration: 2040-12-21
Also published as: CN112558920A

Abstract

The invention provides a signed/unsigned multiply-accumulate device and a method, which are suitable for a coarse-grained reconfigurable processor architecture, wherein the device comprises a splitting module, an operation module, a processing module and an output module; the splitting module is used for acquiring a configuration control signal, splitting an input binary system multiplicand, a multiplier and an addend which are larger than a preset bit width according to the configuration control signal and generating a plurality of groups of binary numbers which are smaller than the preset bit width according to a preset splitting rule; the operation module is used for correspondingly grouping a plurality of groups of binary numbers smaller than the preset bit width through a plurality of MAC operation units according to the dynamic configuration file in the configuration control signal, and then respectively carrying out multiply-accumulate calculation and/or parallel multiply-accumulate calculation to obtain a plurality of calculation results; the processing module is used for respectively carrying out shifting and effective bit expansion processing on a plurality of calculation results according to a preset adjustment rule to obtain a plurality of processing results larger than a preset bit width; the output module is used for accumulating the plurality of processing results to obtain an operation result.

Description

Signed/unsigned multiply-accumulate device and method

Technical Field

The present invention relates to the field of processor design, and more particularly, to a signed multiply-accumulate apparatus and method.

Background

Coarse-grained reconfigurable processor architectures are gaining increasing attention due to their low-energy consumption, high-performance and energy-efficient and flexible and dynamically reconfigurable characteristics. The coarse-grained reconfigurable computing architecture is a high-performance computing architecture which integrates the flexibility of a general-purpose processor and an application-specific integrated circuit, and is very suitable for processing applications with very high parallelism, such as data and computation intensive types, such as the applications in the fields of artificial intelligence, digital signal processing, video image processing, scientific computing, communication encryption and the like. Meanwhile, with the rapid rise of applications such as artificial intelligence, neural networks, big data, cloud computing, 5G communication, etc., more intensive data and operations are brought, and these applications often involve a large number of Multiplication (MUL) operations and multiply-and-add Operation (MAC) operations with different bit width requirements.

In 2017, Google builds a special integrated circuit accelerator TPU (transducer Processing Unit) for acceleration of neural network application, and mainly adopts a MAC unit of a multiply accumulator to execute multiply-accumulate operation in a pulsation array (systolic array) mode on a 256x256 MAC array, so that the computing power of 92TOPS @8bit and the energy efficiency ratio of 4TOPS/W @8bit are achieved. It then supports only 8-bit MAC. However, in many applications such as image video processing, speech recognition and neural networks, the required computational accuracy is different. Some of the data only need lower bit width to meet the requirement of calculation precision. Therefore, if the hardware processing unit supporting high-bit-width operation can support parallel execution of multiple groups of low-bit-width data, the computing capability and the computing performance can be nearly doubled under the limited hardware resources, too much power consumption overhead can not be brought, and the energy efficiency ratio of the computation can be greatly increased.

The current reconfigurable processing architecture is a single-bit-width multiplication operation and an addition operation, and is a separate operation. Therefore, it often cannot support flexible bit width precision adjustment according to specific application requirements. Meanwhile, one MAC operation usually requires two or more operation cycles, and the first cycle multiplies the multiplier and the multiplicand; the second cycle adds the operation result of the previous cycle to the summand through the accumulator. Therefore, the reconfigurable processor is greatly limited to flexibly and efficiently process the tasks. Therefore, there is a need for a new method and apparatus to improve efficiency.

Disclosure of Invention

The invention aims to provide a signed/unsigned multiply-accumulate device and a signed/unsigned multiply-accumulate method, which are effectively used in a coarse-granularity reconfigurable processor architecture, realize the parallel processing of a plurality of groups of multiply/multiply-accumulate operations with different bit widths on the premise of fully utilizing the operation resources of the multiply/multiply-accumulate device through the flexible dynamic configuration of the bit width of operation data, and further almost exponentially improve the calculation throughput rate, the calculation performance and the energy efficiency of the multiply-accumulate device. Meanwhile, in the same multiply-accumulator circuit, signed multiply operation/multiply-accumulate operation can be effectively and flexibly supported, and the dynamic reconfigurable characteristic of the reconfigurable processor is fully ensured and realized under the condition of very low power consumption and area overhead.

In order to achieve the above object, the present invention provides a signed/unsigned multiply-accumulate apparatus, which is suitable for a coarse-grained reconfigurable processor architecture, and comprises a splitting module, an arithmetic module, a processing module and an output module; the splitting module is used for acquiring a configuration control signal, splitting an input binary system multiplicand, a multiplier and an addend which are larger than a preset bit width according to the configuration control signal and generating a plurality of groups of binary numbers which are smaller than the preset bit width; the operation module is used for correspondingly grouping a plurality of groups of binary numbers smaller than a preset bit width through a plurality of MAC operation units according to the dynamic configuration file in the configuration control signal, and then respectively performing multiply-accumulate calculation and/or parallel multiply-accumulate calculation to obtain a plurality of calculation results; the processing module is used for respectively carrying out shifting and effective bit expansion processing on the plurality of calculation results according to a preset adjustment rule to obtain a plurality of processing results larger than a preset bit width; the output module is used for accumulating the processing results to obtain an operation result.

In the signed/unsigned multiply-accumulate apparatus, preferably, the arithmetic block includes a plurality of MAC arithmetic units; the MAC operation unit is used for analyzing a function identifier and an operation type identifier in the dynamic configuration file; acquiring the operation mode of each MAC operation unit on the received binary number according to the function identifier and the operation type identifier; and carrying out corresponding multiplication and accumulation calculation on the received binary number according to the operation mode to obtain a corresponding calculation result.

In the above signed/unsigned multiply-accumulate device, preferably, the MAC operation unit further includes: identifying the symbol condition in the binary number smaller than the preset bit width; carrying out corresponding signed bit expansion or unsigned number expansion on the binary number according to the sign condition; and after partial product and addend shift processing is carried out on the expanded binary number, a calculation result is obtained through multiplication and accumulation.

In the above signed/unsigned multiply-accumulate apparatus, preferably, the operation module further includes: acquiring the operation type of each MAC operation unit according to the application calling requirement, and acquiring the summand value of each MAC operation unit according to the operation type; and the MAC operation unit performs partial sum and addend shift processing on the expanded binary number according to the addend value.

In the foregoing signed/unsigned multiply-accumulate apparatus, preferably, the processing module further includes a shift module, configured to shift and perform effective bit extension processing on the calculation result according to a preset adjustment rule according to a signed/unsigned condition of the calculation result and the operation type, so as to obtain a plurality of processing results with bit widths larger than a preset bit width.

The invention also provides a signed/unsigned multiply-accumulate method, which is suitable for a coarse-grained reconfigurable processor architecture and comprises the following steps: acquiring a configuration control signal, splitting an input binary multiplicand, a multiplier and an addend which are larger than a preset bit width according to the configuration control signal, and generating a plurality of groups of binary numbers which are smaller than the preset bit width; according to the dynamic configuration file in the configuration control signal, correspondingly grouping a plurality of groups of binary numbers smaller than a preset bit width through a plurality of MAC operation units, and then respectively performing multiply-accumulate calculation and/or parallel multiply-accumulate calculation to obtain a plurality of calculation results; respectively carrying out shifting and effective bit expansion processing on the plurality of calculation results according to a preset adjustment rule to obtain a plurality of processing results larger than a preset bit width; and accumulating a plurality of processing results to obtain an operation result.

In the above signed multiply-accumulate method, preferably, after correspondingly grouping a plurality of groups of binary numbers smaller than a preset bit width by a plurality of MAC operation units, performing multiply-accumulate calculation and/or parallel multiply-accumulate calculation respectively to obtain a plurality of calculation results includes: analyzing a function identifier and an operation type identifier in the dynamic configuration file; obtaining the operation mode of each MAC operation unit on the received binary number according to the function identifier and the operation type identifier; and carrying out corresponding multiplication and accumulation calculation on the received binary number according to the operation mode to obtain a corresponding calculation result.

In the above signed multiply-accumulate method, preferably, the obtaining a corresponding calculation result by performing a corresponding multiply-accumulate calculation on the received binary number according to the operation method includes: identifying the symbol condition in the binary number smaller than the preset bit width; carrying out corresponding signed bit expansion or unsigned number expansion on the binary number according to the sign condition; and after partial product and addend shift processing is carried out on the expanded binary number, a calculation result is obtained through multiplication and accumulation.

In the above method for multiply-accumulate with/without sign, preferably, the step of performing shift and valid bit extension processing on the plurality of calculation results according to a preset adjustment rule to obtain a plurality of processing results with bit widths larger than a preset bit width further includes: acquiring the operation type of each MAC operation unit according to the application calling requirement, and acquiring the summand value of each MAC operation unit according to the operation type; and the MAC operation unit carries out partial product and addend shift processing on the expanded binary number according to the addend value.

In the above signed multiply-accumulate method, preferably, the operation type includes: a high bit width signed/unsigned MAC operation and a parallel low bit width signed/unsigned MAC operation.

The invention has the beneficial technical effects that: the method supports signed multiplication and multiplication accumulation operation, unifies signed operation into an operation circuit, not only saves area and power consumption expense, but also can meet the requirements of various applications through configuration and reconstruction, and has good reconfigurability and wider applicability.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principle of the invention. In the drawings:

FIG. 1 is a schematic diagram of a signed multiply-accumulate device according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an application structure of a signed multiply-accumulate device according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating an operation principle of the signed multiply-accumulate apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating multiply-accumulate operation with arbitrary precision and MAC operation according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a signed multiply accumulate method according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a complete set of unsigned MAC multiply-accumulate operations for high bit widths according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating an application of a complete set of unsigned high bit width MAC multiply-accumulate operations according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating two parallel groups of low-bit-width unsigned MAC multiply-accumulate operations according to an embodiment of the present invention;

FIG. 9 is a diagram illustrating an application of two parallel sets of low-bit-width unsigned MAC multiply-accumulate calculations according to an embodiment of the present invention;

FIG. 10 is a diagram illustrating a complete set of MAC multiply-accumulate operations for high-bit-width signed numbers according to an embodiment of the present invention;

FIG. 11 is a diagram illustrating an application of a complete set of unsigned high bit width MAC multiply-accumulate operations according to an embodiment of the present invention;

FIG. 12 is a diagram illustrating two parallel sets of MAC multiply-accumulate operations with low bit width and signed numbers according to an embodiment of the present invention;

fig. 13 is a schematic diagram illustrating an application of two parallel MAC multiply-accumulate operations with low bit width and signed numbers according to an embodiment of the present invention.

Detailed Description

The following detailed description of the embodiments of the present invention will be provided with reference to the drawings and examples, so that how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented. It should be noted that, as long as there is no conflict, the embodiments and the features of the embodiments of the present invention may be combined with each other, and the technical solutions formed are within the scope of the present invention.

Additionally, the steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions and, although a logical order is illustrated in the flow charts, in some cases, the steps illustrated or described may be performed in an order different than here.

Referring to fig. 1, the signed multiply-accumulate device provided in the present invention is suitable for a coarse-grained reconfigurable processor architecture, and includes a splitting module, i.e., module 1, an arithmetic module, i.e., module 2, a processing module, i.e., module 3, and an output module, i.e., module 4; the splitting module is used for acquiring a configuration control signal, splitting an input binary system multiplicand, a multiplier and an addend which are larger than a preset bit width according to the configuration control signal and generating a plurality of groups of binary numbers which are smaller than the preset bit width; the operation module is used for correspondingly grouping a plurality of groups of binary numbers smaller than a preset bit width through a plurality of MAC operation units according to the dynamic configuration file in the configuration control signal, and then respectively performing multiply-accumulate calculation and/or parallel multiply-accumulate calculation to obtain a plurality of calculation results; the processing module is used for respectively carrying out shifting and effective bit expansion processing on the plurality of calculation results according to a preset adjustment rule to obtain a plurality of processing results larger than a preset bit width; the output module is used for accumulating the processing results to obtain an operation result. Therefore, the binary number with high bit width required by multiplication/multiplication accumulation operation is split into a plurality of groups of binary numbers with low bit width, and then the signed multiplication accumulator with adjustable bit width precision is finally realized through proper calculation by the multiplication accumulators with low bit width.

In an embodiment of the present invention, the operation module includes a plurality of MAC operation units; the MAC operation unit is used for analyzing a function identifier and an operation type identifier in the dynamic configuration file; obtaining the operation mode of each MAC operation unit on the received binary number according to the function identifier and the operation type identifier; and carrying out corresponding multiplication and accumulation calculation on the received binary number according to the operation mode to obtain a corresponding calculation result.

As shown in fig. 2, in actual operation, the main functions performed by the above modules are as follows:

the splitting module is mainly used for splitting an input binary system multiplicand A, a multiplier B and an addend C with high bit width into a plurality of groups of binary systems with low bit width reasonably according to a configuration control signal Config. It is assumed here that the splitting into 2 sets of arbitrary low bit-width data.

Inputting a signal: a 3-bit configuration control signal Config, an M-bit multiplicand A, N-bit multiplier B and an L-bit addend C; outputting a signal: split processed m bit A _L And M-M bit A _H (ii) a n bit B _L And N-N bit B _H (ii) a l position C _L And L-L position C _H 。

The operation module is mainly used for carrying out multiplication and accumulation calculation on each multiplicand, multiplier and addend subjected to splitting processing, and comprises signed number multiplication and accumulation, unsigned number multiplication and accumulation and signed number and unsigned number multiplication and accumulation. If the configuration information Config 1-Config 4 is used, whether 1 group of signed multiplication/multiplication accumulation operation with high bit width or 2 groups of signed multiplication/multiplication accumulation operation with low bit width are needed to be carried out is judged; and judging whether multiplication or multiplication accumulation is needed to be carried out according to whether the specific summand A _ in is zero or not.

Inputting a signal: m bits AL and M-M bits AH; n-bit BL and N-N-bit BH; l-position CL and L-L-position CH; a two-bit config signal; outputting a signal: the results of multiply-accumulate calculations P1, P2, P3, and P4.

The processing module mainly carries out proper shift and most significant bit expansion operation on the final result of multiplication/multiplication accumulation operation of several small bit width data calculations. Wherein the result P1 generated by the first MAC is not shifted; the result P2, left shifted by m bits, produced by the second MAC; the third MAC produces a result P3, shifted n bits to the left; the fourth MAC produces a result P4, shifted left by m + n bits. And then, performing most significant bit extension on all the results after the shift operation, and finally extending the results to M + N bits.

Input signal: multiply-accumulate the calculated results P1, P2, P3, and P4; outputting a signal: the results of P1, P2, P3 and P4 after shifting and most significant bit extension: p1_ ext, P2_ ext, P3_ extt and P4_ ext.

The output module mainly accumulates the results calculated by the former modules to obtain the final result.

Inputting a signal: shifted and expanded results P1_ ext, P2_ ext, P3_ extt, and P4_ ext. And (3) outputting a signal: the multi-bit is multiplied by the accumulated final result Product.

In the above embodiment, the function of the control signal Config is shown in table 1, which is a 3-bit wide configuration signal.

TABLE 1

In order to realize adjustable precision, the invention further dynamically configures the control signal Config, generates configuration control signals Config 1-Config 4 in the control signal Config, is a configuration signal with 2-bit width, and has the corresponding MAC operation mode as shown in Table 2.

TABLE 2

The values of Config 1-Config 4	Kind of operation performed by MAC
		00	Multiply-accumulate operation with multiplier and multiplicand all being unsigned numbers
01	Multiply-accumulate operation with a mixture of signed multiplier and unsigned multiplicand
		10	Multiply-accumulate operation with a mix of signed multiplicand and unsigned multiplier
11	Multiply-accumulate operation with multiplier and multiplicand both being signed numbers

Thus, the final operation modes and functions of the different MAC operation units can be determined based on the correspondence relationship of the tables, as shown in table 3 below.

TABLE 3

In an embodiment of the present invention, the MAC operation unit further includes: identifying the symbol condition in the binary number smaller than the preset bit width; carrying out corresponding signed bit expansion or unsigned number expansion on the binary number according to the sign condition; and after partial product and addend shift processing is carried out on the expanded binary number, a calculation result is obtained through multiplication and accumulation. In practical work, the signed multiply-accumulate device is uniformly calculated in a hardware framework, the addend in multiply-accumulate can be processed like processing a multiplication partial product, and therefore the addend in multiply-accumulate can be processed like processing the multiplication partial product, so that the addend is hidden in a partial product addition Tree (such as Wallace Tree) in multiplication, compression and accumulation processing are uniformly carried out, and finally, multiply-accumulate operation can be completed under the condition of basically not increasing area overhead. In the operation process, each MAC operation unit firstly judges whether the input status width number has a sign or not, and then respectively carries out signed bit expansion and unsigned number expansion. And then improving the Booth algorithm with the base of 4 according to the design of multiplication and accumulation, shifting partial products and addends, and performing accumulation calculation, thereby realizing that the addends are hidden in the multiplication calculation and finishing the calculation of the multiplication and accumulation operation. Specifically, as shown in fig. 3, sign bit extension of signed numbers and most significant bit extension of unsigned numbers, and calculation after the partial product Booth is encoded, the final result has an increase of four-bit sign bits, and therefore truncation processing is required to be performed to obtain the final result, so that the overhead of hardware resources of the multiplier is reduced, the area and power consumption overhead is reduced, the calculation delay of the multiplier-accumulator is reduced, and the working frequency and energy efficiency are improved. Here, S is the highest bit after Booth encoding of the partial product and the addend in the multiply-accumulate operation; n is the partial product in the multiply-accumulate operation, and whether the operation of negating and adding 1 is performed when the Booth coding is negative coding or not is performed; m represents a data bit width of a signed multiply-accumulate operation with an arbitrary bit width.

In an embodiment of the present invention, the operation module further includes: acquiring the operation type of each MAC operation unit according to the application calling requirement, and acquiring the summand value of each MAC operation unit according to the operation type; and the MAC operation unit carries out partial product and addend shift processing on the expanded binary number according to the addend value. Further, the processing module may further include shifting and having the calculation result according to a preset adjustment rule according to the signed/unsigned condition of the calculation result and the operation typeAnd the bit expansion processing obtains a plurality of processing results larger than the preset bit width. Specifically, reference may be made to fig. 4, where M ═ N ═ L, and M are assumed to be>n, when performing any precision-adjustable signed multiply-accumulate calculation, the operation principle of the MAC operation unit of the present invention is as shown in fig. 4, and the multiplicand a, the multiplicand B, and the summand C are split into 2 groups of low-bit-width data, which are a, respectively _H 、A _L 、B _H 、B _L 、C _H And C _L (ii) a Then A is mixed _H 、A _L 、B _H 、B _L 、C _H 、C _L The corresponding combinations are respectively obtained to obtain 4 MAC operation parts with low bit width shown in figure 4, which are marked as (r), (r). Wherein, the first and third are common multiply-accumulate operation, and the calculated summands are respectively C _L And 0; two and four are expressed multiply-accumulate operations, the calculated summand of which may be C _H Or 0; when 1 group of multiply-accumulate operation with high bit width is carried out, the summands of the multiply-accumulate operation expressed by (II) and (IV) are respectively C _H And 0, when 2 groups of multiply-accumulate operations with low bit width are carried out in parallel, the summands of the multiply-accumulate operations expressed by the (II) and (IV) are respectively 0 and C _H . If 1 group of multiply-accumulate operations with high bit width are carried out, the shift and most significant bit expansion operations are carried out on the result of MAC multiply-accumulate operation carried out by the 4 modules.

Therefore, in the precision-adjustable signed multiplication and accumulation addition device designed by the invention, the final output can be flexibly selected to be a group of high-bit-width signed multiplication/multiplication and accumulation operation results or a plurality of groups of parallel-calculated low-bit-width signed multiplication/multiplication and accumulation operation results according to the requirements of different precision and calculation performance of different applications; moreover, symbolic and unsigned operations are unified into one operation circuit, so that not only are the area and power consumption expenses saved, but also the requirements of various applications can be met through configuration and reconstruction, and the reconfigurable circuit has good reconfigurability and wider usability.

Referring to fig. 5, the present invention further provides a signed multiply-accumulate/multiply-accumulate method suitable for a coarse-grained reconfigurable processor architecture, the method comprising:

s501, acquiring a configuration control signal, splitting an input binary multiplicand, a multiplier and an addend which are larger than a preset bit width according to the configuration control signal, and generating a plurality of groups of binary numbers which are smaller than the preset bit width;

s502, according to the dynamic configuration file in the configuration control signal, correspondingly grouping a plurality of groups of binary numbers smaller than a preset bit width through a plurality of MAC operation units, and then respectively performing multiply-accumulate calculation and/or parallel multiply-accumulate calculation to obtain a plurality of calculation results;

s503, respectively carrying out shifting and effective bit expansion processing on the plurality of calculation results according to a preset adjustment rule to obtain a plurality of processing results larger than a preset bit width;

s504, accumulating the processing results to obtain an operation result.

In the above embodiment, after the multiple MAC operation units correspondingly group the multiple groups of binary numbers smaller than the preset bit width, the multiply-accumulate calculation and/or the parallel multiply-accumulate calculation are/is performed to obtain multiple calculation results, where the multiple calculation results include: analyzing a function identifier and an operation type identifier in the dynamic configuration file; acquiring the operation mode of each MAC operation unit on the received binary number according to the function identifier and the operation type identifier; and carrying out corresponding multiplication and accumulation calculation on the received binary number according to the operation mode to obtain a corresponding calculation result. Wherein, the obtaining of the corresponding calculation result by performing the corresponding multiplication and accumulation calculation on the received binary number according to the operation mode comprises: identifying the symbol condition in the binary number smaller than the preset bit width; carrying out corresponding signed bit expansion or unsigned number expansion on the binary number according to the sign condition; and after partial product and addend shift processing is carried out on the expanded binary number, a calculation result is obtained through multiplication and accumulation. The specific application example can refer to fig. 4 and the corresponding embodiment described above, and detailed description thereof is omitted.

In an embodiment of the present invention, the obtaining a plurality of processing results greater than a preset bit width by respectively performing shift and valid bit extension processing on the plurality of calculation results according to a preset adjustment rule further includes: acquiring the operation type of each MAC operation unit according to the application calling requirement, and acquiring the summand value of each MAC operation unit according to the operation type; and the MAC operation unit carries out partial product and addend shift processing on the expanded binary number according to the addend value. Wherein the operation categories include: a high bit width signed/unsigned MAC operation and a parallel low bit width signed/unsigned MAC operation.

For better understanding of the specific application of the above embodiments provided by the present invention, the following detailed description is made on the high-bit-width signed/unsigned MAC operation and the parallel low-bit-width signed/unsigned MAC operation by using specific examples, which are known to those skilled in the art and are not intended to limit the invention.

A complete set of unsigned MAC operations of high bit width can be seen with reference to fig. 6, where M-N-L, M>n, A when inputs A, B and C are unsigned numbers _H 、A _L 、B _H 、B _L 、C _H And C _L Since all MAC operation portions are unsigned numbers, all operations performed by the 4 low bit width MAC operation portions are unsigned. The signed/unsigned multiply-accumulate method provided by the invention can unify unsigned numbers into signed numbers for calculation, so that sign bit expansion is required for 4 groups of unsigned number MAC calculation. That is, two bits of unsigned extension are added according to the logic of the MAC operation unit for calculation. Since there is no addend in the calculation of two MAC partial products of (tri) and (tetra) in fig. 6, the addend needs to be treated as 0 for the multiply accumulator. The specific treatment method comprises the following steps:

the first step is as follows: for (1) carrying out _L ×B _L +C _L The result of (2) is not shifted. Wherein A is _L ，B _L ，C _L Performing unsigned extension;

the second step is that: for (2) is carried out _H ×B _L +C _H For the calculation of (A), the result is logically shifted left by m bits (A) _L Bit width of (c). Wherein A is _H ，B _L ，C _H Performing unsigned extension;

the third step: for going on is A _L ×B _H The calculation of +0, the logic left shift of the result by n bits (B) _L Bit width of). Wherein A is _L ，B _H Performing unsigned extension;

the fourth step: for _H ×B _H The calculation of +0, the logic left shift of the result by m + n bits (A) _L +B _L Bit width of). Wherein A is _H ，B _H Performing unsigned extension;

the fifth step: the MAC operation of the first, the second, the third and the fourth all obtain four calculation results, and the four results are accumulated to obtain a final unsigned high bit width result.

Referring to fig. 7 again, in an example of performing only one set of high bit width unsigned MAC operations according to the present invention, the calculation is described as a-155, B-161, and C-88 for calculating unsigned multiply-accumulate operation a + B; it divides the high bit width into 2 groups of low bit width data, and carries on unsigned extension process to the low bit width data, then carries on multiply or multiply accumulation calculation, and finally gets the P ₁ 、P ₂ 、P ₃ And P ₄ After the shift processing is carried out, the final settlement result with high bit width can be obtained by adding; the embodiment proves that the multiplication and accumulation method provided by the invention can be used for carrying out accurate multiplication or multiplication and accumulation calculation with adjustable precision on unsigned numbers.

In another embodiment, two sets of parallel low-bit-width unsigned MAC operations may be as shown with reference to fig. 8, where in fig. 8, where M is assumed to be N L, M>n, A when inputs A, B and C are unsigned numbers _H 、A _L 、B _H 、B _L 、C _H And C _L The number of the third generation is unsigned, so that the third generation can not be enabled, and the corresponding calculation power consumption is reduced; computing 2 sets of low bit width MAC operation part P ₁ ＝A _L ×B _L +C _L And P ₄ ＝A _H ×B _H +C _H It performs unsigned operations. However, the MAC operation designed by the present invention can unify unsigned numbers into signed numbers for calculation, so that sign bits need to be extended for 4 groups of unsigned numbers of MAC operationsAnd (6) unfolding. Namely, according to the logic of the MAC multiplier, two bits of unsigned extension are added for calculation. Therefore, in order to obtain two sets of low bit width multiply-accumulate results simultaneously, the specific flow is as follows:

the first step is as follows: for (1) proceed with A _L ×B _L +C _L The result of (2) is not shifted. Wherein A is _L ，B _L ，C _L Carry on the unsigned extension;

the second step: the pairs of two and three will not be enabled, and no calculation is performed, and all the input data signals are set to 0.

The third step: for the fourth step is carried out A _H ×B _H +C _H For the result of the calculation, logically shift left by m + n bits (A) _L +B _L Bit width of).

The fourth step: to A therein _H 、B _H And C _H Unsigned extension is performed. The two final output results are P ₁ And P ₄ Namely, the settlement results of 2 groups of low bit width unsigned number MAC multiply-accumulate operations respectively.

Referring to FIG. 9, A _H ＝9,B _H ＝10,C _H (ii) 5 and A _L ＝11,B _L ＝1,C _L This is illustrated by way of example at 8. In the calculation of unsigned multiplication A _H ◇B _H +C _H Then, the MAC operation unit works to carry out unsigned multiplication and accumulation calculation to obtain a result P ₄ 95, the MAC arithmetic units perform a in parallel _L ◇B _L +C _L Unsigned multiply-accumulate to obtain P ₁ 19; the embodiment demonstrates that two groups of accurate results can be obtained by simultaneously calculating two groups of low-bit-width unsigned numbers in parallel by adopting the multiply-accumulate method provided by the invention.

In one embodiment of the present invention, a complete set of signed MAC operations with high bit width is shown in fig. 10, where M ═ N ═ L, and M ═ L are assumed to be>n, when inputs A, B and C are signed numbers, then 4 MAC operations need to be discussed in categories, where E represents the sign bit of the data that needs to be spread. Suppose where A _H 、B _H And C _H Is a signed number, A _L 、B _L And C _L The MAC operation is carried out on the MAC operation with unsigned number and low bit width of 4, and each part corresponds to different MAC operations in 4. Wherein, the operation of the first step is carried out without sign; multiplication of signed number and unsigned number and signed number; multiplication of signed number and unsigned number; the operation with sign is carried out. Therefore, to discuss the respective operations separately, the calculation is performed as follows:

the first step is as follows: as shown in FIG. 10 (c), A is performed first _L ×B _L +C _L When calculating, it is a unsigned calculation, and needs to be on A _L ，B _L ，C _L An unsigned extension, i.e., two bits of 0, is performed. Then abandoning the first 3 sign bits of the original calculation result to save area overhead, and actually calculating the result P ₁ The m + n +1 bits are reserved. Finally, the bit complementing processing is carried out, and the calculation result P is processed ₁ MSB (most Significant bit) extension is performed until M + N bits are supplemented.

The second step: as shown in FIG. 10, A is performed first _H ×B _L +C _H When calculated, wherein A _H And C _H Is a signed number, B _L Since the number is an unsigned number, it cannot be directly calculated, and thus it is necessary to use an unsigned number B _L Performing unsigned extension, namely extending two bits 0; and carrying out sign bit extension on the signed number, namely extending a two-bit sign bit E. Then abandoning the first 3 sign bits of the original calculation result to save area overhead, and actually calculating the result P ₂ The M-M + n +1 bit is reserved. Then, shift processing is carried out to P ₂ Shift m bit to the left. Finally, carry on the bit-filling processing, the result P of the calculation ₂ MSB extension is carried out, and extension is carried out till M + N bits.

The third step: as shown in FIG. 10, A is first performed _L ×B _H +0 calculation, where A _L Is an unsigned number, B _H Since the calculation cannot be performed directly because of the signed number, the unsigned number A needs to be used _L Performing unsigned extension, namely extending two bits 0; and carrying out sign bit extension on the signed number, namely extending a two-bit sign bit E. Then abandoning the first 3 bits of sign bit of the original calculation resultProcessing to save area overhead, actual computation result P ₃ The N-N + m +1 bits are retained. Then, shift processing is performed to P ₃ Shifted left by n bits. Finally, the bit complementing treatment is carried out, and the calculation result P is obtained ₃ MSB extension is performed up to M + N bits.

The fourth step: as shown by the fourth in FIG. 10, A is first performed _H ×B _H +0 calculation, where A _H And B _H Is a signed number, so signed calculation is performed for A _H And B _H An extension of the sign bit, i.e. of the two-bit sign bit E, is to be performed. Then, the first four sign bits of the original calculation result are discarded, and the method is different from the first, second and third steps in that the first 4 sign bits of the calculation result are discarded to save the area overhead, and the actual calculation result P ₄ The N-N + M-M bits are reserved. Finally, shift processing is carried out to P ₄ Shift m + n bits to the left. No more extension of the MSB is required in performing the shift process as shown in fig. 10.

The fifth step: and finally, calculating to obtain four results, and accumulating the four results to obtain a final signed result.

Referring again to fig. 11, a set of high bit-wide signed multiplications, a-1, B-21, and C-1 are illustrated. The method splits the high bit width into two groups of low bit width data, and performs signed bit extension and unsigned extension processing on the low bit width data. The MAC operation unit provided by the invention can unify three different operations of multiplication/multiplication accumulation of signed numbers, multiplication/multiplication accumulation of unsigned numbers and multiplication/multiplication accumulation of unsigned numbers into one operation circuit, thereby realizing accurate calculation of multiplication or multiplication accumulation of various signed numbers; finally, it will give P ₁ 、P ₂ 、P ₃ And P ₄ Shifting, and then adding to obtain a group of final results of high-bit-width MAC operation; the embodiment demonstrates that the multiply-accumulate method provided by the invention can accurately and adjustably calculate the multiplication or multiply-accumulate of signed numbers.

In one embodiment of the present invention, two sets of parallel low bit width signed MAC operations are shown in FIG. 12, with the falseWhere M is N, L, M>n, when signed number multiply accumulate calculation is carried out, wherein A _H 、B _H 、C _H And A _L 、B _L 、C _L Are two sets of signed numbers. Then, for the multiply-accumulate calculation of 4 signed MAC with low bit width, A needs to be added _H 、B _H 、C _H And A _L 、B _L 、C _L Signed extensions are performed where E represents the sign bit of the data that needs to be extended. Secondly, enabling the computer to be disabled so as to reduce corresponding computing power consumption; calculating P ₁ ＝A _L ×B _L +C _L And P ₄ ＝A _H ×B _H +C _H In order to obtain two groups of low bit width multiply-accumulate results simultaneously, the multiply-accumulator principle designed by the invention is adopted, and the specific method is as follows:

the first step is as follows: for (1) carrying out _L ×B _L +C _L The result of (2) is not shifted. Wherein A is _L ，B _L ，C _L Carrying out signed extension;

the second step is that: for two groups of two and three, the data signals are not enabled and the corresponding calculation is not carried out, and all the input data signals are set to be 0.

The third step: for the fourth step is carried out A _H ×B _H +C _H Is calculated by (A) wherein _H 、B _H And C _H Signed spreading is performed.

The fourth step: the result of the last two sets of MAC operations output in parallel is P ₁ And P ₄ The result of the settlement of the multiply-accumulate operation of two groups of low bit width signed MAC respectively.

As shown in FIG. 13, a set of high bit width signed multiplications is illustrated, using A _H ＝-1,B _H ＝1,C _H Is-1 and A _L ＝-1,B _L ＝5,C _L The compound is illustrated as-1. Wherein the signed operation A _H ◇B _H +C _H At this time, the operation unit works to multiply and accumulate signed MAC to obtain a result P ₄ When the arithmetic units are parallel to A2 _L ◇B _L +C _L With a symbol MAC multiply-accumulate operation with the result of P ₁ -6; the embodiment demonstrates that by adopting the multiply-accumulate method provided by the invention, two groups of high-bit-width signed numbers can be simultaneously calculated in parallel, and two groups of accurate calculation results can still be obtained.

The technology of the invention divides the binary number with high bit width required by multiplication/multiplication accumulation operation into several groups of binary numbers with low bit width, and then realizes the signed/unsigned multiplication accumulator with adjustable bit width precision through proper calculation by several multiplication accumulators with low bit width, namely MAC operation units; the method can simultaneously support signed multiplication and multiply-accumulate/multiply operation with various bit width precisions under the condition of fully utilizing hardware resources and very low additional cost; more importantly, according to specific application, under the premise that the calculation precision is allowed, the parallel execution of multiple groups of multiplication/multiplication accumulation operations with different low bit widths can be realized, so that the requirement of the calculation performance of the application is met.

Although the invention takes splitting into two groups of low-bit-width data as an example, the method can split the input data into any plurality of low-bit-width data, thereby further improving the flexibility and meeting the calculation requirements of different precisions. If the expansion is divided into 4 groups of data with low bit width, the parallel operation of 4 groups of data with low bit width, signed multiplication accumulation/multiplication, or 2 groups of data with lower bit width, signed multiplication accumulation/multiplication, or 1 group of data with high bit width, signed multiplication accumulation/multiplication can be realized by using 16 MAC operation units with low bit width; for example, 8/16/32 or 4/8/16 and other operations with different data bit width accuracies, and so on, the method proposed by us can be popularized to any accuracy-adjustable computing application, and can divide a high bit width into any plurality of low bit widths to perform flexible bit width accuracy design; therefore, the high-energy-efficiency signed/unsigned multiply-accumulate device with adjustable bit width precision can be applied to hardware acceleration circuits with different requirements, such as CGRA, FPGA, GPU, DSP, TPU, neural network acceleration chip (NPU) and the like, and has very high universality and wide applicability.

The above-mentioned embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above-mentioned embodiments are only examples of the present invention and should not be used to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A signed multiply-accumulate device is suitable for a coarse-grained reconfigurable processor architecture and is characterized by comprising a splitting module, an operation module, a processing module and an output module;

the splitting module is used for acquiring a configuration control signal, splitting an input binary system multiplicand, a multiplier and an addend which are larger than a preset bit width according to the configuration control signal and generating a plurality of groups of binary numbers which are smaller than the preset bit width;

the operation module is used for correspondingly grouping a plurality of groups of binary numbers smaller than a preset bit width through a plurality of MAC operation units according to the dynamic configuration file in the configuration control signal, and then respectively performing multiply-accumulate calculation and/or parallel multiply-accumulate calculation to obtain a plurality of calculation results;

the processing module is used for respectively carrying out shifting and effective bit expansion processing on the plurality of calculation results according to a preset adjustment rule to obtain a plurality of processing results larger than a preset bit width;

the output module is used for accumulating the processing results to obtain an operation result;

the operation module comprises a plurality of MAC operation units;

the MAC operation unit is used for analyzing a function identifier and an operation type identifier in the dynamic configuration file;

acquiring the operation mode of each MAC operation unit on the received binary number according to the function identifier and the operation type identifier;

and carrying out corresponding multiplication and accumulation calculation on the received binary number according to the operation mode to obtain a corresponding calculation result.

2. The signed-multiply-accumulate apparatus of claim 1, wherein said MAC operation unit further comprises:

identifying the symbol condition in the binary number smaller than the preset bit width;

carrying out corresponding signed bit expansion or unsigned number expansion on the binary number according to the sign condition;

and after partial product and addend shift processing is carried out on the expanded binary number, a calculation result is obtained through multiplication and accumulation.

3. The signed-and-unsigned multiply-accumulate device of claim 2, wherein the operation module further comprises: acquiring the operation type of each MAC operation unit according to the application calling requirement, and acquiring the summand value of each MAC operation unit according to the operation type; and the MAC operation unit performs partial sum and addend shift processing on the expanded binary number according to the addend value.

4. The signed multiply-accumulate device according to claim 3, wherein the processing module further performs shift and valid bit extension processing on the calculation result according to a preset adjustment rule according to the signed/unsigned condition of the calculation result and the operation type to obtain a plurality of processing results with bit widths larger than a preset bit width.

5. A signed/unsigned multiply-accumulate method for use in a coarse-grained reconfigurable processor architecture, the method comprising:

acquiring a configuration control signal, splitting an input binary system multiplicand, a multiplier and an addend which are larger than a preset bit width according to the configuration control signal, and generating a plurality of groups of binary numbers which are smaller than the preset bit width according to a preset splitting rule;

according to the dynamic configuration file in the configuration control signal, correspondingly grouping a plurality of groups of binary numbers smaller than a preset bit width through a plurality of MAC operation units, and then respectively performing multiply-accumulate calculation and/or parallel multiply-accumulate calculation to obtain a plurality of calculation results;

respectively carrying out shifting and effective bit expansion processing on the plurality of calculation results according to a preset adjustment rule to obtain a plurality of processing results larger than a preset bit width;

accumulating a plurality of processing results to obtain an operation result;

after correspondingly grouping a plurality of groups of binary numbers with bit widths smaller than a preset bit width through a plurality of MAC operation units, respectively performing multiply-accumulate calculation and/or parallel multiply-accumulate calculation to obtain a plurality of calculation results, wherein the calculation results comprise:

analyzing a function identifier and an operation type identifier in the dynamic configuration file;

6. The signed-multiplication and accumulation method according to claim 5, wherein performing the corresponding multiplication and accumulation calculation on the received binary numbers according to the operation method to obtain the corresponding calculation result comprises:

and after partial product and addend shift processing is carried out on the expanded binary number, a calculation result is obtained through multiplication and accumulation calculation.

7. The signed multiply-accumulate method according to claim 6, wherein the step of performing shift and valid bit extension processing on the plurality of calculation results according to a preset adjustment rule to obtain a plurality of processing results with bit widths larger than a preset bit width further comprises:

acquiring the operation type of each MAC operation unit according to the application calling requirement, and acquiring the summand value of each MAC operation unit according to the operation type;

and the MAC operation unit carries out partial product and addend shift processing on the expanded binary number according to the addend value.

8. The signed-multiply-accumulate method according to claim 7, wherein said operation classes comprise: a high bit width signed/unsigned MAC operation and a parallel low bit width signed/unsigned MAC operation.