CN111694541B - Base 32 operation circuit for number theory transformation multiplication - Google Patents

Base 32 operation circuit for number theory transformation multiplication Download PDF

Info

Publication number
CN111694541B
CN111694541B CN202010371312.9A CN202010371312A CN111694541B CN 111694541 B CN111694541 B CN 111694541B CN 202010371312 A CN202010371312 A CN 202010371312A CN 111694541 B CN111694541 B CN 111694541B
Authority
CN
China
Prior art keywords
operands
operand
bit
circuit
csa
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010371312.9A
Other languages
Chinese (zh)
Other versions
CN111694541A (en
Inventor
华斯亮
张惠国
刘玉申
徐健
卞九辉
张静亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changshu Institute of Technology
Original Assignee
Changshu Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changshu Institute of Technology filed Critical Changshu Institute of Technology
Priority to CN202010371312.9A priority Critical patent/CN111694541B/en
Publication of CN111694541A publication Critical patent/CN111694541A/en
Application granted granted Critical
Publication of CN111694541B publication Critical patent/CN111694541B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • G06F7/501Half or full adders, i.e. basic adder cells for one denomination
    • G06F7/503Half or full adders, i.e. basic adder cells for one denomination using carry switching, i.e. the incoming carry being connected directly, or only via an inverter, to the carry output under control of a carry propagate signal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/60Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers
    • G06F7/72Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers using residue arithmetic
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a basic 32 operation circuit for number theory transformation multiplication, which comprises 32 operand generation modules, wherein each of 32 input data is divided into 11 words by taking 6 bits as one word after being subjected to high-order zero padding, 1 way of 32 96-bit operands, 16 ways of 11 192-bit operands, 3 ways of 16 192-bit operands and 12 ways of 12 192-bit operands are combined and output, each operand generation module is connected with an operand modular addition module, and the operands output by each operand generation module are subjected to modular addition; the modulo p module is used for modulo outputting the data output by each operand modulo adding module to prime number p, wherein the prime number p=2 64 ‑2 32 +1. The invention combines 1024 operands from the prior art to 400 operands, greatly reduces the calculation cost and improves the calculation efficiency of the base 32 operation.

Description

Base 32 operation circuit for number theory transformation multiplication
Technical Field
The present invention relates to an arithmetic circuit, and more particularly, to a base 32 arithmetic circuit for multiplication by number-theory transformation.
Background
Large integer multiplication besides traditional long multiplication, also
Figure GDA0004083228280000011
Strassen algorithm. />
Figure GDA0004083228280000012
The core idea of Strassen algorithm is: performing FFT on a primary loop on two large integers with the length of n respectively, and converting the two large integers into frequency domain distribution; performing point multiplication on the frequency domain distribution of the two integers to obtain the frequency domain distribution of the product; the frequency domain distribution of the product is subjected to IFFT on a primary loop, thereby obtaining the product. Using a number theory transform instead of a discrete fourier transform, rounding error issues can be avoided by using modular arithmetic instead of floating point arithmetic. Multiplication by number theory transformation, in particular->
Figure GDA0004083228280000013
Multiplication using a number theory transformation in the Strassen algorithm. The number theory transformation and the inverse number theory transformation are used as operation cores in the number theory transformation multiplication, occupy more than 90% of operation quantity and operation time in the NTT multiplication, optimize the speed, the area and the power consumption of the number theory transformation, and have critical influence on the overall performance of the NTT multiplication.
A 1048576 point number theory transformation can be decomposed into 4-level base 32 arithmetic units and twiddle factor multiplication operations. The rotation factor can be calculated in advance and stored in the ROM, and can be directly read when the rotation factor is needed to be used. The calculated amount of the base 32 operation accounts for more than 90% of the number theory transformation, and the optimization of the number theory transformation is of great importance.
Large integer multiplier FPGA design and implementation, xie Xing et al, electronic and informatics report, 2019. The paper describes a paper based
Figure GDA0004083228280000014
Large integer multiplier hardware architecture of Strassen algorithm. The paper decomposes the 65536 point number theory transformation into 64 point and 1024 point forms, and the 1024 point number theory transformation uses a structure constructed serially by 2-level base 32 operation. The base 32 operation includes 32 shift units and a tree-like large-number summation processing unit. The "0" padding approach adopted by the paper makes each tree-like large-number summation processing unit required to process 32 data of 192 bits, and the whole radix 32 operation required to process 32×32=1024 operands. The base 32 arithmetic circuit is not efficient enough, resulting in relatively large power consumption and resources required after the circuit is implemented.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a base 32 operation circuit for number theory transformation multiplication, which solves the problems of high power consumption and high resource expense of the base 32 operation circuit.
The technical scheme of the invention is as follows: a base 32 arithmetic circuit for number-wise transform multiplication, comprising:
the operand generating module is provided with 32 operations and 32 operationsThe number generation module numbers Xk, k=0, 1,2, &..31, each of the operand generation modules includes a dividing circuit that divides each of 32 input data into 11 words with 6 bits as one word after performing high-order zero padding, a combining circuit, and a zero padding circuit, and the divided input data is x n,m N is more than or equal to 0 and less than or equal to 32, m is more than or equal to 0 and less than or equal to 11, the merging circuit forms operand output by the input data divided into 32 multiplied by 11 words, 1 output in the merging circuit of 32 operand generating modules is 32 96-bit operands, 16 outputs are 11 192-bit operands, 3 outputs are 16 192-bit operands and 12 outputs are 12 192-bit operands, and a zero filling circuit fills a gap when the merging circuit outputs the operands into 0;
an operand modulo adding module for modulo adding the operands output by each operand generating module;
the method comprises the steps of,
the modulo p module is used for realizing that the data output by each operand modulo addition module is modulo-added to a prime number p and then output, wherein the prime number p=2 64 -2 32 +1。
Further, the operand generation module outputting 32 96-bit operands is numbered X0, the last 11 words of each 96-bit operand are input data, and the first 5 words are assigned zero.
Further, the operand generation module outputting 11 192-bit operands is numbered Xk, k is an odd number, and each operand OP m From 32 different input data x n,m N is more than or equal to 0 and less than 32, the same word index m is used, m is more than or equal to 0 and less than 11, and x is formed by combining n,m Is at the lowest position of OP m Is calculated from 6× (m+nk) (mod 192).
Further, the number of the operand generation modules outputting the 16 192-bit operands is X8, X16 and X24, the 16 operands are divided into 8 groups, each group has 2 operands, OP0 and OP1 are one group, OP2 and OP3 are one group, and so on, the operands OP in each group 2j And OP (optical path) 2j+1 From 44 different input data x n,m ,4j≤n≤4j+3,0≤mIs combined with < 11, x n,m Is at the lowest position of OP 2j And OP (optical path) 2j+1 The position of (2) is calculated from 6× (m+nk) (mod 192), x n,m Preferential placement on OP 2j In, e.g. OP 2j Is already occupied, then is placed in OP 2j+1 Corresponding to the position of the object.
Further, the 12 operands output as 192 bit operands are numbered Xk except X0, X8, X16 and X24, k is even, 12 operands are divided into 2 groups, OP0 to OP5 are one group, OP6 to OP11 are one group, and the operands OP in each group 6j To OP 6j+5 From 176 different input data x n,m And (2) the components are formed by combining 16j is not less than n and not more than 16j+15,0 is not less than m and not more than 11, and x is not less than 0 and not more than 11 n,m Is at the lowest position of OP 6j To OP 6j+5 The position of (2) is calculated from 6× (m+nk) (mod 192), x n,m Merging operands with 2 words as period, and placing the operands in OP preferentially 6j To OP 6j+5 In OP with smaller middle index number.
The technical scheme provided by the invention has the advantages that:
the 'zero filling' vacancy after the operand shift is utilized, the operands of the base 32 operation in the number theory transformation multiplication are combined, the operands are combined to 400 from 1024 in the prior art, the calculation cost is greatly reduced, and the calculation efficiency of the base 32 operation is improved.
Drawings
Fig. 1 is a schematic diagram of the general structure of a base 32 arithmetic circuit for number-theory transform multiplication according to the present invention.
Fig. 2 is a schematic diagram of a partitioning method for filling zero into input data by a partitioning circuit in an operand generation module.
FIG. 3 is a schematic diagram of a split circuit in an operand generation module.
FIG. 4 is a schematic diagram of output data obtained by the merging circuit of the X0 operand generation module.
FIG. 5 is a schematic diagram of a merging circuit of an X0 operand generation module.
FIG. 6 is a diagram of the merged operands of the merging circuit of the X1 operand generation module.
FIG. 7 is a merging circuit of operand number 0 OP0 in the X1 operand generation module.
FIG. 8 is a diagram of the merged operands of the merging circuit of the X3 operand generation module.
FIG. 9 is a diagram of the merged operands of the merging circuit of the X16 operand generation module.
FIG. 10 is a diagram of the merged operands of the merging circuit of the X2 operand generation module.
FIG. 11 is a circuit schematic of a 32-operand modulo addition module.
FIG. 12 is a circuit schematic of an 11-operand modulo addition module.
FIG. 13 is a circuit schematic of a 16-operand modulo addition module.
FIG. 14 is a circuit schematic of a 12-operand modulo addition module.
Detailed Description
The present invention is further described below with reference to examples, which are to be construed as merely illustrative of the present invention and not a limitation of the scope of the present invention, since various modifications to the equivalent arrangements of the present invention will become apparent to those skilled in the art upon reading the present invention, which are intended to be within the scope of the appended claims.
The formula of the base 32 operation is as follows
Figure GDA0004083228280000031
Wherein k is more than or equal to 0 and less than 32, p is a prime number, W 32 Is the 32 th unit root.
Where prime p is the Solinas prime, p=2 64 -2 32 +1. The prime number supports efficient modulo operations: 2 192 mod p=1,2 96 mod p=-1,2 64 mod p=2 32 -1. Unit root W calculated by using the prime number 32 =2 6 The characteristic of being the power of 2 can conveniently convert the multiplication and addition operation into shift and modulo addition operation, and reduce the computational complexity of the number theory transformation. Thus, the base 32 operation can be written as
Figure GDA0004083228280000032
Each x n In a basic unit of 6 bits, divided into 11 words, called x n,m ,0≤m<11。x n Can be expressed as
Figure GDA0004083228280000033
Wherein m represents the mth word, x n Is 64 bits, x n,m Is 6 bits, x n,10 The valid data bits of (2) are 4 bits. After dividing the input data, the base 32 operation can be written into the following formula, and the shifted operands can be combined by using 0 filling, so that the modulo addition operation operand is reduced.
Figure GDA0004083228280000041
Referring to fig. 1, a basic 32 operation circuit for multiplication of number theory transformation according to this embodiment includes 32 operand generating modules, operand modulo adding modules and modulo p modules, wherein the operand modulo adding modules are divided into a 32 operand modulo adding module, an 11 operand modulo adding module, a 16 operand modulo adding module and a 12 operand modulo adding module according to the number of input operands. The input 32 64-bit data on the circuit structure is used as the input of each operand generating module, the operand generating module is connected with an operand modulo adding module, and the operand modulo adding module is connected with a modulo p module.
The operand generation module comprises a dividing circuit, a combining circuit and a zero filling circuit, and sequentially divides, combines and fills zero into 32 64-bit data to form operands. Referring to fig. 2 and 3, the dividing circuit divides each 64-bit input data x n The highest 2 bits of (a) fill 0 to form 66 bits of data, and then split into 11 words, each word containing 6 bits, the 11 th word being 4 bits of valid data because the highest 2 bits fill 0. Data segmentation energyCan be easily implemented with existing hardware with little hardware overhead.
With Xk, k=0, 1, 2..31 numbers operand generation modules, the merging circuits in each operand generation module are different, but may be divided into 4 groups by type, with the circuits within each group being similar.
Group one: x0 is 1 in total; group II: k such as X1, X3, X5 and the like is odd, and 16 are taken as a total; group III: x8, X16 and X24, 3 in total; group four: k other than group one and group three is an even number, such as X2, X4, X6, etc., for a total of 12.
The following grouping explains the data merging operation for each group:
group one, i.e., the merging circuit of the X0 operand generation modules.
The operands are actually aligned input data. In other words, each operand is derived from 11 consecutive words of the output data of the segmentation circuit. The merging circuit outputs 32 96-bit operands, each new 96-bit operand consisting of 16 words, the last 11 words being the input data and the first 5 words being allocated to zeros. As shown in fig. 4, operand No. i OP j With 96 bits, x n The merging circuit is shown in fig. 5, which is obtained by setting the merging circuit at the low 66 bits and filling zeros at the high 30 bits.
And the merging circuit of the odd operand generating modules such as the group II, X1, X3, X5 and the like.
For the merging circuit of the Xk operand generation block with k being an odd number, the input is 32 64-bit input data and the output is 11 192-bit operands. Each operand OP m From 32 different data x n,m N is more than or equal to 0 and less than 32, and the same word index m is used, and m is more than or equal to 0 and less than 11. X is x n,m Is at the lowest position of OP m Is calculated from 6× (m+nk) (mod 192). The following is an example of the operand composition output using X1 and X3:
the merging circuit of the X1 operand generation module merges the operands as shown in fig. 6. There are 11 operands, each of which consists of 32 different data x n,m N is more than or equal to 0 and less than 32, and the same word index m is used, and m is more than or equal to 0 and less than 11. X is x 0,0 Is the least significant in OP0The position is 6× (0+0×1) (mod 192) =0, x 1,0 The position of the lowest bit in OP0 is 6× (0+1×1) (mod 192) =6, and x 0,1 The position of the lowest bit in OP1 is 6× (1+0×1) (mod 192) =6, x 31,1 The position of the lowest bit in OP1 is 6× (1+31×1) (mod 192) =0. The merging circuit of operand number 0 OP0 in the X1 operand generation module is shown in fig. 7.
The merging circuit of the X3 operand generation module merges the operands as shown in fig. 8. X is x 0,0 The position of the lowest bit in OP0 is 6× (0+0×3) (mod 192) =0, x 1,0 The position of the lowest bit in OP0 is 6× (0+1×3) (mod 192) =18, and x 0,1 The position of the lowest bit in OP1 is 6× (1+0×3) (mod 192) =6, x 31,1 The position of the lowest bit in OP1 is 6× (1+31×3) (mod 192) =180.
The operands output by the merging circuits of the remaining operand generation modules are analogized.
Group three, merging circuits of the X8, X16, and X24 operand generation modules.
The input is 32 64-bit input data and the output is 16 192-bit operands. The 16 operands are grouped into 8 groups of 2 operands each, OP0 and OP1 are one group, OP2 and OP3 are one group, and so on. Operands OP within each group 2j And OP (optical path) 2j+1 From 44 different data x n,m And n is more than or equal to 4j and less than or equal to 4j+3, and m is more than or equal to 0 and less than 11. X is x n,m Is at the lowest position of OP 2j And OP (optical path) 2j+1 Is calculated from 6× (m+nk) (mod 192). X is x n,m Preferential placement on OP 2j In, e.g. OP 2j Is already occupied, then is placed in OP 2j+1 Corresponding to the position of the object. The remaining slots are all filled with "0". Taking the merging circuit output data of the X16 operand generation module as an example, as shown in fig. 9, there are 8 sets of operands, each set including 2 merged operands. Each new 192-bit operand consists of 32 words, from 2 different input data, each providing 11 consecutive words. 192, the upper 30 bits and the 30 bits between two consecutive 11 words are filled with 0's.
Group four, the merging circuits of the even operand generation modules other than group one and group three.
For the merging circuit of the Xk operand generation block where k is an even number other than 0, 8, 16 or 24, the input is 32 64-bit input data and the output is 12 192-bit operands. The 12 operands are grouped into 2 groups of 6 operands each, OP0 through OP5 being one group and OP6 through OP11 being one group. Operands OP within each group 6j To OP 6j+5 From 176 different data x n,m And (3) the combination of the n and the m which are not less than 16j and not more than 16j+15 and 0 and not less than 11. X is x n,m Is at the lowest position of OP 6j To OP 6j+5 Is calculated from 6× (m+nk) (mod 192). X is x n,m Merging operands with 2 words as period, and placing the operands in OP preferentially 6j To OP 6j+5 In OP with smaller middle index number. The remaining slots are all filled with "0". Taking the merging circuit output data of the X2 operand generation module as an example, as shown in fig. 9, there are 2 sets of operands, each set including 6 merged operands. The first group comprises OP0 to OP5; the second group includes OP6 to OP11. Each new 192-bit operand consists of 32 words, which come from 16 different input data, each providing 2 consecutive words.
The number of the operands is different according to the different groups of operand generating modules, and the operand modulo adding module comprises a 32-operand modulo adding module, an 11-operand modulo adding module, a 16-operand modulo adding module and a 12-operand modulo adding module.
The 32-operand modulo addition module is shown in FIG. 11, where CSA represents a Carry-save adder, CPA represents a Carry-ripple adder, and "< 1" represents shifting the Carry-side (Carry-side) of the Carry-save adder 1 bit to the left. Of the 32 operands, the operand in the 4i, i=1, 2,..8 positions is reserved, and the rest of the operands are input into the first layer CSA every three; shifting the carry end of the first layer CSA by 1 bit to the left with its sum end and 4i, i=1, 2; shifting the sum end of each two second-layer CSAs to the left by 1 bit and inputting the bit into the third-layer CSA; the carry end of the third layer CSA shifts 1 bit leftwards, the sum end of the third layer CSA shifts 1 bit leftwards, and the carry end of the other second layer CSA in every two second layers CSA is input into the fourth layer CSA; shifting the sum end of every two fourth-layer CSAs to the left by 1 bit and inputting the bit into a fifth-layer CSA; the carry end of the fifth layer CSA shifts 1 bit leftwards, the sum end of the fifth layer CSA shifts 1 bit leftwards and inputs the carry end of the other fourth layer CSA in every two fourth layers CSA into the sixth layer CSA; the sixth layer is totally two CSAs, shift 1 bit to the left of carry end of the second CSA, the sum end of the second CSA and the sum end of the first CSA input the seventh layer CSA (totally 1); the CSA carry end of the seventh layer shifts 1 bit leftwards, the carry end of the first CSA of the sixth layer shifts 1 bit leftwards, and the eighth layer CSA is input; the CSA carry end of the eighth layer shifts 1 bit leftwards and the data end is input into CPA, and the result is input into the modulo addition module. The modulo addition module realizes the 193-bit width data input, the addition operation of the low 192-bit data and the 193-bit data, and the output result is congruent with the prime number p of the input data.
The 11-operand modulo addition module is shown in FIG. 12, where CSA represents a Carry-save adder, CPA represents a Carry-ripple adder, and "ROL 1-bit" represents a cyclic shift of the Carry-side (Carry-side) of the Carry-save adder by 1 bit to the left. 1,2,3 in 11 operands; 5. 6, 7; 9. 10, 11 respectively inputs three first-layer CSAs, wherein the sum end of a first CSA in the first-layer CSAs, an operand 4 and the carry end of a second CSA in the first-layer CSAs are circularly shifted to the left by 1 bit and input into the first CSA in the second-layer, the operand 8, the carry end of a third CSA in the first-layer CSAs are circularly shifted to the left by 1 bit and input into the second CSA in the second-layer, the carry end of the first CSA in the first-layer CSAs is circularly shifted to the left by 1 bit, the carry end of the first CSA in the second-layer is circularly shifted to the left by 1 bit and input into the first CSA in the third-layer, and the sum end of the second CSA in the first-layer and the carry end of the second CSA in the second-layer are circularly shifted to the left by 1 bit and input into the second CSA in the third-layer; the sum end of the first CSA in the third layer of CSA and the carry end of the second CSA in the third layer of CSA circularly shift 1 bit leftwards and the sum end of the second CSA are input into the fourth layer of CSA; the carry end of the first CSA in the third layer CSA circularly shifts 1 bit leftwards, the carry end of the fourth layer CSA circularly shifts 1 bit leftwards, and the sum end of the fourth layer CSA circularly shifts the carry end of the fourth layer CSA to the left, and the fifth layer CSA is input; the CSA carry end of the fifth layer circularly shifts 1 bit leftwards and inputs CPA to the data end, and the result is input to the modulo addition module. The modulo addition module realizes the 193-bit width data input, the addition operation of the low 192-bit data and the 193-bit data, and the output result is congruent with the prime number p of the input data.
The 16-operand modulo addition module is shown in FIG. 13, where CSA represents a Carry-save adder, CPA represents a Carry-ripple adder, and "< <1" represents shifting the Carry-side (Carry-side) of the Carry-save adder 1 bit to the left. The operands in the positions of 4i, i=1, 2,3 and 4 are reserved in 16 operands, and the rest operands are input into the first layer CSA every three; shifting the carry end of the first layer CSA by 1 bit to the left, shifting the carry end of the first layer CSA by 1 bit to the carry end of the first layer CSA, shifting the carry end of the first layer CSA by 4i, i=1, 2,3,4, and inputting operands in the positions of the carry end and the 4i, i=1, 2,3,4 into the second layer CSA; shifting the sum end of each two second-layer CSAs to the left by 1 bit and inputting the bit into the third-layer CSA; the carry end of the third layer CSA shifts 1 bit leftwards, the sum end of the third layer CSA shifts 1 bit leftwards, and the carry end of the other second layer CSA in every two second layers CSA is input into the fourth layer CSA; the fourth layer of CSA is totally two CSAs, the carry end of the second CSA is shifted to the left by 1 bit, the sum end of the second CSA and the sum end of the first CSA are input into the fifth layer of CSA (totally 1); the CSA carry end of the fifth layer shifts 1 bit leftwards, and the carry end of the first CSA of the fourth layer shifts 1 bit leftwards and inputs the CSA of the sixth layer; the CSA carry end of the sixth layer shifts left by 1 and the data end inputs CPA, and the result is input to the modulo addition module. The modulo addition module realizes the 193-bit width data input, the addition operation of the low 192-bit data and the 193-bit data, and the output result is congruent with the prime number p of the input data.
The 12-operand modulo addition module is shown in FIG. 14, where CSA represents a Carry-save adder, CPA represents a Carry-ripple adder, and "ROL 1-bit" represents a cyclic shift of the Carry-side (Carry-side) of the Carry-save adder by 1 bit to the left. Inputting the first layer CSA into every third of the 12 operands, circularly shifting the sum end of every two first layer CSAs and the carry end of one second layer CSA to the left by 1 bit, and inputting the two first layer CSAs into the second layer CSA; the carry end of the second layer CSA circularly shifts 1 bit leftwards, the sum end of the second layer CSA circularly shifts 1 leftwards and inputs the carry end of the other first layer CSA in every two first layers CSA into the third layer CSA; the third layer of CSA is totally two CSAs, the carry end of the second CSA is circularly shifted to the left by 1 bit, the sum end of the second CSA and the sum end of the first CSA are input into the fourth layer of CSA (totally 1); the CSA carry end of the fourth layer circularly shifts 1 bit leftwards, and the carry end of the first CSA of the third layer circularly shifts 1 bit leftwards and inputs the fifth layer CSA; the CSA carry end of the fifth layer circularly shifts 1 bit leftwards and inputs CPA to the data end, and the result is input to the modulo addition module. The modulo addition module realizes the 193-bit width data input, the addition operation of the low 192-bit data and the 193-bit data, and the output result is congruent with the prime number p of the input data.
The modulo-p module performs modulo-p on the input data.

Claims (5)

1. A base 32 arithmetic circuit for use in a number theory conversion multiplication, wherein there are 32 operand generation modules, the number of the 32 operand generation modules is Xk, k=0, 1,2,..31, each of the operand generation modules includes a dividing circuit that divides each of 32 input data into 11 words with 6 bits as one word after performing high-order zero padding by the dividing circuit, and the divided input data is x, a combining circuit, and a zero padding circuit n,m N is more than or equal to 0 and less than or equal to 32, m is more than or equal to 0 and less than or equal to 11, the merging circuit forms operand output by the input data divided into 32 multiplied by 11 words, 1 output in the merging circuit of 32 operand generating modules is 32 96-bit operands, 16 outputs are 11 192-bit operands, 3 outputs are 16 192-bit operands and 12 outputs are 12 192-bit operands, and a zero filling circuit fills a gap when the merging circuit outputs the operands into 0;
an operand modulo adding module for modulo adding the operands output by each operand generating module;
the method comprises the steps of,
the modulo p module is used for realizing that the data output by each operand modulo addition module is modulo-added to a prime number p and then output, wherein the prime number p=2 64 -2 32 +1。
2. The base 32 arithmetic circuit for use in a number theory transform multiplication according to claim 1, wherein the operand generation block number of the output is X0 for 32 96-bit operands, the last 11 words of each 96-bit operand are input data, and the first 5 words are allocated to zero.
3. The base 32 operation circuit for number theory conversion multiplication according to claim 1, wherein said operand generation block number of 11 192-bit operands is Xk, k is an odd number, and each operand OP m From 32 different input data x n,m N is more than or equal to 0 and less than 32, the same word index m is used, m is more than or equal to 0 and less than 11, and x is formed by combining n,m Is at the lowest position of OP m Is calculated from 6× (m+nk) (mod 192).
4. The base 32 operation circuit for the multiplication of the number theory transform according to claim 1, wherein the number of the operand generation modules outputting the 16 192-bit operands is X8, X16 and X24, the 16 operands are divided into 8 groups of 2 operands each, OP0 and OP1 are one group, OP2 and OP3 are one group, and so on, the operands OP within each group 2j And OP (optical path) 2j+1 From 44 different input data x n,m N is more than or equal to 4j and less than or equal to 4j+3, m is more than or equal to 0 and less than 11, and x is formed by combining n,m Is at the lowest position of OP 2j And OP (optical path) 2j+1 The position of (2) is calculated from 6× (m+nk) (mod 192), x n,m Preferential placement on OP 2j In, e.g. OP 2j Is already occupied, then is placed in OP 2j+1 Corresponding to the position of the object.
5. The base 32 operation circuit for number theory conversion multiplication according to claim 1, wherein the operand generation block number of 12 192-bit operands is Xk excluding X0, X8, X16 and X24, k is an even number, 12 operands are divided into 2 groups, OP0 to OP5 are one group, OP6 to OP11 are one group, and operands OP in each group 6j To OP 6j+5 From 176Different input data x n,m And (2) the components are formed by combining 16j is not less than n and not more than 16j+15,0 is not less than m and not more than 11, and x is not less than 0 and not more than 11 n,m Is at the lowest position of OP 6j To OP 6j+5 The position of (2) is calculated from 6× (m+nk) (mod 192), x n,m Merging operands with 2 words as period, and placing the operands in OP preferentially 6j To OP 6j+5 In OP with smaller middle index number.
CN202010371312.9A 2020-05-06 2020-05-06 Base 32 operation circuit for number theory transformation multiplication Active CN111694541B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010371312.9A CN111694541B (en) 2020-05-06 2020-05-06 Base 32 operation circuit for number theory transformation multiplication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010371312.9A CN111694541B (en) 2020-05-06 2020-05-06 Base 32 operation circuit for number theory transformation multiplication

Publications (2)

Publication Number Publication Date
CN111694541A CN111694541A (en) 2020-09-22
CN111694541B true CN111694541B (en) 2023-04-21

Family

ID=72476909

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010371312.9A Active CN111694541B (en) 2020-05-06 2020-05-06 Base 32 operation circuit for number theory transformation multiplication

Country Status (1)

Country Link
CN (1) CN111694541B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870438A (en) * 2014-02-25 2014-06-18 复旦大学 Circuit structure using number theoretic transform for calculating cyclic convolution
CN106027227A (en) * 2016-07-01 2016-10-12 浙江工业大学 Fermat number number-theoretic transform and SAFER (Secure And Fast Encryption Routine) cipher algorithm combined block encryption method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8028015B2 (en) * 2007-08-10 2011-09-27 Inside Contactless S.A. Method and system for large number multiplication
US8549264B2 (en) * 2009-12-22 2013-10-01 Intel Corporation Add instructions to add three source operands
US20130332707A1 (en) * 2012-06-07 2013-12-12 Intel Corporation Speed up big-number multiplication using single instruction multiple data (simd) architectures
CN102866875B (en) * 2012-10-05 2016-03-02 刘杰 Multioperand adder
EP3610382A4 (en) * 2017-04-11 2021-03-24 The Governing Council of the University of Toronto A homomorphic processing unit (hpu) for accelerating secure computations under homomorphic encryption
US10853034B2 (en) * 2018-03-30 2020-12-01 Intel Corporation Common factor mass multiplication circuitry

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870438A (en) * 2014-02-25 2014-06-18 复旦大学 Circuit structure using number theoretic transform for calculating cyclic convolution
CN106027227A (en) * 2016-07-01 2016-10-12 浙江工业大学 Fermat number number-theoretic transform and SAFER (Secure And Fast Encryption Routine) cipher algorithm combined block encryption method

Also Published As

Publication number Publication date
CN111694541A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
Vázquez et al. A new family of high. performance parallel decimal multipliers
CN100388630C (en) CRC computing method and system having matrix conversion technology
US20210349692A1 (en) Multiplier and multiplication method
JPH076161A (en) Method and apparatus for conversion of frequency into time region
CN109379086A (en) The 5G LDPC coding method of the code-rate-compatible of low complex degree and encoder
CN102043760B (en) Data processing method and system
JP4290202B2 (en) Booth multiplication apparatus and method
CN101902228A (en) Rapid cyclic redundancy check encoding method and device
Al-Khaleel et al. Fast and compact binary-to-BCD conversion circuits for decimal multiplication
CN104617959A (en) Universal processor-based LDPC (Low Density Parity Check) encoding and decoding method
GB2399909A (en) Multiplication of selectively partitioned binary inputs using booth encoding
CN108008932A (en) Division synthesizes
JPH11203272A (en) Device, system and method for east fourier transformation processing
CN110543291A (en) Finite field large integer multiplier and implementation method of large integer multiplication based on SSA algorithm
CN101295237B (en) High-speed divider for quotient and balance
CN115238863A (en) Hardware acceleration method, system and application of convolutional neural network convolutional layer
CN111694541B (en) Base 32 operation circuit for number theory transformation multiplication
JPS6382546A (en) Apparatus for calculating discrete conversion
CN111694540B (en) Base 64 operation circuit for number theory transformation multiplication
KR20220064337A (en) Processor for fine-grain sparse integer and floating-point operations
CN111694542B (en) Base 16 arithmetic circuit for number theory conversion multiplication
EP1481319B1 (en) Method and apparatus for parallel access to multiple memory modules
CN109379191B (en) Dot multiplication operation circuit and method based on elliptic curve base point
JP2002149399A (en) Instruction set for processor
Elango et al. Hardware implementation of residue multipliers based signed RNS processor for cryptosystems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Hua Siliang

Inventor after: Zhang Huiguo

Inventor after: Liu Yushen

Inventor after: Xu Jian

Inventor after: Bian Jiuhui

Inventor after: Zhang Jingya

Inventor before: Hua Siliang

Inventor before: Zhang Huiguo

Inventor before: Liu Yushen

Inventor before: Xu Jian

Inventor before: Bian Jiuhui

Inventor before: Zhang Jingya

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant