CN115658005A - High-precision low-delay large integer division accelerating device based on redundancy - Google Patents

High-precision low-delay large integer division accelerating device based on redundancy Download PDF

Info

Publication number
CN115658005A
CN115658005A CN202211237664.0A CN202211237664A CN115658005A CN 115658005 A CN115658005 A CN 115658005A CN 202211237664 A CN202211237664 A CN 202211237664A CN 115658005 A CN115658005 A CN 115658005A
Authority
CN
China
Prior art keywords
module
bit
rsd
redundancy
multiplier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211237664.0A
Other languages
Chinese (zh)
Inventor
王中风
张容蓉
朱丹阳
田静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202211237664.0A priority Critical patent/CN115658005A/en
Publication of CN115658005A publication Critical patent/CN115658005A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Error Detection And Correction (AREA)

Abstract

The invention provides a redundancy-based high-precision low-delay large integer division accelerating device which comprises an RSD preprocessing module, a similar subtraction encoding module, an RSD multiplier and a truncation module. The RSD preprocessing module is used for normalizing input redundancy numbers to meet the algorithm requirement, the similar subtraction coding module is used for quickly realizing a large number subtraction operation by utilizing simple coding, the RSD multiplier is used for realizing quick product of two redundancy numbers, and the order truncation module is used for truncating the first half high order bits of the result of the RSD multiplier so as to be convenient for multiplexing the RSD multiplier. The bottom adder adopted by the whole device is a redundant adder, so that the time delay is shortened, and the overall total operation time is greatly reduced.

Description

High-precision low-delay large integer division accelerating device based on redundancy
Technical Field
The invention relates to a large integer division device and a modulus taking device with unfixed modulus in the technical field of cryptography, in particular to a high-precision low-delay large integer division accelerating device based on redundancy.
Background
With the continuous development of computer technology, the problems related to network information security and the like are endless. At present, the core operation for ensuring network security is to encrypt network related data. In cryptography and its field, many proven algorithms are proposed and disclosed for different application environment requirements, such as Elliptic Curve Cryptography (ECC) and RSA cryptography in the field of public key cryptography, which have been widely used in block chaining, secure chip, and other technologies to enhance security. In order to meet the corresponding security strength, the related encryption and decryption technologies all require mathematical operations such as addition, subtraction, multiplication, division, modulus taking and the like under the word length of 512 bits or 1024bits and the like, for example, based on an RSA cryptographic algorithm which is difficult to decompose large integers, a modulus n is obtained through the product of two large prime numbers p and q, and in the modulus n operation, a public key index e and a private key index d are used for respectively carrying out corresponding encryption and decryption processes.
Currently, in many decentralized systems, a class of VDF (Verifiable Delay Function) is widely applied and rapidly developed, and the core of the VDF is that a computing process needs to sequentially run a specified number of steps and the verification process is rapid. If the calculation rate of an attacker is obviously faster than that of a general user, the application of the VDF is at risk, and therefore, in order to ensure the security of the related application, a quick implementation scheme of the VDF needs to be disclosed. In view of this, many studies have been conducted to optimize algorithms, architectures and the like, such as the acceleration of square operations in the calculation process from the perspective of algorithm optimization in VDF (references: ZHU D, SONG Y, TIAN J, et al. An Efficient Accelerator of the Squaring for the versible Delay Function Over a Class Group [ C ]//2020IEEE Asia Pacific Conference on Circuits and Systems (APCCAS) 2020. However, in the hardware level, the above schemes and the above mentioned modulo operation all involve division, and the current large number division is very complex to implement, and the long critical path and the many iteration cycles all result in a slow operation speed, which becomes an important factor affecting the algorithm efficiency.
In view of the above, a hardware implementation of a large number division operation can be further optimized for most modern cryptosystems.
The existing division operation schemes are roughly divided into three types:
the first method comprises the following steps: a digital loop class algorithm. The SRT division algorithm is widely used as a representative digital loop algorithm, and meanwhile, corresponding improvement schemes are provided according to different application environments. Since the quotient can only be obtained with one bit of precision each time the algorithm is iterated, the convergence speed is very slow when operating on a large bit wide number.
And the second method comprises the following steps: division operation based on newton iteration. The Newton iteration method is also called Newton-Raphson algorithm (Newton-Raphson) and is the earliest used iteration algorithm with a wider application range. For the division operation N/D, it can be converted into the method of first obtaining the reciprocal of the divisor meeting the precision requirement, i.e. 1/D, and then multiplying the reciprocal by the dividend N to obtain the final result: n (1/D). Wherein the inverting part is based on the function
Figure BDA0003882910350000021
Method for zero-finding, using an iterative formula x in the case of quadratic convergence i+1 =x i (2-Dx i ) And (5) solving a required reciprocal value. To ensure convergence, the input number needs to be preprocessed to satisfy (0.5,1)]Simultaneously by an approximation algorithm (reference: LUNGLMAYR M, PLODER O. Fast approximation reactions for iterative algorithms [ J ]]arXiv preprint arXiv:2007.06241,2020) may provide a suitable initial value to reduce the iteration cycle, i.e., x 0 =3-2*D. The algorithm is as follows:
the first algorithm is as follows: newton Raphson algorithm
Input:D∈(0.5,1]
1:Initialize:x 0 ←3–2*D
2:for i=0to k do
3:a←D*x i
4:x i+1 ←x i *(2-a)
5:end for
6:Return x i+1 →1/D
Output:1/D∈[1,2)
In contrast to the digital round robin-like algorithm, the algorithm only requires log 2 W iterations and additional operations can complete the division operation (W is the divisor bit width, the same applies to the following description), thusThe algorithm is suitable for low-delay design of large numbers, meanwhile, each iteration involves two multiplication operations, and hardware resources and operation speed depend on the multiplication operation part.
And the third is that: division operation based on the Goldsmith algorithm. Taylor expansion formula at 0 point based on function 1/(1+x) 1/(1+x) =1-x + x 2 -x 3 +x 4 -x 5 +…=(1-x)(1+x 2 )(1+x 4 )(1+x 8 ) …, the input divisor D is processed to satisfy the condition (0.5,1)]In between, only D =1+x (in this case | x | y<1) Then, the following can be obtained:
Figure BDA0003882910350000031
for D 0 =D=1+x,F 0 =2–D=1-x,N 0 = N, observe
D 1 =D 0 *F 0 =1-x 2 ,F 1 =2–D 1 =1+x 2
D 2 =D 1 *F 1 =1-x 4 ,F 2 =2–D 2 =1+x 4
……
Figure BDA0003882910350000032
So the denominator can be written as i+1 =D i *(2-D i )=D i *F i The molecule can be written as N i+1 =N i *(2-D i )=N i *F i As i increases, the numerator in equation (1) tends to 1 and the denominator tends to N/D, the last quotient. The algorithm is as follows:
and a second algorithm: goldschmidt algorithm
Input:D∈(0.5,1]
1:Initialize:D 0 ←D,F 0 ←2-D,N 0 ←N
2:for i=0to k do
3:D i+1 ←D i *F i
4:N i+1 ←N i *F i
5:F i+1 ←2-D i+1
6:end for
7:Return N i+1 →N/D
Output:N/D
The algorithm is similar to the Newton iteration method, and the division operation is converted into two multiplication operations for operation, so that the iteration times are approximately log 2 And W times. Compared with Newton iterative algorithm, the method has the advantages that the final quotient can be directly calculated, meanwhile, the two multiplication operations are independent, parallel calculation can be achieved, hardware is more friendly, and the method has the defect that once errors occur in iteration, the errors cannot be updated automatically, and therefore the method is very dependent on the accuracy of operation in the process.
Disclosure of Invention
The purpose of the invention is as follows: in order to solve the problems that the existing large number division device based on functional iteration class can involve multiplication operation and subtraction operation during iteration, the multiplication operation and the subtraction operation can be converted into addition operation and shift operation, the delay of the addition operation can be increased along with the input bit width, the overall calculation speed is low, and the running time is long, the invention provides a redundancy-based high-precision low-delay large integer division accelerating device, wherein the data format adopts the representation form of redundant signed numbers (RSD), and the device comprises an RSD preprocessing module, a first data selector, a second data selector, a class reduction coding module, a first RSD multiplier module, a second RSD multiplier module, a first truncation module, a second truncation module, a first register, a second register and a shift register;
the RSD preprocessing module is used for preprocessing an input divisor to obtain a data type D meeting the requirement of a functional iteration algorithm (such as a Goldschmidt algorithm) norm I.e. D norm Is greater than 0 and less than 1, and the most significant bit n D ,D norm Output to the first data selector, n D Outputting the signal to a shift register;
the first data selector is used for obtaining an output D at the RSD preprocessing module according to the control signal norm And the data stored in the first register are selected to obtain an iteration divisor (the data stored in the first register is the result obtained by the first truncation module, the output of the preprocessing module is the initial input, and the data stored in the first register is the input in the iteration, so the loop function is not influenced), and the iteration divisor is output to the analog subtraction encoding module and the first RSD multiplier module;
the second data selector is used for selecting between the input dividend and data stored in a second register according to the control signal to obtain an iteration dividend (the data stored in the second register is the result obtained by the second truncation module, the input dividend is the initial input, and the data stored in the second register is the input in iteration, so that the loop function is not influenced), and outputting the iteration dividend to the second RSD multiplier module;
the class reduction coding module is used for carrying out two-reduction coding on the iteration divisor in a redundancy form to obtain a common multiplier parameter and outputting the common multiplier parameter to the first RSD multiplier module and the second RSD multiplier module;
the first RSD multiplier module is used for realizing multiplication operation of a common multiplier and an iteration divisor under the W-bit redundancy number and outputting the multiplication operation to the first truncation module;
the second RSD multiplier module is used for realizing multiplication operation of a common multiplier and an iteration dividend under the W-bit redundancy number and outputting the multiplication operation to the second truncation module;
the first truncation module is used for truncating the high W bits of the 2W-bit redundancy output by the first RSD multiplier module to obtain a new W-bit redundancy and outputting the new W-bit redundancy to the first register;
the second truncation module is used for truncating the high W bits of the 2W-bit redundancy output by the second RSD multiplier module to obtain new W-bit redundancy and outputting the new W-bit redundancy to the second register and the shift register;
the input end of the first register is connected to the output end of the first truncation module, and the output end of the first register is connected to the input end of the first data selector;
the input end of the second register is connected to the output end of the second truncation module, and the output end of the second register is connected to the input end of the second data selector;
and the shift register is used for right shifting the result obtained by the second truncation module, and the right shift number is twice of the difference between the input bit width W and the most significant bit obtained by the RSD preprocessing module, so that the final quotient value is obtained.
The RSD preprocessing module comprises a precoding module, a coding mapping module, a most significant bit detection (LOD) module, a detection module, a data selector and an internal shift register;
the pre-coding module is used for pre-coding the input redundancy D by utilizing the existing coding module (reference: peter Kornerbup. Correcting the redundancy shift of redundant bank 622 redundancy. IEEE Transactions on computers,58 (10): 1435-1439, 2009) to obtain the tree parameters a, b, c and LOD parameter F, and outputting the tree parameters a, b, c to the detection module, and outputting the LOD parameter F to the most significant bit detection module;
the coding mapping module is used for mapping the input redundancy D into redundancy bits which are the same as the input actual value but are all 0 before the most significant bit, and then outputting the redundancy bits to the internal shift register.
The most significant bit detection module is used for finding the most significant position of the input by using a most significant bit detector LOD, and then outputting the most significant position to a data selector.
The detection module is used for detecting deviation by utilizing a tree-type simplified structure (reference documents: J.D. Bruguera and T.Lang, "Leading-one prediction with current position correction," in IEEE Transactions on Computers, vol.48, no.10, pp.1083-1097, oct.1999), if deviation exists, 1 is output, otherwise 0 is output, and the output end of the detection module is used as the control end of the data selector.
The data selector is used for selecting between the output of the most significant bit detection module and the number obtained by subtracting one from the output of the most significant bit detection module according to the result 0/1 of the detection module to obtain the correct most significant bit n D As the most importantThe final output is then output to the internal shift register.
The internal shift register is used for performing left shift on the result of the coding mapping module to obtain the final normalization result, and the left shift number is the most significant bit obtained by the data selector.
The class reduction coding module is used for processing the input number I of W redundant bits, and the bit level data format is recorded as
Figure BDA0003882910350000051
Wherein
Figure BDA0003882910350000052
Is the 1 st bit of the 1 st redundant bit of I,
Figure BDA0003882910350000053
is the 2 nd bit of the 1 st redundant bit of I,
Figure BDA0003882910350000061
is the 1 st bit of the 2 nd redundant bit of I,
Figure BDA0003882910350000062
the output number of the 2 nd bit of the 2 nd redundant bit of I is still W-bit redundant number, which is marked as O and is expressed as:
Figure BDA0003882910350000063
wherein,
Figure BDA0003882910350000064
Figure BDA0003882910350000065
Figure BDA0003882910350000066
Figure BDA0003882910350000067
Figure BDA0003882910350000068
wherein
Figure BDA0003882910350000069
Is the 1 st bit of the 1 st redundant bit of O,
Figure BDA00038829103500000610
a 2 nd bit which is a 1 st redundant bit of O,
Figure BDA00038829103500000611
is the 1 st bit of the 2 nd redundant bit of O,
Figure BDA00038829103500000612
bit 2 which is the 2 nd redundant bit of O.
The first truncation module and the second truncation module have the same function and are used for realizing rapid high-order truncation, namely rapidly and correctly truncating the input (actual value at W-1 bit) of 2W redundant bit number to W bit without changing the actual value, and the truncation rule is as follows: the last W-1 bit of the input is not changed, and a new W-th redundant bit n new It is recorded as
Figure BDA00038829103500000613
(one redundancy bit is composed of two bits) and is determined only by the original W-th redundancy bit and W + 1-th redundancy bit, which are respectively denoted as n { n } + ,n - And
Figure BDA00038829103500000614
the formula is as follows:
Figure BDA00038829103500000615
Figure BDA00038829103500000616
the first RSD multiplier module and the second RSD multiplier module have the same structure and respectively comprise a partial product generation module (PPG) and an accumulator, and the partial product generation module utilizes a multiplier a i Is composed of
Figure BDA00038829103500000617
0,1, the partial products are-B and 0,B, respectively, to form partial products, wherein-B only needs to exchange the corresponding parity bit of the multiplicand B; the accumulator utilizes a redundant adder (RSDA) in a tree structure to simplify the partial product and a first truncation module to process the overflowing redundant bits.
The invention is suitable for the field of modern cryptography, in particular to a parameter calculation algorithm of a Verifiable Delay Function (VDF) with high requirement on Delay, and further provides a verification Delay Function VDF calculation acceleration method based on a quadratic form.
Has the advantages that: the data format related to the whole device adopts redundancy representation, so that the delay of a bottom layer redundancy adder is irrelevant to bit width, the key path is low when complex division operation is carried out, the iteration period based on a Goldschmidt algorithm is few, and the total operation time of the whole device is greatly reduced.
Drawings
The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a top level architecture diagram.
Fig. 2 is a schematic diagram of an RSD preprocessing module.
Fig. 3 is a schematic diagram of a code mapping module.
Fig. 4 is a schematic diagram of an RSD multiplier block.
FIG. 5 is a schematic diagram of an RSD adder.
Detailed Description
The device is a rapid realization of large number division, is based on a quadratic convergence Golde Schmidt algorithm, and is characterized in that an iteration part of the algorithm is focused, and a top-level architecture diagram is given, as shown in figure 1. The overall data format is represented by Redundant Signed Digit (RSD), that is, a Signed number can be represented by subtracting two unsigned numbers:
Figure BDA0003882910350000071
wherein a is i + 、a i - E {0,1}, which is a bit,
Figure BDA0003882910350000072
are redundant bits.
Based on the above representation, several key functional modules of the present invention will now be described:
(1) RSD preprocessing module
For the Goldsmith algorithm, the operand D is input to achieve convergence requirements 0 Pretreatment is required to satisfy (0.5,1)]In the meantime. In a normal form, only one most significant bit detector (LOD) is needed to calculate the most significant bit position n of the divisor D D (lower right), shift the divisor D left by W-n D The bit (W is the divisor bit width) can obtain the input D meeting the requirement 0 The highest order is 1.
Due to the particularities of the redundant representation itself, it is possible to use a number of more significant bits to represent a number of less significant bits, e.g.
Figure BDA0003882910350000073
The actual value of the representation is 000011 (obviously the two outputs after LOD are different) and therefore the redundant representation needs to be converted into a normal form. However, this scheme will involve a conventional subtraction operation, and will lose superiority in most systems, so the existing scheme for normalization operation in floating-point subtraction (see: peter Kornerup. Correcting the normalization shift of reducing binding 622 representation. IEEE Transactions on computers,58 (10): 1435-1439, 2009) is used to make certain improvements to reduce the critical path from a W-bit normal subtraction and most significant detection module to a most significant detection module, see FIG. 2. Which comprises the following steps:
1. and the pre-coding module is used for pre-coding the input positive redundancy divisor D (the coding form is shown in the reference document) to obtain the input of the detection module and the most significant bit detection module.
2. A coding and mapping module for recoding the input positive redundant divisor D to make the redundant bits before the most significant bit corresponding to the actual value of the divisor all 0, such as
Figure BDA0003882910350000081
Is coded into
Figure BDA0003882910350000082
The coding architecture is shown in fig. 3 as input to the final shift register.
2. A most significant bit detection (LOD) module, wherein the most significant bit n is found by using the existing LOD for the result after 1) coding t
3. The detection module, as in this example, has an actual value of 000011 and the corresponding most significant bit should be 2, but 2) returns 3, so there is a one bit offset. This module uses a simplified tree structure as in the literature (J.D. Bruguera and T.Lang, "Leading-one prediction with current position correction," in IEEE Transactions on Computers, vol.48, no.10, pp.1083-1097, oct.1999) to detect this deviation for correction, and outputs 1 if there is a deviation and 0 otherwise.
4. An alternative data selector for detecting the output n of the module at the most significant bit by using the result 0/1 of the detection module t And n t -1 and get the correct most significant bit n.
5. A shift register for shifting according to the structure value of the data selection module to obtain the final normalized redundancy D norm
(2) Class subtraction coding module
The second algorithm is that the iterative computation involves a subtraction operation 2-D i+1 When the device is used, D i+1 For a redundancy number, a redundancy subtractor can be directly used to perform this step. But considering D i+1 Is 0 and can therefore be replaced by a 4-bit encoder to reduce area. The encoding rule is as follows:
for the input number I of W redundant bits, i.e.
Figure BDA0003882910350000083
(data storage type), the output number is still W-bit redundancy number, denoted as O, i.e.
Figure BDA0003882910350000084
Wherein,
Figure BDA0003882910350000091
Figure BDA0003882910350000092
Figure BDA0003882910350000093
Figure BDA0003882910350000094
Figure BDA0003882910350000095
(3) Quick high-order module of cuting
In hardware design, the multipliers in iteration are multiplexed, data bit width consistency needs to be ensured, and therefore the output of the multipliers needs to be subjected to truncation operation. Considering that the following data bits belong to precision bits, the algorithm can be directly truncated as long as the algorithm is correct, but the redundancy representation has the particularity mentioned above, redundant parts corresponding to zero bits in practice cannot be truncated, and therefore a truncation module is required to be specially designed to rapidly process high bits.
For the redundant number within W-1 bit, no matter how many effective redundant bits there are, only one simple operation is needed when cutting off to W bit, i.e. the last W-1 bit is not changed, and the new W +1 redundant bit is marked as W +1 redundant bit
Figure BDA0003882910350000096
Figure BDA0003882910350000097
Determined only by the original W-th redundant bit and W +1 redundant bit, respectively denoted as n { n } + ,n - And
Figure BDA0003882910350000098
the formula is as follows:
Figure BDA0003882910350000099
Figure BDA00038829103500000910
specific examples are shown in table 1:
TABLE 1
Figure BDA00038829103500000911
(4) RSD multiplier module
In circuit design, a common multiplication operation mainly includes two steps: 1) Generating a partial product; 2) Using a tree-like structure, the partial products are accumulated using a fast ripple carry adder (CPA).
The first step is relatively simple, but the second step causes the critical path to be longer along with the increase of bit width, and the calculation delay is large. Considering that multiplication always uses the redundant signed number representation form in the iterative process of the invention, the second step can be accelerated by using the 'no carry' addition property, and does not need to be converted into a common format. And lowering the critical path in place by pipelining, etc., the architecture after modification is shown in fig. 4. The RSD multiplier module mainly comprises:
1. partial Product Generator (PPG) using a i Is composed of
Figure BDA0003882910350000101
And then, the partial products are respectively-B and 0,B to form the partial products, wherein-B only needs to correspondingly exchange the parity bit of B.
2. Accumulators (RSD adder, RSDA) use a tree structure to simplify the partial product, which is accumulated using redundant adders, see fig. 5.
In the process, in order to avoid uneven increase of the digit number in the accumulation process, the quick high-order truncation module in the step (2) is adopted.
Examples
Take the large number division operation involved in the implementation of VDF algorithm as an example. Considering that the divisors of a plurality of division operations in the algorithm are the same, the reciprocal of the divisor can be calculated by only setting N in the framework to be 1, and then the reciprocal is multiplied by the corresponding dividend by using the used multiplier, namely the product is the corresponding quotient.
Now, the TSMC 28nm CMOS process library is used for ASIC synthesis of a 2048bits/1024bits division operation architecture, and the result is shown in the following table 2:
TABLE 2
Figure BDA0003882910350000102
The results show that the critical path of the architecture is 1.05ns, and the single operation delay is 86.1ns.
In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, and the computer program, when executed by the data processing unit, may run the inventive content of the redundancy-based high-precision low-latency large integer division acceleration apparatus and some or all of the steps in each embodiment. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.
It is obvious to those skilled in the art that the technical solutions in the embodiments of the present invention can be implemented by means of a computer program and its corresponding general-purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a computer program, that is, a software product, which may be stored in a storage medium and includes several instructions to enable a device (which may be a personal computer, a server, a single chip microcomputer MUU or a network device) including a data processing unit to execute the method in each embodiment or some parts of the embodiments of the present invention.
The present invention provides a redundancy-based high-precision low-delay large integer division accelerator, and a plurality of methods and approaches for implementing the technical scheme, where the foregoing is merely a preferred embodiment of the present invention, it should be noted that, for those skilled in the art, a plurality of improvements and refinements may be made without departing from the principle of the present invention, and these improvements and refinements should also be regarded as the protection scope of the present invention, including but not limited to a specific type of redundancy representation, a specific algorithm of multiplier division multiplexing, and a specific algorithm of function iteration class. All the components not specified in the present embodiment can be realized by the prior art.

Claims (10)

1. A data format adopts a representation form of redundant signed number RSD, and is characterized by comprising an RSD preprocessing module, a first data selector, a second data selector, a similar subtraction coding module, a first RSD multiplier module, a second RSD multiplier module, a first truncation module, a second truncation module, a first register, a second register and a shift register;
the RSD preprocessing module is used for preprocessing the input divisor to obtain a data type D meeting the requirement of a functional iteration algorithm norm I.e. D norm Is greater than 0 and less than 1, and the most significant bit n D ,D norm Output to the first data selector, n D Outputting the signal to a shift register;
the first data selector is used for obtaining an output D at the RSD preprocessing module according to the control signal norm And the data stored in the first register are selected to obtain an iteration divisor, and the iteration divisor is output to the class subtraction encoding module and the first RSD multiplier module;
the second data selector is used for selecting between the input dividend and data stored in the second register according to the control signal to obtain an iterative dividend, and outputting the iterative dividend to the second RSD multiplier module;
the class reduction coding module is used for carrying out two-reduction coding on the iteration divisor in a redundancy mode to obtain a common multiplier parameter and outputting the common multiplier parameter to the first RSD multiplier module and the second RSD multiplier module;
the first RSD multiplier module is used for realizing multiplication operation of a common multiplier and an iteration divisor under the W-bit redundancy number and outputting the multiplication operation to the first truncation module;
the second RSD multiplier module is used for realizing multiplication operation of a common multiplier and an iteration dividend under the W-bit redundancy number and outputting the multiplication operation to the second truncation module;
the first truncation module is used for truncating the high W bits of the 2W-bit redundancy output by the first RSD multiplier module to obtain a new W-bit redundancy and outputting the new W-bit redundancy to the first register;
the second truncation module is used for truncating the high W bits of the 2W-bit redundancy number output by the second RSD multiplier module to obtain a new W-bit redundancy number and outputting the new W-bit redundancy number to the second register and the shift register;
the input end of the first register is connected to the output end of the first truncation module, and the output end of the first register is connected to the input end of the first data selector;
the input end of the second register is connected to the output end of the second truncation module, and the output end of the second register is connected to the input end of the second data selector;
and the shift register is used for right shifting the result obtained by the second truncation module, and the right shift number is twice of the difference between the input bit width W and the most significant bit obtained by the RSD preprocessing module, so that the final quotient value is obtained.
2. The redundancy-based high-precision low-latency large integer division accelerator as claimed in claim 1, wherein the RSD preprocessing module comprises a pre-coding module, a code mapping module, a most significant bit detection module, a data selector and an internal shift register;
the pre-coding module is used for pre-coding the input redundancy D to obtain detection tree parameters a, b and c and an LOD parameter F, outputting the detection tree parameters a, b and c to the detection module, and outputting the LOD parameter F to the most significant bit detection module.
3. The apparatus as claimed in claim 2, wherein the coding mapping module is configured to map the input redundancy number D into redundancy bits that are the same as the input actual value but are all 0 before the most significant bit, and then output the redundancy bits to the internal shift register.
4. The apparatus of claim 3, wherein the most significant bit detector module is configured to find the most significant position of the input by using the most significant bit detector LOD, and then output the most significant position to the data selector.
5. The redundancy-based high-precision low-delay large integer division accelerator as claimed in claim 4, wherein the detection module is configured to utilize a tree-type reduction structure for detecting the deviation, and if there is the deviation, 1 is output, otherwise 0 is output, and the output terminal of the detection module is used as the control terminal of the data selector.
6. The apparatus as claimed in claim 5, wherein the data selector is configured to select between the output of the most significant bit detection module and the number obtained by subtracting one from the output of the most significant bit detection module to obtain the correct most significant bit n according to the result 0/1 of the detection module D As the final output and output to the internal shift register.
7. The redundancy-based high-precision low-latency large integer division accelerator according to claim 6, wherein the internal shift register is configured to shift the result of the code mapping module to the left to obtain the final normalized result, and the left shift number is the most significant bit obtained by the data selector.
8. The apparatus as claimed in claim 7, wherein the pseudo-subtraction encoding module is used to process the input I of W redundant bits, and the bit-level data format is written as
Figure FDA0003882910340000021
Wherein
Figure FDA0003882910340000022
Is the 1 st bit of the 1 st redundant bit of I,
Figure FDA0003882910340000023
is the 2 nd bit of the 1 st redundant bit of I,
Figure FDA0003882910340000024
is the 1 st bit of the 2 nd redundant bit of I,
Figure FDA0003882910340000025
the output number of the 2 nd bit of the 2 nd redundant bit of I is still W-bit redundant number, which is marked as O and is expressed as:
Figure FDA0003882910340000026
wherein,
Figure FDA0003882910340000031
Figure FDA0003882910340000032
Figure FDA0003882910340000033
Figure FDA0003882910340000034
Figure FDA0003882910340000035
wherein
Figure FDA0003882910340000036
Is the 1 st bit of the 1 st redundant bit of O,
Figure FDA0003882910340000037
a 2 nd bit which is a 1 st redundant bit of O,
Figure FDA0003882910340000038
is the 1 st bit of the 2 nd redundant bit of O,
Figure FDA0003882910340000039
bit 2 which is the 2 nd redundant bit of O.
9. The redundancy-based high-precision low-delay large integer division acceleration device according to claim 8, wherein the first truncation module and the second truncation module have the same function, and are configured to implement fast high-order truncation, i.e. to rapidly and correctly truncate the input of 2W redundant bits to W bits without changing the actual value, and the truncation rule is as follows: the last W-1 bit of the input is not changed, and a new W-th redundant bit n new It is recorded as
Figure FDA00038829103400000310
Determined only by the original W-th redundant bit and W + 1-th redundant bit, which are respectively denoted as n { n } + ,n - And
Figure FDA00038829103400000311
the formula is as follows:
Figure FDA00038829103400000312
Figure FDA00038829103400000313
10. the redundancy-based high-precision low-delay large integer division accelerator as claimed in claim 9, wherein the first RSD multiplier module and the second RSD multiplier module have the same structure and both comprise partial productsA generation module and an accumulator, the partial product generation module utilizing a multiplier a i Is composed of
Figure FDA00038829103400000314
0,1, the partial products are-B and 0,B, respectively, to form partial products, wherein-B only needs to exchange the corresponding parity bit of the multiplicand B; the accumulator utilizes a redundant adder under a tree structure to simplify a partial product, and utilizes a first truncation module to process overflowing redundant bits.
CN202211237664.0A 2022-10-10 2022-10-10 High-precision low-delay large integer division accelerating device based on redundancy Pending CN115658005A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211237664.0A CN115658005A (en) 2022-10-10 2022-10-10 High-precision low-delay large integer division accelerating device based on redundancy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211237664.0A CN115658005A (en) 2022-10-10 2022-10-10 High-precision low-delay large integer division accelerating device based on redundancy

Publications (1)

Publication Number Publication Date
CN115658005A true CN115658005A (en) 2023-01-31

Family

ID=84988064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211237664.0A Pending CN115658005A (en) 2022-10-10 2022-10-10 High-precision low-delay large integer division accelerating device based on redundancy

Country Status (1)

Country Link
CN (1) CN115658005A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117353898A (en) * 2023-12-04 2024-01-05 粤港澳大湾区数字经济研究院(福田) Fully homomorphic encryption method, system, terminal and medium for floating point number plaintext

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117353898A (en) * 2023-12-04 2024-01-05 粤港澳大湾区数字经济研究院(福田) Fully homomorphic encryption method, system, terminal and medium for floating point number plaintext
CN117353898B (en) * 2023-12-04 2024-03-26 粤港澳大湾区数字经济研究院(福田) Fully homomorphic encryption method, system, terminal and medium for floating point number plaintext

Similar Documents

Publication Publication Date Title
Omondi et al. Residue number systems: theory and implementation
JP4554239B2 (en) Montgomery type modular multiplication apparatus and method
EP0917047B1 (en) Apparatus for modular inversion for information security
US20040098440A1 (en) Multiplication of multi-precision numbers having a size of a power of two
Shieh et al. Word-based Montgomery modular multiplication algorithm for low-latency scalable architectures
US8898215B2 (en) High-radix multiplier-divider
KR100756137B1 (en) Division and square root arithmetic unit
US8862651B2 (en) Method and apparatus for modulus reduction
US20090006512A1 (en) NORMAL-BASIS TO CANONICAL-BASIS TRANSFORMATION FOR BINARY GALOIS-FIELDS GF(2m)
CN113076083B (en) Data multiply-add operation circuit
CN115756386A (en) Efficient lightweight NTT multiplier circuit based on lattice code
Chen et al. Scalable and systolic dual basis multiplier over GF (2m)
Li et al. N-term Karatsuba algorithm and its application to multiplier designs for special trinomials
CN115658005A (en) High-precision low-delay large integer division accelerating device based on redundancy
US8577952B2 (en) Combined binary/decimal fixed-point multiplier and method
US20010025293A1 (en) Divider
Bruguera Composite iterative algorithm and architecture for q-th root calculation
Parihar et al. Fast Montgomery modular multiplier for rivest–shamir–adleman cryptosystem
Selianinau Computationally efficient approach to implementation of the Chinese Remainder Theorem algorithm in minimally redundant Residue Number System
Morita A fast modular-multiplication algorithm based on a higher radix
Selianinau Efficient implementation of Chinese remainder theorem in minimally redundant residue number system
Shen et al. Low complexity bit parallel multiplier for GF (2m) generated by equally-spaced trinomials
KR100946256B1 (en) Scalable Dual-Field Montgomery Multiplier On Dual Field Using Multi-Precision Carry Save Adder
Ercegovac et al. Design of a complex divider
Arunachalamani et al. High Radix Design for Montgomery Multiplier in FPGA platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination