CN115658005A

CN115658005A - High-precision low-delay large integer division accelerating device based on redundancy

Info

Publication number: CN115658005A
Application number: CN202211237664.0A
Authority: CN
Inventors: 王中风; 张容蓉; 朱丹阳; 田静
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-10-10
Filing date: 2022-10-10
Publication date: 2023-01-31

Abstract

The invention provides a redundancy-based high-precision low-delay large integer division accelerating device which comprises an RSD preprocessing module, a similar subtraction encoding module, an RSD multiplier and a truncation module. The RSD preprocessing module is used for normalizing input redundancy numbers to meet the algorithm requirement, the similar subtraction coding module is used for quickly realizing a large number subtraction operation by utilizing simple coding, the RSD multiplier is used for realizing quick product of two redundancy numbers, and the order truncation module is used for truncating the first half high order bits of the result of the RSD multiplier so as to be convenient for multiplexing the RSD multiplier. The bottom adder adopted by the whole device is a redundant adder, so that the time delay is shortened, and the overall total operation time is greatly reduced.

Description

High-precision low-delay large integer division accelerating device based on redundancy

Technical Field

The invention relates to a large integer division device and a modulus taking device with unfixed modulus in the technical field of cryptography, in particular to a high-precision low-delay large integer division accelerating device based on redundancy.

Background

With the continuous development of computer technology, the problems related to network information security and the like are endless. At present, the core operation for ensuring network security is to encrypt network related data. In cryptography and its field, many proven algorithms are proposed and disclosed for different application environment requirements, such as Elliptic Curve Cryptography (ECC) and RSA cryptography in the field of public key cryptography, which have been widely used in block chaining, secure chip, and other technologies to enhance security. In order to meet the corresponding security strength, the related encryption and decryption technologies all require mathematical operations such as addition, subtraction, multiplication, division, modulus taking and the like under the word length of 512 bits or 1024bits and the like, for example, based on an RSA cryptographic algorithm which is difficult to decompose large integers, a modulus n is obtained through the product of two large prime numbers p and q, and in the modulus n operation, a public key index e and a private key index d are used for respectively carrying out corresponding encryption and decryption processes.

Currently, in many decentralized systems, a class of VDF (Verifiable Delay Function) is widely applied and rapidly developed, and the core of the VDF is that a computing process needs to sequentially run a specified number of steps and the verification process is rapid. If the calculation rate of an attacker is obviously faster than that of a general user, the application of the VDF is at risk, and therefore, in order to ensure the security of the related application, a quick implementation scheme of the VDF needs to be disclosed. In view of this, many studies have been conducted to optimize algorithms, architectures and the like, such as the acceleration of square operations in the calculation process from the perspective of algorithm optimization in VDF (references: ZHU D, SONG Y, TIAN J, et al. An Efficient Accelerator of the Squaring for the versible Delay Function Over a Class Group [ C ]//2020IEEE Asia Pacific Conference on Circuits and Systems (APCCAS) 2020. However, in the hardware level, the above schemes and the above mentioned modulo operation all involve division, and the current large number division is very complex to implement, and the long critical path and the many iteration cycles all result in a slow operation speed, which becomes an important factor affecting the algorithm efficiency.

In view of the above, a hardware implementation of a large number division operation can be further optimized for most modern cryptosystems.

The existing division operation schemes are roughly divided into three types:

the first method comprises the following steps: a digital loop class algorithm. The SRT division algorithm is widely used as a representative digital loop algorithm, and meanwhile, corresponding improvement schemes are provided according to different application environments. Since the quotient can only be obtained with one bit of precision each time the algorithm is iterated, the convergence speed is very slow when operating on a large bit wide number.

And the second method comprises the following steps: division operation based on newton iteration. The Newton iteration method is also called Newton-Raphson algorithm (Newton-Raphson) and is the earliest used iteration algorithm with a wider application range. For the division operation N/D, it can be converted into the method of first obtaining the reciprocal of the divisor meeting the precision requirement, i.e. 1/D, and then multiplying the reciprocal by the dividend N to obtain the final result: n (1/D). Wherein the inverting part is based on the function

Method for zero-finding, using an iterative formula x in the case of quadratic convergence _i+1 ＝x _i (2-Dx _i ) And (5) solving a required reciprocal value. To ensure convergence, the input number needs to be preprocessed to satisfy (0.5,1)]Simultaneously by an approximation algorithm (reference: LUNGLMAYR M, PLODER O. Fast approximation reactions for iterative algorithms [ J ]]arXiv preprint arXiv:2007.06241,2020) may provide a suitable initial value to reduce the iteration cycle, i.e., x ₀ =3-2*D. The algorithm is as follows:

the first algorithm is as follows: newton Raphson algorithm

Input：D∈(0.5,1]

1:Initialize:x ₀ ←3–2*D

2:for i＝0to k do

3:a←D*x _i

4:x _i+1 ←x _i *(2-a)

5:end for

6:Return x _i+1 →1/D

Output：1/D∈[1,2)

In contrast to the digital round robin-like algorithm, the algorithm only requires log ₂ W iterations and additional operations can complete the division operation (W is the divisor bit width, the same applies to the following description), thusThe algorithm is suitable for low-delay design of large numbers, meanwhile, each iteration involves two multiplication operations, and hardware resources and operation speed depend on the multiplication operation part.

And the third is that: division operation based on the Goldsmith algorithm. Taylor expansion formula at 0 point based on function 1/(1+x) 1/(1+x) =1-x + x ² -x ³ +x ⁴ -x ⁵ +…＝(1-x)(1+x ² )(1+x ⁴ )(1+x ⁸ ) …, the input divisor D is processed to satisfy the condition (0.5,1)]In between, only D =1+x (in this case | x | y<1) Then, the following can be obtained:

for D ₀ ＝D＝1+x,F ₀ ＝2–D＝1-x,N ₀ = N, observe

D ₁ ＝D ₀ *F ₀ ＝1-x ² ,F ₁ ＝2–D ₁ ＝1+x ² ；

D ₂ ＝D ₁ *F ₁ ＝1-x ⁴ ,F ₂ ＝2–D ₂ ＝1+x ⁴ ；

……

So the denominator can be written as _i+1 ＝D _i *(2-D _i )＝D _i *F _i The molecule can be written as N _i+1 ＝N _i *(2-D _i )＝N _i *F _i As i increases, the numerator in equation (1) tends to 1 and the denominator tends to N/D, the last quotient. The algorithm is as follows:

and a second algorithm: goldschmidt algorithm

Input：D∈(0.5,1]

1:Initialize:D ₀ ←D,F ₀ ←2-D,N ₀ ←N

2:for i＝0to k do

3:D _i+1 ←D _i *F _i

4:N _i+1 ←N _i *F _i

5:F _i+1 ←2-D _i+1

6:end for

7:Return N _i+1 →N/D

Output：N/D

The algorithm is similar to the Newton iteration method, and the division operation is converted into two multiplication operations for operation, so that the iteration times are approximately log ₂ And W times. Compared with Newton iterative algorithm, the method has the advantages that the final quotient can be directly calculated, meanwhile, the two multiplication operations are independent, parallel calculation can be achieved, hardware is more friendly, and the method has the defect that once errors occur in iteration, the errors cannot be updated automatically, and therefore the method is very dependent on the accuracy of operation in the process.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the problems that the existing large number division device based on functional iteration class can involve multiplication operation and subtraction operation during iteration, the multiplication operation and the subtraction operation can be converted into addition operation and shift operation, the delay of the addition operation can be increased along with the input bit width, the overall calculation speed is low, and the running time is long, the invention provides a redundancy-based high-precision low-delay large integer division accelerating device, wherein the data format adopts the representation form of redundant signed numbers (RSD), and the device comprises an RSD preprocessing module, a first data selector, a second data selector, a class reduction coding module, a first RSD multiplier module, a second RSD multiplier module, a first truncation module, a second truncation module, a first register, a second register and a shift register;

the RSD preprocessing module is used for preprocessing an input divisor to obtain a data type D meeting the requirement of a functional iteration algorithm (such as a Goldschmidt algorithm) _norm I.e. D _norm Is greater than 0 and less than 1, and the most significant bit n _D ，D _norm Output to the first data selector, n _D Outputting the signal to a shift register;

the first data selector is used for obtaining an output D at the RSD preprocessing module according to the control signal _norm And the data stored in the first register are selected to obtain an iteration divisor (the data stored in the first register is the result obtained by the first truncation module, the output of the preprocessing module is the initial input, and the data stored in the first register is the input in the iteration, so the loop function is not influenced), and the iteration divisor is output to the analog subtraction encoding module and the first RSD multiplier module;

the second data selector is used for selecting between the input dividend and data stored in a second register according to the control signal to obtain an iteration dividend (the data stored in the second register is the result obtained by the second truncation module, the input dividend is the initial input, and the data stored in the second register is the input in iteration, so that the loop function is not influenced), and outputting the iteration dividend to the second RSD multiplier module;

the class reduction coding module is used for carrying out two-reduction coding on the iteration divisor in a redundancy form to obtain a common multiplier parameter and outputting the common multiplier parameter to the first RSD multiplier module and the second RSD multiplier module;

the first RSD multiplier module is used for realizing multiplication operation of a common multiplier and an iteration divisor under the W-bit redundancy number and outputting the multiplication operation to the first truncation module;

the second RSD multiplier module is used for realizing multiplication operation of a common multiplier and an iteration dividend under the W-bit redundancy number and outputting the multiplication operation to the second truncation module;

the first truncation module is used for truncating the high W bits of the 2W-bit redundancy output by the first RSD multiplier module to obtain a new W-bit redundancy and outputting the new W-bit redundancy to the first register;

the second truncation module is used for truncating the high W bits of the 2W-bit redundancy output by the second RSD multiplier module to obtain new W-bit redundancy and outputting the new W-bit redundancy to the second register and the shift register;

the input end of the first register is connected to the output end of the first truncation module, and the output end of the first register is connected to the input end of the first data selector;

the input end of the second register is connected to the output end of the second truncation module, and the output end of the second register is connected to the input end of the second data selector;

and the shift register is used for right shifting the result obtained by the second truncation module, and the right shift number is twice of the difference between the input bit width W and the most significant bit obtained by the RSD preprocessing module, so that the final quotient value is obtained.

The RSD preprocessing module comprises a precoding module, a coding mapping module, a most significant bit detection (LOD) module, a detection module, a data selector and an internal shift register;

the pre-coding module is used for pre-coding the input redundancy D by utilizing the existing coding module (reference: peter Kornerbup. Correcting the redundancy shift of redundant bank 622 redundancy. IEEE Transactions on computers,58 (10): 1435-1439, 2009) to obtain the tree parameters a, b, c and LOD parameter F, and outputting the tree parameters a, b, c to the detection module, and outputting the LOD parameter F to the most significant bit detection module;

the coding mapping module is used for mapping the input redundancy D into redundancy bits which are the same as the input actual value but are all 0 before the most significant bit, and then outputting the redundancy bits to the internal shift register.

The most significant bit detection module is used for finding the most significant position of the input by using a most significant bit detector LOD, and then outputting the most significant position to a data selector.

The detection module is used for detecting deviation by utilizing a tree-type simplified structure (reference documents: J.D. Bruguera and T.Lang, "Leading-one prediction with current position correction," in IEEE Transactions on Computers, vol.48, no.10, pp.1083-1097, oct.1999), if deviation exists, 1 is output, otherwise 0 is output, and the output end of the detection module is used as the control end of the data selector.

The data selector is used for selecting between the output of the most significant bit detection module and the number obtained by subtracting one from the output of the most significant bit detection module according to the result 0/1 of the detection module to obtain the correct most significant bit n _D As the most importantThe final output is then output to the internal shift register.

The internal shift register is used for performing left shift on the result of the coding mapping module to obtain the final normalization result, and the left shift number is the most significant bit obtained by the data selector.

The class reduction coding module is used for processing the input number I of W redundant bits, and the bit level data format is recorded as

Wherein

Is the 1 st bit of the 1 st redundant bit of I,

is the 2 nd bit of the 1 st redundant bit of I,

is the 1 st bit of the 2 nd redundant bit of I,

the output number of the 2 nd bit of the 2 nd redundant bit of I is still W-bit redundant number, which is marked as O and is expressed as:

wherein,

wherein

Is the 1 st bit of the 1 st redundant bit of O,

a 2 nd bit which is a 1 st redundant bit of O,

is the 1 st bit of the 2 nd redundant bit of O,

bit 2 which is the 2 nd redundant bit of O.

The first truncation module and the second truncation module have the same function and are used for realizing rapid high-order truncation, namely rapidly and correctly truncating the input (actual value at W-1 bit) of 2W redundant bit number to W bit without changing the actual value, and the truncation rule is as follows: the last W-1 bit of the input is not changed, and a new W-th redundant bit n _new It is recorded as

(one redundancy bit is composed of two bits) and is determined only by the original W-th redundancy bit and W + 1-th redundancy bit, which are respectively denoted as n { n } ⁺ ,n ^- And

the formula is as follows:

the first RSD multiplier module and the second RSD multiplier module have the same structure and respectively comprise a partial product generation module (PPG) and an accumulator, and the partial product generation module utilizes a multiplier a _i Is composed of

0,1, the partial products are-B and 0,B, respectively, to form partial products, wherein-B only needs to exchange the corresponding parity bit of the multiplicand B; the accumulator utilizes a redundant adder (RSDA) in a tree structure to simplify the partial product and a first truncation module to process the overflowing redundant bits.

The invention is suitable for the field of modern cryptography, in particular to a parameter calculation algorithm of a Verifiable Delay Function (VDF) with high requirement on Delay, and further provides a verification Delay Function VDF calculation acceleration method based on a quadratic form.

Has the advantages that: the data format related to the whole device adopts redundancy representation, so that the delay of a bottom layer redundancy adder is irrelevant to bit width, the key path is low when complex division operation is carried out, the iteration period based on a Goldschmidt algorithm is few, and the total operation time of the whole device is greatly reduced.

Drawings

The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a top level architecture diagram.

Fig. 2 is a schematic diagram of an RSD preprocessing module.

Fig. 3 is a schematic diagram of a code mapping module.

Fig. 4 is a schematic diagram of an RSD multiplier block.

FIG. 5 is a schematic diagram of an RSD adder.

Detailed Description

The device is a rapid realization of large number division, is based on a quadratic convergence Golde Schmidt algorithm, and is characterized in that an iteration part of the algorithm is focused, and a top-level architecture diagram is given, as shown in figure 1. The overall data format is represented by Redundant Signed Digit (RSD), that is, a Signed number can be represented by subtracting two unsigned numbers:

wherein a is _i ⁺ 、a _i ^- E {0,1}, which is a bit,

are redundant bits.

Based on the above representation, several key functional modules of the present invention will now be described:

(1) RSD preprocessing module

For the Goldsmith algorithm, the operand D is input to achieve convergence requirements ₀ Pretreatment is required to satisfy (0.5,1)]In the meantime. In a normal form, only one most significant bit detector (LOD) is needed to calculate the most significant bit position n of the divisor D _D (lower right), shift the divisor D left by W-n _D The bit (W is the divisor bit width) can obtain the input D meeting the requirement ₀ The highest order is 1.

Due to the particularities of the redundant representation itself, it is possible to use a number of more significant bits to represent a number of less significant bits, e.g.

The actual value of the representation is 000011 (obviously the two outputs after LOD are different) and therefore the redundant representation needs to be converted into a normal form. However, this scheme will involve a conventional subtraction operation, and will lose superiority in most systems, so the existing scheme for normalization operation in floating-point subtraction (see: peter Kornerup. Correcting the normalization shift of reducing binding 622 representation. IEEE Transactions on computers,58 (10): 1435-1439, 2009) is used to make certain improvements to reduce the critical path from a W-bit normal subtraction and most significant detection module to a most significant detection module, see FIG. 2. Which comprises the following steps:

1. and the pre-coding module is used for pre-coding the input positive redundancy divisor D (the coding form is shown in the reference document) to obtain the input of the detection module and the most significant bit detection module.

2. A coding and mapping module for recoding the input positive redundant divisor D to make the redundant bits before the most significant bit corresponding to the actual value of the divisor all 0, such as

Is coded into

The coding architecture is shown in fig. 3 as input to the final shift register.

2. A most significant bit detection (LOD) module, wherein the most significant bit n is found by using the existing LOD for the result after 1) coding _t 。

3. The detection module, as in this example, has an actual value of 000011 and the corresponding most significant bit should be 2, but 2) returns 3, so there is a one bit offset. This module uses a simplified tree structure as in the literature (J.D. Bruguera and T.Lang, "Leading-one prediction with current position correction," in IEEE Transactions on Computers, vol.48, no.10, pp.1083-1097, oct.1999) to detect this deviation for correction, and outputs 1 if there is a deviation and 0 otherwise.

4. An alternative data selector for detecting the output n of the module at the most significant bit by using the result 0/1 of the detection module _t And n _t -1 and get the correct most significant bit n.

5. A shift register for shifting according to the structure value of the data selection module to obtain the final normalized redundancy D _norm 。

(2) Class subtraction coding module

The second algorithm is that the iterative computation involves a subtraction operation 2-D _i+1 When the device is used, D _i+1 For a redundancy number, a redundancy subtractor can be directly used to perform this step. But considering D _i+1 Is 0 and can therefore be replaced by a 4-bit encoder to reduce area. The encoding rule is as follows:

for the input number I of W redundant bits, i.e.

(data storage type), the output number is still W-bit redundancy number, denoted as O, i.e.

Wherein,

(3) Quick high-order module of cuting

In hardware design, the multipliers in iteration are multiplexed, data bit width consistency needs to be ensured, and therefore the output of the multipliers needs to be subjected to truncation operation. Considering that the following data bits belong to precision bits, the algorithm can be directly truncated as long as the algorithm is correct, but the redundancy representation has the particularity mentioned above, redundant parts corresponding to zero bits in practice cannot be truncated, and therefore a truncation module is required to be specially designed to rapidly process high bits.

For the redundant number within W-1 bit, no matter how many effective redundant bits there are, only one simple operation is needed when cutting off to W bit, i.e. the last W-1 bit is not changed, and the new W +1 redundant bit is marked as W +1 redundant bit

Determined only by the original W-th redundant bit and W +1 redundant bit, respectively denoted as n { n } ⁺ ,n ^- And

the formula is as follows:

specific examples are shown in table 1:

TABLE 1

(4) RSD multiplier module

In circuit design, a common multiplication operation mainly includes two steps: 1) Generating a partial product; 2) Using a tree-like structure, the partial products are accumulated using a fast ripple carry adder (CPA).

The first step is relatively simple, but the second step causes the critical path to be longer along with the increase of bit width, and the calculation delay is large. Considering that multiplication always uses the redundant signed number representation form in the iterative process of the invention, the second step can be accelerated by using the 'no carry' addition property, and does not need to be converted into a common format. And lowering the critical path in place by pipelining, etc., the architecture after modification is shown in fig. 4. The RSD multiplier module mainly comprises:

1. partial Product Generator (PPG) using a _i Is composed of

And then, the partial products are respectively-B and 0,B to form the partial products, wherein-B only needs to correspondingly exchange the parity bit of B.

2. Accumulators (RSD adder, RSDA) use a tree structure to simplify the partial product, which is accumulated using redundant adders, see fig. 5.

In the process, in order to avoid uneven increase of the digit number in the accumulation process, the quick high-order truncation module in the step (2) is adopted.

Examples

Take the large number division operation involved in the implementation of VDF algorithm as an example. Considering that the divisors of a plurality of division operations in the algorithm are the same, the reciprocal of the divisor can be calculated by only setting N in the framework to be 1, and then the reciprocal is multiplied by the corresponding dividend by using the used multiplier, namely the product is the corresponding quotient.

Now, the TSMC 28nm CMOS process library is used for ASIC synthesis of a 2048bits/1024bits division operation architecture, and the result is shown in the following table 2:

TABLE 2

The results show that the critical path of the architecture is 1.05ns, and the single operation delay is 86.1ns.

In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, and the computer program, when executed by the data processing unit, may run the inventive content of the redundancy-based high-precision low-latency large integer division acceleration apparatus and some or all of the steps in each embodiment. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.

It is obvious to those skilled in the art that the technical solutions in the embodiments of the present invention can be implemented by means of a computer program and its corresponding general-purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a computer program, that is, a software product, which may be stored in a storage medium and includes several instructions to enable a device (which may be a personal computer, a server, a single chip microcomputer MUU or a network device) including a data processing unit to execute the method in each embodiment or some parts of the embodiments of the present invention.

The present invention provides a redundancy-based high-precision low-delay large integer division accelerator, and a plurality of methods and approaches for implementing the technical scheme, where the foregoing is merely a preferred embodiment of the present invention, it should be noted that, for those skilled in the art, a plurality of improvements and refinements may be made without departing from the principle of the present invention, and these improvements and refinements should also be regarded as the protection scope of the present invention, including but not limited to a specific type of redundancy representation, a specific algorithm of multiplier division multiplexing, and a specific algorithm of function iteration class. All the components not specified in the present embodiment can be realized by the prior art.

Claims

1. A data format adopts a representation form of redundant signed number RSD, and is characterized by comprising an RSD preprocessing module, a first data selector, a second data selector, a similar subtraction coding module, a first RSD multiplier module, a second RSD multiplier module, a first truncation module, a second truncation module, a first register, a second register and a shift register;

the RSD preprocessing module is used for preprocessing the input divisor to obtain a data type D meeting the requirement of a functional iteration algorithm _norm I.e. D _norm Is greater than 0 and less than 1, and the most significant bit n _D ，D _norm Output to the first data selector, n _D Outputting the signal to a shift register;

the first data selector is used for obtaining an output D at the RSD preprocessing module according to the control signal _norm And the data stored in the first register are selected to obtain an iteration divisor, and the iteration divisor is output to the class subtraction encoding module and the first RSD multiplier module;

the second data selector is used for selecting between the input dividend and data stored in the second register according to the control signal to obtain an iterative dividend, and outputting the iterative dividend to the second RSD multiplier module;

the class reduction coding module is used for carrying out two-reduction coding on the iteration divisor in a redundancy mode to obtain a common multiplier parameter and outputting the common multiplier parameter to the first RSD multiplier module and the second RSD multiplier module;

the second truncation module is used for truncating the high W bits of the 2W-bit redundancy number output by the second RSD multiplier module to obtain a new W-bit redundancy number and outputting the new W-bit redundancy number to the second register and the shift register;

2. The redundancy-based high-precision low-latency large integer division accelerator as claimed in claim 1, wherein the RSD preprocessing module comprises a pre-coding module, a code mapping module, a most significant bit detection module, a data selector and an internal shift register;

the pre-coding module is used for pre-coding the input redundancy D to obtain detection tree parameters a, b and c and an LOD parameter F, outputting the detection tree parameters a, b and c to the detection module, and outputting the LOD parameter F to the most significant bit detection module.

3. The apparatus as claimed in claim 2, wherein the coding mapping module is configured to map the input redundancy number D into redundancy bits that are the same as the input actual value but are all 0 before the most significant bit, and then output the redundancy bits to the internal shift register.

4. The apparatus of claim 3, wherein the most significant bit detector module is configured to find the most significant position of the input by using the most significant bit detector LOD, and then output the most significant position to the data selector.

5. The redundancy-based high-precision low-delay large integer division accelerator as claimed in claim 4, wherein the detection module is configured to utilize a tree-type reduction structure for detecting the deviation, and if there is the deviation, 1 is output, otherwise 0 is output, and the output terminal of the detection module is used as the control terminal of the data selector.

6. The apparatus as claimed in claim 5, wherein the data selector is configured to select between the output of the most significant bit detection module and the number obtained by subtracting one from the output of the most significant bit detection module to obtain the correct most significant bit n according to the result 0/1 of the detection module _D As the final output and output to the internal shift register.

7. The redundancy-based high-precision low-latency large integer division accelerator according to claim 6, wherein the internal shift register is configured to shift the result of the code mapping module to the left to obtain the final normalized result, and the left shift number is the most significant bit obtained by the data selector.

8. The apparatus as claimed in claim 7, wherein the pseudo-subtraction encoding module is used to process the input I of W redundant bits, and the bit-level data format is written as

Wherein

Is the 1 st bit of the 1 st redundant bit of I,

is the 2 nd bit of the 1 st redundant bit of I,

is the 1 st bit of the 2 nd redundant bit of I,

wherein,

wherein

Is the 1 st bit of the 1 st redundant bit of O,

a 2 nd bit which is a 1 st redundant bit of O,

is the 1 st bit of the 2 nd redundant bit of O,

bit 2 which is the 2 nd redundant bit of O.

9. The redundancy-based high-precision low-delay large integer division acceleration device according to claim 8, wherein the first truncation module and the second truncation module have the same function, and are configured to implement fast high-order truncation, i.e. to rapidly and correctly truncate the input of 2W redundant bits to W bits without changing the actual value, and the truncation rule is as follows: the last W-1 bit of the input is not changed, and a new W-th redundant bit n _new It is recorded as

Determined only by the original W-th redundant bit and W + 1-th redundant bit, which are respectively denoted as n { n } ⁺ ,n ^- And

the formula is as follows:

10. the redundancy-based high-precision low-delay large integer division accelerator as claimed in claim 9, wherein the first RSD multiplier module and the second RSD multiplier module have the same structure and both comprise partial productsA generation module and an accumulator, the partial product generation module utilizing a multiplier a _i Is composed of

0,1, the partial products are-B and 0,B, respectively, to form partial products, wherein-B only needs to exchange the corresponding parity bit of the multiplicand B; the accumulator utilizes a redundant adder under a tree structure to simplify a partial product, and utilizes a first truncation module to process overflowing redundant bits.