CN110908635A

CN110908635A - High-speed modular multiplier based on post-quantum cryptography of homologus curve and modular multiplication method thereof

Info

Publication number: CN110908635A
Application number: CN201911073701.7A
Authority: CN
Inventors: 王中风; 汪漂洋; 田静; 林军
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-11-04
Filing date: 2019-11-04
Publication date: 2020-03-24

Abstract

The invention discloses a high-throughput modular multiplier based on post-quantum cryptography with a homologus curve and a corresponding modular multiplication method thereof. The modular multiplier mainly comprises a multiplication module, a reduction module and a post-processing module. The multiplication module reduces the number of multipliers by means of Karatsuba and the like. The reduction module uses constant multipliers with less resource consumption and a parallelization strategy. The post-processing module carries out parallelization processing on the adder and simultaneously calculates constant parameters in advance for optimization. Therefore, in summary, the modular multiplier of the present invention has the feature of high throughput. In addition, the modular multiplication method disclosed by the invention is a prime number form based on an unconventional base number, and has a faster calculation speed by using an optimized Barrett reduction method compared with a traditional Montgomery representation method. In summary, the present invention provides an effective modular multiplier architecture and a modular multiplication method for the current encryption scheme based on the post-quantum cryptography with the homologus curve.

Description

High-speed modular multiplier based on post-quantum cryptography of homologus curve and modular multiplication method thereof

Technical Field

The invention relates to a modular multiplier and a modular multiplication method in the field of cryptography; in particular to a modular multiplier with high throughput rate and a modular multiplication method thereof in a post-quantum encryption scheme.

Background

In recent years, great progress has been made in the research of quantum computers. Many common public key algorithms, such as RSA algorithm and Elliptic Curve Cryptography (ECC), can be easily broken by brute force quantum computers according to the scherrer algorithm. This undoubtedly accelerates the development of post-quantum cryptography (PQC). Since 2017, the National Institute of Standards and Technology (NIST) has held two rounds of contests aimed at developing post-quantum standards. The super-singular homologous key encapsulation protocol (SIKE) was one of 26 candidates, which emerged from both rounds of competition. The advantage of SIKE is that its public and private keys are very short in size compared to other candidates, and are very perfectly compatible with conventional ECC protocols. The SIKE protocol was developed by encapsulating the super-singular homogeneous diffie-hellman (SIDH) key exchange protocol using a key encapsulation mechanism. SIDH was originally proposed in 2011. SIDH is based on the principle of finding the difficulty of homology between different super-singular curves to resist quantum attacks. In general, a large number of serial homologous computations in a protocol take a long time to delay, which is also a bottleneck in practical application. Therefore, the method for accelerating the SIDH can be directly used for accelerating the SIKE protocol.

Many researchers have optimized the SIDH/SIKE protocol based on software and hardware platforms. In 2011, Jao implemented SIDH using a GMP big database, which is also considered as the earliest implementation version of SIDH. After that the latest versions offered by c.costello and p.longa et al are generally considered to be the fastest software implementations today, which are constantly integrating the most advanced super-singular homologous cryptographic schemes. Meanwhile, the method also combines the optimization method proposed in the open literature, and provides the hardware realization of the SIDH on FPGA and ARM. By decomposing these calculations, it can be found that the modular multiplication operation is one of the basic operations of the scheme and is also a matter of great concern in the design.

In operation, it is noted that the smoothed homogenous prime number of a super-singular curve usually satisfies the formula p ═ f · a^xb^yAnd +/-1. Where a and b are small prime numbers, x and y are positive integers, and f is a cofactor such that p is a prime number. Due to the special structure of the prime p, it is possible to improve its performance by doing some other work on the modulo operation. Karmakar published an efficient prime number format of 2.2^xb^y-1 modulus takingThe algorithm EFFM, where x and y are even numbers. So that they can use an element on the domain based on an unconventional base number R-2^x/2b^y/2Is expressed in terms of a multiplication operation reduced by half at the cost of adding a small number of addition operations. The FFM1 algorithm based on the method reduces the coefficient in the EFFM algorithm from three to two by an additional mapping function, so that the constant parameter calculated in advance can be discarded without changing the complexity. While the FFM2 algorithm expands the format of the prime number p to f.2 at the expense of more computation^xb^y+ -1, and is the most advanced algorithm so far.

Disclosure of Invention

The invention aims at the problems and provides a modular multiplication method based on prime number form of unconventional base number based on previous research. The method adopts an optimized Barrett reduction method, and has higher speed than the previous Montgomery representation method. The invention also provides a corresponding modular multiplier architecture based on the method, which has the characteristic of high throughput rate, and the specific invention is as follows:

a high throughput modular multiplier architecture for post-quantum cryptography encryption schemes based on homologus curves is characterized by the following main modules:

1) the multiplication module is used for calculating multiplication of a quadratic term coefficient term after splitting the big data;

2) the reduction module is used for carrying out data reduction through modular taking and complementation operation;

3) and the post-processing module is used for post-processing the data to obtain final output.

The multiplication module of the modular multiplier architecture is characterized in that a coefficient item of a quadratic term after splitting of big data is input, and a Karatsuba method is used for optimization, so that the number of multipliers is reduced, and the calculation complexity is reduced.

The reduction module of the modular multiplier architecture is characterized in that the data is processed by using an optimized Barrett reduction algorithm to obtain reduced data. And constant multipliers with less resource consumption than ordinary multipliers are used in the module, and parallelization is used to reduce the length of a critical path.

The post-processing module of the modular multiplier architecture is characterized in that the length of a critical path of the adder is reduced by performing parallelization processing on the adder in calculation and calculating a constant parameter in advance.

The modular multiplication method of the modular multiplier framework is characterized by comprising five steps of input data processing, first-order Karatsuba calculation, Barrett reduction calculation optimization, output data calculation and output data post-processing:

firstly, processing input data, if necessary, calculating the modular multiplication of A and B with respect to prime p, wherein the smooth prime format of the super-singular curve in the algorithm is f.2^xb^y+ -1, where f is 1 or 2, and x and y are even numbers, so that R-2 can be used^x/ ²b^y/2As a non-conventional base, thereby changing the input quantity to a quadratic term a ═ a₂R²+a₁R+a₀(f＝2)、A＝a₁R+a₀(f is 1), and the coefficient (a) is determined₂)、a₁、a₀、(b₂)、b₁、b₀As an input item. For the version supporting multi-precision operation, as the coefficient cannot enter the operation module at one time, a storage or cache unit is required to be added to store data; for the case where f is 2, a is also required to be inputted₂、a₁、a₀、b₂、b₁、b₀And adding mapping:

therefore, the number of the input coefficients can be reduced from three to two, and the complexity of operation is effectively reduced.

Second and first order Karatsuba calculation, namely, calculating the product a of coefficients by using Karatsuba formula₁b₁、a₀b₀、a₁b₀+a₀b₁The formula is as follows:

a_ib_i+a_ib_i＝(a_i+a_i)(b_i+b_i)-a_ib

the number of multipliers can be reduced to three, and theoretically, the complexity of multiplication can be infinitely reduced by the Karatsuba method, but meanwhile, the consumption of hardware resources is rapidly increased, so that a good compromise needs to be made between the two methods.

And thirdly, calculating the optimized Barrett reduction, and obtaining the reduced data through modular operation and complementation operation. Since constant multipliers consume fewer resources than ordinary multipliers, and multiplication of constant parameters is required in the algorithm, separately designed constant multipliers are used herein. In addition, for the multi-precision version, according to the algorithm formula, data after some shift operations are performed on the output of the step four of the last iteration, which is additionally added before reduction.

And fourthly, calculating output data, namely, superposing the quotient and the remainder obtained by the reduction in the previous step according to an algorithm formula to obtain preliminary output data. If the data is a multi-precision version, the data needs to be added to the data before the reduction operation in the third step after some shift operations. The clock frequency is further improved by adopting a parallelization strategy for optimization.

Fifthly, after data output is processed, and for the case when f is 2, the number of coefficients needs to be changed back to three through inverse mapping; the data obtained in step four is also correct data, but needs to be further processed. Because the data may not meet the constraint of the algorithmic formula, a carry is required. Some addition and subtraction operations need to be introduced to make the output meet the algorithm constraint; the throughput rate can be improved by parallelizing the adder in the calculation and calculating the used constant parameters in advance.

The combination of the modular multiplier architecture and the modular multiplication method has the following beneficial effects:

firstly, the modular multiplier performs calculation based on the form of an unconventional base number, the calculation speed after the calculation is faster than that of the traditional Montgomery representation method, and the interval between output data in an output stream obtained by calculation is only about one clock cycle at the fastest speed;

secondly, the invention has high throughput rate and supports multi-precision calculation, and the throughput rate of the multi-precision version reaches about 10 times of that of the prior design under the condition that the hardware resource consumption is equivalent or slightly increased. The non-multi-precision version is much smaller than the improvement of the throughput rate although the resource consumption of hardware is increased by a little, and compared with the prior design, the advantage of the throughput rate is more obvious and is about 60 times or more of that of the prior design;

thirdly, a plurality of modules of the invention adopt a parallelization strategy, thereby improving the clock frequency;

fourthly, mapping operation and Karatsuba calculation which are possibly used in data input processing reduce the operation complexity;

fifthly, the constant multiplier and the common multiplier are designed separately, so that the resource consumption is reduced.

Drawings

FIG. 1 is a block diagram of a modular multiplier according to the present invention;

Detailed Description

The following description will further describe embodiments of the present invention with reference to the accompanying drawings. Firstly, the modular multiplier architecture is introduced, and secondly, the modular multiplication method applicable to the modular multiplier architecture is introduced. The embodiments described below by referring to the drawings are exemplary and intended to be illustrative of the present invention and are not to be construed as limiting the present invention.

The architecture of the modular multiplier of the present invention is first described.

Fig. 1 is a schematic diagram of the modular multiplier of the present invention, which includes a multiplication module, a reduction module, and a post-processing module. The multiplication module is used for calculating multiplication of a quadratic term coefficient term after splitting of big data and only comprises three common multipliers; the reduction module is used for carrying out modular operation and complementation operation by adopting an optimized Barrett reduction algorithm to reduce data; and the post-processing module is used for processing the data to enable the data to meet the output constraint of the algorithm.

The specific operation is that according to the input data, if the version is f-2, the parameters need to be changed through the mapping operationTwo less. Then (a) is subjected to the Karatsuba method₀，b₀)、(a₁，b₁)、(a₀+a₁，b₀+b₀) Three common multipliers in the pair input multiplication module are divided to obtain three groups of products. Then the data is sent into a reduction module, and the data is processed by using the optimized Barretreduction algorithm to obtain two groups of residuals (c)₀、c₁) Quotient (q)₀、q₁). And accumulating the quotient and the remainder according to an algorithm formula to obtain a primary output result. If the data is a multi-precision version, the data needs to be added back to the data before entering the reduction module after some shift operation. And finally, inputting the primary output result into a post-processing module, and obtaining a final output result which accords with the algorithm constraint through some addition and subtraction operations.

The following section is used to illustrate the modular multiplier architecture-based optimized modular multiplication method of the present invention. The method comprises five steps of input data processing, first-order Karatsuba calculation, Barrett reduction optimization calculation, output data calculation and output data post-processing:

a_ib_i+a_ib_i＝(a_i+a_i)(b_i+b_i)-a_ib

And thirdly, calculating the optimized Barrett reduction, and obtaining the reduced data through modular operation and complementation operation. Since constant multipliers consume fewer resources than ordinary multipliers, and the algorithm requires multiplication using constant parameters, separately designed constant multipliers are used here. In addition, for the multi-precision version, according to the algorithm formula, data after some shift operations are performed on the output of the step four of the last iteration, which is additionally added before reduction.

And fourthly, calculating output data, namely, superposing the quotient and the remainder obtained by the reduction in the previous step according to an algorithm formula to obtain preliminary output data. If it is a multi-precision version, it is determined according to formula C^(j)＝C^(j+1)·2^k+A_iB mod p, which requires some shift operations on these data, and adds them to the data before proceeding to the third reduction operation. The clock frequency is further improved by adopting a parallelization strategy for optimization.

Fifthly, after data output is processed, and for the case when f is 2, the number of coefficients needs to be changed back to three through inverse mapping; the data obtained in step four is also correct data, but needs to be further processed. Because the data may not meet the constraint of the algorithmic formula, a carry is required. Some addition and subtraction operations need to be introduced to make the output meet the algorithm constraint; the adder in the calculation is subjected to parallelization processing, and meanwhile, the used constant parameters are calculated in advance, so that the throughput rate can be improved.

Secondly, the modular multiplication method is explained by combining the hardware architecture, and the flowing situation of data in the hardware architecture is explained in detail, wherein the modular multiplication method is as follows:

first, a first step of input data processing, coefficient (a)₂)、a₁、a₀、(b₂)、b₁、b₀As an input item. For the version supporting multi-precision operation, as the coefficient cannot enter the operation module at one time, a storage or cache unit is required to be added to store data; for the case where f is 2, a is also required to be inputted₂、a₁、a₀、b₂、b₁、b₀The post-add mapping operation changes the input coefficient to a₁、a₀、b₁、b₀I.e. the coefficients need to be processed through an inverter, an adder and a selector, thereby reducing the input coefficients.

Second and first order Karatsuba calculation, the input coefficient of the first step needs to pass through two adders, and finally 6 data are obtained and divided into (a)₀，b₀)、(a₁，b₁)、(a₀+a₁，b₀+b₀) Three pairs of common multipliers are input into the second step, and three products a are obtained by calculation₀b₀，a₁b₁And (a)₀+a₁)(b₀+b₀). After a series of subtraction operations (a)₀+a₁)(b₀+b₀)-a₀b₀-a₁b₁And the like; if the version is a multi-precision version, the data after the shift operation is carried out by adding the output of the step four in the previous iteration before the step three is entered.

And thirdly, calculating the optimized Barrett reduction, obtaining quotient and remainder through modulus taking and remainder operation, and taking the high order or the low order of the data through operation on hardware. The multiplication of constant parameters in the calculation uses two constant multipliers to reduce resource consumption. In addition, for the multi-precision version, according to the algorithm formula, data after some shift operations are performed on the output of the step four of the last iteration is required to be additionally added before the step three.

And fourthly, calculating output data, namely obtaining preliminary output data by the quotient and the remainder obtained by calculation in the third step through a plurality of adders, subtracters and selectors according to an algorithm formula. If it is a multi-precision version, then it is based on formula C^(j)＝C^(j+1)·2^k+A_iB mod p, which requires some shift operations on these data, and adds them to the data before proceeding to the third reduction operation.

Fifthly, after processing of output data, for the case that f is 2, the number of coefficients needs to be changed back to three through inverse mapping, the reflection mapping operation is similar to the mapping operation, only one more exclusive or operation is used as a selection signal, and one more negation device and one more selector are additionally arranged to generate one more coefficient output; the post-processing operation after that needs to go through a series of adders and subtractors, and then generates the final data output through a selector.

Implementation example: for prime number format p 2 × 2³⁸⁶3²⁴²The modular multiplier with a corresponding security level of p771 for-1 performs the specific hardware implementation in the present invention. The implementation platform is Vivado 2016.4 of xilinx, xc7k325tffg900-2 development board based on Kintex-7 and xc7vx690tffg1157-3 development board based on Virtex-7.

The actual integrated resource consumption versus occupation ratio for the multi-precision version is shown in the following table:

TABLE-comprehensive results for Kintex-7 xc7k325tffg900-2 development boards

Algorithms	FFM1	FFM2	Multi-precision version
				FFs	9675	11635	12902
LUTs	16627	33051	25743
				DSPs	122	529	210
fclk	55	25	57
				Time(ns)	1164	1120	122
Throughput rate (Mb/s)	663	688	6278

As can be seen from the table, the multi-precision version has the implementation result that under the condition that the consumption of hardware resources is equivalent or slightly increased, the throughput rate reaches about 10 times of that of the previous design.

The results of the implementation of the non-multi-precision version are shown in the following table:

table two Virtex-7 xc7vx690tffg1157-3 development board comprehensive results

Algorithms	Multi-precision version	Non-multiple precision versions
			FFs	12902	38976
LUTs	25743	63173
			DSPs	210	729
fclk	56	60
			Time(ns)	124	17
Throughput rate (Mb/s)	6168	46260

It can be seen that although the resource consumption of the hardware is increased by a little compared to the improvement of the throughput rate, the advantage of the throughput rate is more obvious compared with the previous FFM2 by the non-multi-precision version, which is about 60 times more than that of the FFM 2.

Through the modular multiplier architecture and the modular multiplication algorithm of the embodiment of the invention, the throughput rate of the modular multiplier can be improved to the maximum extent. The multiplication module, the reduction module and the post-processing module in the framework mainly give functional descriptions, and various methods and ways for realizing the functions of the parts are available. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications should be considered as the protection scope of the present invention. The components specified in this embodiment can be implemented by the prior art.

Claims

1. A high throughput modular multiplier architecture for post-quantum cryptography encryption schemes based on homologus curves is characterized by the following main modules:

2. The multiplication module of the modular multiplier architecture according to claim 1, wherein the coefficient term of the quadratic term after splitting the big data is input, and the optimization is performed by using a Karatsuba method, so that the number of multipliers is reduced, and the complexity of calculation is reduced.

3. The reduction module of the modular multiplier architecture of claim 1, wherein the data is processed using an optimized Barrettreduction algorithm resulting in reduced data. And constant multipliers with less resource consumption than ordinary multipliers are used in the module, and parallelization is used to reduce the length of a critical path.

4. The post-processing module of modular multiplier architecture according to claim 1, wherein the length of its critical path is reduced and the throughput is improved by parallelizing the adders in the computation and calculating constant parameters in advance.

5. A modular multiplication method based on the modular multiplier architecture of claim 1, comprising five steps of input data processing, first order Karatsuba calculation, optimization Barrett reduction calculation, output data calculation, and output data post-processing:

firstly, processing input data, if necessary, calculating the modular multiplication of A and B with respect to prime p, wherein the smooth prime format of the super-singular curve in the algorithm is f.2^xb^y+ -1, where f is 1 or 2, and x and y are even numbers, so that R-2 can be used^x/2b^y/2As a non-conventional base, thereby changing the input quantity to a quadratic term a ═ a₂R²+a₁R+a₀(f＝2)、A＝a₁R+a₀(f is 1), and the coefficient (a) is determined₂)、a₁、a₀、(b₂)、b₁、b₀As an input item. For the version supporting multi-precision operation, as the coefficient cannot enter the operation module at one time, a storage or cache unit is required to be added to store data; for the case where f is 2, a is also required to be inputted₂、a₁、a₀、b₂、b₁、b₀And adding mapping:

a_ib_i+a_jb_i＝(a_i+a_i)(b_i+b_i)-a_ib

Fifthly, after data output is processed, and for the case when f is 2, the number of coefficients needs to be changed back to three through inverse mapping; the data obtained in step four is also correct data, but needs to be further processed. Because the data may not satisfy the constraint range of the algorithm formula, carry processing is required. Some addition and subtraction operations need to be introduced to make the output meet the algorithm constraint; the throughput rate can be improved by parallelizing the adder in the calculation and calculating the used constant parameters in advance.