CN117785129A

CN117785129A - Montgomery modular multiplication operation method based on GPU

Info

Publication number: CN117785129A
Application number: CN202410200625.6A
Authority: CN
Inventors: 冯黎明; 董建阔; 叶青波; 陈昕; 马煜翔; 王超; 吴凡
Original assignee: Lanxiang Zhilian Hangzhou Technology Co ltd
Current assignee: Lanxiang Zhilian Hangzhou Technology Co ltd
Priority date: 2024-02-23
Filing date: 2024-02-23
Publication date: 2024-03-29
Anticipated expiration: 2044-02-23
Also published as: CN117785129B

Abstract

The invention discloses a Montgomery modular multiplication operation method based on a GPU. It comprises the following steps: q threads of the GPU are selected; equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory; and the q threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result w. The method can realize Montgomery modular multiplication operation on the GPU, greatly improves the calculation efficiency and reduces the calculation time delay.

Description

Montgomery modular multiplication operation method based on GPU

Technical Field

The invention relates to the technical field of computers, in particular to a Montgomery modular multiplication operation method based on a GPU.

Background

In recent years, data has shown an explosive growth trend, the data volume and the data variety have become more and more complex, and a large amount of valuable client information, personal privacy records and operation data of enterprises have been continuously mined. In the era of data burst, the problem of information security under big data is particularly important, the security assurance of information is based on encryption algorithms, and currently popular encryption algorithms comprise symmetric encryption algorithms and asymmetric encryption algorithms. The asymmetric encryption algorithm mainly comprises an RSA encryption algorithm and an ECC encryption algorithm, and the two encryption algorithms need to execute modular multiplication operation.

The algorithm with high efficiency and convenient implementation in the modular multiplication algorithm is a Montgomery modular multiplication algorithm. The existing Montgomery modular multiplication operation is realized on a CPU, and the single thread on the CPU performs the Montgomery modular multiplication operation, and because the Montgomery modular multiplication operation is limited by a formula, the current Montgomery modular multiplication operation consumes longer time and has lower calculation efficiency, thereby influencing the encryption efficiency.

Disclosure of Invention

In order to solve the technical problems, the invention provides a Montgomery modular multiplication operation method based on a GPU, which can realize Montgomery modular multiplication operation on the GPU, thereby greatly improving the calculation efficiency and reducing the calculation time delay.

In order to solve the problems, the invention is realized by adopting the following technical scheme:

the invention relates to a Montgomery modular multiplication operation method based on a GPU, which is used for Montgomery modular multiplication, wherein data x, data y and modulus m are all positive integers of 32 x p x q bits, p is integer multiple of 2, q is an integer greater than 2, the data x is taken as a multiplicand, and the data y is taken as a multiplier, and comprises the following steps:

s1: q threads of the GPU are selected;

s2: equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory;

s3: and the q threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result z.

In the scheme, the data x and the modulus m are respectively equally divided into q parts and sent to the corresponding threads, the data y is equally divided into p x q parts and stored in the shared memory, and the q threads of the GPU realize Montgomery modular multiplication operation in parallel, so that the calculation efficiency is greatly improved. The CIOS algorithm is Coarsely Integrated Operand Scanning algorithm.

Preferably, the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q;

the step S2 includes the steps of:

dividing the data X into q sub-data X, and sequentially marking the sub-data X as X from low order to high order ₁ 、X ₂ ……X _q Sub data X _i Send to thread number i, child data X _i Data representing bits 32 x p (i-1) to 32 x p i-1 in data x;

dividing modulus M equally into q sub-moduli M, sequentially recorded as M from low order to high order ₁ 、M ₂ ……M _q Sub-modulus M _i Send to thread number i, child module M _i Data representing bits 32 x p (i-1) to 32 x p i-1 in modulus m;

dividing the data Y equally into p×q sub-data Y, and sequentially marking as Y from low order to high order ₁ 、Y ₂ ……Y _p*q And stored in a shared memory, wherein j is more than or equal to 1 and less than or equal to p is more than or equal to q, and the sub data Y _j Data representing 32 x (j-1) to 32 x j-1 bits in data y.

Preferably, the step S1 further includes the steps of: each thread constructs a storage space for storing an intermediate calculation result Z, wherein the storage space comprises p+2 subspaces, and each subspace has 32 bits;

the intermediate calculation result calculated by the thread with the number i is marked as Z _i The p+2 subspaces built by the thread with the number i are sequentially marked as Z from the lower order to the upper order of the stored data _i (1)、Z _i (2)……Z _i (p+2)，1≤r≤p+2，Z _i (r) represents the intermediate calculation result Z _i 32 x (r-1) to 32 x r-1 bits.

Preferably, the step S3 includes the steps of:

s31: setting the intermediate calculation result Z corresponding to each thread to zero, and setting j=1;

s32: each thread fetches sub-data Y from shared memory _j Calculating a corresponding intermediate calculation result Z by combining the sub-data X and the sub-module M held by the user;

s33: assigning j to j=j+1, judging whether j is greater than p×q, if yes, executing step S34, otherwise, jumping to step S32;

s34, performing S34; and extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z.

Preferably, the step S32 includes the steps of:

s321: each thread fetches sub-data Y from shared memory _j Each thread is based on the sub-data X, the current intermediate calculation result Z and the sub-data Y held by the thread _j Calculating the latest intermediate calculation result, and assigning the current intermediate calculation result Z as the latest intermediate calculation result;

s322: the thread with the number 1 calculates an intermediate parameter u and sends the intermediate parameter u to other threads;

s323: each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;

s324: extracting an effective value from the intermediate calculation result Z calculated by each thread to splice the effective value into a parameter f, judging whether the parameter f is smaller than a modulus m, if so, setting the value of a parameter flag to be 0, otherwise, setting the value of the parameter flag to be 1;

each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result.

Preferably, the thread numbered i in the step S321 is based on the sub-data X held by itself _i Current intermediate calculation result Z _i Sub data Y _j Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z _i The method for assigning the latest intermediate calculation result comprises the following steps:

n1: sequentially giving data Z _i (1)、Z _i (2)……Z _i (p+2) reassigning the following formula:

，

wherein C is _i (r) represents the calculation of the latest data Z _i (r) carry, X _i (r) represents sub data X _i Data of 32 x (r-1) to 32 x r-1 bits,x represents _i (r)* Y _j Low 32 bits of>X represents _i (r-1)* Y _j Is the upper 32 bits of (2);

n2: sequentially giving data Z _i (2)、Z _i (3)……Z _i (p+2) reassigning the following formula:

；

and N3: in turn, the latest data Z _i (1)、Z _i (2)……Z _i (p+2) are spliced together to obtain an intermediate calculation result Z _i 。

Due to X _i 、Y _j Is a very large number, and Z is rapidly realized through the two-round calculation _i =Z _i +X _i *Y _j The calculation efficiency is greatly improved.

Preferably, the method for calculating the intermediate parameter u by the thread numbered 1 in the step S322 includes the following steps:

thread fetch data Z numbered 1 ₁ (1) The intermediate parameter u is calculated using the following formula:

，

wherein,and m is the inverse of m.

Preferably, the step S323 includes the steps of:

s3231: each thread reassigns the value of the own intermediate calculation result Z;

the thread numbered i gives the intermediate calculation result Z _i The reassignment method is as follows:

sequentially giving data Z _i (1)、Z _i (2)……Z _i (p+2) reassigning the value of the formula:

，

wherein C is _i (r) represents the calculation of the latest data Z _i (r) carry generated at the time of M _i (r) represents the sub-modulus M _i Data of 32 x (r-1) to 32 x r-1 bits,represents M _i Lower 32 bits of (r) ×u, +.>Represents M _i (r-1) the upper 32 bits of u;

s3232: the thread numbered k+1 will data Z _k+1 (1) Sending the received data to a thread with the number of k, wherein the thread with the number of k marks the received data as tmp (k), and k=1, 2 and 3 … … q-1;

s3233: each thread reassigns the value of the own intermediate calculation result Z;

setting tmp (q) =0, and giving the intermediate calculation result Z to the thread numbered i _i The reassignment method is as follows:

sequentially giving data Z _i (1)、Z _i (2)……Z _i (p+1) reassigning the value of the formula:

，

s3234: the thread numbered k will data Z _k (p+1) sending to the thread numbered k+1, the thread numbered k+1 noting the received data as hmp (k);

s3235: threads with the numbers of 1 and 2 … … q sequentially reassign the self-held intermediate calculation result Z;

when i=1, the intermediate calculation result Z ₁ Remain unchanged;

when i=2, the data Z is sequentially given _i (1)、Z _i (2)……Z _i (p) reassigning the value of the formula:

；

when 2 < i < q, the data Z are given in turn _i (1)、Z _i (2)……Z _i (p) reassigning the value of the formula:

；

when i=q, the data Z is sequentially given _i (1)、Z _i (2)……Z _i (p)、Z _i (p+1) reassigning the value of the formula:

。

preferably, the step S324 includes the steps of:

s3241: extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F;

intermediate calculation result Z from thread numbered i _i The extracted subparameter F _i The method of (2) is as follows:

when i is 1-q, the subparameter F _i For intermediate calculation result Z _i P-position first 32;

when i=q, subparameter F _i For intermediate calculation result Z _i The first 32 (p+1) bits;

s3242: will sub-parameter F ₁ 、F ₂ ……F _q Splicing into a parameter f, judging whether the parameter f is smaller than a modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;

s3243: threads with the numbers of 1 and 2 … … q sequentially reassign the sub-parameters F held by the threads;

the thread numbered i gives itself the held subparameter F _i The reassignment method is as follows:

calculation F _i =F _i +flag*(-M _i ) If F _i ＜M _i Then go to F _i+1 Borrowing;

s3244: each thread reassigns the value of the own intermediate calculation result Z;

intermediate calculation result Z held by thread with number i for itself _i The reassignment method is as follows:

when i is more than or equal to 1 and less than q, the intermediate calculation result Z _i The first 32 x p bits of (a) are assigned as subparameter F _i To the value of the intermediate calculation result Z _i The value of the last 64 bits of (2) is set to 0;

when i=q, the intermediate calculation result Z _i The first 32 (p+1) bits of (a) are assigned the subparameter F _i To the value of the intermediate calculation result Z _i The value of the last 32 bits of (2) is set to 0.

Preferably, the step S34 includes the steps of:

s341: extracting a valid value from the intermediate calculation result Z of each thread as a sub-result W;

intermediate calculation result Z from thread numbered i _i The sub-result W extracted from the middle _i The method of (2) is as follows:

when i is 1.ltoreq.q, the sub-result W _i For intermediate calculation result Z _i P-position first 32;

when i=q, the sub-result W _i For intermediate calculation result Z _i The first 32 (p+1) bits;

s342: will sub-result W ₁ 、W ₂ ……W _q And splicing to obtain a calculation result w.

The beneficial effects of the invention are as follows: by utilizing the multithread parallel computing capability of the GPU, the Montgomery modular multiplication operation is optimized and accelerated, the computing efficiency is greatly improved, and the computing time delay is reduced, so that the encryption efficiency of encryption algorithms such as RSA, ECC and the like which are required to be used for the Montgomery modular multiplication operation is greatly improved.

Drawings

FIG. 1 is a flow chart of an embodiment;

fig. 2 is an illustrative schematic.

Detailed Description

The technical scheme of the invention is further specifically described below through examples and with reference to the accompanying drawings.

Example 1: in the method for performing Montgomery modular multiplication operation based on the GPU according to the present embodiment, the data x, the data y and the modulus m used for Montgomery modular multiplication are all positive integers of 32×p×q bits, p is an integer multiple of 2, q is an integer greater than 2, the data x is a multiplicand, and the data y is a multiplier, as shown in FIG. 1, the method comprises the following steps:

s1: q threads of the GPU are selected, the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q, each thread constructs a storage space for storing an intermediate calculation result Z, the storage space comprises p+2 subspaces, and each subspace is provided with 32 bits;

the intermediate calculation result calculated by the thread with the number i is marked as Z _i The p+2 subspaces built by the thread with the number i are sequentially marked as Z from the lower order to the upper order of the stored data _i (1)、Z _i (2)……Z _i (p+2)，1≤r≤p+2，Z _i (r) represents the intermediate calculation result Z _i 32 x (r-1) to 32 x r-1 bits of data, p+2 subspaces are numbered 1 and 2 … … p+2 in sequence, and data Z _i (r) stored in subspace number r;

Step S2 comprises the steps of:

dividing the data X into q sub-data X, and sequentially marking the sub-data X as X from low order to high order ₁ 、X ₂ ……X _q Sub data X _i Send to thread number i, child data X _i Data representing 32 x p (i-1) to 32 x p i-1 bits in data x,，/>is a splice;

dividing modulus M equally into q sub-moduli M, sequentially recorded as M from low order to high order ₁ 、M ₂ ……M _q Sub-modulus M _i Send to thread number i, child module M _i Data representing bits 32 x p (i-1) to 32 x p i-1 in modulus m,；

dividing the data Y equally into p×q sub-data Y, and sequentially marking as Y from low order to high order ₁ 、Y ₂ ……Y _p*q And stored in a shared memory, wherein j is more than or equal to 1 and less than or equal to p is more than or equal to q, and the sub data Y _j Data representing 32 x (j-1) to 32 x j-1 bits in data y,。

step S3 comprises the steps of:

s31: setting the intermediate calculation result Z corresponding to each thread to zero (namely, setting all the positions of the storage space to zero), and setting j=1;

Step S32 includes the steps of:

The thread numbered i in step S321 is based on the sub-data X held by itself _i Current intermediate calculation result Z _i Sub data Y _j Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z _i The method for assigning the latest intermediate calculation result comprises the following steps:

，

；

The method for calculating the intermediate parameter u by the thread numbered 1 in step S322 includes the steps of:

，

wherein,and m is the inverse of m.

Pair 2 ³² Taking the modulus, data greater than 32 bits can be directly eliminated, so the value requires calculating the product of the first 32 bits of data and m' to 2 ³² And the module is taken, so that the calculation efficiency is improved.

Step S323 includes the steps of:

，

s3232: the thread numbered k+1 will data Z _k+1 (1) The data is sent to a thread with the number k, the thread with the number k marks the received data as tmp (k), and k=1, 2 and 3 … … q-1, namely the threads with the numbers 2 to q take out the low 32 bits of the intermediate calculation result Z held by the threads with the number k and send the low 32 bits to the previous thread;

，

when i=1, the intermediate calculation result Z ₁ Remain unchanged;

；

。

the calculation and displacement are synchronously carried out, so that the calculation efficiency is greatly improved.

Step S324 includes the steps of:

when i is 1-q, the subparameter F _i For intermediate calculation result Z _i Is the first 32 x p bits of (c),；

when i=q, subparameter F _i For intermediate calculation result Z _i The first 32 (p + 1) bits,；

s3242: will sub-parameter F ₁ 、F ₂ ……F _q The parameters f are spliced into the parameters f,judging whether the parameter f is smaller than the modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;

calculation F _i =F _i +flag*(-M _i ) If F _i ＜M _i Then go to F _i+1 Borrowing (F is greater than or equal to m, so F) _q Not less than M _q ）；

Step S34 includes the steps of:

when i is 1.ltoreq.q, the sub-result W _i For intermediate calculation result Z _i Is the first 32 x p bits of (c),；

when i=q, the sub-result W _i For intermediate calculation result Z _i The first 32 (p + 1) bits,；

s342: will sub-result W ₁ 、W ₂ ……W _q The calculation result w is spliced into a calculation result w,。

in the scheme, the data x and the modulus m are respectively and equally divided into q parts and sent to the corresponding threads, the data y is equally divided into p x and q parts and stored in the shared memory, and the q threads of the GPU can realize Montgomery modular multiplication operation in parallel through the method. In the prior art, montgomery modular multiplication operation is performed for single-thread execution CIOS algorithm on a CPU, so that the operation time is long, the calculation efficiency is low, and the encryption efficiency is affected. The algorithm flow designed for the multithread Cheng Teshu of the GPU is adopted, so that the q threads of the GPU execute the method to obtain the same result as the result obtained by executing the CIOS algorithm on the CPU by a single thread, the method is executed by the q threads in parallel, and the specific calculation steps are further optimized in the parallel execution process, thereby greatly improving the calculation efficiency, reducing the calculation time delay, and further greatly improving the encryption efficiency of encryption algorithms such as RSA, ECC and the like which are required to be used for Montgomery modular multiplication operation.

Illustrating:

in the RSA encryption calculation process, when the montgomery modular multiplication operation is required, the method of the embodiment is adopted to perform the montgomery modular multiplication operation, wherein the data x, the data y and the modulus m for the montgomery modular multiplication are all positive integers of 32 x 4 x 3 bits, the data x is used as a multiplicand, the data y is used as a multiplier, and as shown in fig. 2, the method comprises the following steps:

s1: 3 threads of the GPU are selected, the serial numbers of the threads are 1, 2 and 3, i is more than or equal to 1 and less than or equal to 3, each thread constructs a storage space for storing an intermediate calculation result Z, the storage space comprises 6 subspaces, and each subspace has 32 bitsThe intermediate calculation result calculated by the thread with the number i is marked as Z _i The data stored from the lower order to the higher order in the 6 subspaces constructed by the thread with the number i are sequentially marked as Z _i (1)、Z _i (2)……Z _i (6)，1≤r≤6，Z _i (6) Representing intermediate calculation result Z _i Data of 32 x (r-1) to 32 x r-1 bits;

s2: dividing the data X into 3 sub-data X, and sequentially marking the sub-data X as X from low order to high order ₁ 、X ₂ 、X ₃ Sub data X _i Send to thread number i, child data X _i Data representing bits 32 x p (i-1) to 32 x p i-1 in data x;

dividing modulus M equally into 3 sub-moduli M, sequentially recorded as M from low order to high order ₁ 、M ₂ 、M ₃ Sub-modulus M _i Send to thread number i, child module M _i Data representing bits 32 x p (i-1) to 32 x p i-1 in modulus m;

dividing the data Y into 12 sub-data Y, and sequentially marking the sub-data Y as Y from low order to high order ₁ 、Y ₂ ……Y ₁₂ And stored in a shared memory, wherein j is more than or equal to 1 and less than or equal to 12, and the sub data Y _j Data representing 32 x (j-1) to 32 x j-1 bits in data y;

s3: the 3 threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub-data X, the sub-module M and the 12 sub-data Y stored in the shared memory, so as to obtain a calculation result w, and the specific steps are as follows:

s32: each thread fetches sub-data Y from shared memory ₁ Each thread is based on the sub-data X, the current intermediate calculation result Z and the sub-data Y held by the thread ₁ Calculating the latest intermediate calculation result, and assigning the current intermediate calculation result Z as the latest intermediate calculation result, namely, the latest intermediate calculation result Z is calculated by adopting the steps N1 to N3 by the thread 1, the thread 2 and the thread 3 respectively ₁ 、Z ₂ 、Z ₃ ；

Thread number 1 fetch data Z ₁ (1) The intermediate parameter u is calculated using the following formula:

，

wherein,and m are inverse elements;

each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result, namely, the thread 1, the thread 2 and the thread 3 respectively calculate the latest intermediate calculation result Z by adopting the steps S3231 to S3235 ₁ 、Z ₂ 、Z ₃ The method comprises the steps of carrying out a first treatment on the surface of the In step S3232, thread number 2 transfers data Z ₂ (1) Send to thread number 1, thread number 3 will data Z ₃ (1) Sending to thread number 2; in step S3234, thread 1 transfers data Z ₁ (5) Send to thread number 2, thread number 2 will data Z ₂ (5) Sending to thread number 3;

extracting effective value from intermediate calculation result Z calculated by each thread as subparameter F, subparameter F extracted by thread No. 1 ₁ For intermediate calculation result Z ₁ The first 32 x 4 bits of thread number 2 extracted subparameter F ₂ For intermediate calculation result Z ₂ The first 32 x 4 bits of thread number 3 extracted subparameter F ₃ For intermediate calculation result Z ₂ The first 32 x 5 bits of (2);

will sub-parameter F ₁ 、F ₂ 、F ₃ Splicing into a parameter f, judging whether the parameter f is smaller than a modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;

thread 1, thread 2 and thread 3 re-assign values to the sub-parameters F held by the thread 1 and thread 1 calculates F ₁ =F ₁ +flag*(-M ₁ ) If F ₁ ＜M ₁ Then go to F ₂ Borrowing; thread number 2 calculation F ₂ =F ₂ +flag*(-M ₂ ) If F ₂ ＜M ₂ Then go to F ₃ Borrowing; thread number 3 calculation F ₃ =F ₃ +flag*(-M ₃ ) F is greater than or equal to m, so F ₃ Not less than M ₃ ；

Each thread reassigns the value of the own intermediate calculation result Z; thread number 1 gives intermediate calculation result Z ₁ The first 32 x 4 bits of (a) are assigned as subparameter F ₁ Is a value of (2); thread number 2 gives intermediate calculation result Z ₁ The first 32 x 4 bits of (a) are assigned as subparameter F ₂ Is a value of (2); thread 3 gives intermediate calculation result Z ₁ The first 32 x 5 bits of (a) are assigned as subparameter F ₃ Is a value of (2);

s33: assigning j to j=j+1, judging whether j is larger than 12, if yes, executing step S34, otherwise, jumping to step S32;

s34, performing S34; extracting an effective value from the intermediate calculation result Z of each thread as a sub-result W, wherein the sub-result W is extracted by the thread 1 ₁ For intermediate calculation result Z ₁ The first 32 x 4 bits of (b); sub-result W extracted by thread number 2 ₂ For intermediate calculation result Z ₂ The first 32 x 4 bits of (b); sub-result W extracted by thread 3 ₃ For intermediate calculation result Z ₃ The first 32 x 5 bits of (2);

will sub-result W ₁ 、W ₂ ……W _q And splicing to obtain a calculation result w.

Example 2: the method for performing Montgomery modular multiplication operation on the basis of the GPU of the present invention for performing Montgomery modular multiplication on data x, data y and modulus m which are all positive integers of 32 x p x q bits, p being an integer multiple of 2, q being an integer greater than 2, the data x being a multiplicand and the data y being a multiplier comprises the following steps:

the intermediate calculation result calculated by the thread with the number i is marked as Z _i The p+2 subspaces built by the thread with the number i are sequentially marked as Z from the lower order to the upper order of the stored data _i (1)、Z _i (2)……Z _i (p+2)，1≤r≤p+2，Z _i (r) represents the intermediate calculation result Z _i 32 x (r-1) to 32 x r-1 positions in (a)Data, p+2 subspaces are numbered 1 and 2 … … p+2 in sequence, and data Z _i (r) stored in subspace number r;

Step S2 comprises the steps of:

dividing modulus M equally into q sub-moduli M, sequentially recorded as M from low order to high order ₁ 、M ₂ ……M _q Sub-modulus M _i Send to thread number i, child module M _i Data representing bits 32 x p (i-1) to 32 x p i-1 in modulus m,；/>

step S3 comprises the steps of:

Step S32 includes the steps of:

the thread with the number i is based on the sub data X held by the thread _i Current intermediate calculation result Z _i Sub data Y _j Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z _i The formula assigned to the latest intermediate calculation result is: z is Z _i =Z _i +X _i *Y _j ；

S322: thread fetch data Z numbered 1 ₁ (1) The intermediate parameter u is calculated using the following formula:

，

wherein,and m are inverse elements;

the thread with the number 1 sends the intermediate parameter u to other threads;

the thread with the number i is according to the self-held sub-module M _i Intermediate parameter u, current intermediate calculation result Z _i Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z _i The formula assigned to the latest intermediate calculation result is as follows:

，

；

Step S324 includes the steps of:

the thread numbered i gives itself the held subparameter F _i Reassigning valuesThe method comprises the following steps:

Step S34 includes the steps of:

The present embodiment is different from embodiment 1 in steps S321, S323.

Claims

1. The Montgomery modular multiplication operation method based on the GPU is characterized in that data x, data y and modulus m used for Montgomery modular multiplication are all positive integers of 32 x p x q bits, p is an integer multiple of 2, q is an integer greater than 2, the data x is taken as a multiplicand, and the data y is taken as a multiplier, and the method comprises the following steps:

s1: q threads of the GPU are selected;

s3: and the q threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result w.

2. The method for performing Montgomery modular multiplication operation on the basis of the GPU according to claim 1, wherein the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q;

the step S2 includes the steps of:

3. The method of GPU-based montgomery modular multiplication according to claim 2, wherein the step S1 further comprises the steps of: each thread constructs a storage space for storing an intermediate calculation result Z, wherein the storage space comprises p+2 subspaces, and each subspace has 32 bits;

the intermediate calculation result calculated by the thread with the number i is marked as Z _i Numbered iThe p+2 subspaces constructed by threads are sequentially marked as Z from low order to high order stored data _i (1)、Z _i (2)……Z _i (p+2)，1≤r≤p+2，Z _i (r) represents the intermediate calculation result Z _i 32 x (r-1) to 32 x r-1 bits.

4. A method of performing a montgomery modular multiplication operation based on a GPU as recited in claim 3, wherein said step S3 comprises the steps of:

5. The method of performing Montgomery modular multiplication operations on a GPU according to claim 4, wherein the step S32 comprises the steps of:

6. The method according to claim 5, wherein the thread numbered i in the step S321 is based on the sub-data X held by itself _i Current intermediate calculation result Z _i Sub data Y _j Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z _i The method for assigning the latest intermediate calculation result comprises the following steps:

，

；

and N3: in turnWill be the latest data Z _i (1)、Z _i (2)……Z _i (p+2) are spliced together to obtain an intermediate calculation result Z _i 。

7. The method for computing the intermediate parameter u by the thread numbered 1 in the step S322 according to claim 5, wherein the method comprises the following steps:

，

wherein,and m is the inverse of m.

8. The method of claim 5, wherein the step S323 includes the steps of:

，

when i=1, the intermediate calculation result Z ₁ Remain unchanged;

；

。

9. the method according to claim 8, wherein the step S324 includes the steps of:

10. The method of GPU-based montgomery modular multiplication according to claim 9, wherein step S34 comprises the steps of: