CN117785129B

CN117785129B - Montgomery modular multiplication operation method based on GPU

Info

Publication number: CN117785129B
Application number: CN202410200625.6A
Authority: CN
Inventors: 冯黎明; 董建阔; 叶青波; 陈昕; 马煜翔; 王超; 吴凡
Original assignee: Lanxiang Zhilian Hangzhou Technology Co ltd
Current assignee: Lanxiang Zhilian Hangzhou Technology Co ltd
Priority date: 2024-02-23
Filing date: 2024-02-23
Publication date: 2024-05-07
Anticipated expiration: 2044-02-23
Also published as: CN117785129A

Abstract

The invention discloses a Montgomery modular multiplication operation method based on a GPU. It comprises the following steps: q threads of the GPU are selected; equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory; and the q threads perform Montgomery modular multiplication operation by adopting CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result w. The method can realize Montgomery modular multiplication operation on the GPU, greatly improves the calculation efficiency and reduces the calculation time delay.

Description

Montgomery modular multiplication operation method based on GPU

Technical Field

The invention relates to the technical field of computers, in particular to a Montgomery modular multiplication operation method based on a GPU.

Background

In recent years, data has shown an explosive growth trend, the data volume and the data variety have become more and more complex, and a large amount of valuable client information, personal privacy records and operation data of enterprises have been continuously mined. In the era of data burst, the problem of information security under big data is particularly important, the security assurance of information is based on encryption algorithms, and currently popular encryption algorithms comprise symmetric encryption algorithms and asymmetric encryption algorithms. The asymmetric encryption algorithm mainly comprises an RSA encryption algorithm and an ECC encryption algorithm, and the two encryption algorithms need to execute modular multiplication operation.

The algorithm with high efficiency and convenient implementation in the modular multiplication algorithm is a Montgomery modular multiplication algorithm. The existing Montgomery modular multiplication operation is realized on a CPU, and the single thread on the CPU performs the Montgomery modular multiplication operation, and because the Montgomery modular multiplication operation is limited by a formula, the current Montgomery modular multiplication operation consumes longer time and has lower calculation efficiency, thereby influencing the encryption efficiency.

Disclosure of Invention

In order to solve the technical problems, the invention provides a Montgomery modular multiplication operation method based on a GPU, which can realize Montgomery modular multiplication operation on the GPU, thereby greatly improving the calculation efficiency and reducing the calculation time delay.

In order to solve the problems, the invention is realized by adopting the following technical scheme:

The invention relates to a Montgomery modular multiplication operation method based on a GPU, which is used for Montgomery modular multiplication, wherein data x, data y and modulus m are all positive integers of 32 x p x q bits, p is integer multiple of 2, q is an integer greater than 2, the data x is taken as a multiplicand, and the data y is taken as a multiplier, and comprises the following steps:

S1: q threads of the GPU are selected;

S2: equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory;

S3: and the q threads perform Montgomery modular multiplication operation by adopting CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result z.

In the scheme, the data x and the modulus m are respectively equally divided into q parts and sent to the corresponding threads, the data y is equally divided into p x q parts and stored in the shared memory, and the q threads of the GPU realize Montgomery modular multiplication operation in parallel, so that the calculation efficiency is greatly improved. CIOS algorithm is Coarsely Integrated Operand Scanning algorithm.

Preferably, the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q;

the step S2 includes the steps of:

Dividing the data X into q sub-data X equally, sequentially marking the sub-data X ₁、X₂……X_q from low order to high order, sending the sub-data X _i to a thread with the number of i, wherein the sub-data X _i represents 32X p (i-1) to 32X p i-1 bits of data in the data X;

Dividing the modulus M equally into q sub-moduli M, sequentially marking the sub-moduli M ₁、M₂……M_q from low order to high order, sending the sub-moduli M _i to a thread with the number of i, wherein the sub-modulus M _i represents data of 32 x p (i-1) to 32 x p x i-1 bits in the modulus M;

Dividing the data Y into p×q sub-data Y equally, sequentially marking the sub-data Y as Y ₁、Y₂……Y_p*q from low order to high order, storing the sub-data Y _j as 32×1 to 32×j-1 bits of data Y, wherein j is greater than or equal to 1 and less than or equal to p×q, and storing the sub-data Y in a shared memory.

Preferably, the step S1 further includes the steps of: each thread constructs a storage space for storing an intermediate calculation result Z, wherein the storage space comprises p+2 subspaces, and each subspace has 32 bits;

The intermediate calculation result calculated by the thread with the number i is denoted as Z _i, and the data stored from the lower position to the upper position in the p+2 subspace constructed by the thread with the number i are denoted as Z _i(1)、Z_i(2)……Z_i(p+2),1≤r≤p+2,Z_i (r) in sequence, which represents the data of 32 x (r-1) to 32 x r-1 bits in the intermediate calculation result Z _i.

Preferably, the step S3 includes the steps of:

s31: setting the intermediate calculation result Z corresponding to each thread to zero, and setting j=1;

s32: each thread takes out sub data Y _j from the shared memory, and calculates a corresponding intermediate calculation result Z by combining the sub data X and the sub module M held by the thread;

s33: assigning j to j=j+1, judging whether j is greater than p×q, if yes, executing step S34, otherwise, jumping to step S32;

S34, performing S34; and extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z.

Preferably, the step S32 includes the steps of:

S321: each thread takes out sub data Y _j from the shared memory, calculates the latest intermediate calculation result according to the sub data X, the current intermediate calculation result Z and the sub data Y _j held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;

s322: the thread with the number 1 calculates an intermediate parameter u and sends the intermediate parameter u to other threads;

S323: each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;

S324: extracting an effective value from the intermediate calculation result Z calculated by each thread to splice the effective value into a parameter f, judging whether the parameter f is smaller than a modulus m, if so, setting the value of a parameter flag to be 0, otherwise, setting the value of the parameter flag to be 1;

Each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result.

Preferably, the method for calculating the latest intermediate calculation result by the thread numbered i in the step S321 according to the sub-data X _i, the current intermediate calculation result Z _i and the sub-data Y _j, and assigning the current intermediate calculation result Z _i as the latest intermediate calculation result includes the following steps:

N1: the data Z _i(1)、Z_i(2)……Z_i (p+2) is reassigned in sequence, and the assignment formula is as follows:

，

Wherein C _i (r) represents the carry generated when calculating the latest data Z _i (r), X _i (r) represents the data of 32X (r-1) to 32X r-1 bits in the sub data X _i, Representing the low 32 bits of X _i(r)* Y_j,/>The upper 32 bits of X _i(r-1)* Y_j;

n2: the data Z _i(2)、Z_i(3)……Z_i (p+2) is reassigned in sequence, and the assignment formula is as follows:

；

And N3: the latest data Z _i(1)、Z_i(2)……Z_i (p+2) are spliced together in sequence to obtain an intermediate calculation result Z _i.

Because X _i、Y_j is a large number, the calculation of Z _i=Z_i+X_i*Y_j is realized rapidly through the two-round calculation, and the calculation efficiency is greatly improved.

Preferably, the method for calculating the intermediate parameter u by the thread numbered 1 in the step S322 includes the following steps:

The thread with number 1 fetches data Z ₁ (1), and the intermediate parameter u is calculated using the following formula:

，

Wherein, And m is the inverse of m.

Preferably, the step S323 includes the steps of:

S3231: each thread reassigns the value of the own intermediate calculation result Z;

the method for reassigning the thread numbered i to the intermediate calculation result Z _i is as follows:

the data Z _i(1)、Z_i(2)……Z_i (p+2) is sequentially reassigned, and the assignment formula is as follows:

，

Wherein C _i (r) represents the carry generated when calculating the latest data Z _i (r), M _i (r) represents the data of 32 x (r-1) to 32 x r-1 bits in the sub-modulus M _i, Representing the lower 32 bits of M _i (r) ×u,/>The upper 32 bits of M _i (r-1) u;

s3232: the thread numbered k+1 sends data Z _k+1 (1) to the thread numbered k, which marks the received data as tmp (k), k=1, 2,3 … … q-1;

s3233: each thread reassigns the value of the own intermediate calculation result Z;

Setting tmp (q) =0, the method of reassigning the intermediate calculation result Z _i by the thread numbered i is as follows:

The data Z _i(1)、Z_i(2)……Z_i (p+1) is sequentially reassigned, and the assignment formula is as follows:

，

s3234: the thread numbered k sends data Z _k (p+1) to the thread numbered k+1, and the thread numbered k+1 marks the received data as hmp (k);

S3235: threads with the numbers of 1 and 2 … … q sequentially reassign the intermediate calculation result Z held by the threads;

when i=1, the intermediate calculation result Z ₁ remains unchanged;

When i=2, the data Z _i(1)、Z_i(2)……Z_i (p) is reassigned in turn, and the assignment formula is:

；

When i is more than 2 and less than q, reassigning the data Z _i(1)、Z_i(2)……Z_i (p) in sequence, wherein the assignment formula is as follows:

；

When i=q, the data Z _i(1)、Z_i(2)……Z_i(p)、Z_i (p+1) is reassigned in sequence, and the assignment formula is:

。

preferably, the step S324 includes the steps of:

S3241: extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F;

The method of extracting the subparameter F _i from the intermediate calculation result Z _i of the thread numbered i is as follows:

when i is more than or equal to 1 and less than q, the subparameter F _i is the first 32 x p bits of the intermediate calculation result Z _i;

When i=q, the subparameter F _i is the first 32 x (p+1) bits of the intermediate calculation result Z _i;

s3242: splicing the subparameter F ₁、F₂……F_q into a parameter F, judging whether the parameter F is smaller than the modulus m, if F is smaller than m, the value of the parameter flag is 0, and if F is larger than or equal to m, the value of the parameter flag is 1;

s3243: threads with the numbers of 1 and 2 … … q sequentially reassign the sub-parameters F held by the threads;

The method for reassigning the sub-parameter F _i held by the thread with the number i is as follows:

Calculate F _i=F_i+flag*(-M_i), if F _i＜M_i, borrow F _i+1;

s3244: each thread reassigns the value of the own intermediate calculation result Z;

The method for reassigning the intermediate calculation result Z _i held by the thread with the number i is as follows:

when i is more than or equal to 1 and less than q, the first 32 x p bits of the intermediate calculation result Z _i are assigned as the value of the subparameter F _i, and the last 64 bits of the intermediate calculation result Z _i are assigned to 0;

When i=q, the value of the first 32× (p+1) bits of the intermediate calculation result Z _i is assigned as the value of the subparameter F _i, and the value of the last 32 bits of the intermediate calculation result Z _i is set to 0.

Preferably, the step S34 includes the steps of:

S341: extracting a valid value from the intermediate calculation result Z of each thread as a sub-result W;

The method of extracting the sub-result W _i from the intermediate calculation result Z _i of the thread numbered i is as follows:

When i is more than or equal to 1 and less than q, the sub-result W _i is the first 32 p bits of the intermediate calculation result Z _i;

when i=q, the sub-result W _i is the first 32 x (p+1) bits of the intermediate calculation result Z _i;

S342: the sub-results W ₁、W₂……W_q are spliced into a calculation result W.

The beneficial effects of the invention are as follows: by utilizing the multithread parallel computing capability of the GPU, the Montgomery modular multiplication operation is optimized and accelerated, the computing efficiency is greatly improved, and the computing time delay is reduced, so that the encryption efficiency of encryption algorithms such as RSA, ECC and the like which are required to be used for the Montgomery modular multiplication operation is greatly improved.

Drawings

FIG. 1 is a flow chart of an embodiment;

Fig. 2 is an illustrative schematic.

Detailed Description

The technical scheme of the invention is further specifically described below through examples and with reference to the accompanying drawings.

Example 1: in the method for performing Montgomery modular multiplication operation based on the GPU according to the present embodiment, the data x, the data y and the modulus m used for Montgomery modular multiplication are all positive integers of 32×p×q bits, p is an integer multiple of 2, q is an integer greater than 2, the data x is a multiplicand, and the data y is a multiplier, as shown in FIG. 1, the method comprises the following steps:

s1: q threads of the GPU are selected, the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q, each thread constructs a storage space for storing an intermediate calculation result Z, the storage space comprises p+2 subspaces, and each subspace is provided with 32 bits;

The intermediate calculation result calculated by the thread with the number i is marked as Z _i, the data stored from the lower position to the upper position in the p+2 subspaces constructed by the thread with the number i are sequentially marked as Z _i(1)、Z_i(2)……Z_i(p+2),1≤r≤p+2,Z_i (r) to represent the data of 32 x (r-1) to 32 x r-1 bits in the intermediate calculation result Z _i, the p+2 subspaces are sequentially numbered as 1 and 2 … … p+2, and the data Z _i (r) is stored in the subspace with the number r;

Step S2 comprises the steps of:

dividing the data X into q sub-data X equally, sequentially marking the sub-data X ₁、X₂……X_q from low order to high order, sending the sub-data X _i to a thread with the number of i, wherein the sub-data X _i represents 32X p (i-1) to 32X p i-1 bits of data in the data X, ，/>Is a splice;

Dividing the modulus M equally into q sub-moduli M, sequentially marking M ₁、M₂……M_q from low order to high order, sending the sub-modulus M _i to a thread with the number of i, wherein the sub-modulus M _i represents data of 32 x p (i-1) to 32 x p x i-1 bits in the modulus M, ；

Dividing the data Y equally into p-q sub-data Y, sequentially marking Y ₁、Y₂……Y_p*q from low order to high order, storing in a shared memory, wherein j is greater than or equal to 1 and less than or equal to p q, the sub-data Y _j represents 32-1 to 32-1 bits of data in the data Y,。

Step S3 comprises the steps of:

S31: setting the intermediate calculation result Z corresponding to each thread to zero (namely, setting all the positions of the storage space to zero), and setting j=1;

Step S32 includes the steps of:

The method for calculating the latest intermediate calculation result by the thread numbered i in the step S321 according to the sub-data X _i, the current intermediate calculation result Z _i and the sub-data Y _j, and assigning the current intermediate calculation result Z _i as the latest intermediate calculation result includes the following steps:

，

；

The method for calculating the intermediate parameter u by the thread numbered 1 in step S322 includes the steps of:

，

Wherein, And m is the inverse of m.

The modulo 2 ³² is obtained, and data larger than 32 bits can be directly eliminated, so that the value is obtained by taking the modulo 2 ³² of the product of the data of the first 32 bits and m' to be calculated, and the calculation efficiency is improved.

Step S323 includes the steps of:

，

s3232: the thread with the number of k+1 sends the data Z _k+1 (1) to the thread with the number of k, the thread with the number of k marks the received data as tmp (k), and k=1, 2 and 3 … … q-1, namely the thread with the number of 2 to q fetches and sends the low 32 bits of the intermediate calculation result Z held by the thread with the number of 2 to q to the previous thread;

，

when i=1, the intermediate calculation result Z ₁ remains unchanged;

；

。

the calculation and displacement are synchronously carried out, so that the calculation efficiency is greatly improved.

Step S324 includes the steps of:

When i is more than or equal to 1 and less than q, the subparameter F _i is the first 32 p bits of the intermediate calculation result Z _i, ；

When i=q, the subparameter F _i is the first 32 x (p+1) bits of the intermediate calculation result Z _i,；

S3242: splicing the subparameter F ₁、F₂……F_q into a parameter F,Judging whether the parameter f is smaller than the modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;

Calculate F _i=F_i+flag*(-M_i), if F _i＜M_i, borrow F _i+1 (F _q is not less than M _q because F is.gtoreq.m);

Step S34 includes the steps of:

When i is more than or equal to 1 and less than q, the sub-result W _i is the first 32 p bits of the intermediate calculation result Z _i, ；

When i=q, the sub-result W _i is the first 32 x (p+1) bits of the intermediate calculation result Z _i,；

S342: splice the sub-results W ₁、W₂……W_q into a calculation result W,。

In the scheme, the data x and the modulus m are respectively and equally divided into q parts and sent to the corresponding threads, the data y is equally divided into p x and q parts and stored in the shared memory, and the q threads of the GPU can realize Montgomery modular multiplication operation in parallel through the method. In the prior art, the Montgomery modular multiplication operation is performed by executing CIOS algorithm on a CPU by a single thread, so that the operation time is long, the calculation efficiency is low, and the encryption efficiency is affected. The algorithm flow designed for the multithread Cheng Teshu of the GPU is adopted, so that the q threads of the GPU execute the method to obtain the same result as the result obtained by the single thread execution CIOS algorithm on the CPU, the method is executed by the q threads in parallel, and the specific calculation steps are further optimized in the parallel execution process, so that the calculation efficiency is greatly improved, the calculation time delay is reduced, and the encryption efficiency of encryption algorithms such as RSA, ECC and the like which are required to be used for Montgomery modular multiplication operation is greatly improved.

Illustrating:

In the RSA encryption calculation process, when the montgomery modular multiplication operation is required, the method of the embodiment is adopted to perform the montgomery modular multiplication operation, wherein the data x, the data y and the modulus m for the montgomery modular multiplication are all positive integers of 32 x 4 x 3 bits, the data x is used as a multiplicand, the data y is used as a multiplier, and as shown in fig. 2, the method comprises the following steps:

S1: 3 threads of the GPU are selected, the serial numbers of the threads are 1, 2 and 3, i is less than or equal to 1 and less than or equal to 3, each thread constructs a storage space for storing an intermediate calculation result Z, the storage space comprises 6 subspaces, each subspace is provided with 32 bits, the intermediate calculation result calculated by the thread with the serial number of i is marked as Z _i, the data stored from low level to high level in the 6 subspaces constructed by the thread with the serial number of i are marked as Z _i(1)、Z_i(2)……Z_i(6),1≤r≤6,Z_i (6) which represents 32 x (r-1) to 32 x r-1 bits of data in the intermediate calculation result Z _i;

S2: equally dividing the data X into 3 sub-data X, sequentially marking the sub-data X _i as X ₁、X₂、X₃ from low order to high order, and sending the sub-data X _i to a thread with the number of i, wherein the sub-data X _i represents 32X p (i-1) to 32X p i-1 bits of data X;

Dividing the modulus M equally into 3 sub-moduli M, sequentially marking the sub-moduli M ₁、M₂、M₃ from low order to high order, sending the sub-moduli M _i to a thread with the number of i, wherein the sub-modulus M _i represents data of 32 x p (i-1) to 32 x p x i-1 bits in the modulus M;

Equally dividing the data Y into 12 sub-data Y, sequentially marking Y ₁、Y₂……Y₁₂ from low order to high order, storing the sub-data Y _j in a shared memory, wherein j is more than or equal to 1 and less than or equal to 12, and the sub-data Y _j represents 32 x (j-1) to 32 x j-1 bits of data Y;

S3: the 3 threads perform Montgomery modular multiplication operation by adopting CIOS algorithm according to the sub-data X, the sub-module M and the 12 sub-data Y stored in the shared memory, so as to obtain a calculation result w, and the specific steps are as follows:

s32: each thread takes out sub data Y ₁ from the shared memory, calculates the latest intermediate calculation result according to the sub data X, the current intermediate calculation result Z and the sub data Y ₁ held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result, namely, the thread 1, the thread 2 and the thread 3 respectively calculate the latest intermediate calculation result Z ₁、Z₂、Z₃ by adopting steps N1 to N3;

thread number 1 fetches data Z ₁ (1), and the intermediate parameter u is calculated using the following formula:

，

Wherein, And m are inverse elements;

each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result, namely, the thread 1, the thread 2 and the thread 3 respectively adopt the steps S3231 to S3235 to calculate the latest intermediate calculation result Z ₁、Z₂、Z₃; in step S3232, thread No. 2 sends data Z ₂ (1) to thread No. 1, and thread No. 3 sends data Z ₃ (1) to thread No. 2; in step S3234, thread 1 sends data Z ₁ (5) to thread 2, and thread 2 sends data Z ₂ (5) to thread 3;

Extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F, wherein the subparameter F ₁ extracted by the thread No. 1 is the first 32 x 4 bits of the intermediate calculation result Z ₁, the subparameter F ₂ extracted by the thread No. 2 is the first 32 x 4 bits of the intermediate calculation result Z ₂, and the subparameter F ₃ extracted by the thread No. 3 is the first 32 x 5 bits of the intermediate calculation result Z ₂;

splicing the subparameter F ₁、F₂、F₃ into a parameter F, judging whether the parameter F is smaller than the modulus m, if F is smaller than m, the value of the parameter flag is 0, and if F is larger than or equal to m, the value of the parameter flag is 1;

Thread 1, thread 2 and thread 3 re-assign values to the sub-parameters F held by the thread 1, the thread 1 calculates F ₁=F₁+flag*(-M₁), if F ₁＜M₁, borrow F ₂; thread number 2 calculation F ₂=F₂+flag*(-M₂), if F ₂＜M₂, borrow F ₃; thread 3 calculates F ₃=F₃+flag*(-M₃), since F is not less than M, F ₃ is not less than M ₃;

each thread reassigns the value of the own intermediate calculation result Z; thread 1 assigns the first 32 x 4 bits of the intermediate calculation result Z ₁ to the value of the subparameter F ₁; the thread 2 assigns the first 32 x 4 bits of the intermediate calculation result Z ₁ as the value of the subparameter F ₂; the thread 3 assigns the first 32 bits of the intermediate calculation result Z ₁ with the value of the subparameter F ₃;

s33: assigning j to j=j+1, judging whether j is larger than 12, if yes, executing step S34, otherwise, jumping to step S32;

S34, performing S34; extracting an effective value from the intermediate calculation result Z of each thread as a sub-result W, wherein the sub-result W ₁ extracted by the thread 1 is the first 32 x 4 bits of the intermediate calculation result Z ₁; the sub-result W ₂ extracted by the thread 2 is the first 32 x 4 bits of the intermediate calculation result Z ₂; the sub-result W ₃ extracted by the thread 3 is the first 32 x 5 bits of the intermediate calculation result Z ₃;

The sub-results W ₁、W₂……W_q are spliced into a calculation result W.

Example 2: the method for performing Montgomery modular multiplication operation on the basis of the GPU of the present invention for performing Montgomery modular multiplication on data x, data y and modulus m which are all positive integers of 32 x p x q bits, p being an integer multiple of 2, q being an integer greater than 2, the data x being a multiplicand and the data y being a multiplier comprises the following steps:

Step S2 comprises the steps of:

Dividing the modulus M equally into q sub-moduli M, sequentially marking M ₁、M₂……M_q from low order to high order, sending the sub-modulus M _i to a thread with the number of i, wherein the sub-modulus M _i represents data of 32 x p (i-1) to 32 x p x i-1 bits in the modulus M, ；/>

Step S3 comprises the steps of:

Step S32 includes the steps of:

The thread with the number i calculates the latest intermediate calculation result according to the sub-data X _i, the current intermediate calculation result Z _i and the sub-data Y _j, and assigns the current intermediate calculation result Z _i as the latest intermediate calculation result by the following formula: z _i=Z_i+X_i*Y_j;

s322: the thread with number 1 fetches data Z ₁ (1), and the intermediate parameter u is calculated using the following formula:

，

Wherein, And m are inverse elements;

the thread with the number 1 sends the intermediate parameter u to other threads;

The thread with the number i calculates the latest intermediate calculation result according to the child module M _i, the intermediate parameter u and the current intermediate calculation result Z _i, and assigns the current intermediate calculation result Z _i to the latest intermediate calculation result according to the following formula:

，

；

Step S324 includes the steps of:

Calculate F _i=F_i+flag*(-M_i), if F _i＜M_i, borrow F _i+1;

Step S34 includes the steps of:

The present embodiment is different from embodiment 1 in steps S321, S323.

Claims

1. The Montgomery modular multiplication operation method based on the GPU is characterized in that data x, data y and modulus m used for Montgomery modular multiplication are all positive integers of 32 x p x q bits, p is an integer multiple of 2, q is an integer greater than 2, the data x is taken as a multiplicand, and the data y is taken as a multiplier, and the method comprises the following steps:

S1: q threads of the GPU are selected;

S3: q threads perform Montgomery modular multiplication operation by adopting CIOS algorithm according to the sub data X, the sub data M and the p-q sub data Y stored in the shared memory, so as to obtain a calculation result w;

The q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q;

the step S2 includes the steps of:

Dividing the data Y into p-q sub-data Y equally, sequentially marking Y ₁、Y₂……Y_p*q from low order to high order, storing in a shared memory, wherein j is greater than or equal to 1 and less than or equal to p q, and the sub-data Y _j represents 32-1 to 32-1-bit data in the data Y;

The step S1 further includes the steps of: each thread constructs a storage space for storing an intermediate calculation result Z, wherein the storage space comprises p+2 subspaces, and each subspace has 32 bits;

The intermediate calculation result calculated by the thread with the number i is marked as Z _i, and the data stored from the lower position to the upper position in the p+2 subspaces constructed by the thread with the number i are orderly marked as Z _i(1)、Z_i(2)……Z_i(p+2),1≤r≤p+2,Z_i (r) which represents the data of 32 x (r-1) to 32 x r-1 bits in the intermediate calculation result Z _i;

The step S3 includes the steps of:

s34, performing S34; extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z;

the step S32 includes the steps of:

Each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;

The step S34 includes the steps of:

2. The method of claim 1, wherein the method of calculating the latest intermediate calculation result by the thread numbered i in the step S321 according to the sub-data X _i, the current intermediate calculation result Z _i, and the sub-data Y _j, and assigning the current intermediate calculation result Z _i as the latest intermediate calculation result comprises the following steps:

，

；

3. The method for computing the intermediate parameter u by the thread numbered 1 in the step S322 includes the following steps:

，

Wherein, And m is the inverse of m.

4. The method of claim 1, wherein the step S323 includes the steps of:

，

when i=1, the intermediate calculation result Z ₁ remains unchanged;

；

。

5. the method according to claim 4, wherein the step S324 includes the steps of:

Calculate F _i=F_i+flag*(-M_i), if F _i＜M_i, borrow F _i+1;