CN117785129B - Montgomery modular multiplication operation method based on GPU - Google Patents

Montgomery modular multiplication operation method based on GPU Download PDF

Info

Publication number
CN117785129B
CN117785129B CN202410200625.6A CN202410200625A CN117785129B CN 117785129 B CN117785129 B CN 117785129B CN 202410200625 A CN202410200625 A CN 202410200625A CN 117785129 B CN117785129 B CN 117785129B
Authority
CN
China
Prior art keywords
data
calculation result
sub
thread
intermediate calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410200625.6A
Other languages
Chinese (zh)
Other versions
CN117785129A (en
Inventor
冯黎明
董建阔
叶青波
陈昕
马煜翔
王超
吴凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lanxiang Zhilian Hangzhou Technology Co ltd
Original Assignee
Lanxiang Zhilian Hangzhou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lanxiang Zhilian Hangzhou Technology Co ltd filed Critical Lanxiang Zhilian Hangzhou Technology Co ltd
Priority to CN202410200625.6A priority Critical patent/CN117785129B/en
Publication of CN117785129A publication Critical patent/CN117785129A/en
Application granted granted Critical
Publication of CN117785129B publication Critical patent/CN117785129B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a Montgomery modular multiplication operation method based on a GPU. It comprises the following steps: q threads of the GPU are selected; equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory; and the q threads perform Montgomery modular multiplication operation by adopting CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result w. The method can realize Montgomery modular multiplication operation on the GPU, greatly improves the calculation efficiency and reduces the calculation time delay.

Description

Montgomery modular multiplication operation method based on GPU
Technical Field
The invention relates to the technical field of computers, in particular to a Montgomery modular multiplication operation method based on a GPU.
Background
In recent years, data has shown an explosive growth trend, the data volume and the data variety have become more and more complex, and a large amount of valuable client information, personal privacy records and operation data of enterprises have been continuously mined. In the era of data burst, the problem of information security under big data is particularly important, the security assurance of information is based on encryption algorithms, and currently popular encryption algorithms comprise symmetric encryption algorithms and asymmetric encryption algorithms. The asymmetric encryption algorithm mainly comprises an RSA encryption algorithm and an ECC encryption algorithm, and the two encryption algorithms need to execute modular multiplication operation.
The algorithm with high efficiency and convenient implementation in the modular multiplication algorithm is a Montgomery modular multiplication algorithm. The existing Montgomery modular multiplication operation is realized on a CPU, and the single thread on the CPU performs the Montgomery modular multiplication operation, and because the Montgomery modular multiplication operation is limited by a formula, the current Montgomery modular multiplication operation consumes longer time and has lower calculation efficiency, thereby influencing the encryption efficiency.
Disclosure of Invention
In order to solve the technical problems, the invention provides a Montgomery modular multiplication operation method based on a GPU, which can realize Montgomery modular multiplication operation on the GPU, thereby greatly improving the calculation efficiency and reducing the calculation time delay.
In order to solve the problems, the invention is realized by adopting the following technical scheme:
The invention relates to a Montgomery modular multiplication operation method based on a GPU, which is used for Montgomery modular multiplication, wherein data x, data y and modulus m are all positive integers of 32 x p x q bits, p is integer multiple of 2, q is an integer greater than 2, the data x is taken as a multiplicand, and the data y is taken as a multiplier, and comprises the following steps:
S1: q threads of the GPU are selected;
S2: equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory;
S3: and the q threads perform Montgomery modular multiplication operation by adopting CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result z.
In the scheme, the data x and the modulus m are respectively equally divided into q parts and sent to the corresponding threads, the data y is equally divided into p x q parts and stored in the shared memory, and the q threads of the GPU realize Montgomery modular multiplication operation in parallel, so that the calculation efficiency is greatly improved. CIOS algorithm is Coarsely Integrated Operand Scanning algorithm.
Preferably, the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q;
the step S2 includes the steps of:
Dividing the data X into q sub-data X equally, sequentially marking the sub-data X 1、X2……Xq from low order to high order, sending the sub-data X i to a thread with the number of i, wherein the sub-data X i represents 32X p (i-1) to 32X p i-1 bits of data in the data X;
Dividing the modulus M equally into q sub-moduli M, sequentially marking the sub-moduli M 1、M2……Mq from low order to high order, sending the sub-moduli M i to a thread with the number of i, wherein the sub-modulus M i represents data of 32 x p (i-1) to 32 x p x i-1 bits in the modulus M;
Dividing the data Y into p×q sub-data Y equally, sequentially marking the sub-data Y as Y 1、Y2……Yp*q from low order to high order, storing the sub-data Y j as 32×1 to 32×j-1 bits of data Y, wherein j is greater than or equal to 1 and less than or equal to p×q, and storing the sub-data Y in a shared memory.
Preferably, the step S1 further includes the steps of: each thread constructs a storage space for storing an intermediate calculation result Z, wherein the storage space comprises p+2 subspaces, and each subspace has 32 bits;
The intermediate calculation result calculated by the thread with the number i is denoted as Z i, and the data stored from the lower position to the upper position in the p+2 subspace constructed by the thread with the number i are denoted as Z i(1)、Zi(2)……Zi(p+2),1≤r≤p+2,Zi (r) in sequence, which represents the data of 32 x (r-1) to 32 x r-1 bits in the intermediate calculation result Z i.
Preferably, the step S3 includes the steps of:
s31: setting the intermediate calculation result Z corresponding to each thread to zero, and setting j=1;
s32: each thread takes out sub data Y j from the shared memory, and calculates a corresponding intermediate calculation result Z by combining the sub data X and the sub module M held by the thread;
s33: assigning j to j=j+1, judging whether j is greater than p×q, if yes, executing step S34, otherwise, jumping to step S32;
S34, performing S34; and extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z.
Preferably, the step S32 includes the steps of:
S321: each thread takes out sub data Y j from the shared memory, calculates the latest intermediate calculation result according to the sub data X, the current intermediate calculation result Z and the sub data Y j held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
s322: the thread with the number 1 calculates an intermediate parameter u and sends the intermediate parameter u to other threads;
S323: each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
S324: extracting an effective value from the intermediate calculation result Z calculated by each thread to splice the effective value into a parameter f, judging whether the parameter f is smaller than a modulus m, if so, setting the value of a parameter flag to be 0, otherwise, setting the value of the parameter flag to be 1;
Each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result.
Preferably, the method for calculating the latest intermediate calculation result by the thread numbered i in the step S321 according to the sub-data X i, the current intermediate calculation result Z i and the sub-data Y j, and assigning the current intermediate calculation result Z i as the latest intermediate calculation result includes the following steps:
N1: the data Z i(1)、Zi(2)……Zi (p+2) is reassigned in sequence, and the assignment formula is as follows:
Wherein C i (r) represents the carry generated when calculating the latest data Z i (r), X i (r) represents the data of 32X (r-1) to 32X r-1 bits in the sub data X i, Representing the low 32 bits of X i(r)* Yj,/>The upper 32 bits of X i(r-1)* Yj;
n2: the data Z i(2)、Zi(3)……Zi (p+2) is reassigned in sequence, and the assignment formula is as follows:
And N3: the latest data Z i(1)、Zi(2)……Zi (p+2) are spliced together in sequence to obtain an intermediate calculation result Z i.
Because X i、Yj is a large number, the calculation of Z i=Zi+Xi*Yj is realized rapidly through the two-round calculation, and the calculation efficiency is greatly improved.
Preferably, the method for calculating the intermediate parameter u by the thread numbered 1 in the step S322 includes the following steps:
The thread with number 1 fetches data Z 1 (1), and the intermediate parameter u is calculated using the following formula:
Wherein, And m is the inverse of m.
Preferably, the step S323 includes the steps of:
S3231: each thread reassigns the value of the own intermediate calculation result Z;
the method for reassigning the thread numbered i to the intermediate calculation result Z i is as follows:
the data Z i(1)、Zi(2)……Zi (p+2) is sequentially reassigned, and the assignment formula is as follows:
Wherein C i (r) represents the carry generated when calculating the latest data Z i (r), M i (r) represents the data of 32 x (r-1) to 32 x r-1 bits in the sub-modulus M i, Representing the lower 32 bits of M i (r) ×u,/>The upper 32 bits of M i (r-1) u;
s3232: the thread numbered k+1 sends data Z k+1 (1) to the thread numbered k, which marks the received data as tmp (k), k=1, 2,3 … … q-1;
s3233: each thread reassigns the value of the own intermediate calculation result Z;
Setting tmp (q) =0, the method of reassigning the intermediate calculation result Z i by the thread numbered i is as follows:
The data Z i(1)、Zi(2)……Zi (p+1) is sequentially reassigned, and the assignment formula is as follows:
s3234: the thread numbered k sends data Z k (p+1) to the thread numbered k+1, and the thread numbered k+1 marks the received data as hmp (k);
S3235: threads with the numbers of 1 and 2 … … q sequentially reassign the intermediate calculation result Z held by the threads;
the method for reassigning the thread numbered i to the intermediate calculation result Z i is as follows:
when i=1, the intermediate calculation result Z 1 remains unchanged;
When i=2, the data Z i(1)、Zi(2)……Zi (p) is reassigned in turn, and the assignment formula is:
When i is more than 2 and less than q, reassigning the data Z i(1)、Zi(2)……Zi (p) in sequence, wherein the assignment formula is as follows:
When i=q, the data Z i(1)、Zi(2)……Zi(p)、Zi (p+1) is reassigned in sequence, and the assignment formula is:
preferably, the step S324 includes the steps of:
S3241: extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F;
The method of extracting the subparameter F i from the intermediate calculation result Z i of the thread numbered i is as follows:
when i is more than or equal to 1 and less than q, the subparameter F i is the first 32 x p bits of the intermediate calculation result Z i;
When i=q, the subparameter F i is the first 32 x (p+1) bits of the intermediate calculation result Z i;
s3242: splicing the subparameter F 1、F2……Fq into a parameter F, judging whether the parameter F is smaller than the modulus m, if F is smaller than m, the value of the parameter flag is 0, and if F is larger than or equal to m, the value of the parameter flag is 1;
s3243: threads with the numbers of 1 and 2 … … q sequentially reassign the sub-parameters F held by the threads;
The method for reassigning the sub-parameter F i held by the thread with the number i is as follows:
Calculate F i=Fi+flag*(-Mi), if F i<Mi, borrow F i+1;
s3244: each thread reassigns the value of the own intermediate calculation result Z;
The method for reassigning the intermediate calculation result Z i held by the thread with the number i is as follows:
when i is more than or equal to 1 and less than q, the first 32 x p bits of the intermediate calculation result Z i are assigned as the value of the subparameter F i, and the last 64 bits of the intermediate calculation result Z i are assigned to 0;
When i=q, the value of the first 32× (p+1) bits of the intermediate calculation result Z i is assigned as the value of the subparameter F i, and the value of the last 32 bits of the intermediate calculation result Z i is set to 0.
Preferably, the step S34 includes the steps of:
S341: extracting a valid value from the intermediate calculation result Z of each thread as a sub-result W;
The method of extracting the sub-result W i from the intermediate calculation result Z i of the thread numbered i is as follows:
When i is more than or equal to 1 and less than q, the sub-result W i is the first 32 p bits of the intermediate calculation result Z i;
when i=q, the sub-result W i is the first 32 x (p+1) bits of the intermediate calculation result Z i;
S342: the sub-results W 1、W2……Wq are spliced into a calculation result W.
The beneficial effects of the invention are as follows: by utilizing the multithread parallel computing capability of the GPU, the Montgomery modular multiplication operation is optimized and accelerated, the computing efficiency is greatly improved, and the computing time delay is reduced, so that the encryption efficiency of encryption algorithms such as RSA, ECC and the like which are required to be used for the Montgomery modular multiplication operation is greatly improved.
Drawings
FIG. 1 is a flow chart of an embodiment;
Fig. 2 is an illustrative schematic.
Detailed Description
The technical scheme of the invention is further specifically described below through examples and with reference to the accompanying drawings.
Example 1: in the method for performing Montgomery modular multiplication operation based on the GPU according to the present embodiment, the data x, the data y and the modulus m used for Montgomery modular multiplication are all positive integers of 32×p×q bits, p is an integer multiple of 2, q is an integer greater than 2, the data x is a multiplicand, and the data y is a multiplier, as shown in FIG. 1, the method comprises the following steps:
s1: q threads of the GPU are selected, the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q, each thread constructs a storage space for storing an intermediate calculation result Z, the storage space comprises p+2 subspaces, and each subspace is provided with 32 bits;
The intermediate calculation result calculated by the thread with the number i is marked as Z i, the data stored from the lower position to the upper position in the p+2 subspaces constructed by the thread with the number i are sequentially marked as Z i(1)、Zi(2)……Zi(p+2),1≤r≤p+2,Zi (r) to represent the data of 32 x (r-1) to 32 x r-1 bits in the intermediate calculation result Z i, the p+2 subspaces are sequentially numbered as 1 and 2 … … p+2, and the data Z i (r) is stored in the subspace with the number r;
S2: equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory;
S3: and the q threads perform Montgomery modular multiplication operation by adopting CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result z.
Step S2 comprises the steps of:
dividing the data X into q sub-data X equally, sequentially marking the sub-data X 1、X2……Xq from low order to high order, sending the sub-data X i to a thread with the number of i, wherein the sub-data X i represents 32X p (i-1) to 32X p i-1 bits of data in the data X, ,/>Is a splice;
Dividing the modulus M equally into q sub-moduli M, sequentially marking M 1、M2……Mq from low order to high order, sending the sub-modulus M i to a thread with the number of i, wherein the sub-modulus M i represents data of 32 x p (i-1) to 32 x p x i-1 bits in the modulus M,
Dividing the data Y equally into p-q sub-data Y, sequentially marking Y 1、Y2……Yp*q from low order to high order, storing in a shared memory, wherein j is greater than or equal to 1 and less than or equal to p q, the sub-data Y j represents 32-1 to 32-1 bits of data in the data Y,
Step S3 comprises the steps of:
S31: setting the intermediate calculation result Z corresponding to each thread to zero (namely, setting all the positions of the storage space to zero), and setting j=1;
s32: each thread takes out sub data Y j from the shared memory, and calculates a corresponding intermediate calculation result Z by combining the sub data X and the sub module M held by the thread;
s33: assigning j to j=j+1, judging whether j is greater than p×q, if yes, executing step S34, otherwise, jumping to step S32;
S34, performing S34; and extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z.
Step S32 includes the steps of:
S321: each thread takes out sub data Y j from the shared memory, calculates the latest intermediate calculation result according to the sub data X, the current intermediate calculation result Z and the sub data Y j held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
s322: the thread with the number 1 calculates an intermediate parameter u and sends the intermediate parameter u to other threads;
S323: each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
S324: extracting an effective value from the intermediate calculation result Z calculated by each thread to splice the effective value into a parameter f, judging whether the parameter f is smaller than a modulus m, if so, setting the value of a parameter flag to be 0, otherwise, setting the value of the parameter flag to be 1;
Each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result.
The method for calculating the latest intermediate calculation result by the thread numbered i in the step S321 according to the sub-data X i, the current intermediate calculation result Z i and the sub-data Y j, and assigning the current intermediate calculation result Z i as the latest intermediate calculation result includes the following steps:
N1: the data Z i(1)、Zi(2)……Zi (p+2) is reassigned in sequence, and the assignment formula is as follows:
Wherein C i (r) represents the carry generated when calculating the latest data Z i (r), X i (r) represents the data of 32X (r-1) to 32X r-1 bits in the sub data X i, Representing the low 32 bits of X i(r)* Yj,/>The upper 32 bits of X i(r-1)* Yj;
n2: the data Z i(2)、Zi(3)……Zi (p+2) is reassigned in sequence, and the assignment formula is as follows:
And N3: the latest data Z i(1)、Zi(2)……Zi (p+2) are spliced together in sequence to obtain an intermediate calculation result Z i.
Because X i、Yj is a large number, the calculation of Z i=Zi+Xi*Yj is realized rapidly through the two-round calculation, and the calculation efficiency is greatly improved.
The method for calculating the intermediate parameter u by the thread numbered 1 in step S322 includes the steps of:
The thread with number 1 fetches data Z 1 (1), and the intermediate parameter u is calculated using the following formula:
Wherein, And m is the inverse of m.
The modulo 2 32 is obtained, and data larger than 32 bits can be directly eliminated, so that the value is obtained by taking the modulo 2 32 of the product of the data of the first 32 bits and m' to be calculated, and the calculation efficiency is improved.
Step S323 includes the steps of:
S3231: each thread reassigns the value of the own intermediate calculation result Z;
the method for reassigning the thread numbered i to the intermediate calculation result Z i is as follows:
the data Z i(1)、Zi(2)……Zi (p+2) is sequentially reassigned, and the assignment formula is as follows:
Wherein C i (r) represents the carry generated when calculating the latest data Z i (r), M i (r) represents the data of 32 x (r-1) to 32 x r-1 bits in the sub-modulus M i, Representing the lower 32 bits of M i (r) ×u,/>The upper 32 bits of M i (r-1) u;
s3232: the thread with the number of k+1 sends the data Z k+1 (1) to the thread with the number of k, the thread with the number of k marks the received data as tmp (k), and k=1, 2 and 3 … … q-1, namely the thread with the number of 2 to q fetches and sends the low 32 bits of the intermediate calculation result Z held by the thread with the number of 2 to q to the previous thread;
s3233: each thread reassigns the value of the own intermediate calculation result Z;
Setting tmp (q) =0, the method of reassigning the intermediate calculation result Z i by the thread numbered i is as follows:
The data Z i(1)、Zi(2)……Zi (p+1) is sequentially reassigned, and the assignment formula is as follows:
s3234: the thread numbered k sends data Z k (p+1) to the thread numbered k+1, and the thread numbered k+1 marks the received data as hmp (k);
S3235: threads with the numbers of 1 and 2 … … q sequentially reassign the intermediate calculation result Z held by the threads;
the method for reassigning the thread numbered i to the intermediate calculation result Z i is as follows:
when i=1, the intermediate calculation result Z 1 remains unchanged;
When i=2, the data Z i(1)、Zi(2)……Zi (p) is reassigned in turn, and the assignment formula is:
When i is more than 2 and less than q, reassigning the data Z i(1)、Zi(2)……Zi (p) in sequence, wherein the assignment formula is as follows:
When i=q, the data Z i(1)、Zi(2)……Zi(p)、Zi (p+1) is reassigned in sequence, and the assignment formula is:
the calculation and displacement are synchronously carried out, so that the calculation efficiency is greatly improved.
Step S324 includes the steps of:
S3241: extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F;
The method of extracting the subparameter F i from the intermediate calculation result Z i of the thread numbered i is as follows:
When i is more than or equal to 1 and less than q, the subparameter F i is the first 32 p bits of the intermediate calculation result Z i,
When i=q, the subparameter F i is the first 32 x (p+1) bits of the intermediate calculation result Z i,
S3242: splicing the subparameter F 1、F2……Fq into a parameter F,Judging whether the parameter f is smaller than the modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;
s3243: threads with the numbers of 1 and 2 … … q sequentially reassign the sub-parameters F held by the threads;
The method for reassigning the sub-parameter F i held by the thread with the number i is as follows:
Calculate F i=Fi+flag*(-Mi), if F i<Mi, borrow F i+1 (F q is not less than M q because F is.gtoreq.m);
s3244: each thread reassigns the value of the own intermediate calculation result Z;
The method for reassigning the intermediate calculation result Z i held by the thread with the number i is as follows:
when i is more than or equal to 1 and less than q, the first 32 x p bits of the intermediate calculation result Z i are assigned as the value of the subparameter F i, and the last 64 bits of the intermediate calculation result Z i are assigned to 0;
When i=q, the value of the first 32× (p+1) bits of the intermediate calculation result Z i is assigned as the value of the subparameter F i, and the value of the last 32 bits of the intermediate calculation result Z i is set to 0.
Step S34 includes the steps of:
S341: extracting a valid value from the intermediate calculation result Z of each thread as a sub-result W;
The method of extracting the sub-result W i from the intermediate calculation result Z i of the thread numbered i is as follows:
When i is more than or equal to 1 and less than q, the sub-result W i is the first 32 p bits of the intermediate calculation result Z i,
When i=q, the sub-result W i is the first 32 x (p+1) bits of the intermediate calculation result Z i,
S342: splice the sub-results W 1、W2……Wq into a calculation result W,
In the scheme, the data x and the modulus m are respectively and equally divided into q parts and sent to the corresponding threads, the data y is equally divided into p x and q parts and stored in the shared memory, and the q threads of the GPU can realize Montgomery modular multiplication operation in parallel through the method. In the prior art, the Montgomery modular multiplication operation is performed by executing CIOS algorithm on a CPU by a single thread, so that the operation time is long, the calculation efficiency is low, and the encryption efficiency is affected. The algorithm flow designed for the multithread Cheng Teshu of the GPU is adopted, so that the q threads of the GPU execute the method to obtain the same result as the result obtained by the single thread execution CIOS algorithm on the CPU, the method is executed by the q threads in parallel, and the specific calculation steps are further optimized in the parallel execution process, so that the calculation efficiency is greatly improved, the calculation time delay is reduced, and the encryption efficiency of encryption algorithms such as RSA, ECC and the like which are required to be used for Montgomery modular multiplication operation is greatly improved.
Illustrating:
In the RSA encryption calculation process, when the montgomery modular multiplication operation is required, the method of the embodiment is adopted to perform the montgomery modular multiplication operation, wherein the data x, the data y and the modulus m for the montgomery modular multiplication are all positive integers of 32 x 4 x 3 bits, the data x is used as a multiplicand, the data y is used as a multiplier, and as shown in fig. 2, the method comprises the following steps:
S1: 3 threads of the GPU are selected, the serial numbers of the threads are 1, 2 and 3, i is less than or equal to 1 and less than or equal to 3, each thread constructs a storage space for storing an intermediate calculation result Z, the storage space comprises 6 subspaces, each subspace is provided with 32 bits, the intermediate calculation result calculated by the thread with the serial number of i is marked as Z i, the data stored from low level to high level in the 6 subspaces constructed by the thread with the serial number of i are marked as Z i(1)、Zi(2)……Zi(6),1≤r≤6,Zi (6) which represents 32 x (r-1) to 32 x r-1 bits of data in the intermediate calculation result Z i;
S2: equally dividing the data X into 3 sub-data X, sequentially marking the sub-data X i as X 1、X2、X3 from low order to high order, and sending the sub-data X i to a thread with the number of i, wherein the sub-data X i represents 32X p (i-1) to 32X p i-1 bits of data X;
Dividing the modulus M equally into 3 sub-moduli M, sequentially marking the sub-moduli M 1、M2、M3 from low order to high order, sending the sub-moduli M i to a thread with the number of i, wherein the sub-modulus M i represents data of 32 x p (i-1) to 32 x p x i-1 bits in the modulus M;
Equally dividing the data Y into 12 sub-data Y, sequentially marking Y 1、Y2……Y12 from low order to high order, storing the sub-data Y j in a shared memory, wherein j is more than or equal to 1 and less than or equal to 12, and the sub-data Y j represents 32 x (j-1) to 32 x j-1 bits of data Y;
S3: the 3 threads perform Montgomery modular multiplication operation by adopting CIOS algorithm according to the sub-data X, the sub-module M and the 12 sub-data Y stored in the shared memory, so as to obtain a calculation result w, and the specific steps are as follows:
s31: setting the intermediate calculation result Z corresponding to each thread to zero, and setting j=1;
s32: each thread takes out sub data Y 1 from the shared memory, calculates the latest intermediate calculation result according to the sub data X, the current intermediate calculation result Z and the sub data Y 1 held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result, namely, the thread 1, the thread 2 and the thread 3 respectively calculate the latest intermediate calculation result Z 1、Z2、Z3 by adopting steps N1 to N3;
thread number 1 fetches data Z 1 (1), and the intermediate parameter u is calculated using the following formula:
Wherein, And m are inverse elements;
each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result, namely, the thread 1, the thread 2 and the thread 3 respectively adopt the steps S3231 to S3235 to calculate the latest intermediate calculation result Z 1、Z2、Z3; in step S3232, thread No. 2 sends data Z 2 (1) to thread No. 1, and thread No. 3 sends data Z 3 (1) to thread No. 2; in step S3234, thread 1 sends data Z 1 (5) to thread 2, and thread 2 sends data Z 2 (5) to thread 3;
Extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F, wherein the subparameter F 1 extracted by the thread No. 1 is the first 32 x 4 bits of the intermediate calculation result Z 1, the subparameter F 2 extracted by the thread No. 2 is the first 32 x 4 bits of the intermediate calculation result Z 2, and the subparameter F 3 extracted by the thread No. 3 is the first 32 x 5 bits of the intermediate calculation result Z 2;
splicing the subparameter F 1、F2、F3 into a parameter F, judging whether the parameter F is smaller than the modulus m, if F is smaller than m, the value of the parameter flag is 0, and if F is larger than or equal to m, the value of the parameter flag is 1;
Thread 1, thread 2 and thread 3 re-assign values to the sub-parameters F held by the thread 1, the thread 1 calculates F 1=F1+flag*(-M1), if F 1<M1, borrow F 2; thread number 2 calculation F 2=F2+flag*(-M2), if F 2<M2, borrow F 3; thread 3 calculates F 3=F3+flag*(-M3), since F is not less than M, F 3 is not less than M 3;
each thread reassigns the value of the own intermediate calculation result Z; thread 1 assigns the first 32 x 4 bits of the intermediate calculation result Z 1 to the value of the subparameter F 1; the thread 2 assigns the first 32 x 4 bits of the intermediate calculation result Z 1 as the value of the subparameter F 2; the thread 3 assigns the first 32 bits of the intermediate calculation result Z 1 with the value of the subparameter F 3;
s33: assigning j to j=j+1, judging whether j is larger than 12, if yes, executing step S34, otherwise, jumping to step S32;
S34, performing S34; extracting an effective value from the intermediate calculation result Z of each thread as a sub-result W, wherein the sub-result W 1 extracted by the thread 1 is the first 32 x 4 bits of the intermediate calculation result Z 1; the sub-result W 2 extracted by the thread 2 is the first 32 x 4 bits of the intermediate calculation result Z 2; the sub-result W 3 extracted by the thread 3 is the first 32 x 5 bits of the intermediate calculation result Z 3;
The sub-results W 1、W2……Wq are spliced into a calculation result W.
Example 2: the method for performing Montgomery modular multiplication operation on the basis of the GPU of the present invention for performing Montgomery modular multiplication on data x, data y and modulus m which are all positive integers of 32 x p x q bits, p being an integer multiple of 2, q being an integer greater than 2, the data x being a multiplicand and the data y being a multiplier comprises the following steps:
s1: q threads of the GPU are selected, the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q, each thread constructs a storage space for storing an intermediate calculation result Z, the storage space comprises p+2 subspaces, and each subspace is provided with 32 bits;
The intermediate calculation result calculated by the thread with the number i is marked as Z i, the data stored from the lower position to the upper position in the p+2 subspaces constructed by the thread with the number i are sequentially marked as Z i(1)、Zi(2)……Zi(p+2),1≤r≤p+2,Zi (r) to represent the data of 32 x (r-1) to 32 x r-1 bits in the intermediate calculation result Z i, the p+2 subspaces are sequentially numbered as 1 and 2 … … p+2, and the data Z i (r) is stored in the subspace with the number r;
S2: equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory;
S3: and the q threads perform Montgomery modular multiplication operation by adopting CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result z.
Step S2 comprises the steps of:
dividing the data X into q sub-data X equally, sequentially marking the sub-data X 1、X2……Xq from low order to high order, sending the sub-data X i to a thread with the number of i, wherein the sub-data X i represents 32X p (i-1) to 32X p i-1 bits of data in the data X, ,/>Is a splice;
Dividing the modulus M equally into q sub-moduli M, sequentially marking M 1、M2……Mq from low order to high order, sending the sub-modulus M i to a thread with the number of i, wherein the sub-modulus M i represents data of 32 x p (i-1) to 32 x p x i-1 bits in the modulus M, ;/>
Dividing the data Y equally into p-q sub-data Y, sequentially marking Y 1、Y2……Yp*q from low order to high order, storing in a shared memory, wherein j is greater than or equal to 1 and less than or equal to p q, the sub-data Y j represents 32-1 to 32-1 bits of data in the data Y,
Step S3 comprises the steps of:
s31: setting the intermediate calculation result Z corresponding to each thread to zero, and setting j=1;
s32: each thread takes out sub data Y j from the shared memory, and calculates a corresponding intermediate calculation result Z by combining the sub data X and the sub module M held by the thread;
s33: assigning j to j=j+1, judging whether j is greater than p×q, if yes, executing step S34, otherwise, jumping to step S32;
S34, performing S34; and extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z.
Step S32 includes the steps of:
S321: each thread takes out sub data Y j from the shared memory, calculates the latest intermediate calculation result according to the sub data X, the current intermediate calculation result Z and the sub data Y j held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
The thread with the number i calculates the latest intermediate calculation result according to the sub-data X i, the current intermediate calculation result Z i and the sub-data Y j, and assigns the current intermediate calculation result Z i as the latest intermediate calculation result by the following formula: z i=Zi+Xi*Yj;
s322: the thread with number 1 fetches data Z 1 (1), and the intermediate parameter u is calculated using the following formula:
Wherein, And m are inverse elements;
the thread with the number 1 sends the intermediate parameter u to other threads;
S323: each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
The thread with the number i calculates the latest intermediate calculation result according to the child module M i, the intermediate parameter u and the current intermediate calculation result Z i, and assigns the current intermediate calculation result Z i to the latest intermediate calculation result according to the following formula:
S324: extracting an effective value from the intermediate calculation result Z calculated by each thread to splice the effective value into a parameter f, judging whether the parameter f is smaller than a modulus m, if so, setting the value of a parameter flag to be 0, otherwise, setting the value of the parameter flag to be 1;
Each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result.
Step S324 includes the steps of:
S3241: extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F;
The method of extracting the subparameter F i from the intermediate calculation result Z i of the thread numbered i is as follows:
when i is more than or equal to 1 and less than q, the subparameter F i is the first 32 x p bits of the intermediate calculation result Z i;
When i=q, the subparameter F i is the first 32 x (p+1) bits of the intermediate calculation result Z i;
s3242: splicing the subparameter F 1、F2……Fq into a parameter F, judging whether the parameter F is smaller than the modulus m, if F is smaller than m, the value of the parameter flag is 0, and if F is larger than or equal to m, the value of the parameter flag is 1;
s3243: threads with the numbers of 1 and 2 … … q sequentially reassign the sub-parameters F held by the threads;
The method for reassigning the sub-parameter F i held by the thread with the number i is as follows:
Calculate F i=Fi+flag*(-Mi), if F i<Mi, borrow F i+1;
s3244: each thread reassigns the value of the own intermediate calculation result Z;
The method for reassigning the intermediate calculation result Z i held by the thread with the number i is as follows:
when i is more than or equal to 1 and less than q, the first 32 x p bits of the intermediate calculation result Z i are assigned as the value of the subparameter F i, and the last 64 bits of the intermediate calculation result Z i are assigned to 0;
When i=q, the value of the first 32× (p+1) bits of the intermediate calculation result Z i is assigned as the value of the subparameter F i, and the value of the last 32 bits of the intermediate calculation result Z i is set to 0.
Step S34 includes the steps of:
S341: extracting a valid value from the intermediate calculation result Z of each thread as a sub-result W;
The method of extracting the sub-result W i from the intermediate calculation result Z i of the thread numbered i is as follows:
When i is more than or equal to 1 and less than q, the sub-result W i is the first 32 p bits of the intermediate calculation result Z i;
when i=q, the sub-result W i is the first 32 x (p+1) bits of the intermediate calculation result Z i;
S342: the sub-results W 1、W2……Wq are spliced into a calculation result W.
The present embodiment is different from embodiment 1 in steps S321, S323.

Claims (5)

1. The Montgomery modular multiplication operation method based on the GPU is characterized in that data x, data y and modulus m used for Montgomery modular multiplication are all positive integers of 32 x p x q bits, p is an integer multiple of 2, q is an integer greater than 2, the data x is taken as a multiplicand, and the data y is taken as a multiplier, and the method comprises the following steps:
S1: q threads of the GPU are selected;
S2: equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory;
S3: q threads perform Montgomery modular multiplication operation by adopting CIOS algorithm according to the sub data X, the sub data M and the p-q sub data Y stored in the shared memory, so as to obtain a calculation result w;
The q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q;
the step S2 includes the steps of:
Dividing the data X into q sub-data X equally, sequentially marking the sub-data X 1、X2……Xq from low order to high order, sending the sub-data X i to a thread with the number of i, wherein the sub-data X i represents 32X p (i-1) to 32X p i-1 bits of data in the data X;
Dividing the modulus M equally into q sub-moduli M, sequentially marking the sub-moduli M 1、M2……Mq from low order to high order, sending the sub-moduli M i to a thread with the number of i, wherein the sub-modulus M i represents data of 32 x p (i-1) to 32 x p x i-1 bits in the modulus M;
Dividing the data Y into p-q sub-data Y equally, sequentially marking Y 1、Y2……Yp*q from low order to high order, storing in a shared memory, wherein j is greater than or equal to 1 and less than or equal to p q, and the sub-data Y j represents 32-1 to 32-1-bit data in the data Y;
The step S1 further includes the steps of: each thread constructs a storage space for storing an intermediate calculation result Z, wherein the storage space comprises p+2 subspaces, and each subspace has 32 bits;
The intermediate calculation result calculated by the thread with the number i is marked as Z i, and the data stored from the lower position to the upper position in the p+2 subspaces constructed by the thread with the number i are orderly marked as Z i(1)、Zi(2)……Zi(p+2),1≤r≤p+2,Zi (r) which represents the data of 32 x (r-1) to 32 x r-1 bits in the intermediate calculation result Z i;
The step S3 includes the steps of:
s31: setting the intermediate calculation result Z corresponding to each thread to zero, and setting j=1;
s32: each thread takes out sub data Y j from the shared memory, and calculates a corresponding intermediate calculation result Z by combining the sub data X and the sub module M held by the thread;
s33: assigning j to j=j+1, judging whether j is greater than p×q, if yes, executing step S34, otherwise, jumping to step S32;
s34, performing S34; extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z;
the step S32 includes the steps of:
S321: each thread takes out sub data Y j from the shared memory, calculates the latest intermediate calculation result according to the sub data X, the current intermediate calculation result Z and the sub data Y j held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
s322: the thread with the number 1 calculates an intermediate parameter u and sends the intermediate parameter u to other threads;
S323: each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
S324: extracting an effective value from the intermediate calculation result Z calculated by each thread to splice the effective value into a parameter f, judging whether the parameter f is smaller than a modulus m, if so, setting the value of a parameter flag to be 0, otherwise, setting the value of the parameter flag to be 1;
Each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
The step S34 includes the steps of:
S341: extracting a valid value from the intermediate calculation result Z of each thread as a sub-result W;
The method of extracting the sub-result W i from the intermediate calculation result Z i of the thread numbered i is as follows:
When i is more than or equal to 1 and less than q, the sub-result W i is the first 32 p bits of the intermediate calculation result Z i;
when i=q, the sub-result W i is the first 32 x (p+1) bits of the intermediate calculation result Z i;
S342: the sub-results W 1、W2……Wq are spliced into a calculation result W.
2. The method of claim 1, wherein the method of calculating the latest intermediate calculation result by the thread numbered i in the step S321 according to the sub-data X i, the current intermediate calculation result Z i, and the sub-data Y j, and assigning the current intermediate calculation result Z i as the latest intermediate calculation result comprises the following steps:
N1: the data Z i(1)、Zi(2)……Zi (p+2) is reassigned in sequence, and the assignment formula is as follows:
Wherein C i (r) represents the carry generated when calculating the latest data Z i (r), X i (r) represents the data of 32X (r-1) to 32X r-1 bits in the sub data X i, Representing the low 32 bits of X i(r)* Yj,/>The upper 32 bits of X i(r-1)* Yj;
n2: the data Z i(2)、Zi(3)……Zi (p+2) is reassigned in sequence, and the assignment formula is as follows:
And N3: the latest data Z i(1)、Zi(2)……Zi (p+2) are spliced together in sequence to obtain an intermediate calculation result Z i.
3. The method for computing the intermediate parameter u by the thread numbered 1 in the step S322 includes the following steps:
The thread with number 1 fetches data Z 1 (1), and the intermediate parameter u is calculated using the following formula:
Wherein, And m is the inverse of m.
4. The method of claim 1, wherein the step S323 includes the steps of:
S3231: each thread reassigns the value of the own intermediate calculation result Z;
the method for reassigning the thread numbered i to the intermediate calculation result Z i is as follows:
the data Z i(1)、Zi(2)……Zi (p+2) is sequentially reassigned, and the assignment formula is as follows:
Wherein C i (r) represents the carry generated when calculating the latest data Z i (r), M i (r) represents the data of 32 x (r-1) to 32 x r-1 bits in the sub-modulus M i, Representing the lower 32 bits of M i (r) ×u,/>The upper 32 bits of M i (r-1) u;
s3232: the thread numbered k+1 sends data Z k+1 (1) to the thread numbered k, which marks the received data as tmp (k), k=1, 2,3 … … q-1;
s3233: each thread reassigns the value of the own intermediate calculation result Z;
Setting tmp (q) =0, the method of reassigning the intermediate calculation result Z i by the thread numbered i is as follows:
The data Z i(1)、Zi(2)……Zi (p+1) is sequentially reassigned, and the assignment formula is as follows:
s3234: the thread numbered k sends data Z k (p+1) to the thread numbered k+1, and the thread numbered k+1 marks the received data as hmp (k);
S3235: threads with the numbers of 1 and 2 … … q sequentially reassign the intermediate calculation result Z held by the threads;
the method for reassigning the thread numbered i to the intermediate calculation result Z i is as follows:
when i=1, the intermediate calculation result Z 1 remains unchanged;
When i=2, the data Z i(1)、Zi(2)……Zi (p) is reassigned in turn, and the assignment formula is:
When i is more than 2 and less than q, reassigning the data Z i(1)、Zi(2)……Zi (p) in sequence, wherein the assignment formula is as follows:
When i=q, the data Z i(1)、Zi(2)……Zi(p)、Zi (p+1) is reassigned in sequence, and the assignment formula is:
5. the method according to claim 4, wherein the step S324 includes the steps of:
S3241: extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F;
The method of extracting the subparameter F i from the intermediate calculation result Z i of the thread numbered i is as follows:
when i is more than or equal to 1 and less than q, the subparameter F i is the first 32 x p bits of the intermediate calculation result Z i;
When i=q, the subparameter F i is the first 32 x (p+1) bits of the intermediate calculation result Z i;
s3242: splicing the subparameter F 1、F2……Fq into a parameter F, judging whether the parameter F is smaller than the modulus m, if F is smaller than m, the value of the parameter flag is 0, and if F is larger than or equal to m, the value of the parameter flag is 1;
s3243: threads with the numbers of 1 and 2 … … q sequentially reassign the sub-parameters F held by the threads;
The method for reassigning the sub-parameter F i held by the thread with the number i is as follows:
Calculate F i=Fi+flag*(-Mi), if F i<Mi, borrow F i+1;
s3244: each thread reassigns the value of the own intermediate calculation result Z;
The method for reassigning the intermediate calculation result Z i held by the thread with the number i is as follows:
when i is more than or equal to 1 and less than q, the first 32 x p bits of the intermediate calculation result Z i are assigned as the value of the subparameter F i, and the last 64 bits of the intermediate calculation result Z i are assigned to 0;
When i=q, the value of the first 32× (p+1) bits of the intermediate calculation result Z i is assigned as the value of the subparameter F i, and the value of the last 32 bits of the intermediate calculation result Z i is set to 0.
CN202410200625.6A 2024-02-23 2024-02-23 Montgomery modular multiplication operation method based on GPU Active CN117785129B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410200625.6A CN117785129B (en) 2024-02-23 2024-02-23 Montgomery modular multiplication operation method based on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410200625.6A CN117785129B (en) 2024-02-23 2024-02-23 Montgomery modular multiplication operation method based on GPU

Publications (2)

Publication Number Publication Date
CN117785129A CN117785129A (en) 2024-03-29
CN117785129B true CN117785129B (en) 2024-05-07

Family

ID=90391339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410200625.6A Active CN117785129B (en) 2024-02-23 2024-02-23 Montgomery modular multiplication operation method based on GPU

Country Status (1)

Country Link
CN (1) CN117785129B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226461A (en) * 2013-03-26 2013-07-31 中山大学 Montgomery modular multiplication method and circuit thereof
CN103761068A (en) * 2014-01-26 2014-04-30 上海交通大学 Optimized Montgomery modular multiplication method, optimized modular square method and optimized modular multiplication hardware
CN105468439A (en) * 2015-11-19 2016-04-06 华东师范大学 Adaptive parallel algorithm for traversing neighbors in fixed radius under CPU-GPU (Central Processing Unit-Graphic Processing Unit) heterogeneous framework
CN110351087B (en) * 2019-09-06 2019-12-20 南京秉速科技有限公司 Pipelined Montgomery modular multiplication operation method
CN113452383A (en) * 2020-03-26 2021-09-28 湖南智领通信科技有限公司 GPU parallel optimization method for TPC decoding of software radio system
CN113541921A (en) * 2021-06-24 2021-10-22 电子科技大学 Fully homomorphic encryption GPU high-performance implementation method
CN115100227A (en) * 2022-06-29 2022-09-23 南京大学 Remote sensing image edge detection parallel computing method based on CPU-GPU cooperation
CN115268839A (en) * 2022-06-28 2022-11-01 南京大学 Montgomery modular multiplication method and device based on 2
CN115344237A (en) * 2022-10-19 2022-11-15 无锡沐创集成电路设计有限公司 Data processing method combining Karatsuba and Montgomery modular multiplication
CN115756391A (en) * 2022-11-25 2023-03-07 杭州电子科技大学 Hardware circuit and method for realizing RSA modular exponentiation calculation of asymmetric algorithm
CN117527192A (en) * 2024-01-08 2024-02-06 蓝象智联(杭州)科技有限公司 Paillier decryption method based on GPU

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102132261B1 (en) * 2014-03-31 2020-08-06 삼성전자주식회사 Method and apparatus for computing montgomery multiplication performing final reduction wihhout comparator
WO2023141936A1 (en) * 2022-01-28 2023-08-03 Nvidia Corporation Techniques and devices for efficient montgomery multiplication with reduced dependencies

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226461A (en) * 2013-03-26 2013-07-31 中山大学 Montgomery modular multiplication method and circuit thereof
CN103761068A (en) * 2014-01-26 2014-04-30 上海交通大学 Optimized Montgomery modular multiplication method, optimized modular square method and optimized modular multiplication hardware
CN105468439A (en) * 2015-11-19 2016-04-06 华东师范大学 Adaptive parallel algorithm for traversing neighbors in fixed radius under CPU-GPU (Central Processing Unit-Graphic Processing Unit) heterogeneous framework
CN110351087B (en) * 2019-09-06 2019-12-20 南京秉速科技有限公司 Pipelined Montgomery modular multiplication operation method
CN113452383A (en) * 2020-03-26 2021-09-28 湖南智领通信科技有限公司 GPU parallel optimization method for TPC decoding of software radio system
CN113541921A (en) * 2021-06-24 2021-10-22 电子科技大学 Fully homomorphic encryption GPU high-performance implementation method
CN115268839A (en) * 2022-06-28 2022-11-01 南京大学 Montgomery modular multiplication method and device based on 2
CN115100227A (en) * 2022-06-29 2022-09-23 南京大学 Remote sensing image edge detection parallel computing method based on CPU-GPU cooperation
CN115344237A (en) * 2022-10-19 2022-11-15 无锡沐创集成电路设计有限公司 Data processing method combining Karatsuba and Montgomery modular multiplication
CN115756391A (en) * 2022-11-25 2023-03-07 杭州电子科技大学 Hardware circuit and method for realizing RSA modular exponentiation calculation of asymmetric algorithm
CN117527192A (en) * 2024-01-08 2024-02-06 蓝象智联(杭州)科技有限公司 Paillier decryption method based on GPU

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A GPU implementation of the Montgomery multiplication algorithm for elliptic curve cryptography;Leboeuf, Karl 等;2013 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS);20140513;全文 *
蒙哥马利算法在RSA公钥算法中的应用;徐江涛;傅妍芳;;电子设计工程;20130505(第09期);全文 *

Also Published As

Publication number Publication date
CN117785129A (en) 2024-03-29

Similar Documents

Publication Publication Date Title
US11983280B2 (en) Protection of cryptographic operations by intermediate randomization
US8996846B2 (en) System, method and computer program product for performing a scan operation
CN101097511B (en) Modular reduction using folding
US8411855B1 (en) Size optimization for large elliptic curve cryptography scalar multiplication acceleration tables
US20220083857A1 (en) Convolutional neural network operation method and device
US20100306300A1 (en) Sparse Matrix Padding
CN103309893B (en) The comparative approach of a kind of character string and device
CN113628094A (en) High-throughput SM2 digital signature computing system and method based on GPU
WO2012090284A1 (en) Arithmetical device, arithmetical device elliptical scalar multiplication method and elliptical scalar multiplication program, arithmetical device multiplicative operation method and multiplicative operation program, as well as arithmetical device zero determination method and zero determination program
CN108875416B (en) Elliptic curve multiple point operation method and device
US9571281B2 (en) CRT-RSA encryption method and apparatus
CN109933304B (en) Rapid Montgomery modular multiplier operation optimization method suitable for national secret sm2p256v1 algorithm
Weiden et al. Wide quantum circuit optimization with topology aware synthesis
CN117785129B (en) Montgomery modular multiplication operation method based on GPU
US20070233772A1 (en) Modular multiplication acceleration circuit and method for data encryption/decryption
CN110109913B (en) Hardware implementation method and device of zerocase mining algorithm
US7590235B2 (en) Reduction calculations in elliptic curve cryptography
US20220076594A1 (en) Efficient squaring with loop equalization in arithmetic logic units
Terzer et al. Parallel extreme ray and pathway computation
Li et al. Analysis of a splitting approach for the parallel solution of linear systems on GPU cards
CN104572021B (en) A kind of efficient public key encryption engine
JP2007218997A (en) Prime number generation device, program and method
CN113472540B (en) Method, device, electronic equipment and storage medium for generating ciphertext
CN117353923B (en) Exercise method of lightweight hash encryption algorithm and related equipment
US20230129676A1 (en) Compiler, generation method, chip, and execution method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant