CN117785129A - Montgomery modular multiplication operation method based on GPU - Google Patents
Montgomery modular multiplication operation method based on GPU Download PDFInfo
- Publication number
- CN117785129A CN117785129A CN202410200625.6A CN202410200625A CN117785129A CN 117785129 A CN117785129 A CN 117785129A CN 202410200625 A CN202410200625 A CN 202410200625A CN 117785129 A CN117785129 A CN 117785129A
- Authority
- CN
- China
- Prior art keywords
- data
- calculation result
- thread
- sub
- intermediate calculation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 60
- 238000004364 calculation method Methods 0.000 claims abstract description 245
- 230000009191 jumping Effects 0.000 claims description 5
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
Landscapes
- Complex Calculations (AREA)
Abstract
The invention discloses a Montgomery modular multiplication operation method based on a GPU. It comprises the following steps: q threads of the GPU are selected; equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory; and the q threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result w. The method can realize Montgomery modular multiplication operation on the GPU, greatly improves the calculation efficiency and reduces the calculation time delay.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a Montgomery modular multiplication operation method based on a GPU.
Background
In recent years, data has shown an explosive growth trend, the data volume and the data variety have become more and more complex, and a large amount of valuable client information, personal privacy records and operation data of enterprises have been continuously mined. In the era of data burst, the problem of information security under big data is particularly important, the security assurance of information is based on encryption algorithms, and currently popular encryption algorithms comprise symmetric encryption algorithms and asymmetric encryption algorithms. The asymmetric encryption algorithm mainly comprises an RSA encryption algorithm and an ECC encryption algorithm, and the two encryption algorithms need to execute modular multiplication operation.
The algorithm with high efficiency and convenient implementation in the modular multiplication algorithm is a Montgomery modular multiplication algorithm. The existing Montgomery modular multiplication operation is realized on a CPU, and the single thread on the CPU performs the Montgomery modular multiplication operation, and because the Montgomery modular multiplication operation is limited by a formula, the current Montgomery modular multiplication operation consumes longer time and has lower calculation efficiency, thereby influencing the encryption efficiency.
Disclosure of Invention
In order to solve the technical problems, the invention provides a Montgomery modular multiplication operation method based on a GPU, which can realize Montgomery modular multiplication operation on the GPU, thereby greatly improving the calculation efficiency and reducing the calculation time delay.
In order to solve the problems, the invention is realized by adopting the following technical scheme:
the invention relates to a Montgomery modular multiplication operation method based on a GPU, which is used for Montgomery modular multiplication, wherein data x, data y and modulus m are all positive integers of 32 x p x q bits, p is integer multiple of 2, q is an integer greater than 2, the data x is taken as a multiplicand, and the data y is taken as a multiplier, and comprises the following steps:
s1: q threads of the GPU are selected;
s2: equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory;
s3: and the q threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result z.
In the scheme, the data x and the modulus m are respectively equally divided into q parts and sent to the corresponding threads, the data y is equally divided into p x q parts and stored in the shared memory, and the q threads of the GPU realize Montgomery modular multiplication operation in parallel, so that the calculation efficiency is greatly improved. The CIOS algorithm is Coarsely Integrated Operand Scanning algorithm.
Preferably, the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q;
the step S2 includes the steps of:
dividing the data X into q sub-data X, and sequentially marking the sub-data X as X from low order to high order 1 、X 2 ……X q Sub data X i Send to thread number i, child data X i Data representing bits 32 x p (i-1) to 32 x p i-1 in data x;
dividing modulus M equally into q sub-moduli M, sequentially recorded as M from low order to high order 1 、M 2 ……M q Sub-modulus M i Send to thread number i, child module M i Data representing bits 32 x p (i-1) to 32 x p i-1 in modulus m;
dividing the data Y equally into p×q sub-data Y, and sequentially marking as Y from low order to high order 1 、Y 2 ……Y p*q And stored in a shared memory, wherein j is more than or equal to 1 and less than or equal to p is more than or equal to q, and the sub data Y j Data representing 32 x (j-1) to 32 x j-1 bits in data y.
Preferably, the step S1 further includes the steps of: each thread constructs a storage space for storing an intermediate calculation result Z, wherein the storage space comprises p+2 subspaces, and each subspace has 32 bits;
the intermediate calculation result calculated by the thread with the number i is marked as Z i The p+2 subspaces built by the thread with the number i are sequentially marked as Z from the lower order to the upper order of the stored data i (1)、Z i (2)……Z i (p+2),1≤r≤p+2,Z i (r) represents the intermediate calculation result Z i 32 x (r-1) to 32 x r-1 bits.
Preferably, the step S3 includes the steps of:
s31: setting the intermediate calculation result Z corresponding to each thread to zero, and setting j=1;
s32: each thread fetches sub-data Y from shared memory j Calculating a corresponding intermediate calculation result Z by combining the sub-data X and the sub-module M held by the user;
s33: assigning j to j=j+1, judging whether j is greater than p×q, if yes, executing step S34, otherwise, jumping to step S32;
s34, performing S34; and extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z.
Preferably, the step S32 includes the steps of:
s321: each thread fetches sub-data Y from shared memory j Each thread is based on the sub-data X, the current intermediate calculation result Z and the sub-data Y held by the thread j Calculating the latest intermediate calculation result, and assigning the current intermediate calculation result Z as the latest intermediate calculation result;
s322: the thread with the number 1 calculates an intermediate parameter u and sends the intermediate parameter u to other threads;
s323: each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
s324: extracting an effective value from the intermediate calculation result Z calculated by each thread to splice the effective value into a parameter f, judging whether the parameter f is smaller than a modulus m, if so, setting the value of a parameter flag to be 0, otherwise, setting the value of the parameter flag to be 1;
each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result.
Preferably, the thread numbered i in the step S321 is based on the sub-data X held by itself i Current intermediate calculation result Z i Sub data Y j Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z i The method for assigning the latest intermediate calculation result comprises the following steps:
n1: sequentially giving data Z i (1)、Z i (2)……Z i (p+2) reassigning the following formula:
,
wherein C is i (r) represents the calculation of the latest data Z i (r) carry, X i (r) represents sub data X i Data of 32 x (r-1) to 32 x r-1 bits,x represents i (r)* Y j Low 32 bits of>X represents i (r-1)* Y j Is the upper 32 bits of (2);
n2: sequentially giving data Z i (2)、Z i (3)……Z i (p+2) reassigning the following formula:
;
and N3: in turn, the latest data Z i (1)、Z i (2)……Z i (p+2) are spliced together to obtain an intermediate calculation result Z i 。
Due to X i 、Y j Is a very large number, and Z is rapidly realized through the two-round calculation i =Z i +X i *Y j The calculation efficiency is greatly improved.
Preferably, the method for calculating the intermediate parameter u by the thread numbered 1 in the step S322 includes the following steps:
thread fetch data Z numbered 1 1 (1) The intermediate parameter u is calculated using the following formula:
,
wherein,and m is the inverse of m.
Preferably, the step S323 includes the steps of:
s3231: each thread reassigns the value of the own intermediate calculation result Z;
the thread numbered i gives the intermediate calculation result Z i The reassignment method is as follows:
sequentially giving data Z i (1)、Z i (2)……Z i (p+2) reassigning the value of the formula:
,
wherein C is i (r) represents the calculation of the latest data Z i (r) carry generated at the time of M i (r) represents the sub-modulus M i Data of 32 x (r-1) to 32 x r-1 bits,represents M i Lower 32 bits of (r) ×u, +.>Represents M i (r-1) the upper 32 bits of u;
s3232: the thread numbered k+1 will data Z k+1 (1) Sending the received data to a thread with the number of k, wherein the thread with the number of k marks the received data as tmp (k), and k=1, 2 and 3 … … q-1;
s3233: each thread reassigns the value of the own intermediate calculation result Z;
setting tmp (q) =0, and giving the intermediate calculation result Z to the thread numbered i i The reassignment method is as follows:
sequentially giving data Z i (1)、Z i (2)……Z i (p+1) reassigning the value of the formula:
,
s3234: the thread numbered k will data Z k (p+1) sending to the thread numbered k+1, the thread numbered k+1 noting the received data as hmp (k);
s3235: threads with the numbers of 1 and 2 … … q sequentially reassign the self-held intermediate calculation result Z;
the thread numbered i gives the intermediate calculation result Z i The reassignment method is as follows:
when i=1, the intermediate calculation result Z 1 Remain unchanged;
when i=2, the data Z is sequentially given i (1)、Z i (2)……Z i (p) reassigning the value of the formula:
;
when 2 < i < q, the data Z are given in turn i (1)、Z i (2)……Z i (p) reassigning the value of the formula:
;
when i=q, the data Z is sequentially given i (1)、Z i (2)……Z i (p)、Z i (p+1) reassigning the value of the formula:
。
preferably, the step S324 includes the steps of:
s3241: extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F;
intermediate calculation result Z from thread numbered i i The extracted subparameter F i The method of (2) is as follows:
when i is 1-q, the subparameter F i For intermediate calculation result Z i P-position first 32;
when i=q, subparameter F i For intermediate calculation result Z i The first 32 (p+1) bits;
s3242: will sub-parameter F 1 、F 2 ……F q Splicing into a parameter f, judging whether the parameter f is smaller than a modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;
s3243: threads with the numbers of 1 and 2 … … q sequentially reassign the sub-parameters F held by the threads;
the thread numbered i gives itself the held subparameter F i The reassignment method is as follows:
calculation F i =F i +flag*(-M i ) If F i <M i Then go to F i+1 Borrowing;
s3244: each thread reassigns the value of the own intermediate calculation result Z;
intermediate calculation result Z held by thread with number i for itself i The reassignment method is as follows:
when i is more than or equal to 1 and less than q, the intermediate calculation result Z i The first 32 x p bits of (a) are assigned as subparameter F i To the value of the intermediate calculation result Z i The value of the last 64 bits of (2) is set to 0;
when i=q, the intermediate calculation result Z i The first 32 (p+1) bits of (a) are assigned the subparameter F i To the value of the intermediate calculation result Z i The value of the last 32 bits of (2) is set to 0.
Preferably, the step S34 includes the steps of:
s341: extracting a valid value from the intermediate calculation result Z of each thread as a sub-result W;
intermediate calculation result Z from thread numbered i i The sub-result W extracted from the middle i The method of (2) is as follows:
when i is 1.ltoreq.q, the sub-result W i For intermediate calculation result Z i P-position first 32;
when i=q, the sub-result W i For intermediate calculation result Z i The first 32 (p+1) bits;
s342: will sub-result W 1 、W 2 ……W q And splicing to obtain a calculation result w.
The beneficial effects of the invention are as follows: by utilizing the multithread parallel computing capability of the GPU, the Montgomery modular multiplication operation is optimized and accelerated, the computing efficiency is greatly improved, and the computing time delay is reduced, so that the encryption efficiency of encryption algorithms such as RSA, ECC and the like which are required to be used for the Montgomery modular multiplication operation is greatly improved.
Drawings
FIG. 1 is a flow chart of an embodiment;
fig. 2 is an illustrative schematic.
Detailed Description
The technical scheme of the invention is further specifically described below through examples and with reference to the accompanying drawings.
Example 1: in the method for performing Montgomery modular multiplication operation based on the GPU according to the present embodiment, the data x, the data y and the modulus m used for Montgomery modular multiplication are all positive integers of 32×p×q bits, p is an integer multiple of 2, q is an integer greater than 2, the data x is a multiplicand, and the data y is a multiplier, as shown in FIG. 1, the method comprises the following steps:
s1: q threads of the GPU are selected, the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q, each thread constructs a storage space for storing an intermediate calculation result Z, the storage space comprises p+2 subspaces, and each subspace is provided with 32 bits;
the intermediate calculation result calculated by the thread with the number i is marked as Z i The p+2 subspaces built by the thread with the number i are sequentially marked as Z from the lower order to the upper order of the stored data i (1)、Z i (2)……Z i (p+2),1≤r≤p+2,Z i (r) represents the intermediate calculation result Z i 32 x (r-1) to 32 x r-1 bits of data, p+2 subspaces are numbered 1 and 2 … … p+2 in sequence, and data Z i (r) stored in subspace number r;
s2: equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory;
s3: and the q threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result z.
Step S2 comprises the steps of:
dividing the data X into q sub-data X, and sequentially marking the sub-data X as X from low order to high order 1 、X 2 ……X q Sub data X i Send to thread number i, child data X i Data representing 32 x p (i-1) to 32 x p i-1 bits in data x,,/>is a splice;
dividing modulus M equally into q sub-moduli M, sequentially recorded as M from low order to high order 1 、M 2 ……M q Sub-modulus M i Send to thread number i, child module M i Data representing bits 32 x p (i-1) to 32 x p i-1 in modulus m,;
dividing the data Y equally into p×q sub-data Y, and sequentially marking as Y from low order to high order 1 、Y 2 ……Y p*q And stored in a shared memory, wherein j is more than or equal to 1 and less than or equal to p is more than or equal to q, and the sub data Y j Data representing 32 x (j-1) to 32 x j-1 bits in data y,。
step S3 comprises the steps of:
s31: setting the intermediate calculation result Z corresponding to each thread to zero (namely, setting all the positions of the storage space to zero), and setting j=1;
s32: each thread fetches sub-data Y from shared memory j Calculating a corresponding intermediate calculation result Z by combining the sub-data X and the sub-module M held by the user;
s33: assigning j to j=j+1, judging whether j is greater than p×q, if yes, executing step S34, otherwise, jumping to step S32;
s34, performing S34; and extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z.
Step S32 includes the steps of:
s321: each thread fetches sub-data Y from shared memory j Each thread is based on the sub-data X, the current intermediate calculation result Z and the sub-data Y held by the thread j Calculating the latest intermediate calculation result, and assigning the current intermediate calculation result Z as the latest intermediate calculation result;
s322: the thread with the number 1 calculates an intermediate parameter u and sends the intermediate parameter u to other threads;
s323: each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
s324: extracting an effective value from the intermediate calculation result Z calculated by each thread to splice the effective value into a parameter f, judging whether the parameter f is smaller than a modulus m, if so, setting the value of a parameter flag to be 0, otherwise, setting the value of the parameter flag to be 1;
each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result.
The thread numbered i in step S321 is based on the sub-data X held by itself i Current intermediate calculation result Z i Sub data Y j Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z i The method for assigning the latest intermediate calculation result comprises the following steps:
n1: sequentially giving data Z i (1)、Z i (2)……Z i (p+2) reassigning the following formula:
,
wherein C is i (r) represents the calculation of the latest data Z i (r) carry, X i (r) represents sub data X i Data of 32 x (r-1) to 32 x r-1 bits,x represents i (r)* Y j Low 32 bits of>X represents i (r-1)* Y j Is the upper 32 bits of (2);
n2: sequentially giving data Z i (2)、Z i (3)……Z i (p+2) reassigning the following formula:
;
and N3: in turn, the latest data Z i (1)、Z i (2)……Z i (p+2) are spliced together to obtain an intermediate calculation result Z i 。
Due to X i 、Y j Is a very large number, and Z is rapidly realized through the two-round calculation i =Z i +X i *Y j The calculation efficiency is greatly improved.
The method for calculating the intermediate parameter u by the thread numbered 1 in step S322 includes the steps of:
thread fetch data Z numbered 1 1 (1) The intermediate parameter u is calculated using the following formula:
,
wherein,and m is the inverse of m.
Pair 2 32 Taking the modulus, data greater than 32 bits can be directly eliminated, so the value requires calculating the product of the first 32 bits of data and m' to 2 32 And the module is taken, so that the calculation efficiency is improved.
Step S323 includes the steps of:
s3231: each thread reassigns the value of the own intermediate calculation result Z;
the thread numbered i gives the intermediate calculation result Z i The reassignment method is as follows:
sequentially giving data Z i (1)、Z i (2)……Z i (p+2) reassigning the value of the formula:
,
wherein C is i (r) represents the calculation of the latest data Z i (r) carry generated at the time of M i (r) represents the sub-modulus M i Data of 32 x (r-1) to 32 x r-1 bits,represents M i Lower 32 bits of (r) ×u, +.>Represents M i (r-1) the upper 32 bits of u;
s3232: the thread numbered k+1 will data Z k+1 (1) The data is sent to a thread with the number k, the thread with the number k marks the received data as tmp (k), and k=1, 2 and 3 … … q-1, namely the threads with the numbers 2 to q take out the low 32 bits of the intermediate calculation result Z held by the threads with the number k and send the low 32 bits to the previous thread;
s3233: each thread reassigns the value of the own intermediate calculation result Z;
setting tmp (q) =0, and giving the intermediate calculation result Z to the thread numbered i i The reassignment method is as follows:
sequentially giving data Z i (1)、Z i (2)……Z i (p+1) reassigning the value of the formula:
,
s3234: the thread numbered k will data Z k (p+1) sending to the thread numbered k+1, the thread numbered k+1 noting the received data as hmp (k);
s3235: threads with the numbers of 1 and 2 … … q sequentially reassign the self-held intermediate calculation result Z;
the thread numbered i gives the intermediate calculation result Z i The reassignment method is as follows:
when i=1, the intermediate calculation result Z 1 Remain unchanged;
when i=2, the data Z is sequentially given i (1)、Z i (2)……Z i (p) reassigning the value of the formula:
;
when 2 < i < q, the data Z are given in turn i (1)、Z i (2)……Z i (p) reassigning the value of the formula:
;
when i=q, the data Z is sequentially given i (1)、Z i (2)……Z i (p)、Z i (p+1) reassigning the value of the formula:
。
the calculation and displacement are synchronously carried out, so that the calculation efficiency is greatly improved.
Step S324 includes the steps of:
s3241: extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F;
intermediate calculation result Z from thread numbered i i The extracted subparameter F i The method of (2) is as follows:
when i is 1-q, the subparameter F i For intermediate calculation result Z i Is the first 32 x p bits of (c),;
when i=q, subparameter F i For intermediate calculation result Z i The first 32 (p + 1) bits,;
s3242: will sub-parameter F 1 、F 2 ……F q The parameters f are spliced into the parameters f,judging whether the parameter f is smaller than the modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;
s3243: threads with the numbers of 1 and 2 … … q sequentially reassign the sub-parameters F held by the threads;
the thread numbered i gives itself the held subparameter F i The reassignment method is as follows:
calculation F i =F i +flag*(-M i ) If F i <M i Then go to F i+1 Borrowing (F is greater than or equal to m, so F) q Not less than M q );
S3244: each thread reassigns the value of the own intermediate calculation result Z;
intermediate calculation result Z held by thread with number i for itself i The reassignment method is as follows:
when i is more than or equal to 1 and less than q, the intermediate calculation result Z i The first 32 x p bits of (a) are assigned as subparameter F i To the value of the intermediate calculation result Z i The value of the last 64 bits of (2) is set to 0;
when i=q, the intermediate calculation result Z i The first 32 (p+1) bits of (a) are assigned the subparameter F i To the value of the intermediate calculation result Z i The value of the last 32 bits of (2) is set to 0.
Step S34 includes the steps of:
s341: extracting a valid value from the intermediate calculation result Z of each thread as a sub-result W;
intermediate calculation result Z from thread numbered i i The sub-result W extracted from the middle i The method of (2) is as follows:
when i is 1.ltoreq.q, the sub-result W i For intermediate calculation result Z i Is the first 32 x p bits of (c),;
when i=q, the sub-result W i For intermediate calculation result Z i The first 32 (p + 1) bits,;
s342: will sub-result W 1 、W 2 ……W q The calculation result w is spliced into a calculation result w,。
in the scheme, the data x and the modulus m are respectively and equally divided into q parts and sent to the corresponding threads, the data y is equally divided into p x and q parts and stored in the shared memory, and the q threads of the GPU can realize Montgomery modular multiplication operation in parallel through the method. In the prior art, montgomery modular multiplication operation is performed for single-thread execution CIOS algorithm on a CPU, so that the operation time is long, the calculation efficiency is low, and the encryption efficiency is affected. The algorithm flow designed for the multithread Cheng Teshu of the GPU is adopted, so that the q threads of the GPU execute the method to obtain the same result as the result obtained by executing the CIOS algorithm on the CPU by a single thread, the method is executed by the q threads in parallel, and the specific calculation steps are further optimized in the parallel execution process, thereby greatly improving the calculation efficiency, reducing the calculation time delay, and further greatly improving the encryption efficiency of encryption algorithms such as RSA, ECC and the like which are required to be used for Montgomery modular multiplication operation.
Illustrating:
in the RSA encryption calculation process, when the montgomery modular multiplication operation is required, the method of the embodiment is adopted to perform the montgomery modular multiplication operation, wherein the data x, the data y and the modulus m for the montgomery modular multiplication are all positive integers of 32 x 4 x 3 bits, the data x is used as a multiplicand, the data y is used as a multiplier, and as shown in fig. 2, the method comprises the following steps:
s1: 3 threads of the GPU are selected, the serial numbers of the threads are 1, 2 and 3, i is more than or equal to 1 and less than or equal to 3, each thread constructs a storage space for storing an intermediate calculation result Z, the storage space comprises 6 subspaces, and each subspace has 32 bitsThe intermediate calculation result calculated by the thread with the number i is marked as Z i The data stored from the lower order to the higher order in the 6 subspaces constructed by the thread with the number i are sequentially marked as Z i (1)、Z i (2)……Z i (6),1≤r≤6,Z i (6) Representing intermediate calculation result Z i Data of 32 x (r-1) to 32 x r-1 bits;
s2: dividing the data X into 3 sub-data X, and sequentially marking the sub-data X as X from low order to high order 1 、X 2 、X 3 Sub data X i Send to thread number i, child data X i Data representing bits 32 x p (i-1) to 32 x p i-1 in data x;
dividing modulus M equally into 3 sub-moduli M, sequentially recorded as M from low order to high order 1 、M 2 、M 3 Sub-modulus M i Send to thread number i, child module M i Data representing bits 32 x p (i-1) to 32 x p i-1 in modulus m;
dividing the data Y into 12 sub-data Y, and sequentially marking the sub-data Y as Y from low order to high order 1 、Y 2 ……Y 12 And stored in a shared memory, wherein j is more than or equal to 1 and less than or equal to 12, and the sub data Y j Data representing 32 x (j-1) to 32 x j-1 bits in data y;
s3: the 3 threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub-data X, the sub-module M and the 12 sub-data Y stored in the shared memory, so as to obtain a calculation result w, and the specific steps are as follows:
s31: setting the intermediate calculation result Z corresponding to each thread to zero, and setting j=1;
s32: each thread fetches sub-data Y from shared memory 1 Each thread is based on the sub-data X, the current intermediate calculation result Z and the sub-data Y held by the thread 1 Calculating the latest intermediate calculation result, and assigning the current intermediate calculation result Z as the latest intermediate calculation result, namely, the latest intermediate calculation result Z is calculated by adopting the steps N1 to N3 by the thread 1, the thread 2 and the thread 3 respectively 1 、Z 2 、Z 3 ;
Thread number 1 fetch data Z 1 (1) The intermediate parameter u is calculated using the following formula:
,
wherein,and m are inverse elements;
each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result, namely, the thread 1, the thread 2 and the thread 3 respectively calculate the latest intermediate calculation result Z by adopting the steps S3231 to S3235 1 、Z 2 、Z 3 The method comprises the steps of carrying out a first treatment on the surface of the In step S3232, thread number 2 transfers data Z 2 (1) Send to thread number 1, thread number 3 will data Z 3 (1) Sending to thread number 2; in step S3234, thread 1 transfers data Z 1 (5) Send to thread number 2, thread number 2 will data Z 2 (5) Sending to thread number 3;
extracting effective value from intermediate calculation result Z calculated by each thread as subparameter F, subparameter F extracted by thread No. 1 1 For intermediate calculation result Z 1 The first 32 x 4 bits of thread number 2 extracted subparameter F 2 For intermediate calculation result Z 2 The first 32 x 4 bits of thread number 3 extracted subparameter F 3 For intermediate calculation result Z 2 The first 32 x 5 bits of (2);
will sub-parameter F 1 、F 2 、F 3 Splicing into a parameter f, judging whether the parameter f is smaller than a modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;
thread 1, thread 2 and thread 3 re-assign values to the sub-parameters F held by the thread 1 and thread 1 calculates F 1 =F 1 +flag*(-M 1 ) If F 1 <M 1 Then go to F 2 Borrowing; thread number 2 calculation F 2 =F 2 +flag*(-M 2 ) If F 2 <M 2 Then go to F 3 Borrowing; thread number 3 calculation F 3 =F 3 +flag*(-M 3 ) F is greater than or equal to m, so F 3 Not less than M 3 ;
Each thread reassigns the value of the own intermediate calculation result Z; thread number 1 gives intermediate calculation result Z 1 The first 32 x 4 bits of (a) are assigned as subparameter F 1 Is a value of (2); thread number 2 gives intermediate calculation result Z 1 The first 32 x 4 bits of (a) are assigned as subparameter F 2 Is a value of (2); thread 3 gives intermediate calculation result Z 1 The first 32 x 5 bits of (a) are assigned as subparameter F 3 Is a value of (2);
s33: assigning j to j=j+1, judging whether j is larger than 12, if yes, executing step S34, otherwise, jumping to step S32;
s34, performing S34; extracting an effective value from the intermediate calculation result Z of each thread as a sub-result W, wherein the sub-result W is extracted by the thread 1 1 For intermediate calculation result Z 1 The first 32 x 4 bits of (b); sub-result W extracted by thread number 2 2 For intermediate calculation result Z 2 The first 32 x 4 bits of (b); sub-result W extracted by thread 3 3 For intermediate calculation result Z 3 The first 32 x 5 bits of (2);
will sub-result W 1 、W 2 ……W q And splicing to obtain a calculation result w.
Example 2: the method for performing Montgomery modular multiplication operation on the basis of the GPU of the present invention for performing Montgomery modular multiplication on data x, data y and modulus m which are all positive integers of 32 x p x q bits, p being an integer multiple of 2, q being an integer greater than 2, the data x being a multiplicand and the data y being a multiplier comprises the following steps:
s1: q threads of the GPU are selected, the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q, each thread constructs a storage space for storing an intermediate calculation result Z, the storage space comprises p+2 subspaces, and each subspace is provided with 32 bits;
the intermediate calculation result calculated by the thread with the number i is marked as Z i The p+2 subspaces built by the thread with the number i are sequentially marked as Z from the lower order to the upper order of the stored data i (1)、Z i (2)……Z i (p+2),1≤r≤p+2,Z i (r) represents the intermediate calculation result Z i 32 x (r-1) to 32 x r-1 positions in (a)Data, p+2 subspaces are numbered 1 and 2 … … p+2 in sequence, and data Z i (r) stored in subspace number r;
s2: equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory;
s3: and the q threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result z.
Step S2 comprises the steps of:
dividing the data X into q sub-data X, and sequentially marking the sub-data X as X from low order to high order 1 、X 2 ……X q Sub data X i Send to thread number i, child data X i Data representing 32 x p (i-1) to 32 x p i-1 bits in data x,,/>is a splice;
dividing modulus M equally into q sub-moduli M, sequentially recorded as M from low order to high order 1 、M 2 ……M q Sub-modulus M i Send to thread number i, child module M i Data representing bits 32 x p (i-1) to 32 x p i-1 in modulus m,;/>
dividing the data Y equally into p×q sub-data Y, and sequentially marking as Y from low order to high order 1 、Y 2 ……Y p*q And stored in a shared memory, wherein j is more than or equal to 1 and less than or equal to p is more than or equal to q, and the sub data Y j Data representing 32 x (j-1) to 32 x j-1 bits in data y,。
step S3 comprises the steps of:
s31: setting the intermediate calculation result Z corresponding to each thread to zero, and setting j=1;
s32: each thread fetches sub-data Y from shared memory j Calculating a corresponding intermediate calculation result Z by combining the sub-data X and the sub-module M held by the user;
s33: assigning j to j=j+1, judging whether j is greater than p×q, if yes, executing step S34, otherwise, jumping to step S32;
s34, performing S34; and extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z.
Step S32 includes the steps of:
s321: each thread fetches sub-data Y from shared memory j Each thread is based on the sub-data X, the current intermediate calculation result Z and the sub-data Y held by the thread j Calculating the latest intermediate calculation result, and assigning the current intermediate calculation result Z as the latest intermediate calculation result;
the thread with the number i is based on the sub data X held by the thread i Current intermediate calculation result Z i Sub data Y j Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z i The formula assigned to the latest intermediate calculation result is: z is Z i =Z i +X i *Y j ;
S322: thread fetch data Z numbered 1 1 (1) The intermediate parameter u is calculated using the following formula:
,
wherein,and m are inverse elements;
the thread with the number 1 sends the intermediate parameter u to other threads;
s323: each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
the thread with the number i is according to the self-held sub-module M i Intermediate parameter u, current intermediate calculation result Z i Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z i The formula assigned to the latest intermediate calculation result is as follows:
,
;
s324: extracting an effective value from the intermediate calculation result Z calculated by each thread to splice the effective value into a parameter f, judging whether the parameter f is smaller than a modulus m, if so, setting the value of a parameter flag to be 0, otherwise, setting the value of the parameter flag to be 1;
each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result.
Step S324 includes the steps of:
s3241: extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F;
intermediate calculation result Z from thread numbered i i The extracted subparameter F i The method of (2) is as follows:
when i is 1-q, the subparameter F i For intermediate calculation result Z i P-position first 32;
when i=q, subparameter F i For intermediate calculation result Z i The first 32 (p+1) bits;
s3242: will sub-parameter F 1 、F 2 ……F q Splicing into a parameter f, judging whether the parameter f is smaller than a modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;
s3243: threads with the numbers of 1 and 2 … … q sequentially reassign the sub-parameters F held by the threads;
the thread numbered i gives itself the held subparameter F i Reassigning valuesThe method comprises the following steps:
calculation F i =F i +flag*(-M i ) If F i <M i Then go to F i+1 Borrowing;
s3244: each thread reassigns the value of the own intermediate calculation result Z;
intermediate calculation result Z held by thread with number i for itself i The reassignment method is as follows:
when i is more than or equal to 1 and less than q, the intermediate calculation result Z i The first 32 x p bits of (a) are assigned as subparameter F i To the value of the intermediate calculation result Z i The value of the last 64 bits of (2) is set to 0;
when i=q, the intermediate calculation result Z i The first 32 (p+1) bits of (a) are assigned the subparameter F i To the value of the intermediate calculation result Z i The value of the last 32 bits of (2) is set to 0.
Step S34 includes the steps of:
s341: extracting a valid value from the intermediate calculation result Z of each thread as a sub-result W;
intermediate calculation result Z from thread numbered i i The sub-result W extracted from the middle i The method of (2) is as follows:
when i is 1.ltoreq.q, the sub-result W i For intermediate calculation result Z i P-position first 32;
when i=q, the sub-result W i For intermediate calculation result Z i The first 32 (p+1) bits;
s342: will sub-result W 1 、W 2 ……W q And splicing to obtain a calculation result w.
The present embodiment is different from embodiment 1 in steps S321, S323.
Claims (10)
1. The Montgomery modular multiplication operation method based on the GPU is characterized in that data x, data y and modulus m used for Montgomery modular multiplication are all positive integers of 32 x p x q bits, p is an integer multiple of 2, q is an integer greater than 2, the data x is taken as a multiplicand, and the data y is taken as a multiplier, and the method comprises the following steps:
s1: q threads of the GPU are selected;
s2: equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory;
s3: and the q threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result w.
2. The method for performing Montgomery modular multiplication operation on the basis of the GPU according to claim 1, wherein the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q;
the step S2 includes the steps of:
dividing the data X into q sub-data X, and sequentially marking the sub-data X as X from low order to high order 1 、X 2 ……X q Sub data X i Send to thread number i, child data X i Data representing bits 32 x p (i-1) to 32 x p i-1 in data x;
dividing modulus M equally into q sub-moduli M, sequentially recorded as M from low order to high order 1 、M 2 ……M q Sub-modulus M i Send to thread number i, child module M i Data representing bits 32 x p (i-1) to 32 x p i-1 in modulus m;
dividing the data Y equally into p×q sub-data Y, and sequentially marking as Y from low order to high order 1 、Y 2 ……Y p*q And stored in a shared memory, wherein j is more than or equal to 1 and less than or equal to p is more than or equal to q, and the sub data Y j Data representing 32 x (j-1) to 32 x j-1 bits in data y.
3. The method of GPU-based montgomery modular multiplication according to claim 2, wherein the step S1 further comprises the steps of: each thread constructs a storage space for storing an intermediate calculation result Z, wherein the storage space comprises p+2 subspaces, and each subspace has 32 bits;
the intermediate calculation result calculated by the thread with the number i is marked as Z i Numbered iThe p+2 subspaces constructed by threads are sequentially marked as Z from low order to high order stored data i (1)、Z i (2)……Z i (p+2),1≤r≤p+2,Z i (r) represents the intermediate calculation result Z i 32 x (r-1) to 32 x r-1 bits.
4. A method of performing a montgomery modular multiplication operation based on a GPU as recited in claim 3, wherein said step S3 comprises the steps of:
s31: setting the intermediate calculation result Z corresponding to each thread to zero, and setting j=1;
s32: each thread fetches sub-data Y from shared memory j Calculating a corresponding intermediate calculation result Z by combining the sub-data X and the sub-module M held by the user;
s33: assigning j to j=j+1, judging whether j is greater than p×q, if yes, executing step S34, otherwise, jumping to step S32;
s34, performing S34; and extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z.
5. The method of performing Montgomery modular multiplication operations on a GPU according to claim 4, wherein the step S32 comprises the steps of:
s321: each thread fetches sub-data Y from shared memory j Each thread is based on the sub-data X, the current intermediate calculation result Z and the sub-data Y held by the thread j Calculating the latest intermediate calculation result, and assigning the current intermediate calculation result Z as the latest intermediate calculation result;
s322: the thread with the number 1 calculates an intermediate parameter u and sends the intermediate parameter u to other threads;
s323: each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
s324: extracting an effective value from the intermediate calculation result Z calculated by each thread to splice the effective value into a parameter f, judging whether the parameter f is smaller than a modulus m, if so, setting the value of a parameter flag to be 0, otherwise, setting the value of the parameter flag to be 1;
each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result.
6. The method according to claim 5, wherein the thread numbered i in the step S321 is based on the sub-data X held by itself i Current intermediate calculation result Z i Sub data Y j Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z i The method for assigning the latest intermediate calculation result comprises the following steps:
n1: sequentially giving data Z i (1)、Z i (2)……Z i (p+2) reassigning the following formula:
,
wherein C is i (r) represents the calculation of the latest data Z i (r) carry, X i (r) represents sub data X i Data of 32 x (r-1) to 32 x r-1 bits,x represents i (r)* Y j Low 32 bits of>X represents i (r-1)* Y j Is the upper 32 bits of (2);
n2: sequentially giving data Z i (2)、Z i (3)……Z i (p+2) reassigning the following formula:
;
and N3: in turnWill be the latest data Z i (1)、Z i (2)……Z i (p+2) are spliced together to obtain an intermediate calculation result Z i 。
7. The method for computing the intermediate parameter u by the thread numbered 1 in the step S322 according to claim 5, wherein the method comprises the following steps:
thread fetch data Z numbered 1 1 (1) The intermediate parameter u is calculated using the following formula:
,
wherein,and m is the inverse of m.
8. The method of claim 5, wherein the step S323 includes the steps of:
s3231: each thread reassigns the value of the own intermediate calculation result Z;
the thread numbered i gives the intermediate calculation result Z i The reassignment method is as follows:
sequentially giving data Z i (1)、Z i (2)……Z i (p+2) reassigning the value of the formula:
,
wherein C is i (r) represents the calculation of the latest data Z i (r) carry generated at the time of M i (r) represents the sub-modulus M i Data of 32 x (r-1) to 32 x r-1 bits,represents M i Lower 32 bits of (r) ×u, +.>Represents M i (r-1) the upper 32 bits of u;
s3232: the thread numbered k+1 will data Z k+1 (1) Sending the received data to a thread with the number of k, wherein the thread with the number of k marks the received data as tmp (k), and k=1, 2 and 3 … … q-1;
s3233: each thread reassigns the value of the own intermediate calculation result Z;
setting tmp (q) =0, and giving the intermediate calculation result Z to the thread numbered i i The reassignment method is as follows:
sequentially giving data Z i (1)、Z i (2)……Z i (p+1) reassigning the value of the formula:
,
s3234: the thread numbered k will data Z k (p+1) sending to the thread numbered k+1, the thread numbered k+1 noting the received data as hmp (k);
s3235: threads with the numbers of 1 and 2 … … q sequentially reassign the self-held intermediate calculation result Z;
the thread numbered i gives the intermediate calculation result Z i The reassignment method is as follows:
when i=1, the intermediate calculation result Z 1 Remain unchanged;
when i=2, the data Z is sequentially given i (1)、Z i (2)……Z i (p) reassigning the value of the formula:
;
when 2 < i < q, the data Z are given in turn i (1)、Z i (2)……Z i (p) reassigning the value of the formula:
;
when i=q, the data Z is sequentially given i (1)、Z i (2)……Z i (p)、Z i (p+1) reassigning the value of the formula:
。
9. the method according to claim 8, wherein the step S324 includes the steps of:
s3241: extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F;
intermediate calculation result Z from thread numbered i i The extracted subparameter F i The method of (2) is as follows:
when i is 1-q, the subparameter F i For intermediate calculation result Z i P-position first 32;
when i=q, subparameter F i For intermediate calculation result Z i The first 32 (p+1) bits;
s3242: will sub-parameter F 1 、F 2 ……F q Splicing into a parameter f, judging whether the parameter f is smaller than a modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;
s3243: threads with the numbers of 1 and 2 … … q sequentially reassign the sub-parameters F held by the threads;
the thread numbered i gives itself the held subparameter F i The reassignment method is as follows:
calculation F i =F i +flag*(-M i ) If F i <M i Then go to F i+1 Borrowing;
s3244: each thread reassigns the value of the own intermediate calculation result Z;
intermediate calculation result Z held by thread with number i for itself i The reassignment method is as follows:
when i is more than or equal to 1 and less than q, the intermediate calculation result Z i The first 32 x p bits of (a) are assigned as subparameter F i To the value of the intermediate calculation result Z i The value of the last 64 bits of (2) is set to 0;
when i=q, the intermediate calculation result Z i The first 32 (p+1) bits of (a) are assigned the subparameter F i To the value of the intermediate calculation result Z i The value of the last 32 bits of (2) is set to 0.
10. The method of GPU-based montgomery modular multiplication according to claim 9, wherein step S34 comprises the steps of:
s341: extracting a valid value from the intermediate calculation result Z of each thread as a sub-result W;
intermediate calculation result Z from thread numbered i i The sub-result W extracted from the middle i The method of (2) is as follows:
when i is 1.ltoreq.q, the sub-result W i For intermediate calculation result Z i P-position first 32;
when i=q, the sub-result W i For intermediate calculation result Z i The first 32 (p+1) bits;
s342: will sub-result W 1 、W 2 ……W q And splicing to obtain a calculation result w.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410200625.6A CN117785129B (en) | 2024-02-23 | 2024-02-23 | Montgomery modular multiplication operation method based on GPU |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410200625.6A CN117785129B (en) | 2024-02-23 | 2024-02-23 | Montgomery modular multiplication operation method based on GPU |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117785129A true CN117785129A (en) | 2024-03-29 |
CN117785129B CN117785129B (en) | 2024-05-07 |
Family
ID=90391339
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410200625.6A Active CN117785129B (en) | 2024-02-23 | 2024-02-23 | Montgomery modular multiplication operation method based on GPU |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117785129B (en) |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103226461A (en) * | 2013-03-26 | 2013-07-31 | 中山大学 | Montgomery modular multiplication method and circuit thereof |
CN103761068A (en) * | 2014-01-26 | 2014-04-30 | 上海交通大学 | Optimized Montgomery modular multiplication method, optimized modular square method and optimized modular multiplication hardware |
US20150277855A1 (en) * | 2014-03-31 | 2015-10-01 | Samsung Electronics Co., Ltd. | Montgomery multiplication method for performing final modular reduction without comparison operation and montgomery multiplier |
CN105468439A (en) * | 2015-11-19 | 2016-04-06 | 华东师范大学 | Adaptive parallel algorithm for traversing neighbors in fixed radius under CPU-GPU (Central Processing Unit-Graphic Processing Unit) heterogeneous framework |
CN110351087A (en) * | 2019-09-06 | 2019-10-18 | 南京秉速科技有限公司 | The montgomery modulo multiplication operation method and computing device of pipeline-type |
CN113452383A (en) * | 2020-03-26 | 2021-09-28 | 湖南智领通信科技有限公司 | GPU parallel optimization method for TPC decoding of software radio system |
CN113541921A (en) * | 2021-06-24 | 2021-10-22 | 电子科技大学 | Fully homomorphic encryption GPU high-performance implementation method |
CN115100227A (en) * | 2022-06-29 | 2022-09-23 | 南京大学 | Remote sensing image edge detection parallel computing method based on CPU-GPU cooperation |
CN115268839A (en) * | 2022-06-28 | 2022-11-01 | 南京大学 | Montgomery modular multiplication method and device based on 2 |
CN115344237A (en) * | 2022-10-19 | 2022-11-15 | 无锡沐创集成电路设计有限公司 | Data processing method combining Karatsuba and Montgomery modular multiplication |
CN115756391A (en) * | 2022-11-25 | 2023-03-07 | 杭州电子科技大学 | Hardware circuit and method for realizing RSA modular exponentiation calculation of asymmetric algorithm |
US20230244445A1 (en) * | 2022-01-28 | 2023-08-03 | Nvidia Corporation | Techniques and devices for efficient montgomery multiplication with reduced dependencies |
CN117527192A (en) * | 2024-01-08 | 2024-02-06 | 蓝象智联(杭州)科技有限公司 | Paillier decryption method based on GPU |
-
2024
- 2024-02-23 CN CN202410200625.6A patent/CN117785129B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103226461A (en) * | 2013-03-26 | 2013-07-31 | 中山大学 | Montgomery modular multiplication method and circuit thereof |
CN103761068A (en) * | 2014-01-26 | 2014-04-30 | 上海交通大学 | Optimized Montgomery modular multiplication method, optimized modular square method and optimized modular multiplication hardware |
US20150277855A1 (en) * | 2014-03-31 | 2015-10-01 | Samsung Electronics Co., Ltd. | Montgomery multiplication method for performing final modular reduction without comparison operation and montgomery multiplier |
CN105468439A (en) * | 2015-11-19 | 2016-04-06 | 华东师范大学 | Adaptive parallel algorithm for traversing neighbors in fixed radius under CPU-GPU (Central Processing Unit-Graphic Processing Unit) heterogeneous framework |
CN110351087A (en) * | 2019-09-06 | 2019-10-18 | 南京秉速科技有限公司 | The montgomery modulo multiplication operation method and computing device of pipeline-type |
CN113452383A (en) * | 2020-03-26 | 2021-09-28 | 湖南智领通信科技有限公司 | GPU parallel optimization method for TPC decoding of software radio system |
CN113541921A (en) * | 2021-06-24 | 2021-10-22 | 电子科技大学 | Fully homomorphic encryption GPU high-performance implementation method |
US20230244445A1 (en) * | 2022-01-28 | 2023-08-03 | Nvidia Corporation | Techniques and devices for efficient montgomery multiplication with reduced dependencies |
CN115268839A (en) * | 2022-06-28 | 2022-11-01 | 南京大学 | Montgomery modular multiplication method and device based on 2 |
CN115100227A (en) * | 2022-06-29 | 2022-09-23 | 南京大学 | Remote sensing image edge detection parallel computing method based on CPU-GPU cooperation |
CN115344237A (en) * | 2022-10-19 | 2022-11-15 | 无锡沐创集成电路设计有限公司 | Data processing method combining Karatsuba and Montgomery modular multiplication |
CN115756391A (en) * | 2022-11-25 | 2023-03-07 | 杭州电子科技大学 | Hardware circuit and method for realizing RSA modular exponentiation calculation of asymmetric algorithm |
CN117527192A (en) * | 2024-01-08 | 2024-02-06 | 蓝象智联(杭州)科技有限公司 | Paillier decryption method based on GPU |
Non-Patent Citations (2)
Title |
---|
LEBOEUF, KARL 等: "A GPU implementation of the Montgomery multiplication algorithm for elliptic curve cryptography", 2013 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 13 May 2014 (2014-05-13) * |
徐江涛;傅妍芳;: "蒙哥马利算法在RSA公钥算法中的应用", 电子设计工程, no. 09, 5 May 2013 (2013-05-05) * |
Also Published As
Publication number | Publication date |
---|---|
CN117785129B (en) | 2024-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11983280B2 (en) | Protection of cryptographic operations by intermediate randomization | |
US8411855B1 (en) | Size optimization for large elliptic curve cryptography scalar multiplication acceleration tables | |
US20220083857A1 (en) | Convolutional neural network operation method and device | |
Antao et al. | Elliptic curve point multiplication on GPUs | |
CN103309893A (en) | Character string comparing method and device | |
US20100077187A1 (en) | System and Method to Execute a Linear Feedback-Shift Instruction | |
CN112650471A (en) | Processor and method for processing masked data | |
US20070233772A1 (en) | Modular multiplication acceleration circuit and method for data encryption/decryption | |
CN117785129B (en) | Montgomery modular multiplication operation method based on GPU | |
US9025766B2 (en) | Efficient hardware architecture for a S1 S-box in a ZUC cipher | |
WO2013159361A1 (en) | Data processing method and related device | |
WO2023141934A1 (en) | Efficient masking of secure data in ladder-type cryptographic computations | |
US7590235B2 (en) | Reduction calculations in elliptic curve cryptography | |
US11961420B2 (en) | Efficient squaring with loop equalization in arithmetic logic units | |
Haider et al. | A novel pseudo-random number generator based on multivariable optimization for image-cryptographic applications | |
CN113630236A (en) | SM3 data encryption method and related device | |
Falcao et al. | Heterogeneous implementation of a voronoi cell-based svp solver | |
JP2007218997A (en) | Prime number generation device, program and method | |
CN104572021B (en) | A kind of efficient public key encryption engine | |
Wehner | A new concept in advice complexity of job shop scheduling | |
CN109802824A (en) | A kind of method, apparatus of shifting processing, computer storage medium and terminal | |
CN113472540B (en) | Method, device, electronic equipment and storage medium for generating ciphertext | |
Razaque et al. | Integration of CPU and GPU to accelerate RSA modular exponentiation operation | |
CN112487448B (en) | Encryption information processing device, method and computer equipment | |
WO2024140141A9 (en) | Doubled-point quantum computing method in elliptic curve, generic-point-addition quantum computing method in elliptic curve, and decryption method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |