CN117785129A - Montgomery modular multiplication operation method based on GPU - Google Patents

Montgomery modular multiplication operation method based on GPU Download PDF

Info

Publication number
CN117785129A
CN117785129A CN202410200625.6A CN202410200625A CN117785129A CN 117785129 A CN117785129 A CN 117785129A CN 202410200625 A CN202410200625 A CN 202410200625A CN 117785129 A CN117785129 A CN 117785129A
Authority
CN
China
Prior art keywords
data
calculation result
thread
sub
intermediate calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410200625.6A
Other languages
Chinese (zh)
Other versions
CN117785129B (en
Inventor
冯黎明
董建阔
叶青波
陈昕
马煜翔
王超
吴凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lanxiang Zhilian Hangzhou Technology Co ltd
Original Assignee
Lanxiang Zhilian Hangzhou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lanxiang Zhilian Hangzhou Technology Co ltd filed Critical Lanxiang Zhilian Hangzhou Technology Co ltd
Priority to CN202410200625.6A priority Critical patent/CN117785129B/en
Publication of CN117785129A publication Critical patent/CN117785129A/en
Application granted granted Critical
Publication of CN117785129B publication Critical patent/CN117785129B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a Montgomery modular multiplication operation method based on a GPU. It comprises the following steps: q threads of the GPU are selected; equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory; and the q threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result w. The method can realize Montgomery modular multiplication operation on the GPU, greatly improves the calculation efficiency and reduces the calculation time delay.

Description

Montgomery modular multiplication operation method based on GPU
Technical Field
The invention relates to the technical field of computers, in particular to a Montgomery modular multiplication operation method based on a GPU.
Background
In recent years, data has shown an explosive growth trend, the data volume and the data variety have become more and more complex, and a large amount of valuable client information, personal privacy records and operation data of enterprises have been continuously mined. In the era of data burst, the problem of information security under big data is particularly important, the security assurance of information is based on encryption algorithms, and currently popular encryption algorithms comprise symmetric encryption algorithms and asymmetric encryption algorithms. The asymmetric encryption algorithm mainly comprises an RSA encryption algorithm and an ECC encryption algorithm, and the two encryption algorithms need to execute modular multiplication operation.
The algorithm with high efficiency and convenient implementation in the modular multiplication algorithm is a Montgomery modular multiplication algorithm. The existing Montgomery modular multiplication operation is realized on a CPU, and the single thread on the CPU performs the Montgomery modular multiplication operation, and because the Montgomery modular multiplication operation is limited by a formula, the current Montgomery modular multiplication operation consumes longer time and has lower calculation efficiency, thereby influencing the encryption efficiency.
Disclosure of Invention
In order to solve the technical problems, the invention provides a Montgomery modular multiplication operation method based on a GPU, which can realize Montgomery modular multiplication operation on the GPU, thereby greatly improving the calculation efficiency and reducing the calculation time delay.
In order to solve the problems, the invention is realized by adopting the following technical scheme:
the invention relates to a Montgomery modular multiplication operation method based on a GPU, which is used for Montgomery modular multiplication, wherein data x, data y and modulus m are all positive integers of 32 x p x q bits, p is integer multiple of 2, q is an integer greater than 2, the data x is taken as a multiplicand, and the data y is taken as a multiplier, and comprises the following steps:
s1: q threads of the GPU are selected;
s2: equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory;
s3: and the q threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result z.
In the scheme, the data x and the modulus m are respectively equally divided into q parts and sent to the corresponding threads, the data y is equally divided into p x q parts and stored in the shared memory, and the q threads of the GPU realize Montgomery modular multiplication operation in parallel, so that the calculation efficiency is greatly improved. The CIOS algorithm is Coarsely Integrated Operand Scanning algorithm.
Preferably, the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q;
the step S2 includes the steps of:
dividing the data X into q sub-data X, and sequentially marking the sub-data X as X from low order to high order 1 、X 2 ……X q Sub data X i Send to thread number i, child data X i Data representing bits 32 x p (i-1) to 32 x p i-1 in data x;
dividing modulus M equally into q sub-moduli M, sequentially recorded as M from low order to high order 1 、M 2 ……M q Sub-modulus M i Send to thread number i, child module M i Data representing bits 32 x p (i-1) to 32 x p i-1 in modulus m;
dividing the data Y equally into p×q sub-data Y, and sequentially marking as Y from low order to high order 1 、Y 2 ……Y p*q And stored in a shared memory, wherein j is more than or equal to 1 and less than or equal to p is more than or equal to q, and the sub data Y j Data representing 32 x (j-1) to 32 x j-1 bits in data y.
Preferably, the step S1 further includes the steps of: each thread constructs a storage space for storing an intermediate calculation result Z, wherein the storage space comprises p+2 subspaces, and each subspace has 32 bits;
the intermediate calculation result calculated by the thread with the number i is marked as Z i The p+2 subspaces built by the thread with the number i are sequentially marked as Z from the lower order to the upper order of the stored data i (1)、Z i (2)……Z i (p+2),1≤r≤p+2,Z i (r) represents the intermediate calculation result Z i 32 x (r-1) to 32 x r-1 bits.
Preferably, the step S3 includes the steps of:
s31: setting the intermediate calculation result Z corresponding to each thread to zero, and setting j=1;
s32: each thread fetches sub-data Y from shared memory j Calculating a corresponding intermediate calculation result Z by combining the sub-data X and the sub-module M held by the user;
s33: assigning j to j=j+1, judging whether j is greater than p×q, if yes, executing step S34, otherwise, jumping to step S32;
s34, performing S34; and extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z.
Preferably, the step S32 includes the steps of:
s321: each thread fetches sub-data Y from shared memory j Each thread is based on the sub-data X, the current intermediate calculation result Z and the sub-data Y held by the thread j Calculating the latest intermediate calculation result, and assigning the current intermediate calculation result Z as the latest intermediate calculation result;
s322: the thread with the number 1 calculates an intermediate parameter u and sends the intermediate parameter u to other threads;
s323: each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
s324: extracting an effective value from the intermediate calculation result Z calculated by each thread to splice the effective value into a parameter f, judging whether the parameter f is smaller than a modulus m, if so, setting the value of a parameter flag to be 0, otherwise, setting the value of the parameter flag to be 1;
each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result.
Preferably, the thread numbered i in the step S321 is based on the sub-data X held by itself i Current intermediate calculation result Z i Sub data Y j Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z i The method for assigning the latest intermediate calculation result comprises the following steps:
n1: sequentially giving data Z i (1)、Z i (2)……Z i (p+2) reassigning the following formula:
wherein C is i (r) represents the calculation of the latest data Z i (r) carry, X i (r) represents sub data X i Data of 32 x (r-1) to 32 x r-1 bits,x represents i (r)* Y j Low 32 bits of>X represents i (r-1)* Y j Is the upper 32 bits of (2);
n2: sequentially giving data Z i (2)、Z i (3)……Z i (p+2) reassigning the following formula:
and N3: in turn, the latest data Z i (1)、Z i (2)……Z i (p+2) are spliced together to obtain an intermediate calculation result Z i
Due to X i 、Y j Is a very large number, and Z is rapidly realized through the two-round calculation i =Z i +X i *Y j The calculation efficiency is greatly improved.
Preferably, the method for calculating the intermediate parameter u by the thread numbered 1 in the step S322 includes the following steps:
thread fetch data Z numbered 1 1 (1) The intermediate parameter u is calculated using the following formula:
wherein,and m is the inverse of m.
Preferably, the step S323 includes the steps of:
s3231: each thread reassigns the value of the own intermediate calculation result Z;
the thread numbered i gives the intermediate calculation result Z i The reassignment method is as follows:
sequentially giving data Z i (1)、Z i (2)……Z i (p+2) reassigning the value of the formula:
wherein C is i (r) represents the calculation of the latest data Z i (r) carry generated at the time of M i (r) represents the sub-modulus M i Data of 32 x (r-1) to 32 x r-1 bits,represents M i Lower 32 bits of (r) ×u, +.>Represents M i (r-1) the upper 32 bits of u;
s3232: the thread numbered k+1 will data Z k+1 (1) Sending the received data to a thread with the number of k, wherein the thread with the number of k marks the received data as tmp (k), and k=1, 2 and 3 … … q-1;
s3233: each thread reassigns the value of the own intermediate calculation result Z;
setting tmp (q) =0, and giving the intermediate calculation result Z to the thread numbered i i The reassignment method is as follows:
sequentially giving data Z i (1)、Z i (2)……Z i (p+1) reassigning the value of the formula:
s3234: the thread numbered k will data Z k (p+1) sending to the thread numbered k+1, the thread numbered k+1 noting the received data as hmp (k);
s3235: threads with the numbers of 1 and 2 … … q sequentially reassign the self-held intermediate calculation result Z;
the thread numbered i gives the intermediate calculation result Z i The reassignment method is as follows:
when i=1, the intermediate calculation result Z 1 Remain unchanged;
when i=2, the data Z is sequentially given i (1)、Z i (2)……Z i (p) reassigning the value of the formula:
when 2 < i < q, the data Z are given in turn i (1)、Z i (2)……Z i (p) reassigning the value of the formula:
when i=q, the data Z is sequentially given i (1)、Z i (2)……Z i (p)、Z i (p+1) reassigning the value of the formula:
preferably, the step S324 includes the steps of:
s3241: extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F;
intermediate calculation result Z from thread numbered i i The extracted subparameter F i The method of (2) is as follows:
when i is 1-q, the subparameter F i For intermediate calculation result Z i P-position first 32;
when i=q, subparameter F i For intermediate calculation result Z i The first 32 (p+1) bits;
s3242: will sub-parameter F 1 、F 2 ……F q Splicing into a parameter f, judging whether the parameter f is smaller than a modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;
s3243: threads with the numbers of 1 and 2 … … q sequentially reassign the sub-parameters F held by the threads;
the thread numbered i gives itself the held subparameter F i The reassignment method is as follows:
calculation F i =F i +flag*(-M i ) If F i <M i Then go to F i+1 Borrowing;
s3244: each thread reassigns the value of the own intermediate calculation result Z;
intermediate calculation result Z held by thread with number i for itself i The reassignment method is as follows:
when i is more than or equal to 1 and less than q, the intermediate calculation result Z i The first 32 x p bits of (a) are assigned as subparameter F i To the value of the intermediate calculation result Z i The value of the last 64 bits of (2) is set to 0;
when i=q, the intermediate calculation result Z i The first 32 (p+1) bits of (a) are assigned the subparameter F i To the value of the intermediate calculation result Z i The value of the last 32 bits of (2) is set to 0.
Preferably, the step S34 includes the steps of:
s341: extracting a valid value from the intermediate calculation result Z of each thread as a sub-result W;
intermediate calculation result Z from thread numbered i i The sub-result W extracted from the middle i The method of (2) is as follows:
when i is 1.ltoreq.q, the sub-result W i For intermediate calculation result Z i P-position first 32;
when i=q, the sub-result W i For intermediate calculation result Z i The first 32 (p+1) bits;
s342: will sub-result W 1 、W 2 ……W q And splicing to obtain a calculation result w.
The beneficial effects of the invention are as follows: by utilizing the multithread parallel computing capability of the GPU, the Montgomery modular multiplication operation is optimized and accelerated, the computing efficiency is greatly improved, and the computing time delay is reduced, so that the encryption efficiency of encryption algorithms such as RSA, ECC and the like which are required to be used for the Montgomery modular multiplication operation is greatly improved.
Drawings
FIG. 1 is a flow chart of an embodiment;
fig. 2 is an illustrative schematic.
Detailed Description
The technical scheme of the invention is further specifically described below through examples and with reference to the accompanying drawings.
Example 1: in the method for performing Montgomery modular multiplication operation based on the GPU according to the present embodiment, the data x, the data y and the modulus m used for Montgomery modular multiplication are all positive integers of 32×p×q bits, p is an integer multiple of 2, q is an integer greater than 2, the data x is a multiplicand, and the data y is a multiplier, as shown in FIG. 1, the method comprises the following steps:
s1: q threads of the GPU are selected, the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q, each thread constructs a storage space for storing an intermediate calculation result Z, the storage space comprises p+2 subspaces, and each subspace is provided with 32 bits;
the intermediate calculation result calculated by the thread with the number i is marked as Z i The p+2 subspaces built by the thread with the number i are sequentially marked as Z from the lower order to the upper order of the stored data i (1)、Z i (2)……Z i (p+2),1≤r≤p+2,Z i (r) represents the intermediate calculation result Z i 32 x (r-1) to 32 x r-1 bits of data, p+2 subspaces are numbered 1 and 2 … … p+2 in sequence, and data Z i (r) stored in subspace number r;
s2: equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory;
s3: and the q threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result z.
Step S2 comprises the steps of:
dividing the data X into q sub-data X, and sequentially marking the sub-data X as X from low order to high order 1 、X 2 ……X q Sub data X i Send to thread number i, child data X i Data representing 32 x p (i-1) to 32 x p i-1 bits in data x,,/>is a splice;
dividing modulus M equally into q sub-moduli M, sequentially recorded as M from low order to high order 1 、M 2 ……M q Sub-modulus M i Send to thread number i, child module M i Data representing bits 32 x p (i-1) to 32 x p i-1 in modulus m,
dividing the data Y equally into p×q sub-data Y, and sequentially marking as Y from low order to high order 1 、Y 2 ……Y p*q And stored in a shared memory, wherein j is more than or equal to 1 and less than or equal to p is more than or equal to q, and the sub data Y j Data representing 32 x (j-1) to 32 x j-1 bits in data y,
step S3 comprises the steps of:
s31: setting the intermediate calculation result Z corresponding to each thread to zero (namely, setting all the positions of the storage space to zero), and setting j=1;
s32: each thread fetches sub-data Y from shared memory j Calculating a corresponding intermediate calculation result Z by combining the sub-data X and the sub-module M held by the user;
s33: assigning j to j=j+1, judging whether j is greater than p×q, if yes, executing step S34, otherwise, jumping to step S32;
s34, performing S34; and extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z.
Step S32 includes the steps of:
s321: each thread fetches sub-data Y from shared memory j Each thread is based on the sub-data X, the current intermediate calculation result Z and the sub-data Y held by the thread j Calculating the latest intermediate calculation result, and assigning the current intermediate calculation result Z as the latest intermediate calculation result;
s322: the thread with the number 1 calculates an intermediate parameter u and sends the intermediate parameter u to other threads;
s323: each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
s324: extracting an effective value from the intermediate calculation result Z calculated by each thread to splice the effective value into a parameter f, judging whether the parameter f is smaller than a modulus m, if so, setting the value of a parameter flag to be 0, otherwise, setting the value of the parameter flag to be 1;
each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result.
The thread numbered i in step S321 is based on the sub-data X held by itself i Current intermediate calculation result Z i Sub data Y j Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z i The method for assigning the latest intermediate calculation result comprises the following steps:
n1: sequentially giving data Z i (1)、Z i (2)……Z i (p+2) reassigning the following formula:
wherein C is i (r) represents the calculation of the latest data Z i (r) carry, X i (r) represents sub data X i Data of 32 x (r-1) to 32 x r-1 bits,x represents i (r)* Y j Low 32 bits of>X represents i (r-1)* Y j Is the upper 32 bits of (2);
n2: sequentially giving data Z i (2)、Z i (3)……Z i (p+2) reassigning the following formula:
and N3: in turn, the latest data Z i (1)、Z i (2)……Z i (p+2) are spliced together to obtain an intermediate calculation result Z i
Due to X i 、Y j Is a very large number, and Z is rapidly realized through the two-round calculation i =Z i +X i *Y j The calculation efficiency is greatly improved.
The method for calculating the intermediate parameter u by the thread numbered 1 in step S322 includes the steps of:
thread fetch data Z numbered 1 1 (1) The intermediate parameter u is calculated using the following formula:
wherein,and m is the inverse of m.
Pair 2 32 Taking the modulus, data greater than 32 bits can be directly eliminated, so the value requires calculating the product of the first 32 bits of data and m' to 2 32 And the module is taken, so that the calculation efficiency is improved.
Step S323 includes the steps of:
s3231: each thread reassigns the value of the own intermediate calculation result Z;
the thread numbered i gives the intermediate calculation result Z i The reassignment method is as follows:
sequentially giving data Z i (1)、Z i (2)……Z i (p+2) reassigning the value of the formula:
wherein C is i (r) represents the calculation of the latest data Z i (r) carry generated at the time of M i (r) represents the sub-modulus M i Data of 32 x (r-1) to 32 x r-1 bits,represents M i Lower 32 bits of (r) ×u, +.>Represents M i (r-1) the upper 32 bits of u;
s3232: the thread numbered k+1 will data Z k+1 (1) The data is sent to a thread with the number k, the thread with the number k marks the received data as tmp (k), and k=1, 2 and 3 … … q-1, namely the threads with the numbers 2 to q take out the low 32 bits of the intermediate calculation result Z held by the threads with the number k and send the low 32 bits to the previous thread;
s3233: each thread reassigns the value of the own intermediate calculation result Z;
setting tmp (q) =0, and giving the intermediate calculation result Z to the thread numbered i i The reassignment method is as follows:
sequentially giving data Z i (1)、Z i (2)……Z i (p+1) reassigning the value of the formula:
s3234: the thread numbered k will data Z k (p+1) sending to the thread numbered k+1, the thread numbered k+1 noting the received data as hmp (k);
s3235: threads with the numbers of 1 and 2 … … q sequentially reassign the self-held intermediate calculation result Z;
the thread numbered i gives the intermediate calculation result Z i The reassignment method is as follows:
when i=1, the intermediate calculation result Z 1 Remain unchanged;
when i=2, the data Z is sequentially given i (1)、Z i (2)……Z i (p) reassigning the value of the formula:
when 2 < i < q, the data Z are given in turn i (1)、Z i (2)……Z i (p) reassigning the value of the formula:
when i=q, the data Z is sequentially given i (1)、Z i (2)……Z i (p)、Z i (p+1) reassigning the value of the formula:
the calculation and displacement are synchronously carried out, so that the calculation efficiency is greatly improved.
Step S324 includes the steps of:
s3241: extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F;
intermediate calculation result Z from thread numbered i i The extracted subparameter F i The method of (2) is as follows:
when i is 1-q, the subparameter F i For intermediate calculation result Z i Is the first 32 x p bits of (c),
when i=q, subparameter F i For intermediate calculation result Z i The first 32 (p + 1) bits,
s3242: will sub-parameter F 1 、F 2 ……F q The parameters f are spliced into the parameters f,judging whether the parameter f is smaller than the modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;
s3243: threads with the numbers of 1 and 2 … … q sequentially reassign the sub-parameters F held by the threads;
the thread numbered i gives itself the held subparameter F i The reassignment method is as follows:
calculation F i =F i +flag*(-M i ) If F i <M i Then go to F i+1 Borrowing (F is greater than or equal to m, so F) q Not less than M q );
S3244: each thread reassigns the value of the own intermediate calculation result Z;
intermediate calculation result Z held by thread with number i for itself i The reassignment method is as follows:
when i is more than or equal to 1 and less than q, the intermediate calculation result Z i The first 32 x p bits of (a) are assigned as subparameter F i To the value of the intermediate calculation result Z i The value of the last 64 bits of (2) is set to 0;
when i=q, the intermediate calculation result Z i The first 32 (p+1) bits of (a) are assigned the subparameter F i To the value of the intermediate calculation result Z i The value of the last 32 bits of (2) is set to 0.
Step S34 includes the steps of:
s341: extracting a valid value from the intermediate calculation result Z of each thread as a sub-result W;
intermediate calculation result Z from thread numbered i i The sub-result W extracted from the middle i The method of (2) is as follows:
when i is 1.ltoreq.q, the sub-result W i For intermediate calculation result Z i Is the first 32 x p bits of (c),
when i=q, the sub-result W i For intermediate calculation result Z i The first 32 (p + 1) bits,
s342: will sub-result W 1 、W 2 ……W q The calculation result w is spliced into a calculation result w,
in the scheme, the data x and the modulus m are respectively and equally divided into q parts and sent to the corresponding threads, the data y is equally divided into p x and q parts and stored in the shared memory, and the q threads of the GPU can realize Montgomery modular multiplication operation in parallel through the method. In the prior art, montgomery modular multiplication operation is performed for single-thread execution CIOS algorithm on a CPU, so that the operation time is long, the calculation efficiency is low, and the encryption efficiency is affected. The algorithm flow designed for the multithread Cheng Teshu of the GPU is adopted, so that the q threads of the GPU execute the method to obtain the same result as the result obtained by executing the CIOS algorithm on the CPU by a single thread, the method is executed by the q threads in parallel, and the specific calculation steps are further optimized in the parallel execution process, thereby greatly improving the calculation efficiency, reducing the calculation time delay, and further greatly improving the encryption efficiency of encryption algorithms such as RSA, ECC and the like which are required to be used for Montgomery modular multiplication operation.
Illustrating:
in the RSA encryption calculation process, when the montgomery modular multiplication operation is required, the method of the embodiment is adopted to perform the montgomery modular multiplication operation, wherein the data x, the data y and the modulus m for the montgomery modular multiplication are all positive integers of 32 x 4 x 3 bits, the data x is used as a multiplicand, the data y is used as a multiplier, and as shown in fig. 2, the method comprises the following steps:
s1: 3 threads of the GPU are selected, the serial numbers of the threads are 1, 2 and 3, i is more than or equal to 1 and less than or equal to 3, each thread constructs a storage space for storing an intermediate calculation result Z, the storage space comprises 6 subspaces, and each subspace has 32 bitsThe intermediate calculation result calculated by the thread with the number i is marked as Z i The data stored from the lower order to the higher order in the 6 subspaces constructed by the thread with the number i are sequentially marked as Z i (1)、Z i (2)……Z i (6),1≤r≤6,Z i (6) Representing intermediate calculation result Z i Data of 32 x (r-1) to 32 x r-1 bits;
s2: dividing the data X into 3 sub-data X, and sequentially marking the sub-data X as X from low order to high order 1 、X 2 、X 3 Sub data X i Send to thread number i, child data X i Data representing bits 32 x p (i-1) to 32 x p i-1 in data x;
dividing modulus M equally into 3 sub-moduli M, sequentially recorded as M from low order to high order 1 、M 2 、M 3 Sub-modulus M i Send to thread number i, child module M i Data representing bits 32 x p (i-1) to 32 x p i-1 in modulus m;
dividing the data Y into 12 sub-data Y, and sequentially marking the sub-data Y as Y from low order to high order 1 、Y 2 ……Y 12 And stored in a shared memory, wherein j is more than or equal to 1 and less than or equal to 12, and the sub data Y j Data representing 32 x (j-1) to 32 x j-1 bits in data y;
s3: the 3 threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub-data X, the sub-module M and the 12 sub-data Y stored in the shared memory, so as to obtain a calculation result w, and the specific steps are as follows:
s31: setting the intermediate calculation result Z corresponding to each thread to zero, and setting j=1;
s32: each thread fetches sub-data Y from shared memory 1 Each thread is based on the sub-data X, the current intermediate calculation result Z and the sub-data Y held by the thread 1 Calculating the latest intermediate calculation result, and assigning the current intermediate calculation result Z as the latest intermediate calculation result, namely, the latest intermediate calculation result Z is calculated by adopting the steps N1 to N3 by the thread 1, the thread 2 and the thread 3 respectively 1 、Z 2 、Z 3
Thread number 1 fetch data Z 1 (1) The intermediate parameter u is calculated using the following formula:
wherein,and m are inverse elements;
each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result, namely, the thread 1, the thread 2 and the thread 3 respectively calculate the latest intermediate calculation result Z by adopting the steps S3231 to S3235 1 、Z 2 、Z 3 The method comprises the steps of carrying out a first treatment on the surface of the In step S3232, thread number 2 transfers data Z 2 (1) Send to thread number 1, thread number 3 will data Z 3 (1) Sending to thread number 2; in step S3234, thread 1 transfers data Z 1 (5) Send to thread number 2, thread number 2 will data Z 2 (5) Sending to thread number 3;
extracting effective value from intermediate calculation result Z calculated by each thread as subparameter F, subparameter F extracted by thread No. 1 1 For intermediate calculation result Z 1 The first 32 x 4 bits of thread number 2 extracted subparameter F 2 For intermediate calculation result Z 2 The first 32 x 4 bits of thread number 3 extracted subparameter F 3 For intermediate calculation result Z 2 The first 32 x 5 bits of (2);
will sub-parameter F 1 、F 2 、F 3 Splicing into a parameter f, judging whether the parameter f is smaller than a modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;
thread 1, thread 2 and thread 3 re-assign values to the sub-parameters F held by the thread 1 and thread 1 calculates F 1 =F 1 +flag*(-M 1 ) If F 1 <M 1 Then go to F 2 Borrowing; thread number 2 calculation F 2 =F 2 +flag*(-M 2 ) If F 2 <M 2 Then go to F 3 Borrowing; thread number 3 calculation F 3 =F 3 +flag*(-M 3 ) F is greater than or equal to m, so F 3 Not less than M 3
Each thread reassigns the value of the own intermediate calculation result Z; thread number 1 gives intermediate calculation result Z 1 The first 32 x 4 bits of (a) are assigned as subparameter F 1 Is a value of (2); thread number 2 gives intermediate calculation result Z 1 The first 32 x 4 bits of (a) are assigned as subparameter F 2 Is a value of (2); thread 3 gives intermediate calculation result Z 1 The first 32 x 5 bits of (a) are assigned as subparameter F 3 Is a value of (2);
s33: assigning j to j=j+1, judging whether j is larger than 12, if yes, executing step S34, otherwise, jumping to step S32;
s34, performing S34; extracting an effective value from the intermediate calculation result Z of each thread as a sub-result W, wherein the sub-result W is extracted by the thread 1 1 For intermediate calculation result Z 1 The first 32 x 4 bits of (b); sub-result W extracted by thread number 2 2 For intermediate calculation result Z 2 The first 32 x 4 bits of (b); sub-result W extracted by thread 3 3 For intermediate calculation result Z 3 The first 32 x 5 bits of (2);
will sub-result W 1 、W 2 ……W q And splicing to obtain a calculation result w.
Example 2: the method for performing Montgomery modular multiplication operation on the basis of the GPU of the present invention for performing Montgomery modular multiplication on data x, data y and modulus m which are all positive integers of 32 x p x q bits, p being an integer multiple of 2, q being an integer greater than 2, the data x being a multiplicand and the data y being a multiplier comprises the following steps:
s1: q threads of the GPU are selected, the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q, each thread constructs a storage space for storing an intermediate calculation result Z, the storage space comprises p+2 subspaces, and each subspace is provided with 32 bits;
the intermediate calculation result calculated by the thread with the number i is marked as Z i The p+2 subspaces built by the thread with the number i are sequentially marked as Z from the lower order to the upper order of the stored data i (1)、Z i (2)……Z i (p+2),1≤r≤p+2,Z i (r) represents the intermediate calculation result Z i 32 x (r-1) to 32 x r-1 positions in (a)Data, p+2 subspaces are numbered 1 and 2 … … p+2 in sequence, and data Z i (r) stored in subspace number r;
s2: equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory;
s3: and the q threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result z.
Step S2 comprises the steps of:
dividing the data X into q sub-data X, and sequentially marking the sub-data X as X from low order to high order 1 、X 2 ……X q Sub data X i Send to thread number i, child data X i Data representing 32 x p (i-1) to 32 x p i-1 bits in data x,,/>is a splice;
dividing modulus M equally into q sub-moduli M, sequentially recorded as M from low order to high order 1 、M 2 ……M q Sub-modulus M i Send to thread number i, child module M i Data representing bits 32 x p (i-1) to 32 x p i-1 in modulus m,;/>
dividing the data Y equally into p×q sub-data Y, and sequentially marking as Y from low order to high order 1 、Y 2 ……Y p*q And stored in a shared memory, wherein j is more than or equal to 1 and less than or equal to p is more than or equal to q, and the sub data Y j Data representing 32 x (j-1) to 32 x j-1 bits in data y,
step S3 comprises the steps of:
s31: setting the intermediate calculation result Z corresponding to each thread to zero, and setting j=1;
s32: each thread fetches sub-data Y from shared memory j Calculating a corresponding intermediate calculation result Z by combining the sub-data X and the sub-module M held by the user;
s33: assigning j to j=j+1, judging whether j is greater than p×q, if yes, executing step S34, otherwise, jumping to step S32;
s34, performing S34; and extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z.
Step S32 includes the steps of:
s321: each thread fetches sub-data Y from shared memory j Each thread is based on the sub-data X, the current intermediate calculation result Z and the sub-data Y held by the thread j Calculating the latest intermediate calculation result, and assigning the current intermediate calculation result Z as the latest intermediate calculation result;
the thread with the number i is based on the sub data X held by the thread i Current intermediate calculation result Z i Sub data Y j Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z i The formula assigned to the latest intermediate calculation result is: z is Z i =Z i +X i *Y j
S322: thread fetch data Z numbered 1 1 (1) The intermediate parameter u is calculated using the following formula:
wherein,and m are inverse elements;
the thread with the number 1 sends the intermediate parameter u to other threads;
s323: each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
the thread with the number i is according to the self-held sub-module M i Intermediate parameter u, current intermediate calculation result Z i Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z i The formula assigned to the latest intermediate calculation result is as follows:
s324: extracting an effective value from the intermediate calculation result Z calculated by each thread to splice the effective value into a parameter f, judging whether the parameter f is smaller than a modulus m, if so, setting the value of a parameter flag to be 0, otherwise, setting the value of the parameter flag to be 1;
each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result.
Step S324 includes the steps of:
s3241: extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F;
intermediate calculation result Z from thread numbered i i The extracted subparameter F i The method of (2) is as follows:
when i is 1-q, the subparameter F i For intermediate calculation result Z i P-position first 32;
when i=q, subparameter F i For intermediate calculation result Z i The first 32 (p+1) bits;
s3242: will sub-parameter F 1 、F 2 ……F q Splicing into a parameter f, judging whether the parameter f is smaller than a modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;
s3243: threads with the numbers of 1 and 2 … … q sequentially reassign the sub-parameters F held by the threads;
the thread numbered i gives itself the held subparameter F i Reassigning valuesThe method comprises the following steps:
calculation F i =F i +flag*(-M i ) If F i <M i Then go to F i+1 Borrowing;
s3244: each thread reassigns the value of the own intermediate calculation result Z;
intermediate calculation result Z held by thread with number i for itself i The reassignment method is as follows:
when i is more than or equal to 1 and less than q, the intermediate calculation result Z i The first 32 x p bits of (a) are assigned as subparameter F i To the value of the intermediate calculation result Z i The value of the last 64 bits of (2) is set to 0;
when i=q, the intermediate calculation result Z i The first 32 (p+1) bits of (a) are assigned the subparameter F i To the value of the intermediate calculation result Z i The value of the last 32 bits of (2) is set to 0.
Step S34 includes the steps of:
s341: extracting a valid value from the intermediate calculation result Z of each thread as a sub-result W;
intermediate calculation result Z from thread numbered i i The sub-result W extracted from the middle i The method of (2) is as follows:
when i is 1.ltoreq.q, the sub-result W i For intermediate calculation result Z i P-position first 32;
when i=q, the sub-result W i For intermediate calculation result Z i The first 32 (p+1) bits;
s342: will sub-result W 1 、W 2 ……W q And splicing to obtain a calculation result w.
The present embodiment is different from embodiment 1 in steps S321, S323.

Claims (10)

1. The Montgomery modular multiplication operation method based on the GPU is characterized in that data x, data y and modulus m used for Montgomery modular multiplication are all positive integers of 32 x p x q bits, p is an integer multiple of 2, q is an integer greater than 2, the data x is taken as a multiplicand, and the data y is taken as a multiplier, and the method comprises the following steps:
s1: q threads of the GPU are selected;
s2: equally dividing the data X into q sub-data X, and respectively transmitting the q sub-data X to q threads; dividing the modulus M into q sub-moduli M equally, and respectively transmitting the q sub-moduli M to q threads; dividing the data Y into p x q sub-data Y equally and storing the sub-data Y in a shared memory;
s3: and the q threads perform Montgomery modular multiplication operation by adopting a CIOS algorithm according to the sub data X, the sub data M and the p X q sub data Y stored in the shared memory, so as to obtain a calculation result w.
2. The method for performing Montgomery modular multiplication operation on the basis of the GPU according to claim 1, wherein the q threads are numbered 1 and 2 … … q in sequence, i is more than or equal to 1 and less than or equal to q;
the step S2 includes the steps of:
dividing the data X into q sub-data X, and sequentially marking the sub-data X as X from low order to high order 1 、X 2 ……X q Sub data X i Send to thread number i, child data X i Data representing bits 32 x p (i-1) to 32 x p i-1 in data x;
dividing modulus M equally into q sub-moduli M, sequentially recorded as M from low order to high order 1 、M 2 ……M q Sub-modulus M i Send to thread number i, child module M i Data representing bits 32 x p (i-1) to 32 x p i-1 in modulus m;
dividing the data Y equally into p×q sub-data Y, and sequentially marking as Y from low order to high order 1 、Y 2 ……Y p*q And stored in a shared memory, wherein j is more than or equal to 1 and less than or equal to p is more than or equal to q, and the sub data Y j Data representing 32 x (j-1) to 32 x j-1 bits in data y.
3. The method of GPU-based montgomery modular multiplication according to claim 2, wherein the step S1 further comprises the steps of: each thread constructs a storage space for storing an intermediate calculation result Z, wherein the storage space comprises p+2 subspaces, and each subspace has 32 bits;
the intermediate calculation result calculated by the thread with the number i is marked as Z i Numbered iThe p+2 subspaces constructed by threads are sequentially marked as Z from low order to high order stored data i (1)、Z i (2)……Z i (p+2),1≤r≤p+2,Z i (r) represents the intermediate calculation result Z i 32 x (r-1) to 32 x r-1 bits.
4. A method of performing a montgomery modular multiplication operation based on a GPU as recited in claim 3, wherein said step S3 comprises the steps of:
s31: setting the intermediate calculation result Z corresponding to each thread to zero, and setting j=1;
s32: each thread fetches sub-data Y from shared memory j Calculating a corresponding intermediate calculation result Z by combining the sub-data X and the sub-module M held by the user;
s33: assigning j to j=j+1, judging whether j is greater than p×q, if yes, executing step S34, otherwise, jumping to step S32;
s34, performing S34; and extracting an effective value from the intermediate calculation result Z calculated by each thread, and splicing the effective value into a calculation result Z.
5. The method of performing Montgomery modular multiplication operations on a GPU according to claim 4, wherein the step S32 comprises the steps of:
s321: each thread fetches sub-data Y from shared memory j Each thread is based on the sub-data X, the current intermediate calculation result Z and the sub-data Y held by the thread j Calculating the latest intermediate calculation result, and assigning the current intermediate calculation result Z as the latest intermediate calculation result;
s322: the thread with the number 1 calculates an intermediate parameter u and sends the intermediate parameter u to other threads;
s323: each thread calculates the latest intermediate calculation result according to the sub-module M, the intermediate parameter u and the current intermediate calculation result Z held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result;
s324: extracting an effective value from the intermediate calculation result Z calculated by each thread to splice the effective value into a parameter f, judging whether the parameter f is smaller than a modulus m, if so, setting the value of a parameter flag to be 0, otherwise, setting the value of the parameter flag to be 1;
each thread calculates the latest intermediate calculation result according to the parameter flag and the parameter f held by the thread, and assigns the current intermediate calculation result Z as the latest intermediate calculation result.
6. The method according to claim 5, wherein the thread numbered i in the step S321 is based on the sub-data X held by itself i Current intermediate calculation result Z i Sub data Y j Calculating the latest intermediate calculation result and comparing the current intermediate calculation result Z i The method for assigning the latest intermediate calculation result comprises the following steps:
n1: sequentially giving data Z i (1)、Z i (2)……Z i (p+2) reassigning the following formula:
wherein C is i (r) represents the calculation of the latest data Z i (r) carry, X i (r) represents sub data X i Data of 32 x (r-1) to 32 x r-1 bits,x represents i (r)* Y j Low 32 bits of>X represents i (r-1)* Y j Is the upper 32 bits of (2);
n2: sequentially giving data Z i (2)、Z i (3)……Z i (p+2) reassigning the following formula:
and N3: in turnWill be the latest data Z i (1)、Z i (2)……Z i (p+2) are spliced together to obtain an intermediate calculation result Z i
7. The method for computing the intermediate parameter u by the thread numbered 1 in the step S322 according to claim 5, wherein the method comprises the following steps:
thread fetch data Z numbered 1 1 (1) The intermediate parameter u is calculated using the following formula:
wherein,and m is the inverse of m.
8. The method of claim 5, wherein the step S323 includes the steps of:
s3231: each thread reassigns the value of the own intermediate calculation result Z;
the thread numbered i gives the intermediate calculation result Z i The reassignment method is as follows:
sequentially giving data Z i (1)、Z i (2)……Z i (p+2) reassigning the value of the formula:
wherein C is i (r) represents the calculation of the latest data Z i (r) carry generated at the time of M i (r) represents the sub-modulus M i Data of 32 x (r-1) to 32 x r-1 bits,represents M i Lower 32 bits of (r) ×u, +.>Represents M i (r-1) the upper 32 bits of u;
s3232: the thread numbered k+1 will data Z k+1 (1) Sending the received data to a thread with the number of k, wherein the thread with the number of k marks the received data as tmp (k), and k=1, 2 and 3 … … q-1;
s3233: each thread reassigns the value of the own intermediate calculation result Z;
setting tmp (q) =0, and giving the intermediate calculation result Z to the thread numbered i i The reassignment method is as follows:
sequentially giving data Z i (1)、Z i (2)……Z i (p+1) reassigning the value of the formula:
s3234: the thread numbered k will data Z k (p+1) sending to the thread numbered k+1, the thread numbered k+1 noting the received data as hmp (k);
s3235: threads with the numbers of 1 and 2 … … q sequentially reassign the self-held intermediate calculation result Z;
the thread numbered i gives the intermediate calculation result Z i The reassignment method is as follows:
when i=1, the intermediate calculation result Z 1 Remain unchanged;
when i=2, the data Z is sequentially given i (1)、Z i (2)……Z i (p) reassigning the value of the formula:
when 2 < i < q, the data Z are given in turn i (1)、Z i (2)……Z i (p) reassigning the value of the formula:
when i=q, the data Z is sequentially given i (1)、Z i (2)……Z i (p)、Z i (p+1) reassigning the value of the formula:
9. the method according to claim 8, wherein the step S324 includes the steps of:
s3241: extracting an effective value from the intermediate calculation result Z calculated by each thread as a subparameter F;
intermediate calculation result Z from thread numbered i i The extracted subparameter F i The method of (2) is as follows:
when i is 1-q, the subparameter F i For intermediate calculation result Z i P-position first 32;
when i=q, subparameter F i For intermediate calculation result Z i The first 32 (p+1) bits;
s3242: will sub-parameter F 1 、F 2 ……F q Splicing into a parameter f, judging whether the parameter f is smaller than a modulus m, if f is smaller than m, the value of the parameter flag is 0, and if f is larger than or equal to m, the value of the parameter flag is 1;
s3243: threads with the numbers of 1 and 2 … … q sequentially reassign the sub-parameters F held by the threads;
the thread numbered i gives itself the held subparameter F i The reassignment method is as follows:
calculation F i =F i +flag*(-M i ) If F i <M i Then go to F i+1 Borrowing;
s3244: each thread reassigns the value of the own intermediate calculation result Z;
intermediate calculation result Z held by thread with number i for itself i The reassignment method is as follows:
when i is more than or equal to 1 and less than q, the intermediate calculation result Z i The first 32 x p bits of (a) are assigned as subparameter F i To the value of the intermediate calculation result Z i The value of the last 64 bits of (2) is set to 0;
when i=q, the intermediate calculation result Z i The first 32 (p+1) bits of (a) are assigned the subparameter F i To the value of the intermediate calculation result Z i The value of the last 32 bits of (2) is set to 0.
10. The method of GPU-based montgomery modular multiplication according to claim 9, wherein step S34 comprises the steps of:
s341: extracting a valid value from the intermediate calculation result Z of each thread as a sub-result W;
intermediate calculation result Z from thread numbered i i The sub-result W extracted from the middle i The method of (2) is as follows:
when i is 1.ltoreq.q, the sub-result W i For intermediate calculation result Z i P-position first 32;
when i=q, the sub-result W i For intermediate calculation result Z i The first 32 (p+1) bits;
s342: will sub-result W 1 、W 2 ……W q And splicing to obtain a calculation result w.
CN202410200625.6A 2024-02-23 2024-02-23 Montgomery modular multiplication operation method based on GPU Active CN117785129B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410200625.6A CN117785129B (en) 2024-02-23 2024-02-23 Montgomery modular multiplication operation method based on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410200625.6A CN117785129B (en) 2024-02-23 2024-02-23 Montgomery modular multiplication operation method based on GPU

Publications (2)

Publication Number Publication Date
CN117785129A true CN117785129A (en) 2024-03-29
CN117785129B CN117785129B (en) 2024-05-07

Family

ID=90391339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410200625.6A Active CN117785129B (en) 2024-02-23 2024-02-23 Montgomery modular multiplication operation method based on GPU

Country Status (1)

Country Link
CN (1) CN117785129B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226461A (en) * 2013-03-26 2013-07-31 中山大学 Montgomery modular multiplication method and circuit thereof
CN103761068A (en) * 2014-01-26 2014-04-30 上海交通大学 Optimized Montgomery modular multiplication method, optimized modular square method and optimized modular multiplication hardware
US20150277855A1 (en) * 2014-03-31 2015-10-01 Samsung Electronics Co., Ltd. Montgomery multiplication method for performing final modular reduction without comparison operation and montgomery multiplier
CN105468439A (en) * 2015-11-19 2016-04-06 华东师范大学 Adaptive parallel algorithm for traversing neighbors in fixed radius under CPU-GPU (Central Processing Unit-Graphic Processing Unit) heterogeneous framework
CN110351087A (en) * 2019-09-06 2019-10-18 南京秉速科技有限公司 The montgomery modulo multiplication operation method and computing device of pipeline-type
CN113452383A (en) * 2020-03-26 2021-09-28 湖南智领通信科技有限公司 GPU parallel optimization method for TPC decoding of software radio system
CN113541921A (en) * 2021-06-24 2021-10-22 电子科技大学 Fully homomorphic encryption GPU high-performance implementation method
CN115100227A (en) * 2022-06-29 2022-09-23 南京大学 Remote sensing image edge detection parallel computing method based on CPU-GPU cooperation
CN115268839A (en) * 2022-06-28 2022-11-01 南京大学 Montgomery modular multiplication method and device based on 2
CN115344237A (en) * 2022-10-19 2022-11-15 无锡沐创集成电路设计有限公司 Data processing method combining Karatsuba and Montgomery modular multiplication
CN115756391A (en) * 2022-11-25 2023-03-07 杭州电子科技大学 Hardware circuit and method for realizing RSA modular exponentiation calculation of asymmetric algorithm
US20230244445A1 (en) * 2022-01-28 2023-08-03 Nvidia Corporation Techniques and devices for efficient montgomery multiplication with reduced dependencies
CN117527192A (en) * 2024-01-08 2024-02-06 蓝象智联(杭州)科技有限公司 Paillier decryption method based on GPU

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226461A (en) * 2013-03-26 2013-07-31 中山大学 Montgomery modular multiplication method and circuit thereof
CN103761068A (en) * 2014-01-26 2014-04-30 上海交通大学 Optimized Montgomery modular multiplication method, optimized modular square method and optimized modular multiplication hardware
US20150277855A1 (en) * 2014-03-31 2015-10-01 Samsung Electronics Co., Ltd. Montgomery multiplication method for performing final modular reduction without comparison operation and montgomery multiplier
CN105468439A (en) * 2015-11-19 2016-04-06 华东师范大学 Adaptive parallel algorithm for traversing neighbors in fixed radius under CPU-GPU (Central Processing Unit-Graphic Processing Unit) heterogeneous framework
CN110351087A (en) * 2019-09-06 2019-10-18 南京秉速科技有限公司 The montgomery modulo multiplication operation method and computing device of pipeline-type
CN113452383A (en) * 2020-03-26 2021-09-28 湖南智领通信科技有限公司 GPU parallel optimization method for TPC decoding of software radio system
CN113541921A (en) * 2021-06-24 2021-10-22 电子科技大学 Fully homomorphic encryption GPU high-performance implementation method
US20230244445A1 (en) * 2022-01-28 2023-08-03 Nvidia Corporation Techniques and devices for efficient montgomery multiplication with reduced dependencies
CN115268839A (en) * 2022-06-28 2022-11-01 南京大学 Montgomery modular multiplication method and device based on 2
CN115100227A (en) * 2022-06-29 2022-09-23 南京大学 Remote sensing image edge detection parallel computing method based on CPU-GPU cooperation
CN115344237A (en) * 2022-10-19 2022-11-15 无锡沐创集成电路设计有限公司 Data processing method combining Karatsuba and Montgomery modular multiplication
CN115756391A (en) * 2022-11-25 2023-03-07 杭州电子科技大学 Hardware circuit and method for realizing RSA modular exponentiation calculation of asymmetric algorithm
CN117527192A (en) * 2024-01-08 2024-02-06 蓝象智联(杭州)科技有限公司 Paillier decryption method based on GPU

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LEBOEUF, KARL 等: "A GPU implementation of the Montgomery multiplication algorithm for elliptic curve cryptography", 2013 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 13 May 2014 (2014-05-13) *
徐江涛;傅妍芳;: "蒙哥马利算法在RSA公钥算法中的应用", 电子设计工程, no. 09, 5 May 2013 (2013-05-05) *

Also Published As

Publication number Publication date
CN117785129B (en) 2024-05-07

Similar Documents

Publication Publication Date Title
US11983280B2 (en) Protection of cryptographic operations by intermediate randomization
US8411855B1 (en) Size optimization for large elliptic curve cryptography scalar multiplication acceleration tables
US20220083857A1 (en) Convolutional neural network operation method and device
Antao et al. Elliptic curve point multiplication on GPUs
CN103309893A (en) Character string comparing method and device
US20100077187A1 (en) System and Method to Execute a Linear Feedback-Shift Instruction
CN112650471A (en) Processor and method for processing masked data
US20070233772A1 (en) Modular multiplication acceleration circuit and method for data encryption/decryption
CN117785129B (en) Montgomery modular multiplication operation method based on GPU
US9025766B2 (en) Efficient hardware architecture for a S1 S-box in a ZUC cipher
WO2013159361A1 (en) Data processing method and related device
WO2023141934A1 (en) Efficient masking of secure data in ladder-type cryptographic computations
US7590235B2 (en) Reduction calculations in elliptic curve cryptography
US11961420B2 (en) Efficient squaring with loop equalization in arithmetic logic units
Haider et al. A novel pseudo-random number generator based on multivariable optimization for image-cryptographic applications
CN113630236A (en) SM3 data encryption method and related device
Falcao et al. Heterogeneous implementation of a voronoi cell-based svp solver
JP2007218997A (en) Prime number generation device, program and method
CN104572021B (en) A kind of efficient public key encryption engine
Wehner A new concept in advice complexity of job shop scheduling
CN109802824A (en) A kind of method, apparatus of shifting processing, computer storage medium and terminal
CN113472540B (en) Method, device, electronic equipment and storage medium for generating ciphertext
Razaque et al. Integration of CPU and GPU to accelerate RSA modular exponentiation operation
CN112487448B (en) Encryption information processing device, method and computer equipment
WO2024140141A9 (en) Doubled-point quantum computing method in elliptic curve, generic-point-addition quantum computing method in elliptic curve, and decryption method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant