CN117971136B - Method and device for accelerating number theory transformation hardware, electronic equipment and storage medium - Google Patents

Method and device for accelerating number theory transformation hardware, electronic equipment and storage medium Download PDF

Info

Publication number
CN117971136B
CN117971136B CN202410381414.7A CN202410381414A CN117971136B CN 117971136 B CN117971136 B CN 117971136B CN 202410381414 A CN202410381414 A CN 202410381414A CN 117971136 B CN117971136 B CN 117971136B
Authority
CN
China
Prior art keywords
butterfly
butterfly unit
iterative computation
determining
round
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410381414.7A
Other languages
Chinese (zh)
Other versions
CN117971136A (en
Inventor
李丽
李仁刚
赵雅倩
李茹杨
李雪雷
郭文烁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202410381414.7A priority Critical patent/CN117971136B/en
Publication of CN117971136A publication Critical patent/CN117971136A/en
Application granted granted Critical
Publication of CN117971136B publication Critical patent/CN117971136B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention relates to the technical field of computers, and discloses a method and a device for accelerating number theory transformation hardware, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring characteristic parameter information of number theory transformation; according to the characteristic parameter information, determining twiddle factors of each butterfly unit in each round of iterative computation, and storing twiddle factors into twiddle factor memories corresponding to each butterfly unit; in the iterative computation process of the butterfly unit based on the twiddle factors stored in the twiddle factor memory, when the iterative computation times of the butterfly unit do not reach a preset time threshold, a plurality of obtained output operands are written into the same target memory, and when the iterative computation times of the butterfly unit reach the preset time threshold, a plurality of output operands obtained by the butterfly unit through iterative computation are written into a plurality of different target memories, so that the memory resource occupation amount when NTT is realized based on given hardware is reduced, and a foundation is laid for improving the throughput of the NTT.

Description

Method and device for accelerating number theory transformation hardware, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and apparatus for accelerating a number theory transformation hardware, an electronic device, and a storage medium.
Background
With the development of technologies such as artificial intelligence and cloud computing, the requirements for big data processing capability of devices such as servers are gradually increased. Wherein the number theory transformation (Number Theoretic Transforms, abbreviated as NTT) can multiply the time complexity of the polynomialReduce to/>The difficulty of data operation is greatly reduced, NTT has the characteristic of highly supporting parallel computation, but the current NTT increases the memory occupation amount, so how to apply NTT to perform data operation becomes the content of hot research.
Disclosure of Invention
The application provides a method, a device, electronic equipment and a storage medium for accelerating number theory transformation hardware, which are used for solving the defects that the memory occupation amount is increased in the related technology.
The first aspect of the application provides a method for accelerating number theory transformation hardware, which comprises the following steps:
Acquiring characteristic parameter information of number theory transformation;
Determining rotation factors of each butterfly unit in each round of iterative computation according to the characteristic parameter information;
storing the twiddle factors into twiddle factor memories corresponding to the butterfly units;
In the iterative computation process of the butterfly unit based on the twiddle factors stored in the twiddle factor memory, aiming at any butterfly unit, when the iterative computation times of the butterfly unit do not reach a preset time threshold, writing a plurality of output operands obtained by the iterative computation of the butterfly unit into the same target memory;
When the iterative computation times of the butterfly unit reach a preset time threshold, writing a plurality of output operands obtained by the iterative computation of the butterfly unit into a plurality of different target memories.
In an optional embodiment, the determining the rotation factor of each butterfly unit in each round of iterative computation according to the feature parameter information includes:
determining the iterative calculation total round and the rotation factor expression of each butterfly unit according to the butterfly unit type represented by the characteristic parameter information and the dimension of the input vector;
determining an index of the rotation factor in each round of iterative computation according to the total round of iterative computation of each butterfly unit;
And determining the twiddle factors of the butterfly units in each round of iterative computation according to the index of the twiddle factors in each round of iterative computation and the twiddle factor expression.
In an optional implementation manner, the storing the twiddle factor in twiddle factor memories corresponding to the butterfly units includes:
For any butterfly unit, determining a storage address of each twiddle factor in a twiddle factor memory corresponding to the butterfly unit according to iterative computation rounds corresponding to each twiddle factor of the butterfly unit;
according to the storage address of each twiddle factor in the twiddle factor storage corresponding to the butterfly unit, storing the twiddle factor in the twiddle factor storage corresponding to the butterfly unit;
wherein the twiddle factor memories are in one-to-one correspondence with the butterfly units.
In an optional implementation manner, the determining, according to the iteration calculation round corresponding to each twiddle factor of the butterfly unit, a storage address of each twiddle factor in a twiddle factor memory corresponding to the butterfly unit includes:
dividing twiddle factor storage areas corresponding to each round of iterative computation of butterfly units in each twiddle factor storage; the twiddle factor storage area comprises a plurality of storage addresses, and the twiddle factor storage area corresponds to iterative computation rounds one by one;
in the twiddle factor memory, when the iterative computation round is lower than a preset round threshold, taking all twiddle factor memory areas corresponding to the iterative computation round as the same twiddle factor memory areas; wherein the twiddle factors stored in the same twiddle factor storage area are the same;
When the iteration calculation round is not lower than a preset round threshold, determining the number of identical twiddle factor storage areas according to the dimension of the input vector and the iteration calculation round;
Positioning the same twiddle factor storage areas according to the number of the same twiddle factor storage areas so that the same twiddle factor storage areas store the same twiddle factors;
when the number of the same twiddle factor storage areas is determined to be 0, determining that twiddle factor storage areas corresponding to the current iterative calculation round store different twiddle factors.
In an optional implementation manner, for any butterfly unit, when the number of iterative computation of the butterfly unit does not reach a preset number of times threshold, writing a plurality of output operands obtained by iterative computation of the butterfly unit into the same target memory, where the method includes:
For any butterfly unit, when the iterative computation times of the butterfly unit do not reach a preset time threshold, determining a corresponding target memory address according to the memory read address of the input operand of the iterative computation of the butterfly unit;
And writing a plurality of output operands obtained by iterative computation of the butterfly unit into the same target memory according to the target memory address.
In an optional implementation manner, when the number of iterative computations of the butterfly unit does not reach the preset number threshold for any butterfly unit, determining the corresponding target memory address according to the memory read address of the input operand of the iterative computation of the round, includes:
For any butterfly unit, when the iterative computation times of the butterfly unit do not reach a preset time threshold, determining a target memory according to the memory read address of the input operand of the iterative computation of the round;
and determining the target memory address of each output operand in the target memory according to the characteristic parameter information of the number theory transformation.
In an optional implementation manner, the determining, according to the characteristic parameter information of the number theory transformation, a target memory address of each output operand in the target memory includes:
determining a first target memory address according to the dimension of the input vector represented by the characteristic parameter information of the number theory transformation, the number of butterfly units and the current iterative calculation round;
Determining a plurality of second target memory addresses according to the first target memory addresses;
the target memory address comprises the first target memory address and a plurality of second target memory addresses.
In an optional implementation manner, the determining the first target memory address according to the dimension of the input vector, the number of butterfly units and the current iteration calculation round, which are characterized by the feature parameter information of the number theory transformation, includes:
Determining a first target memory address based on the following formula:
Wherein, Representing a first target memory address,/>Representing the dimension of the input vector,/>Representing the number of butterfly units,/>Representing the radix of the butterfly unit,/>Representing the current iteration calculation round,/>A first positioning parameter is indicated and a second positioning parameter is indicated,,/>Representing the second positioning parameter,/>
In an optional embodiment, the determining a number of second target memory addresses according to the first target memory address includes:
Determining a plurality of second target memory addresses based on the following formula:
Wherein, Represents the/>Second target memory address,/>,/>Representing a first target memory address,/>Representing the dimension of the input vector,/>Representing the number of butterfly units,/>Representing the current iteration round.
In an optional implementation manner, when the number of iterative computations of the butterfly unit reaches a preset number threshold, writing a plurality of output operands obtained by the iterative computation of the butterfly unit into a plurality of different target memories, including:
when the iterative computation times of the butterfly unit reach a preset time threshold, determining the arrangement sequence information of the butterfly unit according to the characteristic parameter information of the number theory transformation;
determining a plurality of target memories according to the arrangement sequence information of the butterfly units and the actual sequence numbers of the butterfly units;
and writing a plurality of output operands obtained by iterative computation of the butterfly unit into the target memory respectively.
In an optional implementation manner, the determining the arrangement order information of the butterfly unit according to the feature parameter information of the number theory transformation includes:
And determining the arrangement sequence information of the butterfly units according to the quantity of the butterfly units represented by the characteristic parameter information of the number theory transformation and the current iterative calculation round.
In an optional implementation manner, the determining the arrangement sequence information of the butterfly units according to the number of the butterfly units characterized by the characteristic parameter information of the number theory transformation and the current iterative calculation round includes:
Determining the arrangement order information of the butterfly units based on the following formula:
Wherein, Representing the arrangement order information of the butterfly units,/>Representing the number of butterfly units,/>Representing the radix of the butterfly unit,/>Representing the current iteration calculation round,/>,/>Representing the first order parameter,/>,/>Representing a second order parameter,/>,/>Representing a third order parameter,/>
In an optional implementation manner, the determining a plurality of target memories according to the arrangement sequence information of the butterfly unit and the actual sequence number of the butterfly unit includes:
determining a first sequence parameter, a second sequence parameter and a third sequence parameter of the butterfly unit according to the arrangement sequence information of the butterfly unit and the actual sequence number of the butterfly unit;
And determining a plurality of target memories according to the first sequence parameter, the second sequence parameter and the third sequence parameter of the butterfly unit.
In an alternative embodiment, the determining a plurality of target memories according to the first order parameter, the second order parameter, and the third order parameter of the butterfly unit includes:
determining a plurality of target butterfly units according to the first sequence parameters and the third sequence parameters of the butterfly units;
And selecting a target memory in each target butterfly unit according to the second sequence parameter to determine a plurality of target memories.
In an alternative embodiment, the determining a plurality of target butterfly units according to the first order parameter and the third order parameter of the butterfly units includes:
A number of target butterfly units are determined based on the following formula:
Wherein, ,/>Represents the/>Target butterfly unit sequence number corresponding to each target butterfly unit,/>Representing the radix of the butterfly unit,/>Representing the number of butterfly units,/>Representing the current iteration calculation round,/>,/>Representing the first order parameter,/>,/>Representing a third order parameter,/>
In an alternative embodiment, the selecting, according to the second order parameter, the target memory in each of the target butterfly units includes:
taking the sequence number of the target butterfly unit as a first coordinate of a target memory;
Taking the second sequence parameter as a second coordinate of the target memory;
And selecting a target memory from each target butterfly unit according to the first coordinate and the second coordinate of the target memory.
In an alternative embodiment, the method further comprises:
Determining the preset times threshold based on the following formula:
Wherein, Representing a preset number of times threshold,/>Representing the dimension of the input vector,/>Indicating the number of butterfly units,Representing the radix of the butterfly unit.
A second aspect of the present application provides a number theory transformation hardware acceleration device, including:
The acquisition module is used for acquiring the characteristic parameter information of the number theory transformation;
the determining module is used for determining rotation factors of the butterfly units in each round of iterative computation according to the characteristic parameter information;
The storage module is used for storing the twiddle factors to twiddle factor memories corresponding to the butterfly units;
The first acceleration module is used for writing a plurality of output operands obtained by the butterfly unit through iterative computation into the same target memory when the iterative computation times of the butterfly unit do not reach a preset time threshold value for any butterfly unit in the iterative computation process of the butterfly unit based on the twiddle factors stored in the twiddle factor memory;
and the second acceleration module is used for writing a plurality of output operands obtained by the butterfly unit through iterative computation into a plurality of different target memories when the iterative computation times of the butterfly unit reach a preset time threshold.
A third aspect of the present application provides an electronic device, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executes the computer-executable instructions stored by the memory such that the at least one processor performs the method as described above in the first aspect and the various possible designs of the first aspect.
A fourth aspect of the application provides a computer readable storage medium having stored therein computer executable instructions which when executed by a processor implement the method as described above for the first aspect and the various possible designs of the first aspect.
The technical scheme of the application has the following advantages:
The application provides a method, a device, electronic equipment and a storage medium for accelerating number theory transformation hardware, wherein the method comprises the following steps: acquiring characteristic parameter information of number theory transformation; according to the characteristic parameter information, determining twiddle factors of each butterfly unit in each round of iterative computation, and storing twiddle factors into twiddle factor memories corresponding to each butterfly unit; in the iterative computation process of the butterfly unit based on the twiddle factors stored in the twiddle factor memory, aiming at any butterfly unit, when the iterative computation times of the butterfly unit do not reach a preset time threshold, writing a plurality of output operands obtained by the butterfly unit through iterative computation into the same target memory; when the iterative computation times of the butterfly unit reach a preset time threshold, writing a plurality of output operands obtained by the iterative computation of the butterfly unit into a plurality of different target memories. According to the method provided by the scheme, when the iterative computation times of the butterfly unit do not reach the preset times threshold, the obtained plurality of output operands are written into the same target memory, and when the iterative computation times of the butterfly unit reach the preset times threshold, the plurality of output operands obtained by the iterative computation of the butterfly unit are written into different plurality of target memories, so that the memory resource occupation amount when the NTT is realized based on given hardware is reduced, and a foundation is laid for improving the throughput of the NTT.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following descriptions are some embodiments of the present application, and other drawings may be obtained according to the drawings for those skilled in the art.
FIG. 1 is a schematic diagram of a system for accelerating the hardware of the number theory transformation based on the embodiment of the application;
FIG. 2 is a flow chart of a method for accelerating the number theory transformation hardware according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a twiddle factor memory according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an exemplary operand memory scheduling logic provided by an embodiment of the present application;
FIG. 5 is a schematic diagram of another exemplary operand memory scheduling logic provided in an embodiment of the present application;
FIG. 6 is a schematic diagram of still another exemplary operand memory scheduling logic provided in accordance with an embodiment of the present application;
fig. 7 is a flowchart of NTT calculation according to an embodiment of the present application;
FIG. 8 is a schematic diagram of an exemplary number theory transformation hardware acceleration system according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a structure of a hardware acceleration device for number theory transformation according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Specific embodiments of the present application have been shown by way of the above drawings and will be described in more detail below. These drawings and the written description are not intended to limit the scope of the disclosed concept in any way, but to illustrate the inventive concept to those skilled in the art by reference to specific embodiments.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. In the following description of the embodiments, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
With the development of artificial intelligence, cloud computing and other technologies, data is taken as a core production element, and becomes a new kinetic energy for promoting the development of economy and high quality. However, the data has the characteristics of intangibility, non-consumption and the like, and can be copied infinitely near zero cost, so that the leakage problem in the data circulation and processing process cannot be ignored. It is analyzed that 50% of the large organizations in the future will employ privacy-enhanced computing technology to process data in untrusted environments and in multiparty data analysis cases. In particular, homomorphic encryption is one of important technical means for realizing data "available invisible" and protecting data privacy security, and is called holy cup of cryptography. However, homomorphic encryption schemes are complex in design and slow in computation, which becomes a bottleneck restricting their wide deployment. In particular, in the homomorphic encryption scheme CKKS for floating point numbers, the key transformation module is the most time-consuming part in homomorphic multiplication computation, and the time of the NTT in the key transformation module is up to 70%, so that accelerating the NTT has important significance for promoting homomorphic encryption to be applied and protecting data privacy.
On the other hand, in the post quantum encryption and decryption scheme based on the lattice, the naive polynomial multiplication computation occupies more than 95% of the time in the encryption and decryption computation, and the NTT can multiply the time complexity of the polynomial byReduce to. Thus, as a general algorithm, NTT is an important research content, and an NTT acceleration scheme based on CPU, GPU, FPGA, ASIC or the like is sequentially proposed. Thanks to the divide-and-conquer strategy, NTT has the characteristic of highly supporting parallel computing, so that the performance of NTT can be obviously improved based on hardware platform acceleration. When NTT is realized based on given hardware, memory scheduling is complex, which increases algorithm design difficulty. Therefore, designing an efficient, compact memory scheduling scheme becomes a key challenge to speed up NTT.
In order to solve the above problems, an embodiment of the present application provides a method, an apparatus, an electronic device, and a storage medium for accelerating a number theory transformation hardware, where the method includes: acquiring characteristic parameter information of number theory transformation; according to the characteristic parameter information, determining twiddle factors of each butterfly unit in each round of iterative computation, and storing twiddle factors into twiddle factor memories corresponding to each butterfly unit; in the iterative computation process of the butterfly unit based on the twiddle factors stored in the twiddle factor memory, aiming at any butterfly unit, when the iterative computation times of the butterfly unit do not reach a preset time threshold, writing a plurality of output operands obtained by the butterfly unit through iterative computation into the same target memory; when the iterative computation times of the butterfly unit reach a preset time threshold, writing a plurality of output operands obtained by the iterative computation of the butterfly unit into a plurality of different target memories. According to the method provided by the scheme, when the iterative computation times of the butterfly unit do not reach the preset times threshold, the obtained plurality of output operands are written into the same target memory, and when the iterative computation times of the butterfly unit reach the preset times threshold, the plurality of output operands obtained by the iterative computation of the butterfly unit are written into different plurality of target memories, so that the memory resource occupation amount when the NTT is realized based on given hardware is reduced, and a foundation is laid for improving the throughput of the NTT.
The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.
First, the structure of the number theory transformation hardware acceleration system on which the present application is based will be described:
The method, the device, the electronic equipment and the storage medium for accelerating the number theory transformation hardware are suitable for realizing the acceleration of the number theory transformation through memory scheduling. Fig. 1 is a schematic structural diagram of a number-theory transformation hardware acceleration system according to an embodiment of the present application, which mainly includes a twiddle factor memory, a butterfly unit, a memory and a number-theory transformation hardware acceleration device. Specifically, the number theory transformation hardware accelerating device is used for storing the twiddle factor of the butterfly unit into a twiddle factor memory, and storing an output operand generated in the iterative calculation process of the butterfly unit into a memory.
The embodiment of the application provides a number theory transformation hardware acceleration method which is used for scheduling the internal memory of an NTT butterfly unit so as to realize the number theory transformation acceleration. The execution main body of the embodiment of the application is electronic equipment such as a server, a desktop computer, a notebook computer, a tablet computer and other electronic equipment which can be used for carrying out hardware acceleration of number theory transformation.
As shown in fig. 2, a flow chart of a method for accelerating a number-theory transformation hardware according to an embodiment of the present application is shown, where the method includes:
In step 201, feature parameter information of the number theory transformation is obtained.
Wherein the characteristic parameter information of the number theory transformation can comprise NTT points (the dimension of the input vector)Modulus/>And butterfly unit type, etc. The input vector may be a ciphertext vector in homomorphic encryption, and may be a coefficient vector of a ciphertext polynomial, or may be a coefficient vector of a polynomial in a torus-divided integer ring. The number theory transformation is used in key generation, encryption and decryption of Kyber algorithm, and occupies most time in encryption and decryption process; in homomorphic encryption ciphertext calculation, the polynomial is the ciphertext polynomial. The difference between the lattice-based encryption scheme and the number-theory transformation in homomorphic encryption is the difference between the polynomial degree N and the modulus value q.
Step 202, determining rotation factors of each butterfly unit in each round of iterative computation according to the characteristic parameter information.
It should be noted that the number theory transformation is implemented by iterative computation of a plurality of butterfly units, and the iterative computation of the butterfly units is implemented based on twiddle factors.
Specifically, twiddle factor determination logic of the butterfly units can be determined according to characteristic parameters of number theory transformation, and twiddle factors of the butterfly units in each round of iterative computation are determined according to the twiddle factor determination logic.
And 203, storing the twiddle factors in twiddle factor memories corresponding to the butterfly units.
The twiddle factor Memory is specifically Read Only Memory (ROM), and the twiddle factor Memory corresponds to the butterfly units one by one, and in determining twiddle factors of each butterfly unit in each round of iterative computation, the twiddle factors are stored in twiddle factor memories of the corresponding butterfly units, so that the corresponding twiddle factors are Read from the twiddle factor memories in the iterative computation process of the butterfly units.
In step 204, in the iterative computation process of the butterfly unit based on the twiddle factors stored in the twiddle factor memory, for any butterfly unit, when the number of iterative computation times of the butterfly unit does not reach the preset number of times threshold, writing a plurality of output operands obtained by iterative computation of the butterfly unit into the same target memory.
It should be noted that, taking the base 2C-T butterfly unit as an example, there are 2 input operands and 2 output operands, so 2 pseudo-dual port RAMs are needed to store the operands, and these two RAMs are respectively an even RAM (even memory) and an odd RAM (odd memory). When the iterative computation times of the butterfly units do not reach a preset time threshold, the input operands and the output operands of the butterfly units are determined to be independent of each other, so that a plurality of output operands obtained by the iterative computation of the butterfly units can be written into the same target memory, such as even memory or odd memory;
Step 205, when the number of iterative computations of the butterfly unit reaches a preset number of times threshold, writing a plurality of output operands obtained by the iterative computations of the butterfly unit into a plurality of different target memories.
Specifically, when the iterative computation times of the butterfly units reach a preset time threshold, it is determined that the input operands and the output operands of each butterfly unit are dependent, so that a plurality of output operands obtained by the butterfly units through iterative computation can be written into different target memories, for example, two output operands of one radix-2 butterfly unit are written into even memories corresponding to two different target butterfly units or odd memories corresponding to two different target butterfly units.
On the basis of the above embodiment, in order to improve the efficiency of determining the twiddle factor, as an implementation manner, in an embodiment, determining, according to the characteristic parameter information, the twiddle factor of each butterfly unit in each round of iterative computation includes:
Step 2021, determining the iterative calculation total round and the twiddle factor expression of each butterfly unit according to the butterfly unit type represented by the characteristic parameter information and the dimension of the input vector;
Step 2022, determining an index of the twiddle factor in each iteration calculation according to the iteration calculation total round of each butterfly unit;
step 2023, determining the twiddle factor of each butterfly unit in each round of iterative computation according to the index and twiddle factor expression of each round of iterative computation.
Taking the example of determining the twiddle factor of the base 2C-T butterfly unit, a modulus value is setInput vector/>For/>Dimension vector,/>The overall calculation formula for determining NTT is as follows:
Wherein, For/>In/>Root of secondary primitive unit,/>The required twiddle factors are calculated for the NTT. To speed up the computation of NTT, these twiddle factors need to be pre-computed and ordered according to the number of butterfly units. The specific calculation formula of the base 2C-T butterfly unit is as follows:
Thus, at the first In stage iteration, the twiddle factor required for butterfly units is/>Wherein, the method comprises the steps of, wherein,,/>. If input vector/>Subscripts according to/>In the order of (a) >, thenIn the stage iteration, the exponent/>, of the twiddle factorAccording to/>The bits are arranged in reverse order.
Specifically, in an embodiment, for any butterfly unit, a storage address of each twiddle factor in a twiddle factor memory corresponding to the butterfly unit may be determined according to iterative computation rounds corresponding to each twiddle factor of the butterfly unit; and storing the twiddle factors into twiddle factor memories corresponding to the butterfly units according to the storage addresses of twiddle factor memories corresponding to the butterfly units by the twiddle factors.
The twiddle factor memories are in one-to-one correspondence with the butterfly units.
Specifically, when the butterfly unit performs each round of iterative computation, different twiddle factors are adopted in each round of iterative computation, so the twiddle factors can be stored into twiddle factor memories corresponding to the butterfly unit according to the arrangement sequence of the twiddle factors.
Specifically, in an embodiment, a twiddle factor storage area corresponding to each round of iterative computation of the butterfly unit may be divided in each twiddle factor storage; the twiddle factor storage area comprises a plurality of storage addresses, and the twiddle factor storage area corresponds to the iterative computation round one by one; in the twiddle factor memory, when the iterative computation round is lower than a preset round threshold, taking all twiddle factor memory areas corresponding to the iterative computation round as the same twiddle factor memory areas; wherein the twiddle factors stored in the same twiddle factor storage area are the same; when the iteration calculation round is not lower than a preset round threshold, determining the number of identical twiddle factor storage areas according to the dimension of the input vector and the iteration calculation round; positioning the same twiddle factor storage areas according to the number of the same twiddle factor storage areas so that the same twiddle factor storage areas store the same twiddle factors; when the number of the same twiddle factor storage areas is determined to be 0, the twiddle factor storage areas corresponding to the current iterative calculation round are determined to store different twiddle factors.
In which, as shown in FIG. 3, a schematic diagram of a twiddle factor memory according to an embodiment of the present application is shown, and FIG. 3 is、/>At the time, the data stored in the twiddle factor ROM (twiddle factor memory), i.e. the number of butterfly units, isThen need/>The block ROM twiddle factor memory stores pre-calculated twiddle factors. In the first placeIn the stage iteration, the data stored in the same address of the twiddle factor ROM corresponding to each butterfly unit is the same, namely, each twiddle factor is stored in the/>, according to the arrangement sequenceIn a block ROM. In the first placeIn stage iteration, every successive/>The data stored in the same address of the twiddle factor ROM corresponding to each butterfly unit is the same, namely each twiddle factor is stored in the arrangement orderIn a block ROM. Wherein/>For the preset round threshold, integers in FIG. 3 represent twiddle factorsIndex/>In fig. 3, stages 0 to 3 represent iterative computations of 0 to 3 stages, each stage includes several rounds of iterative computations, and the corresponding bracket range is a twiddle factor storage area corresponding to each round of iterative computation.
In the case of performing polynomial multiplication calculation by butterfly unit, it is assumed that、/>The calculation formula of the butterfly unit is as follows:
Wherein, For/>In/>Root of secondary primitive unit,/>For/>In/>Root of the secondary primitive unit. Operation/>Representing the multiplication of the coordinate components,/>. At this time:
The specific calculation formula of the butterfly unit is as follows:
By varying the value of the above-mentioned pre-calculated twiddle factor, i.e Change to/>Can reduce the formulaIn/>And the multiplication operation is performed, so that polynomial multiplication is accelerated, and the iterative computation efficiency of the butterfly unit is improved. The arrangement of twiddle factors in this formula is the same as that provided in the above embodiment.
On the basis of the foregoing embodiment, as an implementation manner, in an embodiment, for any butterfly unit, when the number of iterative computation of the butterfly unit does not reach a preset number threshold, writing a plurality of output operands obtained by iterative computation of the butterfly unit into the same target memory, including:
Step 2041, for any butterfly unit, determining a corresponding target memory address according to the memory read address of the input operand of the iterative computation of the round when the iterative computation number of the butterfly unit does not reach the preset number threshold;
In step 2042, according to the target memory address, a plurality of output operands obtained by the butterfly unit through iterative computation are written into the same target memory.
It should be noted that, except twiddle factors, the input and output operands of the radix-2C-T butterfly units are allAnd thus requires the use/>The block pseudo-dual port RAM stores operands, the two RAMs are respectively even RAMs and odd RAMs, and the total number of the pseudo-dual port RAMs is/>Per block RAM storage/>And a number of operands. Let subscript of input vector be in accordance with/>In the order of (1) >The individual coordinate components are written sequentially into the/>In block even operand RAM, i.e./>The individual coordinate components are written into the/>A block couple RAM; first/>The coordinate components are written in turnIn block odd operand RAM, i.e./>The individual coordinate components are written into the/>In the block odd operand RAM.
Specifically, the output operand of the butterfly unit is the input operand of the next iteration calculation, when the iteration calculation times of the butterfly unit do not reach the preset times threshold, the input operand and the output operand of each butterfly unit are determined to be independent of each other, so that a plurality of output operands obtained by the butterfly unit through the iteration calculation can be written into the same target memory, a corresponding target memory address can be determined according to the memory read address of the input operand of the iteration calculation of the round, and then the output operand generated by the iteration calculation of the round is written into the target memory address, so that the address allocation efficiency of the memory is improved, and the situations of wasting storage space and the like are avoided.
Specifically, in an embodiment, for any butterfly unit, when the number of iterative computations of the butterfly unit does not reach a preset number of times threshold, a target memory may be determined according to a memory read address of an input operand of the iterative computation of the round; and determining the target memory address of each output operand in the target memory according to the characteristic parameter information of the number theory transformation.
The feature parameter information of the number theory transformation can refer to the dimension of an input vector, the number of butterfly units, the current iteration calculation round and the like.
Specifically, in an embodiment, the first target memory address may be determined according to the dimension of the input vector represented by the feature parameter information of the number-theory transformation, the number of butterfly units, and the current iterative computation round; and determining a plurality of second target memory addresses according to the first target memory addresses.
The target memory addresses comprise a first target memory address and a plurality of second target memory addresses.
Specifically, in one embodiment, the first target memory address may be determined based on the following formula:
Wherein, Representing a first target memory address,/>Representing the dimension of the input vector,/>Representing the number of butterfly units,/>Representing the radix of the butterfly unit,/>Representing the current iteration calculation round,/>A first positioning parameter is indicated and a second positioning parameter is indicated,,/>Representing the second positioning parameter,/>
Accordingly, in one embodiment, the number of second target memory addresses may be determined based on the following formula:
Wherein, Represents the/>Second target memory address,/>,/>Representing a first target memory address,/>Representing the dimension of the input vector,/>Representing the number of butterfly units,/>Representing the current iteration round.
When the butterfly unit is a radix-2 butterfly unit, the second target memory address determination formula is as follows:
Wherein, Representing a second target memory address,/>Representing a first target memory address,/>Representing the dimension of the input vector,/>Representing the number of butterfly units,/>Representing the current iteration round.
Specifically, taking a radix-2 butterfly unit as an example, as shown in fig. 4, an exemplary operand memory scheduling logic diagram provided by an embodiment of the present application is shown in fig. 5, which is another exemplary operand memory scheduling logic diagram provided by an embodiment of the present application, where fig. 4 is operand memory scheduling logic of a butterfly unit in a 0 th stage of iterative computation when n=32 and pe=4, and fig. 5 is operand memory scheduling logic of a butterfly unit in a1 st stage of iterative computation when n=32 and pe=4, where each cycle corresponds to one round of iterative computation. Let two input operands of the radix 2C-T butterfly unit beAnd/>Twiddle factor of/>The calculation formula of the base 2C-T butterfly unit is:
corresponding to the first The two RAMs of the butterfly units are even memory/>And odd memoryThen at/>In stage iteration, the/>The butterfly units are respectively atAnd/>Middle address is/>Or/>In space fetch of operands/>And/>Wherein,/>,/>. If the read address is/>Then/>Two output operand writes of each butterfly unit/>If the read address isThen/>Two output operand writes of each butterfly unit/>. Two output operands/>And/>The write addresses of (a) are/>, respectivelyAnd/>
The preset frequency threshold calculation formula is as follows:
Wherein, Representing a preset number of times threshold,/>Representing the dimension of the input vector,/>Indicating the number of butterfly units,Representing the radix of the butterfly unit.
It should be noted that, when the butterfly unit is a radix-2 butterfly unit, implementationDot NTT calculation requires/>Stage-based 2 butterfly unit iteration, at/>In the stage iteration, the input and output operands of each butterfly unit are independent of each other, and the two input operands of the butterfly unit are read from the same addresses of the two RAMs, and the two output operands are written into the same RAM, so that the second output operand is written after one beat of register. In the first placeIn stage iteration, the input and output operands of each butterfly unit are dependent, and the two output operands of the butterfly units are written into different RAMs.
Specifically, in an embodiment, for any butterfly unit, when the number of iterative computations of the butterfly unit reaches a preset number threshold, determining arrangement sequence information of the butterfly unit according to feature parameter information of number theory transformation; determining a plurality of target memories according to the arrangement sequence information of the butterfly units and the actual sequence numbers of the butterfly units; and writing a plurality of output operands obtained by iterative computation of the butterfly unit into a target memory respectively.
Specifically, in an embodiment, for any butterfly unit, when the number of iterative computation times of the butterfly unit reaches a preset number threshold, the arrangement sequence information of the butterfly unit is determined according to the number of butterfly units represented by the characteristic parameter information of the number theory transformation and the current iterative computation round.
Specifically, when the number of iterative computation of the butterfly unit reaches a preset number threshold, the arrangement sequence information of the butterfly unit on the logic level is determined according to the number of the butterfly units and the characteristic parameter information such as the current iterative computation round, and each butterfly unit has an actual arrangement sequence.
Specifically, in an embodiment, determining the arrangement sequence information of the butterfly units according to the number of butterfly units represented by the characteristic parameter information of the number theory transformation and the current iterative calculation round includes:
Determining the arrangement order information of the butterfly units based on the following formula:
Wherein, Representing the arrangement order information of the butterfly units,/>Representing the number of butterfly units,/>Representing the radix of the butterfly unit,/>Representing the current iteration calculation round,/>,/>Representing the first order parameter,/>,/>Representing a second order parameter,/>,/>Representing a third order parameter,/>
For example, when the butterfly unit is a radix-2 butterfly unit, the arrangement order information of the butterfly unit may be specifically determined based on the following formula:
Wherein, Representing the arrangement order information of the butterfly units,/>Representing the number of butterfly units,/>Representing iterative computation round count,/>,/>Representing the current iteration calculation round,/>A first order parameter is indicated and a first order parameter is indicated,,/>Representing a second order parameter,/>,/>A third order parameter is indicated and is indicated,
Specifically, in an embodiment, according to the arrangement sequence information of the butterfly unit and the actual sequence number of the butterfly unit, the first sequence parameter, the second sequence parameter and the third sequence parameter of the butterfly unit; and determining a plurality of target memories according to the first sequence parameter, the second sequence parameter and the third sequence parameter of the butterfly unit.
Specifically, specific values of the first sequence parameter, the second sequence parameter and the third sequence parameter in the sequence information calculation formula may be determined according to the correspondence between the sequence information and the actual sequence number, with the aim of making the sequence information equal to the preset butterfly unit sequence number.
Specifically, in an embodiment, a plurality of target butterfly units may be determined according to the first order parameter and the third order parameter of the butterfly units; selecting a target memory in each target butterfly unit according to the second sequence parameter to determine a plurality of target memories;
Specifically, in one embodiment, a number of target butterfly units are determined based on the following formula:
Wherein, ,/>Represents the/>Target butterfly unit sequence number corresponding to each target butterfly unit,/>Representing the number of butterfly units,/>Representing the current iteration round of the computation,,/>Representing the first order parameter,/>,/>Representing a third order parameter,/>
When the butterfly unit is a base 2 butterfly unit, two target butterfly units are respectively a first target butterfly unit and a second target butterfly unit, and a specific determination formula is as follows:
Wherein, Representing a first target butterfly unit sequence number corresponding to the first target butterfly unit,/>Representing a second target butterfly unit sequence number corresponding to the second target butterfly unit,/>Representing the number of butterfly units,/>Representing iterative computation round count,/>,/>Representing the current iteration calculation round,/>A first order parameter is indicated and a first order parameter is indicated,,/>Representing a second order parameter,/>A third order parameter is indicated and is indicated,
Specifically, in one embodiment, the target butterfly unit sequence number is used as a first coordinate of a target memory; taking the second sequence parameter as a second coordinate of the target memory; and selecting a target memory from each target butterfly unit according to the first coordinate and the second coordinate of the target memory.
Wherein, the representation mode of the target memory is specifically as followsI.e. the/>, corresponding to each target butterfly unitThe block memory is used as a target memory.
When the butterfly unit is a radix-2 butterfly unit, the even memory in each target butterfly unit is taken as a target memory when the second sequence parameter of the butterfly unit is 1; and when the second sequence parameter of the butterfly unit is 0, taking the odd memory in each target butterfly unit as a target memory.
Specifically, taking a radix-2 butterfly unit as an example, as shown in fig. 6, an exemplary operand memory scheduling logic diagram provided in an embodiment of the present application is shown, where fig. 6 is operand memory scheduling logic of a butterfly unit in a 2 nd-4 th level iterative computation when n=32 and pe=4. Let two input operands of the radix 2C-T butterfly unit beAnd/>Twiddle factor of/>The calculation formula of the base 2C-T butterfly unit is:
Wherein, the first The butterfly units are respectively at/>And (3) withMiddle address is/>In space fetch of operands/>And/>Wherein/>,/>,/>. If/>Two output operands/>And/>Write to the first and second parts respectivelyButterfly unit and/>In the even RAM corresponding to each butterfly unit, the write address is/>. If/>Two output operands/>And/>Write to the first and second parts respectivelyButterfly unit and/>In the odd RAM corresponding to the butterfly units, the write address is/>
Specifically, when the iterative computation times of the butterfly unit reach a preset time threshold, the write address of the output operand in the target memory is the same as the read address of the input operand of the iterative computation in the memory, so that additional time and resources are not required to be spent for computing the target write address of the butterfly unit, and the iterative computation efficiency of the butterfly unit is further improved.
As shown in fig. 7, an exemplary embodiment of the present application provides an NTT calculation flowchart, where the control unit is an execution end of the method provided by the embodiment of the present application, initializes the twiddle factor ROM to determine twiddle factors, stores the determined twiddle factors in the twiddle factor memory, and performs iterative calculation on an input vector according to the determined twiddle factors by using the butterfly unit to obtain an output vector. As shown in fig. 8, a schematic structural diagram of an exemplary number theory transformation hardware acceleration system provided by the embodiment of the present application is shown, where twiddle factor ROM is twiddle factor memory, and operand RAM is memory.
Wherein, aiming at the butterfly unit of the base 2G-S, a butterfly unit is provided,/>For/>The vector of dimensions is used to determine,Then:
the specific calculation formula of the base 2G-S butterfly unit is as follows:
Thus, at the first In stage iteration, the twiddle factor required for butterfly units is/>Wherein, the method comprises the steps of, wherein,,/>. If input vector/>Subscripts according to/>The bit is arranged in reverse order, then at the/>In the stage iteration, the exponent/>, of the twiddle factorAccording to/>The bits are arranged in reverse order.
Wherein the number of twiddle factors calculated in parallel is set asThen need/>The block ROM stores pre-calculated twiddle factors. In/>In stage iteration, every successive/>The data stored in the same address of the twiddle factor ROM corresponding to each butterfly unit is the same, namely each twiddle factor is sequentially stored in/>In a block ROM. In the first placeIn the stage iteration, the data stored in the same address of the twiddle factor ROM corresponding to each butterfly unit is the same, namely each twiddle factor is sequentially stored in/>In a block ROM.
Is provided with、/>Order-makingThen: /(I)
Wherein,For/>Middle/>And (5) a dimension vector. At this time:
the specific calculation formula of the base 2G-S butterfly unit is as follows:
By varying the value of a pre-calculated twiddle factor, i.e Change to/>Can reduce the formulaIn/>The multiplication operations, and thus the polynomial multiplication, are accelerated. In particular, the arrangement of twiddle factors in the above formula is the same as the arrangement of twiddle factors.
Wherein, in initializing input vectors, similar to the method for realizing NTT based on 2C-T butterfly unit iteration, the process for realizing NTT based on 2G-S butterfly unit iteration needsThe block pseudo-dual port RAM stores operands, each butterfly unit corresponds to two RAMs, which are respectively even RAMs and odd RAMs, and each RAM stores/>And a number of operands. In particular, in polynomial modular multiplication computation, NTT and INTT computation needs to be performed sequentially, that is, NTT and INTT appear in pairs. Thus, the output vector of NTT can be taken as the input vector of INTT, at which time the input vector subscript of INTT is in terms of/>The order of the bit reverse is arranged, and the output vector subscripts are according to/>Is a sequential arrangement of (a).
The operand RAM scheduling process when calculating INTT based on the 2G-S butterfly unit iteration is opposite to the operand RAM scheduling process when calculating NTT based on the 2C-T butterfly unit iteration. In particular, realizeDot NTT calculation requires/>Stage-based 2 butterfly unit iteration, at/>In stage iteration, the input and output operands of each butterfly unit are mutually dependent, and the two output operands of the butterfly units are written into different RAMs. In/>In the stage iteration, the input and output operands of each butterfly unit are independent of each other, and the two input operands of the butterfly unit are read from the same addresses of the two RAMs, and the two output operands are written into the same RAM, so that the second output operand is written after one beat of register. In particular, by adding a shift calculation inside the butterfly unit once based on the following formula, the multiplication/>, of the above formula, can be omittedIs a process of (2).
Specifically, for a base 2G-S butterfly unit, in the iterative computation process of the butterfly unit based on the twiddle factors stored in the twiddle factor memory, for any butterfly unit, when the iterative computation times of the butterfly unit do not reach a preset time threshold, writing a plurality of output operands obtained by iterative computation of the butterfly unit into a plurality of different target memories. When the iterative computation times of the butterfly unit reach a preset time threshold, writing a plurality of output operands obtained by the iterative computation of the butterfly unit into the same target memory. The specific target memory determining process is referred to the above embodiments, and will not be described herein.
Similarly, for a radix-8 butterfly unit, a radix-8C-T butterfly unit based on a two-stage pipelined architecture requires 7 twiddle factors, the number of twiddle factors required for each stage iteration beingWherein/>These twiddle factors are stored in 7 ROMs, the twiddle factor being exponentially expressed in/>The bit octals are arranged in reverse order. In the first placeIn stage iteration, each twiddle factor repeats/>And the same addresses of twiddle factor ROM corresponding to each butterfly unit are respectively stored. In/>In stage iteration, each twiddle factor repeats/>And twice. Similar to the process of implementing NTT based on radix-2 butterfly unit iterations, by varying the value of the pre-computed twiddle factor, formula polynomial multiplication can be accelerated without increasing the computational effort.
The input and output operands of the base 8C-T butterfly unit are all except twiddle factorsAnd thus requires the use/>The block pseudo-dual port RAM stores operands, and the 8 block RAMs are respectively numbered 0, 1 and/(I)RAM number 7, total pseudo-dual port RAM number ofPer block RAM storage/>And a number of operands. Let subscript of input vector be in accordance with/>In the order of (1) >The individual coordinate components are written sequentially into the/>In operand RAM block 0, i.e./>The individual coordinate components are written into the/>Block 0 RAM; first/>The coordinate components are written in turnIn operand RAM block 1, i.e./>The individual coordinate components are written into the/>Block 1 operand RAM; sequentially proceeding until the/>The individual coordinate components are written sequentially into the/>In operand RAM block 7, i.e./>The individual coordinate components are written into the/>Block 7 operand RAM.
Realization ofDot NTT calculation requires/>Stage-based 8 butterfly unit iteration, at the firstIn the stage iteration, the input and output operands of each butterfly unit are independent of each other, 8 input operands of the butterfly unit are read from the same address of the 8-block RAM, and 8 output operands are written into the same RAM, so that the 2 nd to 8 th output operands are registered for 1 to 7 beats and then written. In the first placeIn stage iteration, the input and output operands of each butterfly unit are dependent, and 8 output operands of the butterfly units are written into different RAMs. In particular, inIn a stage iteration, the write address is:
Wherein the method comprises the steps of ,/>. In the first placeIn the stage iteration, the write address is the same as the read address, and the output operand is not registered. If the subscript of the input vector is according to/>In order of (2), the subscripts of the output vector are in accordance withThe bit octals are arranged in reverse order. /(I)
In practical application of number theory transformation, if memory resources on a selected hardware chip are limited, as NTT point number N, modulus value q and the like are increased, an off-chip memory is required to be used for storing rotation factors, intermediate calculation results and the like, and memory delay is increased; by instantiating a plurality of butterfly units, the parallelism of NTT calculation can be improved, the calculation speed is increased, and the memory scheduling difficulty is increased. The scheduling mode based on the intermediate result cache is simple in design, but memory occupation amount and calculation period are increased. The embodiment of the application provides a number theory transformation hardware acceleration method based on a dynamic memory read-write strategy, which is characterized in that the RAM resources are not additionally occupied in the NTT calculation process except for the initialized RAM resources, and the butterfly unit can realize the flow calculation, so that the memory occupation amount is reduced, and the off-chip memory access is reduced.
According to the hardware acceleration method for the number theory transformation, which is provided by the embodiment of the application, the characteristic parameter information of the number theory transformation is obtained; according to the characteristic parameter information, determining twiddle factors of each butterfly unit in each round of iterative computation, and storing twiddle factors into twiddle factor memories corresponding to each butterfly unit; in the iterative computation process of the butterfly unit based on the twiddle factors stored in the twiddle factor memory, aiming at any butterfly unit, when the iterative computation times of the butterfly unit do not reach a preset time threshold, writing a plurality of output operands obtained by the iterative computation of the butterfly unit into the same target memory, and when the iterative computation times of the butterfly unit reach the preset time threshold, writing a plurality of output operands obtained by the iterative computation of the butterfly unit into different target memories. According to the method provided by the scheme, when the iterative computation times of the butterfly unit do not reach the preset times threshold, the obtained plurality of output operands are written into the same target memory, and when the iterative computation times of the butterfly unit reach the preset times threshold, the plurality of output operands obtained by the iterative computation of the butterfly unit are written into different plurality of target memories, so that the memory resource occupation amount when the NTT is realized based on given hardware is reduced, and a foundation is laid for improving the throughput of the NTT. And the output results of the butterfly units are stored locally by adopting the pseudo-dual port RAM, and the output results of the butterfly units calculated in the iteration of the first stage are rearranged by dividing and registering a plurality of output results of the butterfly units calculated in the iteration of the first stage, so that read-write conflict is solved, and the pipeline calculation of the butterfly units is facilitated.
The embodiment of the application provides a number theory transformation hardware acceleration device which is used for executing the number theory transformation hardware acceleration method provided by the embodiment.
Fig. 9 is a schematic structural diagram of a hardware acceleration device for number theory transformation according to an embodiment of the present application. The number theory transformation hardware acceleration device 90 includes: an acquisition module 901, a determination module 902, a save module 903, a first acceleration module 904, and a second acceleration module 905.
The device comprises an acquisition module, a characteristic parameter information determination module, a rotation factor calculation module and a rotation factor calculation module, wherein the acquisition module is used for acquiring characteristic parameter information of number theory transformation and determining rotation factors of each butterfly unit in each round of iterative calculation according to the characteristic parameter information; the storage module is used for storing the twiddle factors to twiddle factor memories corresponding to the butterfly units; and the acceleration module is used for writing a plurality of output operands obtained by the butterfly unit through iterative computation into the same target memory when the iterative computation times of any butterfly unit do not reach a preset time threshold value for any butterfly unit in the iterative computation process of the butterfly unit based on the twiddle factors stored in the twiddle factor memory. And the second acceleration module is used for writing a plurality of output operands obtained by the butterfly unit through iterative computation into a plurality of different target memories when the iterative computation times of the butterfly unit reach a preset time threshold.
The specific manner in which the respective modules perform the operations of the number theory transformation hardware acceleration device in this embodiment has been described in detail in the embodiments related to the method, and will not be described in detail here.
The number theory transformation hardware acceleration device provided by the embodiment of the application is used for executing the number theory transformation hardware acceleration method provided by the embodiment, and the implementation mode and the principle are the same and are not repeated.
The embodiment of the application provides electronic equipment for executing the number theory transformation hardware acceleration method provided by the embodiment.
Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 100 includes: at least one processor 1001 and memory 1002.
The memory stores computer-executable instructions; at least one processor executes computer-executable instructions stored in the memory, causing the at least one processor to perform the number-theory transformation hardware acceleration method as provided by the embodiments above.
The electronic device provided by the embodiment of the application is used for executing the number theory transformation hardware acceleration method provided by the embodiment, and the implementation mode and the principle are the same and are not repeated.
The embodiment of the application provides a computer readable storage medium, wherein computer execution instructions are stored in the computer readable storage medium, and when a processor executes the computer execution instructions, the method for accelerating the number theory transformation hardware provided by any embodiment is realized.
The storage medium including the computer executable instructions provided in the embodiments of the present application may be used to store the computer executable instructions of the number theory transformation hardware acceleration method provided in the foregoing embodiments, and the implementation manner and principle of the computer executable instructions are the same and are not repeated.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.
The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above. The specific working process of the above-described device may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims (19)

1. A method for accelerating a number theory transformation hardware, comprising:
Acquiring characteristic parameter information of number theory transformation; the characteristic parameter information comprises butterfly unit types and dimensions of input vectors;
Determining rotation factors of each butterfly unit in each round of iterative computation according to the characteristic parameter information;
storing the twiddle factors into twiddle factor memories corresponding to the butterfly units;
In the iterative computation process of the butterfly unit based on the twiddle factors stored in the twiddle factor memory, aiming at any butterfly unit, when the iterative computation times of the butterfly unit do not reach a preset time threshold, writing a plurality of output operands obtained by the iterative computation of the butterfly unit into the same target memory;
When the iterative computation times of the butterfly unit reach a preset time threshold, writing a plurality of output operands obtained by the iterative computation of the butterfly unit into a plurality of different target memories;
The determining the rotation factor of each butterfly unit in each round of iterative computation according to the characteristic parameter information comprises the following steps:
determining the iterative calculation total round and the rotation factor expression of each butterfly unit according to the butterfly unit type represented by the characteristic parameter information and the dimension of the input vector;
determining an index of the rotation factor in each round of iterative computation according to the total round of iterative computation of each butterfly unit;
And determining the twiddle factors of the butterfly units in each round of iterative computation according to the index of the twiddle factors in each round of iterative computation and the twiddle factor expression.
2. The method of claim 1, wherein storing the twiddle factor in twiddle factor memories corresponding to the butterfly units comprises:
For any butterfly unit, determining a storage address of each twiddle factor in a twiddle factor memory corresponding to the butterfly unit according to iterative computation rounds corresponding to each twiddle factor of the butterfly unit;
according to the storage address of each twiddle factor in the twiddle factor storage corresponding to the butterfly unit, storing the twiddle factor in the twiddle factor storage corresponding to the butterfly unit;
wherein the twiddle factor memories are in one-to-one correspondence with the butterfly units.
3. The method according to claim 2, wherein determining the storage address of each twiddle factor in the twiddle factor memory corresponding to the butterfly unit according to the iteration computation round corresponding to each twiddle factor of the butterfly unit comprises:
dividing twiddle factor storage areas corresponding to each round of iterative computation of butterfly units in each twiddle factor storage; the twiddle factor storage area comprises a plurality of storage addresses, and the twiddle factor storage area corresponds to iterative computation rounds one by one;
in the twiddle factor memory, when the iterative computation round is lower than a preset round threshold, taking all twiddle factor memory areas corresponding to the iterative computation round as the same twiddle factor memory areas; wherein the twiddle factors stored in the same twiddle factor storage area are the same;
When the iteration calculation round is not lower than a preset round threshold, determining the number of identical twiddle factor storage areas according to the dimension of the input vector and the iteration calculation round;
Positioning the same twiddle factor storage areas according to the number of the same twiddle factor storage areas so that the same twiddle factor storage areas store the same twiddle factors;
when the number of the same twiddle factor storage areas is determined to be 0, determining that twiddle factor storage areas corresponding to the current iterative calculation round store different twiddle factors.
4. The method of claim 1, wherein for any butterfly unit, writing the number of output operands obtained by performing iterative computation on the butterfly unit into the same target memory when the number of iterative computation on the butterfly unit does not reach a preset number threshold, includes:
For any butterfly unit, when the iterative computation times of the butterfly unit do not reach a preset time threshold, determining a corresponding target memory address according to the memory read address of the input operand of the iterative computation of the butterfly unit;
And writing a plurality of output operands obtained by iterative computation of the butterfly unit into the same target memory according to the target memory address.
5. The method of claim 4, wherein determining, for any butterfly unit, a corresponding target memory address according to a memory read address of an input operand of the current round of iterative computation when the number of iterative computation of the butterfly unit does not reach a preset number of times threshold, includes:
For any butterfly unit, when the iterative computation times of the butterfly unit do not reach a preset time threshold, determining a target memory according to the memory read address of the input operand of the iterative computation of the round;
and determining the target memory address of each output operand in the target memory according to the characteristic parameter information of the number theory transformation.
6. The method of claim 5, wherein determining a target memory address of each of the output operands in the target memory based on the characteristic parameter information of the number theory transformation comprises:
determining a first target memory address according to the dimension of the input vector represented by the characteristic parameter information of the number theory transformation, the number of butterfly units and the current iterative calculation round;
Determining a plurality of second target memory addresses according to the first target memory addresses;
the target memory address comprises the first target memory address and a plurality of second target memory addresses.
7. The method of claim 6, wherein the determining the first target memory address from the dimension of the input vector, the number of butterfly units, and the current iteration count of the number of the butterfly units characterized by the feature parameter information of the number-theory transformation comprises:
Determining a first target memory address based on the following formula:
Wherein, Representing a first target memory address,/>Representing the dimension of the input vector,/>Indicating the number of butterfly units,Representing the radix of the butterfly unit,/>Representing the current iteration calculation round,/>A first positioning parameter is indicated and a second positioning parameter is indicated,,/>Representing the second positioning parameter,/>
8. The method of claim 6, wherein determining a number of second target memory addresses from the first target memory addresses comprises:
Determining a plurality of second target memory addresses based on the following formula:
Wherein, Represents the/>Second target memory address,/>,/>Representing a first target memory address,/>Representing the dimension of the input vector,/>Representing the number of butterfly units,/>Representing the current iteration round.
9. The method of claim 1, wherein writing the number of output operands obtained by the iterative computation of the butterfly unit to different number of target memories when the number of iterative computation of the butterfly unit reaches a preset number threshold comprises:
when the iterative computation times of the butterfly unit reach a preset time threshold, determining the arrangement sequence information of the butterfly unit according to the characteristic parameter information of the number theory transformation;
determining a plurality of target memories according to the arrangement sequence information of the butterfly units and the actual sequence numbers of the butterfly units;
and writing a plurality of output operands obtained by iterative computation of the butterfly unit into the target memory respectively.
10. The method of claim 9, wherein determining the order information of the butterfly units based on the characteristic parameter information of the number-wise transformation comprises:
And determining the arrangement sequence information of the butterfly units according to the quantity of the butterfly units represented by the characteristic parameter information of the number theory transformation and the current iterative calculation round.
11. The method of claim 10, wherein the determining the arrangement order information of the butterfly units according to the number of butterfly units characterized by the characteristic parameter information of the number-theory transformation and the current iteration calculation round comprises:
Determining the arrangement order information of the butterfly units based on the following formula:
Wherein, Representing the arrangement order information of the butterfly units,/>Representing the number of butterfly units,/>Representing the radix of the butterfly unit,/>Representing the current iteration calculation round,/>,/>Representing the first order parameter,/>,/>Representing a second order parameter,/>,/>Representing a third order parameter,/>
12. The method of claim 9, wherein determining a plurality of target memories according to the order information of the butterfly unit and the actual sequence number of the butterfly unit comprises:
determining a first sequence parameter, a second sequence parameter and a third sequence parameter of the butterfly unit according to the arrangement sequence information of the butterfly unit and the actual sequence number of the butterfly unit;
And determining a plurality of target memories according to the first sequence parameter, the second sequence parameter and the third sequence parameter of the butterfly unit.
13. The method of claim 12, wherein determining a number of target memories based on the first order parameter, the second order parameter, and the third order parameter of the butterfly unit comprises:
determining a plurality of target butterfly units according to the first sequence parameters and the third sequence parameters of the butterfly units;
And selecting a target memory in each target butterfly unit according to the second sequence parameter to determine a plurality of target memories.
14. The method of claim 13, wherein determining a number of target butterfly units based on the first order parameter and the third order parameter of the butterfly units comprises:
A number of target butterfly units are determined based on the following formula:
Wherein, ,/>Represents the/>Target butterfly unit sequence number corresponding to each target butterfly unit,/>Representing the radix of the butterfly unit,/>Representing the number of butterfly units,/>Representing the current iteration calculation round,/>,/>Representing the first order parameter,/>,/>Representing a third order parameter,/>
15. The method of claim 14, wherein selecting a target memory in each of the target butterfly units according to the second order parameter comprises:
taking the sequence number of the target butterfly unit as a first coordinate of a target memory;
Taking the second sequence parameter as a second coordinate of the target memory;
And selecting a target memory from each target butterfly unit according to the first coordinate and the second coordinate of the target memory.
16. The method according to any one of claims 1 to 15, further comprising:
Determining the preset times threshold based on the following formula:
Wherein, Representing a preset number of times threshold,/>Representing the dimension of the input vector,/>Representing the number of butterfly units,/>Representing the radix of the butterfly unit.
17. A number theory transformation hardware acceleration device, characterized by comprising:
the acquisition module is used for acquiring the characteristic parameter information of the number theory transformation; the characteristic parameter information comprises butterfly unit types and dimensions of input vectors;
the determining module is used for determining rotation factors of the butterfly units in each round of iterative computation according to the characteristic parameter information;
The storage module is used for storing the twiddle factors to twiddle factor memories corresponding to the butterfly units;
The first acceleration module is used for writing a plurality of output operands obtained by the butterfly unit through iterative computation into the same target memory when the iterative computation times of the butterfly unit do not reach a preset time threshold value for any butterfly unit in the iterative computation process of the butterfly unit based on the twiddle factors stored in the twiddle factor memory;
the second acceleration module is used for writing a plurality of output operands obtained by the butterfly unit through iterative computation into a plurality of different target memories when the iterative computation times of the butterfly unit reach a preset time threshold;
the determining module is specifically configured to:
determining the iterative calculation total round and the rotation factor expression of each butterfly unit according to the butterfly unit type represented by the characteristic parameter information and the dimension of the input vector;
determining an index of the rotation factor in each round of iterative computation according to the total round of iterative computation of each butterfly unit;
And determining the twiddle factors of the butterfly units in each round of iterative computation according to the index of the twiddle factors in each round of iterative computation and the twiddle factor expression.
18. An electronic device, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
The at least one processor executing computer-executable instructions stored in the memory causes the at least one processor to perform the method of any one of claims 1 to 16.
19. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor implement the method of any one of claims 1 to 16.
CN202410381414.7A 2024-03-29 2024-03-29 Method and device for accelerating number theory transformation hardware, electronic equipment and storage medium Active CN117971136B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410381414.7A CN117971136B (en) 2024-03-29 2024-03-29 Method and device for accelerating number theory transformation hardware, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410381414.7A CN117971136B (en) 2024-03-29 2024-03-29 Method and device for accelerating number theory transformation hardware, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117971136A CN117971136A (en) 2024-05-03
CN117971136B true CN117971136B (en) 2024-06-25

Family

ID=90859853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410381414.7A Active CN117971136B (en) 2024-03-29 2024-03-29 Method and device for accelerating number theory transformation hardware, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117971136B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113972980A (en) * 2020-07-24 2022-01-25 国民技术股份有限公司 Method and device for optimizing lattice code polynomial multiplication operation based on number theory transformation

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20230078131A (en) * 2021-11-26 2023-06-02 삼성전자주식회사 Appratus and method of homomorphic encryption operation using iterative array number theoretic transform
CN117349569A (en) * 2023-09-28 2024-01-05 芯光智网集成电路设计(无锡)有限公司 Hardware accelerator based on rapid number theory transformation and hardware acceleration method
CN117610040A (en) * 2023-11-23 2024-02-27 支付宝(杭州)信息技术有限公司 Method for executing number theory transformation NTT based on hardware module and hardware module

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113972980A (en) * 2020-07-24 2022-01-25 国民技术股份有限公司 Method and device for optimizing lattice code polynomial multiplication operation based on number theory transformation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种高速实时浮点蝶形运算单元的设计与实现;杨军;郭跃东;丁俊;;仪器仪表学报;20100315(03);第41-46页 *

Also Published As

Publication number Publication date
CN117971136A (en) 2024-05-03

Similar Documents

Publication Publication Date Title
Fritzmann et al. RISQ-V: Tightly coupled RISC-V accelerators for post-quantum cryptography
US11416638B2 (en) Configurable lattice cryptography processor for the quantum-secure internet of things and related techniques
US7640284B1 (en) Bit reversal methods for a parallel processor
US7836116B1 (en) Fast fourier transforms and related transforms using cooperative thread arrays
Liu et al. IMGPU: GPU-accelerated influence maximization in large-scale social networks
Jiang et al. Matcha: A fast and energy-efficient accelerator for fully homomorphic encryption over the torus
Bošnački et al. Parallel probabilistic model checking on general purpose graphics processors
CN108959168B (en) SHA512 full-flow water circuit based on-chip memory and implementation method thereof
US11995184B2 (en) Low-latency digital signature processing with side-channel security
Ye et al. Low-complexity VLSI design of large integer multipliers for fully homomorphic encryption
Liu Parallel and scalable sparse basic linear algebra subprograms
US20230318829A1 (en) Cryptographic processor device and data processing apparatus employing the same
CN106933777B (en) The high-performance implementation method of the one-dimensional FFT of base 2 based on domestic 26010 processor of Shen prestige
Wan et al. TESLAC: accelerating lattice-based cryptography with AI accelerator
CN117971136B (en) Method and device for accelerating number theory transformation hardware, electronic equipment and storage medium
Roy et al. Compact ring-lwe based cryptoprocessor
Chen et al. Efficient access scheme for multi-bank based NTT architecture through conflict graph
Gouert et al. ArctyrEX: Accelerated Encrypted Execution of General-Purpose Applications
US20220255757A1 (en) Digital signature verification engine for reconfigurable circuit devices
Takala et al. Scalable FFT processors and pipelined butterfly units
Zheng Encrypted cloud using GPUs
Seo et al. SIKE in 32-bit ARM processors based on redundant number system for NIST level-II
Gouert et al. Accelerated Encrypted Execution of General-Purpose Applications.
Stelzer et al. Enabling Lattice-Based Post-Quantum Cryptography on the OpenTitan Platform
Kang et al. Tensor virtualization technique to support efficient data reorganization for CNN accelerators

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant