CN113536228A - FPGA acceleration implementation method for matrix singular value decomposition - Google Patents

FPGA acceleration implementation method for matrix singular value decomposition Download PDF

Info

Publication number
CN113536228A
CN113536228A CN202111083549.8A CN202111083549A CN113536228A CN 113536228 A CN113536228 A CN 113536228A CN 202111083549 A CN202111083549 A CN 202111083549A CN 113536228 A CN113536228 A CN 113536228A
Authority
CN
China
Prior art keywords
matrix
sub
block
columns
fpga
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111083549.8A
Other languages
Chinese (zh)
Other versions
CN113536228B (en
Inventor
胡塘
李相迪
徐志伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202111083549.8A priority Critical patent/CN113536228B/en
Publication of CN113536228A publication Critical patent/CN113536228A/en
Application granted granted Critical
Publication of CN113536228B publication Critical patent/CN113536228B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses an FPGA (field programmable gate array) acceleration realization method for matrix singular value decomposition, which comprises the steps of firstly, averagely dividing a matrix of m rows and multiplied by n columns stored in an off-chip DRAM into p = n/k subblocks according to a group of column vectors of every k columns, combining the p subblocks alternately in pairs in sequence to obtain a small-size matrix of m rows and multiplied by 2k columns, writing the small-size matrix into a BRAM (block tree) in the FPGA, further executing unilateral Jacobi rotation transformation, writing half column vectors in an obtained calculation result back to the off-chip DRAM, continuously combining the other half column vectors with the next subblock to obtain a new matrix of m rows and multiplied by 2k columns, and repeatedly executing the operation on the FPGA until the p subblocks are combined in pairs to execute a whole round of unilateral Jacobi rotation transformation; and executing the operation for multiple times until the convergence condition is met, namely the singular value decomposition of the m-row x n-column large-size matrix is finished. The invention adopts the implementation mode of dividing and dividing decomposition strategy and alternate combination among the sub-blocks, improves the data reuse rate, reduces frequent data movement and lightens the bandwidth pressure of data transmission inside and outside the chip.

Description

FPGA acceleration implementation method for matrix singular value decomposition
Technical Field
The invention relates to the field of signal processing, in particular to an FPGA (field programmable gate array) acceleration implementation method for matrix singular value decomposition.
Background
Singular value decomposition is an important matrix decomposition in linear algebra and is widely applied to the fields of signal processing, image compression, deep learning and the like. In the existing research, a CPU or a GPU is mainly used for realizing the matrix singular value decomposition in a software program mode, and with the rapid development of an FPGA technology in recent years, the adoption of the FPGA for realizing the matrix singular value decomposition gradually becomes a hot technology, especially for some application scenes with the FPGA already deployed, the scheme of realizing the matrix singular value decomposition based on the FPGA to replace the GPU can achieve the purposes of reducing the cost and reducing the power consumption, and compared with the CPU scheme, the performance with real time and low delay can be obtained.
However, singular value decomposition involves a large number of mathematical operations and loop iterations, and particularly for large-size matrices, the matrix has distinct features of being computation-intensive and storage-intensive, and puts harsh demands on computation resources, logic resources, BRAM capacity and on-chip and off-chip transmission bandwidths inside the FPGA, and also causes abnormal difficulty and huge workload in FPGA development. In the disclosed research and invention, due to the limited resources of the FPGA and the complexity of the singular value decomposition itself, the matrix size of the singular value decomposition implemented based on the FPGA is generally small, the real-time performance is poor, and only matrix input with a fixed size can be supported.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an FPGA acceleration realization method of matrix singular value decomposition, which is based on a unilateral Jacobi algorithm and adopts a divide-and-conquer strategy to divide an input large-size matrix into blocks in a column vector form, so that the singular value decomposition of the original abnormal and complicated large-size matrix is converted into various small-size matrix decompositions which are relatively easy to realize, the FPGA development difficulty is obviously reduced, and the harsh requirement on FPGA resources is obviously reduced.
The purpose of the invention is realized by the following technical scheme:
an FPGA acceleration realization method for matrix singular value decomposition is disclosed, wherein the FPGA has 3k BRAMs, the matrix is m rows x n columns, and the method comprises the following steps:
s1: dividing the matrix into p = n/k subblocks on average according to every k columns of column vectors, if p = n/k subblocks cannot be divided completely, performing column vector completion on the tail of the matrix in advance to achieve the whole division, and taking 0 as all element values of the newly added column vector;
s2: combining the 1 st sub-block and the 2 nd sub-block to obtain a new matrix with m rows multiplied by 2k columns, and writing column vectors of each column into corresponding 2k BRAMs of the FPGA in a one-to-one correspondence manner, wherein each BRAM corresponds to one column vector;
s3: performing unilateral Jacobi rotation transformation on a new matrix with m rows and 2k columns, performing 2k-1 rounds by a round-robin scheduling mechanism, and pre-writing k column vectors of a 3 rd sub-block into the remaining k BRAMs;
s4: combining the k column vector intermediate calculation result corresponding to the 1 st sub-block in the S3 with the 3 rd sub-block written in advance to obtain a new m-row x 2k column matrix, and executing 2k-1 round of unilateral Jacobi rotation transformation; simultaneously, the intermediate result A 'of the k column vector corresponding to the 2 nd sub-block in S3'2Writing back the off-chip DRAM, and after the space of the k BRAMs in the part is released, writing the k column vectors of the 4 th sub-block into the k BRAMs in the part;
by analogy, according to the following combination rule and sequence between the sub-blocks, (1,2) → (1,3) → (1,4) → (1, p); (2, p) → (3, p) → (4, p) → (p-1, p); (p-1,2) → (p-1,3) → (p-1,4) → (p-1, p-2); ...; (3, 2); p sub-blocks combined pairwise, totaling
Figure 393002DEST_PATH_IMAGE001
In one case, completing one-round unilateral Jacobi rotation transformation of the whole matrix;
s5: and taking an intermediate result of each sub-block after unilateral Jacobi rotation transformation as a new whole round of iterative input, and repeating the same combination and calculation operations of S2-S4 until a convergence condition is met, namely the singular value decomposition of the matrix of m rows multiplied by n columns is finished.
Further, in S4, after completing the single-sided Jacobi rotation transformation of 2k-1 round of the last subblock combination (3,2) of the entire matrix, writing the middle result of the 3 rd subblock back to the off-chip DRAM, and retaining the middle result of the 2 nd subblock in k BRAMs on the chip, that is, multiplexing the middle calculation result of the 2 nd subblock.
The invention has the following beneficial effects:
the method is particularly suitable for large-size matrixes, the large-size matrixes with m rows and n columns are averagely divided into p = n/k sub-blocks by taking each k column vector as a group, the sub-blocks are combined pairwise to form small-size matrixes with m rows and 2k columns, the original abnormal and complicated singular value decomposition is remarkably simplified by the divide-and-conquer strategy, the singular value decomposition of matrixes with any size can be supported, the combination rule among the sub-blocks is simple and clear, and the FPGA development and implementation are easy; when any two groups of BRAMs are used for Jacobi rotation conversion operation, the rest BRAMs are used for read-write caching with an off-chip DRAM to form a ping-pong structure working mechanism, so that the parallel efficiency of the whole circuit can be improved, and the working capacity of a production line can be improved; unilateral Jacobi rotation transformation is executed in a round-robin mode in the sub-block combination, so that the data reuse rate is improved, data reuse is further improved in an alternate combination mode among the sub-block combinations, the carrying amount of intermediate calculation results is reduced, and the requirements of high-bandwidth transmission inside and outside the chip are reduced.
Drawings
FIG. 1 is a schematic diagram of a matrix with 1024 rows by 1024 columns divided into 16 sub-blocks and combined together;
FIG. 2 is a schematic diagram of 2k-1 round single-sided Jacobi transformation of a matrix composed of sub-blocks 1 and 2;
FIG. 3 is a diagram of sub-blocks 1,2, 3 in relation to off-chip DRAM;
FIG. 4 is a graph showing the relationship between the calculation results of sub-blocks 1 and 2, sub-block 3, and off-chip DRAM;
FIG. 5 is a schematic diagram of the Jacobi conversion, BRAM and off-chip DRAM ping-pong operation.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and preferred embodiments, and the objects and effects of the present invention will become more apparent, it being understood that the specific embodiments described herein are merely illustrative of the present invention and are not intended to limit the present invention.
First, technical term explanations are given:
(1) FPGA: field Programmable Gate Array
(2) BRAM: block RAM, FPGA internal Block RAM
(3) Jacobi: the invention particularly refers to unilateral Jacobian rotation which is commonly used for matrix singular value decomposition based on FPGA
(4) round-robin: scheduling mechanism commonly used for round-robin scheduling and unilateral Jacobi rotating singular value decomposition
(5) DRAM: dynamic Random Access Memory, herein specifically referred to as off-chip DRAM, such as DDR3 (or DDR 4) SDRAM.
The FPGA acceleration realization method of matrix singular value decomposition requires that the FPGA has 3k BRAMs, and defines the matrix to be decomposed as m rows multiplied by n columns, and then the method comprises the following steps:
s1: and dividing the matrix into p = n/k sub-blocks on average according to each k columns of column vectors, if p = n/k cannot be divided evenly, performing column vector padding on the tail of the matrix in advance to achieve the integral division, and taking 0 as all element values of the newly added column vector.
S2: and combining the sub block 1 and the sub block 2 to obtain a new matrix with m rows and multiplied by 2k columns, and writing column vectors of each column into corresponding 2k BRAMs of the FPGA in a one-to-one correspondence manner, wherein each BRAM corresponds to one column vector.
S3: and performing one-sided Jacobi rotation transformation on the new matrix of m rows by 2k columns, performing 2k-1 rounds in a round-robin scheduling mechanism, and pre-writing k column vectors of the 3 rd sub-block into the remaining k BRAMs.
S4: combining the k column vector intermediate calculation result corresponding to the 1 st sub-block in the S3 with the 3 rd sub-block written in advance to obtain a new m-row x 2k column matrix, and executing 2k-1 round of unilateral Jacobi rotation transformation; simultaneously, the intermediate result A 'of the k column vector corresponding to the 2 nd sub-block in S3'2Writing back the off-chip DRAM, and after the space of the k BRAMs in the part is released, writing the k column vectors of the 4 th sub-block into the k BRAMs in the part;
by analogy, according to the following combination rule and order (1,2) → (1,3) → (1,4) → (1, p); (2, p) → (3, p) → (4, p) → (p-1, p); (p-1,2) → (p-1,3) → (p-1,4) → (p-1, p-2); ...; (3, 2); p sub-blocks combined pairwise, totaling
Figure 28251DEST_PATH_IMAGE001
In one case, a single-sided Jacobi rotation transformation is performed for a full round of the entire matrix.
Defining the ith sub-block as Ai,AiPerforming Jacobi spin transitions with other sub-blocksThe intermediate calculation result of the k column vector after conversion is A'iThe combination and writing and drawing rules may be expressed in a more general way as:
A'1with pre-written Ai(i is more than or equal to 3) forming a new matrix with m rows and 2k columns, and executing 2k-1 round of unilateral Jacobi rotation transformation; simultaneously calculating result A 'in the middle of the previous sub-block'i-1Write back to off-chip DRAM until the space of k BRAMs in this section is released, and then write back to the next sub-block Ai+1Writing to the k BRAMs in the part; until completion of A'1And A p2k-1 round of single-sided Jacobi rotation transformation to this sub-block A1Combine with other sub-blocks and compute all over once, keep A'pBRAM is unchanged in FPGA;
and calculating the intermediate result A'1Write back to off-chip DRAM, wait until space of k BRAMs in this portion is released, A'2The k BRAMs written from the off-chip DRAM to the part are the same as the part A'pCombining and repeating the above process to complete A'pCombining and calculating with other sub-blocks;
in the same way, the middle result of each subblock is reserved in the slice, subblocks which are not combined with the subblocks and are not calculated are written in sequence, and the unilateral Jacobi rotation transformation of the 2k-1 round is executed; until p sub-blocks are combined pairwise, totaling
Figure 276830DEST_PATH_IMAGE001
In one case, a full round of single-sided Jacobi rotation transformation calculations for the entire matrix are completed.
S5: and taking an intermediate result of each sub-block after unilateral Jacobi rotation transformation as a new whole round of iterative input, and repeating the same combination and calculation operations of S2-S4 until a convergence condition is met, namely the singular value decomposition of the matrix of m rows multiplied by n columns is finished.
In addition, after S4 is completed, the intermediate calculation results of the k column vectors of the resulting 3 rd and 2 nd sub-blocks may be both written back to the off-chip DRAM, and then the write, combine, and calculate operations may be restarted in the next round of S5. However, in order to further improve data multiplexing and reduce the carrying amount of the intermediate calculation result, at this time, only the intermediate calculation result of the k column vector of the 3 rd sub-block may be written back to the off-chip DRAM, and the intermediate calculation result of the k column vector of the 2 nd sub-block may be retained in the k BRAMs in the chip. When a new round of single-side Jacobi rotation transformation is started, the intermediate calculation result of the k column vector of the 2 nd sub-block is not required to be repeatedly led out and led in, namely, the intermediate calculation result of the 2 nd sub-block is multiplexed, the back-and-forth conveying operation of the intermediate result between the BRAM inside the FPGA and the off-chip DRAM is omitted, and the data conveying frequency is reduced. When a new single-side Jacobi rotation conversion is performed in a whole round, the combination just started is (2,1), (1,3) and (1, 4). → (1, p) in sequence.
The process of the invention is explained and illustrated below in a specific example.
The specific embodiment is explained by singular value decomposition of a matrix of 1024 rows by 1024 columns, the type of matrix element data is single-precision floating point number, namely 32 bits wide, VC707 provided by Xilinx company is selected as a development board, the model of FPGA is XC7VX485T-2FFG1761C, BRAM of 1015 blocks 36Kb is contained, 4Kb of each BRAM is actually used for parity check and can be 32Kb, if a block processing method is not considered, only storing the whole matrix data to an internal BRAM needs the BRAM of 1024 blocks 36Kb, and internal BRAM resources which can be provided by the FPGA are exceeded. To solve this problem, in this embodiment, 64 columns of column vectors are used as one sub-block, the original input matrix is divided into p =1024/64, that is, 16 sub-blocks, and any two sub-blocks are combined to form a new matrix of 1024 rows × 128 columns. Each column vector is composed of 1024 single-precision floating point numbers and just occupies 1 block BRAM, and 64 column vectors are taken as a group to occupy 64 blocks BRAM, wherein 3 groups BRAM and 192 blocks BRAM are used together in the embodiment. As shown in fig. 1, a schematic diagram of a 1024 row by 1024 column matrix divided into 16 sub-blocks and a schematic diagram of two-by-two alternate combinations thereof according to the present invention is shown, where m =1024 in fig. 1, the 1 st column is the 64 th sub-block, the 65 th to 128 th columns are the 2 nd sub-block, and so on, and the 961 th to 1024 th columns are the 16 th sub-block; the combination rules and sequence are as follows: (1,2) → (1,3) → (1,4) → (1, 16); (2,16) → (3,16) → (4,16) → (15, 16); (15,2) → (15,3) → (15,4) → analogy to and so on until (3,2), totaling
Figure 875302DEST_PATH_IMAGE002
=120 combination cases.
The specific implementation procedure of this embodiment is as follows:
step 1: combining the 1 st sub-block and the 2 nd sub-block to obtain a new matrix of 1024 rows by 128 columns, writing the 1 st sub-block, namely the 1 st column to the 64 th column vector, into a 1 st group BRAM from the off-chip DRAM, and writing the 2 nd sub-block, namely the 65 th column to the 128 th column vector, into a 2 nd group BRAM from the off-chip DRAM, wherein each column vector corresponds to one block BRAM.
Step 2: executing a unilateral Jacobi rotation transformation on the 1024 rows by 128 columns new matrix of the step 1, as shown in FIG. 2, wherein k =64 in the figure, executing 127 rounds by a round-robin scheduling mechanism, so as to realize the high multiplexing of data in the sub-block combination, and after the 127 rounds are executed, the order of the 128 columns of vectors is completely consistent with that before the execution, namely the 1 st to 64 th columns of vectors are still updated and written back to the 1 st to 64 th blocks of the 1 st group BRAM, and the 65 th to 128 th columns of vectors are still updated and written back to the 1 st to 64 th blocks of the 2 nd group BRAM; while executing unilateral Jacobi rotation transformation, writing the column vectors from the 129 th column to the 192 th column of the next combination, namely the 3 rd sub-block, into the 1 st block to the 64 th block BRAM of the 3 rd group in advance, thereby achieving the operating mechanism of a ping-pong structure and improving the parallel pipeline operating performance of the whole circuit, as shown in FIG. 3, wherein k = 64.
And step 3: combining the intermediate calculation results of 64 columns of column vectors of the 1 st sub-block in the step 2, namely the 1 st group BRAM and the 3 rd group BRAM to form a new 1024-row multiplied by 128-column new matrix, and executing 127 rounds of single-side Jacobi rotation transformation in the step 2 in the same way; meanwhile, writing the intermediate calculation result of the BRAM of the 2 nd group in the step 2 back to the off-chip DRAM for storage, thereby achieving the effect of a ping-pong structure working mechanism; a specific circuit is shown in FIG. 4, where k =64 and matrix element A'1,1Representing the 1 st row and 1 st column element A of the original 1 st sub-block1,1Intermediate result, A ', after Jacobi transformation has been performed'2,k+1Indicating the original row 2, column 1 element A of sub-block 22,k+1After the Jacobi transformation is performed, the same principle of the other elements is not repeated, and FIG. 5 shows the Jacobi transformation, BRAM and BRAM in the whole circuitAn off-chip DRAM ping-pong mechanism.
And 4, step 4: and analogizing in turn until the intermediate result of the 1 st subblock is combined with the 16 th subblock and the unilateral Jacobi rotation transformation is executed, at this time, the intermediate result of the column vector of the 1 st subblock, namely the 1 st group BRAM data, needs to be written back to the off-chip DRAM for storage, the intermediate result of the column vector of the 16 th subblock is combined with the next combination, namely the 2 nd subblock, namely the subblock combination (1,16) is changed into the subblock combination (2,16), the operation similar to the step 2 and the step 3 is continuously executed, the intermediate calculation result of the subblock 16 is multiplexed, the operation of the off-chip DRAM with the intermediate result is omitted, and the data moving times are reduced.
And 5: repeating the steps 1 to 5 until 16 subblocks are alternately combined pairwise to total 120 cases, wherein each combination respectively executes the same kind of operation until the unilateral Jacobi rotation transformation of the whole large-size matrix for a whole round is completed, the sequence of all subblocks is completely consistent with the sequence of the most initial large-size matrix, and all intermediate calculation results are written back to an off-chip DRAM for temporary storage, wherein the operations in the steps 1 to 5 are called as one-time sweep;
step 6: all the steps are repeated, and 8 times of sweep is executed on a matrix with 1024 rows and 1024 columns, so that convergence conditions can be determined to be met, and singular value decomposition of a large-size matrix with 1024 rows and 1024 columns is realized.
The comprehensive results of the FPGA show that 384 blocks of BRAM and 253K-LUT are used on XC7VX485T-2FFG1761C for a single-precision floating-point matrix with 1024 rows and 1024 columns, and singular value decomposition is rapidly completed within 0.690 seconds under the clock operation of 200 MHz. Compared with the results in a matrix singular value decomposition Solver library published by Xilinx, the singular value decomposition of a 512-row and 512-column real symmetric single-precision floating-point matrix is realized, 128 blocks of URAM +307 blocks of BRAM and 65K-LUT are used on an Alveo U250 accelerator card, the time is 1.687 seconds, and the embodiment of the invention can realize the improvement of the real-time performance which is nearly 20 times according to the size scale reduction of the matrix.
The embodiment of the invention can find that for a large-size matrix with m rows multiplied by n columns, the increase of the number of columns hardly increases the consumption of logic and storage resources, only the situation of combining two subblocks is increased, and for different rows, only the depth of the BRAM is deepened, so that the matrix decomposition with different rows can be adapted. Therefore, the invention can realize the effect of singular value decomposition of the matrix with any size.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and although the invention has been described in detail with reference to the foregoing examples, it will be apparent to those skilled in the art that various changes in the form and details of the embodiments may be made and equivalents may be substituted for elements thereof. All modifications, equivalents and the like which come within the spirit and principle of the invention are intended to be included within the scope of the invention.

Claims (2)

1. An FPGA acceleration realization method for matrix singular value decomposition is characterized in that the FPGA has 3k BRAMs, the matrix is m rows x n columns, and the method comprises the following steps:
s1: dividing the matrix into p = n/k subblocks on average according to every k columns of column vectors, if p = n/k subblocks cannot be divided completely, performing column vector completion on the tail of the matrix in advance to achieve the whole division, and taking 0 as all element values of the newly added column vector;
s2: combining the 1 st sub-block and the 2 nd sub-block to obtain a new matrix with m rows multiplied by 2k columns, and writing column vectors of each column into corresponding 2k BRAMs of the FPGA in a one-to-one correspondence manner, wherein each BRAM corresponds to one column vector;
s3: performing unilateral Jacobi rotation transformation on a new matrix with m rows and 2k columns, performing 2k-1 rounds by a round-robin scheduling mechanism, and pre-writing k column vectors of a 3 rd sub-block into the remaining k BRAMs;
s4: combining the k column vector intermediate calculation result corresponding to the 1 st sub-block in the S3 with the 3 rd sub-block written in advance to obtain a new m-row x 2k column matrix, and executing 2k-1 round of unilateral Jacobi rotation transformation; simultaneously, the intermediate result A 'of the k column vector corresponding to the 2 nd sub-block in S3'2Writing back the off-chip DRAM, and after the space of the k BRAMs in the part is released, writing the k column vectors of the 4 th sub-block into the k BRAMs in the part;
by analogy, the following subblocksThe combination rule and order of (1,2) → (1,3) → (1,4) → (1, p); (2, p) → (3, p) → (4, p) → (p-1, p); (p-1,2) → (p-1,3) → (p-1,4) → (p-1, p-2); ...; (3, 2); p sub-blocks combined pairwise, totaling
Figure 214027DEST_PATH_IMAGE001
In one case, completing one-round unilateral Jacobi rotation transformation of the whole matrix;
s5: and taking an intermediate result of each sub-block after unilateral Jacobi rotation transformation as a new whole round of iterative input, and repeating the same combination and calculation operations of S2-S4 until a convergence condition is met, namely the singular value decomposition of the matrix of m rows multiplied by n columns is finished.
2. The FPGA acceleration implementation method of matrix singular value decomposition according to claim 1, wherein in S4, after completing the unilateral Jacobi rotation transformation of 2k-1 rounds of the last subblock combination (3,2) of the whole matrix, writing the intermediate result of the 3 rd subblock back to the off-chip DRAM, and keeping the intermediate result of the 2 nd subblock in k BRAMs on the chip, i.e. multiplexing the intermediate calculation result of the 2 nd subblock.
CN202111083549.8A 2021-09-16 2021-09-16 FPGA acceleration implementation method for matrix singular value decomposition Active CN113536228B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111083549.8A CN113536228B (en) 2021-09-16 2021-09-16 FPGA acceleration implementation method for matrix singular value decomposition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111083549.8A CN113536228B (en) 2021-09-16 2021-09-16 FPGA acceleration implementation method for matrix singular value decomposition

Publications (2)

Publication Number Publication Date
CN113536228A true CN113536228A (en) 2021-10-22
CN113536228B CN113536228B (en) 2021-12-24

Family

ID=78123221

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111083549.8A Active CN113536228B (en) 2021-09-16 2021-09-16 FPGA acceleration implementation method for matrix singular value decomposition

Country Status (1)

Country Link
CN (1) CN113536228B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116170601A (en) * 2023-04-25 2023-05-26 之江实验室 Image compression method based on four-column vector block singular value decomposition
CN116382617A (en) * 2023-06-07 2023-07-04 之江实验室 Singular value decomposition accelerator with parallel ordering function based on FPGA

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105323036A (en) * 2014-08-01 2016-02-10 ***通信集团公司 Method and device for performing singular value decomposition on complex matrix and computing equipment
CN112596701A (en) * 2021-03-05 2021-04-02 之江实验室 FPGA acceleration realization method based on unilateral Jacobian singular value decomposition

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105323036A (en) * 2014-08-01 2016-02-10 ***通信集团公司 Method and device for performing singular value decomposition on complex matrix and computing equipment
CN112596701A (en) * 2021-03-05 2021-04-02 之江实验室 FPGA acceleration realization method based on unilateral Jacobian singular value decomposition

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
许乔等: "基于FPGA的大矩阵奇异值分解的实现", 《电子测量技术》 *
马亚峰: "基于FPGA的矩阵奇异值分解加速方案的设计与实现", 《中国优秀硕士学位论文全文数据库(电子期刊)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116170601A (en) * 2023-04-25 2023-05-26 之江实验室 Image compression method based on four-column vector block singular value decomposition
CN116170601B (en) * 2023-04-25 2023-07-11 之江实验室 Image compression method based on four-column vector block singular value decomposition
CN116382617A (en) * 2023-06-07 2023-07-04 之江实验室 Singular value decomposition accelerator with parallel ordering function based on FPGA
CN116382617B (en) * 2023-06-07 2023-08-29 之江实验室 Singular value decomposition accelerator with parallel ordering function based on FPGA

Also Published As

Publication number Publication date
CN113536228B (en) 2021-12-24

Similar Documents

Publication Publication Date Title
CN113536228B (en) FPGA acceleration implementation method for matrix singular value decomposition
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN108108809B (en) Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof
CN111178518A (en) Software and hardware cooperative acceleration method based on FPGA
CN105049061A (en) Advanced calculation-based high-dimensional polarization code decoder and polarization code decoding method
CN108596331A (en) A kind of optimization method of cell neural network hardware structure
CN113254391B (en) Neural network accelerator convolution calculation and data loading parallel method and device
CN110674927A (en) Data recombination method for pulse array structure
CN112596701B (en) FPGA acceleration realization method based on unilateral Jacobian singular value decomposition
CN114995782B (en) Data processing method, device, equipment and readable storage medium
CN112307421B (en) Base 4 frequency extraction fast Fourier transform processor
CN109146065A (en) The convolution algorithm method and device of 2-D data
CN113222133A (en) FPGA-based compressed LSTM accelerator and acceleration method
CN115390788A (en) Sparse matrix multiplication distribution system of graph convolution neural network based on FPGA
EP3958149A1 (en) Data processing method and device, storage medium and electronic device
CN116992203A (en) FPGA-based large-scale high-throughput sparse matrix vector integer multiplication method
CN102064835B (en) Decoder suitable for quasi-cyclic LDPC decoding
CN104158549A (en) Efficient decoding method and decoding device for polar code
CN112632465B (en) Data storage method for decomposing characteristic value of real symmetric matrix based on FPGA
CN108572787A (en) A kind of method and device that data are stored, read
US11886347B2 (en) Large-scale data processing computer architecture
CN116382617A (en) Singular value decomposition accelerator with parallel ordering function based on FPGA
CN112905526B (en) FPGA implementation method for multiple types of convolution
Sierra et al. High-performance decoding of variable-length memory data packets for FPGA stream processing
Ghosh et al. HARVEST: Towards Efficient Sparse DNN Accelerators using Programmable Thresholds

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant