CN113536228A

CN113536228A - FPGA acceleration implementation method for matrix singular value decomposition

Info

Publication number: CN113536228A
Application number: CN202111083549.8A
Authority: CN
Inventors: 胡塘; 李相迪; 徐志伟
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-10-22
Anticipated expiration: 2041-09-16
Also published as: CN113536228B

Abstract

The invention discloses an FPGA (field programmable gate array) acceleration realization method for matrix singular value decomposition, which comprises the steps of firstly, averagely dividing a matrix of m rows and multiplied by n columns stored in an off-chip DRAM into p = n/k subblocks according to a group of column vectors of every k columns, combining the p subblocks alternately in pairs in sequence to obtain a small-size matrix of m rows and multiplied by 2k columns, writing the small-size matrix into a BRAM (block tree) in the FPGA, further executing unilateral Jacobi rotation transformation, writing half column vectors in an obtained calculation result back to the off-chip DRAM, continuously combining the other half column vectors with the next subblock to obtain a new matrix of m rows and multiplied by 2k columns, and repeatedly executing the operation on the FPGA until the p subblocks are combined in pairs to execute a whole round of unilateral Jacobi rotation transformation; and executing the operation for multiple times until the convergence condition is met, namely the singular value decomposition of the m-row x n-column large-size matrix is finished. The invention adopts the implementation mode of dividing and dividing decomposition strategy and alternate combination among the sub-blocks, improves the data reuse rate, reduces frequent data movement and lightens the bandwidth pressure of data transmission inside and outside the chip.

Description

FPGA acceleration implementation method for matrix singular value decomposition

Technical Field

The invention relates to the field of signal processing, in particular to an FPGA (field programmable gate array) acceleration implementation method for matrix singular value decomposition.

Background

Singular value decomposition is an important matrix decomposition in linear algebra and is widely applied to the fields of signal processing, image compression, deep learning and the like. In the existing research, a CPU or a GPU is mainly used for realizing the matrix singular value decomposition in a software program mode, and with the rapid development of an FPGA technology in recent years, the adoption of the FPGA for realizing the matrix singular value decomposition gradually becomes a hot technology, especially for some application scenes with the FPGA already deployed, the scheme of realizing the matrix singular value decomposition based on the FPGA to replace the GPU can achieve the purposes of reducing the cost and reducing the power consumption, and compared with the CPU scheme, the performance with real time and low delay can be obtained.

However, singular value decomposition involves a large number of mathematical operations and loop iterations, and particularly for large-size matrices, the matrix has distinct features of being computation-intensive and storage-intensive, and puts harsh demands on computation resources, logic resources, BRAM capacity and on-chip and off-chip transmission bandwidths inside the FPGA, and also causes abnormal difficulty and huge workload in FPGA development. In the disclosed research and invention, due to the limited resources of the FPGA and the complexity of the singular value decomposition itself, the matrix size of the singular value decomposition implemented based on the FPGA is generally small, the real-time performance is poor, and only matrix input with a fixed size can be supported.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an FPGA acceleration realization method of matrix singular value decomposition, which is based on a unilateral Jacobi algorithm and adopts a divide-and-conquer strategy to divide an input large-size matrix into blocks in a column vector form, so that the singular value decomposition of the original abnormal and complicated large-size matrix is converted into various small-size matrix decompositions which are relatively easy to realize, the FPGA development difficulty is obviously reduced, and the harsh requirement on FPGA resources is obviously reduced.

The purpose of the invention is realized by the following technical scheme:

an FPGA acceleration realization method for matrix singular value decomposition is disclosed, wherein the FPGA has 3k BRAMs, the matrix is m rows x n columns, and the method comprises the following steps:

s1: dividing the matrix into p = n/k subblocks on average according to every k columns of column vectors, if p = n/k subblocks cannot be divided completely, performing column vector completion on the tail of the matrix in advance to achieve the whole division, and taking 0 as all element values of the newly added column vector;

s2: combining the 1 st sub-block and the 2 nd sub-block to obtain a new matrix with m rows multiplied by 2k columns, and writing column vectors of each column into corresponding 2k BRAMs of the FPGA in a one-to-one correspondence manner, wherein each BRAM corresponds to one column vector;

s3: performing unilateral Jacobi rotation transformation on a new matrix with m rows and 2k columns, performing 2k-1 rounds by a round-robin scheduling mechanism, and pre-writing k column vectors of a 3 rd sub-block into the remaining k BRAMs;

s4: combining the k column vector intermediate calculation result corresponding to the 1 st sub-block in the S3 with the 3 rd sub-block written in advance to obtain a new m-row x 2k column matrix, and executing 2k-1 round of unilateral Jacobi rotation transformation; simultaneously, the intermediate result A 'of the k column vector corresponding to the 2 nd sub-block in S3'₂Writing back the off-chip DRAM, and after the space of the k BRAMs in the part is released, writing the k column vectors of the 4 th sub-block into the k BRAMs in the part;

by analogy, according to the following combination rule and sequence between the sub-blocks, (1,2) → (1,3) → (1,4) → (1, p); (2, p) → (3, p) → (4, p) → (p-1, p); (p-1,2) → (p-1,3) → (p-1,4) → (p-1, p-2); ...; (3, 2); p sub-blocks combined pairwise, totaling

In one case, completing one-round unilateral Jacobi rotation transformation of the whole matrix;

s5: and taking an intermediate result of each sub-block after unilateral Jacobi rotation transformation as a new whole round of iterative input, and repeating the same combination and calculation operations of S2-S4 until a convergence condition is met, namely the singular value decomposition of the matrix of m rows multiplied by n columns is finished.

Further, in S4, after completing the single-sided Jacobi rotation transformation of 2k-1 round of the last subblock combination (3,2) of the entire matrix, writing the middle result of the 3 rd subblock back to the off-chip DRAM, and retaining the middle result of the 2 nd subblock in k BRAMs on the chip, that is, multiplexing the middle calculation result of the 2 nd subblock.

The invention has the following beneficial effects:

the method is particularly suitable for large-size matrixes, the large-size matrixes with m rows and n columns are averagely divided into p = n/k sub-blocks by taking each k column vector as a group, the sub-blocks are combined pairwise to form small-size matrixes with m rows and 2k columns, the original abnormal and complicated singular value decomposition is remarkably simplified by the divide-and-conquer strategy, the singular value decomposition of matrixes with any size can be supported, the combination rule among the sub-blocks is simple and clear, and the FPGA development and implementation are easy; when any two groups of BRAMs are used for Jacobi rotation conversion operation, the rest BRAMs are used for read-write caching with an off-chip DRAM to form a ping-pong structure working mechanism, so that the parallel efficiency of the whole circuit can be improved, and the working capacity of a production line can be improved; unilateral Jacobi rotation transformation is executed in a round-robin mode in the sub-block combination, so that the data reuse rate is improved, data reuse is further improved in an alternate combination mode among the sub-block combinations, the carrying amount of intermediate calculation results is reduced, and the requirements of high-bandwidth transmission inside and outside the chip are reduced.

Drawings

FIG. 1 is a schematic diagram of a matrix with 1024 rows by 1024 columns divided into 16 sub-blocks and combined together;

FIG. 2 is a schematic diagram of 2k-1 round single-sided Jacobi transformation of a matrix composed of

sub-blocks

1 and 2;

FIG. 3 is a diagram of

sub-blocks

1,2, 3 in relation to off-chip DRAM;

FIG. 4 is a graph showing the relationship between the calculation results of

sub-blocks

1 and 2, sub-block 3, and off-chip DRAM;

FIG. 5 is a schematic diagram of the Jacobi conversion, BRAM and off-chip DRAM ping-pong operation.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and preferred embodiments, and the objects and effects of the present invention will become more apparent, it being understood that the specific embodiments described herein are merely illustrative of the present invention and are not intended to limit the present invention.

First, technical term explanations are given:

(1) FPGA: field Programmable Gate Array

(2) BRAM: block RAM, FPGA internal Block RAM

(3) Jacobi: the invention particularly refers to unilateral Jacobian rotation which is commonly used for matrix singular value decomposition based on FPGA

(4) round-robin: scheduling mechanism commonly used for round-robin scheduling and unilateral Jacobi rotating singular value decomposition

(5) DRAM: dynamic Random Access Memory, herein specifically referred to as off-chip DRAM, such as DDR3 (or DDR 4) SDRAM.

The FPGA acceleration realization method of matrix singular value decomposition requires that the FPGA has 3k BRAMs, and defines the matrix to be decomposed as m rows multiplied by n columns, and then the method comprises the following steps:

s1: and dividing the matrix into p = n/k sub-blocks on average according to each k columns of column vectors, if p = n/k cannot be divided evenly, performing column vector padding on the tail of the matrix in advance to achieve the integral division, and taking 0 as all element values of the newly added column vector.

S2: and combining the sub block 1 and the sub block 2 to obtain a new matrix with m rows and multiplied by 2k columns, and writing column vectors of each column into corresponding 2k BRAMs of the FPGA in a one-to-one correspondence manner, wherein each BRAM corresponds to one column vector.

S3: and performing one-sided Jacobi rotation transformation on the new matrix of m rows by 2k columns, performing 2k-1 rounds in a round-robin scheduling mechanism, and pre-writing k column vectors of the 3 rd sub-block into the remaining k BRAMs.

by analogy, according to the following combination rule and order (1,2) → (1,3) → (1,4) → (1, p); (2, p) → (3, p) → (4, p) → (p-1, p); (p-1,2) → (p-1,3) → (p-1,4) → (p-1, p-2); ...; (3, 2); p sub-blocks combined pairwise, totaling

In one case, a single-sided Jacobi rotation transformation is performed for a full round of the entire matrix.

Defining the ith sub-block as A_i，A_iPerforming Jacobi spin transitions with other sub-blocksThe intermediate calculation result of the k column vector after conversion is A'_iThe combination and writing and drawing rules may be expressed in a more general way as:

A'₁with pre-written A_i(i is more than or equal to 3) forming a new matrix with m rows and 2k columns, and executing 2k-1 round of unilateral Jacobi rotation transformation; simultaneously calculating result A 'in the middle of the previous sub-block'_i-₁Write back to off-chip DRAM until the space of k BRAMs in this section is released, and then write back to the next sub-block A_i+1Writing to the k BRAMs in the part; until completion of A'₁And A _p2k-1 round of single-sided Jacobi rotation transformation to this sub-block A₁Combine with other sub-blocks and compute all over once, keep A'_pBRAM is unchanged in FPGA;

and calculating the intermediate result A'₁Write back to off-chip DRAM, wait until space of k BRAMs in this portion is released, A'₂The k BRAMs written from the off-chip DRAM to the part are the same as the part A'_pCombining and repeating the above process to complete A'_pCombining and calculating with other sub-blocks;

in the same way, the middle result of each subblock is reserved in the slice, subblocks which are not combined with the subblocks and are not calculated are written in sequence, and the unilateral Jacobi rotation transformation of the 2k-1 round is executed; until p sub-blocks are combined pairwise, totaling

In one case, a full round of single-sided Jacobi rotation transformation calculations for the entire matrix are completed.

In addition, after S4 is completed, the intermediate calculation results of the k column vectors of the resulting 3 rd and 2 nd sub-blocks may be both written back to the off-chip DRAM, and then the write, combine, and calculate operations may be restarted in the next round of S5. However, in order to further improve data multiplexing and reduce the carrying amount of the intermediate calculation result, at this time, only the intermediate calculation result of the k column vector of the 3 rd sub-block may be written back to the off-chip DRAM, and the intermediate calculation result of the k column vector of the 2 nd sub-block may be retained in the k BRAMs in the chip. When a new round of single-side Jacobi rotation transformation is started, the intermediate calculation result of the k column vector of the 2 nd sub-block is not required to be repeatedly led out and led in, namely, the intermediate calculation result of the 2 nd sub-block is multiplexed, the back-and-forth conveying operation of the intermediate result between the BRAM inside the FPGA and the off-chip DRAM is omitted, and the data conveying frequency is reduced. When a new single-side Jacobi rotation conversion is performed in a whole round, the combination just started is (2,1), (1,3) and (1, 4). → (1, p) in sequence.

The process of the invention is explained and illustrated below in a specific example.

The specific embodiment is explained by singular value decomposition of a matrix of 1024 rows by 1024 columns, the type of matrix element data is single-precision floating point number, namely 32 bits wide, VC707 provided by Xilinx company is selected as a development board, the model of FPGA is XC7VX485T-2FFG1761C, BRAM of 1015 blocks 36Kb is contained, 4Kb of each BRAM is actually used for parity check and can be 32Kb, if a block processing method is not considered, only storing the whole matrix data to an internal BRAM needs the BRAM of 1024 blocks 36Kb, and internal BRAM resources which can be provided by the FPGA are exceeded. To solve this problem, in this embodiment, 64 columns of column vectors are used as one sub-block, the original input matrix is divided into p =1024/64, that is, 16 sub-blocks, and any two sub-blocks are combined to form a new matrix of 1024 rows × 128 columns. Each column vector is composed of 1024 single-precision floating point numbers and just occupies 1 block BRAM, and 64 column vectors are taken as a group to occupy 64 blocks BRAM, wherein 3 groups BRAM and 192 blocks BRAM are used together in the embodiment. As shown in fig. 1, a schematic diagram of a 1024 row by 1024 column matrix divided into 16 sub-blocks and a schematic diagram of two-by-two alternate combinations thereof according to the present invention is shown, where m =1024 in fig. 1, the 1 st column is the 64 th sub-block, the 65 th to 128 th columns are the 2 nd sub-block, and so on, and the 961 th to 1024 th columns are the 16 th sub-block; the combination rules and sequence are as follows: (1,2) → (1,3) → (1,4) → (1, 16); (2,16) → (3,16) → (4,16) → (15, 16); (15,2) → (15,3) → (15,4) → analogy to and so on until (3,2), totaling

=120 combination cases.

The specific implementation procedure of this embodiment is as follows:

step 1: combining the 1 st sub-block and the 2 nd sub-block to obtain a new matrix of 1024 rows by 128 columns, writing the 1 st sub-block, namely the 1 st column to the 64 th column vector, into a 1 st group BRAM from the off-chip DRAM, and writing the 2 nd sub-block, namely the 65 th column to the 128 th column vector, into a 2 nd group BRAM from the off-chip DRAM, wherein each column vector corresponds to one block BRAM.

Step 2: executing a unilateral Jacobi rotation transformation on the 1024 rows by 128 columns new matrix of the step 1, as shown in FIG. 2, wherein k =64 in the figure, executing 127 rounds by a round-robin scheduling mechanism, so as to realize the high multiplexing of data in the sub-block combination, and after the 127 rounds are executed, the order of the 128 columns of vectors is completely consistent with that before the execution, namely the 1 st to 64 th columns of vectors are still updated and written back to the 1 st to 64 th blocks of the 1 st group BRAM, and the 65 th to 128 th columns of vectors are still updated and written back to the 1 st to 64 th blocks of the 2 nd group BRAM; while executing unilateral Jacobi rotation transformation, writing the column vectors from the 129 th column to the 192 th column of the next combination, namely the 3 rd sub-block, into the 1 st block to the 64 th block BRAM of the 3 rd group in advance, thereby achieving the operating mechanism of a ping-pong structure and improving the parallel pipeline operating performance of the whole circuit, as shown in FIG. 3, wherein k = 64.

And step 3: combining the intermediate calculation results of 64 columns of column vectors of the 1 st sub-block in the step 2, namely the 1 st group BRAM and the 3 rd group BRAM to form a new 1024-row multiplied by 128-column new matrix, and executing 127 rounds of single-side Jacobi rotation transformation in the step 2 in the same way; meanwhile, writing the intermediate calculation result of the BRAM of the 2 nd group in the step 2 back to the off-chip DRAM for storage, thereby achieving the effect of a ping-pong structure working mechanism; a specific circuit is shown in FIG. 4, where k =64 and matrix element A'_1,1Representing the 1 st row and 1 st column element A of the original 1 st sub-block_1,1Intermediate result, A ', after Jacobi transformation has been performed'_2,k+1Indicating the original row 2, column 1 element A of sub-block 2_2,k+1After the Jacobi transformation is performed, the same principle of the other elements is not repeated, and FIG. 5 shows the Jacobi transformation, BRAM and BRAM in the whole circuitAn off-chip DRAM ping-pong mechanism.

And 4, step 4: and analogizing in turn until the intermediate result of the 1 st subblock is combined with the 16 th subblock and the unilateral Jacobi rotation transformation is executed, at this time, the intermediate result of the column vector of the 1 st subblock, namely the 1 st group BRAM data, needs to be written back to the off-chip DRAM for storage, the intermediate result of the column vector of the 16 th subblock is combined with the next combination, namely the 2 nd subblock, namely the subblock combination (1,16) is changed into the subblock combination (2,16), the operation similar to the step 2 and the step 3 is continuously executed, the intermediate calculation result of the subblock 16 is multiplexed, the operation of the off-chip DRAM with the intermediate result is omitted, and the data moving times are reduced.

And 5: repeating the steps 1 to 5 until 16 subblocks are alternately combined pairwise to total 120 cases, wherein each combination respectively executes the same kind of operation until the unilateral Jacobi rotation transformation of the whole large-size matrix for a whole round is completed, the sequence of all subblocks is completely consistent with the sequence of the most initial large-size matrix, and all intermediate calculation results are written back to an off-chip DRAM for temporary storage, wherein the operations in the steps 1 to 5 are called as one-time sweep;

step 6: all the steps are repeated, and 8 times of sweep is executed on a matrix with 1024 rows and 1024 columns, so that convergence conditions can be determined to be met, and singular value decomposition of a large-size matrix with 1024 rows and 1024 columns is realized.

The comprehensive results of the FPGA show that 384 blocks of BRAM and 253K-LUT are used on XC7VX485T-2FFG1761C for a single-precision floating-point matrix with 1024 rows and 1024 columns, and singular value decomposition is rapidly completed within 0.690 seconds under the clock operation of 200 MHz. Compared with the results in a matrix singular value decomposition Solver library published by Xilinx, the singular value decomposition of a 512-row and 512-column real symmetric single-precision floating-point matrix is realized, 128 blocks of URAM +307 blocks of BRAM and 65K-LUT are used on an Alveo U250 accelerator card, the time is 1.687 seconds, and the embodiment of the invention can realize the improvement of the real-time performance which is nearly 20 times according to the size scale reduction of the matrix.

The embodiment of the invention can find that for a large-size matrix with m rows multiplied by n columns, the increase of the number of columns hardly increases the consumption of logic and storage resources, only the situation of combining two subblocks is increased, and for different rows, only the depth of the BRAM is deepened, so that the matrix decomposition with different rows can be adapted. Therefore, the invention can realize the effect of singular value decomposition of the matrix with any size.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and although the invention has been described in detail with reference to the foregoing examples, it will be apparent to those skilled in the art that various changes in the form and details of the embodiments may be made and equivalents may be substituted for elements thereof. All modifications, equivalents and the like which come within the spirit and principle of the invention are intended to be included within the scope of the invention.

Claims

1. An FPGA acceleration realization method for matrix singular value decomposition is characterized in that the FPGA has 3k BRAMs, the matrix is m rows x n columns, and the method comprises the following steps:

by analogy, the following subblocksThe combination rule and order of (1,2) → (1,3) → (1,4) → (1, p); (2, p) → (3, p) → (4, p) → (p-1, p); (p-1,2) → (p-1,3) → (p-1,4) → (p-1, p-2); ...; (3, 2); p sub-blocks combined pairwise, totaling

2. The FPGA acceleration implementation method of matrix singular value decomposition according to claim 1, wherein in S4, after completing the unilateral Jacobi rotation transformation of 2k-1 rounds of the last subblock combination (3,2) of the whole matrix, writing the intermediate result of the 3 rd subblock back to the off-chip DRAM, and keeping the intermediate result of the 2 nd subblock in k BRAMs on the chip, i.e. multiplexing the intermediate calculation result of the 2 nd subblock.