CN116382617B

CN116382617B - Singular value decomposition accelerator with parallel ordering function based on FPGA

Info

Publication number: CN116382617B
Application number: CN202310669739.0A
Authority: CN
Inventors: 胡塘; 李相迪; 任嵩楠; 闫力; 王跃明
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-06-07
Filing date: 2023-06-07
Publication date: 2023-08-29
Anticipated expiration: 2043-06-07
Also published as: CN116382617A

Abstract

The application discloses a singular value decomposition accelerator with a parallel ordering function based on an FPGA, which comprises an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits and 2k parts of internal BRAM memory, wherein the external DDR memory is used for storing the k parts of single-sided Jacobian orthogonal transformation circuits; the k parts of unilateral Jacobian orthogonal transformation calculation circuits generate norms alpha and beta in parallel, classify and process a rotation matrix J according to the size relation of the norms alpha and beta, execute unilateral Jacobian calculation from the 1 st round to the k th round according to a polling scheduling mechanism state machine, and exchange the norms of the rest column vectors except the last pair of column vector norms alpha and beta when the k+1 th round to the n-1 th round keeps the rule, and the rotation matrix J uses the transposed matrix J thereof ^T Instead, the iteration is repeated until convergence. The method can realize the synchronous completion of the singular value sorting in the singular value decomposition process, eliminate the time consumption required by independent sorting processing, save the hardware resource cost specially used for processing the sorting function realization, and obviously improve the hardware acceleration effect.

Description

Singular value decomposition accelerator with parallel ordering function based on FPGA

Technical Field

The application relates to the field of signal processing, in particular to a singular value decomposition accelerator with a parallel ordering function based on an FPGA.

Background

Matrix singular value decomposition is a classical and important technology in the field of signal processing, and plays an important role in aspects of data dimension reduction, hyperspectral image processing, robot positioning and navigation, artificial intelligent recommendation algorithm and the like. The matrix singular value decomposition is realized by projecting in different subspaces through orthogonal transformation, so that the main component is effectively extracted to realize the dimension reduction effect, and singular value decomposition operators or accelerators are often integrated in a plurality of CPU, GPU, AI processors and FPGA systems to realize the performance improvement. However, the singular value decomposition itself involves complex computation, and it is important how to implement the descending order of the singular values and the corresponding singular vectors while completing the singular value decomposition in the ordering process of the computation results.

In the current singular value decomposition scheme realized based on very large scale integrated circuits (Very Large Scale Integration Circuit, VLSI), most of the singular value decomposition schemes adopt a method of mutually separating singular value decomposition and sorting processes, including a matrix algorithm library provided by a certain company, wherein singular value decomposition calculation is firstly adopted, and then singular values and singular vectors are sorted. This results in a serial execution between the sorting operation and the singular value decomposition calculation, increasing the overall delay, and in addition, requiring the overhead of dedicated sorting circuit hardware resources in order to implement the sorting function.

The application patent content of application number CN201010151981.1 mentions singular value decomposition, singular value size ordering and constructing an image using the first N singular values and their corresponding singular vectors, in which the singular value decomposition and singular value size ordering are performed serially, requiring additional time consuming and computational resources.

The patent application CN2202111040096.0 mentions that singular value decomposition operators are integrated in the lifting AI process to improve the performance of the lifting AI processor, including the application of selecting the first K largest singular values to approximate the original matrix, but there is no relevant description of how the integrated singular value decomposition operators implement singular value ordering.

Disclosure of Invention

Aiming at the defects of the prior art, the application provides a singular value decomposition accelerator with a parallel ordering function based on an FPGA, which can achieve the trend that the norms of all column vectors are arranged in a descending order as a whole by processing the norms alpha and beta of all column vectors and a corresponding rotation matrix J in different rounds on the basis of a classical unilateral Jacobi algorithm, can realize the descending order arrangement convergence of all column vector norms through a plurality of times of sweep, further carries out square root calculation on all second order norms to obtain corresponding singular values, and meanwhile, divides all column vectors by corresponding singular values respectively to obtain respective corresponding left singular vectors. In the matrix singular value decomposition process, the application performs the sorting processing of the singular values and the singular vectors in parallel, and the sorting operation is delayed and hidden in the singular value decomposition process, so that the two steps which are originally performed in series are changed into single-step parallel operation, the improvement of the whole real-time performance is promoted, and the hardware resource cost special for processing the realization of the sorting function is saved.

The aim of the application is achieved by the following technical scheme:

on the one hand, the singular value decomposition accelerator with the parallel ordering function based on the FPGA comprises an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits, a round-robin scheduling mechanism state machine and 2k blocks of internal memory BRAMs; the singular value decomposition accelerator performs singular value decomposition by:

s1: writing matrix data of m rows and n columns into BRAMs in the FPGA from an external DDR memory through an AXI interface, wherein each column corresponds to 1 BRAM, n blocks of BRAMs are combined into a pair by every two adjacent blocks of BRAMs in sequence, and the pairs are equally divided into k=n/2 pairs; wherein n is an even number; if the number of columns of the matrix is originally odd, 1 column of all 0 is supplemented at the end so as to be even;

s2: the k parts of unilateral Jacobian orthogonal transformation circuits calculate the second-order norms alpha, beta and the inner product gamma of k pairs of BRAM groups in parallel;

s3: if the second order norm alpha of the first column vector is greater than or equal to the second order norm beta of the second column vector, generating a rotation matrix according to the single-side Jacobian algorithmThe method comprises the steps of carrying out a first treatment on the surface of the Conversely, a rotation matrix is generatedThe method comprises the steps of carrying out a first treatment on the surface of the Synchronous generation of k rotation matrixes J corresponding to k pairs of BRAMs _i ，i=1,2，…，k；

S4, k parts of unilateral Jacobi orthogonal transformation circuits synchronously execute orthogonal rotation calculation, and the intermediate result is stored in n parts of BRAMs;

s5: exchanging column vectors according to a round-robin scheduling mechanism state machine, repeating the steps S2-S5, and executing k rounds of operations altogether;

s6: the k+1 th round is executed, comprising the following sub-steps:

s6.1: calculating the second order norms alpha, beta and inner product gamma of k pairs of BRAM groups, and exchanging the values of two second order norms in the 1~k-1 BRAM group;

s6.2: repeating S3 the same operation to synchronously generate k rotation matrixes J _i ，i=1,2，…，k；

S6.3: maintaining the last rotation matrix J _k The rest rotation matrixes are replaced by a transposition mode without change;

s6.4: s4, executing the same operation;

s7: exchanging column vector data stored by each BRAM according to a round-robin scheduling mechanism state machine, and executing the same operation as the S6 from the k+2~n-1 round until the n-1 round operation is completed, namely completing one-time 'sweeping' operation;

s8: and repeating the steps S2-S7, executing a plurality of 'sweeping' operations until the iteration termination condition is met, completing the singular value decomposition task, and arranging the singular values of the column vectors stored by each block BRAM from large to small.

Further, the unilateral Jacobian orthogonal transformation circuit comprises a norm and inner product calculation module, a cos theta and sin theta calculation module, a norm comparison module, a gamma positive and negative judgment module, a kth rotation matrix judgment module, a k-wheel judgment module, a rotation matrix J generation module, a unilateral Jacobian orthogonal rotation calculation module and a square root calculation module.

Further, the round-robin scheduling mechanism state machine controls the generation of data streams and control streams of each single-sided Jacobi orthogonal transformation circuit, including the reading of BRAM, the calculation of alpha, beta and gamma, the exchange of alpha and beta, the calculation of cos theta and sin theta, the generation of a rotation matrix J, and the write-back operation of Jacobi orthogonal rotation calculation results to BRAM.

Further, the initial column vector index rule is that the column vector index of the lower row is odd, the column vectors 1, 3, 5 … n-1, respectively, and the column vector index of the upper row is even, the column vectors 2, 4, 6 … n, respectively.

Further, the column vector index of the upper row is always greater than the column vector index of the lower row for the first k-wheel of each "sweep"; in the k+1 through n-1 th round of each "sweep", the column vector index of the upper row is always smaller than the column vector index of the lower row except for the last column.

Further, the second order norms of the following row of column vectors are α, i.e., α respectively ₁ ，α ₂ ，α ₃ ，…，α _k The method comprises the steps of carrying out a first treatment on the surface of the The second order norms of the column vectors of the upper row are beta, i.e. beta respectively ₁ ，β ₂ ，β ₃ ，…，β _k 。

Further, after the execution of S8 is completed, in the first few "sweeps", there is local oscillation, and the second-order norms of the column vectors overall show a descending order trend, so as to finally realize α ₁ ≥β ₁ ≥α ₂ ≥β ₂ ≥α ₃ ≥…≥α _k ≥β _k 。

Further, the generation formulas of cos θ and sin θ are as follows:

。

further, after S8 is executed and the convergence condition is satisfied, the second order norm α is calculated for each row of vectors ₁ ，β ₁ ，α ₂ ，β ₂ ，α ₃ ，…，α _k ，β _k Respectively performing square root calculation to obtain corresponding singular values of sigma ₁ ，σ ₂ ，σ ₃ ，σ ₄ ，σ ₅ ，…，σ _n-1 ，σ _n And sigma (sigma) ₁ ≥σ ₂ ≥σ ₃ ≥σ ₄ ≥σ ₅ ≥…≥σ _n-1 ≥σ _n And sum the result sigma ₁ ，σ ₂ ，σ ₃ ，σ ₄ ，σ ₅ ，…，σ _n-1 ，σ _n Sequentially writing to external DDR memory over AXI interfaceAnd (5) storing.

Further, each column of vectors u satisfying convergence in S8 ₁ ，u ₂ ，u ₃ ，…，u _n Divided by the singular values sigma corresponding to each ₁ ，σ ₂ ，σ ₃ ，…，σ _n Obtaining respective corresponding left singular vectors u ₁ /σ ₁ ，u ₂ /σ ₂ ，u ₃ /σ ₃ ，…，u _n /σ _n And the result u ₁ /σ ₁ ，u ₂ /σ ₂ ，u ₃ /σ ₃ ，…，u _n /σ _n Sequentially writing to external DDR storage through an AXI interface.

The beneficial effects of the application are as follows:

the method is particularly suitable for matrix singular value decomposition (including FPGA) realized based on VLSI, performs sequencing treatment of singular values and singular vectors in parallel in a cyclic iterative calculation process of singular value decomposition, and conceals the part of time delay in the whole singular value decomposition process, so that two steps which are originally executed in series are changed into single-step parallel synchronous operation, the integral real-time improvement of singular value decomposition can be improved, and particularly, for the application scene of image compression and principal component analysis, the method can extract the larger part of singular values and the corresponding singular vectors more quickly; in addition, the application saves the hardware resource cost special for processing the implementation of the sorting function.

Drawings

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

FIG. 1 is a block diagram of a singular value decomposition accelerator with parallel ordering function;

FIG. 2 is a circuit diagram of a detailed control channel and data channel of a singular value decomposition accelerator with parallel ordering function;

FIG. 3 is a schematic diagram of a single-sided Jacobi algorithm of a 512 row by 512 column matrix based on a round-robin state machine;

FIG. 4 is a schematic diagram of a one-time sweep process column vector swap operation with column dimension 6;

FIG. 5 is a diagram of a one-time sweep process column vector norm magnitude relationship with column dimension 6;

fig. 6 is a graph showing a partial column vector norm descending trend of a matrix of 512 rows by 512 columns for 5 times sweep execution.

Detailed Description

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

First, explanation of technical terms is given:

(1) VLSI: very Large Scale Integrated circuits VLSI (very large scale integrated circuit)

(2) And (3) FPGA: field Programmable Gate Array field programmable gate array

(3) BRAM: block RAM, FPGA internal Block RAM

(4) Jacobi: the application refers to unilateral Jacobian rotation, which is commonly used for matrix singular value decomposition based on FPGA

(5) round-robin: polling scheduling, one-side Jacobi rotation singular value decomposition commonly used scheduling mechanism

(6) DDR SDRAM: double Data Rate Synchronous Dynamic Random Access Memory, DDR external storage.

The singular value decomposition accelerator with the parallel ordering function based on the FPGA comprises an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits, a round-robin scheduling mechanism state machine and 2k blocks of internal BRAM storage; the singular value decomposition accelerator performs singular value decomposition by:

s3: the second order norm alpha of the first column vector is larger than or equal to the second order norm beta of the second column vector, and then a rotation matrix is generated according to a unilateral Jacobi algorithmThe method comprises the steps of carrying out a first treatment on the surface of the Conversely, a rotation matrix is generatedThe method comprises the steps of carrying out a first treatment on the surface of the Synchronous generation of k rotation matrixes J corresponding to k pairs of BRAMs _i ，i=1,2，…，k；

Here, the generation formulas of cos θ and sin θ are as follows:

。

s4: k parts of unilateral Jacobi orthogonal transformation circuits synchronously execute orthogonal rotation calculation, and intermediate results are temporarily stored in n parts of BRAMs;

s6: the k+1 th round is executed, comprising the following sub-steps:

S6.3: maintaining the last rotation matrix J _k Unchanged, the rest of the rotation matrix J _i In transposed form J _i ^T Instead, i=1, 2, …, k-1;

s6.4: s4, executing the same operation;

s8: and repeating the steps S2-S7, executing a plurality of 'sweeping' operations until the iteration termination condition is met, completing the singular value decomposition task, and arranging the singular value sizes of the column vectors stored by each block BRAM from large to small according to sequence numbers.

After S8 is executed, in the first few 'sweeps', the second order norms of each column vector totally show descending arrangement trend except for local oscillation, and finally alpha is realized ₁ ≥β ₁ ≥α ₂ ≥β ₂ ≥α ₃ ≥…≥α _k ≥β _k 。

In this embodiment, after S8 is performed and convergence conditions are satisfied, the column vector second order norms α stored for each block BRAM ₁ ，β ₁ ，α ₂ ，β ₂ ，α ₃ ，…，α _k ，β _k Respectively performing square root calculation to obtain corresponding singular values of sigma ₁ ，σ ₂ ，σ ₃ ，σ ₄ ，σ ₅ ，…，σ _n-1 ，σ _n And sigma (sigma) ₁ ≥σ ₂ ≥σ ₃ ≥σ ₄ ≥σ ₅ ≥…≥σ _n-1 ≥σ _n . And sum the result sigma ₁ ，σ ₂ ，σ ₃ ，σ ₄ ，σ ₅ ，…，σ _n-1 ，σ _n Sequentially writing to external DDR storage through an AXI interface. Or further, each column vector u satisfying convergence in S8 ₁ ，u ₂ ，u ₃ ，…，u _n Divided by the singular values sigma corresponding to each ₁ ，σ ₂ ，σ ₃ ，…，σ _n Obtaining respective corresponding left singular vectors u ₁ /σ ₁ ，u ₂ /σ ₂ ，u ₃ /σ ₃ ，…，u _n /σ _n And the result u ₁ /σ ₁ ，u ₂ /σ ₂ ，u ₃ /σ ₃ ，…，u _n /σ _n Sequentially writing to external DDR storage over AXI interface。

The unilateral Jacobian orthogonal transformation circuit comprises a norm and inner product calculation module, a cos theta and sin theta calculation module, a norm comparison module, a gamma positive and negative judgment module, a kth rotation matrix judgment module, a k-wheel judgment module, a rotation matrix J generation module, a unilateral Jacobian orthogonal rotation calculation module and a square root calculation module.

The round-robin scheduling mechanism state machine controls the generation of data flow and control flow of each single-sided Jacobi orthogonal transformation circuit, and comprises the reading of BRAM, the calculation of alpha, beta and gamma, the exchange of alpha and beta, the calculation of cos theta and sin theta, the generation of a rotation matrix J and the write-back operation of Jacobi orthogonal rotation calculation results to the BRAM.

In addition, in the singular value decomposition process of the present application, the initial column vector index rule is that the column vector index of the lower row is odd, the column vectors 1, 3, 5 … n-1 are respectively, and the column vector index of the upper row is even, the column vectors 2, 4, 6 … n are respectively. And the column vector index of the upper row is always greater than the column vector index of the lower row for the first k rounds of each "sweep"; in the k+1 through n-1 th round of each "sweep", the column vector index of the upper row is always smaller than the column vector index of the lower row except for the last column. Further, the second order norms of the following row of column vectors are α, i.e., α respectively ₁ ，α ₂ ，α ₃ ，…，α _k The method comprises the steps of carrying out a first treatment on the surface of the The second order norms of the column vectors of the upper row are beta, i.e. beta respectively ₁ ，β ₂ ，β ₃ ，…，β _k 。

The method of the present application is explained and illustrated in the following by a specific example.

The specific embodiment is described by singular value decomposition of a 512 row by 512 column matrix, the matrix element data type is a single-precision floating point number which accords with IEEE754 standard, XC7V690T-3FFG1761FPGA of Xilinx company is selected as target hardware for deployment verification, the minimum physical unit of the internal BRAM in the FPGA is BRAM with 18Kb capacity, the single-precision floating point number column vector with the depth of 512 just occupies 1 block of BRAM with 18Kb, and 512 blocks of BRAM are needed in total.

The specific implementation process of this embodiment is as follows:

step 1: through an AXI interface, 512 rows and 512 columns of matrix data are read from an external DDR memory device and written into corresponding 512 blocks BRAM in the FPGA according to columns, wherein the 1 st column is written into the 1 st block BRAM, the 2 nd column is written into the 2 nd block BRAM, and the 1 st pair is formed by the 1 st column and the 2 nd column, so that the internal memory of the 1# unilateral Jacobian orthogonal transformation circuit is formed; column 3 is written to the 3 rd block BRAM, column 4 is written to the 4 th block BRAM, and the two form a 2 nd pair to form the internal storage of the 2# unilateral Jacobian orthogonal transformation circuit; … …; and the method is characterized in that the method is repeated until the 511 th column is written into the 511 th block BRAM, the 512 th column is written into the 512 th block BRAM, and the two blocks form a 256 th pair to form the internal storage of a 256# unilateral Jacobian orthogonal transformation circuit, as shown in figure 1.

Step 2: the k parts of unilateral Jacobian orthogonal transformation circuits in the FPGA in fig. 1 synchronously and parallelly calculate the second-order norms alpha, beta and the inner products gamma, namely alpha, of the column vectors stored in the 256 pairs of BRAM groups in the step 1 ₁ Is the second order norm calculated by the 1 st block BRAM, beta ₁ Is the second order norm and gamma obtained by the 2 nd block BRAM calculation ₁ Is the inner product of the two, alpha ₂ Is the second order norm and beta calculated by the 3 rd block BRAM ₂ Second order norm and gamma obtained by calculating 4 th block BRAM ₂ Is the inner product of the two, …, and so on, α ₂₅₆ Is the second order norm and beta calculated by the 511 th block BRAM ₂₅₆ Is the second order norm and gamma obtained by the calculation of the 512 th block BRAM ₂₅₆ Is the inner product of the two.

Step 3: taking a 1# unilateral Jacobi orthogonal transformation circuit as an example, generating cos theta according to unilateral Jacobi algorithm ₁ Sum sin theta ₁ The formula is as follows:

in order to realize the parallel ordering function, after the output of the alpha and beta comparison circuit, the following special treatment is carried out:

if alpha is ₁ ≥β ₁ Then；

On the contrary, the method comprises the steps of,。

by analogy, 256 rotation matrices, i.e. J ₁ ，J ₂ ，…，J ₂₅₆ And (5) synchronous parallel generation.

Step 4:256 single-sided Jacobi orthogonal transformation circuits synchronously execute single-sided Jacobi orthogonal rotation calculation: to be used forRepresents column 1, round 1 current vector, < >>Representing the vector obtained by updating column 1 through the 1 st round of orthogonal rotation transformation, and for the 1 st pair of column vectors, executing Jacobi orthogonal rotation calculation as +.>The 2 nd pair of column vectors performs Jacobi orthogonal rotation calculation as +.>And so on until ++>。

Step 5: as shown in fig. 2, the control channel and the data channel inside the accelerator are described in detail by 1 pair of column vectors, and a round-robin scheduling mechanism state machine is responsible for overall flow control, and controls the data input and calculation result of each unit module according to the running round; after 256 pairs of column vectors are subjected to unilateral Jacobi orthogonal rotation transformation, respectively exchanging updated column vectors according to a round-robin scheduling mechanism in FIG. 3; the specific method comprises the following steps: fix the last 1 column, u ₅₁₂ Other column vectors perform u in counter-clockwise reverse rotation, i.e. concurrently ₁ Pass to the right to u ₃ ，u ₃ Pass to the right to u ₅ ，…，u ₅₀₉ Pass to the right to u ₅₁₁ ，u ₅₁₁ Diagonal transfer to u ₅₁₀ ，u ₅₁₀ Pass to the left to u ₅₀₈ ，u ₅₀₈ Pass to the left to u ₅₀₆ ，…，u ₄ Pass to the left to u ₂ ，u ₂ Down to u ₁ The method comprises the steps of carrying out a first treatment on the surface of the For more details, the data scheduling exchange may refer to fig. 1, where bram_4 stores data to bram_2, and bram_6 stores data to bram_4, …; the data stored in BRAM_1 is transferred to BRAM_3, and the data stored in BRAM_3 is transferred to BRAM_5 and …; the data stored in BRAM_2 is transmitted to BRAM_1, and the data stored in BRAM_512 is kept unchanged; repeating the steps 2-4, and executing k=512/2=256 rounds of the operations altogether; the results are shown in FIG. 3.

Step 6: the execution of the kth+1=256+1=257 round is entered, comprising the following sub-steps:

step 6.1: similarly to step 2, the second order norms α, β and the inner product γ of 256 pairs of column vectors are calculated, but the norms α of the last pair of column vectors are specially processed ₂₅₆ 、β ₂₅₆ The relationship remains unchanged and the norm values alpha of the remaining column vector pairs _i And beta _i Interchangeable, i.e. alpha ₁ 、β ₁ Between, alpha ₂ 、β ₂ Between …, alpha ₂₅₅ 、β ₂₅₅ The values are exchanged between, and it is noted that the column vectors themselves are unchanged in position.

Step 6.2: after the special treatment of the step 6.1, the same operation of the step 3 is executed to generate 256 rotation matrixes J respectively ₁ 、J _2、 …、J ₂₅₆ 。

Step 6.3: rotation matrix J holding last pair of column vectors ₂₅₆ Unchanged, the rest of the rotation matrix J _i (i=1, 2, …, 255) take the respective transposed form J _i ^T And substituting.

Step 6.4:256 performs a single-sided Jacobi orthogonal rotation transform on the column vector sync.

Step 7: and repeatedly executing the step 6 until the 511 th round of operation is completed.

Step 2 to step 7 are called a sweep. When one sweep is performed, the whole column vector norms show a descending order trend, namely the rule of alpha ₁ ≥β ₁ ≥α ₂ ≥β ₂ ≥α ₃ …≥α ₂₅₆ ≥β ₂₅₆ In the first few sweep, there is occasionally a shock at the pole-individual, i.e. there are few column vector second order norms beta _i ＜α _i+1 (i=1, 2, …, 255).

For ease of description and understanding, a matrix with column dimensions 6 is added to illustrate the overall process and column vector second order norm sequencing results to enhance understanding and implementation. As shown in fig. 4, during one sweep, the first k-wheel is the first 3 steps, shown in the upper half of the figure; in the next (k+1) th to (n-1) th rounds, steps 4 and 5 in the lower half of the figure, due to u ₆ Is 6, always greater than the other column indices, so the second order norm alpha of the last 1 pair of column vectors ₃ 、β ₃ While the values of the second order norms of the first 2 pairs of column vectors need to be exchanged, but the column vectors themselves are not exchanged, i.e. step 4 alpha ₁ Equal to column vector u ₃ Second order norm, beta ₁ Equal to column vector u ₅ Second order norm, alpha ₂ Equal to column vector u ₁ Second order norm, beta ₂ Equal to column vector u ₄ Is a second order norm of (2); the operation of the step 5 is the same; after orthogonal transformation, the size change and overall trend of the second order norms of the column vectors are shown in fig. 5, and the exchange processing of the orthogonal rotation matrix J and the second order norms in the application can lead the size of the second order norms of each column vector to be arranged in descending order according to the sequence number of the column index, namely the overall trend is that。

Step 8: according to the matrix size of 512 rows by 512 columns, 6 times sweep is performed to meet the preset convergence condition. At this timePerforming square root computation on column vector norms of 512 columns in parallel to obtain singular values of sigma respectively ₁ ，σ ₂ ，σ ₃ ，…，σ ₅₁₂ Corresponding sigma ₁ ≥σ ₂ ≥σ ₃ ≥…≥σ ₅₁₂ Further dividing each column vector by each otherObtaining left singular vector u by corresponding singular value ₁ /σ ₁ ，u ₂ /σ ₂ ，u ₃ /σ ₃ ，…，u ₅₁₂ /σ ₅₁₂ And (5) completing the singular value decomposition task. As shown in fig. 6, a part of the column vector second-order norm values are truncated, wherein the whole exhibits a tendency of descending order after 1 sweep execution, but there is a partial concussion, and the column vector second-order norm values are basically monotonically decreasing after 4 sweep execution.

Step 9: and (3) writing the singular values in the descending order obtained in the step (8) back into an external DDR storage through an AXI interface.

The FPGA operation result shows that the single-precision floating point matrix of 512 rows and 512 columns can rapidly complete singular value decomposition in 52.9 milliseconds under the operation of a 200MHz clock in XC7V690T-3FFG1761 target hardware, and the singular values and singular vectors are arranged in descending order. Compared with the result in the matrix singular value decomposition Solver library published by Xilinx corporation, 512 rows x 512 columns of real symmetric single precision floating point matrix singular value decomposition is realized, which takes 1.687 seconds on an Alveo U250 accelerator card, but the ordering of singular values and singular vectors also requires additional functional circuits to realize, and more time is consumed for this.

It can be found by the embodiments of the present application that, for matrix singular value decomposition based on VLSI (including FPGA), alpha is calculated by the present application _i 、β _i And the special treatment of the rotation matrix J can realize the descending order arrangement of the singular values and the singular vectors in parallel while decomposing the singular values, thereby saving the time consumption and the hardware cost of special ordering tasks. Therefore, the application can realize the improvement of the real-time performance of singular value decomposition and the saving of the hardware resource cost.

Corresponding to the embodiment of the FPGA-based singular value decomposition accelerator with the parallel ordering function, the application also provides an embodiment of the FPGA-based singular value decomposition system with the parallel ordering function.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. The singular value decomposition accelerator with the parallel ordering function based on the FPGA is characterized by comprising an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits, a round-robin scheduling mechanism state machine and 2k blocks of internal memory BRAMs; the singular value decomposition accelerator performs singular value decomposition by:

s6: the k+1 th round is executed, comprising the following sub-steps:

s6.4: s4, executing the same operation;

2. The FPGA-based singular value decomposition accelerator with parallel ordering function according to claim 1, wherein the single-sided jacobian orthogonal transformation circuit comprises a norm and inner product calculation module, cos θ and sin θ calculation modules, a norm comparison module, a gamma positive and negative decision module, a kth rotation matrix decision module, a k-less wheel decision module, a rotation matrix J generation module, a single-sided jacobian orthogonal rotation calculation module, and a square root calculation module.

3. The FPGA-based singular value decomposition accelerator with parallel ordering function of claim 1, wherein the round-robin scheduling mechanism state machine controls the generation of data streams and control streams for each single-sided jacobian orthogonal transform circuit, including the reading of BRAM, the computation of α, β, γ and α, β exchange, the computation of cos θ and sin θ, the generation of rotation matrix J, and the write-back operation of jacobian orthogonal rotation computation results to BRAM.

4. The FPGA-based singular value decomposition accelerator with parallel ordering according to claim 1, wherein the initial column vector index rule is that the column vector index of the lower row is odd, respectively column vectors 1, 3, 5 … n-1, and the column vector index of the upper row is even, respectively column vectors 2, 4, 6 … n.

5. The FPGA-based singular value decomposition accelerator with parallel ordering according to claim 2, wherein the column vector index of the upper row is always greater than the column vector index of the lower row for the first k-cycles of each "sweep"; in the k+1 through n-1 th round of each "sweep", the column vector index of the upper row is always smaller than the column vector index of the lower row except for the last column.

6. The FPGA-based singular value decomposition accelerator with parallel ordering according to claim 1, wherein the second order norms of the following row column vectors are α, i.e. α respectively ₁ ，α ₂ ，α ₃ ，…，α _k The method comprises the steps of carrying out a first treatment on the surface of the The second order norms of the column vectors of the upper row are beta, i.e. beta respectively ₁ ，β ₂ ，β ₃ ，…，β _k 。

7. The FPGA-based singular value decomposition accelerator with parallel ordering function according to claim 1, wherein after S8 is performed, in the first few "sweeps", there is local oscillation, the second-order norms of each column vector overall show descending order trend, and finally realize alpha ₁ ≥β ₁ ≥α ₂ ≥β ₂ ≥α ₃ ≥…≥α _k ≥β _k 。

8. The FPGA-based singular value decomposition accelerator with parallel ordering function of claim 1, wherein the generation formulas of cos θ and sin θ are as follows:

。

9. the FPGA-based singular value decomposition accelerator with parallel ordering according to claim 1, wherein after S8 is executed and convergence condition is satisfied, the second order norms α for each column vector ₁ ，β ₁ ，α ₂ ，β ₂ ，α ₃ ，…，α _k ，β _k Respectively performing square root calculation to obtain corresponding singular values of sigma ₁ ，σ ₂ ，σ ₃ ，σ ₄ ，σ ₅ ，…，σ _n-1 ，σ _n And sigma (sigma) ₁ ≥σ ₂ ≥σ ₃ ≥σ ₄ ≥σ ₅ ≥…≥σ _n-1 ≥σ _n And sum the result sigma ₁ ，σ ₂ ，σ ₃ ，σ ₄ ，σ ₅ ，…，σ _n-1 ，σ _n Sequentially writing to external DDR storage through an AXI interface.

10. The FPGA-based singular value decomposition accelerator with parallel ordering according to claim 9, wherein the columns of vectors u satisfying convergence in S8 are ₁ ，u ₂ ，u ₃ ，…，u _n Divided by the singular values sigma corresponding to each ₁ ，σ ₂ ，σ ₃ ，…，σ _n Obtaining respective corresponding left singular vectors u ₁ /σ ₁ ，u ₂ /σ ₂ ，u ₃ /σ ₃ ，…，u _n /σ _n And the result u ₁ /σ ₁ ，u ₂ /σ ₂ ，u ₃ /σ ₃ ，…，u _n /σ _n Sequentially writing to external DDR storage through an AXI interface.