CN116382617B - Singular value decomposition accelerator with parallel ordering function based on FPGA - Google Patents

Singular value decomposition accelerator with parallel ordering function based on FPGA Download PDF

Info

Publication number
CN116382617B
CN116382617B CN202310669739.0A CN202310669739A CN116382617B CN 116382617 B CN116382617 B CN 116382617B CN 202310669739 A CN202310669739 A CN 202310669739A CN 116382617 B CN116382617 B CN 116382617B
Authority
CN
China
Prior art keywords
singular value
value decomposition
column
fpga
round
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310669739.0A
Other languages
Chinese (zh)
Other versions
CN116382617A (en
Inventor
胡塘
李相迪
任嵩楠
闫力
王跃明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202310669739.0A priority Critical patent/CN116382617B/en
Publication of CN116382617A publication Critical patent/CN116382617A/en
Application granted granted Critical
Publication of CN116382617B publication Critical patent/CN116382617B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/06Arrangements for sorting, selecting, merging, or comparing data on individual record carriers
    • G06F7/08Sorting, i.e. grouping record carriers in numerical or other ordered sequence according to the classification of at least some of the information they carry
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7817Specially adapted for signal processing, e.g. Harvard architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7821Tightly coupled to memory, e.g. computational memory, smart memory, processor in memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/06Arrangements for sorting, selecting, merging, or comparing data on individual record carriers
    • G06F7/20Comparing separate sets of record carriers arranged in the same sequence to determine whether at least some of the data in one set is identical with that in the other set or sets
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Complex Calculations (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Apparatus For Radiation Diagnosis (AREA)

Abstract

The application discloses a singular value decomposition accelerator with a parallel ordering function based on an FPGA, which comprises an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits and 2k parts of internal BRAM memory, wherein the external DDR memory is used for storing the k parts of single-sided Jacobian orthogonal transformation circuits; the k parts of unilateral Jacobian orthogonal transformation calculation circuits generate norms alpha and beta in parallel, classify and process a rotation matrix J according to the size relation of the norms alpha and beta, execute unilateral Jacobian calculation from the 1 st round to the k th round according to a polling scheduling mechanism state machine, and exchange the norms of the rest column vectors except the last pair of column vector norms alpha and beta when the k+1 th round to the n-1 th round keeps the rule, and the rotation matrix J uses the transposed matrix J thereof T Instead, the iteration is repeated until convergence. The method can realize the synchronous completion of the singular value sorting in the singular value decomposition process, eliminate the time consumption required by independent sorting processing, save the hardware resource cost specially used for processing the sorting function realization, and obviously improve the hardware acceleration effect.

Description

Singular value decomposition accelerator with parallel ordering function based on FPGA
Technical Field
The application relates to the field of signal processing, in particular to a singular value decomposition accelerator with a parallel ordering function based on an FPGA.
Background
Matrix singular value decomposition is a classical and important technology in the field of signal processing, and plays an important role in aspects of data dimension reduction, hyperspectral image processing, robot positioning and navigation, artificial intelligent recommendation algorithm and the like. The matrix singular value decomposition is realized by projecting in different subspaces through orthogonal transformation, so that the main component is effectively extracted to realize the dimension reduction effect, and singular value decomposition operators or accelerators are often integrated in a plurality of CPU, GPU, AI processors and FPGA systems to realize the performance improvement. However, the singular value decomposition itself involves complex computation, and it is important how to implement the descending order of the singular values and the corresponding singular vectors while completing the singular value decomposition in the ordering process of the computation results.
In the current singular value decomposition scheme realized based on very large scale integrated circuits (Very Large Scale Integration Circuit, VLSI), most of the singular value decomposition schemes adopt a method of mutually separating singular value decomposition and sorting processes, including a matrix algorithm library provided by a certain company, wherein singular value decomposition calculation is firstly adopted, and then singular values and singular vectors are sorted. This results in a serial execution between the sorting operation and the singular value decomposition calculation, increasing the overall delay, and in addition, requiring the overhead of dedicated sorting circuit hardware resources in order to implement the sorting function.
The application patent content of application number CN201010151981.1 mentions singular value decomposition, singular value size ordering and constructing an image using the first N singular values and their corresponding singular vectors, in which the singular value decomposition and singular value size ordering are performed serially, requiring additional time consuming and computational resources.
The patent application CN2202111040096.0 mentions that singular value decomposition operators are integrated in the lifting AI process to improve the performance of the lifting AI processor, including the application of selecting the first K largest singular values to approximate the original matrix, but there is no relevant description of how the integrated singular value decomposition operators implement singular value ordering.
Disclosure of Invention
Aiming at the defects of the prior art, the application provides a singular value decomposition accelerator with a parallel ordering function based on an FPGA, which can achieve the trend that the norms of all column vectors are arranged in a descending order as a whole by processing the norms alpha and beta of all column vectors and a corresponding rotation matrix J in different rounds on the basis of a classical unilateral Jacobi algorithm, can realize the descending order arrangement convergence of all column vector norms through a plurality of times of sweep, further carries out square root calculation on all second order norms to obtain corresponding singular values, and meanwhile, divides all column vectors by corresponding singular values respectively to obtain respective corresponding left singular vectors. In the matrix singular value decomposition process, the application performs the sorting processing of the singular values and the singular vectors in parallel, and the sorting operation is delayed and hidden in the singular value decomposition process, so that the two steps which are originally performed in series are changed into single-step parallel operation, the improvement of the whole real-time performance is promoted, and the hardware resource cost special for processing the realization of the sorting function is saved.
The aim of the application is achieved by the following technical scheme:
on the one hand, the singular value decomposition accelerator with the parallel ordering function based on the FPGA comprises an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits, a round-robin scheduling mechanism state machine and 2k blocks of internal memory BRAMs; the singular value decomposition accelerator performs singular value decomposition by:
s1: writing matrix data of m rows and n columns into BRAMs in the FPGA from an external DDR memory through an AXI interface, wherein each column corresponds to 1 BRAM, n blocks of BRAMs are combined into a pair by every two adjacent blocks of BRAMs in sequence, and the pairs are equally divided into k=n/2 pairs; wherein n is an even number; if the number of columns of the matrix is originally odd, 1 column of all 0 is supplemented at the end so as to be even;
s2: the k parts of unilateral Jacobian orthogonal transformation circuits calculate the second-order norms alpha, beta and the inner product gamma of k pairs of BRAM groups in parallel;
s3: if the second order norm alpha of the first column vector is greater than or equal to the second order norm beta of the second column vector, generating a rotation matrix according to the single-side Jacobian algorithmThe method comprises the steps of carrying out a first treatment on the surface of the Conversely, a rotation matrix is generatedThe method comprises the steps of carrying out a first treatment on the surface of the Synchronous generation of k rotation matrixes J corresponding to k pairs of BRAMs i ,i=1,2,…,k;
S4, k parts of unilateral Jacobi orthogonal transformation circuits synchronously execute orthogonal rotation calculation, and the intermediate result is stored in n parts of BRAMs;
s5: exchanging column vectors according to a round-robin scheduling mechanism state machine, repeating the steps S2-S5, and executing k rounds of operations altogether;
s6: the k+1 th round is executed, comprising the following sub-steps:
s6.1: calculating the second order norms alpha, beta and inner product gamma of k pairs of BRAM groups, and exchanging the values of two second order norms in the 1~k-1 BRAM group;
s6.2: repeating S3 the same operation to synchronously generate k rotation matrixes J i ,i=1,2,…,k;
S6.3: maintaining the last rotation matrix J k The rest rotation matrixes are replaced by a transposition mode without change;
s6.4: s4, executing the same operation;
s7: exchanging column vector data stored by each BRAM according to a round-robin scheduling mechanism state machine, and executing the same operation as the S6 from the k+2~n-1 round until the n-1 round operation is completed, namely completing one-time 'sweeping' operation;
s8: and repeating the steps S2-S7, executing a plurality of 'sweeping' operations until the iteration termination condition is met, completing the singular value decomposition task, and arranging the singular values of the column vectors stored by each block BRAM from large to small.
Further, the unilateral Jacobian orthogonal transformation circuit comprises a norm and inner product calculation module, a cos theta and sin theta calculation module, a norm comparison module, a gamma positive and negative judgment module, a kth rotation matrix judgment module, a k-wheel judgment module, a rotation matrix J generation module, a unilateral Jacobian orthogonal rotation calculation module and a square root calculation module.
Further, the round-robin scheduling mechanism state machine controls the generation of data streams and control streams of each single-sided Jacobi orthogonal transformation circuit, including the reading of BRAM, the calculation of alpha, beta and gamma, the exchange of alpha and beta, the calculation of cos theta and sin theta, the generation of a rotation matrix J, and the write-back operation of Jacobi orthogonal rotation calculation results to BRAM.
Further, the initial column vector index rule is that the column vector index of the lower row is odd, the column vectors 1, 3, 5 … n-1, respectively, and the column vector index of the upper row is even, the column vectors 2, 4, 6 … n, respectively.
Further, the column vector index of the upper row is always greater than the column vector index of the lower row for the first k-wheel of each "sweep"; in the k+1 through n-1 th round of each "sweep", the column vector index of the upper row is always smaller than the column vector index of the lower row except for the last column.
Further, the second order norms of the following row of column vectors are α, i.e., α respectively 1 ,α 2 ,α 3 ,…,α k The method comprises the steps of carrying out a first treatment on the surface of the The second order norms of the column vectors of the upper row are beta, i.e. beta respectively 1 ,β 2 ,β 3 ,…,β k
Further, after the execution of S8 is completed, in the first few "sweeps", there is local oscillation, and the second-order norms of the column vectors overall show a descending order trend, so as to finally realize α 1 ≥β 1 ≥α 2 ≥β 2 ≥α 3 ≥…≥α k ≥β k
Further, the generation formulas of cos θ and sin θ are as follows:
further, after S8 is executed and the convergence condition is satisfied, the second order norm α is calculated for each row of vectors 1 ,β 1 ,α 2 ,β 2 ,α 3 ,…,α k ,β k Respectively performing square root calculation to obtain corresponding singular values of sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n And sigma (sigma) 1 ≥σ 2 ≥σ 3 ≥σ 4 ≥σ 5 ≥…≥σ n-1 ≥σ n And sum the result sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n Sequentially writing to external DDR memory over AXI interfaceAnd (5) storing.
Further, each column of vectors u satisfying convergence in S8 1 ,u 2 ,u 3 ,…,u n Divided by the singular values sigma corresponding to each 1 ,σ 2 ,σ 3 ,…,σ n Obtaining respective corresponding left singular vectors u 11 ,u 22 ,u 33 ,…,u nn And the result u 11 ,u 22 ,u 33 ,…,u nn Sequentially writing to external DDR storage through an AXI interface.
The beneficial effects of the application are as follows:
the method is particularly suitable for matrix singular value decomposition (including FPGA) realized based on VLSI, performs sequencing treatment of singular values and singular vectors in parallel in a cyclic iterative calculation process of singular value decomposition, and conceals the part of time delay in the whole singular value decomposition process, so that two steps which are originally executed in series are changed into single-step parallel synchronous operation, the integral real-time improvement of singular value decomposition can be improved, and particularly, for the application scene of image compression and principal component analysis, the method can extract the larger part of singular values and the corresponding singular vectors more quickly; in addition, the application saves the hardware resource cost special for processing the implementation of the sorting function.
Drawings
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
FIG. 1 is a block diagram of a singular value decomposition accelerator with parallel ordering function;
FIG. 2 is a circuit diagram of a detailed control channel and data channel of a singular value decomposition accelerator with parallel ordering function;
FIG. 3 is a schematic diagram of a single-sided Jacobi algorithm of a 512 row by 512 column matrix based on a round-robin state machine;
FIG. 4 is a schematic diagram of a one-time sweep process column vector swap operation with column dimension 6;
FIG. 5 is a diagram of a one-time sweep process column vector norm magnitude relationship with column dimension 6;
fig. 6 is a graph showing a partial column vector norm descending trend of a matrix of 512 rows by 512 columns for 5 times sweep execution.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
First, explanation of technical terms is given:
(1) VLSI: very Large Scale Integrated circuits VLSI (very large scale integrated circuit)
(2) And (3) FPGA: field Programmable Gate Array field programmable gate array
(3) BRAM: block RAM, FPGA internal Block RAM
(4) Jacobi: the application refers to unilateral Jacobian rotation, which is commonly used for matrix singular value decomposition based on FPGA
(5) round-robin: polling scheduling, one-side Jacobi rotation singular value decomposition commonly used scheduling mechanism
(6) DDR SDRAM: double Data Rate Synchronous Dynamic Random Access Memory, DDR external storage.
The singular value decomposition accelerator with the parallel ordering function based on the FPGA comprises an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits, a round-robin scheduling mechanism state machine and 2k blocks of internal BRAM storage; the singular value decomposition accelerator performs singular value decomposition by:
s1: writing matrix data of m rows and n columns into BRAMs in the FPGA from an external DDR memory through an AXI interface, wherein each column corresponds to 1 BRAM, n blocks of BRAMs are combined into a pair by every two adjacent blocks of BRAMs in sequence, and the pairs are equally divided into k=n/2 pairs; wherein n is an even number; if the number of columns of the matrix is originally odd, 1 column of all 0 is supplemented at the end so as to be even;
s2: the k parts of unilateral Jacobian orthogonal transformation circuits calculate the second-order norms alpha, beta and the inner product gamma of k pairs of BRAM groups in parallel;
s3: the second order norm alpha of the first column vector is larger than or equal to the second order norm beta of the second column vector, and then a rotation matrix is generated according to a unilateral Jacobi algorithmThe method comprises the steps of carrying out a first treatment on the surface of the Conversely, a rotation matrix is generatedThe method comprises the steps of carrying out a first treatment on the surface of the Synchronous generation of k rotation matrixes J corresponding to k pairs of BRAMs i ,i=1,2,…,k;
Here, the generation formulas of cos θ and sin θ are as follows:
s4: k parts of unilateral Jacobi orthogonal transformation circuits synchronously execute orthogonal rotation calculation, and intermediate results are temporarily stored in n parts of BRAMs;
s5: exchanging column vectors according to a round-robin scheduling mechanism state machine, repeating the steps S2-S5, and executing k rounds of operations altogether;
s6: the k+1 th round is executed, comprising the following sub-steps:
s6.1: calculating the second order norms alpha, beta and inner product gamma of k pairs of BRAM groups, and exchanging the values of two second order norms in the 1~k-1 BRAM group;
s6.2: repeating S3 the same operation to synchronously generate k rotation matrixes J i ,i=1,2,…,k;
S6.3: maintaining the last rotation matrix J k Unchanged, the rest of the rotation matrix J i In transposed form J i T Instead, i=1, 2, …, k-1;
s6.4: s4, executing the same operation;
s7: exchanging column vector data stored by each BRAM according to a round-robin scheduling mechanism state machine, and executing the same operation as the S6 from the k+2~n-1 round until the n-1 round operation is completed, namely completing one-time 'sweeping' operation;
s8: and repeating the steps S2-S7, executing a plurality of 'sweeping' operations until the iteration termination condition is met, completing the singular value decomposition task, and arranging the singular value sizes of the column vectors stored by each block BRAM from large to small according to sequence numbers.
After S8 is executed, in the first few 'sweeps', the second order norms of each column vector totally show descending arrangement trend except for local oscillation, and finally alpha is realized 1 ≥β 1 ≥α 2 ≥β 2 ≥α 3 ≥…≥α k ≥β k
In this embodiment, after S8 is performed and convergence conditions are satisfied, the column vector second order norms α stored for each block BRAM 1 ,β 1 ,α 2 ,β 2 ,α 3 ,…,α k ,β k Respectively performing square root calculation to obtain corresponding singular values of sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n And sigma (sigma) 1 ≥σ 2 ≥σ 3 ≥σ 4 ≥σ 5 ≥…≥σ n-1 ≥σ n . And sum the result sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n Sequentially writing to external DDR storage through an AXI interface. Or further, each column vector u satisfying convergence in S8 1 ,u 2 ,u 3 ,…,u n Divided by the singular values sigma corresponding to each 1 ,σ 2 ,σ 3 ,…,σ n Obtaining respective corresponding left singular vectors u 11 ,u 22 ,u 33 ,…,u nn And the result u 11 ,u 22 ,u 33 ,…,u nn Sequentially writing to external DDR storage over AXI interface。
The unilateral Jacobian orthogonal transformation circuit comprises a norm and inner product calculation module, a cos theta and sin theta calculation module, a norm comparison module, a gamma positive and negative judgment module, a kth rotation matrix judgment module, a k-wheel judgment module, a rotation matrix J generation module, a unilateral Jacobian orthogonal rotation calculation module and a square root calculation module.
The round-robin scheduling mechanism state machine controls the generation of data flow and control flow of each single-sided Jacobi orthogonal transformation circuit, and comprises the reading of BRAM, the calculation of alpha, beta and gamma, the exchange of alpha and beta, the calculation of cos theta and sin theta, the generation of a rotation matrix J and the write-back operation of Jacobi orthogonal rotation calculation results to the BRAM.
In addition, in the singular value decomposition process of the present application, the initial column vector index rule is that the column vector index of the lower row is odd, the column vectors 1, 3, 5 … n-1 are respectively, and the column vector index of the upper row is even, the column vectors 2, 4, 6 … n are respectively. And the column vector index of the upper row is always greater than the column vector index of the lower row for the first k rounds of each "sweep"; in the k+1 through n-1 th round of each "sweep", the column vector index of the upper row is always smaller than the column vector index of the lower row except for the last column. Further, the second order norms of the following row of column vectors are α, i.e., α respectively 1 ,α 2 ,α 3 ,…,α k The method comprises the steps of carrying out a first treatment on the surface of the The second order norms of the column vectors of the upper row are beta, i.e. beta respectively 1 ,β 2 ,β 3 ,…,β k
The method of the present application is explained and illustrated in the following by a specific example.
The specific embodiment is described by singular value decomposition of a 512 row by 512 column matrix, the matrix element data type is a single-precision floating point number which accords with IEEE754 standard, XC7V690T-3FFG1761FPGA of Xilinx company is selected as target hardware for deployment verification, the minimum physical unit of the internal BRAM in the FPGA is BRAM with 18Kb capacity, the single-precision floating point number column vector with the depth of 512 just occupies 1 block of BRAM with 18Kb, and 512 blocks of BRAM are needed in total.
The specific implementation process of this embodiment is as follows:
step 1: through an AXI interface, 512 rows and 512 columns of matrix data are read from an external DDR memory device and written into corresponding 512 blocks BRAM in the FPGA according to columns, wherein the 1 st column is written into the 1 st block BRAM, the 2 nd column is written into the 2 nd block BRAM, and the 1 st pair is formed by the 1 st column and the 2 nd column, so that the internal memory of the 1# unilateral Jacobian orthogonal transformation circuit is formed; column 3 is written to the 3 rd block BRAM, column 4 is written to the 4 th block BRAM, and the two form a 2 nd pair to form the internal storage of the 2# unilateral Jacobian orthogonal transformation circuit; … …; and the method is characterized in that the method is repeated until the 511 th column is written into the 511 th block BRAM, the 512 th column is written into the 512 th block BRAM, and the two blocks form a 256 th pair to form the internal storage of a 256# unilateral Jacobian orthogonal transformation circuit, as shown in figure 1.
Step 2: the k parts of unilateral Jacobian orthogonal transformation circuits in the FPGA in fig. 1 synchronously and parallelly calculate the second-order norms alpha, beta and the inner products gamma, namely alpha, of the column vectors stored in the 256 pairs of BRAM groups in the step 1 1 Is the second order norm calculated by the 1 st block BRAM, beta 1 Is the second order norm and gamma obtained by the 2 nd block BRAM calculation 1 Is the inner product of the two, alpha 2 Is the second order norm and beta calculated by the 3 rd block BRAM 2 Second order norm and gamma obtained by calculating 4 th block BRAM 2 Is the inner product of the two, …, and so on, α 256 Is the second order norm and beta calculated by the 511 th block BRAM 256 Is the second order norm and gamma obtained by the calculation of the 512 th block BRAM 256 Is the inner product of the two.
Step 3: taking a 1# unilateral Jacobi orthogonal transformation circuit as an example, generating cos theta according to unilateral Jacobi algorithm 1 Sum sin theta 1 The formula is as follows:
in order to realize the parallel ordering function, after the output of the alpha and beta comparison circuit, the following special treatment is carried out:
if alpha is 1 ≥β 1 Then
On the contrary, the method comprises the steps of,
by analogy, 256 rotation matrices, i.e. J 1 ,J 2 ,…,J 256 And (5) synchronous parallel generation.
Step 4:256 single-sided Jacobi orthogonal transformation circuits synchronously execute single-sided Jacobi orthogonal rotation calculation: to be used forRepresents column 1, round 1 current vector, < >>Representing the vector obtained by updating column 1 through the 1 st round of orthogonal rotation transformation, and for the 1 st pair of column vectors, executing Jacobi orthogonal rotation calculation as +.>The 2 nd pair of column vectors performs Jacobi orthogonal rotation calculation as +.>And so on until ++>
Step 5: as shown in fig. 2, the control channel and the data channel inside the accelerator are described in detail by 1 pair of column vectors, and a round-robin scheduling mechanism state machine is responsible for overall flow control, and controls the data input and calculation result of each unit module according to the running round; after 256 pairs of column vectors are subjected to unilateral Jacobi orthogonal rotation transformation, respectively exchanging updated column vectors according to a round-robin scheduling mechanism in FIG. 3; the specific method comprises the following steps: fix the last 1 column, u 512 Other column vectors perform u in counter-clockwise reverse rotation, i.e. concurrently 1 Pass to the right to u 3 ,u 3 Pass to the right to u 5 ,…,u 509 Pass to the right to u 511 ,u 511 Diagonal transfer to u 510 ,u 510 Pass to the left to u 508 ,u 508 Pass to the left to u 506 ,…,u 4 Pass to the left to u 2 ,u 2 Down to u 1 The method comprises the steps of carrying out a first treatment on the surface of the For more details, the data scheduling exchange may refer to fig. 1, where bram_4 stores data to bram_2, and bram_6 stores data to bram_4, …; the data stored in BRAM_1 is transferred to BRAM_3, and the data stored in BRAM_3 is transferred to BRAM_5 and …; the data stored in BRAM_2 is transmitted to BRAM_1, and the data stored in BRAM_512 is kept unchanged; repeating the steps 2-4, and executing k=512/2=256 rounds of the operations altogether; the results are shown in FIG. 3.
Step 6: the execution of the kth+1=256+1=257 round is entered, comprising the following sub-steps:
step 6.1: similarly to step 2, the second order norms α, β and the inner product γ of 256 pairs of column vectors are calculated, but the norms α of the last pair of column vectors are specially processed 256 、β 256 The relationship remains unchanged and the norm values alpha of the remaining column vector pairs i And beta i Interchangeable, i.e. alpha 1 、β 1 Between, alpha 2 、β 2 Between …, alpha 255 、β 255 The values are exchanged between, and it is noted that the column vectors themselves are unchanged in position.
Step 6.2: after the special treatment of the step 6.1, the same operation of the step 3 is executed to generate 256 rotation matrixes J respectively 1 、J 2、 …、J 256
Step 6.3: rotation matrix J holding last pair of column vectors 256 Unchanged, the rest of the rotation matrix J i (i=1, 2, …, 255) take the respective transposed form J i T And substituting.
Step 6.4:256 performs a single-sided Jacobi orthogonal rotation transform on the column vector sync.
Step 7: and repeatedly executing the step 6 until the 511 th round of operation is completed.
Step 2 to step 7 are called a sweep. When one sweep is performed, the whole column vector norms show a descending order trend, namely the rule of alpha 1 ≥β 1 ≥α 2 ≥β 2 ≥α 3 …≥α 256 ≥β 256 In the first few sweep, there is occasionally a shock at the pole-individual, i.e. there are few column vector second order norms beta i <α i+1 (i=1, 2, …, 255).
For ease of description and understanding, a matrix with column dimensions 6 is added to illustrate the overall process and column vector second order norm sequencing results to enhance understanding and implementation. As shown in fig. 4, during one sweep, the first k-wheel is the first 3 steps, shown in the upper half of the figure; in the next (k+1) th to (n-1) th rounds, steps 4 and 5 in the lower half of the figure, due to u 6 Is 6, always greater than the other column indices, so the second order norm alpha of the last 1 pair of column vectors 3 、β 3 While the values of the second order norms of the first 2 pairs of column vectors need to be exchanged, but the column vectors themselves are not exchanged, i.e. step 4 alpha 1 Equal to column vector u 3 Second order norm, beta 1 Equal to column vector u 5 Second order norm, alpha 2 Equal to column vector u 1 Second order norm, beta 2 Equal to column vector u 4 Is a second order norm of (2); the operation of the step 5 is the same; after orthogonal transformation, the size change and overall trend of the second order norms of the column vectors are shown in fig. 5, and the exchange processing of the orthogonal rotation matrix J and the second order norms in the application can lead the size of the second order norms of each column vector to be arranged in descending order according to the sequence number of the column index, namely the overall trend is that
Step 8: according to the matrix size of 512 rows by 512 columns, 6 times sweep is performed to meet the preset convergence condition. At this timePerforming square root computation on column vector norms of 512 columns in parallel to obtain singular values of sigma respectively 1 ,σ 2 ,σ 3 ,…,σ 512 Corresponding sigma 1 ≥σ 2 ≥σ 3 ≥…≥σ 512 Further dividing each column vector by each otherObtaining left singular vector u by corresponding singular value 11 ,u 22 ,u 33 ,…,u 512512 And (5) completing the singular value decomposition task. As shown in fig. 6, a part of the column vector second-order norm values are truncated, wherein the whole exhibits a tendency of descending order after 1 sweep execution, but there is a partial concussion, and the column vector second-order norm values are basically monotonically decreasing after 4 sweep execution.
Step 9: and (3) writing the singular values in the descending order obtained in the step (8) back into an external DDR storage through an AXI interface.
The FPGA operation result shows that the single-precision floating point matrix of 512 rows and 512 columns can rapidly complete singular value decomposition in 52.9 milliseconds under the operation of a 200MHz clock in XC7V690T-3FFG1761 target hardware, and the singular values and singular vectors are arranged in descending order. Compared with the result in the matrix singular value decomposition Solver library published by Xilinx corporation, 512 rows x 512 columns of real symmetric single precision floating point matrix singular value decomposition is realized, which takes 1.687 seconds on an Alveo U250 accelerator card, but the ordering of singular values and singular vectors also requires additional functional circuits to realize, and more time is consumed for this.
It can be found by the embodiments of the present application that, for matrix singular value decomposition based on VLSI (including FPGA), alpha is calculated by the present application i 、β i And the special treatment of the rotation matrix J can realize the descending order arrangement of the singular values and the singular vectors in parallel while decomposing the singular values, thereby saving the time consumption and the hardware cost of special ordering tasks. Therefore, the application can realize the improvement of the real-time performance of singular value decomposition and the saving of the hardware resource cost.
Corresponding to the embodiment of the FPGA-based singular value decomposition accelerator with the parallel ordering function, the application also provides an embodiment of the FPGA-based singular value decomposition system with the parallel ordering function.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. The singular value decomposition accelerator with the parallel ordering function based on the FPGA is characterized by comprising an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits, a round-robin scheduling mechanism state machine and 2k blocks of internal memory BRAMs; the singular value decomposition accelerator performs singular value decomposition by:
s1: writing matrix data of m rows and n columns into BRAMs in the FPGA from an external DDR memory through an AXI interface, wherein each column corresponds to 1 BRAM, n blocks of BRAMs are combined into a pair by every two adjacent blocks of BRAMs in sequence, and the pairs are equally divided into k=n/2 pairs; wherein n is an even number; if the number of columns of the matrix is originally odd, 1 column of all 0 is supplemented at the end so as to be even;
s2: the k parts of unilateral Jacobian orthogonal transformation circuits calculate the second-order norms alpha, beta and the inner product gamma of k pairs of BRAM groups in parallel;
s3: if the second order norm alpha of the first column vector is greater than or equal to the second order norm beta of the second column vector, generating a rotation matrix according to the single-side Jacobian algorithmThe method comprises the steps of carrying out a first treatment on the surface of the Conversely, a rotation matrix is generatedThe method comprises the steps of carrying out a first treatment on the surface of the Synchronous generation of k rotation matrixes J corresponding to k pairs of BRAMs i ,i=1,2,…,k;
S4, k parts of unilateral Jacobi orthogonal transformation circuits synchronously execute orthogonal rotation calculation, and the intermediate result is stored in n parts of BRAMs;
s5: exchanging column vectors according to a round-robin scheduling mechanism state machine, repeating the steps S2-S5, and executing k rounds of operations altogether;
s6: the k+1 th round is executed, comprising the following sub-steps:
s6.1: calculating the second order norms alpha, beta and inner product gamma of k pairs of BRAM groups, and exchanging the values of two second order norms in the 1~k-1 BRAM group;
s6.2: repeating S3 the same operation to synchronously generate k rotation matrixes J i ,i=1,2,…,k;
S6.3: maintaining the last rotation matrix J k The rest rotation matrixes are replaced by a transposition mode without change;
s6.4: s4, executing the same operation;
s7: exchanging column vector data stored by each BRAM according to a round-robin scheduling mechanism state machine, and executing the same operation as the S6 from the k+2~n-1 round until the n-1 round operation is completed, namely completing one-time 'sweeping' operation;
s8: and repeating the steps S2-S7, executing a plurality of 'sweeping' operations until the iteration termination condition is met, completing the singular value decomposition task, and arranging the singular values of the column vectors stored by each block BRAM from large to small.
2. The FPGA-based singular value decomposition accelerator with parallel ordering function according to claim 1, wherein the single-sided jacobian orthogonal transformation circuit comprises a norm and inner product calculation module, cos θ and sin θ calculation modules, a norm comparison module, a gamma positive and negative decision module, a kth rotation matrix decision module, a k-less wheel decision module, a rotation matrix J generation module, a single-sided jacobian orthogonal rotation calculation module, and a square root calculation module.
3. The FPGA-based singular value decomposition accelerator with parallel ordering function of claim 1, wherein the round-robin scheduling mechanism state machine controls the generation of data streams and control streams for each single-sided jacobian orthogonal transform circuit, including the reading of BRAM, the computation of α, β, γ and α, β exchange, the computation of cos θ and sin θ, the generation of rotation matrix J, and the write-back operation of jacobian orthogonal rotation computation results to BRAM.
4. The FPGA-based singular value decomposition accelerator with parallel ordering according to claim 1, wherein the initial column vector index rule is that the column vector index of the lower row is odd, respectively column vectors 1, 3, 5 … n-1, and the column vector index of the upper row is even, respectively column vectors 2, 4, 6 … n.
5. The FPGA-based singular value decomposition accelerator with parallel ordering according to claim 2, wherein the column vector index of the upper row is always greater than the column vector index of the lower row for the first k-cycles of each "sweep"; in the k+1 through n-1 th round of each "sweep", the column vector index of the upper row is always smaller than the column vector index of the lower row except for the last column.
6. The FPGA-based singular value decomposition accelerator with parallel ordering according to claim 1, wherein the second order norms of the following row column vectors are α, i.e. α respectively 1 ,α 2 ,α 3 ,…,α k The method comprises the steps of carrying out a first treatment on the surface of the The second order norms of the column vectors of the upper row are beta, i.e. beta respectively 1 ,β 2 ,β 3 ,…,β k
7. The FPGA-based singular value decomposition accelerator with parallel ordering function according to claim 1, wherein after S8 is performed, in the first few "sweeps", there is local oscillation, the second-order norms of each column vector overall show descending order trend, and finally realize alpha 1 ≥β 1 ≥α 2 ≥β 2 ≥α 3 ≥…≥α k ≥β k
8. The FPGA-based singular value decomposition accelerator with parallel ordering function of claim 1, wherein the generation formulas of cos θ and sin θ are as follows:
9. the FPGA-based singular value decomposition accelerator with parallel ordering according to claim 1, wherein after S8 is executed and convergence condition is satisfied, the second order norms α for each column vector 1 ,β 1 ,α 2 ,β 2 ,α 3 ,…,α k ,β k Respectively performing square root calculation to obtain corresponding singular values of sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n And sigma (sigma) 1 ≥σ 2 ≥σ 3 ≥σ 4 ≥σ 5 ≥…≥σ n-1 ≥σ n And sum the result sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n Sequentially writing to external DDR storage through an AXI interface.
10. The FPGA-based singular value decomposition accelerator with parallel ordering according to claim 9, wherein the columns of vectors u satisfying convergence in S8 are 1 ,u 2 ,u 3 ,…,u n Divided by the singular values sigma corresponding to each 1 ,σ 2 ,σ 3 ,…,σ n Obtaining respective corresponding left singular vectors u 11 ,u 22 ,u 33 ,…,u nn And the result u 11 ,u 22 ,u 33 ,…,u nn Sequentially writing to external DDR storage through an AXI interface.
CN202310669739.0A 2023-06-07 2023-06-07 Singular value decomposition accelerator with parallel ordering function based on FPGA Active CN116382617B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310669739.0A CN116382617B (en) 2023-06-07 2023-06-07 Singular value decomposition accelerator with parallel ordering function based on FPGA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310669739.0A CN116382617B (en) 2023-06-07 2023-06-07 Singular value decomposition accelerator with parallel ordering function based on FPGA

Publications (2)

Publication Number Publication Date
CN116382617A CN116382617A (en) 2023-07-04
CN116382617B true CN116382617B (en) 2023-08-29

Family

ID=86961959

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310669739.0A Active CN116382617B (en) 2023-06-07 2023-06-07 Singular value decomposition accelerator with parallel ordering function based on FPGA

Country Status (1)

Country Link
CN (1) CN116382617B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118153494A (en) * 2024-05-11 2024-06-07 南京邮电大学 Hardware acceleration system for realizing matrix SVD decomposition based on AXI bus

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101390351A (en) * 2004-11-15 2009-03-18 高通股份有限公司 Eigenvalue decomposition and singular value decomposition of matrices using jacobi rotation
CN106528490A (en) * 2016-11-30 2017-03-22 郑州云海信息技术有限公司 FPGA (Field Programmable Gate Array) heterogeneous accelerated computing device and system
CN107506173A (en) * 2017-08-30 2017-12-22 郑州云海信息技术有限公司 A kind of accelerated method, the apparatus and system of singular value decomposition computing
KR20190059033A (en) * 2017-11-22 2019-05-30 한국전자통신연구원 Input vector generating apparatus and method using singular vaule decomposition for deep neural network speech recognition system
CN112596701A (en) * 2021-03-05 2021-04-02 之江实验室 FPGA acceleration realization method based on unilateral Jacobian singular value decomposition
CN113536228A (en) * 2021-09-16 2021-10-22 之江实验室 FPGA acceleration implementation method for matrix singular value decomposition
US11190244B1 (en) * 2020-07-31 2021-11-30 Samsung Electronics Co., Ltd. Low complexity algorithms for precoding matrix calculation
CN115659880A (en) * 2022-09-01 2023-01-31 重庆邮电大学 Hardware circuit and method of principal component analysis algorithm based on singular value decomposition
CN116170601A (en) * 2023-04-25 2023-05-26 之江实验室 Image compression method based on four-column vector block singular value decomposition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7937425B2 (en) * 2005-01-28 2011-05-03 Frantorf Investments Gmbh, Llc Scalable 2×2 rotation processor for singular value decomposition

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101390351A (en) * 2004-11-15 2009-03-18 高通股份有限公司 Eigenvalue decomposition and singular value decomposition of matrices using jacobi rotation
CN106528490A (en) * 2016-11-30 2017-03-22 郑州云海信息技术有限公司 FPGA (Field Programmable Gate Array) heterogeneous accelerated computing device and system
CN107506173A (en) * 2017-08-30 2017-12-22 郑州云海信息技术有限公司 A kind of accelerated method, the apparatus and system of singular value decomposition computing
KR20190059033A (en) * 2017-11-22 2019-05-30 한국전자통신연구원 Input vector generating apparatus and method using singular vaule decomposition for deep neural network speech recognition system
US11190244B1 (en) * 2020-07-31 2021-11-30 Samsung Electronics Co., Ltd. Low complexity algorithms for precoding matrix calculation
CN112596701A (en) * 2021-03-05 2021-04-02 之江实验室 FPGA acceleration realization method based on unilateral Jacobian singular value decomposition
CN113536228A (en) * 2021-09-16 2021-10-22 之江实验室 FPGA acceleration implementation method for matrix singular value decomposition
CN115659880A (en) * 2022-09-01 2023-01-31 重庆邮电大学 Hardware circuit and method of principal component analysis algorithm based on singular value decomposition
CN116170601A (en) * 2023-04-25 2023-05-26 之江实验室 Image compression method based on four-column vector block singular value decomposition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于CORDIC矩阵奇异值分解的FPGA实现;应俊;朱云鹏;;重庆邮电大学学报(自然科学版)(第03期);全文 *

Also Published As

Publication number Publication date
CN116382617A (en) 2023-07-04

Similar Documents

Publication Publication Date Title
Yepez et al. Stride 2 1-D, 2-D, and 3-D Winograd for convolutional neural networks
CN108897716B (en) Data processing device and method for reducing calculation amount through memory read-write operation
US20210065005A1 (en) Systems and methods for providing vector-wise sparsity in a neural network
CN110163359B (en) Computing device and method
CN111915001B (en) Convolution calculation engine, artificial intelligent chip and data processing method
US11769041B2 (en) Low latency long short-term memory inference with sequence interleaving
CN116382617B (en) Singular value decomposition accelerator with parallel ordering function based on FPGA
Bekas et al. Low cost high performance uncertainty quantification
WO2021080873A1 (en) Structured pruning for machine learning model
Wang et al. Efficient convolution architectures for convolutional neural network
US11341400B1 (en) Systems and methods for high-throughput computations in a deep neural network
Alawad et al. Stochastic-based deep convolutional networks with reconfigurable logic fabric
US20210065328A1 (en) System and methods for computing 2-d convolutions and cross-correlations
US11429849B2 (en) Deep compressed network
CN110580519B (en) Convolution operation device and method thereof
WO2018027706A1 (en) Fft processor and algorithm
US20200104669A1 (en) Methods and Apparatus for Constructing Digital Circuits for Performing Matrix Operations
CN112306555A (en) Method, apparatus, device and computer readable storage medium for extracting image data in multiple convolution windows in parallel
CN115186802A (en) Block sparse method and device based on convolutional neural network and processing unit
CN114138231B (en) Method, circuit and SOC for executing matrix multiplication operation
Yang et al. S 2 Engine: A novel systolic architecture for sparse convolutional neural networks
Alawad et al. Memory-efficient probabilistic 2-D finite impulse response (FIR) filter
CN109740740A (en) The fixed point accelerating method and device of convolutional calculation
CN113485750A (en) Data processing method and data processing device
Fong et al. A cost-effective CNN accelerator design with configurable PU on FPGA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant