CN116382617A - Singular value decomposition accelerator with parallel ordering function based on FPGA - Google Patents

Singular value decomposition accelerator with parallel ordering function based on FPGA Download PDF

Info

Publication number
CN116382617A
CN116382617A CN202310669739.0A CN202310669739A CN116382617A CN 116382617 A CN116382617 A CN 116382617A CN 202310669739 A CN202310669739 A CN 202310669739A CN 116382617 A CN116382617 A CN 116382617A
Authority
CN
China
Prior art keywords
singular value
value decomposition
column
fpga
column vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310669739.0A
Other languages
Chinese (zh)
Other versions
CN116382617B (en
Inventor
胡塘
李相迪
任嵩楠
闫力
王跃明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202310669739.0A priority Critical patent/CN116382617B/en
Publication of CN116382617A publication Critical patent/CN116382617A/en
Application granted granted Critical
Publication of CN116382617B publication Critical patent/CN116382617B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/06Arrangements for sorting, selecting, merging, or comparing data on individual record carriers
    • G06F7/08Sorting, i.e. grouping record carriers in numerical or other ordered sequence according to the classification of at least some of the information they carry
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7817Specially adapted for signal processing, e.g. Harvard architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7821Tightly coupled to memory, e.g. computational memory, smart memory, processor in memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/06Arrangements for sorting, selecting, merging, or comparing data on individual record carriers
    • G06F7/20Comparing separate sets of record carriers arranged in the same sequence to determine whether at least some of the data in one set is identical with that in the other set or sets
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Complex Calculations (AREA)
  • Apparatus For Radiation Diagnosis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a singular value decomposition accelerator with a parallel ordering function based on an FPGA, which comprises an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits and 2k parts of internal BRAM memory, wherein the external DDR memory is used for storing the k parts of single-sided Jacobian orthogonal transformation circuits; the k parts of unilateral Jacobian orthogonal transformation calculation circuits generate norms alpha and beta in parallel, classify and process a rotation matrix J according to the size relation of the norms alpha and beta, execute unilateral Jacobian calculation from the 1 st round to the k th round according to a polling scheduling mechanism state machine, and exchange the norms of the rest column vectors except the last pair of column vector norms alpha and beta when the k+1 th round to the n-1 th round keeps the rule, and the rotation matrix J uses the transposed matrix J thereof T Instead, the iteration is repeated until convergence. The invention can be implementedThe prior singular value decomposition process synchronously completes the singular value sorting, eliminates the time consumption required by independent sorting processing, saves the hardware resource overhead specially used for realizing the processing sorting function, and obviously improves the hardware acceleration effect.

Description

Singular value decomposition accelerator with parallel ordering function based on FPGA
Technical Field
The invention relates to the field of signal processing, in particular to a singular value decomposition accelerator with a parallel ordering function based on an FPGA.
Background
Matrix singular value decomposition is a classical and important technology in the field of signal processing, and plays an important role in aspects of data dimension reduction, hyperspectral image processing, robot positioning and navigation, artificial intelligent recommendation algorithm and the like. The matrix singular value decomposition is realized by projecting in different subspaces through orthogonal transformation, so that the main component is effectively extracted to realize the dimension reduction effect, and singular value decomposition operators or accelerators are often integrated in a plurality of CPU, GPU, AI processors and FPGA systems to realize the performance improvement. However, the singular value decomposition itself involves complex computation, and it is important how to implement the descending order of the singular values and the corresponding singular vectors while completing the singular value decomposition in the ordering process of the computation results.
In the current singular value decomposition scheme realized based on very large scale integrated circuits (Very Large Scale Integration Circuit, VLSI), most of the singular value decomposition schemes adopt a method of mutually separating singular value decomposition and sorting processes, including a matrix algorithm library provided by a certain company, wherein singular value decomposition calculation is firstly adopted, and then singular values and singular vectors are sorted. This results in a serial execution between the sorting operation and the singular value decomposition calculation, increasing the overall delay, and in addition, requiring the overhead of dedicated sorting circuit hardware resources in order to implement the sorting function.
The invention patent content of application number CN201010151981.1 mentions singular value decomposition, singular value size ordering and constructing an image using the first N singular values and their corresponding singular vectors, in which the singular value decomposition and singular value size ordering are performed serially, requiring additional time consuming and computational resources.
The patent application CN2202111040096.0 mentions that singular value decomposition operators are integrated in the lifting AI process to improve the performance of the lifting AI processor, including the application of selecting the first K largest singular values to approximate the original matrix, but there is no relevant description of how the integrated singular value decomposition operators implement singular value ordering.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a singular value decomposition accelerator with a parallel ordering function based on an FPGA, which can achieve the trend that the norms of all column vectors are arranged in a descending order as a whole by processing the norms alpha and beta of all column vectors and a corresponding rotation matrix J in different rounds on the basis of a classical unilateral Jacobi algorithm, can realize the descending order arrangement convergence of all column vector norms through a plurality of times of sweep, further carries out square root calculation on all second order norms to obtain corresponding singular values, and meanwhile, divides all column vectors by corresponding singular values respectively to obtain respective corresponding left singular vectors. In the matrix singular value decomposition process, the invention performs the sorting processing of the singular values and the singular vectors in parallel, and the sorting operation is delayed and hidden in the singular value decomposition process, so that the two steps which are originally performed in series are changed into single-step parallel operation, the improvement of the whole real-time performance is promoted, and the hardware resource cost special for processing the realization of the sorting function is saved.
The aim of the invention is achieved by the following technical scheme:
on the one hand, the singular value decomposition accelerator with the parallel ordering function based on the FPGA comprises an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits, a round-robin scheduling mechanism state machine and 2k blocks of internal memory BRAMs; the singular value decomposition accelerator performs singular value decomposition by:
s1: writing matrix data of m rows and n columns into BRAMs in the FPGA from an external DDR memory through an AXI interface, wherein each column corresponds to 1 BRAM, n blocks of BRAMs are combined into a pair by every two adjacent blocks of BRAMs in sequence, and the pairs are equally divided into k=n/2 pairs; wherein n is an even number; if the number of columns of the matrix is originally odd, 1 column of all 0 is supplemented at the end so as to be even;
s2: the k parts of unilateral Jacobian orthogonal transformation circuits calculate the second-order norms alpha, beta and the inner product gamma of k pairs of BRAM groups in parallel;
s3: if the second order norm alpha of the first column vector is greater than or equal to the second order norm beta of the second column vector, generating a rotation matrix according to the single-side Jacobian algorithm
Figure SMS_1
The method comprises the steps of carrying out a first treatment on the surface of the Conversely, a rotation matrix is generated +.>
Figure SMS_2
The method comprises the steps of carrying out a first treatment on the surface of the Synchronous generation of k rotation matrixes J corresponding to k pairs of BRAMs i ,i=1,2,…,k;
S4, k parts of unilateral Jacobi orthogonal transformation circuits synchronously execute orthogonal rotation calculation, and the intermediate result is stored in n parts of BRAMs;
s5: exchanging column vectors according to a round-robin scheduling mechanism state machine, repeating the steps S2-S5, and executing k rounds of operations altogether;
s6: the k+1 th round is executed, comprising the following sub-steps:
s6.1: calculating the second order norms alpha, beta and inner product gamma of k pairs of BRAM groups, and exchanging the values of two second order norms in the 1~k-1 BRAM group;
s6.2: repeating S3 the same operation to synchronously generate k rotation matrixes J i ,i=1,2,…,k;
S6.3: maintaining the last rotation matrix J k The rest rotation matrixes are replaced by a transposition mode without change;
s6.4: s4, executing the same operation;
s7: exchanging column vector data stored by each BRAM according to a round-robin scheduling mechanism state machine, and executing the same operation as the S6 from the k+2~n-1 round until the n-1 round operation is completed, namely completing one-time 'sweeping' operation;
s8: and repeating the steps S2-S7, executing a plurality of 'sweeping' operations until the iteration termination condition is met, completing the singular value decomposition task, and arranging the singular values of the column vectors stored by each block BRAM from large to small.
Further, the unilateral Jacobian orthogonal transformation circuit comprises a norm and inner product calculation module, a cos theta and sin theta calculation module, a norm comparison module, a gamma positive and negative judgment module, a kth rotation matrix judgment module, a k-wheel judgment module, a rotation matrix J generation module, a unilateral Jacobian orthogonal rotation calculation module and a square root calculation module.
Further, the round-robin scheduling mechanism state machine controls the generation of data streams and control streams of each single-sided Jacobi orthogonal transformation circuit, including the reading of BRAM, the calculation of alpha, beta and gamma, the exchange of alpha and beta, the calculation of cos theta and sin theta, the generation of a rotation matrix J, and the write-back operation of Jacobi orthogonal rotation calculation results to BRAM.
Further, the initial column vector index rule is that the column vector index of the lower row is odd, the column vectors 1, 3, 5 … n-1, respectively, and the column vector index of the upper row is even, the column vectors 2, 4, 6 … n, respectively.
Further, the column vector index of the upper row is always greater than the column vector index of the lower row for the first k-wheel of each "sweep"; in the k+1 through n-1 th round of each "sweep", the column vector index of the upper row is always smaller than the column vector index of the lower row except for the last column.
Further, the second order norms of the following row of column vectors are α, i.e., α respectively 1 ,α 2 ,α 3 ,…,α k The method comprises the steps of carrying out a first treatment on the surface of the The second order norms of the column vectors of the upper row are beta, i.e. beta respectively 1 ,β 2 ,β 3 ,…,β k
Further, after the execution of S8 is completed, in the first few "sweeps", there is local oscillation, and the second-order norms of the column vectors overall show a descending order trend, so as to finally realize α 1 ≥β 1 ≥α 2 ≥β 2 ≥α 3 ≥…≥α k ≥β k
Further, the generation formulas of cos θ and sin θ are as follows:
Figure SMS_3
further, S8 is performed and the harvest is satisfiedAfter the convergence condition, the second order norm alpha of each column vector 1 ,β 1 ,α 2 ,β 2 ,α 3 ,…,α k ,β k Respectively performing square root calculation to obtain corresponding singular values of sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n And sigma (sigma) 1 ≥σ 2 ≥σ 3 ≥σ 4 ≥σ 5 ≥…≥σ n-1 ≥σ n And sum the result sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n Sequentially writing to external DDR storage through an AXI interface.
Further, each column of vectors u satisfying convergence in S8 1 ,u 2 ,u 3 ,…,u n Divided by the singular values sigma corresponding to each 1 ,σ 2 ,σ 3 ,…,σ n Obtaining respective corresponding left singular vectors u 11 ,u 22 ,u 33 ,…,u nn And the result u 11 ,u 22 ,u 33 ,…,u nn Sequentially writing to external DDR storage through an AXI interface.
The beneficial effects of the invention are as follows:
the method is particularly suitable for matrix singular value decomposition (including FPGA) realized based on VLSI, performs sequencing treatment of singular values and singular vectors in parallel in a cyclic iterative calculation process of singular value decomposition, and conceals the part of time delay in the whole singular value decomposition process, so that two steps which are originally executed in series are changed into single-step parallel synchronous operation, the integral real-time improvement of singular value decomposition can be improved, and particularly, for the application scene of image compression and principal component analysis, the method can extract the larger part of singular values and the corresponding singular vectors more quickly; in addition, the invention saves the hardware resource cost special for processing the implementation of the sorting function.
Drawings
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
FIG. 1 is a block diagram of a singular value decomposition accelerator with parallel ordering function;
FIG. 2 is a circuit diagram of a detailed control channel and data channel of a singular value decomposition accelerator with parallel ordering function;
FIG. 3 is a schematic diagram of a single-sided Jacobi algorithm of a 512 row by 512 column matrix based on a round-robin state machine;
FIG. 4 is a schematic diagram of a one-time sweep process column vector swap operation with column dimension 6;
FIG. 5 is a diagram of a one-time sweep process column vector norm magnitude relationship with column dimension 6;
fig. 6 is a graph showing a partial column vector norm descending trend of a matrix of 512 rows by 512 columns for 5 times sweep execution.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
First, explanation of technical terms is given:
(1) VLSI: very Large Scale Integrated circuits VLSI (very large scale integrated circuit)
(2) And (3) FPGA: field Programmable Gate Array field programmable gate array
(3) BRAM: block RAM, FPGA internal Block RAM
(4) Jacobi: the invention refers to unilateral Jacobian rotation, which is commonly used for matrix singular value decomposition based on FPGA
(5) round-robin: polling scheduling, one-side Jacobi rotation singular value decomposition commonly used scheduling mechanism
(6) DDR SDRAM: double Data Rate Synchronous Dynamic Random Access Memory, DDR external storage.
The singular value decomposition accelerator with the parallel ordering function based on the FPGA comprises an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits, a round-robin scheduling mechanism state machine and 2k blocks of internal BRAM storage; the singular value decomposition accelerator performs singular value decomposition by:
s1: writing matrix data of m rows and n columns into BRAMs in the FPGA from an external DDR memory through an AXI interface, wherein each column corresponds to 1 BRAM, n blocks of BRAMs are combined into a pair by every two adjacent blocks of BRAMs in sequence, and the pairs are equally divided into k=n/2 pairs; wherein n is an even number; if the number of columns of the matrix is originally odd, 1 column of all 0 is supplemented at the end so as to be even;
s2: the k parts of unilateral Jacobian orthogonal transformation circuits calculate the second-order norms alpha, beta and the inner product gamma of k pairs of BRAM groups in parallel;
s3: the second order norm alpha of the first column vector is larger than or equal to the second order norm beta of the second column vector, and then a rotation matrix is generated according to a unilateral Jacobi algorithm
Figure SMS_4
The method comprises the steps of carrying out a first treatment on the surface of the Conversely, a rotation matrix is generated +.>
Figure SMS_5
The method comprises the steps of carrying out a first treatment on the surface of the Synchronous generation of k rotation matrixes J corresponding to k pairs of BRAMs i ,i=1,2,…,k;
Here, the generation formulas of cos θ and sin θ are as follows:
Figure SMS_6
s4: k parts of unilateral Jacobi orthogonal transformation circuits synchronously execute orthogonal rotation calculation, and intermediate results are temporarily stored in n parts of BRAMs;
s5: exchanging column vectors according to a round-robin scheduling mechanism state machine, repeating the steps S2-S5, and executing k rounds of operations altogether;
s6: the k+1 th round is executed, comprising the following sub-steps:
s6.1: calculating the second order norms alpha, beta and inner product gamma of k pairs of BRAM groups, and exchanging the values of two second order norms in the 1~k-1 BRAM group;
s6.2: repeating S3 the same operation to synchronously generate k rotation matrixes J i ,i=1,2,…,k;
S6.3: maintaining the last rotation matrix J k Unchanged, the rest of the rotation matrix J i In transposed form J i T Instead, i=1, 2, …, k-1;
s6.4: s4, executing the same operation;
s7: exchanging column vector data stored by each BRAM according to a round-robin scheduling mechanism state machine, and executing the same operation as the S6 from the k+2~n-1 round until the n-1 round operation is completed, namely completing one-time 'sweeping' operation;
s8: and repeating the steps S2-S7, executing a plurality of 'sweeping' operations until the iteration termination condition is met, completing the singular value decomposition task, and arranging the singular value sizes of the column vectors stored by each block BRAM from large to small according to sequence numbers.
After S8 is executed, in the first few 'sweeps', the second order norms of each column vector totally show descending arrangement trend except for local oscillation, and finally alpha is realized 1 ≥β 1 ≥α 2 ≥β 2 ≥α 3 ≥…≥α k ≥β k
In this embodiment, after S8 is performed and convergence conditions are satisfied, the column vector second order norms α stored for each block BRAM 1 ,β 1 ,α 2 ,β 2 ,α 3 ,…,α k ,β k Respectively performing square root calculation to obtain corresponding singular values of sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n And sigma (sigma) 1 ≥σ 2 ≥σ 3 ≥σ 4 ≥σ 5 ≥…≥σ n-1 ≥σ n . And sum the result sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n Sequentially writing to external DDR storage through an AXI interface. Or further, each column vector u satisfying convergence in S8 1 ,u 2 ,u 3 ,…,u n Divided by the singular values sigma corresponding to each 1 ,σ 2 ,σ 3 ,…,σ n Obtaining respective corresponding left singular vectors u 11 ,u 22 ,u 33 ,…,u nn And the result u 11 ,u 22 ,u 33 ,…,u nn Sequentially writing to external DDR storage through an AXI interface.
The unilateral Jacobian orthogonal transformation circuit comprises a norm and inner product calculation module, a cos theta and sin theta calculation module, a norm comparison module, a gamma positive and negative judgment module, a kth rotation matrix judgment module, a k-wheel judgment module, a rotation matrix J generation module, a unilateral Jacobian orthogonal rotation calculation module and a square root calculation module.
The round-robin scheduling mechanism state machine controls the generation of data flow and control flow of each single-sided Jacobi orthogonal transformation circuit, and comprises the reading of BRAM, the calculation of alpha, beta and gamma, the exchange of alpha and beta, the calculation of cos theta and sin theta, the generation of a rotation matrix J and the write-back operation of Jacobi orthogonal rotation calculation results to the BRAM.
In addition, in the singular value decomposition process of the present invention, the initial column vector index rule is that the column vector index of the lower row is odd, the column vectors 1, 3, 5 … n-1 are respectively, and the column vector index of the upper row is even, the column vectors 2, 4, 6 … n are respectively. And the column vector index of the upper row is always greater than the column vector index of the lower row for the first k rounds of each "sweep"; in the k+1 through n-1 th round of each "sweep", the column vector index of the upper row is always smaller than the column vector index of the lower row except for the last column. Further, the second order norms of the following row of column vectors are α, i.e., α respectively 1 ,α 2 ,α 3 ,…,α k The method comprises the steps of carrying out a first treatment on the surface of the The upper row is directed toThe second order norms of the quantities being beta, i.e. beta 1 ,β 2 ,β 3 ,…,β k
The method of the present invention is explained and illustrated in the following by a specific example.
The specific embodiment is described by singular value decomposition of a 512 row by 512 column matrix, the matrix element data type is a single-precision floating point number which accords with IEEE754 standard, XC7V690T-3FFG1761FPGA of Xilinx company is selected as target hardware for deployment verification, the minimum physical unit of the internal BRAM in the FPGA is BRAM with 18Kb capacity, the single-precision floating point number column vector with the depth of 512 just occupies 1 block of BRAM with 18Kb, and 512 blocks of BRAM are needed in total.
The specific implementation process of this embodiment is as follows:
step 1: through an AXI interface, 512 rows and 512 columns of matrix data are read from an external DDR memory device and written into corresponding 512 blocks BRAM in the FPGA according to columns, wherein the 1 st column is written into the 1 st block BRAM, the 2 nd column is written into the 2 nd block BRAM, and the 1 st pair is formed by the 1 st column and the 2 nd column, so that the internal memory of the 1# unilateral Jacobian orthogonal transformation circuit is formed; column 3 is written to the 3 rd block BRAM, column 4 is written to the 4 th block BRAM, and the two form a 2 nd pair to form the internal storage of the 2# unilateral Jacobian orthogonal transformation circuit; … …; and the method is characterized in that the method is repeated until the 511 th column is written into the 511 th block BRAM, the 512 th column is written into the 512 th block BRAM, and the two blocks form a 256 th pair to form the internal storage of a 256# unilateral Jacobian orthogonal transformation circuit, as shown in figure 1.
Step 2: the k parts of unilateral Jacobian orthogonal transformation circuits in the FPGA in fig. 1 synchronously and parallelly calculate the second-order norms alpha, beta and the inner products gamma, namely alpha, of the column vectors stored in the 256 pairs of BRAM groups in the step 1 1 Is the second order norm calculated by the 1 st block BRAM, beta 1 Is the second order norm and gamma obtained by the 2 nd block BRAM calculation 1 Is the inner product of the two, alpha 2 Is the second order norm and beta calculated by the 3 rd block BRAM 2 Second order norm and gamma obtained by calculating 4 th block BRAM 2 Is the inner product of the two, …, and so on, α 256 Is the second order norm and beta calculated by the 511 th block BRAM 256 Is the second order norm and gamma obtained by the calculation of the 512 th block BRAM 256 Is the inner part of bothAnd (3) accumulation.
Step 3: taking a 1# unilateral Jacobi orthogonal transformation circuit as an example, generating cos theta according to unilateral Jacobi algorithm 1 Sum sin theta 1 The formula is as follows:
Figure SMS_7
in order to realize the parallel ordering function, after the output of the alpha and beta comparison circuit, the following special treatment is carried out:
if alpha is 1 ≥β 1 Then
Figure SMS_8
On the contrary, the method comprises the steps of,
Figure SMS_9
by analogy, 256 rotation matrices, i.e. J 1 ,J 2 ,…,J 256 And (5) synchronous parallel generation.
Step 4:256 single-sided Jacobi orthogonal transformation circuits synchronously execute single-sided Jacobi orthogonal rotation calculation: to be used for
Figure SMS_10
Represents column 1, round 1 current vector, < >>
Figure SMS_11
Representing the vector obtained by updating column 1 through the 1 st round of orthogonal rotation transformation, and for the 1 st pair of column vectors, executing Jacobi orthogonal rotation calculation as +.>
Figure SMS_12
The 2 nd pair of column vectors performs Jacobi orthogonal rotation calculation as +.>
Figure SMS_13
And so on until ++>
Figure SMS_14
Step 5: as shown in FIG. 2, 1 pair of column vectorsThe control channel and the data channel inside the accelerator are described in detail, and a round-robin scheduling mechanism state machine is responsible for overall flow control and controls the data input and calculation results of each unit module according to running rounds; after 256 pairs of column vectors are subjected to unilateral Jacobi orthogonal rotation transformation, respectively exchanging updated column vectors according to a round-robin scheduling mechanism in FIG. 3; the specific method comprises the following steps: fix the last 1 column, u 512 Other column vectors perform u in counter-clockwise reverse rotation, i.e. concurrently 1 Pass to the right to u 3 ,u 3 Pass to the right to u 5 ,…,u 509 Pass to the right to u 511 ,u 511 Diagonal transfer to u 510 ,u 510 Pass to the left to u 508 ,u 508 Pass to the left to u 506 ,…,u 4 Pass to the left to u 2 ,u 2 Down to u 1 The method comprises the steps of carrying out a first treatment on the surface of the For more details, the data scheduling exchange may refer to fig. 1, where bram_4 stores data to bram_2, and bram_6 stores data to bram_4, …; the data stored in BRAM_1 is transferred to BRAM_3, and the data stored in BRAM_3 is transferred to BRAM_5 and …; the data stored in BRAM_2 is transmitted to BRAM_1, and the data stored in BRAM_512 is kept unchanged; repeating the steps 2-4, and executing k=512/2=256 rounds of the operations altogether; the results are shown in FIG. 3.
Step 6: the execution of the kth+1=256+1=257 round is entered, comprising the following sub-steps:
step 6.1: similarly to step 2, the second order norms α, β and the inner product γ of 256 pairs of column vectors are calculated, but the norms α of the last pair of column vectors are specially processed 256 、β 256 The relationship remains unchanged and the norm values alpha of the remaining column vector pairs i And beta i Interchangeable, i.e. alpha 1 、β 1 Between, alpha 2 、β 2 Between …, alpha 255 、β 255 The values are exchanged between, and it is noted that the column vectors themselves are unchanged in position.
Step 6.2: after the special treatment of the step 6.1, the same operation of the step 3 is executed to generate 256 rotation matrixes J respectively 1 、J 2、 …、J 256
Step 6.3: rotation matrix J holding last pair of column vectors 256 Unchanged, the rest of the rotation matrix J i (i=1, 2, …, 255) take the respective transposed form J i T And substituting.
Step 6.4:256 performs a single-sided Jacobi orthogonal rotation transform on the column vector sync.
Step 7: and repeatedly executing the step 6 until the 511 th round of operation is completed.
Step 2 to step 7 are called a sweep. When one sweep is performed, the whole column vector norms show a descending order trend, namely the rule of alpha 1 ≥β 1 ≥α 2 ≥β 2 ≥α 3 …≥α 256 ≥β 256 In the first few sweep, there is occasionally a shock at the pole-individual, i.e. there are few column vector second order norms beta i <α i+1 (i=1, 2, …, 255).
For ease of description and understanding, a matrix with column dimensions 6 is added to illustrate the overall process and column vector second order norm sequencing results to enhance understanding and implementation. As shown in fig. 4, during one sweep, the first k-wheel is the first 3 steps, shown in the upper half of the figure; in the next (k+1) th to (n-1) th rounds, steps 4 and 5 in the lower half of the figure, due to u 6 Is 6, always greater than the other column indices, so the second order norm alpha of the last 1 pair of column vectors 3 、β 3 While the values of the second order norms of the first 2 pairs of column vectors need to be exchanged, but the column vectors themselves are not exchanged, i.e. step 4 alpha 1 Equal to column vector u 3 Second order norm, beta 1 Equal to column vector u 5 Second order norm, alpha 2 Equal to column vector u 1 Second order norm, beta 2 Equal to column vector u 4 Is a second order norm of (2); the operation of the step 5 is the same; after orthogonal transformation, the size change and overall trend of the second order norms of the column vectors are shown in fig. 5, and the exchange processing of the orthogonal rotation matrix J and the second order norms in the invention can enable the size of the second order norms of each column vector to be reduced according to the sequence number of the column indexOrdered, i.e. overall trend of
Figure SMS_15
Step 8: according to the matrix size of 512 rows by 512 columns, 6 times sweep is performed to meet the preset convergence condition. At this time
Figure SMS_16
Performing square root computation on column vector norms of 512 columns in parallel to obtain singular values of sigma respectively 1 ,σ 2 ,σ 3 ,…,σ 512 Corresponding sigma 1 ≥σ 2 ≥σ 3 ≥…≥σ 512 Further dividing each column vector by the corresponding singular value to obtain a left singular vector, i.e. u 11 ,u 22 ,u 33 ,…,u 512512 And (5) completing the singular value decomposition task. As shown in fig. 6, a part of the column vector second-order norm values are truncated, wherein the whole exhibits a tendency of descending order after 1 sweep execution, but there is a partial concussion, and the column vector second-order norm values are basically monotonically decreasing after 4 sweep execution.
Step 9: and (3) writing the singular values in the descending order obtained in the step (8) back into an external DDR storage through an AXI interface.
The FPGA operation result shows that the single-precision floating point matrix of 512 rows and 512 columns can rapidly complete singular value decomposition in 52.9 milliseconds under the operation of a 200MHz clock in XC7V690T-3FFG1761 target hardware, and the singular values and singular vectors are arranged in descending order. Compared with the result in the matrix singular value decomposition Solver library published by Xilinx corporation, 512 rows x 512 columns of real symmetric single precision floating point matrix singular value decomposition is realized, which takes 1.687 seconds on an Alveo U250 accelerator card, but the ordering of singular values and singular vectors also requires additional functional circuits to realize, and more time is consumed for this.
It can be found by the embodiments of the present invention that, for matrix singular value decomposition based on VLSI (including FPGA), alpha is calculated by the present invention i 、β i And the special treatment of the rotation matrix J can realize the descending order arrangement of the singular values and the singular vectors in parallel while decomposing the singular values, thereby saving the time consumption and the hardware cost of special ordering tasks. Therefore, the invention can realize the improvement of the real-time performance of singular value decomposition and the saving of the hardware resource cost.
Corresponding to the embodiment of the FPGA-based singular value decomposition accelerator with the parallel ordering function, the invention also provides an embodiment of the FPGA-based singular value decomposition system with the parallel ordering function.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. The singular value decomposition accelerator with the parallel ordering function based on the FPGA is characterized by comprising an external DDR memory, an AXI interface, k parts of unilateral Jacobian orthogonal transformation circuits, a round-robin scheduling mechanism state machine and 2k blocks of internal memory BRAMs; the singular value decomposition accelerator performs singular value decomposition by:
s1: writing matrix data of m rows and n columns into BRAMs in the FPGA from an external DDR memory through an AXI interface, wherein each column corresponds to 1 BRAM, n blocks of BRAMs are combined into a pair by every two adjacent blocks of BRAMs in sequence, and the pairs are equally divided into k=n/2 pairs; wherein n is an even number; if the number of columns of the matrix is originally odd, 1 column of all 0 is supplemented at the end so as to be even;
s2: the k parts of unilateral Jacobian orthogonal transformation circuits calculate the second-order norms alpha, beta and the inner product gamma of k pairs of BRAM groups in parallel;
s3: if the second order norm alpha of the first column vector is greater than or equal to the second order norm beta of the second column vector, generating a rotation matrix according to the single-side Jacobian algorithm
Figure QLYQS_1
The method comprises the steps of carrying out a first treatment on the surface of the Conversely, a rotation matrix is generated +.>
Figure QLYQS_2
The method comprises the steps of carrying out a first treatment on the surface of the Synchronous generation of k rotation matrixes J corresponding to k pairs of BRAMs i ,i=1,2,…,k;
S4, k parts of unilateral Jacobi orthogonal transformation circuits synchronously execute orthogonal rotation calculation, and the intermediate result is stored in n parts of BRAMs;
s5: exchanging column vectors according to a round-robin scheduling mechanism state machine, repeating the steps S2-S5, and executing k rounds of operations altogether;
s6: the k+1 th round is executed, comprising the following sub-steps:
s6.1: calculating the second order norms alpha, beta and inner product gamma of k pairs of BRAM groups, and exchanging the values of two second order norms in the 1~k-1 BRAM group;
s6.2: repeating S3 the same operation to synchronously generate k rotation matrixes J i ,i=1,2,…,k;
S6.3: maintaining the last rotation matrix J k The rest rotation matrixes are replaced by a transposition mode without change;
s6.4: s4, executing the same operation;
s7: exchanging column vector data stored by each BRAM according to a round-robin scheduling mechanism state machine, and executing the same operation as the S6 from the k+2~n-1 round until the n-1 round operation is completed, namely completing one-time 'sweeping' operation;
s8: and repeating the steps S2-S7, executing a plurality of 'sweeping' operations until the iteration termination condition is met, completing the singular value decomposition task, and arranging the singular values of the column vectors stored by each block BRAM from large to small.
2. The FPGA-based singular value decomposition accelerator with parallel ordering function according to claim 1, wherein the single-sided jacobian orthogonal transformation circuit comprises a norm and inner product calculation module, cos θ and sin θ calculation modules, a norm comparison module, a gamma positive and negative decision module, a kth rotation matrix decision module, a k-less wheel decision module, a rotation matrix J generation module, a single-sided jacobian orthogonal rotation calculation module, and a square root calculation module.
3. The FPGA-based singular value decomposition accelerator with parallel ordering function of claim 1, wherein the round-robin scheduling mechanism state machine controls the generation of data streams and control streams for each single-sided jacobian orthogonal transform circuit, including the reading of BRAM, the computation of α, β, γ and α, β exchange, the computation of cos θ and sin θ, the generation of rotation matrix J, and the write-back operation of jacobian orthogonal rotation computation results to BRAM.
4. The FPGA-based singular value decomposition accelerator with parallel ordering according to claim 1, wherein the initial column vector index rule is that the column vector index of the lower row is odd, respectively column vectors 1, 3, 5 … n-1, and the column vector index of the upper row is even, respectively column vectors 2, 4, 6 … n.
5. The FPGA-based singular value decomposition accelerator with parallel ordering according to claim 2, wherein the column vector index of the upper row is always greater than the column vector index of the lower row for the first k-cycles of each "sweep"; in the k+1 through n-1 th round of each "sweep", the column vector index of the upper row is always smaller than the column vector index of the lower row except for the last column.
6. The FPGA-based singular value decomposition accelerator with parallel ordering according to claim 1, wherein the second order norms of the following row column vectors are α, i.e. α respectively 1 ,α 2 ,α 3 ,…,α k The method comprises the steps of carrying out a first treatment on the surface of the The second order norms of the column vectors of the upper row are beta, i.e. beta respectively 1 ,β 2 ,β 3 ,…,β k
7. The FPGA-based singular value decomposition accelerator with parallel ordering function according to claim 1, wherein after S8 is performed, in the first few "sweeps", there is local oscillation, the second-order norms of each column vector overall show descending order trend, and finally realize alpha 1 ≥β 1 ≥α 2 ≥β 2 ≥α 3 ≥…≥α k ≥β k
8. The FPGA-based singular value decomposition accelerator with parallel ordering function of claim 1, wherein the generation formulas of cos θ and sin θ are as follows:
Figure QLYQS_3
9. the FPGA-based singular value decomposition accelerator with parallel ordering according to claim 1, wherein after S8 is executed and convergence condition is satisfied, the second order norms α for each column vector 1 ,β 1 ,α 2 ,β 2 ,α 3 ,…,α k ,β k Respectively performing square root calculation to obtain corresponding singular values of sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n And sigma (sigma) 1 ≥σ 2 ≥σ 3 ≥σ 4 ≥σ 5 ≥…≥σ n-1 ≥σ n And sum the result sigma 1 ,σ 2 ,σ 3 ,σ 4 ,σ 5 ,…,σ n-1 ,σ n Sequentially writing to external DDR storage through an AXI interface.
10. The FPGA-based tape of claim 9A singular value decomposition accelerator with parallel ordering function is characterized in that each column vector u satisfying convergence in S8 1 ,u 2 ,u 3 ,…,u n Divided by the singular values sigma corresponding to each 1 ,σ 2 ,σ 3 ,…,σ n Obtaining respective corresponding left singular vectors u 11 ,u 22 ,u 33 ,…,u nn And the result u 11 ,u 22 ,u 33 ,…,u nn Sequentially writing to external DDR storage through an AXI interface.
CN202310669739.0A 2023-06-07 2023-06-07 Singular value decomposition accelerator with parallel ordering function based on FPGA Active CN116382617B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310669739.0A CN116382617B (en) 2023-06-07 2023-06-07 Singular value decomposition accelerator with parallel ordering function based on FPGA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310669739.0A CN116382617B (en) 2023-06-07 2023-06-07 Singular value decomposition accelerator with parallel ordering function based on FPGA

Publications (2)

Publication Number Publication Date
CN116382617A true CN116382617A (en) 2023-07-04
CN116382617B CN116382617B (en) 2023-08-29

Family

ID=86961959

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310669739.0A Active CN116382617B (en) 2023-06-07 2023-06-07 Singular value decomposition accelerator with parallel ordering function based on FPGA

Country Status (1)

Country Link
CN (1) CN116382617B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118153494A (en) * 2024-05-11 2024-06-07 南京邮电大学 Hardware acceleration system for realizing matrix SVD decomposition based on AXI bus

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060173948A1 (en) * 2005-01-28 2006-08-03 Bae Systems Information And Electronic Systems Integration Inc Scalable 2X2 rotation processor for singular value decomposition
CN101390351A (en) * 2004-11-15 2009-03-18 高通股份有限公司 Eigenvalue decomposition and singular value decomposition of matrices using jacobi rotation
CN106528490A (en) * 2016-11-30 2017-03-22 郑州云海信息技术有限公司 FPGA (Field Programmable Gate Array) heterogeneous accelerated computing device and system
CN107506173A (en) * 2017-08-30 2017-12-22 郑州云海信息技术有限公司 A kind of accelerated method, the apparatus and system of singular value decomposition computing
KR20190059033A (en) * 2017-11-22 2019-05-30 한국전자통신연구원 Input vector generating apparatus and method using singular vaule decomposition for deep neural network speech recognition system
CN112596701A (en) * 2021-03-05 2021-04-02 之江实验室 FPGA acceleration realization method based on unilateral Jacobian singular value decomposition
CN113536228A (en) * 2021-09-16 2021-10-22 之江实验室 FPGA acceleration implementation method for matrix singular value decomposition
US11190244B1 (en) * 2020-07-31 2021-11-30 Samsung Electronics Co., Ltd. Low complexity algorithms for precoding matrix calculation
CN115659880A (en) * 2022-09-01 2023-01-31 重庆邮电大学 Hardware circuit and method of principal component analysis algorithm based on singular value decomposition
CN116170601A (en) * 2023-04-25 2023-05-26 之江实验室 Image compression method based on four-column vector block singular value decomposition

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101390351A (en) * 2004-11-15 2009-03-18 高通股份有限公司 Eigenvalue decomposition and singular value decomposition of matrices using jacobi rotation
US20060173948A1 (en) * 2005-01-28 2006-08-03 Bae Systems Information And Electronic Systems Integration Inc Scalable 2X2 rotation processor for singular value decomposition
CN106528490A (en) * 2016-11-30 2017-03-22 郑州云海信息技术有限公司 FPGA (Field Programmable Gate Array) heterogeneous accelerated computing device and system
CN107506173A (en) * 2017-08-30 2017-12-22 郑州云海信息技术有限公司 A kind of accelerated method, the apparatus and system of singular value decomposition computing
KR20190059033A (en) * 2017-11-22 2019-05-30 한국전자통신연구원 Input vector generating apparatus and method using singular vaule decomposition for deep neural network speech recognition system
US11190244B1 (en) * 2020-07-31 2021-11-30 Samsung Electronics Co., Ltd. Low complexity algorithms for precoding matrix calculation
CN112596701A (en) * 2021-03-05 2021-04-02 之江实验室 FPGA acceleration realization method based on unilateral Jacobian singular value decomposition
CN113536228A (en) * 2021-09-16 2021-10-22 之江实验室 FPGA acceleration implementation method for matrix singular value decomposition
CN115659880A (en) * 2022-09-01 2023-01-31 重庆邮电大学 Hardware circuit and method of principal component analysis algorithm based on singular value decomposition
CN116170601A (en) * 2023-04-25 2023-05-26 之江实验室 Image compression method based on four-column vector block singular value decomposition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
应俊;朱云鹏;: "基于CORDIC矩阵奇异值分解的FPGA实现", 重庆邮电大学学报(自然科学版), no. 03 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118153494A (en) * 2024-05-11 2024-06-07 南京邮电大学 Hardware acceleration system for realizing matrix SVD decomposition based on AXI bus

Also Published As

Publication number Publication date
CN116382617B (en) 2023-08-29

Similar Documents

Publication Publication Date Title
Yepez et al. Stride 2 1-D, 2-D, and 3-D Winograd for convolutional neural networks
CN111915001B (en) Convolution calculation engine, artificial intelligent chip and data processing method
CN110163359B (en) Computing device and method
US20210065005A1 (en) Systems and methods for providing vector-wise sparsity in a neural network
US11769041B2 (en) Low latency long short-term memory inference with sequence interleaving
CN116382617B (en) Singular value decomposition accelerator with parallel ordering function based on FPGA
Bekas et al. Low cost high performance uncertainty quantification
WO2021080873A1 (en) Structured pruning for machine learning model
US11983616B2 (en) Methods and apparatus for constructing digital circuits for performing matrix operations
US11341400B1 (en) Systems and methods for high-throughput computations in a deep neural network
CN110580519B (en) Convolution operation device and method thereof
WO2018027706A1 (en) Fft processor and algorithm
CN115186802A (en) Block sparse method and device based on convolutional neural network and processing unit
CN112306555A (en) Method, apparatus, device and computer readable storage medium for extracting image data in multiple convolution windows in parallel
CN114138231B (en) Method, circuit and SOC for executing matrix multiplication operation
CN116710912A (en) Matrix multiplier and control method thereof
Alawad et al. Memory-efficient probabilistic 2-D finite impulse response (FIR) filter
CN112765540A (en) Data processing method and device and related products
CN113890508A (en) Hardware implementation method and hardware system for batch processing FIR algorithm
CN114237548A (en) Method and system for complex dot product operation based on nonvolatile memory array
Chen et al. An efficient ReRAM-based inference accelerator for convolutional neural networks via activation reuse
CN112905954A (en) CNN model convolution operation accelerated calculation method using FPGA BRAM
Huai et al. Crossbar-aligned & integer-only neural network compression for efficient in-memory acceleration
Jain-Mendon et al. A case study of streaming storage format for sparse matrices
Allmann et al. Cyclic reduction on distributed shared memory machines

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant