CN117077734A - Convolution input conversion method, hardware accelerator and accelerator structure determination method - Google Patents

Convolution input conversion method, hardware accelerator and accelerator structure determination method Download PDF

Info

Publication number
CN117077734A
CN117077734A CN202311101953.2A CN202311101953A CN117077734A CN 117077734 A CN117077734 A CN 117077734A CN 202311101953 A CN202311101953 A CN 202311101953A CN 117077734 A CN117077734 A CN 117077734A
Authority
CN
China
Prior art keywords
input
vector
convolution
input transformation
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311101953.2A
Other languages
Chinese (zh)
Inventor
李明峻
沈小勇
吕江波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Smartmore Technology Co Ltd
Original Assignee
Shenzhen Smartmore Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Smartmore Technology Co Ltd filed Critical Shenzhen Smartmore Technology Co Ltd
Priority to CN202311101953.2A priority Critical patent/CN117077734A/en
Publication of CN117077734A publication Critical patent/CN117077734A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The application relates to a convolution input transformation method, a hardware accelerator and an accelerator structure determining method. The convolution input transformation method comprises the following steps: executing a hardware accelerator, wherein a target input transformation processing unit in the hardware accelerator is configured and combined by a basic input transformation processing unit and a peripheral unit, and the peripheral unit carries out cross-correlation operation on a column vector, a first coefficient vector and second coefficient vectors to obtain a first cross-correlation result and a plurality of second cross-correlation results; the basic input transformation processing unit obtains a first input transformation result vector element corresponding to the column vector according to the product of the basic input transformation parameter matrix and the first cross correlation result; the peripheral unit obtains second input transformation result vector elements corresponding to the column vectors according to products of the basic coefficient vectors and the second cross correlation results respectively; and determining convolution input transformation results corresponding to the input feature matrix according to the input transformation result vectors of the column vectors. The application can improve the configurability of the hardware accelerator.

Description

Convolution input conversion method, hardware accelerator and accelerator structure determination method
Technical Field
The present application relates to the field of embedded technologies, and in particular, to a convolution input transformation method, a hardware accelerator, and an accelerator structure determining method.
Background
In recent years, convolutional Neural Networks (CNNs) have been widely used in various fields and play an important role. Convolution operators are the fundamental component of convolutional neural networks, and are also the most time-consuming part, containing a large number of multiplications. Winograd convolution is considered to be a highly efficient fast convolution algorithm because it greatly reduces the multiplication operations in the convolution. The two-dimensional Winograd convolution function may be defined by F (k, n), where k is the side length of the output feature matrix and n is the side length of the convolution kernel, and the side length ω=k+n-1 of the input feature matrix, that is, the convolution represented by F (k, n) completes each convolution operation of the (k+n-1 ) sized input feature matrix and the (n, n) sized convolution kernel, outputting the (k, k) sized output feature matrix. The matrix calculation of the convolution is:
S=A T [(B T dB)⊙(GgG T )]A
wherein A, B, G is a parameter matrix, and when k and n are determined, the parameter matrix is also determined. The symbol as follows indicates that two matrices of the same size are multiplied by each other in the corresponding positions. Wherein, for the input transform part of Winograd convolution, the calculation formula is:
U=B T dB
Where d is an input feature matrix, which is a matrix of size (k+n-1 ). The size of the input transformation parameter matrix B is (k+n-1 ). The result of the computation of the input transform is a matrix U of size (k+n-1 ).
In recent years, many researches design a hardware accelerator special for Winograd convolution input transformation, and a main deployment platform is an FPGA and the like and is used for calculating the Winograd convolution input transformation, namely U=B T dB (dB). However, the Winograd convolution input transform hardware accelerator designed in the conventional approach is generally only specific to a specific few convolution forms, such as: f (2, 3) and F (4, 3), when the convolution parameters (i.e. at least one of k and n) change, require a large-scale modification to the underlying implementation of the hardware accelerator, even to the architecture of the whole system, resulting in poor configurability.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a convolution input conversion method, a hardware accelerator, an accelerator structure determination method, an apparatus, a computer device, a computer-readable storage medium, and a computer program product, which can achieve an improvement in the configurability of the hardware accelerator.
In a first aspect, the present application provides a convolution input conversion method, which is executed by a hardware accelerator, wherein a target input conversion processing unit corresponding to a specified convolution parameter in the hardware accelerator is configured and combined by a basic input conversion processing unit and a peripheral unit, and the method includes:
For each column vector in the input feature matrix, performing cross-correlation operation on the column vector, the first coefficient vector and each second coefficient vector through a peripheral unit to obtain a first cross-correlation result and a plurality of second cross-correlation results; the first coefficient vector is determined by a plurality of delta polynomials; the second coefficient vectors are coefficient vectors corresponding to the increment polynomials respectively; the increment polynomial is a polynomial which is obtained by adding a convolution polynomial corresponding to a specified convolution parameter compared with a basic convolution polynomial;
obtaining a first input transformation result vector element corresponding to the column vector according to the product of a basic input transformation parameter matrix corresponding to the basic convolution polynomial and a first cross-correlation result through a basic input transformation processing unit;
obtaining second input transformation result vector elements corresponding to the column vectors according to products of the basic coefficient vectors and the second cross-correlation results respectively through the peripheral units; the base coefficient vector is a coefficient vector of the product of the respective base convolution polynomials;
and determining a convolution input transformation result corresponding to the input feature matrix according to the first input transformation result vector element and the second input transformation result vector element respectively corresponding to each column vector.
In a second aspect, the present application provides a hardware accelerator for convolution input transformation, the hardware accelerator comprising a plurality of target input transformation processing units corresponding to specified convolution parameters; the target input transformation processing unit is obtained by combining a basic input transformation processing unit and a peripheral unit configuration;
the peripheral unit is used for carrying out cross-correlation operation on the input column vector, the first coefficient vector and each second coefficient vector respectively to obtain a first cross-correlation result and a plurality of second cross-correlation results; the first coefficient vector is determined by a plurality of delta polynomials; the second coefficient vectors are coefficient vectors corresponding to the increment polynomials respectively; the increment polynomial is a polynomial which is obtained by adding a convolution polynomial corresponding to a specified convolution parameter compared with a basic convolution polynomial;
the basic input transformation processing unit is used for obtaining a first input transformation result vector element corresponding to the column vector according to the product of the basic input transformation parameter matrix corresponding to the basic convolution polynomial and the first cross-correlation result;
the peripheral unit is also used for obtaining second input transformation result vector elements corresponding to the column vectors according to the products of the basic coefficient vectors and the second cross correlation results respectively; the base coefficient vector is a coefficient vector of the product of the respective base convolution polynomials; and obtaining an input transformation result vector corresponding to the column vector according to the first input transformation result vector element and the second input transformation result vector element corresponding to the column vector.
In a third aspect, the present application provides a method for determining a structure of a hardware accelerator for convolutional input transformation, including:
the hardware accelerator comprises a plurality of target input transformation processing units;
the structure determination step of the target input transform processing unit includes:
acquiring input convolution parameters;
determining a first coefficient vector according to a plurality of increment polynomials, and determining a second coefficient vector according to each increment polynomial respectively; the increment polynomial is a polynomial which is obtained by adding a convolution polynomial corresponding to the input convolution parameter compared with a basic convolution polynomial;
determining the structure of a peripheral unit according to the first coefficient vector, the second coefficient vector and a basic input transformation parameter matrix corresponding to the basic convolution polynomial; the peripheral unit is used for carrying out cross-correlation operation on column vectors in the input feature matrix and the first coefficient vector and each second coefficient vector respectively to obtain a first cross-correlation result and a plurality of second cross-correlation results, and obtaining second input transformation result vector elements corresponding to the column vectors according to products of the basic coefficient vectors and each second cross-correlation result respectively; the base coefficient vector is a coefficient vector of the product of the respective base convolution polynomials;
Acquiring the structure of a basic input transformation processing unit according to the input convolution parameters; the basic input transformation processing unit is used for obtaining a first input transformation result vector element corresponding to the column vector according to the product of a basic input transformation parameter matrix corresponding to the basic convolution polynomial and the first cross correlation result;
the structure of the target input transform processing unit is determined based on the structures of the base input transform processing unit and the peripheral unit.
In a fourth aspect, the present application provides a structure determining apparatus of a hardware accelerator for convolution input transformation, comprising:
the first acquisition module is used for acquiring input convolution parameters;
the first determining module is used for determining a first coefficient vector according to a plurality of increment polynomials and determining each second coefficient vector according to each increment polynomial respectively; the increment polynomial is a polynomial which is obtained by adding a convolution polynomial corresponding to the input convolution parameter compared with a basic convolution polynomial;
the second determining module is used for determining the structure of the peripheral unit according to the first coefficient vector, the second coefficient vector and the basic input transformation parameter matrix corresponding to the basic convolution polynomial; the peripheral unit is used for carrying out cross-correlation operation on column vectors in the input feature matrix and the first coefficient vector and each second coefficient vector respectively to obtain a first cross-correlation result and a plurality of second cross-correlation results, and obtaining second input transformation result vector elements corresponding to the column vectors according to products of the basic coefficient vectors and each second cross-correlation result respectively; the base coefficient vector is a coefficient vector of the product of the respective base convolution polynomials;
The second acquisition module is used for acquiring the structure of the basic input transformation processing unit according to the input convolution parameters; the basic input transformation processing unit is used for obtaining a first input transformation result vector element corresponding to the column vector according to the product of a basic input transformation parameter matrix corresponding to the basic convolution polynomial and the first cross correlation result;
and the third determining module is used for determining the structure of the target input transformation processing unit according to the structures of the basic input transformation processing unit and the peripheral unit.
In a fifth aspect, the present application provides a computer device comprising a memory storing a computer program and a processor, the processor implementing the steps of the method described above when executing the computer program.
In a sixth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method described above.
In a seventh aspect, the application provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method described above.
The above-mentioned convolution input transformation method, hardware accelerator, accelerator structure determining method, device, computer apparatus, computer readable storage medium and computer program product can obtain the target input transformation processing unit in the hardware accelerator for convolution input transformation of specified convolution parameters only by configuration combination of the basic input transformation processing unit and peripheral unit; the method comprises the steps that a first cross-correlation result and a plurality of second cross-correlation results can be obtained through cross-correlation operation by a peripheral unit, a second input transformation result vector element corresponding to a column vector is obtained according to the product of a basic coefficient vector and each second cross-correlation result, a first input transformation result vector element corresponding to the column vector can be obtained according to the product of a basic input transformation parameter matrix corresponding to a basic convolution polynomial and the first cross-correlation result by a basic input transformation processing unit, and accordingly the first input transformation result vector element and the second input transformation result vector element corresponding to each column vector in an input feature matrix can be obtained, the convolution input transformation result corresponding to the input feature matrix is determined, namely, a target input transformation processing unit in a hardware accelerator for convolution input transformation of specified convolution parameters can be obtained by adding the peripheral unit on the basis of the basic input transformation processing unit, and the configurability of the hardware accelerator for convolution input transformation is improved.
Drawings
FIG. 1 is an application environment diagram of a convolution input transformation method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of Winograd convolution according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a transpose matrix of an input transform parameter matrix according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an equivalent relationship of a convolution input transformation according to an embodiment of the present application;
FIG. 5 is a schematic flow chart of a convolution input transformation method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a hardware accelerator structure and interaction among parts according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a target input conversion processing unit according to an embodiment of the present application;
FIG. 8 is a block diagram of a hardware accelerator for convolutional input transformation according to an embodiment of the present application;
FIG. 9 is a block diagram of another hardware accelerator for convolving input transformations provided by an embodiment of the application;
FIG. 10 is a flowchart of a method for determining a structure of a hardware accelerator for convolutional input transformation according to an embodiment of the present application;
FIG. 11 is a schematic diagram of an overall flow chart according to an embodiment of the present application;
FIG. 12 is a block diagram of a hardware accelerator structure determination device for convolutional input transformation according to an embodiment of the present application;
FIG. 13 is a diagram illustrating an internal architecture of a computer device according to an embodiment of the present application;
fig. 14 is an internal structural diagram of a computer-readable storage medium according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The convolution input conversion method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. The computer device 102 may obtain the input convolution parameters, the computer device 102 may perform a structure determination method of a hardware accelerator for convolution input transformation to determine a structure of a target input transformation processing unit, and obtain the structure of the hardware accelerator based on the structure of the target input transformation processing unit. The computer device 102 may apply the determined hardware accelerator structure to obtain the hardware accelerator 104 for the convolution input transformations. The hardware accelerator 104 may perform a convolution input transformation method to perform convolution input transformation on the input feature matrix to obtain a convolution input transformation result corresponding to the input feature matrix. The computer device 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The hardware accelerator 104 may be obtained by loading the structure of the hardware accelerator determined by the computer device 102 into a programmable logic device, for example: the programmable logic device may be an FPGA (Field Programmable Gate Array ). The hardware accelerator 104 may also be an application specific integrated circuit (ASIC, application Specific Integrated Circuit) specifically custom manufactured according to the architecture of the hardware accelerator as determined by the computer device 102.
The two-dimensional Winograd convolution function may be defined by F (k, n), where k is the side length of the output feature matrix and n is the side length of the convolution kernel (i.e., the convolution parameters include k and n), and the side length ω=k+n-1 of the input feature matrix, i.e., winograd convolution represented by F (k, n) is a convolution process between the (k+n-1 ) sized input feature matrix and the (n, n) sized convolution kernel to output the (k, k) sized output feature matrix. Where k is generally even and n is generally odd. The matrix calculation formula of the Winograd convolution is:
S=A T [(B T dB)⊙(GgG T )]A
wherein A, B, G is a parameter matrix, and when k and n are determined, the parameter matrix is also determined. The symbol as follows indicates that two matrices of the same size are multiplied by each other in the corresponding positions.
As shown in FIG. 2, B in the above formula T dB belonging to input transformation, ggG T The product of the input transformation and the result of the convolution kernel transformation is denoted by Y, A T YA belongs to the output transform.
The hardware accelerator in the embodiments of the present application is used to perform the input transformation of the Winograd convolution corresponding to the specified convolution parameters. The calculation formula of the input transformation of Winograd convolution is:
U=B T dB
where d is an input feature matrix, which is a matrix of size (k+n-1 ). The size of the input transformation parameter matrix B is (k+n-1 ). The result of the computation of the input transform is a matrix U of size (k+n-1 ).
It will be appreciated that the size of the input transformation parameter matrix B and the values of the internal elements are only related to the value of (k+n-1), and therefore, when the convolution parameters k and n change, the size of the input transformation parameter matrix B and the values of the internal elements change.
The change rule of the input transformation parameter matrix when the convolution parameters are changed is researched, and the recurrence relation among the input transformation parameter matrices corresponding to different convolution parameters is summarized:
according to the basic principle of Winograd convolution: when constructing and using Winograd convolution F (k, n) (i.e., winograd convolution with convolution parameters k and n), k+n-1 polynomials (i.e., convolution polynomials) are defined first:
for example: when constructing F (2, 3), 4 polynomials are defined: x, x-1, x+1, 1. When constructing F (4, 3), 6 polynomials are defined: x, x-1, x+1, x-2, x+2, 1.
Then define the product of all polynomials:
transposed matrix B of input transformation parameter matrix B T Is denoted as S' i (x),i=0、1、2、…、(k+n-2)。S′ i (x) The values of the elements are polynomialsThe coefficients of the sub-terms are arranged from low to high. Wherein S is 0 (x) Is more special: if the constant term coefficients are complex (the condition is equivalent to the value of (k+n-1)/2 being even), then the entire polynomial needs to be scaled Multiplied by-1. Accordingly, F (2, 3) and F (4, 3) correspond to the transposed matrix B of the input transform parameter matrix B, respectively T As shown in fig. 3.
According to the principle described above, when the sum of the convolution parameters increases by 2, i.e. F (k, n) becomes F (k ', n'), and (k '+n' -1) - (k+n-1) =2, the Winograd convolution polynomial ni (x) increases by 2, for example: f (4, 3) is newly added with two polynomials than F (2, 3): x-2 and x+2, the incremental polynomial of the increase is expressed as:
thus, for F (k'B of n') T Its row vector S' i (x) The values at i=0, 1, 2, …, k '+n' -5, k '+n' -2 may be represented by B of F (k, n) T Each row is additionally multiplied by the product n of the respective delta polynomials k’+n’-4 (x)×n k’+n’-3 (x) Obtained, its row vector S' i (x) The value at i=k '+n' -4, k '+n' -3 may be defined by B of F (k, n) T Last line vector S' k+n-2 (x) Respectively and additionally multiply by a polynomial n k'+n'-3 (x) And n k'+n'-4 (x) Obtained. Namely B of F (k ', n') T From B of F (k, n) T Each row being multiplied by the product of each increment polynomial, and B of F (k, n) T The last row of vectors in the row are multiplied by the results of the increment polynomials respectively to be combined. For example: as shown in FIG. 3, B of F (4, 3) T The penultimate and penultimate rows in (2, 3) are B of F T The last row vector of (2) is multiplied by the delta polynomials (x+2) and (x-2), respectively, i.e. the second last row is multiplied by (x-2), the third last row is multiplied by (x+2), B of F (4, 3) T The remaining rows (except the penultimate and third rows) in (a) are B of F (2, 3) T Each row of (2) x (x+2) is multiplied by the product of the respective delta polynomials.
The following recurrence relation can be obtained according to the change rule: b for F (k ', n') T One row S' m (x) Its corresponding polynomial can be represented by B of F (k, n) T One row S' h (x) The polynomial of (2) is additionally multiplied by the delta polynomial n l (x) Obtained by setting a polynomial n l (x) Coefficient vector v of (2) l S 'then' m (x) Can be defined by S' h (x) And v l The convolution operation is performed (herein, "convolution operation" is a convolution operation of two discrete sequences in a conventional mathematical sense, and is different from convolution in the convolutional neural network CNN). Thus S' m (x) When vector multiplication operation is carried out on the column vector in the input feature matrix d, the vector multiplication operation can be converted into S' h (x) And v l Performing convolution operation, multiplying by d, and further converting to a first pair of d and v l Performing cross-correlation operation, and then performing S' h (x) Vector multiplication with cross-correlation resultsAnd (5) calculating. B as F (4, 3) T S 'in (B)' 5 (x)=[0 4 0 -5 0 1]For example, B of F (2, 3) T Corresponding row vector S' 3 (x)=[0 -1 0 1]Coefficient vector v of increment polynomial l =[-4 0 1]S 'then' 5 (x)=[0 4 0 -5 0 1]Multiplying by column vector d in input feature matrix d 0 d 1 d 2 d 3 d 4 d 5 ] T Can be converted into: first to v l =[-4 0 1]And [ d ] 0 d 1 d 2 d 3 d 4 d 5 ] T Performing cross-correlation operation, and then adding B of F (2, 3) T Corresponding row vector S' 3 (x)=[0 -1 0 1]Multiplying the result of the cross-correlation operation, namely:
wherein,i.e. v l =[-4 0 1]And [ d ] 0 d 1 d 2 d 3 d 4 d 5 ] T Is a cross-correlation result of (a).
The recursive manner is promoted to the whole B T From the calculation of the input feature matrix d, B is calculated as F (k ', n') T When multiplying the column vector of the input feature matrix d, the column vector of the input feature matrix can be multiplied by each increment polynomial n k'+n'-4 (x) And n k'+n'-3 (x) The coefficient vector of the product of (2) and the coefficient vector corresponding to each increment polynomial are subjected to cross-correlation operation to obtain a first cross-correlation result and a plurality of second cross-correlation results, and then the first cross-correlation result is subjected to cross-correlation operation with B of F (k, n) T Multiplying each second cross-correlation result with B of F (k, n) T S_ (k+n-2)/(x) in (a) is multiplied. For example: as shown in FIG. 4, the Winograd input transformation operation equivalent relationship of F (2, 3) and F (4, 3) is obtained.
According to the above equivalent relation, in terms of hardware implementation, when constructing the input transform processing unit in the hardware accelerator of F (k ', n'), if the input transform processing unit capable of processing the input transform of F (k, n) is already provided (i.e., the basic input transform processing unit), the construction of the input transform processing unit of F (k ', n') can be realized by adding the peripheral unit for performing the cross-correlation operation and multiplying the last row vector in the input transform parameter matrix by each of the second cross-correlation results on the basis of multiplexing the basic input transform processing unit.
As shown in fig. 5, according to the above-mentioned deduction result, an embodiment of the present application provides a convolution input transformation method, and the method is taken as an example for describing that the method is applied to the hardware accelerator 104 in fig. 1. The target input conversion processing unit corresponding to the specified convolution parameter in the hardware accelerator is obtained by configuring and combining a basic input conversion processing unit and a peripheral unit; the method comprises the following steps:
s502, performing cross-correlation operation on each column vector in the input feature matrix with the first coefficient vector and each second coefficient vector through a peripheral unit to obtain a first cross-correlation result and a plurality of second cross-correlation results; the first coefficient vector is determined by a plurality of delta polynomials; the second coefficient vectors are coefficient vectors corresponding to the increment polynomials respectively; the increment polynomial is a polynomial that the convolution polynomial corresponding to the specified convolution parameter is compared with the basic convolution polynomial.
The target input transformation processing unit is a hardware circuit unit in the hardware accelerator, and the hardware circuit unit is used for executing convolution input transformation corresponding to the specified convolution parameters. The basic input transformation processing unit is a hardware circuit unit for executing convolution input transformation corresponding to the basic convolution parameters. The sum of the parameter values in the base convolution parameters is less than the sum of the parameter values in the specified convolution parameters. The basic convolution polynomial is a convolution polynomial corresponding to the basic convolution parameter. The convolution polynomial is used to define the input transformation parameter matrix of the Winograd convolution. The peripheral unit is a hardware circuit unit added at the periphery of the basic input conversion processing unit.
The convolution input transform refers to the input transform in Winograd convolution. The convolution parameters include the side length (k) of the output feature matrix of the Winograd convolution and the side length (n) of the convolution kernel. The sum of the side length of the convolution output feature matrix and the side length of the convolution kernel in the basic convolution parameters is smaller than the sum of the side length of the convolution output feature matrix and the side length of the convolution kernel in the specified convolution parameters. The input feature matrix is a feature matrix to be subjected to Winograd convolution input transformation, which is input into the hardware accelerator. For example: the input feature matrix may be a matrix of pixel values corresponding to the feature image after slicing.
In some embodiments, the sum of the parameter values in the base convolution parameters is less than 2 than the sum of the parameter values in the specified convolution parameters.
The convolution polynomial is derived from the following formula used in the derivation process described above:
for example: if the specified convolution parameters are k=4 and n=3, the base convolution parameters may be k=2 and n=3 (the sum of k and n is less than 2 than specified), and the corresponding convolution polynomials for F (2, 3) are x, x-1, x+1, 1; the convolution polynomials corresponding to F (4, 3) are x, x-1, x+1, x-2, x+2, 1, so that the convolution polynomials corresponding to F (4, 3) are x-2 and x+2, i.e., the increment polynomials are x-2 and x+2, as compared with the convolution polynomials corresponding to F (2, 3).
In some embodiments, the first coefficient vector may be a coefficient vector of the product of the respective delta polynomials. For example: if the increment polynomials are x-2 and x+2, the product of the increment polynomials is (x-2) × (x+2) =x 2 -4, so that the first coefficient vector is [ -4 0 1]。
For example: if the increment polynomials are x-2 and x+2, the coefficient vectors corresponding to the increment polynomials are [ -2 1 0] and [2 1 0], respectively, i.e., the second coefficient vectors are [ -2 1 0] and [2 1 0], respectively.
In some embodiments, the target input transformation processing unit in the hardware accelerator depends onAnd obtaining each column vector, and performing cross-correlation operation on the column vector, the first coefficient vector and each second coefficient vector by a peripheral unit in the target input conversion processing unit to obtain a first cross-correlation result and a plurality of second cross-correlation results. For example: column vector isThe first coefficient vector is [ -4 0.1]Each second coefficient vector is [ -2 1 0]And [2 1 0]As shown in fig. 4,i.e. the first cross-correlation result, ">And->I.e. the second cross-correlation result.
S504, obtaining a first input transformation result vector element corresponding to the column vector according to the product of the basic input transformation parameter matrix corresponding to the basic convolution polynomial and the first cross correlation result through the basic input transformation processing unit.
Wherein the basic input transformation parameter matrix refers to a transpose matrix B of a Winograd convolved input transformation parameter matrix B corresponding to basic convolution parameters T . The first input transformation result vector element refers to an element of a first part of the input transformation result vector corresponding to the column vector.
In some embodiments, the basic input transform processing unit may obtain a first cross-correlation result output by the peripheral unit, and multiply (i.e. matrix multiplication operation) the basic input transform parameter matrix corresponding to the basic convolution polynomial with the first cross-correlation result to obtain a first input transform result vector element corresponding to the column vector. For example: the basic input transformation processing unit can implement the one shown in FIG. 4Obtained->Representing the first input transform result vector element to which the column vector corresponds.
In some embodiments, the internal structure of the basic input transform processing unit may be that the input transform processing unit corresponding to the first convolution parameter smaller than the basic convolution parameter adds a corresponding peripheral unit; the internal structure of the input transform processing unit corresponding to the first convolution parameter may be that the input transform processing unit corresponding to the second convolution parameter smaller than the first convolution parameter adds the corresponding peripheral unit, and so on, that is, the internal of the target input transform processing unit is a recursive structure, and the recursive endpoint is the transpose matrix B of the input transform parameter matrix corresponding to k+n-1=2 T (i.e) That is, the target input transform processing unit is obtained by adding the respective peripheral units layer by layer on the basis of the initial input transform parameter matrix (i.e., the transposed matrix of the input transform parameter matrix corresponding to k+n-1=2).
S506, obtaining second input transformation result vector elements corresponding to the column vectors according to products of the basic coefficient vectors and the second cross-correlation results through the peripheral units; the base coefficient vector is a coefficient vector of the product of the respective base convolution polynomials.
The second input transformation result vector element refers to an element of a second part in the input transformation result vector corresponding to the column vector.
In some embodiments, the last row vector in the basic input transformation parameter matrix may be taken as the basic coefficient vector. It will be appreciated that since the last row vector in the base input transform parameter matrix is the product of the respective base convolution polynomials divided by polynomial 1, the last row vector in the base input transform parameter matrix is equal to the coefficient vector of the product of the respective base convolution polynomials. For example: [ 0-101 ] in FIG. 4]I.e. as a basic input transformation parameter matrixIs the last row vector in the row (a).
In some embodiments, the peripheral unit may multiply (i.e. matrix multiply) the base coefficient vector with each second cross-correlation result respectively, to obtain the second input transform result vector element corresponding to the column vector.
As shown in FIG. 4, the peripheral unit may be configured to be from [ 0-101 ]]Respectively with the second cross-correlation resultAndmultiplying to obtain D 3 And D 4 Representing the second input transform result vector element to which the column vector corresponds.
S508, determining a convolution input transformation result corresponding to the input feature matrix according to the first input transformation result vector element and the second input transformation result vector element respectively corresponding to each column vector.
In some embodiments, the peripheral unit may sort and combine each element of the first input transformation result vector element and the second input transformation result vector element corresponding to the column vector, to obtain an input transformation result vector corresponding to the column vector. For example: will be described in FIG. 4D 3 And D 4 Ordering and combining to obtain ∈ ->I.e. the input transformation result vector corresponding to the column vector.
In some embodiments, the hardware accelerator may determine the input transformation result corresponding to the input feature matrix according to the input transformation result vector corresponding to each column vector.
It will be appreciated that the input transformation result corresponding to the input feature matrix is the result of performing the front portion input transformation on the input feature matrix. The target input transformation processing unit corresponding to the rear part input transformation in the hardware accelerator can perform the rear part input transformation on the front part input transformation result of the input feature matrix to obtain the convolution input transformation result corresponding to the input feature matrix.
Wherein the front part input transformation is the transposed matrix B of the input transformation parameter matrix B T Multiplying by a column vector D in the input feature matrix, i.e. d=b T d. The latter-part input transform is a process of multiplying the result D of the former-part input transform by the input transform parameter matrix B, that is, u=db.
It can be seen that, in the embodiment of the present application, only the basic input transformation processing unit and the peripheral unit are required to be configured and combined, so that the target input transformation processing unit in the hardware accelerator for convolution input transformation of the specified convolution parameters can be obtained; the method comprises the steps that a first cross-correlation result and a plurality of second cross-correlation results can be obtained through cross-correlation operation by a peripheral unit, a second input transformation result vector element corresponding to a column vector is obtained according to the product of a basic coefficient vector and each second cross-correlation result, a first input transformation result vector element corresponding to the column vector can be obtained according to the product of a basic input transformation parameter matrix corresponding to a basic convolution polynomial and the first cross-correlation result by a basic input transformation processing unit, and accordingly the first input transformation result vector element and the second input transformation result vector element corresponding to each column vector in an input feature matrix can be obtained, the convolution input transformation result corresponding to the input feature matrix is determined, namely, a target input transformation processing unit in a hardware accelerator for convolution input transformation of specified convolution parameters can be obtained by adding the peripheral unit on the basis of the basic input transformation processing unit, and the configurability, flexibility and applicability of the hardware accelerator for convolution input transformation are improved.
In some embodiments, the first input transformation result vector element and the second input transformation result vector element respectively corresponding to each column vector are obtained by a target input transformation processing unit corresponding to the front part input transformation;
determining a convolution input transformation result corresponding to the input feature matrix according to a first input transformation result vector element and a second input transformation result vector element respectively corresponding to each column vector, including:
determining a front part input transformation result corresponding to the input feature matrix according to a first input transformation result vector element and a second input transformation result vector element which are respectively corresponding to each column vector;
each target input conversion processing unit corresponding to the rear part input conversion respectively carries out the rear part input conversion on each row vector in the front part input conversion result to obtain a third input conversion result vector element and a fourth input conversion result vector element respectively corresponding to each row vector;
and determining a convolution input transformation result corresponding to the input feature matrix according to the third input transformation result vector element and the fourth input transformation result vector element respectively corresponding to each row vector.
In some embodiments, the target input transformation processing unit corresponding to the front part input transformation in the hardware accelerator may perform the front part input transformation for each column vector, to obtain a first input transformation result vector element and a second input transformation result vector element corresponding to each column vector, and determine the front part input transformation result vector corresponding to the column vector according to the first input transformation result vector element and the second input transformation result vector element corresponding to the column vector.
In some embodiments, the target input transform processing unit corresponding to the front part input transform may output the front part input transform result vector corresponding to the column vector to an intermediate shift register in the hardware accelerator for storing, and the intermediate shift register may shift the stored front part input transform result vector to gradually store the front part input transform result vectors corresponding to the column vectors in the input feature matrix respectively, so as to obtain the front part input transform result corresponding to the input feature matrix.
In some embodiments, each target input conversion processing unit corresponding to the post-portion input conversion may read, from the intermediate shift register, a row vector in a front-portion input conversion result corresponding to the input feature matrix, each target input conversion processing unit corresponding to the post-portion input conversion performs post-portion input conversion on the read row vector, to obtain a third input conversion result vector element and a fourth input conversion result vector element corresponding to each row vector, and determine a post-portion input conversion result vector corresponding to each row vector according to the third input conversion result vector and the fourth input conversion result vector. The hardware accelerator may determine a convolution input transformation result corresponding to the input feature matrix according to the rear portion input transformation result vector corresponding to each row vector.
It can be appreciated that db= (B) is input due to the back part T D T ) T The subsequent input transform corresponds to the same processing as the preceding input transform with respect to the transposition of the preceding input transform result, and the processing result is obtained, and then the transposition of the processing result is taken, that is, the subsequent input transform result vector corresponding to each of the line vectors is obtained by performing the same processing as the preceding input transform with respect to each of the line vectors (corresponding to the transposition of the preceding input transform result), and the subsequent input transform result vector corresponding to each of the line vectors is taken as the line vector in the convolution input transform result corresponding to the input feature matrix (corresponding to the transposition of the processing result). Therefore, the processing performed by each target input conversion processing unit corresponding to the rear portion input conversion is consistent with the processing performed by each target input conversion processing unit corresponding to the front portion input conversion, and specific processing steps of each target input conversion processing unit corresponding to the rear portion input conversion are not described in detail. The respective target input transform processing units corresponding to the rear-portion input transforms may have the same structure as the target input transform processing units corresponding to the front-portion input transforms.
It can be seen that, in this embodiment, the front portion input transformation is performed by the target input transformation processing unit corresponding to the front portion input transformation to obtain the front portion input transformation result corresponding to the input feature matrix, and then the rear portion input transformation is performed by each target input transformation processing unit corresponding to the rear portion input transformation to each row vector in the front portion input transformation result, so that the processing performed by the target input transformation processing units corresponding to the front portion input transformation and the rear portion input transformation is consistent, and the target input transformation processing units corresponding to the front portion input transformation and the rear portion input transformation can be obtained by adding the peripheral unit on the basis of the basic input transformation processing unit, thereby improving the configurability of the hardware accelerator for convolution input transformation.
In some embodiments, the hardware accelerator further comprises an input register, an intermediate shift register, and an output register; the method further comprises the steps of:
acquiring and storing column vectors in an input feature matrix through an input register;
reading the stored column vectors from the input register through a target input conversion processing unit corresponding to the front part input conversion, performing the front part input conversion on the column vectors, and outputting front part input conversion result vectors corresponding to the column vectors to the intermediate shift register;
Shifting the stored front part input transformation result vectors through an intermediate shift register to gradually store the front part input transformation result vectors corresponding to each column vector in the input feature matrix respectively, so as to obtain front part input transformation results corresponding to the input feature matrix;
reading each row vector in the front part input conversion result from the middle shift register through a target input conversion processing unit corresponding to the rear part input conversion, carrying out the rear part input conversion on each row vector in parallel, and outputting a rear part input conversion result vector corresponding to each row vector to an output register;
and storing the input transformation result vectors of each rear part through the output register to obtain the convolution input transformation result corresponding to the input feature matrix.
In some embodiments, as shown in FIG. 6, a controller may also be included in the hardware accelerator. After receiving a start control signal transmitted by an external device, the controller can control the input register to write a column vector in the input feature matrix and provide the column vector to a target input conversion processing unit (PE) corresponding to the input conversion of the front part.
In some embodiments, the controller may send an enable signal to a target input transform processing unit (PE) corresponding to the front part input transform to cause the target input transform processing unit corresponding to the front part input transform to read the stored column vector from the input register, and perform the front part input transform on the column vector, and output a front part input transform result vector corresponding to the column vector to the intermediate shift register.
In some embodiments, a front-part input transform result vector output by a target input transform processing unit (PE) corresponding to the front-part input transform is input into an intermediate shift register. The intermediate shift register shifts the stored front part input transformation result vector by taking columns as units so as to gradually store the front part input transformation result vector corresponding to each column vector in the input feature matrix, and when the intermediate shift register is fully loaded, the front part input transformation result corresponding to the input feature matrix can be obtained. For example: the front input conversion result vector output from the target input conversion processing unit (PE) corresponding to the front input conversion may be stored to the rearmost side of the intermediate shift register in the direction indicated by the arrow in fig. 6, and the intermediate shift register may be shifted from right to left in units of columns.
In some embodiments, when the intermediate shift register is full, the controller may send an enable signal to each of the target input transform processing units (PEs) corresponding to the rear portion input transform connected to the intermediate shift register, so that each of the target input transform processing units (PEs) corresponding to the rear portion input transform performs the rear portion input transform on the front portion input transform result.
In some embodiments, as shown in fig. 6, each target input transformation processing unit (PE) corresponding to the post-partial input transformation may respectively read each row vector in the pre-partial input transformation result from the intermediate shift register according to the direction indicated by the arrow, perform the post-partial input transformation on each row vector in parallel, output the post-partial input transformation result vector corresponding to each row vector in parallel to the output register, and finally store the convolution input transformation result corresponding to the input feature matrix in the output register.
In some embodiments, after each target input transform processing unit (PE) corresponding to the post-partial input transform completes the post-partial input transform and outputs a post-partial input transform result vector to the output register, each target input transform processing unit (PE) corresponding to the post-partial input transform returns a calculation completion signal to the controller, and after the controller receives the calculation completion signal of each target input transform processing unit (PE) corresponding to the post-partial input transform, the controller may send the input transform completion signal to the external device to prompt the external device to read the convolution input transform result from the output register.
It can be seen that in the present embodiment, the convolution input conversion can be efficiently and accurately completed by storing the column vectors of the input feature moments therein through the input register, storing the front-part input conversion result vectors corresponding to the column vectors through the intermediate shift register, and shifting to store the front-part input conversion result vectors corresponding to the respective column vectors, and storing the convolution input conversion results through the output register. In addition, each target input conversion processing unit corresponding to the rear part input conversion carries out the rear part input conversion on each row vector in parallel, so that the efficiency is improved.
In some embodiments, the peripheral units include a first cross-correlation unit and a plurality of second cross-correlation units; performing cross-correlation operation on the column vector and the first coefficient vector and each second coefficient vector through the peripheral unit to obtain a first cross-correlation result and a plurality of second cross-correlation results, wherein the cross-correlation operation comprises the following steps:
through a first cross-correlation unit, a shift operation and an addition and subtraction operation in a circuit are used for realizing cross-correlation operation between a column vector and a first coefficient vector, and a first cross-correlation result is obtained;
and through each second cross-correlation unit, performing cross-correlation operation between the column vector and each second coefficient vector by using shift operation and addition and subtraction operation in the circuit, so as to obtain a plurality of second cross-correlation results.
The first cross-correlation unit is a hardware circuit unit for realizing cross-correlation operation between the column vector and the first coefficient vector. The second cross-correlation unit is a hardware circuit unit for realizing cross-correlation operation between the column vector and the second coefficient vector.
In some embodiments, as shown in fig. 7, a schematic diagram of the structure of the target input transform processing unit is provided. It will be appreciated that the letters in the figures are intended to represent data for the input or output of various hardware circuit elements and do not represent hardware structures. 712 in the figure is the basic input transform processing unit, 702, 704, 706, 708, 710 and 714 are the peripheral units. Wherein 702 is a first cross-correlation unit, 704 and 706 are second cross-correlation units, the column vectors are input into the target input transformation processing unit, and the first cross-correlation unit in the target input transformation processing unit can implement cross-correlation operation between the column vectors and the first coefficient vectors to obtain a first cross-correlation result (i.e. in fig. 7) Each second cross-correlation unit in the target input transformation processing unit realizes cross-correlation operation between the column vector and each second coefficient vector to obtain a plurality of second cross-correlation results (i.e., & lt/EN & gt in FIG. 7 >And->)。
In some embodiments, because the constant terms of the delta polynomial are all integer powers of 2 when constructing the polynomial, for example: polynomial (x) 2 The constant term 4 in 4) is a power of 2, so that the cross-correlation operation can be converted into a shift operation and an addition-subtraction operation in the circuit. The first cross-correlation unit and each second cross-correlation unit can realize multiplication operation in the cross-correlation operation process through shift operation in the circuit, and can realize addition and subtraction operation in the cross-correlation operation process through addition and subtraction operation in the circuit.
For example: at the position of [ -4 0.1]And (3) withWhen the cross-correlation operation is carried out, the cross-correlation operation process is +.>In the case where the calculation is performed in a binary form in the circuit, since 4 is represented as binary 100, the multiplication with 4 can be converted into a shift of 2 bits to the upper bits and a zero-padding of the lower bits, and thus the cross-correlation operation procedure in the above example is converted into a shift operation and a subtraction operation.
In some embodiments, since the basic input transform processing unit may be obtained by adding peripheral units layer by the initial input transform processing unit, the process of multiplying the basic input transform parameter matrix by the first cross-correlation result by the basic input transform processing unit may be recursively converted step by step into a shift operation and an addition-subtraction operation. For example: will be [ 0.0.about.5.0.1 in the foregoing ]×[d 0 d 1 d 2 d 3 d 4 d 5 ] T Conversion toCan then be further converted into [0 1 ]]Multiplied by [ -1 0 1]And->Is a result of the cross-correlation operation of: />
Therefore, the process of multiplying the basic input transform parameter matrix by the first cross-correlation result in the basic input transform processing unit can also be converted into a shift operation and an addition-subtraction operation all together.
It can be seen that, in this embodiment, the first cross-correlation unit is used to perform the cross-correlation operation between the column vector and the first coefficient vector by using the shift operation and the addition/subtraction operation in the circuit, so as to obtain the first cross-correlation result, and the second cross-correlation unit is used to perform the cross-correlation operation between the column vector and each second coefficient vector by using the shift operation and the addition/subtraction operation in the circuit, so as to obtain a plurality of second cross-correlation results, which is equivalent to converting all multiplication operations into the shift operation and the addition/subtraction operation, without using a multiplier, so that the consumption of hardware resources is reduced.
In some embodiments, the peripheral unit includes a plurality of vector multiplication units; obtaining, by the peripheral unit, second input transform result vector elements corresponding to the column vectors according to products of the base coefficient vectors and the second cross-correlation results, respectively, including:
And multiplying the basic coefficient vector with each second cross-correlation result by each vector multiplication unit respectively by using shift operation and addition and subtraction operation in the circuit to obtain second input transformation result vector elements corresponding to the column vectors.
The vector multiplication unit is a unit for realizing multiplication operation between vectors.
In some embodiments, 708 and 710 in FIG. 7 are vector multiplication units. The second cross-correlation results output by the respective second cross-correlation units (i.e., in fig. 7And->) Respectively input to each vector multiplication unit, and each vector multiplication unit multiplies the basic coefficient vector by each second cross-correlation result to obtain a second input transformation result vector element corresponding to the column vector (i.e. D in FIG. 7) 3 And D 4 )。
In some embodiments, the hardware structure of each vector multiplication unit may be identical, as the calculations performed by each vector multiplication unit are identical, differing only in the input data.
In some embodiments, the peripheral unit may further include an output reordering unit (i.e. 714 in fig. 7), where the output reordering unit may perform a sequence combination on the first input transformation result vector element and each second input transformation result vector element to obtain an input transformation result vector corresponding to the column vector.
It can be seen that, in this embodiment, the basic coefficient vector is multiplied by each second cross-correlation result by each vector multiplication unit by using the shift operation and the addition/subtraction operation in the circuit, so as to obtain the second input conversion result vector element corresponding to the column vector, which is equivalent to converting all the multiplication operations into the shift operation and the addition/subtraction operation, without using a multiplier, and with less hardware resource consumption.
Based on the same inventive concept, the embodiment of the application also provides a hardware accelerator for convolution input transformation. The implementation of the solution provided by the hardware accelerator is similar to the implementation described in the above convolution input transformation method, so the specific limitation in one or more embodiments of the hardware accelerator for convolution input transformation provided below may be referred to the limitation of the convolution input transformation method hereinabove, and will not be repeated herein.
As shown in fig. 8, an embodiment of the present application provides a hardware accelerator 800 for convolution input transformation, the hardware accelerator 800 including a plurality of target input transformation processing units 810 corresponding to specified convolution parameters; the target input transform processing unit 810 is configured and combined by a basic input transform processing unit 812 and a peripheral unit 814;
A peripheral unit 814, configured to perform a cross-correlation operation on the input column vector and the first coefficient vector and each second coefficient vector, to obtain a first cross-correlation result and a plurality of second cross-correlation results; the first coefficient vector is determined by a plurality of delta polynomials; the second coefficient vectors are coefficient vectors corresponding to the increment polynomials respectively; the increment polynomial is a polynomial which is obtained by adding a convolution polynomial corresponding to a specified convolution parameter compared with a basic convolution polynomial;
a basic input transformation processing unit 812, configured to obtain a first input transformation result vector element corresponding to the column vector according to a product of the basic input transformation parameter matrix corresponding to the basic convolution polynomial and the first cross-correlation result;
the peripheral unit 814 is further configured to obtain second input transformation result vector elements corresponding to the column vectors according to products of the base coefficient vectors and the second cross-correlation results respectively; the base coefficient vector is a coefficient vector of the product of the respective base convolution polynomials; and obtaining an input transformation result vector corresponding to the column vector according to the first input transformation result vector element and the second input transformation result vector element corresponding to the column vector.
In some embodiments, as shown in fig. 9, the target input transform processing unit 810 includes a target input transform processing unit 8101 corresponding to a front part input transform and a plurality of target input transform processing units 8102 corresponding to a rear part input transform;
a target input conversion processing unit 8101 corresponding to the front part input conversion, configured to perform front part input conversion on the column vectors in the input feature matrix, and output corresponding front part input conversion result vectors;
and a plurality of target input transformation processing units 8102 corresponding to the post-partial input transformation, configured to perform the post-partial input transformation on the front-partial input transformation result vectors corresponding to the column vectors in the input feature matrix in parallel, and output the convolution input transformation results corresponding to the column vectors in the input feature matrix.
In some embodiments, as shown in fig. 9, the hardware accelerator 800 further comprises: an input register 802, an intermediate shift register 804, and an output register 806;
an input register 802 for acquiring and storing column vectors in the input feature matrix;
the target input conversion processing unit 8101 corresponding to the front part input conversion is further configured to read the stored column vector from the input register, perform the front part input conversion on the column vector, and output the front part input conversion result vector corresponding to the column vector to the intermediate shift register;
The intermediate shift register 804 is configured to shift the stored front-part input transformation result vectors, so as to gradually store front-part input transformation result vectors corresponding to each column vector in the input feature matrix, and obtain front-part input transformation results corresponding to the input feature matrix;
the target input conversion processing unit 8102 corresponding to each rear part input conversion is further configured to read each row vector in the front part input conversion result from the intermediate shift register, perform the rear part input conversion on each row vector in parallel, and output the rear part input conversion result vector corresponding to each row vector to the output register;
and an output register 806, configured to store the input transformation result vectors of each rear portion, and obtain a convolution input transformation result corresponding to the input feature matrix.
Wherein the input register comprises (k+n-1) register cells capable of storing a column (i.e. column vector) in the input feature matrix and supporting both whole column reading and whole column writing. The intermediate shift register includes (k+n-1) × (k+n-1) register units, supports simultaneous reading of all register units, and each row is connected to a target input conversion processing unit corresponding to a subsequent input conversion. The output register includes (k+n-1) x (k+n-1) register units, supporting simultaneous reading and simultaneous writing of all register units.
As shown in fig. 10, an embodiment of the present application provides a structure determining method of a hardware accelerator for convolution input transformation, the hardware accelerator including a plurality of target input transformation processing units; the structure determination step of the target input transform processing unit includes:
s1002, acquiring input convolution parameters.
In some embodiments, a user may input specified convolution parameters and data bit widths through a computer device. The computer device may obtain the input convolution parameters and the data Bit Width (BW). The data bit width is the width of data processed by the hardware unit.
In some embodiments, the computer device may transmit the input convolution parameters to the RTL (Register Transfer Level) engineering reserved parameter interface of the convolution input transformation hardware accelerator, and perform step S1004 and subsequent steps through the RTL engineering to determine the structure of the hardware accelerator corresponding to the input convolution parameters, and perform layout and routing.
S1004, determining a first coefficient vector according to a plurality of increment polynomials, and determining each second coefficient vector according to each increment polynomial; the increment polynomial is a polynomial that the convolution polynomial corresponding to the input convolution parameter is compared with the basic convolution polynomial.
S1006, determining the structure of the peripheral unit according to the first coefficient vector, the second coefficient vector and a basic input transformation parameter matrix corresponding to the basic convolution polynomial; the peripheral unit is used for carrying out cross-correlation operation on column vectors in the input feature matrix and the first coefficient vector and each second coefficient vector respectively to obtain a first cross-correlation result and a plurality of second cross-correlation results, and obtaining second input transformation result vector elements corresponding to the column vectors according to products of the basic coefficient vectors and each second cross-correlation result respectively; the base coefficient vector is a coefficient vector of the product of the respective base convolution polynomials.
S1008, acquiring a structure of a basic input transformation processing unit according to the input convolution parameters; the basic input transformation processing unit is used for obtaining a first input transformation result vector element corresponding to the column vector according to the product of the basic input transformation parameter matrix corresponding to the basic convolution polynomial and the first cross-correlation result.
S1010, determining the structure of the target input transformation processing unit according to the structures of the basic input transformation processing unit and the peripheral unit.
Therefore, in the embodiment of the application, the structure of the hardware accelerator can be rapidly determined according to the input convolution parameters, the deployment efficiency of the hardware accelerator is improved, and the obtained hardware accelerator has higher configurability.
FIG. 11 is a schematic overall flow chart of a method in various embodiments of the present application, in which input specified convolution parameters and data bit widths are obtained, synthesis, layout and wiring are performed through RTL engineering according to the specified convolution parameters and data bit widths, a hardware accelerator for convolution input transformation conforming to the specified convolution parameters is obtained, then input data is provided to the hardware accelerator, and a start signal is sent, and the hardware accelerator performs convolution input transformation and outputs a convolution input transformation result.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a structure determining device of the hardware accelerator for convolution input transformation. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiment of the structure determining device of the hardware accelerator for one or more convolution input transformations provided below may be referred to the limitation of the structure determining method of the hardware accelerator for convolution input transformations hereinabove, and will not be described herein.
As shown in fig. 12, an embodiment of the present application provides a structure determining apparatus 1200 of a hardware accelerator for convolution input transformation, including:
a first obtaining module 1202, configured to obtain an input convolution parameter;
a first determining module 1204, configured to determine a first coefficient vector according to a plurality of increment polynomials, and determine each second coefficient vector according to each increment polynomial, respectively; the increment polynomial is a polynomial which is obtained by adding a convolution polynomial corresponding to the input convolution parameter compared with a basic convolution polynomial;
a second determining module 1206, configured to determine a structure of the peripheral unit according to the first coefficient vector, the second coefficient vector, and a basic input transformation parameter matrix corresponding to the basic convolution polynomial; the peripheral unit is used for carrying out cross-correlation operation on column vectors in the input feature matrix and the first coefficient vector and each second coefficient vector respectively to obtain a first cross-correlation result and a plurality of second cross-correlation results, and obtaining second input transformation result vector elements corresponding to the column vectors according to products of the basic coefficient vectors and each second cross-correlation result respectively; the base coefficient vector is a coefficient vector of the product of the respective base convolution polynomials;
A second obtaining module 1208, configured to obtain a structure of the basic input transform processing unit according to the input convolution parameter; the basic input transformation processing unit is used for obtaining a first input transformation result vector element corresponding to the column vector according to the product of a basic input transformation parameter matrix corresponding to the basic convolution polynomial and the first cross correlation result;
the third determining module 1210 is configured to determine a structure of the target input transform processing unit according to structures of the basic input transform processing unit and the peripheral unit.
The respective modules in the above-described structure determining means of the hardware accelerator of the convolution input transformation may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In some embodiments, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 13. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program, when executed by a processor, performs the steps in the method of determining the structure of a hardware accelerator for convoluting input transformations as described above. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen; the input device of the computer equipment can be a touch layer covered on a display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in FIG. 13 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In some embodiments, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method embodiments described above when the computer program is executed.
In some embodiments, an internal structural diagram of a computer-readable storage medium is provided as shown in fig. 14, the computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the method embodiments described above.
In some embodiments, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program, which may be stored on a non-transitory computer readable storage medium and which, when executed, may comprise the steps of the above-described embodiments of the methods. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (12)

1. A convolution input conversion method, characterized in that it is executed by a hardware accelerator, in which a target input conversion processing unit corresponding to a specified convolution parameter is configured and combined by a basic input conversion processing unit and a peripheral unit, the method comprising:
for each column vector in the input feature matrix, performing cross-correlation operation on the column vector, the first coefficient vector and each second coefficient vector through the peripheral unit to obtain a first cross-correlation result and a plurality of second cross-correlation results; the first coefficient vector is determined by a plurality of delta polynomials; each second coefficient vector is a coefficient vector corresponding to each increment polynomial; the increment polynomial is a polynomial which is obtained by multiplying a convolution polynomial corresponding to the specified convolution parameter by a basic convolution polynomial;
Obtaining a first input transformation result vector element corresponding to the column vector according to the product of the basic input transformation parameter matrix corresponding to the basic convolution polynomial and the first cross correlation result by the basic input transformation processing unit;
obtaining second input transformation result vector elements corresponding to the column vectors according to products of the basic coefficient vectors and the second cross-correlation results through the peripheral units; the base coefficient vector is a coefficient vector of the product of the respective base convolution polynomials;
and determining a convolution input transformation result corresponding to the input feature matrix according to the first input transformation result vector element and the second input transformation result vector element respectively corresponding to each column vector.
2. The method according to claim 1, wherein the first input transformation result vector element and the second input transformation result vector element, respectively corresponding to the respective column vectors, are obtained by a target input transformation processing unit corresponding to a front part input transformation;
the determining the convolution input transformation result corresponding to the input feature matrix according to the first input transformation result vector element and the second input transformation result vector element respectively corresponding to each column vector comprises:
Determining a front part input transformation result corresponding to the input feature matrix according to a first input transformation result vector element and a second input transformation result vector element respectively corresponding to each column vector;
performing the post-partial input transformation on each row vector in the front-partial input transformation result through each target input transformation processing unit corresponding to the post-partial input transformation to obtain a third input transformation result vector element and a fourth input transformation result vector element corresponding to each row vector;
and determining a convolution input transformation result corresponding to the input feature matrix according to a third input transformation result vector element and a fourth input transformation result vector element respectively corresponding to each row vector.
3. The method of claim 2, wherein the hardware accelerator further comprises an input register, an intermediate shift register, and an output register; the method further comprises the steps of:
acquiring and storing column vectors in the input feature matrix through the input register;
reading the stored column vector from the input register through a target input conversion processing unit corresponding to the front part input conversion, performing front part input conversion on the column vector, and outputting a front part input conversion result vector corresponding to the column vector to the intermediate shift register;
Shifting the stored front part input transformation result vector through the intermediate shift register to gradually store front part input transformation result vectors corresponding to each column vector in the input feature matrix, so as to obtain a front part input transformation result corresponding to the input feature matrix;
reading each row vector in the front part input conversion result from the intermediate shift register through a target input conversion processing unit corresponding to the rear part input conversion, carrying out rear part input conversion on each row vector in parallel, and outputting a rear part input conversion result vector corresponding to each row vector to the output register;
and storing each input transformation result vector of the rear part through the output register to obtain a convolution input transformation result corresponding to the input feature matrix.
4. A method according to any one of claims 1 to 3, wherein the peripheral units comprise a first cross-correlation unit and a plurality of second cross-correlation units; the cross-correlation operation is performed on the column vector and the first coefficient vector and each second coefficient vector through the peripheral unit, so as to obtain a first cross-correlation result and a plurality of second cross-correlation results, including:
Through the first cross-correlation unit, a shift operation and an addition and subtraction operation in a circuit are used for realizing the cross-correlation operation between the column vector and a first coefficient vector, so as to obtain a first cross-correlation result;
and respectively realizing the cross-correlation operation between the column vector and each second coefficient vector by each second cross-correlation unit through shift operation and addition and subtraction operation in a circuit, and obtaining a plurality of second cross-correlation results.
5. A method according to any one of claims 1 to 3, wherein the peripheral unit comprises a plurality of vector multiplication units; the obtaining, by the peripheral unit, second input transform result vector elements corresponding to the column vectors according to products of the base coefficient vectors and the second cross-correlation results, includes:
and multiplying the basic coefficient vector with each second cross-correlation result by each vector multiplication unit respectively by using shift operation and addition and subtraction operation in a circuit to obtain a second input transformation result vector element corresponding to the column vector.
6. A hardware accelerator for convoluting an input transform, the hardware accelerator comprising a plurality of target input transform processing units corresponding to specified convolution parameters; the target input transformation processing unit is obtained by combining a basic input transformation processing unit and a peripheral unit configuration;
The peripheral unit is used for performing cross-correlation operation on the input column vector, the first coefficient vector and each second coefficient vector respectively to obtain a first cross-correlation result and a plurality of second cross-correlation results; the first coefficient vector is determined by a plurality of delta polynomials; each second coefficient vector is a coefficient vector corresponding to each increment polynomial; the increment polynomial is a polynomial which is obtained by multiplying a convolution polynomial corresponding to the specified convolution parameter by a basic convolution polynomial;
the basic input transformation processing unit is used for obtaining a first input transformation result vector element corresponding to the column vector according to the product of a basic input transformation parameter matrix corresponding to the basic convolution polynomial and the first cross correlation result;
the peripheral unit is further configured to obtain second input transformation result vector elements corresponding to the column vectors according to products of the basic coefficient vectors and the second cross-correlation results respectively; the base coefficient vector is a coefficient vector of the product of the respective base convolution polynomials; and obtaining the input transformation result vector corresponding to the column vector according to the first input transformation result vector element and the second input transformation result vector element corresponding to the column vector.
7. The hardware accelerator of claim 6 wherein the target input transform processing unit comprises a target input transform processing unit corresponding to a front portion input transform and a plurality of target input transform processing units corresponding to a rear portion input transform;
the target input transformation processing unit corresponding to the front part input transformation is used for performing front part input transformation on column vectors in the input feature matrix and outputting corresponding front part input transformation result vectors;
and the target input transformation processing units are used for carrying out the rear part input transformation on the front part input transformation result vectors corresponding to the column vectors in the input feature matrix in parallel and outputting convolution input transformation results corresponding to the column vectors in the input feature matrix.
8. The hardware accelerator of claim 7, wherein the hardware accelerator further comprises: an input register, an intermediate shift register, and an output register;
the input register is used for acquiring and storing column vectors in the input feature matrix;
the target input conversion processing unit corresponding to the front part input conversion is further configured to read the stored column vector from the input register, perform front part input conversion on the column vector, and output a front part input conversion result vector corresponding to the column vector to the intermediate shift register;
The intermediate shift register is configured to shift the stored front input transformation result vectors, so as to gradually store front input transformation result vectors corresponding to each column vector in the input feature matrix, and obtain front input transformation results corresponding to the input feature matrix;
the target input conversion processing unit corresponding to each rear part input conversion is further configured to read each row vector in the front part input conversion result from the intermediate shift register, perform rear part input conversion on each row vector in parallel, and output a rear part input conversion result vector corresponding to each row vector to the output register;
and the output register is used for storing each input transformation result vector of the rear part to obtain a convolution input transformation result corresponding to the input feature matrix.
9. A method for determining the structure of a hardware accelerator for convolutional input transformation, comprising:
the hardware accelerator comprises a plurality of target input transformation processing units;
the structure determining step of the target input transformation processing unit includes:
acquiring input convolution parameters;
Determining a first coefficient vector according to a plurality of increment polynomials, and determining a second coefficient vector according to each increment polynomial respectively; the increment polynomial is a polynomial which is obtained by adding a convolution polynomial corresponding to the input convolution parameter compared with a basic convolution polynomial;
determining the structure of a peripheral unit according to the first coefficient vector, the second coefficient vector and a basic input transformation parameter matrix corresponding to the basic convolution polynomial; the peripheral unit is used for carrying out cross-correlation operation on column vectors in an input feature matrix and the first coefficient vector and each second coefficient vector respectively to obtain a first cross-correlation result and a plurality of second cross-correlation results, and obtaining second input transformation result vector elements corresponding to the column vectors according to products of basic coefficient vectors and each second cross-correlation result respectively; the base coefficient vector is a coefficient vector of the product of the respective base convolution polynomials;
acquiring the structure of a basic input transformation processing unit according to the input convolution parameters; the basic input transformation processing unit is used for obtaining a first input transformation result vector element corresponding to the column vector according to the product of a basic input transformation parameter matrix corresponding to the basic convolution polynomial and the first cross correlation result;
And determining the structure of the target input transformation processing unit according to the structures of the basic input transformation processing unit and the peripheral unit.
10. A hardware accelerator architecture determination apparatus for convolutional input transformation, comprising:
the first acquisition module is used for acquiring input convolution parameters;
the first determining module is used for determining a first coefficient vector according to a plurality of increment polynomials and determining a second coefficient vector according to each increment polynomial respectively; the increment polynomial is a polynomial which is obtained by adding a convolution polynomial corresponding to the input convolution parameter compared with a basic convolution polynomial;
the second determining module is used for determining the structure of the peripheral unit according to the first coefficient vector, the second coefficient vector and a basic input transformation parameter matrix corresponding to the basic convolution polynomial; the peripheral unit is used for carrying out cross-correlation operation on column vectors in an input feature matrix and the first coefficient vector and each second coefficient vector respectively to obtain a first cross-correlation result and a plurality of second cross-correlation results, and obtaining second input transformation result vector elements corresponding to the column vectors according to products of basic coefficient vectors and each second cross-correlation result respectively; the base coefficient vector is a coefficient vector of the product of the respective base convolution polynomials;
The second acquisition module is used for acquiring the structure of the basic input transformation processing unit according to the input convolution parameters; the basic input transformation processing unit is used for obtaining a first input transformation result vector element corresponding to the column vector according to the product of a basic input transformation parameter matrix corresponding to the basic convolution polynomial and the first cross correlation result;
and the third determining module is used for determining the structure of the target input transformation processing unit according to the structures of the basic input transformation processing unit and the peripheral unit.
11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of claim 9 when executing the computer program.
12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of claim 9.
CN202311101953.2A 2023-08-29 2023-08-29 Convolution input conversion method, hardware accelerator and accelerator structure determination method Pending CN117077734A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311101953.2A CN117077734A (en) 2023-08-29 2023-08-29 Convolution input conversion method, hardware accelerator and accelerator structure determination method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311101953.2A CN117077734A (en) 2023-08-29 2023-08-29 Convolution input conversion method, hardware accelerator and accelerator structure determination method

Publications (1)

Publication Number Publication Date
CN117077734A true CN117077734A (en) 2023-11-17

Family

ID=88705893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311101953.2A Pending CN117077734A (en) 2023-08-29 2023-08-29 Convolution input conversion method, hardware accelerator and accelerator structure determination method

Country Status (1)

Country Link
CN (1) CN117077734A (en)

Similar Documents

Publication Publication Date Title
EP3373210B1 (en) Transposing neural network matrices in hardware
CN108205519B (en) Matrix multiply-add operation device and method, processing device, chip and electronic device
KR20170135752A (en) Efficient sparse parallel winograd-based convolution scheme
CN111915001B (en) Convolution calculation engine, artificial intelligent chip and data processing method
CN111465924A (en) System and method for converting matrix input to vectorized input for a matrix processor
EP3093757B1 (en) Multi-dimensional sliding window operation for a vector processor
US11328395B2 (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
TW202123093A (en) Method and system for performing convolution operation
US20210065328A1 (en) System and methods for computing 2-d convolutions and cross-correlations
CN114764615A (en) Convolution operation implementation method, data processing method and device
JP2024028901A (en) Sparse matrix multiplication in hardware
CN113947200A (en) Acceleration calculation method of neural network, accelerator and computer-readable storage medium
CN114138231B (en) Method, circuit and SOC for executing matrix multiplication operation
Meher et al. New systolic algorithm and array architecture for prime-length discrete sine transform
CN112929300A (en) Data processing device, method, base station and storage medium
CN114003198A (en) Inner product processing unit, arbitrary precision calculation device, method, and readable storage medium
WO2021168644A1 (en) Data processing apparatus, electronic device, and data processing method
CN114758209B (en) Convolution result obtaining method and device, computer equipment and storage medium
CN116051345A (en) Image data processing method, device, computer equipment and readable storage medium
CN117077734A (en) Convolution input conversion method, hardware accelerator and accelerator structure determination method
CN116400884A (en) Control method and device of multiplier-adder computer device and storage medium
CN115424038A (en) Multi-scale image processing method, system and device and computer equipment
CN111931937B (en) Gradient updating method, device and system of image processing model
CN117407640A (en) Matrix calculation method and device
CN113642722A (en) Chip for convolution calculation, control method thereof and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination