CN117349585A - Operator performance optimization method based on accelerator constraint - Google Patents

Operator performance optimization method based on accelerator constraint Download PDF

Info

Publication number
CN117349585A
CN117349585A CN202311644953.7A CN202311644953A CN117349585A CN 117349585 A CN117349585 A CN 117349585A CN 202311644953 A CN202311644953 A CN 202311644953A CN 117349585 A CN117349585 A CN 117349585A
Authority
CN
China
Prior art keywords
matrix
constraint
value
array
test
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311644953.7A
Other languages
Chinese (zh)
Other versions
CN117349585B (en
Inventor
钟阳宇
杜凯
刘忠新
温研
邓强
李解
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Linzhuo Information Technology Co Ltd
Original Assignee
Beijing Linzhuo Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Linzhuo Information Technology Co Ltd filed Critical Beijing Linzhuo Information Technology Co Ltd
Priority to CN202311644953.7A priority Critical patent/CN117349585B/en
Publication of CN117349585A publication Critical patent/CN117349585A/en
Application granted granted Critical
Publication of CN117349585B publication Critical patent/CN117349585B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/535Dividing only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses an operator performance optimization method based on accelerator constraint, which comprises the steps of determining a first constraint based on acquired accelerator first parameters, establishing a test matrix group set and a test matrix group combination set corresponding to the test matrix group set according to the first constraint, completing matrix multiplication operation of all test sub-matrix groups by using a target deep learning accelerator, recording operation time, taking a matrix group combination with the minimum combination operation time as a selected matrix group of the test matrix groups, taking the selected matrix group as a second constraint, determining a target combination for input array decomposition of operation to be executed, determining a limited exploration space based on the target combination, and finally adopting a simulated annealing algorithm based on the constraint to execute optimization in the limited exploration space by taking the second constraint as the constraint so as to obtain an optimal decomposition combination, thereby effectively improving operator calculation performance on the same accelerator.

Description

Operator performance optimization method based on accelerator constraint
Technical Field
The invention belongs to the technical field of deep learning accelerators, and particularly relates to an operator performance optimization method based on accelerator constraint.
Background
With the increasing use of deep learning, the need for high performance computing is also increasing. The acceleration of deep learning is mainly achieved by improving the computing efficiency and the energy consumption efficiency of a deep learning operator, and how to achieve optimal performance and energy consumption efficiency through matching between the computing characteristics of an algorithm and the characteristics of a hardware architecture still has challenges. The existing performance optimization mode mainly comprises the steps of designing a software library for a CPU and a GPU, optimizing a memory subsystem to realize matrix multiplication, and decomposing an array to improve operator performance, wherein the mode is not focused on the characteristics of an accelerator, so that the problem of matching between the accelerator and an operator cannot be effectively solved, and the optimization effect on the operator performance is poor.
Disclosure of Invention
In view of this, the invention provides an operator performance optimization method based on accelerator constraint, which realizes constraint-based array decomposition to obtain an optimal array decomposition mode, and further obtains an optimal matrix combination.
The invention provides an operator performance optimization method based on accelerator constraint, which comprises the following steps:
step 1, obtaining a first parameter of a target deep learning accelerator, wherein the first parameter refers to that the size of a minimum calculation unit of matrix multiplication operation which can be supported by the target deep learning accelerator is recorded as (M, N, K), and the product of the first parameter and M, N and K in the first parameter is taken as a first constraint; constructing a test matrix group set consisting of test matrix groups, decomposing the test matrix groups into test sub-matrix groups G (M, N, K) under a first constraint to form a plurality of matrix group combinations G (M, N, K), forming a test matrix group combination set corresponding to the test matrix group set by all matrix group combinations, completing matrix multiplication operation of all the test sub-matrix groups in each matrix group combination of the test matrix groups by adopting a target deep learning accelerator, recording operation time to obtain combination operation time of each matrix group combination, and taking the minimum combination operation time and the maximum set scale of the matrix group combination corresponding to the minimum combination operation time as a second constraint;
step 2, for the input arrays A and B to be operated, the dimension of the array A is d _A The dimension of the array B is d _B And d _A And d _B Are integers not less than 2; if d _A And d _B Step 3 is executed if the number of rows or columns of the array A and the array B is equal to 2, and if d _A And d _B Step 4 is executed if one equals 2 and the other equals 3 and the number of rows or columns of array A and array B are equal, if d _A And d _B And if the total number is not less than 3, executing the step 5;
step 3, a matrix multiplication operation of a and B is represented as mma (num, O1, O2) and num=1, o1= A, O = B, O1 _c=o2_r, wherein num is the number of groups participating in the operation matrix, O1 and O2 are both matrices participating in the operation, O1 is represented as (o1_r, o1_c), and O2 is represented as (o2_r, o2_c), and step 6 is executed;
step 4, decomposing the array with the dimension of 3 into a plurality of matrixes and a vector, and executing step 6;
step 5, let a be the input array a= (n, ci, wi, hi), B be the convolution kernel b= (co, ci, kw, kh), where n is the number of convolution batches, ci is the number of input channels, wi is the input width, hi is the input height, co is the number of output channels, kw is the convolution kernel width, kh is the convolution kernel height; if the dimension of A or B is 2x+1, decomposing the dimension into x-1 matrixes and matrixes A_ch 'or B_ch' obtained by dimension reduction deformation of an array formed by the last three elements; if the dimension of A or B is 2x, decomposing the dimension into x matrixes, and deforming the last matrix into A_ch 'or B_ch'; then, the matrix multiplication of a and B is denoted as mma (num, O1, O2) and num=n×ci, o1=a_ch ', o2=b_ch', o1_c=o2_r, and step 6 is performed; x is any natural number;
step 6, regarding the matrices O1 (o1_r, o1_c) and O2 (o2_r, o2_c) in mma (num, O1, O2), taking o1_r, o1_c or o2_r, o2_c as the values to be compared, executing step 7 if there is a value greater than the lower threshold value in the values to be compared, and adding the lower threshold value to the subset LX of the vectors X corresponding to the values to be compared if none of the values to be compared is greater than the lower threshold value, and then executing step 10;
step 7, executing step 8 if the value to be compared is larger than the upper threshold value, otherwise executing step 10;
step 8, performing modulo-2 division on the vector corresponding to the value to be compared, if the result is 0, setting the value of the value to be compared as an operation result, recording the times of modulo-2 division, and executing step 7; otherwise, executing the step 9;
step 9, executing zero padding operation on vectors corresponding to the values to be compared, setting the values to be compared as the next values to be compared, and executing step 8;
step 10, decomposing the vector X corresponding to the value to be compared, and letting 2 i For a value greater than or equal to the first minimum value and less than or equal to the first maximum value, x=x-2 is employed i Until X is less than or equal to the lower threshold value, recording all the values of i, and storing the results taking the values of i as indexes with the base of 2 as dimensions into a subset LX of the vector X, and executing the step 6; after all vectors are decomposed, executing step 11;
and 11, arranging and combining the numerical values in all vector subsets LX to obtain a set GA (M, N, K) of num matrix group combinations, obtaining a limited exploration space by the num GA (M, N, K), and performing optimization in the limited exploration space by adopting a simulated annealing algorithm based on constraint and taking a second constraint as constraint to obtain an optimal decomposition combination of input arrays A and B meeting optimal calculation time.
Further, the method for constructing the test matrix group set composed of the test matrix groups in the step 1 is as follows: taking the maximum value in the first constraint M, N and K as a minimum value of a single dimension as a first minimum value MIN, setting the maximum value of the single dimension as a first maximum value MAX, and selecting a numerical value which is larger than the first minimum value and is an exponential multiple of 2; selecting a value of 2 times the index between the first minimum and the first maximum to construct a test matrix set (m TA ,n TB ,k TA ) From all test matrix groupsA set of test matrix sets is formed.
Further, the combination operation time in the step 1 is the sum of matrix multiplication operation time of all test sub-matrix groups in the matrix group combination.
Further, in the step 4, the way to decompose the array with dimension 3 into a plurality of matrices and a vector is as follows: assuming a is a 3-dimensional array a= (a_1, a_2, a_3) and B is a 2-dimensional array b= (a_4, a_1), a is decomposed into a 'and a ", where a' = (a_1, a_2), a '= (a_3), and the matrix multiplication of a and B is denoted mma (num, O1, O2) and num=a_3, o1=a', o2= B, O _c=o2_r.
Further, the method of transforming the last matrix to a_ch 'or b_ch' in the step 5 is as follows:
let step s=1 and convolution filling p=0 of convolution operation, then the output length obtained by two-dimensional convolution operation is wo= (wi-kw+2p)/s+1 and the width is ho= (hi-kh+2p)/s+1; for each channel calculation process, n=1, ci=1, a_ch= (1, wi, hi), b_ch= (co, 1, kw, kh); the dimension of a_ch is reduced to a matrix c= (wi, hi), C is further transformed to a_ch '= (kw. Kh, wo. Ho), and b_ch is reduced to b_ch' = (co, kw. Kh).
Advantageous effects
The method comprises the steps of determining a first constraint based on acquired first parameters of an accelerator, establishing a test matrix group set and a test matrix group combination set corresponding to the test matrix group set according to the first constraint, completing matrix multiplication operation of all test sub-matrix groups by using a target deep learning accelerator, recording operation time, taking a matrix group combination with minimum combination operation time as a selected matrix group combination of the test matrix groups, taking the selected matrix group combination as a second constraint, decomposing an input array to be operated to determine a target combination, determining a limited exploration space based on the target combination, and finally carrying out optimizing in the limited exploration space by taking the second constraint as the constraint by adopting a simulated annealing algorithm based on the constraint to obtain an optimal decomposition combination, thereby effectively improving operator calculation performance on the same accelerator.
Drawings
FIG. 1 is a schematic diagram of an array decomposition process in an operator performance optimization method based on accelerator constraints.
Detailed Description
The present invention will be described in detail with reference to the following examples.
The invention provides an operator performance optimization method based on accelerator constraint, which has the following core ideas: determining a first constraint based on the acquired accelerator first parameter, establishing a test matrix group set and a test matrix group combination set corresponding to the test matrix group set according to the first constraint, completing matrix multiplication operation of all test sub-matrix groups by using a target deep learning accelerator, recording operation time, taking a matrix group combination with the minimum combination operation time as a selected matrix group combination of the test matrix groups, taking the selected matrix group combination as a second constraint, decomposing an input array to be operated to determine a target combination, determining a limited exploration space based on the target combination, and finally taking the second constraint as a constraint, and executing optimization in the limited exploration space by adopting a simulated annealing algorithm based on the constraint to obtain an optimal decomposition combination.
The invention provides an operator performance optimization method based on accelerator constraint, which specifically comprises the following steps:
step 1, obtaining a model number and a first parameter of a target deep learning accelerator, wherein the first parameter refers to that the size of a minimum calculation unit of matrix multiplication operation which can be supported by the target deep learning accelerator is recorded as (M, N, K), namely the minimum calculation which can be supported by the deep learning accelerator is multiplication operation between a matrix A=M=K and a matrix B=K×N; the first parameter and the product of M, N and K in the first parameter are taken together as a first constraint.
Step 2, taking the maximum value in the first constraint M, N and K as the minimum value of a single dimension as a first minimum value MIN, setting the maximum value of the single dimension as a first maximum value MAX, and selecting a numerical value which is larger than the first minimum value and is an exponential multiple of 2; selecting a value of 2 times the index between the first minimum and the first maximum to construct a test matrix set (m TA ,n TB ,k TA ) The test matrix set may be expressed as the matrix ta=m TA *k TA Sum matrix tb=k TB *n TB And forming a test matrix group set by all the test matrix groups.
And 3, decomposing each test matrix group in the test matrix group set into a plurality of matrix group combinations G (M, N, K) under the first constraint, and forming a test matrix group combination set corresponding to the test matrix group set by all matrix group combinations. The matrix group combination comprises a plurality of test sub-matrix groups g (m, n, k) obtained by decomposing the test matrix group, and the sum of matrix multiplication operation time of all the test sub-matrix groups in the matrix group combination is marked as the combination operation time.
The combination calculation time of different matrix groups is different because the matrix multiplication operation time of different test sub-matrix groups is different.
Step 4, for each matrix group combination of the test matrix groups, changing the data types of elements in the matrix group combinations, and then completing matrix multiplication operation of all the test sub-matrix groups by using a target deep learning accelerator and recording operation time, thereby obtaining the combination operation time of each matrix group combination; and taking the matrix group combination with the minimum combination operation time and the maximum set rule among the plurality of matrix group combinations as a selected matrix group combination of the test matrix group, and taking the combination operation time and the set scale of the selected matrix group combination as a second constraint.
Wherein a matrix set combination is actually a set of test sub-matrix sets g (m, n, k), and thus the set size refers to the number of elements in the matrix set combination. The second constraint established by the invention actually reflects the maximum operation quantity supported by the target deep learning accelerator in the relative shortest operation time.
Step 5, for the input arrays A and B to be operated, the dimension of the array A is d _A The dimension of the array B is d _B And d _A And d _B Are integers not less than 2; if d _A And d _B Step 6 is executed if the number of rows or columns of the array A and the array B is equal to 2, and if d _A And d _B Equal to 2 anotherStep 7 is executed if the number is equal to 3 and the number of rows or columns of the array A and the array B are equal, if d _A And d _B And if not less than 3, executing the step 8.
Specifically, in the present invention, the dimensions of the input array for performing matrix multiplication may be two-dimensional, three-dimensional, four-dimensional, …, N-dimensional, etc., the two-dimensional array is a matrix, the three-dimensional array is an array composed of a plurality of matrices, the four-dimensional array is an array composed of a plurality of three-dimensional arrays, and similarly, the N-dimensional array is an array composed of a plurality of N-1-dimensional arrays. For example, a three-dimensional array may be represented as (a_1, a_2, a_3), meaning that the three-dimensional array is made up of a_3 (a_1, a_2) matrices; the four-dimensional array can be expressed as (a_1, a_2, a_3, a_4), which means that the four-dimensional array is composed of a_4 (a_1, a_2, a_3) three-dimensional arrays; similarly, an N-dimensional array may be denoted as (a_1, a_2, a_3, …, a_ (N-1), a_N), meaning that the N-dimensional array is made up of a_N (a_1, a_2, a_3, …, a_ (N-1)) N-1-dimensional arrays, where a_1, a_2, a_3, …, a_ (N-1) are elements of the array.
Step 6, for input arrays a and B, the matrix multiplication of a and B is denoted as mma (num, O1, O2) and num=1, o1= A, O2= B, O1 _1_c=o2_r, i.e. 1 set of matrix a and matrix B performs the matrix multiplication, and then step 9 is performed. Wherein mma (num, O1, O2) is a matrix multiplication function defined in the present invention, num is the number of groups participating in the operation matrix, O1 and O2 are both matrices participating in the operation, O1 is represented as (o1_r, o1_c), and O2 is represented as (o2_r, o2_c).
Step 7, decomposing the array with 3 dimensions into a plurality of matrices and a vector, assuming that the input array a is represented as a= (a_1, a_2, a_3) for the 3-dimensional array and b= (a_4, a_1) for the 2-dimensional array, decomposing a into a ' and a″ where a ' = (a_1, a_2), a ' = (a_3), and representing the matrix multiplication of a and B as mma (num, O1, O2) and num=a_3, o1=a ', o2= B, O _c=o2_r, that is, performing the matrix multiplication operation on the a_3 matrix a ' and the matrix B, and executing step 9.
Step 8, this is usually a convolution operation, so let a be the input array denoted as a= (n, ci, wi, hi), B be the convolution kernel denoted as b= (co, ci, kw, kh), where n is the number of convolution batches, ci is the number of input channels, wi is the input width, hi is the input height, co is the number of output channels, kw is the convolution kernel width, kh is the convolution kernel height; if the dimension of A or B is odd, namely the dimension can be expressed as 2x+1, decomposing the dimension into x-1 matrixes and matrixes A_ch 'or B_ch' obtained by dimension reduction deformation of an array formed by the last three elements; if the dimension of A or B is even, i.e. the dimension thereof can be expressed as 2x, decomposing it into x matrices and deforming the last matrix into A_ch 'or B_ch'; the matrix multiplication of a and B is further denoted as mma (num, O1, O2) and num=n_ci, o1=a_ch ', o2=b_ch', o1_c=o2_r, i.e. n_ci sets of matrix a_ch 'and matrix b_ch' perform the matrix multiplication, and step 9 is performed.
The most commonly used and fundamental operators in deep learning are two-dimensional convolution, and three-dimensional convolution, deconvolution, dilation convolution and the like are based on two-dimensional convolution and are only different in dimension or size. For the convolution operation where a is the input array a= (n, ci, wi, hi), and B is the convolution kernel b= (co, ci, kw, kh), the above process of transforming the last matrix into a_ch 'or b_ch' is as follows:
assuming that the step size s=1 and the convolution filling p=0 of the convolution operation, the output length obtained by the two-dimensional convolution operation is wo= (wi-kw+2p)/s+1 and the width is ho= (hi-kh+2p)/s+1;
for each channel calculation process, n=1, ci=1, a_ch= (1, wi, hi), b_ch= (co, 1, kw, kh);
reducing the dimension of a_ch to c= (wi, hi), and then deforming C to a_ch' = (kw. Kh, wo. Ho); let b_ch reduce to b_ch' = (co, kw x kh).
As shown in fig. 1, the above-mentioned modification process is shown in fig. 1, and the first channel, i.e., the input array a_1= (1, 3, 5) and the convolution kernel b_1= (2, 3), and the following description will be given by taking the processing of the dark portion in the first channel as an example only: at this time, n=1, ci=1, a_1= (1, wi, hi), b_1= (co, 1, kw, kh), a_1 is reduced to C (wi, hi), C is further transformed to a_1 '(kw, wo) and b_1 is reduced to b_1' (co, kw, kh), and matrix multiplication is performed on n×ci a_1 'and b_1' after all channel calculations are completed, i.e., F (a_1, b_1) = mma (n×ci, a_1', b_1').
Step 9, regarding the matrices O1 (O1_r, O1_c) and O2 (O2_r, O2_c) in mma (num, O1, O2), respectively comparing the values to be compared with 2 in turn by taking O1_r, O1_c or O2_r, O2_c as the values to be compared MIN 2 MAX If there is more than 2 in the values to be compared MIN If the values of (2) are not greater than 2, step 10 is performed MIN Will be 2 MIN Step 13 is then performed after adding to the subset LX of the corresponding vectors X of the values to be compared.
Step 10, if the value to be compared is greater than 2 MAX Step 11 is performed, otherwise step 13 is performed.
Step 11, performing modulo-2 division on the vector corresponding to the value to be compared, if the result of modulo-2 division is 0, indicating that the value can be divided by 2, setting the value of the value to be compared as an operation result, recording the times of modulo-2 division, and executing step 10; otherwise, step 12 is performed.
And step 12, executing zero padding operation on the vector corresponding to the value to be compared, namely adding 0 at the tail end of the vector, and executing step 11 after setting the current value to be compared as the next value to be compared.
Step 13, decomposing the vector X corresponding to the value to be compared to make 2 i For values greater than or equal to MIN and less than or equal to MAX, x=x-2 is employed i Is decomposed by X until X is less than or equal to 2 MIN Recording values i1, i2, i3 and the like of all i in the process, storing the results taking the values as indexes with the base of 2 as dimensions into a subset LX of a vector X, and executing the step 9; when all vectors have completed decomposition, step 14 is performed.
For example, min=2, max=8, i satisfies 1+.i+.3. At this time, when the value to be compared is 8, it is explained that the vector X has 8 elements, and when i takes 3, the vector X can be decomposed, 2 will be 3 Added to subset LX, i.e. lx= {2 3 -a }; when the value to be compared is 7, the vector X is indicated to have 7 elements, and i is firstly made to be 2, namely 2 2 The elements are decomposed from the vector X to obtain the vector X', and then i is 1 to 2 1 The elements are decomposed from the vector X' where the vector X remainsThe remainder of the dimensions are less than 2 1 Thus, it is marked as 2 1 Will be 2 2 、2 1 、2 1 Added to subset LX, i.e. lx= {2 2 ,2 1 ,2 1 }。
Step 14, the numerical values in all the vector subsets LX are arranged and combined to form a matrix group combination set GA (M, N, K), and num GA (M, N, K) are present.
And 15, obtaining a limited exploration space according to num GA (M, N, K) determined in the step 14, and carrying out optimization in the limited exploration space by adopting a simulated annealing algorithm based on the constraint by taking a second constraint as the constraint to obtain an optimal decomposition combination meeting the optimal calculation time.
The specific exploration steps of carrying out optimization in the limited exploration space by using the simulated annealing algorithm based on the constraint are as follows:
step 15.1, initializing, setting an initial value of temperature to be t=100, and setting an initial solution state S to be: the input array is A, B, the size (M, N, K) of the minimum calculation unit of the matrix multiplication operation, the maximum value MAX (M, N, K) in M, N and K is defined as 2 14 The measured AB calculates the calculation Time (S).
Step 15.2, repeating steps 15.3 to 15.5 for 100 times.
Step 15.3, generating a new solution S' is: the input array is A, B, and the maximum value MAX (M, N, K) in the sizes (M, N, K), M, N and K of the minimum calculation unit of the matrix multiplication operation is the calculated Time of the AB calculation of the new solution actual measurement.
Step 15.4, calculating increment Δt '=time (S') -Time (S).
Step 15.5, if Δt '<0, accepting S' as the new current solution, otherwise accepting S 'as the new current solution with probability exp (- Δt'/T).
Step 15.6, if none of the continuous 20 new solutions S 'is accepted or when T < = 1, outputting the current solution as an optimal solution, where the optimal solution includes an optimal computation Time (S'), an operator optimal decomposition and a combination of GA (M, N, K), the number of GA and the number of modulo-2 divisions, and exiting the process. Wherein GA (M, N, K) comprises a matrix group G (M, N, K) and a sub-matrix group G (M, N, K).
Step 15.7, let t=0.95T, execute step 15.2.
In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. An operator performance optimization method based on accelerator constraint is characterized by comprising the following steps:
step 1, obtaining a first parameter of a target deep learning accelerator, wherein the first parameter refers to that the size of a minimum calculation unit of matrix multiplication operation which can be supported by the target deep learning accelerator is recorded as (M, N, K), and the product of the first parameter and M, N and K in the first parameter is taken as a first constraint; constructing a test matrix group set consisting of test matrix groups, decomposing the test matrix groups into test sub-matrix groups G (M, N, K) under a first constraint to form a plurality of matrix group combinations G (M, N, K), forming a test matrix group combination set corresponding to the test matrix group set by all matrix group combinations, completing matrix multiplication operation of all the test sub-matrix groups in each matrix group combination of the test matrix groups by adopting a target deep learning accelerator, recording operation time to obtain combination operation time of each matrix group combination, and taking the minimum combination operation time and the maximum set scale of the matrix group combination corresponding to the minimum combination operation time as a second constraint;
step 2, for the input arrays A and B to be operated, the dimension of the array A is d _A The dimension of the array B is d _B And d _A And d _B Are integers not less than 2; if d _A And d _B Step 3 is executed if the number of rows or columns of the array A and the array B is equal to 2, and if d _A And d _B Step 4 is executed if one equals 2 and the other equals 3 and the number of rows or columns of array A and array B are equal, if d _A And d _B And if the total number is not less than 3, executing the step 5;
step 3, a matrix multiplication operation of a and B is represented as mma (num, O1, O2) and num=1, o1= A, O = B, O1 _c=o2_r, wherein num is the number of groups participating in the operation matrix, O1 and O2 are both matrices participating in the operation, O1 is represented as (o1_r, o1_c), and O2 is represented as (o2_r, o2_c), and step 6 is executed;
step 4, decomposing the array with the dimension of 3 into a plurality of matrixes and a vector, and executing step 6;
step 5, let a be the input array a= (n, ci, wi, hi), B be the convolution kernel b= (co, ci, kw, kh), where n is the number of convolution batches, ci is the number of input channels, wi is the input width, hi is the input height, co is the number of output channels, kw is the convolution kernel width, kh is the convolution kernel height; if the dimension of A or B is 2x+1, decomposing the dimension into x-1 matrixes and matrixes A_ch 'or B_ch' obtained by dimension reduction deformation of an array formed by the last three elements; if the dimension of A or B is 2x, decomposing the dimension into x matrixes, and deforming the last matrix into A_ch 'or B_ch'; then, the matrix multiplication of a and B is denoted as mma (num, O1, O2) and num=n×ci, o1=a_ch ', o2=b_ch', o1_c=o2_r, and step 6 is performed; x is any natural number;
step 6, regarding the matrices O1 (o1_r, o1_c) and O2 (o2_r, o2_c) in mma (num, O1, O2), taking o1_r, o1_c or o2_r, o2_c as the values to be compared, executing step 7 if there is a value greater than the lower threshold value in the values to be compared, and adding the lower threshold value to the subset LX of the vectors X corresponding to the values to be compared if none of the values to be compared is greater than the lower threshold value, and then executing step 10;
step 7, executing step 8 if the value to be compared is larger than the upper threshold value, otherwise executing step 10;
step 8, performing modulo-2 division on the vector corresponding to the value to be compared, if the result is 0, setting the value of the value to be compared as an operation result, recording the times of modulo-2 division, and executing step 7; otherwise, executing the step 9;
step 9, executing zero padding operation on vectors corresponding to the values to be compared, setting the values to be compared as the next values to be compared, and executing step 8;
step 10, decomposing the vector X corresponding to the value to be compared, and letting 2 i Is greater than or equal to the first minimum value and is smallAt or equal to the first maximum value, X=X-2 is used i Until X is less than or equal to the lower threshold value, recording all the values of i, and storing the results taking the values of i as indexes with the base of 2 as dimensions into a subset LX of the vector X, and executing the step 6; after all vectors are decomposed, executing step 11;
and 11, arranging and combining the numerical values in all vector subsets LX to obtain a set GA (M, N, K) of num matrix group combinations, obtaining a limited exploration space by the num GA (M, N, K), and performing optimization in the limited exploration space by adopting a simulated annealing algorithm based on constraint and taking a second constraint as constraint to obtain an optimal decomposition combination of input arrays A and B meeting optimal calculation time.
2. The operator performance optimization method according to claim 1, wherein the method for constructing the test matrix group set composed of the test matrix groups in the step 1 is as follows: taking the maximum value in the first constraint M, N and K as a minimum value of a single dimension as a first minimum value MIN, setting the maximum value of the single dimension as a first maximum value MAX, and selecting a numerical value which is larger than the first minimum value and is an exponential multiple of 2; selecting a value of 2 times the index between the first minimum and the first maximum to construct a test matrix set (m TA ,n TB ,k TA ) And forming a test matrix group set by all the test matrix groups.
3. The method of optimizing operator performance according to claim 1, wherein the combination operation time in step 1 is a sum of matrix multiplication operation times of all test sub-matrix groups in a matrix group combination.
4. The method for optimizing operator performance according to claim 1, wherein the decomposing the array with the dimension of 3 into a plurality of matrices and a vector in the step 4 is: assuming a is a 3-dimensional array a= (a_1, a_2, a_3) and B is a 2-dimensional array b= (a_4, a_1), a is decomposed into a 'and a ", where a' = (a_1, a_2), a '= (a_3), and the matrix multiplication of a and B is denoted mma (num, O1, O2) and num=a_3, o1=a', o2= B, O _c=o2_r.
5. The method for optimizing operator performance according to claim 1, wherein the method for transforming the last matrix to a_ch 'or b_ch' in step 5 is as follows:
let step s=1 and convolution filling p=0 of convolution operation, then the output length obtained by two-dimensional convolution operation is wo= (wi-kw+2p)/s+1 and the width is ho= (hi-kh+2p)/s+1; for each channel calculation process, n=1, ci=1, a_ch= (1, wi, hi), b_ch= (co, 1, kw, kh); the dimension of a_ch is reduced to a matrix c= (wi, hi), C is further transformed to a_ch '= (kw. Kh, wo. Ho), and b_ch is reduced to b_ch' = (co, kw. Kh).
CN202311644953.7A 2023-12-04 2023-12-04 Operator performance optimization method based on accelerator constraint Active CN117349585B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311644953.7A CN117349585B (en) 2023-12-04 2023-12-04 Operator performance optimization method based on accelerator constraint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311644953.7A CN117349585B (en) 2023-12-04 2023-12-04 Operator performance optimization method based on accelerator constraint

Publications (2)

Publication Number Publication Date
CN117349585A true CN117349585A (en) 2024-01-05
CN117349585B CN117349585B (en) 2024-02-23

Family

ID=89361719

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311644953.7A Active CN117349585B (en) 2023-12-04 2023-12-04 Operator performance optimization method based on accelerator constraint

Country Status (1)

Country Link
CN (1) CN117349585B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899182A (en) * 2015-06-09 2015-09-09 中国人民解放军国防科学技术大学 Matrix multiplication acceleration method for supporting variable blocks
US20200026992A1 (en) * 2016-09-29 2020-01-23 Tsinghua University Hardware neural network conversion method, computing device, compiling method and neural network software and hardware collaboration system
CN111667051A (en) * 2020-05-27 2020-09-15 上海赛昉科技有限公司 Neural network accelerator suitable for edge equipment and neural network acceleration calculation method
CN114781629A (en) * 2022-04-06 2022-07-22 合肥工业大学 Hardware accelerator of convolutional neural network based on parallel multiplexing and parallel multiplexing method
CN114925823A (en) * 2022-05-12 2022-08-19 南京航空航天大学 Convolutional neural network compression method and edge side FPGA accelerator
CN115994861A (en) * 2021-10-20 2023-04-21 珠海一微半导体股份有限公司 Image processing method and chip based on convolution algorithm

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899182A (en) * 2015-06-09 2015-09-09 中国人民解放军国防科学技术大学 Matrix multiplication acceleration method for supporting variable blocks
US20200026992A1 (en) * 2016-09-29 2020-01-23 Tsinghua University Hardware neural network conversion method, computing device, compiling method and neural network software and hardware collaboration system
CN111667051A (en) * 2020-05-27 2020-09-15 上海赛昉科技有限公司 Neural network accelerator suitable for edge equipment and neural network acceleration calculation method
CN115994861A (en) * 2021-10-20 2023-04-21 珠海一微半导体股份有限公司 Image processing method and chip based on convolution algorithm
CN114781629A (en) * 2022-04-06 2022-07-22 合肥工业大学 Hardware accelerator of convolutional neural network based on parallel multiplexing and parallel multiplexing method
CN114925823A (en) * 2022-05-12 2022-08-19 南京航空航天大学 Convolutional neural network compression method and edge side FPGA accelerator

Also Published As

Publication number Publication date
CN117349585B (en) 2024-02-23

Similar Documents

Publication Publication Date Title
US20200151541A1 (en) Efficient Convolutional Neural Networks
Alaghi et al. A spectral transform approach to stochastic circuits
CN111542839B (en) Hardware acceleration method and device of deconvolution neural network and electronic equipment
CN113850389B (en) Quantum circuit construction method and device
CN111563599A (en) Quantum line decomposition method and device, storage medium and electronic device
EP4071615A1 (en) Quantum error correction decoding system and method, fault-tolerant quantum error correction system, and chip
CN113035280A (en) RBP binding site prediction algorithm based on deep learning
CN114418105A (en) Method and device for processing quantum application problem based on quantum line
US11704604B2 (en) Optimization method, apparatus, computer device and storage medium for engine model
Chen et al. An efficient proximal-gradient method for single and multi-task regression with structured sparsity
Jiang et al. Convolutional neural network pruning based on multi-objective feature map selection for image classification
CN114399044A (en) Subarray-level sparse array transmitted beam sidelobe level optimization method
CN117407966B (en) Multi-specification steel bar blanking method and device integrating distributed pruning and genetic algorithm
KR102643431B1 (en) Apparatus and method for accelerating deep neural network learning for deep reinforcement learning
CN117349585B (en) Operator performance optimization method based on accelerator constraint
CN111967184B (en) Multi-target antenna design method based on sequence model
Yu et al. Memristor parallel computing for a matrix-friendly genetic algorithm
CN115881209B (en) RNA secondary structure prediction processing method and device
CN111723906A (en) Accelerated calculation method and system of recurrent neural network and related device
CN114519429B (en) Method, device and medium for obtaining observability quantity of target system
CN115908909A (en) Evolutionary neural architecture searching method and system based on Bayes convolutional neural network
CN111788584A (en) Neural network computing method and device
Kwon et al. Critical heat flux function approximation using genetic algorithms
Gao et al. Multiple sequence alignment based on combining genetic algorithm with chaotic sequences
Mühlenstädt et al. How much data do you need? Part 2: Predicting DL class specific training dataset sizes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant