CN117349585A

CN117349585A - Operator performance optimization method based on accelerator constraint

Info

Publication number: CN117349585A
Application number: CN202311644953.7A
Authority: CN
Inventors: 钟阳宇; 杜凯; 刘忠新; 温研; 邓强; 李解
Original assignee: Beijing Linzhuo Information Technology Co Ltd
Current assignee: Beijing Linzhuo Information Technology Co Ltd
Priority date: 2023-12-04
Filing date: 2023-12-04
Publication date: 2024-01-05
Anticipated expiration: 2043-12-04
Also published as: CN117349585B

Abstract

The invention discloses an operator performance optimization method based on accelerator constraint, which comprises the steps of determining a first constraint based on acquired accelerator first parameters, establishing a test matrix group set and a test matrix group combination set corresponding to the test matrix group set according to the first constraint, completing matrix multiplication operation of all test sub-matrix groups by using a target deep learning accelerator, recording operation time, taking a matrix group combination with the minimum combination operation time as a selected matrix group of the test matrix groups, taking the selected matrix group as a second constraint, determining a target combination for input array decomposition of operation to be executed, determining a limited exploration space based on the target combination, and finally adopting a simulated annealing algorithm based on the constraint to execute optimization in the limited exploration space by taking the second constraint as the constraint so as to obtain an optimal decomposition combination, thereby effectively improving operator calculation performance on the same accelerator.

Description

Operator performance optimization method based on accelerator constraint

Technical Field

The invention belongs to the technical field of deep learning accelerators, and particularly relates to an operator performance optimization method based on accelerator constraint.

Background

With the increasing use of deep learning, the need for high performance computing is also increasing. The acceleration of deep learning is mainly achieved by improving the computing efficiency and the energy consumption efficiency of a deep learning operator, and how to achieve optimal performance and energy consumption efficiency through matching between the computing characteristics of an algorithm and the characteristics of a hardware architecture still has challenges. The existing performance optimization mode mainly comprises the steps of designing a software library for a CPU and a GPU, optimizing a memory subsystem to realize matrix multiplication, and decomposing an array to improve operator performance, wherein the mode is not focused on the characteristics of an accelerator, so that the problem of matching between the accelerator and an operator cannot be effectively solved, and the optimization effect on the operator performance is poor.

Disclosure of Invention

In view of this, the invention provides an operator performance optimization method based on accelerator constraint, which realizes constraint-based array decomposition to obtain an optimal array decomposition mode, and further obtains an optimal matrix combination.

The invention provides an operator performance optimization method based on accelerator constraint, which comprises the following steps:

step 1, obtaining a first parameter of a target deep learning accelerator, wherein the first parameter refers to that the size of a minimum calculation unit of matrix multiplication operation which can be supported by the target deep learning accelerator is recorded as (M, N, K), and the product of the first parameter and M, N and K in the first parameter is taken as a first constraint; constructing a test matrix group set consisting of test matrix groups, decomposing the test matrix groups into test sub-matrix groups G (M, N, K) under a first constraint to form a plurality of matrix group combinations G (M, N, K), forming a test matrix group combination set corresponding to the test matrix group set by all matrix group combinations, completing matrix multiplication operation of all the test sub-matrix groups in each matrix group combination of the test matrix groups by adopting a target deep learning accelerator, recording operation time to obtain combination operation time of each matrix group combination, and taking the minimum combination operation time and the maximum set scale of the matrix group combination corresponding to the minimum combination operation time as a second constraint;

step 2, for the input arrays A and B to be operated, the dimension of the array A is d _{_A} The dimension of the array B is d _{_B} And d _{_A} And d _{_B} Are integers not less than 2; if d _{_A} And d _{_B} Step 3 is executed if the number of rows or columns of the array A and the array B is equal to 2, and if d _{_A} And d _{_B} Step 4 is executed if one equals 2 and the other equals 3 and the number of rows or columns of array A and array B are equal, if d _{_A} And d _{_B} And if the total number is not less than 3, executing the step 5;

step 3, a matrix multiplication operation of a and B is represented as mma (num, O1, O2) and num=1, o1= A, O = B, O1 _c=o2_r, wherein num is the number of groups participating in the operation matrix, O1 and O2 are both matrices participating in the operation, O1 is represented as (o1_r, o1_c), and O2 is represented as (o2_r, o2_c), and step 6 is executed;

step 4, decomposing the array with the dimension of 3 into a plurality of matrixes and a vector, and executing step 6;

step 5, let a be the input array a= (n, ci, wi, hi), B be the convolution kernel b= (co, ci, kw, kh), where n is the number of convolution batches, ci is the number of input channels, wi is the input width, hi is the input height, co is the number of output channels, kw is the convolution kernel width, kh is the convolution kernel height; if the dimension of A or B is 2x+1, decomposing the dimension into x-1 matrixes and matrixes A_ch 'or B_ch' obtained by dimension reduction deformation of an array formed by the last three elements; if the dimension of A or B is 2x, decomposing the dimension into x matrixes, and deforming the last matrix into A_ch 'or B_ch'; then, the matrix multiplication of a and B is denoted as mma (num, O1, O2) and num=n×ci, o1=a_ch ', o2=b_ch', o1_c=o2_r, and step 6 is performed; x is any natural number;

step 6, regarding the matrices O1 (o1_r, o1_c) and O2 (o2_r, o2_c) in mma (num, O1, O2), taking o1_r, o1_c or o2_r, o2_c as the values to be compared, executing step 7 if there is a value greater than the lower threshold value in the values to be compared, and adding the lower threshold value to the subset LX of the vectors X corresponding to the values to be compared if none of the values to be compared is greater than the lower threshold value, and then executing step 10;

step 7, executing step 8 if the value to be compared is larger than the upper threshold value, otherwise executing step 10;

step 8, performing modulo-2 division on the vector corresponding to the value to be compared, if the result is 0, setting the value of the value to be compared as an operation result, recording the times of modulo-2 division, and executing step 7; otherwise, executing the step 9;

step 9, executing zero padding operation on vectors corresponding to the values to be compared, setting the values to be compared as the next values to be compared, and executing step 8;

step 10, decomposing the vector X corresponding to the value to be compared, and letting 2 ⁱ For a value greater than or equal to the first minimum value and less than or equal to the first maximum value, x=x-2 is employed ⁱ Until X is less than or equal to the lower threshold value, recording all the values of i, and storing the results taking the values of i as indexes with the base of 2 as dimensions into a subset LX of the vector X, and executing the step 6; after all vectors are decomposed, executing step 11;

and 11, arranging and combining the numerical values in all vector subsets LX to obtain a set GA (M, N, K) of num matrix group combinations, obtaining a limited exploration space by the num GA (M, N, K), and performing optimization in the limited exploration space by adopting a simulated annealing algorithm based on constraint and taking a second constraint as constraint to obtain an optimal decomposition combination of input arrays A and B meeting optimal calculation time.

Further, the method for constructing the test matrix group set composed of the test matrix groups in the step 1 is as follows: taking the maximum value in the first constraint M, N and K as a minimum value of a single dimension as a first minimum value MIN, setting the maximum value of the single dimension as a first maximum value MAX, and selecting a numerical value which is larger than the first minimum value and is an exponential multiple of 2; selecting a value of 2 times the index between the first minimum and the first maximum to construct a test matrix set (m _TA ，n _TB ，k _TA ) From all test matrix groupsA set of test matrix sets is formed.

Further, the combination operation time in the step 1 is the sum of matrix multiplication operation time of all test sub-matrix groups in the matrix group combination.

Further, in the step 4, the way to decompose the array with dimension 3 into a plurality of matrices and a vector is as follows: assuming a is a 3-dimensional array a= (a_1, a_2, a_3) and B is a 2-dimensional array b= (a_4, a_1), a is decomposed into a 'and a ", where a' = (a_1, a_2), a '= (a_3), and the matrix multiplication of a and B is denoted mma (num, O1, O2) and num=a_3, o1=a', o2= B, O _c=o2_r.

Further, the method of transforming the last matrix to a_ch 'or b_ch' in the step 5 is as follows:

let step s=1 and convolution filling p=0 of convolution operation, then the output length obtained by two-dimensional convolution operation is wo= (wi-kw+2p)/s+1 and the width is ho= (hi-kh+2p)/s+1; for each channel calculation process, n=1, ci=1, a_ch= (1, wi, hi), b_ch= (co, 1, kw, kh); the dimension of a_ch is reduced to a matrix c= (wi, hi), C is further transformed to a_ch '= (kw. Kh, wo. Ho), and b_ch is reduced to b_ch' = (co, kw. Kh).

Advantageous effects

The method comprises the steps of determining a first constraint based on acquired first parameters of an accelerator, establishing a test matrix group set and a test matrix group combination set corresponding to the test matrix group set according to the first constraint, completing matrix multiplication operation of all test sub-matrix groups by using a target deep learning accelerator, recording operation time, taking a matrix group combination with minimum combination operation time as a selected matrix group combination of the test matrix groups, taking the selected matrix group combination as a second constraint, decomposing an input array to be operated to determine a target combination, determining a limited exploration space based on the target combination, and finally carrying out optimizing in the limited exploration space by taking the second constraint as the constraint by adopting a simulated annealing algorithm based on the constraint to obtain an optimal decomposition combination, thereby effectively improving operator calculation performance on the same accelerator.

Drawings

FIG. 1 is a schematic diagram of an array decomposition process in an operator performance optimization method based on accelerator constraints.

Detailed Description

The present invention will be described in detail with reference to the following examples.

The invention provides an operator performance optimization method based on accelerator constraint, which has the following core ideas: determining a first constraint based on the acquired accelerator first parameter, establishing a test matrix group set and a test matrix group combination set corresponding to the test matrix group set according to the first constraint, completing matrix multiplication operation of all test sub-matrix groups by using a target deep learning accelerator, recording operation time, taking a matrix group combination with the minimum combination operation time as a selected matrix group combination of the test matrix groups, taking the selected matrix group combination as a second constraint, decomposing an input array to be operated to determine a target combination, determining a limited exploration space based on the target combination, and finally taking the second constraint as a constraint, and executing optimization in the limited exploration space by adopting a simulated annealing algorithm based on the constraint to obtain an optimal decomposition combination.

The invention provides an operator performance optimization method based on accelerator constraint, which specifically comprises the following steps:

step 1, obtaining a model number and a first parameter of a target deep learning accelerator, wherein the first parameter refers to that the size of a minimum calculation unit of matrix multiplication operation which can be supported by the target deep learning accelerator is recorded as (M, N, K), namely the minimum calculation which can be supported by the deep learning accelerator is multiplication operation between a matrix A=M=K and a matrix B=K×N; the first parameter and the product of M, N and K in the first parameter are taken together as a first constraint.

Step 2, taking the maximum value in the first constraint M, N and K as the minimum value of a single dimension as a first minimum value MIN, setting the maximum value of the single dimension as a first maximum value MAX, and selecting a numerical value which is larger than the first minimum value and is an exponential multiple of 2; selecting a value of 2 times the index between the first minimum and the first maximum to construct a test matrix set (m _TA ，n _TB ，k _TA ) The test matrix set may be expressed as the matrix ta=m _TA *k _TA Sum matrix tb=k _TB *n _TB And forming a test matrix group set by all the test matrix groups.

And 3, decomposing each test matrix group in the test matrix group set into a plurality of matrix group combinations G (M, N, K) under the first constraint, and forming a test matrix group combination set corresponding to the test matrix group set by all matrix group combinations. The matrix group combination comprises a plurality of test sub-matrix groups g (m, n, k) obtained by decomposing the test matrix group, and the sum of matrix multiplication operation time of all the test sub-matrix groups in the matrix group combination is marked as the combination operation time.

The combination calculation time of different matrix groups is different because the matrix multiplication operation time of different test sub-matrix groups is different.

Step 4, for each matrix group combination of the test matrix groups, changing the data types of elements in the matrix group combinations, and then completing matrix multiplication operation of all the test sub-matrix groups by using a target deep learning accelerator and recording operation time, thereby obtaining the combination operation time of each matrix group combination; and taking the matrix group combination with the minimum combination operation time and the maximum set rule among the plurality of matrix group combinations as a selected matrix group combination of the test matrix group, and taking the combination operation time and the set scale of the selected matrix group combination as a second constraint.

Wherein a matrix set combination is actually a set of test sub-matrix sets g (m, n, k), and thus the set size refers to the number of elements in the matrix set combination. The second constraint established by the invention actually reflects the maximum operation quantity supported by the target deep learning accelerator in the relative shortest operation time.

Step 5, for the input arrays A and B to be operated, the dimension of the array A is d _{_A} The dimension of the array B is d _{_B} And d _{_A} And d _{_B} Are integers not less than 2; if d _{_A} And d _{_B} Step 6 is executed if the number of rows or columns of the array A and the array B is equal to 2, and if d _{_A} And d _{_B} Equal to 2 anotherStep 7 is executed if the number is equal to 3 and the number of rows or columns of the array A and the array B are equal, if d _{_A} And d _{_B} And if not less than 3, executing the step 8.

Specifically, in the present invention, the dimensions of the input array for performing matrix multiplication may be two-dimensional, three-dimensional, four-dimensional, …, N-dimensional, etc., the two-dimensional array is a matrix, the three-dimensional array is an array composed of a plurality of matrices, the four-dimensional array is an array composed of a plurality of three-dimensional arrays, and similarly, the N-dimensional array is an array composed of a plurality of N-1-dimensional arrays. For example, a three-dimensional array may be represented as (a_1, a_2, a_3), meaning that the three-dimensional array is made up of a_3 (a_1, a_2) matrices; the four-dimensional array can be expressed as (a_1, a_2, a_3, a_4), which means that the four-dimensional array is composed of a_4 (a_1, a_2, a_3) three-dimensional arrays; similarly, an N-dimensional array may be denoted as (a_1, a_2, a_3, …, a_ (N-1), a_N), meaning that the N-dimensional array is made up of a_N (a_1, a_2, a_3, …, a_ (N-1)) N-1-dimensional arrays, where a_1, a_2, a_3, …, a_ (N-1) are elements of the array.

Step 6, for input arrays a and B, the matrix multiplication of a and B is denoted as mma (num, O1, O2) and num=1, o1= A, O2= B, O1 _1_c=o2_r, i.e. 1 set of matrix a and matrix B performs the matrix multiplication, and then step 9 is performed. Wherein mma (num, O1, O2) is a matrix multiplication function defined in the present invention, num is the number of groups participating in the operation matrix, O1 and O2 are both matrices participating in the operation, O1 is represented as (o1_r, o1_c), and O2 is represented as (o2_r, o2_c).

Step 7, decomposing the array with 3 dimensions into a plurality of matrices and a vector, assuming that the input array a is represented as a= (a_1, a_2, a_3) for the 3-dimensional array and b= (a_4, a_1) for the 2-dimensional array, decomposing a into a ' and a″ where a ' = (a_1, a_2), a ' = (a_3), and representing the matrix multiplication of a and B as mma (num, O1, O2) and num=a_3, o1=a ', o2= B, O _c=o2_r, that is, performing the matrix multiplication operation on the a_3 matrix a ' and the matrix B, and executing step 9.

Step 8, this is usually a convolution operation, so let a be the input array denoted as a= (n, ci, wi, hi), B be the convolution kernel denoted as b= (co, ci, kw, kh), where n is the number of convolution batches, ci is the number of input channels, wi is the input width, hi is the input height, co is the number of output channels, kw is the convolution kernel width, kh is the convolution kernel height; if the dimension of A or B is odd, namely the dimension can be expressed as 2x+1, decomposing the dimension into x-1 matrixes and matrixes A_ch 'or B_ch' obtained by dimension reduction deformation of an array formed by the last three elements; if the dimension of A or B is even, i.e. the dimension thereof can be expressed as 2x, decomposing it into x matrices and deforming the last matrix into A_ch 'or B_ch'; the matrix multiplication of a and B is further denoted as mma (num, O1, O2) and num=n_ci, o1=a_ch ', o2=b_ch', o1_c=o2_r, i.e. n_ci sets of matrix a_ch 'and matrix b_ch' perform the matrix multiplication, and step 9 is performed.

The most commonly used and fundamental operators in deep learning are two-dimensional convolution, and three-dimensional convolution, deconvolution, dilation convolution and the like are based on two-dimensional convolution and are only different in dimension or size. For the convolution operation where a is the input array a= (n, ci, wi, hi), and B is the convolution kernel b= (co, ci, kw, kh), the above process of transforming the last matrix into a_ch 'or b_ch' is as follows:

assuming that the step size s=1 and the convolution filling p=0 of the convolution operation, the output length obtained by the two-dimensional convolution operation is wo= (wi-kw+2p)/s+1 and the width is ho= (hi-kh+2p)/s+1;

for each channel calculation process, n=1, ci=1, a_ch= (1, wi, hi), b_ch= (co, 1, kw, kh);

reducing the dimension of a_ch to c= (wi, hi), and then deforming C to a_ch' = (kw. Kh, wo. Ho); let b_ch reduce to b_ch' = (co, kw x kh).

As shown in fig. 1, the above-mentioned modification process is shown in fig. 1, and the first channel, i.e., the input array a_1= (1, 3, 5) and the convolution kernel b_1= (2, 3), and the following description will be given by taking the processing of the dark portion in the first channel as an example only: at this time, n=1, ci=1, a_1= (1, wi, hi), b_1= (co, 1, kw, kh), a_1 is reduced to C (wi, hi), C is further transformed to a_1 '(kw, wo) and b_1 is reduced to b_1' (co, kw, kh), and matrix multiplication is performed on n×ci a_1 'and b_1' after all channel calculations are completed, i.e., F (a_1, b_1) = mma (n×ci, a_1', b_1').

Step 9, regarding the matrices O1 (O1_r, O1_c) and O2 (O2_r, O2_c) in mma (num, O1, O2), respectively comparing the values to be compared with 2 in turn by taking O1_r, O1_c or O2_r, O2_c as the values to be compared ^MIN 2 ^MAX If there is more than 2 in the values to be compared ^MIN If the values of (2) are not greater than 2, step 10 is performed ^MIN Will be 2 ^MIN Step 13 is then performed after adding to the subset LX of the corresponding vectors X of the values to be compared.

Step 10, if the value to be compared is greater than 2 ^MAX Step 11 is performed, otherwise step 13 is performed.

Step 11, performing modulo-2 division on the vector corresponding to the value to be compared, if the result of modulo-2 division is 0, indicating that the value can be divided by 2, setting the value of the value to be compared as an operation result, recording the times of modulo-2 division, and executing step 10; otherwise, step 12 is performed.

And step 12, executing zero padding operation on the vector corresponding to the value to be compared, namely adding 0 at the tail end of the vector, and executing step 11 after setting the current value to be compared as the next value to be compared.

Step 13, decomposing the vector X corresponding to the value to be compared to make 2 ⁱ For values greater than or equal to MIN and less than or equal to MAX, x=x-2 is employed ⁱ Is decomposed by X until X is less than or equal to 2 ^MIN Recording values i1, i2, i3 and the like of all i in the process, storing the results taking the values as indexes with the base of 2 as dimensions into a subset LX of a vector X, and executing the step 9; when all vectors have completed decomposition, step 14 is performed.

For example, min=2, max=8, i satisfies 1+.i+.3. At this time, when the value to be compared is 8, it is explained that the vector X has 8 elements, and when i takes 3, the vector X can be decomposed, 2 will be ³ Added to subset LX, i.e. lx= {2 ³ -a }; when the value to be compared is 7, the vector X is indicated to have 7 elements, and i is firstly made to be 2, namely 2 ² The elements are decomposed from the vector X to obtain the vector X', and then i is 1 to 2 ¹ The elements are decomposed from the vector X' where the vector X remainsThe remainder of the dimensions are less than 2 ¹ Thus, it is marked as 2 ¹ Will be 2 ² 、2 ¹ 、2 ¹ Added to subset LX, i.e. lx= {2 ² ，2 ¹ ，2 ¹ }。

Step 14, the numerical values in all the vector subsets LX are arranged and combined to form a matrix group combination set GA (M, N, K), and num GA (M, N, K) are present.

And 15, obtaining a limited exploration space according to num GA (M, N, K) determined in the step 14, and carrying out optimization in the limited exploration space by adopting a simulated annealing algorithm based on the constraint by taking a second constraint as the constraint to obtain an optimal decomposition combination meeting the optimal calculation time.

The specific exploration steps of carrying out optimization in the limited exploration space by using the simulated annealing algorithm based on the constraint are as follows:

step 15.1, initializing, setting an initial value of temperature to be t=100, and setting an initial solution state S to be: the input array is A, B, the size (M, N, K) of the minimum calculation unit of the matrix multiplication operation, the maximum value MAX (M, N, K) in M, N and K is defined as 2 ¹⁴ The measured AB calculates the calculation Time (S).

Step 15.2, repeating steps 15.3 to 15.5 for 100 times.

Step 15.3, generating a new solution S' is: the input array is A, B, and the maximum value MAX (M, N, K) in the sizes (M, N, K), M, N and K of the minimum calculation unit of the matrix multiplication operation is the calculated Time of the AB calculation of the new solution actual measurement.

Step 15.4, calculating increment Δt '=time (S') -Time (S).

Step 15.5, if Δt '<0, accepting S' as the new current solution, otherwise accepting S 'as the new current solution with probability exp (- Δt'/T).

Step 15.6, if none of the continuous 20 new solutions S 'is accepted or when T < = 1, outputting the current solution as an optimal solution, where the optimal solution includes an optimal computation Time (S'), an operator optimal decomposition and a combination of GA (M, N, K), the number of GA and the number of modulo-2 divisions, and exiting the process. Wherein GA (M, N, K) comprises a matrix group G (M, N, K) and a sub-matrix group G (M, N, K).

Step 15.7, let t=0.95T, execute step 15.2.

In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An operator performance optimization method based on accelerator constraint is characterized by comprising the following steps:

step 10, decomposing the vector X corresponding to the value to be compared, and letting 2 ⁱ Is greater than or equal to the first minimum value and is smallAt or equal to the first maximum value, X=X-2 is used ⁱ Until X is less than or equal to the lower threshold value, recording all the values of i, and storing the results taking the values of i as indexes with the base of 2 as dimensions into a subset LX of the vector X, and executing the step 6; after all vectors are decomposed, executing step 11;

2. The operator performance optimization method according to claim 1, wherein the method for constructing the test matrix group set composed of the test matrix groups in the step 1 is as follows: taking the maximum value in the first constraint M, N and K as a minimum value of a single dimension as a first minimum value MIN, setting the maximum value of the single dimension as a first maximum value MAX, and selecting a numerical value which is larger than the first minimum value and is an exponential multiple of 2; selecting a value of 2 times the index between the first minimum and the first maximum to construct a test matrix set (m _TA ，n _TB ，k _TA ) And forming a test matrix group set by all the test matrix groups.

3. The method of optimizing operator performance according to claim 1, wherein the combination operation time in step 1 is a sum of matrix multiplication operation times of all test sub-matrix groups in a matrix group combination.

4. The method for optimizing operator performance according to claim 1, wherein the decomposing the array with the dimension of 3 into a plurality of matrices and a vector in the step 4 is: assuming a is a 3-dimensional array a= (a_1, a_2, a_3) and B is a 2-dimensional array b= (a_4, a_1), a is decomposed into a 'and a ", where a' = (a_1, a_2), a '= (a_3), and the matrix multiplication of a and B is denoted mma (num, O1, O2) and num=a_3, o1=a', o2= B, O _c=o2_r.

5. The method for optimizing operator performance according to claim 1, wherein the method for transforming the last matrix to a_ch 'or b_ch' in step 5 is as follows: