CN114707114A - Blocking method and device, convolution operation method and device, and storage medium - Google Patents

Blocking method and device, convolution operation method and device, and storage medium Download PDF

Info

Publication number
CN114707114A
CN114707114A CN202210440010.1A CN202210440010A CN114707114A CN 114707114 A CN114707114 A CN 114707114A CN 202210440010 A CN202210440010 A CN 202210440010A CN 114707114 A CN114707114 A CN 114707114A
Authority
CN
China
Prior art keywords
matrix
dimension
size
parameter
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210440010.1A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Biren Intelligent Technology Co Ltd
Original Assignee
Shanghai Biren Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Biren Intelligent Technology Co Ltd filed Critical Shanghai Biren Intelligent Technology Co Ltd
Priority to CN202210440010.1A priority Critical patent/CN114707114A/en
Publication of CN114707114A publication Critical patent/CN114707114A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • G06F17/153Multidimensional correlation or convolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F5/00Methods or arrangements for data conversion without changing the order or content of the data handled
    • G06F5/06Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising or timing, e.g. delay lines, FIFO buffers; over- or underrun control therefor
    • G06F5/08Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising or timing, e.g. delay lines, FIFO buffers; over- or underrun control therefor having a sequence of storage locations, the intermediate ones not being accessible for either enqueue or dequeue operations, e.g. using a shift register
    • G06F5/085Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising or timing, e.g. delay lines, FIFO buffers; over- or underrun control therefor having a sequence of storage locations, the intermediate ones not being accessible for either enqueue or dequeue operations, e.g. using a shift register in which the data is recirculated
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/02Comparing digital values
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A blocking method applied to a matrix multiplication operation, a method for a convolution operation, a blocking apparatus applied to a matrix multiplication operation, an apparatus for a convolution operation, and a computer-readable storage medium. The blocking method is applied to matrix multiplication operation, the matrix multiplication operation is used for realizing multiplication operation between a first matrix and a second matrix, and the blocking method comprises the following steps: determining input parameters based on the first matrix and the second matrix, wherein the input parameters comprise dimension parameters of the first matrix, dimension parameters of the second matrix, data size of each data in each of the first matrix and the second matrix, loading bandwidth corresponding to the first matrix, loading bandwidth corresponding to the second matrix and size of an input buffer; based on the input parameters, obtaining output parameters, wherein the output parameters comprise a cyclic order and blocking parameters corresponding to the first matrix and the second matrix.

Description

Blocking method and device, convolution operation method and device, and storage medium
Technical Field
Embodiments of the present disclosure relate to a blocking method applied to a matrix multiplication operation, a method for a convolution operation, a blocking apparatus applied to a matrix multiplication operation, an apparatus for a convolution operation, and a computer-readable storage medium.
Background
Neural Networks (NN) have attracted much attention in recent years as a class of machine learning models in the field of Artificial Intelligence (AI). AI accelerators are a specialized class of hardware accelerators or computer systems aimed at accelerating the application of artificial intelligence, particularly artificial neural networks, machine vision, and machine learning. Typical applications of AI accelerators include robotics, internet of things, and other algorithms for data intensive or sensor driven tasks. Currently, in addition to general-purpose processors such as a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU), AI accelerators based on Field Programmable Gate Arrays (FPGA) or Application Specific Integrated Circuits (ASIC) have been developed. The AI accelerator based on the FPGA has shorter development period and flexible programmability, and simultaneously has higher calculation parallelism and relatively moderate power consumption; while the ASIC-based AI accelerator requires high development time and manpower and material costs, a customized accelerator satisfying performance and power consumption requirements can be maximally implemented.
Disclosure of Invention
At least one embodiment of the present disclosure provides a blocking method applied to a matrix multiplication operation, where the matrix multiplication operation is used to implement a multiplication operation between a first matrix and a second matrix, and the blocking method includes: determining input parameters based on the first matrix and the second matrix, wherein the input parameters comprise dimension parameters of the first matrix, dimension parameters of the second matrix, data size of each data in each of the first matrix and the second matrix, loading bandwidth corresponding to the first matrix, loading bandwidth corresponding to the second matrix, and size of an input buffer; obtaining output parameters based on the input parameters, wherein the output parameters comprise a cyclic order and blocking parameters corresponding to the first matrix and the second matrix.
For example, in the blocking method provided by the embodiment of the present disclosure, the dimension parameter of the first matrix includes a first matrix dimension and a second matrix dimension, the dimension parameters of the second matrix include the second matrix dimension and a third matrix dimension, the blocking parameters include a first dimension outer blocking parameter related to the first matrix dimension, a first dimension inner blocking parameter related to the first dimension outer blocking parameter, a second dimension blocking parameter related to the second matrix dimension, a third dimension outer blocking parameter related to the third matrix dimension, and a third dimension inner blocking parameter related to the third dimension outer blocking parameter, the cyclic order is used to indicate an order of the first dimension outer layer blocking parameter, the first dimension inner layer blocking parameter, the second dimension blocking parameter, the third dimension outer layer blocking parameter, and the third dimension inner layer blocking parameter.
For example, in a blocking method provided by an embodiment of the present disclosure, the input buffer includes a first input buffer for buffering data in the first matrix and a second input buffer for buffering data in the second matrix, a size of the input buffer includes a size of the first input buffer and a size of the second input buffer, and obtaining an output parameter based on the input parameter includes: determining a first matrix size corresponding to the first matrix based on the dimension parameter of the first matrix and the data size; determining a second matrix size corresponding to the second matrix based on the dimension parameter of the second matrix and the data size; comparing the first matrix size with the size of the first input buffer and comparing the second matrix size with the size of the second input buffer to determine a first comparison result; determining the output parameter based on the first comparison result.
For example, in a blocking method provided by an embodiment of the present disclosure, determining the output parameter based on the first comparison result includes: determining the second dimension blocking parameter as the second matrix dimension in response to the first comparison result indicating that the first matrix size is equal to or smaller than the size of the first input buffer and/or the second matrix size is equal to or smaller than the size of the second input buffer; comparing the first matrix dimension and the third matrix dimension to determine a second comparison result; determining the first dimension outer layer blocking parameter and the third dimension outer layer blocking parameter based on the second comparison result.
For example, in the blocking method provided by the embodiment of the present disclosure, determining the first dimension-wise outer blocking parameter and the third dimension-wise outer blocking parameter based on the second comparison result includes: in response to the second comparison result indicating that the first matrix dimension is less than the third matrix dimension: determining the outer layer blocking parameter of the first dimension as the first matrix dimension; determining the third dimension outer layer blocking parameter based on the data size, the dimension parameter of the second matrix and the size of the second input buffer; in response to the second comparison result indicating that the first matrix dimension is greater than or equal to the third matrix dimension: determining the third dimension outer layer blocking parameter as the third matrix dimension; determining the first dimension outer layer blocking parameter based on the data size, the dimension parameter of the first matrix, and the size of the first input buffer.
For example, in a blocking method provided by an embodiment of the present disclosure, determining the first dimension outer layer blocking parameter based on the data size, the dimension parameter of the first matrix, and the size of the first input buffer includes: determining a first intermediate blocking parameter based on the data size, the second matrix dimension, and a size of the first input buffer; determining the first-dimension outer-layer blocking parameter as the first intermediate blocking parameter in response to the first intermediate blocking parameter being less than the first matrix dimension, and determining the first-dimension outer-layer blocking parameter as the first matrix dimension in response to the first intermediate blocking parameter being greater than or equal to the first matrix dimension.
For example, in a blocking method provided by an embodiment of the present disclosure, determining the third-dimension outer-layer blocking parameter based on the data size, the dimension parameter of the second matrix, and the size of the second input buffer includes: determining a second intermediate blocking parameter based on the data size, the second matrix dimension, and a size of the second input buffer; determining the third-dimension outer-layer blocking parameter as the second intermediate blocking parameter in response to the second intermediate blocking parameter being less than the third matrix dimension, and determining the third-dimension outer-layer blocking parameter as the third matrix dimension in response to the second intermediate blocking parameter being greater than or equal to the third matrix dimension.
For example, in a blocking method provided by an embodiment of the present disclosure, determining the output parameter based on the first comparison result includes: comparing the first matrix dimension and the third matrix dimension to determine a second comparison result in response to the first comparison result indicating that the first matrix size is equal to or less than the size of the first input buffer and/or the second matrix size is equal to or less than the size of the second input buffer; based on the second comparison, an outer loop flag is determined.
For example, in the blocking method provided by the embodiment of the present disclosure, determining the outer loop flag based on the second comparison result includes: in response to the second comparison result indicating that the first matrix dimension is less than the third matrix dimension: determining the outer-loop flag as a first flag, wherein the outer-loop flag is used for indicating an order of the first dimension outer-layer block parameter and the third dimension outer-layer block parameter in the loop order, and the first flag indicates that the first dimension outer-layer block parameter is located before the third dimension outer-layer block parameter in the loop order; in response to the second comparison result indicating that the first matrix dimension is greater than or equal to the third matrix dimension: determining the outer-loop flag to be a second flag, wherein the second flag indicates that the first dimension outer-layer block parameter is located after the third dimension outer-layer block parameter in the loop order.
For example, in the blocking method provided in the embodiment of the present disclosure, determining the output parameter based on the first comparison result further includes: determining an order of the first dimension outer layer blocking parameter and the third dimension outer layer blocking parameter in the cycle order based on the outer layer cycle flag.
For example, in the blocking method provided in the embodiment of the present disclosure, the input parameters further include: determining the output parameters based on the first comparison result, including a maximum size of an accumulation buffer and hardware computation power, comprising: in response to the first comparison result indicating that the first matrix size is larger than the size of the first input buffer and the second matrix size is larger than the size of the second input buffer, in response to not splitting the second matrix dimension, calculating a first utilization based on a dimension parameter of the first matrix, a dimension parameter of the second matrix, the data size, the hardware computation power, a loading bandwidth corresponding to the first matrix, a loading bandwidth corresponding to the second matrix, a size of the first input buffer, and a size of the second input buffer; in response to the splitting of the dimension of the second matrix, taking the maximum size of the accumulation buffer as a current calculation size, and calculating a first splitting utilization rate based on the dimension parameter of the first matrix, the dimension parameter of the second matrix, the data size, the hardware calculation power, the loading bandwidth corresponding to the first matrix, the loading bandwidth corresponding to the second matrix, the size of the first input buffer, the size of the second input buffer, and the current calculation size; when the first utilization ratio is greater than or equal to the first cut utilization ratio: determining the second dimension blocking parameter as the second matrix dimension; determining the first dimension outer layer blocking parameter based on the second matrix dimension, the data size, and a size of the first input buffer; determining the third out-of-range layer blocking parameter based on the second matrix dimension, the data size, and a size of the second input buffer.
For example, in the blocking method provided in the embodiment of the present disclosure, determining the output parameter based on the first comparison result further includes: when the first utilization is less than the first cut utilization: determining an adjusted size based on an adjustment coefficient and the current calculated size, and iteratively calculating a utilization rate by taking the adjusted size as the current calculated size until a utilization rate smaller than the first cut utilization rate is obtained as a target cut utilization rate; determining the outer-layer blocking parameter of the first dimension and the outer-layer blocking parameter of the third dimension based on the first matrix dimension, the third matrix dimension, the data size, the loading bandwidth corresponding to the first matrix, the loading bandwidth corresponding to the second matrix and the current calculation size corresponding to the target segmentation utilization rate; determining the second dimension blocking parameter based on a size of the first input buffer, a size of the second input buffer, the first dimension outer blocking parameter, the third dimension outer blocking parameter, and the data size.
For example, in the blocking method provided in the embodiment of the present disclosure, determining the output parameter based on the first comparison result further includes: when the first utilization ratio is greater than or equal to the first cut utilization ratio: determining an outer-layer cycle flag as a first flag in response to the loading bandwidth corresponding to the first matrix being smaller than the loading bandwidth corresponding to the second matrix, wherein the outer-layer cycle flag is used for indicating the order of the first dimension outer-layer block parameter and the third dimension outer-layer block parameter in the cycle order, the first flag indicates that the first dimension outer-layer block parameter is located before the third dimension outer-layer block parameter in the cycle order, and determining the outer-layer cycle flag as a second flag in response to the loading bandwidth corresponding to the first matrix being greater than or equal to the loading bandwidth corresponding to the second matrix, wherein the second flag indicates that the first dimension outer-layer block parameter is located after the third dimension outer-layer block parameter in the cycle order.
For example, in the blocking method provided in the embodiment of the present disclosure, determining the output parameter based on the first comparison result further includes: when the first utilization rate is smaller than the first cut utilization rate, determining that an outer layer cycle flag is an outer layer preset flag, wherein the outer layer cycle flag is used for indicating the sequence of the outer layer block parameters of the first dimension number and the outer layer block parameters of the third dimension number in the cycle sequence.
For example, in the blocking method provided in the embodiment of the present disclosure, the input parameters further include a first block parameter and a second block parameter corresponding to a synchronization granularity and a minimum data block, and obtaining the output parameters based on the input parameters includes: determining the first dimension inner layer blocking parameter and the third dimension inner layer blocking parameter based on the first block parameter, the second block parameter, the first dimension outer layer blocking parameter, the synchronization granularity, and the data size.
For example, in the blocking method provided in the embodiment of the present disclosure, the input parameters further include hardware computation force, and the output parameters further include at least one of: the method includes the steps of determining an output parameter based on a first comparison result, wherein the output parameter comprises a dimension splitting identifier, an accumulation buffer identifier, a target size of the accumulation buffer, a first matrix loading frequency corresponding to a first matrix, a second matrix loading frequency corresponding to a second matrix and a target utilization rate of matrix multiplication, the dimension splitting identifier is used for indicating whether the dimension of the second matrix needs to be split, the accumulation buffer identifier is used for indicating whether the accumulation buffer needs to be added, the target size of the accumulation buffer represents the size of the accumulation buffer needed when the accumulation buffer needs to be added, and the output parameter is determined based on the first comparison result, and the method includes the following steps: in response to the first comparison result indicating that the first matrix size is equal to or less than the size of the first input buffer and/or the second matrix size is equal to or less than the size of the second input buffer: determining that the loading times of the first matrix and the loading times of the second matrix are both 1; determining that the dimension splitting identifier is a first tangent identifier, the accumulation buffer identifier is a first buffer sub-identifier, and the target size of the accumulation buffer is 0, wherein the dimension splitting identifier indicates that the first tangent identifier does not need to split the dimension of the second matrix, and the accumulation buffer identifier indicates that the first buffer sub-identifier does not need to accumulate the buffer; and determining the target utilization rate of the matrix multiplication operation based on the dimension parameters of the first matrix, the dimension parameters of the second matrix, the data size, the hardware computation force, the loading bandwidth corresponding to the first matrix and the loading bandwidth corresponding to the second matrix.
For example, in the blocking method provided by the embodiment of the present disclosure, the output parameter further includes at least one of: the method includes the steps of determining an output parameter based on a first comparison result, wherein the output parameter comprises a dimension splitting identifier, an accumulation buffer identifier, a target size of the accumulation buffer, a first matrix loading frequency corresponding to a first matrix, a second matrix loading frequency corresponding to a second matrix and a target utilization rate of matrix multiplication, the dimension splitting identifier is used for indicating whether the dimension of the second matrix needs to be split, the accumulation buffer identifier is used for indicating whether the accumulation buffer needs to be added, the target size of the accumulation buffer represents the size of the accumulation buffer needed when the accumulation buffer needs to be added, and the output parameter is determined based on the first comparison result, and the method includes the following steps: when the first utilization rate is greater than or equal to the first cut utilization rate: determining the dimension splitting identifier as a first slicer identifier, the accumulation buffer identifier as a first buffer sub-identifier, and a target size of the accumulation buffer as 0, wherein the dimension splitting identifier indicates that the second matrix dimension does not need to be split for the first slicer identifier, the accumulation buffer identifier indicates that the first buffer sub-identifier does not need to accumulate the buffer, determining the first matrix loading frequency as 1 in response to a loading bandwidth corresponding to the first matrix being less than a loading bandwidth corresponding to the second matrix, determining the second matrix loading frequency based on the first dimension outer layer blocking parameter and the first matrix dimension, and determining the first matrix loading frequency based on the third dimension outer layer blocking parameter and the third matrix dimension in response to the loading bandwidth corresponding to the first matrix being greater than or equal to the loading bandwidth corresponding to the second matrix, determining the number of times of loading the second matrix to be 1, and determining the target utilization rate of the matrix multiplication operation based on the dimension parameter of the first matrix, the dimension parameter of the second matrix, the data size, the hardware computing power, the loading bandwidth corresponding to the first matrix and the loading bandwidth corresponding to the second matrix; and when the first utilization is less than the first cut utilization: determining the dimension splitting identifier as the second splitting sub-identifier, the accumulation buffer identifier as the second buffer sub-identifier, and the target size of the accumulation buffer as the current calculation size corresponding to the target splitting utilization, wherein the dimension splitting identifier is that the second splitting sub-identifier indicates that the second matrix dimension needs to be split, the accumulation buffer identifier is that the second buffer sub-identifier indicates that the accumulation buffer needs to be accumulated, determining the first matrix loading frequency and the second matrix loading frequency based on the current calculation size corresponding to the target splitting utilization, the first matrix dimension, the third matrix dimension, the data size, the loading bandwidth corresponding to the first matrix, and the loading bandwidth corresponding to the second matrix, and determining the first matrix loading frequency and the second matrix loading frequency based on the dimension parameter of the first matrix, the dimension parameter of the second matrix, the data size, the loading bandwidth corresponding to the first matrix, and the loading bandwidth corresponding to the second matrix, And determining the target utilization rate of the matrix multiplication operation by the data size, the hardware calculation force, the loading bandwidth corresponding to the first matrix, the loading bandwidth corresponding to the second matrix, the first matrix loading times and the second matrix loading times.
At least one embodiment of the present disclosure also provides a method for convolution operation, wherein the convolution operation represents an operation on an input image by using a convolution kernel, the method including: determining a first convolution operation matrix and a second convolution operation matrix based on convolution input parameters of the convolution operation, wherein the convolution input parameters comprise parameters corresponding to the convolution kernel and parameters corresponding to the input image; the blocking method according to any embodiment of the present disclosure is performed with the first convolution operation matrix as a first matrix and the second convolution operation matrix as a second matrix.
For example, in a method provided by an embodiment of the present disclosure, the parameters corresponding to the convolution kernels include a first kernel size, a second kernel size, an input channel number, and a number of convolution kernels for each convolution kernel, the parameters corresponding to the input image include a first image size, a second image size, and a number of input images, the first convolution operation matrix has a first convolution matrix dimension and a second convolution matrix dimension, the second convolution operation matrix has the second convolution matrix dimension and a third convolution matrix dimension, and the first convolution operation matrix and the second convolution operation matrix are determined based on convolution input parameters of the convolution operation, including: determining the first convolution matrix dimension based on the number of convolution kernels; determining the second convolution matrix dimension based on the first kernel size, the second kernel size, and the number of input channels; determining the third convolution matrix dimension based on the number of input images, the first image size, and the second image size.
At least one embodiment of the present disclosure further provides a blocking apparatus applied to a matrix multiplication operation, and including a memory and a processor, wherein the memory stores computer-executable instructions adapted to be executed by the processor, and the computer-executable instructions, when executed by the processor, perform one or more steps of a blocking method according to any one of the embodiments of the present disclosure.
At least one embodiment of the present disclosure also provides an apparatus for convolution operations, comprising a memory and a processor, wherein the memory stores computer-executable instructions adapted to be executed by the processor, and the computer-executable instructions, when executed by the processor, perform one or more steps of a method for convolution operations according to any one of the embodiments of the present disclosure.
At least one embodiment of the present disclosure also provides a computer-readable storage medium having stored thereon, non-transitory, computer-executable instructions that, when executed by a computer, perform one or more steps of a blocking method according to any one of the embodiments of the present disclosure or perform one or more steps of a method for convolution operations according to any one of the embodiments of the present disclosure.
Drawings
To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.
FIG. 1 is a schematic diagram of a system model with an AI accelerator provided by an embodiment of the disclosure;
fig. 2 is a schematic flow chart of a blocking method according to at least one embodiment of the disclosure;
fig. 3A is a schematic diagram of a first matrix and a second matrix provided by some embodiments of the present disclosure;
FIG. 3B is a schematic diagram of the matrix block C0 shown in FIG. 3A;
FIG. 4 is a schematic diagram of another first matrix and second matrix provided by some embodiments of the present disclosure;
fig. 5 is a schematic flow chart of a chunking method according to some embodiments of the present disclosure;
FIG. 6 is a schematic flow chart diagram of a method for convolution operations provided by some embodiments of the present disclosure;
fig. 7 is a schematic block diagram of a block device according to at least one embodiment of the present disclosure;
fig. 8 is a schematic block diagram of an apparatus for convolution operations according to at least one embodiment of the present disclosure;
fig. 9 is a schematic diagram of a computer-readable storage medium according to at least one embodiment of the disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described below clearly and completely with reference to the accompanying drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.
Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
To maintain the following description of the embodiments of the present disclosure clear and concise, a detailed description of some known functions and components have been omitted from the present disclosure.
Fig. 1 is a schematic diagram of a system model with an AI accelerator according to an embodiment of the present disclosure, and as shown in fig. 1, the system model with the AI accelerator includes an AI accelerator, an External memory (External memory), an Internal memory (Internal memory), and a processor Core (e.g., GPU/gpgpu (general Purpose Graphics Processing unit)) Core. The AI accelerator includes an input buffer (e.g., near-core input buffer), a Matrix Multiplication Unit (Matrix Multiplication Unit), and an Accumulator (Accumulator). The processor core includes a Register File (Register File) and a single Instruction Multiple data (simd) processor. The register file may also be accessed by the AI core (i.e., AI accelerator). A SIMD processor can control multiple threads to perform the same operation simultaneously by executing one instruction.
For Matrix Multiplication, data (data in two matrices matrixA and matrixB in Matrix Multiplication) is loaded into an input buffer from an external memory or an internal memory, because the storage space of the input buffer is limited, all data in the two matrices matrixA and matrixB cannot be loaded at the same time, only the data in the two matrices matrixA and matrixB can be loaded in batches (blocks), then Matrix Multiplication is performed by a Matrix Multiplication Unit according to blocks, then the results calculated by the Matrix Multiplication Unit are accumulated by an Accumulator (Accumulator), and a complete calculation result for operating the two matrices matrixA and matrixB can be obtained after multiple cycles. Finally, the computation result is transmitted to the processor core.
Different blocking (Tiling) methods result in different cycle costs (i.e., performance) and sometimes result in repeatedly loading data, even making loading data a performance bottleneck. Therefore, it is necessary to find a suitable chunk size (tiling size) and cycle order (loop sequence) to obtain the optimal performance; the optimal tiling size and loop sequence will also differ for matrices of different shapes and where the data of the matrices are located in different locations (external memory or internal memory).
Currently, the method for determining the tiling size and loop sequence includes: firstly, a method for manually adjusting and optimizing according to case-by-case (case-by-case) is carried out aiming at a specific input condition; second, a method of fixing the loop sequence and then searching for an optimal tiling shape; thirdly, direct blind search, namely: and traversing all possible tilingshapes and loop sequences, and calculating the cycle cost under each combination so as to find the optimal combination.
The method for manually optimizing the case-by-case is a customized method, and cannot obtain a tiling shape and a loop sequence with optimal performance for all different input scenes; the method for fixing the loop sequence and then searching the optimal tiling shape reduces the search space, and the tiling shape and the loop sequence with the optimal performance can not be always found; while the brute force search method of traversing all possible combinations may incur a large, even unacceptable, time overhead.
At least one embodiment of the present disclosure provides a blocking method applied to a matrix multiplication operation. The matrix multiplication operation is used for realizing the multiplication operation between a first matrix and a second matrix, and the blocking method comprises the following steps: determining input parameters based on the first matrix and the second matrix, wherein the input parameters comprise dimension parameters of the first matrix, dimension parameters of the second matrix, data size of each data in each of the first matrix and the second matrix, loading bandwidth corresponding to the first matrix, loading bandwidth corresponding to the second matrix and size of an input buffer; based on the input parameters, output parameters are obtained, wherein the output parameters comprise a cyclic order and blocking parameters corresponding to the first matrix and the second matrix.
In the blocking processing method provided by the embodiment of the disclosure, for any input scene (different input scenes, the input parameters are different), based on the input parameters in the input scene, the blocking parameters (blocking shape) and the cycle sequence (loop sequence) with optimal performance can be determined, so that the time overhead of determining the blocking parameters and the cycle sequence is saved, and the efficiency and the operational performance are improved.
The blocking method provided by the disclosure can be applied to the field of deep neural networks or high-performance computing. Aiming at the calculation scene of matrix multiplication or convolution operation based on an AI accelerator, the blocking method provided by the embodiment of the disclosure can fully exert hardware computing power and improve calculation performance on various input scenes and data with different sizes.
At least one embodiment of the present disclosure also provides a method for convolution operation, a blocking apparatus applied to matrix multiplication operation, an apparatus for convolution operation, and a computer-readable storage medium.
The blocking method provided by the embodiment of the disclosure can be applied to the blocking device provided by the embodiment of the disclosure, and the blocking device can be configured on an electronic device. The electronic device may be a personal computer, a mobile terminal, and the like, and the mobile terminal may be a hardware device such as a smart phone and a tablet computer.
Note that, in the present disclosure, the unit of "size" may be a bit (bit), but the present disclosure is not limited thereto, and the unit of "size" may be set according to actual circumstances.
Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.
Fig. 2 is a schematic flow chart of a blocking method according to at least one embodiment of the present disclosure.
For example, the blocking method may be applied to a matrix multiplication operation, e.g., a matrix multiplication operation for implementing a multiplication operation between a first matrix and a second matrix. For example, the matrix multiplication is used to perform multiplication on the first matrix and the second matrix to obtain a third matrix, where data in the third matrix is a plurality of operation results of the matrix multiplication.
For example, the blocking method can be applied to each matrix multiplication operation in the BERT Large network, and compared with a case-by-case manual optimization method, the blocking method obtains the same optimal theoretical performance for all input scenarios (cases), and uses fewer local registers (local registers).
As shown in fig. 2, the blocking method provided by the embodiment of the present disclosure includes steps S10 to S20. In step S10, determining input parameters based on the first matrix and the second matrix; in step S20, based on the input parameters, output parameters are obtained.
For example, in step S10, the input parameters include a dimension parameter of the first matrix, a dimension parameter of the second matrix, a data size of each data in each of the first matrix and the second matrix, a loading bandwidth corresponding to the first matrix, a loading bandwidth corresponding to the second matrix, and a size of the input buffer.
For example, in step S20, the output parameters include the round robin order and the blocking parameters corresponding to the first matrix and the second matrix.
For example, the dimension parameters of the first matrix include a first matrix dimension and a second matrix dimension, and the dimension parameters of the second matrix include a second matrix dimension and a third matrix dimension. The first matrix dimension, the second matrix dimension and the third matrix dimension are positive integers, the type of the first matrix dimension is different from that of the second matrix dimension, and the type of the second matrix dimension is different from that of the third matrix dimension. For example, for a first matrix, the first matrix dimension represents the number of rows of the first matrix, the second matrix dimension represents the number of columns of the first matrix, for a second matrix, the second matrix dimension represents the number of rows of the second matrix, and the third matrix dimension represents the number of columns of the second matrix. For example, if a first matrix is a M × K matrix, i.e., a first matrix dimension of the first matrix is M (e.g., M ═ 8 in some examples), a second matrix dimension of the first matrix is K (e.g., K ═ 4 in some examples), and if a second matrix is a K × N matrix, i.e., a second matrix dimension of the second matrix is K (e.g., K ═ 4 in some examples), a third matrix dimension of the second matrix is N (e.g., N ═ 8 in some examples). It should be noted that, although the second matrix dimension represents the number of columns of the first matrix and represents the number of rows of the second matrix, the second matrix dimension is the same for the first matrix and the second matrix, that is, the number of columns of the first matrix and the number of rows of the second matrix are equal.
For example, the blocking parameters include a first dimension outer blocking parameter related to a first matrix dimension, a first dimension inner blocking parameter related to the first dimension outer blocking parameter, a second dimension blocking parameter related to a second matrix dimension, a third dimension outer blocking parameter related to a third matrix dimension, and a third dimension inner blocking parameter related to the third dimension outer blocking parameter.
For example, the first dimension outer layer blocking parameter, the first dimension inner layer blocking parameter, the second dimension blocking parameter, the third dimension outer layer blocking parameter, and the third dimension inner layer blocking parameter are all positive integers.
For example, the cycle order is used to indicate the order of the first dimension outer layer blocking parameter, the first dimension inner layer blocking parameter, the second dimension blocking parameter, the third dimension outer layer blocking parameter, and the third dimension inner layer blocking parameter. For example, in some embodiments, the first dimension outer layer blocking parameter and the third dimension outer layer blocking parameter are located before the second dimension blocking parameter and the first dimension inner layer blocking parameter and the third dimension inner layer blocking parameter are located after the second dimension blocking parameter in a round robin order, which in one example is: a first dimension outer layer blocking parameter, a third dimension outer layer blocking parameter, a second dimension blocking parameter, a first dimension inner layer blocking parameter, and a third dimension inner layer blocking parameter.
It should be noted that the blocking method provided by the embodiment of the present disclosure has the following criteria (guideliine): firstly, data of a matrix is not required to be repeatedly loaded, so that the situation that the repeatedly loaded data becomes a performance bottleneck is prevented; second, optionally without splitting the second Matrix dimension (i.e., the K dimension), the accumulation calculation should be performed in the General Matrix Multiplication (GEMM) unit of the AI accelerator if no reloading of data is caused; thirdly, if the data accumulation can be realized by adopting an accumulation buffer outside the GEMM, splitting the dimension of the second matrix is used for avoiding or minimizing the data of the repeatedly loaded matrix; fourth, in the case where accumulation buffer resources are available, theoretical performance is prioritized rather than synchronization overhead.
Fig. 3A is a schematic diagram of a first matrix and a second matrix provided in some embodiments of the present disclosure, and fig. 3B is a schematic diagram of a matrix block C0 shown in fig. 3A.
The chunking parameters and the round robin order provided by embodiments of the present disclosure are explained below in conjunction with fig. 3A.
As shown in fig. 3A, in some examples, the first matrix is denoted as M x K and the second matrix is denoted as K x N. M is a first matrix dimension, K is a second matrix dimension, and N is a third matrix dimension. K. M, N are all positive integers. And multiplying the first matrix and the second matrix to obtain a third matrix, wherein the third matrix represents M x N.
For example, as shown in fig. 3A, in this example, during loading of data, a first matrix dimension M needs to be sliced, a second matrix dimension K needs to be sliced, and a third matrix dimension N needs to be sliced. Briefly describing the calculation process of the matrix block C0 in the third matrix, the matrix block M00 in the first matrix is multiplied by the matrix block N00 in the second matrix to obtain a first calculation value of each element (data) in the matrix block C0, which is the result of the first calculation of each element in the matrix block C0. The multiplication is performed on the matrix block M01 in the first matrix and the matrix block N01 in the second matrix, so that the second calculated value of each element in the matrix block C0 in the third matrix C is accumulated with the first calculated value, i.e., the first calculated value, to obtain the second calculated value of each element in the matrix block C0. The multiplication is performed on the matrix block M02 in the first matrix and the matrix block N02 in the second matrix, so that the third calculated value of each element in the matrix block C0 in the third matrix C is accumulated with the second calculated value to obtain the third calculated value of each element in the matrix block C0. The result of the third operation is the final operation result of each element of the matrix block C0.
As shown in fig. 3A, the PARTITION parameters include a first dimension outer PARTITION parameter PARTITION _ M, a third dimension outer PARTITION parameter PARTITION _ N, and a second dimension PARTITION parameter PARTITION _ K. The first dimension outer layer partitioning parameter PARTITION _ M represents a step size of the first matrix dimension M in the first matrix in outer layer circulation, the third dimension outer layer partitioning parameter PARTITION _ N represents a step size of the third matrix dimension N in the second matrix in outer layer circulation, and the second dimension partitioning parameter PARTITION _ K represents a step size of the second matrix dimension K in outer layer circulation (average PARTITION in the first matrix and the second matrix).
For example, as shown in fig. 3A and 3B, the third matrix includes a plurality of matrix blocks C0-C7, the sizes of the matrix blocks are the same, and taking matrix block C0 as an example, the first dimension outer layer PARTITION parameter _ M and the third dimension outer layer PARTITION parameter PARTITION _ N correspond to the height (row number) and the width (column number) of matrix block C0 in the third matrix, that is, matrix block C0 may be represented as PARTITION _ M × PARTITION _ N.
For example, each matrix block of the third matrix may also be sliced, as shown in fig. 3B, the partition parameters further include a first-dimension intra-layer partition parameter TILE _ M and a third-dimension intra-layer partition parameter TILE _ N. As shown in fig. 3B, the matrix block C0 is partitioned to obtain a sub-matrix block C0B0, a sub-matrix block C0B1, a sub-matrix block C0B2, and a sub-matrix block C0B3, where the first dimension inner layer PARTITION parameter TILE _ M indicates a step size of partitioning the height of the matrix block C0 in the inner layer loop, that is, in the memory loop, a step size of partitioning the first dimension outer layer PARTITION parameter TILE _ M, and the third dimension inner layer PARTITION parameter TILE _ N indicates a step size of partitioning the width of the matrix block C0 in the inner layer loop, that is, in the memory loop, a step size of partitioning the third dimension outer layer PARTITION parameter TILE _ N. The matrix block C0 is divided into a plurality of sub-matrix blocks C0B 0-C0B 3, the sizes of the sub-matrix blocks are the same, and taking the sub-matrix block C0B0 as an example, the first dimension inner layer block parameter TILE _ M and the third dimension inner layer block parameter TILE _ N correspond to the height and width of the sub-matrix block C0B0, that is, the sub-matrix block C0B0 may be represented as TILE _ M _ TILE _ N.
It should be noted that the outer loop represents the splitting of the first matrix and the second matrix, and the memory loop represents the splitting of each matrix block in the third matrix.
It should be noted that, if the second matrix dimension K is not split, the second dimension partitioning parameter PARTITION _ K is equal to the second matrix dimension K.
An example of a final cycle sequence is as follows:
for(out_m=0;out_m<M;out_m+=PARTITION_M)
for(out_n=0;out_n<N;out_n+=PARTITION_N)
for(k=0;k<K;k+=PARTITION_K)
for(in_m=0;in_m<PARTITION_M;in_m+=TILE_M)
for(in_n=0;in_n<PARTITION_N;in_n+=TILE_N)
{
//AI Core process
//GPU Core process
}
for example, in the above example, in the cyclic order, the first out-of-dimension blocking parameter _ M is located before the third out-of-dimension blocking parameter _ N, which indicates that the direction of the third matrix dimension N is traversed first, and then the direction of the first matrix dimension M is traversed; the first intra-dimensional PARTITION parameter TILE _ M is located before the third intra-dimensional PARTITION parameter TILE _ N, and represents that the direction of PARTITION _ N of the matrix block in the third matrix is traversed first, and then the direction of PARTITION _ M of the matrix block in the third matrix is traversed.
As shown in fig. 3A, for the first matrix and the second matrix, if the direction of the third matrix dimension N is traversed first and then the direction of the first matrix dimension M is traversed, a matrix block C0 is obtained by calculation first, a matrix block C1 is obtained by calculation, a matrix block C2 is obtained by calculation, a matrix block C3 is obtained by calculation, a matrix block C4 is obtained by calculation, a matrix block C5 is obtained by calculation, a matrix block C6 is obtained by calculation, and a matrix block C7 is obtained by calculation; if the direction of the first matrix dimension M is traversed and then the direction of the third matrix dimension N is traversed, a matrix block C0 is obtained by calculation, a matrix block C2 is obtained by calculation, a matrix block C4 is obtained by calculation, a matrix block C6 is obtained by calculation, a matrix block C1 is obtained by calculation, a matrix block C3 is obtained by calculation, a matrix block C5 is obtained by calculation, and a matrix block C7 is obtained by calculation.
Similarly, as shown in fig. 3B, for the matrix block C0, if traversing the PARTITION _ N direction of the matrix block C0 and traversing the PARTITION _ M direction of the matrix block C0, first calculating to obtain a sub-matrix block C0B0, then calculating to obtain a sub-matrix block C0B2, then calculating to obtain a sub-matrix block C0B1, and then calculating to obtain a sub-matrix block C0B 3; if the PARTITION _ M direction of the matrix block C0 is traversed first, and then the PARTITION _ N direction of the matrix block C0 is traversed, the sub-matrix block C0B0 is obtained by calculation first, the sub-matrix block C0B1 is obtained by calculation, the sub-matrix block C0B2 is obtained by calculation, and the sub-matrix block C0B3 is obtained by calculation.
For example, in some embodiments, as shown in fig. 1, the AI accelerator includes an input buffer including a first input buffer for buffering data in a first matrix and a second input buffer for buffering data in a second matrix, the size of the input buffer including the size of the first input buffer and the size of the second input buffer.
It should be noted that the first input buffer and the second input buffer may be two independent buffers, and the first input buffer and the second input buffer may also be two different buffer areas in the same buffer.
For example, the data size of each data in the first matrix is the same as the data size of each data in the second matrix.
For example, in some embodiments, in step S10, based on the first matrix, the dimension parameter of the first matrix, the storage location corresponding to the first matrix, the first input buffer for buffering the data of the first matrix, and the data size may be determined, then based on the first input buffer, the size of the first input buffer may be determined, and based on the storage location corresponding to the first matrix, the loading bandwidth corresponding to the first matrix may be determined; based on the second matrix, the dimension parameter of the second matrix, the storage location corresponding to the second matrix, a second input buffer for buffering data of the second matrix, and the data size may be determined, then based on the second input buffer, the size of the second input buffer may be determined, and based on the storage location corresponding to the second matrix, the loading bandwidth corresponding to the second matrix may be determined.
For example, as shown in fig. 1, taking the first matrix as an example, when the first matrix is stored in the external memory, the loading bandwidth corresponding to the first matrix is a first loading bandwidth related to the external memory, and when the second matrix is stored in the internal memory, the loading bandwidth corresponding to the second matrix is a second loading bandwidth related to the internal memory, for example, the first loading bandwidth is generally smaller than the second loading bandwidth. For example, the load bandwidth affects the speed at which data is loaded, the greater the load bandwidth, the greater the speed at which data is loaded.
For example, the loading bandwidth corresponding to the first matrix and the loading bandwidth corresponding to the second matrix may be the same or different, and are specifically determined according to the storage locations of the first matrix and the second matrix.
For example, the size of the first input buffer may or may not be the same as the size of the second input buffer. In the embodiments of the present disclosure, the description is made by taking as an example that the size of the first input buffer is the same as the size of the second input buffer.
For example, in some embodiments, step S20 may include: determining a first matrix size corresponding to the first matrix based on the dimension parameter and the data size of the first matrix; determining a second matrix size corresponding to the second matrix based on the dimension parameter and the data size of the second matrix; comparing the first matrix size with the size of the first input buffer and comparing the second matrix size with the size of the second input buffer to determine a first comparison result; based on the first comparison result, an output parameter is determined.
For example, the first matrix size corresponding to the first matrix is expressed as: matA _ size ═ M × K × d _ s, matA _ size denotes the first matrix size, M denotes the first matrix dimension, K denotes the second matrix dimension, and d _ s denotes the data size. The second matrix size corresponding to the second matrix is expressed as: matB _ size ═ N × K × d _ s, and matB _ size denotes the second matrix size, and N denotes the third matrix dimension.
First, it is determined whether the first input buffer can buffer all data in the first matrix and the second input buffer can buffer all data in the second matrix according to shape (shape) information of the matrices, i.e., dimension parameters of the first matrix and dimension parameters of the second matrix, i.e., comparing the size of the first matrix and the size of the first input buffer and comparing the size of the second matrix and the size of the second input buffer, to determine a first comparison result, and then, an output parameter is determined based on the first comparison result.
For example, in some embodiments, in step S20, based on the first comparison result, the determining of the output parameter includes: determining a second dimension blocking parameter as a second matrix dimension in response to the first comparison result indicating that the first matrix size is smaller than or equal to the size of the first input buffer and/or the second matrix size is smaller than or equal to the size of the second input buffer; comparing the first matrix dimension and the third matrix dimension to determine a second comparison result; based on the second comparison result, a first dimension outer layer blocking parameter and a third dimension outer layer blocking parameter are determined.
For example, in some embodiments, in step S20, based on the first comparison result, the determining of the output parameter includes: comparing the first matrix dimension and the third matrix dimension to determine a second comparison result in response to the first comparison result indicating that the first matrix size is equal to or less than the size of the first input buffer and/or the second matrix size is equal to or less than the size of the second input buffer; based on the second comparison, an outer loop flag is determined.
For example, if the size of the first matrix is smaller than or equal to the size of the first input buffer or the size of the second matrix is smaller than or equal to the size of the second input buffer, one of the first matrix and the second matrix may be fixedly loaded (load) into the input buffer through a proper cyclic sequence, and whether all data of the other one of the first matrix and the second matrix can be completely buffered by the corresponding input buffer, the data in the other one of the first matrix and the second matrix will be linearly loaded into the input buffer, so that the operation of repeatedly loading data into the matrix is not involved, that is, the first matrix and the second matrix do not have the problem of repeatedly loading data at this time. The size of the first matrix is smaller than or equal to that of the first input buffer, and the size of the second matrix is smaller than or equal to that of the second input buffer, so that the first matrix and the second matrix can be fixedly loaded (load) into the input buffers through a proper cyclic sequence, and at the moment, the problem of repeatedly loading data does not exist in the first matrix and the second matrix. Therefore, in this case, there is no possibility that the repeatedly loaded data becomes a performance bottleneck, and at this time, the K-dimension splitting is not required to avoid the additional performance overhead.
For example, in some embodiments of the present disclosure, the order of the first dimension outer layer PARTITION parameter PARTITION _ M and the third dimension outer layer PARTITION parameter PARTITION _ N in the loop order may be indicated based on the outer layer loop flag.
It should be noted that "repeatedly loading data" means that all data in the matrix is loaded at least twice.
Since the situation of repeatedly loading data does not occur in both the first matrix and the second matrix, the sequence of outer layer circulation (i.e., traversing the M direction first and then traversing the N direction (column-major) or traversing the N direction first and then traversing the M direction (row-major)) does not affect the final gem utilization. At this time, the order of the outer loop may be determined based on the comparison result of the first matrix dimension and the third matrix dimension.
Note that, in the embodiment of the present disclosure, the "M direction" indicates the row direction of the matrix, and the "N direction" indicates the column direction of the matrix.
For example, in some embodiments, determining the first-dimension outer-layer blocking parameter and the third-dimension outer-layer blocking parameter based on the second comparison result comprises: in response to the second comparison indicating that the first matrix dimension is less than the third matrix dimension: determining the outer layer blocking parameter of the first dimension as a first matrix dimension; determining a third dimension outer layer blocking parameter based on the data size, the dimension parameter of the second matrix, and the size of the second input buffer.
For example, the first dimension outer PARTITION parameter PARTITION _ M being the first matrix dimension M means that the first matrix dimension M is not split in the outer loop.
For example, in some embodiments, determining the first-dimension outer-layer blocking parameter and the third-dimension outer-layer blocking parameter based on the second comparison further comprises: in response to the second comparison indicating that the first matrix dimension is equal to or greater than the third matrix dimension: determining the outer layer blocking parameter of the third dimension as a third matrix dimension; the outer-layer blocking parameter for the first dimension is determined based on the data size, the dimension parameter for the first matrix, and the size of the first input buffer.
For example, the third dimension outer layer partitioning parameter PARTITION _ N being the third matrix dimension N means that the third matrix dimension N is not split in the outer layer loop.
For example, in some embodiments, determining the first dimension outer layer blocking parameter based on the data size, the dimension parameter of the first matrix, and the size of the first input buffer comprises: determining a first intermediate blocking parameter based on the data size, the second matrix dimension, and the size of the first input buffer; and determining the outer-layer blocking parameter of the first dimension as the first intermediate blocking parameter in response to the first intermediate blocking parameter being smaller than the first matrix dimension, and determining the outer-layer blocking parameter of the first dimension as the first matrix dimension in response to the first intermediate blocking parameter being greater than or equal to the first matrix dimension. I.e. the first dimension outer PARTITION parameter PARTITION _ M is the smaller of the first intermediate PARTITION parameter and the first matrix dimension M.
For example, in some embodiments, the first intermediate block parameter is expressed as:
P_Md1=b_s1/(K*d_s)
where P _ Md1 denotes a first intermediate blocking parameter, b _ s1 denotes the size of the first input buffer, K denotes the second matrix dimension, and d _ s denotes the data size.
For example, if the first matrix dimension M is less than or equal to the first middle PARTITION parameter, the first dimension outer PARTITION parameter PARTITION _ M is the first matrix dimension M, at this time, all data in the first matrix may be completely buffered by the first input buffer, and all data in the second matrix may also be completely buffered by the second input buffer, at this time, in the outer loop, data loading is performed only once in both the M direction and the N direction, which is equivalent to no outer PARTITION.
For example, if the dimension M of the first matrix is greater than the first intermediate blocking parameter, the first input buffer cannot buffer all the data in the first matrix, and the data of the first matrix needs to be loaded into the first input buffer multiple times, but the data is not repeatedly loaded.
For example, in some embodiments, determining the third-dimension outer-layer-blocking parameter based on the data size, the dimension parameter of the second matrix, and the size of the second input buffer comprises: determining a second intermediate blocking parameter based on the data size, the second matrix dimension, and the size of the second input buffer; and determining the outer layer blocking parameter of the third dimension as the second intermediate blocking parameter in response to the second intermediate blocking parameter being smaller than the third matrix dimension, and determining the outer layer blocking parameter of the third dimension as the third matrix dimension in response to the second intermediate blocking parameter being greater than or equal to the third matrix dimension. I.e. the third dimension outer PARTITION parameter PARTITION _ N is the smaller of the second intermediate PARTITION parameter and the third matrix dimension N.
For example, in some embodiments, the second intermediate blocking parameter is expressed as:
P_Md2=b_s2/(K*d_s)
where P _ Md2 denotes a second intermediate blocking parameter, b _ s2 denotes a size of the second input buffer, K denotes a second matrix dimension, and d _ s denotes a data size.
For example, if the dimension N of the third matrix is less than or equal to the second middle PARTITION parameter, the third dimension outer PARTITION parameter PARTITION _ N is the dimension N of the third matrix, at this time, all data in the first matrix may be completely buffered by the first input buffer, all data in the second matrix may also be completely buffered by the second input buffer, and at this time, in the outer loop, data loading is performed only once in both the M direction and the N direction, which is equivalent to no outer PARTITION.
For example, if the third matrix dimension N is greater than the second intermediate blocking parameter, the second input buffer may not buffer all the data in the second matrix, and the data in the second matrix may need to be loaded into the second input buffer multiple times, but the data may not be loaded repeatedly.
For example, in some embodiments, determining the outer loop flag based on the second comparison comprises: in response to the second comparison indicating that the first matrix dimension is less than the third matrix dimension: determining an outer loop flag as a first flag; in response to the second comparison indicating that the first matrix dimension is equal to or greater than the third matrix dimension: and determining the outer loop mark as a second mark.
For example, the outer-loop flag is used to indicate the order of the first dimension outer-layer blocking parameter and the third dimension outer-layer blocking parameter in the loop order.
For example, the first flag indicates that the first dimension outer layer block parameter precedes the third dimension outer layer block parameter in the cyclic order, where the N direction is traversed before the M direction.
For example, the second flag indicates that the first dimension outer layer block parameter is located after the third dimension outer layer block parameter in the cyclic order, and then the M direction is traversed before the N direction is traversed.
For example, in some embodiments, the first flag and the second flag may each be a binary number, the first flag may be, for example, a1, and the second flag may be a 0.
For example, in some embodiments, in step S20, determining the output parameter based on the first comparison result may further include: and determining the sequence of the first dimension outer layer blocking parameter and the third dimension outer layer blocking parameter in the circulation sequence based on the outer layer circulation mark.
For example, when the outer-layer loop flag is the first flag, in the loop sequence, the first dimension outer-layer block parameter is located before the third dimension outer-layer block parameter, and at this time, the N direction is traversed first, and then the M direction is traversed.
For example, when the outer-layer loop flag is the second flag, in the loop sequence, the first dimension outer-layer block parameter is located after the third dimension outer-layer block parameter, and at this time, the M direction is traversed first, and then the N direction is traversed.
For example, in some embodiments, the input parameters further include hardware computational power, and the output parameters further include at least one of: the method comprises the steps of dimension segmentation identification, accumulation buffer identification, target size of an accumulation buffer, first matrix loading times corresponding to a first matrix, second matrix loading times corresponding to a second matrix and target utilization rate of matrix multiplication.
The dimension segmentation mark is used for indicating whether the dimension of the second matrix needs to be segmented or not, the accumulation buffer mark is used for indicating whether the buffer needs to be accumulated or not, and the target size of the accumulation buffer represents the size of the accumulation buffer needed when the buffer needs to be accumulated.
For example, in some embodiments, in step S20, determining the output parameter based on the first comparison result may further include: in response to the first comparison indicating that the first matrix size is equal to or less than the size of the first input buffer and/or the second matrix size is equal to or less than the size of the second input buffer: determining that the loading times of the first matrix and the loading times of the second matrix are both 1; determining that the dimension segmentation identifier is a first segmentation identifier, the accumulation buffer identifier is a first buffer sub-identifier, and the target size of the accumulation buffer is 0; and determining the target utilization rate of the matrix multiplication operation based on the dimension parameters of the first matrix, the dimension parameters of the second matrix, the data size, the hardware calculation force, the loading bandwidth corresponding to the first matrix and the loading bandwidth corresponding to the second matrix.
For example, in some embodiments, the dimension splitting flag being the first splitting flag indicates that no splitting of the second matrix dimension is required, e.g., when no splitting of the second matrix dimension is required, the second dimension blocking parameter PARTITION _ K is the second matrix dimension K.
For example, in some embodiments, the indication that the accumulation buffer is identified as the first buffer sub-identifier indicates that no accumulation buffer is needed, and the indication that no accumulation buffer is needed in the matrix multiplication process indicates that no accumulation buffer is needed for accumulation calculation.
For example, the first tangential molecule identifier and the first buffer sub-identifier may both be binary numbers, the first tangential molecule identifier may be 0, and the first buffer sub-identifier may be 0.
For example, in a case where the first comparison result indicates that the first matrix size is equal to or smaller than the size of the first input buffer and/or the second matrix size is equal to or smaller than the size of the second input buffer, a target utilization rate of the matrix multiplication may be calculated, and the target utilization rate of the matrix multiplication may indicate the performance evaluation index, and in this case, the target utilization rate of the matrix multiplication may be the first GEMM utilization rate.
For example, in some embodiments, determining a target utilization rate for the matrix multiplication operation based on the dimensional parameters of the first matrix, the dimensional parameters of the second matrix, the data size, the hardware computation power, the loading bandwidth corresponding to the first matrix, and the loading bandwidth corresponding to the second matrix comprises: calculating the multiplication period number (GEMM period number) of the universal matrix based on the dimension parameters of the first matrix, the dimension parameters of the second matrix, the data size and the hardware computing power; calculating the single loading period number corresponding to the first matrix based on the first matrix dimension, the second matrix dimension, the data size and the loading bandwidth corresponding to the first matrix; calculating the single loading period number corresponding to the second matrix based on the second matrix dimension, the third matrix dimension, the data size and the loading bandwidth corresponding to the second matrix; and determining the target utilization rate of the matrix multiplication operation based on the multiplication period number of the universal matrix, the single loading period number corresponding to the first matrix and the single loading period number corresponding to the second matrix.
For example, in some embodiments, where the first comparison result indicates that the first matrix size is equal to or less than the size of the first input buffer and/or the second matrix size is equal to or less than the size of the second input buffer, the first GEMM utilization is calculated as follows.
The calculation of the GEMM cycle number (denoted as GEMM _ cycle) involves the hardware computation power of the AI core (AI core) related to the hardware characteristics, the hardware computation power is a constant and is denoted as hw _ power, and the unit is the operation per cycle, where the operation denotes the multiply-add operation, i.e. how many times the multiply-add operation can be performed per cycle (cycle).
For example, the GEMM _ cycle may be calculated by the following equation (1):
GEMM _ cycle ═ M × K × N × d _ s/hw _ power formula (1)
For example, the number of loading cycles per time corresponding to the first matrix represents the number of cycles required to load all data in the first matrix per time, and the number of loading cycles per time corresponding to the first matrix can be calculated by using the following formula (2):
and (2) a formula (2) of load _ buffer _ cycle, M × K × d _ s/matA _ bw, where load _ buffer _ cycle represents a single load cycle number corresponding to the first matrix, and matA _ bw represents a load bandwidth corresponding to the first matrix.
For example, the number of single loading cycles corresponding to the second matrix represents the number of cycles required to load all data in the second matrix at a time, and the number of single loading cycles corresponding to the second matrix can be calculated by using the following formula (3):
and (3) a formula (3) of load _ buffer _ cycle, N × K × d _ s/matB _ bw, where load _ buffer _ cycle represents a single load cycle number corresponding to the second matrix, and matB _ bw represents a load bandwidth corresponding to the second matrix.
For example, the first GEMM utilization may be calculated using equation (4) below:
gemjutil ═ gemjcycle/(max (gemjcycle, load _ buffer _ cycle), (load _ buffer _ cycle)) ═ min (1, (N matA _ bw/hw _ power), (M matB _ bw/hw _ power)) formula (4),
where GEMM _ Util indicates the first GEMM utilization, and max (x) indicates the maximum value. For example, the maximum value of GEMM _ Util is 1.
For example, in some embodiments, the input parameters further include: the maximum size of the buffer and the hardware effort are accumulated. As shown in fig. 1, the processor core includes a register file that includes an accumulation buffer.
For example, in some embodiments, in step S20, based on the first comparison result, determining the output parameter may include: responding to the first comparison result that the size of the first matrix is larger than the size of the first input buffer and the size of the second matrix is larger than the size of the second input buffer, responding to the fact that the dimension of the second matrix is not segmented, and calculating a first utilization rate based on the dimension parameter of the first matrix, the dimension parameter of the second matrix, the data size, the hardware calculation force, the loading bandwidth corresponding to the first matrix, the loading bandwidth corresponding to the second matrix, the size of the first input buffer and the size of the second input buffer; responding to the segmentation of the dimension of the second matrix, taking the maximum size of the accumulation buffer as the current calculation size, and calculating a first segmentation utilization rate based on the dimension parameter of the first matrix, the dimension parameter of the second matrix, the data size, the hardware calculation force, the loading bandwidth corresponding to the first matrix, the loading bandwidth corresponding to the second matrix, the size of the first input buffer, the size of the second input buffer and the current calculation size; when the first utilization rate is greater than or equal to the first division utilization rate: determining a second dimension blocking parameter as a second matrix dimension; determining outer layer blocking parameters of the first dimension based on the second matrix dimension, the data size and the size of the first input buffer; a third dimension outer layer blocking parameter is determined based on the second matrix dimension, the data size, and the size of the second input buffer.
For example, in the case where the first matrix size is larger than the size of the first input buffer and the second matrix size is larger than the size of the second input buffer, then the repeated loading of at least one of the first matrix and the second matrix is inevitable. In this case, the performance comparison between the case of using split _ k and the case of not using split _ k needs to be calculated to decide whether to use the split _ k strategy to implement the data accumulation process outside the GEMM. Using split _ k indicates that the second matrix dimension needs to be sliced, and not using split _ k indicates that the second matrix dimension does not need to be sliced.
Fig. 4 is a schematic diagram of another first matrix and a second matrix provided in some embodiments of the present disclosure.
The process of split _ k is described below in conjunction with fig. 4.
As shown in fig. 4, the first matrix is denoted by M × K and the second matrix is denoted by K × N, in this example, PARTITION _ M is M, i.e., the first matrix dimension M is not partitioned, PARTITION _ N is N/2, i.e., the second matrix dimension N is partitioned into 2, PARTITION _ K is K/N, i.e., the second matrix dimension K is partitioned into N, at which time the first matrix is split into N matrix blocks, i.e., matrix block a1, matrix blocks a2, …, matrix block An, and the second matrix is split into 2N matrix blocks, i.e., matrix block B1, matrix block B2, …, matrix block Bn +1, matrix block Bn +2, …, matrix block B2N. First, when calculating the matrix block C00, it is necessary to load the matrix block a1, the matrix blocks a2, …, the matrix block An in the first matrix, and also to load the matrix block B1, the matrix blocks B2, …, the matrix block Bn in the second matrix, the first calculated value (a 1B 1) of the matrix block C00 is multiplied by the matrix block a1 and the matrix block B1 to obtain the first calculated value (a 1B 1) of the matrix block C00, the first calculated value (a 1B 1) of the matrix block C00 is stored in the accumulation buffer, the second calculated value (a 2B 2) of the matrix block C2 is multiplied by the matrix block a2 and the matrix block B2 to obtain the second calculated value (a 2B 2) of the matrix block C2, the first calculated value (a 2B 2) of the matrix block C2 and the second calculated value (a 2B 2) of the matrix block C2 are accumulated in the accumulation buffer, the result is accumulated by the current calculated value (a) of the matrix block B685 n buffer, and the current calculated value (a current calculated value B685) of the matrix block C2 is stored in the current calculated value (a current value B685) of the block C685) and the current calculated value (n buffer) of the block C685) is stored in the block C685 buffer, and the current block C685 buffer Sum of) is added, the final calculated values of the elements in matrix block C00 can be obtained. At this point, all the data in the first matrix is loaded once.
Then, when calculating the matrix block C10, the matrix block a1, the matrix blocks a2, …, and the matrix block An in the first matrix need to be loaded, and the matrix block Bn +1, the matrix blocks Bn +2, …, and the matrix block B2n in the second matrix need to be loaded, at which time, all data in the first matrix is repeatedly loaded once. The matrix block a1 and the matrix block Bn +1 are multiplied together to obtain the first calculated value of the matrix block C10 (a1 × Bn +1), the first calculated value of the matrix block C10 (a1 × Bn +1) is stored in the accumulation buffer, multiplying the matrix block A2 with the matrix block Bn +2 to obtain a second calculated value (A2 x n +2) of the matrix block C10, adding the first calculated value (A1 x Bn +1) of the matrix block C10 with the second calculated value (A2 x Bn +2) of the matrix block C10, and stores the accumulated result in the accumulation buffer, and so on, the matrix block An and the matrix block B2n are multiplied to obtain the nth calculation value (An × B2n) of the matrix block C10, when the nth calculated value (An × B2n) of the matrix block C10 is superimposed with the current value stored in the accumulation buffer (the sum of the previous n-1 calculated values), the final calculated value of each element in the matrix block C10 is obtained.
In this example, the number of times of loading the first matrix corresponding to the first matrix is N/PARTITION _ N is 2, and the second matrix is not repeatedly loaded, so the number of times of loading the second matrix corresponding to the second matrix is M/PARTITION _ M is 1, PARTITION _ M is the size corresponding to the matrix block C00, and the size corresponding to the matrix block C00 is the maximum size Acc _ buffer _ max of the accumulation buffer.
For example, when the first matrix size is larger than the size of the first input buffer and the second matrix size is larger than the size of the second input buffer, all data of the first matrix and the second matrix cannot be completely loaded into the input buffers, and repeated loading cannot be avoided. The necessity of using split _ k strategy needs to be calculated at this time.
For example, in some embodiments, in response to slicing the dimension of the second matrix, taking the maximum size of the accumulation buffer as the current calculated size, calculating a first slicing utilization based on the dimension parameter of the first matrix, the dimension parameter of the second matrix, the data size, the hardware computation power, the loading bandwidth corresponding to the first matrix, the loading bandwidth corresponding to the second matrix, the size of the first input buffer, the size of the second input buffer, and the current calculated size, includes: calculating the loading times of the first matrix and the loading times of the second matrix based on the current calculation size, the first matrix dimension, the third matrix dimension, the data size, the loading bandwidth corresponding to the first matrix and the loading bandwidth corresponding to the second matrix; calculating the number of universal matrix multiplication cycles based on the first matrix dimension, the second matrix dimension, the third matrix dimension, the data size and the hardware computing power; calculating the total loading period number corresponding to the first matrix based on the first matrix dimension, the second matrix dimension, the first matrix loading times, the data size and the loading bandwidth corresponding to the first matrix; calculating the total loading period number corresponding to the second matrix based on the second matrix dimension, the third matrix dimension, the second matrix loading frequency, the data size and the loading bandwidth corresponding to the second matrix; and calculating the first cut utilization rate based on the multiplication period number of the universal matrix, the total loading period number corresponding to the first matrix and the total loading period number corresponding to the second matrix.
For example, a first cut utilization rate corresponding to the matrix multiplication operation when split _ k is used may be calculated. In case of using split _ k, the first matrix loading times corresponding to the first matrix is denoted as loadA _ spk, the second matrix loading times corresponding to the second matrix is denoted as loadB _ spk, and the product of the first matrix loading times loadA _ spk and the second matrix loading times loadB _ spk satisfies a first condition regarding the first matrix dimension M, the third matrix dimension N, the first dimension outer layer partitioning parameter PARTITION _ M, and the third dimension outer layer partitioning parameter PARTITION _ N, for example, in some embodiments, the first condition is expressed as: (M/PARTITION _ M) (N/PARTITION _ N), where a product of PARTITION _ M and PARTITION _ N is Acc _ buffer/d _ s, Acc _ buffer represents a current calculated size corresponding to the accumulation buffer, and when the current calculated size is a maximum size of the accumulation buffer, Acc _ buffer _ max and Acc _ buffer/d _ s are fixed values, i.e., loadA _ spk loadB _ spk is (M × N _ d _ s)/Acc _ buffer.
Meanwhile, the ratio of the first matrix loading times loadA _ spk to the second matrix loading times loadB _ spk should satisfy as much as possible a second condition related to the number of single loading cycles load _ buffer _ cycle corresponding to the first matrix and the number of single loading cycles load _ buffer _ cycle corresponding to the second matrix, and in some examples, the second condition is an inverse ratio of the number of single loading cycles corresponding to the first matrix to the number of single loading cycles corresponding to the second matrix, such that the ratio of the total number of loading cycles loadA _ cycle corresponding to the first matrix to the total number of loading cycles loadB _ cycle corresponding to the second matrix is as close as possible (for example, the maximum allowable difference (i.e., the ratio loadA _ cycle/loadB _ cycle between the total number of loading cycles loadA _ cycle corresponding to the first matrix and the total number of loading cycles loadB _ cycle corresponding to the second matrix) ranges from 0.5 to 2, and when the ratio of loadA _ cycle/loadB _ cycle corresponding to the first matrix is less than 0.5 or more than 2, the ratio of loading cycles in the first matrix and the second matrix becomes 2 times of loading cycles corresponding to the previous loading cycles, the loading times corresponding to the other of the first matrix and the second matrix become 0.5 times the original loading times, which may satisfy the constraint condition), that is, the load _ spk/load _ spk is as close as possible to the load _ buffer _ cycle/load _ buffer _ cycle, as can be seen from the above description, the load _ buffer _ cycle is M K _ d _ s/matA _ bw, and the load _ buffer _ cycle is N K _ d _ s/matB _ bw. From the product relationship of loadA _ spk and loadB _ spk and the ratio relationship between loadA _ spk and loadB _ spk, loadA _ spk and loadB _ spk under the condition of optimal GEMM utilization rate can be calculated, so that the performance data of the GEMM utilization rate under the condition can be further obtained. The GEMM utilization in the case of using split _ k is the first cut utilization described above.
It should be noted that, in addition to the first condition and the second condition, the loadA _ spk and the loadB _ spk need to satisfy a third condition, where the third condition is that the loadA _ spk and the loadB _ spk are both positive integers.
For example, when split _ K is used, the general matrix multiplication cycle number GEMM _ cycle is expressed as GEMM _ cycle M × K × N _ d _ s/hw _ power, based on the above formula (1).
The total number of loading cycles for the first matrix can be calculated using the following equation (5):
loadA _ cycle ═ M × K × d _ s _ loadA _ spk/matA _ bw equation (5),
wherein, the loadA _ cycle represents the total loading cycle number corresponding to the first matrix.
The total number of loading cycles for the second matrix can be calculated using the following equation (6):
loadB _ cycle ═ N × K × d _ s _ loadB _ spk/matB _ bw equation (6),
and the loadB _ cycle represents the total loading cycle number corresponding to the second matrix.
For example, the first score utilization may be calculated using the following equation (7):
the utility _ split ═ GEMM _ cycle/(GEMM _ cycle, max (loadA _ cycle)) equation (7),
wherein, Util _ split represents the first cut utilization rate, and max (x) represents the maximum value.
For example, in some embodiments, in response to not splitting the dimension of the second matrix, calculating a first utilization based on the dimension parameter of the first matrix, the dimension parameter of the second matrix, the data size, the hardware effort, the load bandwidth corresponding to the first matrix, the load bandwidth corresponding to the second matrix, the size of the first input buffer, and the size of the second input buffer may include: determining the loading times of the first matrix and the loading times of the second matrix based on the first matrix dimension, the second matrix dimension, the third matrix dimension, the data size, the size of the first input buffer and the size of the second input buffer; calculating the multiplication period number of the universal matrix based on the dimension parameters of the first matrix, the dimension parameters of the second matrix, the data size and the hardware calculation force; calculating the total loading period number corresponding to the first matrix based on the first matrix dimension, the second matrix dimension, the loading times of the first matrix, the data size and the loading bandwidth corresponding to the first matrix; calculating the total loading period number corresponding to the second matrix based on the second matrix dimension, the third matrix dimension, the second matrix loading frequency, the data size and the loading bandwidth corresponding to the second matrix; and calculating the first utilization rate based on the multiplication period number of the universal matrix, the total loading period number corresponding to the first matrix and the total loading period number corresponding to the second matrix.
For example, in some embodiments, determining the first number of matrix loads and the second number of matrix loads without slicing the second matrix dimension comprises: and determining the loading times of the first matrix and the second matrix as 1 based on the second matrix dimension, the third matrix dimension, the data size and the size of the second input buffer in response to traversing the M direction and then traversing the N direction in the outer-layer circulation. At this time, the first matrix loading times loadA _ nspk is expressed as: loadA _ nspk _ N × K _ d _ s/b _ s 2.
For example, in some embodiments, determining the first matrix loading times and the second matrix loading times without slicing the second matrix dimension comprises: and determining the first matrix loading frequency to be 1 in response to traversing the N direction and then traversing the M direction in the outer-layer circulation, and determining the second matrix loading frequency based on the first matrix dimension, the second matrix dimension, the data size and the size of the first input buffer. At this time, the second matrix loading times loadB _ nspk is expressed as: loadB _ nspk ═ M × K × d _ s/b _ s 1.
For example, a corresponding first utilization rate for the matrix multiplication operation when split _ k is not used may be calculated. In some embodiments, in the outer-layer loop without using split _ K, if the splitting is performed in the M direction and then in the N direction (i.e., the M direction is traversed first and then the N direction is traversed), the data of the second matrix does not need to be repeatedly loaded, the number of times loadA _ nspk of the first matrix corresponding to the first matrix is N × K × d _ s/b _ s2, at this time, the total number of loadcycles loadA _ cycle corresponding to the first matrix is load _ buffer _ cycle _ loadA _ nspk, the total number of loadcycles loadB _ cycle corresponding to the second matrix is load _ buffer _ cycle, and the utilization rate Util _1 corresponding to the matrix multiplication operation may be represented as:
Util_1=GEMM_cycle/(max(GEMM_cycle,load_bufferA_cycle*loadA_nspk,load_bufferB_cycle))。
in the outer-layer cycle, if the slicing in the M direction after the N direction is performed (i.e., the N direction is traversed first and then the N direction is traversed), the data of the first matrix does not need to be repeatedly loaded, the number of times of loading the second matrix corresponding to the second matrix is loadB _ nspk is M × K × d _ s/b _ s1, at this time, the total number of loading cycles loadA _ cycle corresponding to the first matrix is load _ buffer _ cycle, the total number of loading cycles loadB _ cycle corresponding to the second matrix is load _ buffer b _ cycle × loadB _ nspk, and the utilization rate Util _2 corresponding to the matrix multiplication operation may be represented as:
Util_2=GEMM_cycle/(max(GEMM_cycle,load_bufferA_cycle,load_buffer B_cycle*loadB_nspk))
then, based on the above two cases, the better one of the selectivity is selected, that is, the larger of the utilization rate Util _1 and the utilization rate Util _2 is selected, so that the first utilization rate is the larger of the utilization rate Util _1 and the utilization rate Util _2, whereby the first utilization rate is expressed as the following formula (8):
the utility model has the following formula (8) that utility _ normal _ cycle/(max (GEMM _ cycle, min (load _ buffer _ cycle, load a _ nspk, load _ buffer _ cycle, load b _ nspk)),
wherein, Util _ normal represents the first utilization rate, and min (#) represents taking the minimum value.
For example, if the GEMM utilization (i.e., the first utilization) obtained without using split _ k is still higher than or equal to the GEMM utilization (i.e., the first share utilization) obtained with using split _ k, then there is no need to use split _ k to bring additional resource consumption of the accumulator buffer. When the first utilization rate is greater than or equal to the first utilization rate, the second dimension outer-dimension blocking parameter PARTITION _ K is the second matrix dimension K, the first dimension outer-layer blocking parameter PARTITION _ M and the third dimension outer-layer blocking parameter PARTITION _ N are determined according to the situation that the first input buffer and the second input buffer are fully loaded, and at the moment, the first dimension outer-layer blocking parameter PARTITION _ M is determined based on the second matrix dimension, the data size and the size of the first input buffer, for example, PARTITION _ M is b _ s1/(K × d _ s); based on the second matrix dimension, the data size and the size of the second input buffer, a third dimension outer layer PARTITION parameter PARTITION _ N is determined, e.g., PARTITION _ N ═ b _ s2/(K × d _ s).
For example, in some embodiments, in step S20, determining the output parameter based on the first comparison result further includes: when the first utilization rate is greater than or equal to the first cut utilization rate: and determining the outer layer cycle mark as a first mark in response to the fact that the loading bandwidth corresponding to the first matrix is smaller than the loading bandwidth corresponding to the second matrix, and determining the outer layer cycle mark as a second mark in response to the fact that the loading bandwidth corresponding to the first matrix is larger than or equal to the loading bandwidth corresponding to the second matrix.
It should be noted that, for the description of the meanings of the outer-layer loop flag, the first flag, and the second flag, reference may be made to the above related description, and repeated descriptions are omitted.
For example, in some embodiments, in step S20, determining the output parameter based on the first comparison result further includes: when the first utilization rate is greater than or equal to the first division utilization rate: determining the loading times of the first matrix to be 1 in response to the fact that the loading bandwidth corresponding to the first matrix is smaller than the loading bandwidth corresponding to the second matrix, and determining the loading times of the second matrix based on the outer layer blocking parameters of the first dimension and the dimension of the first matrix; and in response to the fact that the loading bandwidth corresponding to the first matrix is larger than or equal to the loading bandwidth corresponding to the second matrix, determining the loading times of the first matrix based on the outer layer blocking parameters of the third matrix and the dimension of the third matrix, and determining the loading times of the second matrix to be 1.
For example, when the first utilization rate is greater than or equal to the first cut utilization rate, it is necessary to determine whether to traverse the M direction first or the N direction first in the outer loop. For example, the order of the first dimension outer-layer blocking parameter and the third dimension outer-layer blocking parameter in the cyclic sequence may be determined based on a relationship between the loading bandwidth matA _ bw corresponding to the first matrix and the loading bandwidth matB _ bw corresponding to the second matrix, for example, when matA _ bw is greater than or equal to matB _ bw, it is determined that the first dimension outer-layer blocking parameter is located after the third dimension outer-layer blocking parameter in the cyclic sequence, i.e., the first dimension outer-layer blocking parameter is traversed in the M direction and then in the N direction, and therefore, the first matrix needs to be repeatedly loaded, and the first matrix loading time loadA _ nspk corresponding to the first matrix is N/PARTITION _ N (in this case, PARTITION _ N is b _ s2/(K _ d _ s)), and data of the second matrix does not need to be repeatedly loaded, i.e., the second matrix loading time corresponding to the second matrix is 1. When matA _ bw is smaller than matB _ bw, it is determined that in the loop sequence, the first dimension outer layer block parameter is located before the third dimension outer layer block parameter, that is, the N direction is traversed first and then the M direction is traversed, so that data of the first matrix does not need to be repeatedly loaded, that is, the first matrix loading frequency corresponding to the first matrix is 1, the second matrix needs to be repeatedly loaded, and the second matrix loading frequency loadB _ nspk corresponding to the second matrix is M/PARTITION _ M (at this time, PARTITION _ M is b _ s1/(K × d _ s)).
For example, in some embodiments, in step S20, based on the first comparison result, the determining of the output parameter includes: when the first utilization rate is greater than or equal to the first cut utilization rate: determining that the dimension segmentation identifier is a first segmentation identifier, the accumulation buffer identifier is a first buffer sub-identifier, and the target size of the accumulation buffer is 0; and determining the target utilization rate of the matrix multiplication operation based on the dimension parameters of the first matrix, the dimension parameters of the second matrix, the data size, the hardware calculation force, the loading bandwidth corresponding to the first matrix and the loading bandwidth corresponding to the second matrix.
For example, when the loading bandwidth corresponding to the first matrix is smaller than the loading bandwidth corresponding to the second matrix, the target utilization rate of the matrix multiplication operation is represented as: util ═ min (N × matA _ bw, M × matB _ bw/loadB _ nspk)/hw _ power, where loadB _ nspk ═ M/PARTITION _ M.
For example, when the loading bandwidth corresponding to the first matrix is greater than or equal to the loading bandwidth corresponding to the second matrix, the target utilization rate of the matrix multiplication operation is represented as: util ═ min (N × matA _ bw/loadA _ nspk, M × matB _ bw)/hw _ power, where loadA _ nspk ═ N/detail _ N.
For example, in some embodiments, in step S20, determining the output parameter based on the first comparison result further includes: when the first utilization rate is smaller than the first segmentation utilization rate, determining an adjusted size based on the adjustment coefficient and the current calculated size, and iteratively calculating the utilization rate by taking the adjusted size as the current calculated size until the utilization rate smaller than the first segmentation utilization rate is obtained as a target segmentation utilization rate; determining outer-layer blocking parameters of the first dimension number and outer-layer blocking parameters of the third dimension number based on the first matrix dimension number, the third matrix dimension number, the data size, the loading bandwidth corresponding to the first matrix, the loading bandwidth corresponding to the second matrix and the current calculation size corresponding to the target segmentation utilization rate; and determining a second dimension blocking parameter based on the size of the first input buffer, the size of the second input buffer, the first dimension outer blocking parameter, the third dimension outer blocking parameter and the data size.
For example, the adjustment factor is less than 1, in some embodiments the adjustment factor may be 0.5, and the adjusted size may be the product of the current calculated size and the adjustment factor. It should be noted that the adjustment coefficient may be set according to specific situations, and the present disclosure does not limit this.
For example, if the GEMM utilization (i.e., the first utilization) obtained without using split _ k is lower than the GEMM utilization (i.e., the first cut utilization) obtained with split _ k, it is necessary to determine whether the smaller size accumulation buffer can be used to obtain the same result. For example, the size Acc _ buffer (i.e. the current calculation size) of the accumulation buffer used for accumulation calculation may be set to be half of the maximum size Acc _ buffer _ max of the accumulation buffer, and then the utilization rate may be repeatedly calculated again based on the current calculation size Acc _ buffer, if the utilization rate calculated based on the current calculation size Acc _ buffer is not lower than the first cut utilization rate, the used accumulation buffer may be reduced, and the calculation may be repeated in this way until the obtained utilization rate is lower than the first cut utilization rate, which is the target cut utilization rate.
For example, when the slicing utilization is calculated for the first time, the maximum size of the accumulation buffer is taken as the current calculated size (hereinafter referred to as the first current calculated size) for calculating the slicing utilization, and at this time, the first current calculated size Acc _ buffer1 is Acc _ buffer _ max, and when the slicing utilization obtained based on the first current calculated size is larger than the first slicing utilization, an adjusted size (hereinafter referred to as the first adjusted size) is determined based on the adjustment coefficient (e.g., 0.5) and the first current calculated size, and the first adjusted size is taken as the current calculated size (hereinafter referred to as the second current calculated size) for calculating the slicing utilization, that is, when the slicing utilization is calculated for the second time, the second current calculated size Acc _ buffer2 is 0.5 a.c _ buffer _ max, and when the slicing utilization obtained based on the second current calculated size is still larger than the first slicing utilization, and determining an adjusted size (hereinafter referred to as a second adjusted size) based on the adjustment coefficient and the second current calculated size, taking the second adjusted size as a current calculated size (hereinafter referred to as a third current calculated size) to be used for calculating the segmentation utilization rate, namely, when the segmentation utilization rate is calculated for the third time, the third current calculated size Acc _ buffer3 is 0.5 × 0.5 Acc _ buffer _ max, and repeating the steps to guide to finally obtain the segmentation utilization rate smaller than the first segmentation utilization rate, wherein the segmentation utilization rate smaller than the first segmentation utilization rate is the target segmentation utilization rate.
For example, in some embodiments, determining the outer-layer blocking parameter of the first dimension and the outer-layer blocking parameter of the third dimension based on the first matrix dimension, the third matrix dimension, the data size, the loading bandwidth corresponding to the first matrix, the loading bandwidth corresponding to the second matrix, and the current calculated size corresponding to the target slicing utilization includes: determining the loading times of the first matrix and the loading times of the second matrix based on the current calculation size corresponding to the target segmentation utilization rate, the first matrix dimension, the third matrix dimension, the data size, the loading bandwidth corresponding to the first matrix and the loading bandwidth corresponding to the second matrix; determining outer layer blocking parameters of the first matrix based on the loading times of the second matrix and the dimension of the first matrix; and determining outer layer blocking parameters of the third matrix based on the loading times of the first matrix and the dimension of the third matrix.
For example, when the first utilization rate is less than the first PARTITION utilization rate, it is determined that split _ k needs to be used, and at this time, the PARTITION parameter PARTITION _ M in the first dimension is represented as: PARTITION _ M is M/loadB _ spk, and a third dimension outer layer PARTITION parameter PARTITION _ N is expressed as: the PARTITION _ N is N/loadA _ spk, and the second dimension PARTITION parameter PARTITION _ K is min (b _ s1/(PARTITION _ M × d _ s), b _ s2/(PARTITION _ N × d _ s)). For example, when b _ s1 and b _ s2 are equal, the second dimension PARTITION parameter PARTITION _ K is denoted by min (b _ s1/max (PARTITION _ M, PARTITION _ N) × d _ s).
For example, the first matrix loading times loadA _ spk and the second matrix loading times loadB _ spk satisfy the following condition: in this case, the Acc _ buffer _ new is the current calculation size corresponding to the target slicing utilization, and it should be understood that the Acc _ buffer _ new is inevitably smaller than or equal to the Acc _ buffer _ max.
For example, in some embodiments, in step S20, based on the first comparison result, the determining of the output parameter includes: when the first utilization rate is less than the first cut utilization rate: determining the dimension segmentation identifier as a second segmentation sub-identifier, the accumulation buffer identifier as a second buffer sub-identifier and the target size of the accumulation buffer as the current calculation size corresponding to the target segmentation utilization rate; and determining the target utilization rate of the matrix multiplication operation based on the dimension parameters of the first matrix, the dimension parameters of the second matrix, the data size, the hardware calculation force, the loading bandwidth corresponding to the first matrix, the loading bandwidth corresponding to the second matrix, the loading times of the first matrix and the loading times of the second matrix.
For example, the dimension splitting identifier is a second splitting sub-identifier indicating that the second matrix dimension needs to be split, and the accumulation buffer identifier is a second buffer sub-identifier indicating that an accumulation buffer needs to be set.
For example, the second slicer sub-id and the second buffer sub-id may both be binary numbers, the first slicer sub-id is different from the second slicer sub-id, and the first buffer sub-id is different from the second buffer sub-id, for example, the second slicer sub-id may be 1, and the second buffer sub-id may be 1.
It should be noted that specific values of the first tangent molecule identifier, the second tangent molecule identifier, the first buffer sub-identifier, and the second buffer sub-identifier may be set according to actual situations, which is not limited in this disclosure, for example, the first tangent molecule identifier and the first buffer sub-identifier may also be different, for example, the first tangent molecule identifier is 0, and the first buffer sub-identifier is 2.
For example, the current calculated size corresponding to the target slicing utilization is the required final size acc _ buffer _ needed of the accumulation buffer.
For example, the dimension division identifier is associated with both the accumulation buffer identifier and the final size acc _ buffer _ needed of the accumulation buffer, and the accumulation buffer is only required when the dimension division identifier is the second division sub-identifier, and the current calculated size corresponding to the target division utilization rate may be output as the final size acc _ buffer _ needed of the accumulation buffer.
For example, only in the case of using split-k, the accumulation buffer is required and the final size of the accumulation buffer acc _ buffer _ needed is determined, and in fact, the process of data accumulation outside the AI accelerator can be performed in the accumulation buffer.
For example, when the first utilization ratio is smaller than the first cut utilization ratio, the target utilization ratio of the matrix multiplication operation is the target cut utilization ratio, which can be calculated by using the above formula (7), except that for calculating the target cut utilization ratio, the loadA _ spk and loadB _ spk in the formula (7) are calculated based on the current calculation size corresponding to the target cut utilization ratio.
For example, in some embodiments, determining the output parameter based on the first comparison further comprises: and when the first utilization rate is smaller than the first cut utilization rate, determining the outer layer circulation mark as an outer layer preset mark. For example, in some embodiments, the outer preset flag indicates that the first dimension outer layer block parameter precedes the third dimension outer layer block parameter in the round robin order, i.e., the N direction is traversed before the N direction is traversed.
For example, the outer preset flag may be preset by a user according to an actual application scenario. The outer layer preset mark is not particularly limited by the present disclosure.
For example, in some embodiments, the input parameters further include a first block parameter and a second block parameter corresponding to a synchronization granularity and a minimum data block.
For example, the synchronization granularity may represent the synchronization granularity between the AI accelerator and the processor core.
For example, in some embodiments, the first block parameter and the second block parameter may be the same, e.g., both 64. Of course, the present disclosure is not limited thereto, and the first block parameter and the second block parameter may also be different.
For example, the first and second block parameters may be a width and a length of the minimum data block, respectively.
For example, in a GPU, a kernel corresponds to a thread grid (grid) which in turn contains a number of minimum data blocks (blocks), each of which contains a number of threads (threads).
For example, in some embodiments, obtaining the output parameters based on the input parameters includes: and determining a first dimension inner layer blocking parameter and a third dimension inner layer blocking parameter based on the first block parameter, the second block parameter, the first dimension outer layer blocking parameter, the synchronous granularity and the data size.
For the inner loop, for example, as shown in fig. 3A, the inner loop represents a loop for each matrix block in the third matrix, e.g., each matrix block in the third matrix may be represented as PARTITION _ M × PARTITION _ N. The synchronization granularity sync _ granularity in the input parameters determines the amount of data processed in each loop body in the inner loop, so that the step size of the inner loop can be determined.
For example, the unit of the synchronization granularity sync _ granularity may be a minimum data block.
For example, in some embodiments, determining the first-dimension intra-layer blocking parameter and the third-dimension intra-layer blocking parameter based on the first block parameter, the second block parameter, the first-dimension outer-layer blocking parameter, the synchronization granularity, and the data size comprises: determining a first data block number based on the first block parameter and the first dimension outer layer blocking parameter; determining the number of second data blocks based on the second block parameter and the third dimension outer layer blocking parameter; and determining a first dimension intra-layer blocking parameter and a third dimension intra-layer blocking parameter based on the first block parameter, the second block parameter, the synchronization granularity, the data size and the first data block number.
For example, in some embodiments, determining the first-dimension intra-layer blocking parameter and the third-dimension intra-layer blocking parameter based on the first block parameter, the second block parameter, the synchronization granularity, the data size, and the first data block quantity comprises: determining a first number of intermediate data blocks based on the synchronization granularity and the data size; determining the third number of data blocks as the first number of intermediate data blocks in response to the first number of intermediate data blocks being less than the first number of data blocks; determining the third data block quantity as the first data block quantity in response to the first intermediate data block quantity being greater than or equal to the first data block quantity; determining a second number of intermediate data blocks based on the first number of intermediate data blocks and the third number of data blocks; in response to the second intermediate data block number being less than 1, determining the fourth data block number to be 1; determining the fourth data block quantity as the second intermediate data block quantity in response to the second intermediate data block quantity being greater than or equal to 1; determining a first-dimension intra-layer partitioning parameter based on the third number of data blocks and the first block parameter; and determining a third intra-dimensional segmentation parameter based on the fourth data block number and the second block parameter.
For example, for the inner loop, first, the number of minimum data blocks (the minimum data block is the smallest data block in the hardware limitation) corresponding to one matrix block (the matrix block may be referred to as a PARTITION block) in the third matrix is calculated, and the size of the PARTITION block is the single outer loop size.
For example, the first data block number block _ num _ M may be PARTITION _ M/block _ size _ M, where block _ size _ M is a first block parameter, and the second data block number block _ num _ N may be PARTITION _ N/block _ size _ N, where block _ size _ N is a second block parameter.
For example, in some embodiments, the blocking method may further include: and determining the inner layer circulation mark as an inner layer preset mark. The inner-layer cycle flag is used to indicate the order of the first-dimension inner-layer block parameter and the third-dimension inner-layer block parameter in the cycle order.
For example, the inner layer preset flag may be preset by a user according to an actual application scenario. The inner preset mark is not particularly limited by the present disclosure.
For example, to ensure the uniformity with the hardware architecture, in the inner loop, the accumulation may be selected according to columns, and at this time, the inner preset flag indicates: in the loop sequence, the third intra-dimension layer partition parameter TILE _ N precedes the first intra-dimension layer partition parameter TILE _ M.
When each accumulation is performed, the number of minimum data blocks corresponding to the parameter _ M direction is a third data block number, TILE _ M _ block, the number of minimum data blocks corresponding to the parameter _ N direction is a fourth data block number, TILE _ N _ block, and TILE _ M _ block is sync _ granularity/(2 × d _ s), and at this time, the corresponding TILE _ N _ block is 1.
Meanwhile, the number of minimum data blocks corresponding to each column needs to be constrained, if the number of minimum data blocks corresponding to sync _ granularity exceeds the first data block number, block _ num _ M, in each cycle, the third data block number, TILE _ M _ block, is block _ num _ M, and the fourth data block number, TILE _ N _ block, is sync _ granularity/(2 × dem _ size)/TILE _ M _ block. Finally, the intra-dimension blocking parameter TILE _ M is TILE _ M _ block _ size _ M, and the intra-dimension blocking parameter TILE _ N is TILE _ N _ block _ size _ N.
For example, after all the partition parameters are obtained through calculation, the calculated partition parameters are combined in an outside-in order to obtain a cycle order, wherein the cycle order is expressed as an order of "outer cycle- > Split-k cycle- > inner cycle", in the outer cycle, the order of the first-dimension outer partition parameter and the third-dimension outer partition parameter is determined based on an outer cycle flag, and in the inner cycle, the order of the first-dimension inner partition parameter and the third-dimension inner partition parameter is determined based on an inner cycle flag.
It should be noted that the first dimension outer layer PARTITION parameter _ M represents a step size in the outer layer loop in the M direction of the first matrix, and the third dimension outer layer PARTITION parameter _ N represents a step size in the outer layer loop in the N direction of the second matrix; the first dimension inner layer block parameter TILE _ M indicates a step size in the PARTITION _ M direction of each PARTITION block in the inner layer loop, and the third dimension inner layer block parameter TILE _ N indicates a step size in the PARTITION _ N direction of each PARTITION block in the inner layer loop.
From the overall structure, the flow of the blocking method provided by the embodiment of the disclosure is divided into two aspects: splitting the necessity of the K dimension and dividing the M dimension and the N dimension specifically comprises the following steps: and determining whether to use a split _ K (dividing on a K dimension and using an accumulation buffer to complete external data accumulation of the general matrix multiplication) strategy, and determining the final size of the accumulation buffer used when the split _ K strategy is used and the corresponding GEMM utilization rate according to the maximum size corresponding to the accumulation buffer and performance data corresponding to different sizes corresponding to the accumulation buffer.
Meanwhile, in the M/N dimension division, two layers (namely outer layer circulation and inner layer circulation) are mainly included, in the outer layer circulation, a first dimension outer layer blocking parameter and a third dimension outer layer blocking parameter are determined according to full load conditions of a first matrix and a second matrix as segmentation granularity, and in the inner layer circulation, the first dimension inner layer blocking parameter and the third dimension inner layer blocking parameter are determined according to set synchronization granularity between an AI accelerator and a processor core as the segmentation granularity. The order of circulation (row-major, column-major) is determined according to the different situations described above.
Fig. 5 is a schematic flow chart of a blocking method according to some embodiments of the present disclosure.
The complete flow of the blocking method provided by the embodiment of the present disclosure is briefly described below with reference to fig. 5.
In this blocking method, as shown in fig. 5, first, in step S501, based on input parameters, a first matrix size matA _ size corresponding to the first matrix and a second matrix size matB _ size corresponding to the second matrix are calculated, matA _ size ═ M × K × d _ S, and matB _ size ═ N × K × d _ S.
As shown in fig. 5, the input parameters include a first matrix dimension M, a second matrix dimension K, a third matrix dimension N, a data size d _ s, a loading bandwidth matA _ bw corresponding to the first matrix, a loading bandwidth matB _ bw corresponding to the second matrix, a size b _ s1 of the first input buffer, and a size b _ s2 of the second input buffer.
As shown in fig. 5, in step S502, the first matrix size matA _ size and the size b _ S1 of the first input buffer are compared and the second matrix size matB _ size and the size b _ S2 of the second input buffer are compared.
When the first matrix size matA _ size is larger than the size b _ S1 of the first input buffer and the second matrix size matB _ size is larger than the size b _ S2 of the second input buffer (Y), steps S503 and S504 are performed. Step S503 shows a case where split _ k is not used, and step S504 shows a case where split _ k is used.
In step S503, the first matrix loading frequency loadA _ nspk, the second matrix loading frequency loadB _ nspk, the total loading cycle number loadA _ cycle corresponding to the first matrix, the total loading cycle number loadB _ cycle corresponding to the second matrix, and the first utilization rate are calculated.
For example, if the M direction is traversed before the N direction is traversed, the loadA _ nspk is N × K × d _ s/b _ s1, the loadB _ nspk is 1, the loadA _ cycle is load _ buffer _ cycle × loadA _ nspk is M × K _ loadA _ nspk _ s/matA _ bw, and the loadB _ cycle is load _ buffer _ cycle _ mpk _ N × d _ s/matA _ bw.
For example, if the N direction is traversed before the M direction is traversed, then the loadA _ nspk is 1, the loadB _ nspk is M × K × d _ s/b _ s2, the loadA _ cycle is load _ buffer _ cycle _ loadA _ nspk is M _ K _ d _ s/matA _ bw, and the loadB _ cycle is load _ buffer _ b _ cycle _ loadB _ nspk _ N _ K _ M _ K _ M _ d _ s/matA _ bw.
For example, the first utilization ratio Util _ normal is: the value of "global _ normal" is "GEMM _ cycle/(max (GEMM _ cycle, min (loadA _ cycle (M × K) loadA _ nspk _ d _ s/matA _ bw), loadB _ cycle (N × K) loadB _ nspk _ d _ s/matB _ bw))).
As shown in fig. 5, in step S504, in the case that the current calculated size Acc _ buffer is the maximum size Acc _ buffer _ max of the accumulation buffer, the first cut utilization, i.e. the Procedure "calc _ util _ split", is calculated. When the Procedure "calc _ util _ split" is executed, the first matrix loading times loadA _ spk and the second matrix loading times loadB _ spk are firstly calculated, and the loadA _ spk and the loadB _ spk satisfy the following formula:
(a)loadA_spk*loadB_spk=M*N*d_s/(Acc_buffer);
(b)loadA_spk/loadB_spk=M*matB_bw/N*matA_bw。
based on the above equations (a) and (b), loadA _ spk and loadB _ spk can be calculated. Then, the general matrix multiplication cycle number GEMM _ cycle, the total loading cycle number loadA _ cycle corresponding to the first matrix, and the total loading cycle number loadB _ cycle corresponding to the second matrix are calculated. The device comprises a GEMM _ cycle, a loadA _ cycle, a loadB _ cycle and a loadB _ cycle, wherein the GEMM _ cycle is M × N × K × d _ s/hw _ power, the loadA _ cycle is M × K × loadA _ spk × d _ s/matA _ bw, and the loadB _ cycle is N × K × loadB _ spk _ d _ s/matB _ bw. Finally, a first share utilization rate Util _ split is calculated, e.g., Util _ split ═ GEMM _ cycle/max (GEMM _ cycle, loadA _ cycle, loadB _ cycle)
For example, in step S504, the maximum size Acc _ buffer _ max of the accumulation buffer (i.e. the maximum available size of the accumulation buffer) in the input parameters needs to be obtained.
As shown in fig. 5, in step S505, the first utilization rate Util _ normal and the first cut utilization rate Util _ split are compared. When the first utilization rate Util _ normal is greater than the first cut utilization rate Util _ normal (Y), step S506 is performed.
In step S506, a first dimension outer layer PARTITION parameter PARTITION _ M, a second dimension outer layer PARTITION parameter PARTITION _ K, and a third dimension outer layer PARTITION parameter PARTITION _ N are calculated, where PARTITION _ M is b _ S1/(K × d _ S), PARTITION _ N is b _ S2/(K × d _ S), and PARTITION _ K is K.
Then, in step S507, the loading bandwidth matA _ bw corresponding to the first matrix and the loading bandwidth matB _ bw corresponding to the second matrix are compared. When the loading bandwidth matA _ bw corresponding to the first matrix is smaller than the loading bandwidth matB _ bw corresponding to the second matrix (Y), step S508 is executed.
In step S508, it is determined that the outer loop flag outloop _ rowmajor _ flag is 1, the first matrix loading frequency loadA _ nspk, the second matrix loading frequency loadB _ nspk, and the target utilization rate of the matrix multiplication are calculated, loadA _ nspk is 1 (not shown), loadB _ nspk is M/PARTITION _ M, and the target utilization rate of the matrix multiplication is represented as: util ═ min (N × matA _ bw, M × matB _ bw/loadB _ nspk)/hw _ power.
When the loading bandwidth matA _ bw corresponding to the first matrix is greater than or equal to the loading bandwidth matB _ bw corresponding to the second matrix (N), step S509 is executed.
In step S509, it is determined that the outer loop flag outloop _ rowmajor _ flag is 0, the first matrix loading frequency loadA _ nspk, the second matrix loading frequency loadB _ nspk, and the target utilization rate of the matrix multiplication are calculated, the loadA _ nspk is N/PARTITION _ N, the loadB _ nspk is 1 (not shown), and the target utilization rate of the matrix multiplication uti is expressed as: util ═ min (N × matA _ bw/loadA _ nspk, M × matB _ bw)/hw _ power.
As shown in fig. 5, when the loading bandwidth matA _ bw corresponding to the first matrix is smaller than the loading bandwidth matB _ bw corresponding to the second matrix, in step S510, the output parameters include: the first dimension outer layer PARTITION parameter PARTITION _ M ═ b _ S1/(K × d _ S), the second dimension PARTITION parameter PARTITION _ K ═ K, the third dimension outer layer PARTITION parameter PARTITION _ N ═ b _ S2/(K × d _ S), the first matrix loading time loadA _ nspk ═ 1, the second matrix loading time loadB _ nspk ═ M/PARTITION _ M, the dimension PARTITION flag Need _ split _ flag ═ 0, the accumulation buffer flag Acc _ buffer _ flag 0 (not shown), the target size of the accumulation buffer _ node ═ 0, and the target utilization of the matrix multiplication operation gemmm _ utilization ═ util (util in step S508).
As shown in fig. 5, when the loading bandwidth matA _ bw corresponding to the first matrix is greater than or equal to the loading bandwidth matB _ bw corresponding to the second matrix, in step S511, the output parameters include: the first dimension outer layer PARTITION parameter PARTITION _ M ═ b _ S1/(K × d _ S), the second dimension PARTITION parameter PARTITION _ K ═ K, the third dimension outer layer PARTITION parameter PARTITION _ N ═ b _ S2/(K × d _ S), the first matrix loading time loadA _ nspk ═ N/PARTITION _ N, the second matrix loading time loadB _ nspk ═ 1, the dimension PARTITION flag Need _ split _ flag ═ 0, the accumulation buffer flag Acc _ buffer _ flag 0 (not shown), the target size _ buffer _ node of the accumulation buffer is 0, and the target utilization rate of the matrix multiplication operation gemmm _ utilization ═ util (util in step S509).
For example, as shown in fig. 5, in step S512, a first data block number block _ num _ M and a second data block number block _ num _ N are calculated, block _ num _ M is PARTITION _ M/block _ size _ M, and block _ num _ N is PARTITION _ N/block _ size _ N.
For example, as shown in fig. 5, in step S513, based on the synchronization granularity sync _ granularity, a third data block number TILE _ M _ block and a fourth data block number TILE _ N _ block are calculated, where TILE _ M _ block is min (sync _ granularity/(2 × d _ S), block _ num _ M), and TILE _ N _ block is max (1, sync _ granularity/(2 × d _ S)/TILE _ M _ block).
For example, as shown in fig. 5, in step S514, the output parameters further include a first intra-dimension sub-block parameter TILE _ M and a third intra-dimension sub-block parameter TILE _ N, where TILE _ M is TILE _ M _ block _ size _ M, and TILE _ N is TILE _ N _ block _ size _ N.
As shown in fig. 5, in step S505, when the first utilization rate Util _ normal is equal to or less than the first cut utilization rate Util _ normal (N), steps S515 to S517 are executed.
For example, in step S515, the current calculation size Acc _ buffer is set to Acc _ buffer _ max/2, and then the above "procedure" calc _ util _ split "is iteratively executed to calculate a new split utilization rate util _ split _ new until util _ split _ new ═ util _ split, at which time util _ split _ new is the target split utilization rate.
For example, in step S516, a first dimension outer layer PARTITION parameter _ M, a second dimension PARTITION parameter _ K, a third dimension outer layer PARTITION parameter _ N, PARTITION _ M ═ M/loadB _ spk, PARTITION _ N ═ N/loadA _ spk, PARTITION _ K ═ min (b _ S1/(PARTITION _ M ×. d _ S), b _ S2/(PARTITION _ N ×. d _ S)) is calculated. For example, in step S516, loadA _ spk and loadB _ spk are calculated based on the current calculated size corresponding to the target cut utilization rate.
For example, in step S517, the output parameters include: the method includes the steps of obtaining a first dimension outer layer blocking parameter _ M, a second dimension blocking parameter _ K, a third dimension outer layer blocking parameter _ N, a first matrix loading frequency loadA _ spk, a second matrix loading frequency loadB _ spk, a dimension split identifier Need _ split _ flag being 1, an accumulation buffer identifier Acc _ buffer _ flag being 1 (not shown), a target size Acc _ buffer _ fed of the accumulation buffer, the Acc _ buffer _ fed being a current calculation size corresponding to the target utilization ratio split, and a target utilization ratio gemjutilization _ fine _ new of matrix multiplication. Then, after step S517 is executed, step S512 to step 514 may be executed, which is specifically referred to the above description and is not described herein again.
As shown in fig. 5, in step S502, when the first matrix size matA _ size is equal to or smaller than the size b _ S1 of the first input buffer and/or the second matrix size matB _ size is equal to or smaller than the size b _ S2 of the second input buffer (N), steps S518 to S522 are performed.
For example, in step S518, the target utilization rate util of matrix multiplication is calculated, util ═ min (1, (N × matA _ bw)/hw _ power, (M × matB _ bw)/hw _ power).
For example, in step S519, the first matrix dimension M and the third matrix dimension N are compared, and when the first matrix dimension M is smaller than the third matrix dimension N (Y), step S520 is performed. When the first matrix dimension M is equal to or greater than the third matrix dimension N (N), step S521 is performed.
In step S520, a first dimension outer layer PARTITION parameter PARTITION _ M, a second dimension PARTITION parameter PARTITION _ K, a third dimension outer layer PARTITION parameter PARTITION _ N, PARTITION _ M being M, PARTITION _ N being min (b _ S2/(K × d _ S), N), PARTITION _ K being K, are calculated.
In step S521, a first dimension outer layer PARTITION parameter PARTITION _ M, a second dimension PARTITION parameter PARTITION _ K, a third dimension outer layer PARTITION parameter PARTITION _ N, PARTITION _ M being min (b _ S1/(K × d _ S), M), PARTITION _ N being N, PARTITION _ K being K, are calculated.
Then, after steps S520 and 521 are performed, step S512 is performed, and then step S522 is performed.
For example, in step S522, the output parameters include: a first dimension outer layer blocking parameter _ M, a second dimension blocking parameter _ K, a third dimension outer layer blocking parameter _ N, a first matrix loading frequency loadA _ nspk is 1, a second matrix loading frequency loadB _ spk is 1, a dimension split flag Need _ split _ flag is 0, an accumulation buffer flag Acc _ buffer _ flag is 0 (not shown), a target size Acc _ buffer _ connected of the accumulation buffer is 0, and a target utilization ratio gemm _ utilization of matrix multiplication is calculated (util calculated in the step S518). Then, after step S522 is performed, step S513 to step S514 may be performed, which is specifically referred to the above description and is not described herein again.
In step S522, when the first matrix dimension M is smaller than the third matrix dimension N, PARTITION _ N is min (b _ S2/(K × d _ S), N), PARTITION _ M is M, and PARTITION _ K is K; when the first matrix dimension M is equal to or greater than the third matrix dimension N, PARTITION _ M is min (b _ s1/(K × d _ s), M), PARTITION _ N is N, and PARTITION _ K is K.
At least one embodiment of the present disclosure also provides a method for convolution operations. Fig. 6 is a schematic flow chart of a method for convolution operation according to some embodiments of the present disclosure.
For example, in some embodiments, a convolution operation refers to operating on an input image with a convolution kernel.
As shown in fig. 6, the method for convolution operation includes the following steps S30 and S40.
Step S30, determining a first convolution operation matrix and a second convolution operation matrix based on convolution input parameters of convolution operation;
step S40, taking the first convolution operation matrix as the first matrix and the second convolution operation matrix as the second matrix, executing the blocking method according to any embodiment of the present disclosure.
The partitioning method provided by the embodiments of the present disclosure can be applied to convolution operation of a neural network only by performing simple dimension mapping.
For example, in step S30, the convolution input parameter may be expressed as an input tensor (tensor) of the convolution operation. Convolution input parameters of the convolution operation include parameters corresponding to the convolution kernel and parameters corresponding to the input image.
For example, the parameters corresponding to the convolution kernels may include a first kernel size, a second kernel size, the number of input channels, and the number of convolution kernels for each convolution kernel, and the parameters corresponding to the input image may include a first image size, a second image size, and the number of input images.
For example, the first kernel size and the second kernel size may be the same, e.g. 3 each, i.e. each convolution kernel is a 3 x 3 convolution kernel. Embodiments of the present disclosure are not limited thereto, and the first core size and the second core size may not be the same.
For example, the first convolution operation matrix has a first convolution matrix dimension and a second convolution matrix dimension, and the second convolution operation matrix has a second convolution matrix dimension and a third convolution matrix dimension. When the first convolution operation matrix is used as the first matrix and the second convolution operation matrix is used as the second matrix, the first convolution matrix dimension is the first matrix dimension, the second convolution matrix dimension is the second matrix dimension, and the third convolution matrix dimension is the third matrix dimension.
For example, in some embodiments, step S30 includes: determining a first convolution matrix dimension based on the number of convolution kernels; determining a second convolution matrix dimension based on the first kernel size, the second kernel size and the number of input channels; a third convolution matrix dimension is determined based on the number of input images, the first image size, and the second image size.
For example, a first convolution matrix dimension corresponds to the number of convolution kernels (i.e., the number of output channels), a second convolution matrix dimension corresponds to the product of the first kernel size, the second kernel size, and the number of input channels, and a third convolution matrix dimension corresponds to the product of the first image size, the second image size, and the number of input images.
When the blocking method provided by the embodiment of the present disclosure is applied to convolution operation of a neural network, dimension mapping needs to be performed, for example, a first kernel size is represented by KH, a second kernel size is represented by KW, the number of input channels is represented by IC, a first image size is represented by H, a second image size is represented by W, the number of input images is represented by SN, the number of convolution kernels is represented by oc (output channel), and a mapping relationship is as follows: mapping the number of convolution kernels OC to a first convolution matrix dimension, i.e., the first matrix dimension M ═ OC, mapping IC × KH KW to a second convolution matrix dimension, i.e., the second matrix dimension K ═ IC × KH KW, and mapping SN × H W to a third convolution matrix dimension, i.e., the third matrix dimension N ═ SN × W. That is, the first matrix is a mapping of weight in convolution operation, and the second matrix is a mapping of activation in convolution operation.
For the convolution operation, the related operations may be executed according to the mapping relationship, that is, by using the blocking method provided by the embodiment of the present disclosure, to obtain the blocking parameters and the cyclic order.
It should be noted that for convolution operations, when calculating the number of load cycles of data (loading cost), the values in the KH and KW dimensions are reusable due to the reusability of the data of activation. Therefore, the calculation of each cycle number becomes:
GEMM_cycle=OC*(SN*H*W)*(IC*KH*KW)*d_s/hw_power,
loadA_cycle=IC*OC*KH*KW*d_s/matA_bw,
loadB_cycle=N*H*W*IC*d_s/matB_bw。
here, d _ s is a data size of each data in each of the convolution kernel and the input image.
The rest of the calculation is the same as the description in the above blocking method, and is not described herein again.
At least one embodiment of the present disclosure also provides a block apparatus applied to a matrix multiplication operation. Fig. 7 is a schematic block diagram of a block device according to at least one embodiment of the present disclosure.
For example, as shown in fig. 7, the blocking device 70 may include a memory 705 and a processor 710. The memory 705 is used to non-temporarily store computer-executable instructions adapted to be executed by the processor; the processor 710 is configured to execute computer-executable instructions, which when executed by the processor 710 may cause the processor 710 to perform one or more steps of a blocking method according to any of the embodiments of the present disclosure. For specific implementation and related explanation of each step of the block partitioning method, reference may be made to the above embodiments of the block partitioning method, which are not described herein again.
It should be noted that the components of the block device 70 shown in fig. 7 are only exemplary and not limiting, and the block device 70 may have other components according to the actual application.
For example, in some embodiments, the chunking device 70 may further include an input module configured to obtain input parameters. For example, the computer-executable instructions, when executed by the processor 710, may cause the processor 710 to control the input module to obtain input parameters and perform one or more steps of the above-described blocking method based on the input parameters.
For example, the processor 710 and the memory 705 may be in direct or indirect communication with each other.
For example, the processor 710 and the memory 705 may communicate over a network connection. The network may include a wireless network, a wired network, and/or any combination of wireless and wired networks, and the present disclosure is not limited as to the type and functionality of the network. Also for example, the processor 710 and the memory 705 may communicate via a bus connection. The bus may be a peripheral component interconnect standard (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, or the like.
For example, the processor 710 and the memory 705 may be disposed on a server side (or a cloud side), or may be disposed on a client side (e.g., a mobile device such as a mobile phone).
For example, the processor 710 may be a device having data processing and/or instruction execution capabilities, such as a Central Processing Unit (CPU), Tensor Processor (TPU), or graphics processor GPU, and may control other components in the block device 70 to perform desired functions. The Central Processing Unit (CPU) may be an X86 or ARM architecture, etc.
For example, the memory 705 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer-executable instructions may be stored on the computer-readable storage medium and executed by the processor 710 to implement the various functions of the block device 70. Various applications and various data, as well as various data used and/or generated by applications, etc., may also be stored in the memory 705.
It should be noted that the blocking device 70 can achieve similar technical effects to the blocking method described above, and repeated descriptions are omitted.
At least one embodiment of the present disclosure also provides an apparatus for convolution operations. Fig. 8 is a schematic block diagram of an apparatus for convolution operation according to at least one embodiment of the present disclosure.
For example, as shown in fig. 8, the means 80 for performing convolution operations may include a memory 805 and a processor 810. The memory 805 is used to non-temporarily store computer-executable instructions adapted to be executed by the processor; the processor 810 is configured to execute computer-executable instructions, which when executed by the processor 810 may cause the processor 810 to perform one or more steps of a method for convolution operations according to any embodiment of the present disclosure. For specific implementation and related explanation of each step of the method for convolution operation, reference may be made to the above-mentioned embodiment of the method for convolution operation, which is not described herein again.
It should be noted that the components of the apparatus 80 for convolution operations shown in fig. 8 are only exemplary and not limiting, and the apparatus 80 for convolution operations may have other components according to the practical application.
For example, in some embodiments, the apparatus 80 for convolution operations may further include an input module configured to obtain convolution input parameters for the convolution operations. For example, the computer executable instructions, when executed by the processor 810, may cause the processor 810 to control the input module to obtain convolution input parameters for a convolution operation, and to perform one or more steps of the above-described method for convolution operations based on the convolution input parameters for the convolution operation.
For example, the processor 810 and the memory 805 may be in direct or indirect communication with each other.
For example, the processor 810 and the memory 805 may communicate over a network connection. The network may comprise a wireless network, a wired network, and/or any combination of wireless and wired networks, and the present disclosure is not limited as to the type and functionality of the network. As another example, the processor 810 and the memory 805 may also communicate via a bus connection. The bus may be a peripheral component interconnect standard (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, or the like.
For example, the processor 810 and the memory 805 may be disposed on a server side (or a cloud side) or a client side (e.g., a mobile device such as a mobile phone).
For example, the processor 810 may be a device having data processing capability and/or instruction execution capability, such as a Central Processing Unit (CPU), Tensor Processor (TPU), or graphics processor GPU, and may control other components in the apparatus 80 for convolution operations to perform desired functions. The Central Processing Unit (CPU) may be an X86 or ARM architecture, etc.
For example, memory 805 may include any combination of one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer-executable instructions may be stored on the computer-readable storage medium and executed by the processor 810 to implement the various functions of the apparatus 80 for convolution operations. Various applications and various data, as well as various data used and/or generated by the applications, etc., may also be stored in the memory 805.
It should be noted that the apparatus 80 for convolution operation can achieve similar technical effects to the foregoing method for convolution operation, and repeated descriptions are omitted here.
At least one embodiment of the present disclosure also provides a computer-readable storage medium. Fig. 9 is a schematic diagram of a computer-readable storage medium according to at least one embodiment of the present disclosure.
For example, as shown in fig. 9, one or more computer-executable instructions 1001 may be stored non-transitory on a computer-readable storage medium 1000.
For example, the computer-executable instructions 1001, when executed by a computer, may perform one or more steps of a chunking method according to any of the embodiments of the present disclosure.
Also for example, the computer-executable instructions 1001, when executed by a computer, may also perform one or more steps of a method for convolution operations according to any embodiment of the present disclosure.
For example, the computer readable storage medium 1000 may be applied to the above-mentioned block dividing apparatus 70 or the apparatus 80 for convolution operation, for example, it may be the memory 705 in the block dividing apparatus 70 or the memory 805 in the apparatus 80 for convolution operation.
For example, the computer-readable storage medium 1000 may be a non-transitory computer-readable storage medium.
For example, the description of the computer-readable storage medium 1000 may refer to the description of the memory 705 in the embodiment of the block device 70 or refer to the memory 805 in the embodiment of the device 80 for performing convolution operation, and repeated descriptions are omitted.
For the present disclosure, there are also the following points to be explained:
(1) the drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to the common design.
(2) Thicknesses and dimensions of layers or structures may be exaggerated in the drawings used to describe embodiments of the present invention for clarity. It will be understood that when an element such as a layer, film, region, or substrate is referred to as being "on" or "under" another element, it can be "directly on" or "under" the other element or intervening elements may be present.
(3) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.
The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and the scope of the present disclosure should be subject to the scope of the claims.

Claims (22)

1. A blocking method for use in a matrix multiplication operation, wherein the matrix multiplication operation is used to implement a multiplication operation between a first matrix and a second matrix,
the blocking method comprises the following steps:
determining input parameters based on the first matrix and the second matrix, wherein the input parameters comprise dimension parameters of the first matrix, dimension parameters of the second matrix, data size of each data in each of the first matrix and the second matrix, loading bandwidth corresponding to the first matrix, loading bandwidth corresponding to the second matrix, and size of an input buffer;
obtaining output parameters based on the input parameters, wherein the output parameters comprise a cyclic order and blocking parameters corresponding to the first matrix and the second matrix.
2. The blocking method according to claim 1, wherein the dimension parameters of the first matrix comprise a first matrix dimension and a second matrix dimension, the dimension parameters of the second matrix comprise the second matrix dimension and a third matrix dimension,
the blocking parameters include a first-dimension outer-layer blocking parameter related to the first matrix dimension, a first-dimension inner-layer blocking parameter related to the first-dimension outer-layer blocking parameter, a second-dimension blocking parameter related to the second matrix dimension, a third-dimension outer-layer blocking parameter related to the third matrix dimension, and a third-dimension inner-layer blocking parameter related to the third-dimension outer-layer blocking parameter,
the cyclic order is used to indicate an order of the first dimension outer layer blocking parameter, the first dimension inner layer blocking parameter, the second dimension blocking parameter, the third dimension outer layer blocking parameter, and the third dimension inner layer blocking parameter.
3. The blocking method according to claim 2, wherein the input buffers include a first input buffer for buffering data in the first matrix and a second input buffer for buffering data in the second matrix, the sizes of the input buffers include a size of the first input buffer and a size of the second input buffer,
obtaining an output parameter based on the input parameter, including:
determining a first matrix size corresponding to the first matrix based on the dimension parameter of the first matrix and the data size;
determining a second matrix size corresponding to the second matrix based on the dimension parameter of the second matrix and the data size;
comparing the first matrix size with the size of the first input buffer and comparing the second matrix size with the size of the second input buffer to determine a first comparison result;
determining the output parameter based on the first comparison result.
4. The blocking method of claim 3, wherein determining the output parameter based on the first comparison result comprises:
in response to the first comparison result indicating that the first matrix size is equal to or smaller than the size of the first input buffer and/or the second matrix size is equal to or smaller than the size of the second input buffer,
determining the second dimension blocking parameter as the second matrix dimension;
comparing the first matrix dimension and the third matrix dimension to determine a second comparison result;
determining the first dimension outer layer blocking parameter and the third dimension outer layer blocking parameter based on the second comparison result.
5. The blocking method of claim 4, wherein determining the first dimension-wise outer blocking parameter and the third dimension-wise outer blocking parameter based on the second comparison comprises:
in response to the second comparison result indicating that the first matrix dimension is less than the third matrix dimension:
determining the outer layer blocking parameter of the first dimension as the first matrix dimension;
determining the third-dimension out-of-layer blocking parameter based on the data size, the dimension parameter of the second matrix, and the size of the second input buffer;
in response to the second comparison result indicating that the first matrix dimension is greater than or equal to the third matrix dimension:
determining the third dimension outer layer blocking parameter as the third matrix dimension;
determining the first dimension outer layer blocking parameter based on the data size, the dimension parameter of the first matrix, and the size of the first input buffer.
6. The blocking method of claim 5, wherein determining the first dimension outer blocking parameter based on the data size, the dimension parameter of the first matrix, and the size of the first input buffer comprises:
determining a first intermediate blocking parameter based on the data size, the second matrix dimension, and a size of the first input buffer;
determining the first dimension outer layer blocking parameter as the first intermediate blocking parameter in response to the first intermediate blocking parameter being less than the first matrix dimension,
determining the first dimension outer layer blocking parameter as the first matrix dimension in response to the first middle blocking parameter being greater than or equal to the first matrix dimension.
7. The blocking method according to claim 5 or 6, wherein determining the third dimensional outer layer blocking parameter based on the data size, the dimension parameter of the second matrix and the size of the second input buffer comprises:
determining a second intermediate blocking parameter based on the data size, the second matrix dimension, and a size of the second input buffer;
determining the third dimension outer layer blocking parameter as the second intermediate blocking parameter in response to the second intermediate blocking parameter being less than the third matrix dimension,
determining the third outer-dimension blocking parameter as the third matrix dimension in response to the second intermediate blocking parameter being greater than or equal to the third matrix dimension.
8. The blocking method according to any one of claims 3 to 6, wherein determining the output parameter based on the first comparison result comprises:
in response to the first comparison result indicating that the first matrix size is equal to or less than the size of the first input buffer and/or the second matrix size is equal to or less than the size of the second input buffer,
comparing the first matrix dimension and the third matrix dimension to determine a second comparison result;
based on the second comparison result, an outer loop flag is determined.
9. The chunking method of claim 8, wherein determining an outer loop flag based on the second comparison comprises:
in response to the second comparison result indicating that the first matrix dimension is less than the third matrix dimension:
determining the outer-loop flag as a first flag, wherein the outer-loop flag is used to indicate an order of the first dimension outer-layer block parameter and the third dimension outer-layer block parameter in the loop order, and the first flag indicates that the first dimension outer-layer block parameter is located before the third dimension outer-layer block parameter in the loop order;
in response to the second comparison result indicating that the first matrix dimension is greater than or equal to the third matrix dimension:
determining the outer-loop flag as a second flag, wherein the second flag indicates that the first dimension outer-layer blocking parameter is located after the third dimension outer-layer blocking parameter in the loop order.
10. The blocking method of claim 8, wherein determining the output parameter based on the first comparison result further comprises:
determining an order of the first dimension outer layer blocking parameter and the third dimension outer layer blocking parameter in the cycle order based on the outer layer cycle flag.
11. The chunking method of any of claims 3-6, wherein the input parameters further comprise: the maximum size of the accumulation buffer and the hardware effort,
determining the output parameter based on the first comparison result, including:
in response to the first comparison result indicating that the first matrix size is greater than the size of the first input buffer and the second matrix size is greater than the size of the second input buffer,
in response to not splitting the second matrix dimension, calculating a first utilization based on the dimension parameter of the first matrix, the dimension parameter of the second matrix, the data size, the hardware effort, the loading bandwidth corresponding to the first matrix, the loading bandwidth corresponding to the second matrix, the size of the first input buffer, and the size of the second input buffer;
in response to the splitting of the dimension of the second matrix, taking the maximum size of the accumulation buffer as a current calculation size, and calculating a first splitting utilization rate based on the dimension parameter of the first matrix, the dimension parameter of the second matrix, the data size, the hardware calculation power, the loading bandwidth corresponding to the first matrix, the loading bandwidth corresponding to the second matrix, the size of the first input buffer, the size of the second input buffer, and the current calculation size;
when the first utilization ratio is greater than or equal to the first cut utilization ratio:
determining the second dimension blocking parameter as the second matrix dimension;
determining the first dimension outer layer blocking parameter based on the second matrix dimension, the data size, and a size of the first input buffer;
determining the third out-of-range layer blocking parameter based on the second matrix dimension, the data size, and a size of the second input buffer.
12. The blocking method of claim 11, wherein determining the output parameter based on the first comparison result further comprises:
when the first utilization is less than the first cut utilization:
determining an adjusted size based on an adjustment coefficient and the current calculated size, and iteratively calculating a utilization rate by taking the adjusted size as the current calculated size until a utilization rate smaller than the first cutting utilization rate is obtained as a target cutting utilization rate;
determining the first dimension outer layer blocking parameter and the third dimension outer layer blocking parameter based on the first matrix dimension, the third matrix dimension, the data size, the loading bandwidth corresponding to the first matrix, the loading bandwidth corresponding to the second matrix and the current calculation size corresponding to the target segmentation utilization rate;
determining the second dimension chunking parameter based on a size of the first input buffer, a size of the second input buffer, the first dimension outer chunking parameter, the third dimension outer chunking parameter, and the data size.
13. The blocking method of claim 11, wherein determining the output parameter based on the first comparison result further comprises:
when the first utilization ratio is greater than or equal to the first cut utilization ratio:
in response to that the loading bandwidth corresponding to the first matrix is smaller than the loading bandwidth corresponding to the second matrix, determining an outer-layer cyclic flag as a first flag, where the outer-layer cyclic flag is used to indicate an order of the first dimension outer-layer block parameter and the third dimension outer-layer block parameter in the cyclic order, and the first flag indicates that the first dimension outer-layer block parameter is located before the third dimension outer-layer block parameter in the cyclic order,
and in response to the loading bandwidth corresponding to the first matrix being greater than or equal to the loading bandwidth corresponding to the second matrix, determining the outer-layer cyclic flag to be a second flag, wherein the second flag indicates that the first dimension outer-layer block parameter is located after the third dimension outer-layer block parameter in the cyclic order.
14. The blocking method of claim 11, wherein determining the output parameter based on the first comparison result further comprises:
when the first utilization rate is smaller than the first cut utilization rate, determining that an outer layer cycle flag is an outer layer preset flag, wherein the outer layer cycle flag is used for indicating the sequence of the outer layer block parameters of the first dimension number and the outer layer block parameters of the third dimension number in the cycle sequence.
15. The chunking method of any of claims 2-6, wherein the input parameters further comprise a first block parameter and a second block parameter corresponding to a synchronization granularity and a smallest data block,
obtaining an output parameter based on the input parameter, including:
determining the first intra-dimension blocking parameter and the third intra-dimension blocking parameter based on the first blocking parameter, the second blocking parameter, the first intra-dimension blocking parameter, the synchronization granularity, and the data size.
16. The chunking method of any of claims 3-6, wherein the input parameters further comprise hardware computational power,
the output parameters further comprise at least one of: dimension segmentation identification, accumulation buffer identification, target size of accumulation buffer, first matrix loading times corresponding to the first matrix, second matrix loading times corresponding to the second matrix and target utilization rate of the matrix multiplication operation,
wherein the dimension splitting identifier is used for indicating whether the second matrix dimension needs to be split or not, the accumulation buffer identifier is used for indicating whether an accumulation buffer is needed or not, the target size of the accumulation buffer represents the size of the accumulation buffer needed when the accumulation buffer is needed,
determining the output parameter based on the first comparison result, including:
in response to the first comparison result indicating that the first matrix size is equal to or less than the size of the first input buffer and/or the second matrix size is equal to or less than the size of the second input buffer:
determining that the loading times of the first matrix and the loading times of the second matrix are both 1;
determining that the dimension splitting identifier is a first tangent identifier, the accumulation buffer identifier is a first buffer sub-identifier, and the target size of the accumulation buffer is 0, wherein the dimension splitting identifier indicates that the first tangent identifier does not need to split the dimension of the second matrix, and the accumulation buffer identifier indicates that the first buffer sub-identifier does not need to accumulate the buffer;
and determining the target utilization rate of the matrix multiplication operation based on the dimension parameters of the first matrix, the dimension parameters of the second matrix, the data size, the hardware computation force, the loading bandwidth corresponding to the first matrix and the loading bandwidth corresponding to the second matrix.
17. The chunking method of claim 12, wherein the output parameters further comprise at least one of: dimension segmentation identification, accumulation buffer identification, target size of accumulation buffer, first matrix loading times corresponding to the first matrix, second matrix loading times corresponding to the second matrix and target utilization rate of the matrix multiplication operation,
wherein the dimension splitting identifier is used for indicating whether the second matrix dimension needs to be split or not, the accumulation buffer identifier is used for indicating whether an accumulation buffer needs to be added or not, the target size of the accumulation buffer represents the size of the accumulation buffer needed when the buffer needs to be added,
determining the output parameter based on the first comparison result, including:
when the first utilization rate is greater than or equal to the first cut utilization rate:
determining that the dimension splitting identifier is a first tangent identifier, the accumulation buffer identifier is a first buffer sub-identifier, and a target size of the accumulation buffer is 0, wherein the dimension splitting identifier indicates that the first tangent identifier does not need to split the second matrix dimension, the accumulation buffer identifier indicates that the first buffer sub-identifier does not need to accumulate the buffer,
in response to the loading bandwidth corresponding to the first matrix being smaller than the loading bandwidth corresponding to the second matrix, determining the first matrix loading times to be 1, determining the second matrix loading times based on the first dimension outer layer blocking parameter and the first matrix dimension,
determining the first matrix loading times based on the third dimension outer layer blocking parameter and the third matrix dimension in response to the loading bandwidth corresponding to the first matrix being greater than or equal to the loading bandwidth corresponding to the second matrix, determining the second matrix loading times to be 1,
determining a target utilization rate of the matrix multiplication operation based on the dimension parameters of the first matrix, the dimension parameters of the second matrix, the data size, the hardware computation force, the loading bandwidth corresponding to the first matrix and the loading bandwidth corresponding to the second matrix; and
when the first utilization is less than the first cut utilization:
determining the dimension segmentation identifier as the second segmentation sub-identifier, the accumulation buffer identifier as the second buffer sub-identifier, and the target size of the accumulation buffer as the current calculation size corresponding to the target segmentation utilization, wherein the dimension segmentation identifier indicates that the second matrix dimension needs to be segmented, the accumulation buffer identifier indicates that the second buffer sub-identifier indicates that the accumulation buffer needs to be accumulated,
determining the first matrix loading times and the second matrix loading times based on a current calculated size corresponding to the target segmentation utilization rate, the first matrix dimension, the third matrix dimension, the data size, a loading bandwidth corresponding to the first matrix, and a loading bandwidth corresponding to the second matrix,
and determining the target utilization rate of the matrix multiplication operation based on the dimension parameters of the first matrix, the dimension parameters of the second matrix, the data size, the hardware calculation force, the loading bandwidth corresponding to the first matrix, the loading bandwidth corresponding to the second matrix, the loading times of the first matrix and the loading times of the second matrix.
18. A method for convolution operation, wherein said convolution operation means operating on an input image with a convolution kernel,
the method comprises the following steps:
determining a first convolution operation matrix and a second convolution operation matrix based on convolution input parameters of the convolution operation, wherein the convolution input parameters comprise parameters corresponding to the convolution kernel and parameters corresponding to the input image;
performing the blocking method according to any one of claims 1-17 with the first matrix of convolution operations as a first matrix and the second matrix of convolution operations as a second matrix.
19. The method of claim 18, wherein the parameters corresponding to the convolution kernels comprise a first kernel size, a second kernel size, a number of input channels, and a number of convolution kernels for each convolution kernel, the parameters corresponding to the input image comprise a first image size, a second image size, and a number of input images,
the first convolution operation matrix having a first convolution matrix dimension and a second convolution matrix dimension, the second convolution operation matrix having the second convolution matrix dimension and a third convolution matrix dimension,
determining a first convolution operation matrix and a second convolution operation matrix based on convolution input parameters of the convolution operation, including:
determining the first convolution matrix dimension based on the number of convolution kernels;
determining the second convolution matrix dimension based on the first kernel size, the second kernel size, and the number of input channels;
determining the third convolution matrix dimension based on the number of input images, the first image size, and the second image size.
20. A block apparatus for use in matrix multiplication operations and comprising a memory and a processor,
wherein the memory stores computer-executable instructions adapted to be executed by the processor, the computer-executable instructions, when executed by the processor, performing one or more steps of the blocking method according to any one of claims 1-17.
21. An apparatus for convolution operations, comprising a memory and a processor,
wherein the memory stores computer-executable instructions adapted to be executed by the processor, the computer-executable instructions, when executed by the processor, performing one or more steps of the method for convolution operations according to claim 18 or 19.
22. A computer-readable storage medium having non-transitory computer-executable instructions stored thereon,
wherein the computer-executable instructions, when executed by a computer, perform one or more steps of the blocking method of any one of claims 1-17 or perform one or more steps of the method for convolution operations of claim 18 or 19.
CN202210440010.1A 2022-04-25 2022-04-25 Blocking method and device, convolution operation method and device, and storage medium Pending CN114707114A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210440010.1A CN114707114A (en) 2022-04-25 2022-04-25 Blocking method and device, convolution operation method and device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210440010.1A CN114707114A (en) 2022-04-25 2022-04-25 Blocking method and device, convolution operation method and device, and storage medium

Publications (1)

Publication Number Publication Date
CN114707114A true CN114707114A (en) 2022-07-05

Family

ID=82173960

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210440010.1A Pending CN114707114A (en) 2022-04-25 2022-04-25 Blocking method and device, convolution operation method and device, and storage medium

Country Status (1)

Country Link
CN (1) CN114707114A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600062A (en) * 2022-12-14 2023-01-13 深圳思谋信息科技有限公司(Cn) Convolution processing method, circuit, electronic device and computer readable storage medium
CN116088773A (en) * 2023-04-11 2023-05-09 南京砺算科技有限公司 Data loading method, device, equipment and medium based on implicit GEMM convolution
CN116842307A (en) * 2023-08-28 2023-10-03 腾讯科技(深圳)有限公司 Data processing method, device, equipment, chip and storage medium
CN116861149A (en) * 2023-09-05 2023-10-10 之江实验室 Convolution operation optimization method, device and processor
CN116881618A (en) * 2023-08-25 2023-10-13 之江实验室 General matrix multiplication calculation optimization method, device and processor

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600062A (en) * 2022-12-14 2023-01-13 深圳思谋信息科技有限公司(Cn) Convolution processing method, circuit, electronic device and computer readable storage medium
CN115600062B (en) * 2022-12-14 2023-04-07 深圳思谋信息科技有限公司 Convolution processing method, circuit, electronic device and computer readable storage medium
CN116088773A (en) * 2023-04-11 2023-05-09 南京砺算科技有限公司 Data loading method, device, equipment and medium based on implicit GEMM convolution
CN116088773B (en) * 2023-04-11 2023-06-16 南京砺算科技有限公司 Data loading method, device, equipment and medium based on implicit GEMM convolution
CN116881618A (en) * 2023-08-25 2023-10-13 之江实验室 General matrix multiplication calculation optimization method, device and processor
CN116842307A (en) * 2023-08-28 2023-10-03 腾讯科技(深圳)有限公司 Data processing method, device, equipment, chip and storage medium
CN116842307B (en) * 2023-08-28 2023-11-28 腾讯科技(深圳)有限公司 Data processing method, device, equipment, chip and storage medium
CN116861149A (en) * 2023-09-05 2023-10-10 之江实验室 Convolution operation optimization method, device and processor
CN116861149B (en) * 2023-09-05 2024-01-09 之江实验室 Convolution operation optimization method, device and processor

Similar Documents

Publication Publication Date Title
CN114707114A (en) Blocking method and device, convolution operation method and device, and storage medium
US11915119B2 (en) Convolutional neural network (CNN) processing method and apparatus performing high speed and precision convolution operations
EP3734475A1 (en) Method and device for training data, storage medium, and electronic device
WO2020108371A1 (en) Partitioning of deep learning inference with dynamic offloading
CN109190758B (en) Method and apparatus for unwrapping tensor data for convolutional neural networks
US11411575B2 (en) Irreversible compression of neural network output
US11010308B2 (en) Optimizing data partitioning and replacement strategy for convolutional neural networks
US20140310720A1 (en) Apparatus and method of parallel processing execution
US20230267569A1 (en) Acceleration of gpus in cloud computing
CN112668708B (en) Convolution operation device for improving data utilization rate
KR20230130591A (en) Information processing apparatus, information processing method, non-transitory computer-readable storage medium
CN111709415B (en) Target detection method, device, computer equipment and storage medium
KR20230050340A (en) Tabular Convolution and Acceleration
CN111461302A (en) Data processing method, device and storage medium based on convolutional neural network
CN116762080A (en) Neural network generation device, neural network operation device, edge device, neural network control method, and software generation program
CN117311998A (en) Large model deployment method and system
CN112771546A (en) Operation accelerator and compression method
US20230078991A1 (en) Processing data stream modification to reduce power effects during parallel processing
CN113222136A (en) Convolution operation method and chip
US20220366217A1 (en) Method and device of computing layout selection for efficient dnn inference
CN113222099A (en) Convolution operation method and chip
CN113963241B (en) FPGA hardware architecture, data processing method thereof and storage medium
CN111971692A (en) Convolutional neural network
CN113159297A (en) Neural network compression method and device, computer equipment and storage medium
CN109388428B (en) Layer traversal method, control device and data processing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: 201100 room 1302, 13 / F, building 16, No. 2388, Chenhang highway, Minhang District, Shanghai

Applicant after: Shanghai Bi Ren Technology Co.,Ltd.

Address before: 201100 room 1302, 13 / F, building 16, No. 2388, Chenhang highway, Minhang District, Shanghai

Applicant before: Shanghai Bilin Intelligent Technology Co.,Ltd.

Country or region before: China