CN113468469A

CN113468469A - Convolution processing method and device of feature graph executed by computer and electronic equipment

Info

Publication number: CN113468469A
Application number: CN202110616970.4A
Authority: CN
Inventors: 章晓; 曾平; 马骏; 王彪; 柳俊杰
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2021-10-01

Abstract

The invention provides a convolution processing method, a convolution processing device and electronic equipment of a feature map executed by a computer, wherein the convolution processing method comprises the following steps: determining the current memory access offset based on the initial memory access offset and the offset increment; acquiring an eigen map block matrix to be calculated in an eigen map tensor to be calculated and a weight block matrix to be calculated in a weight tensor to be calculated according to the current memory access offset; performing matrix multiplication on the feature map blocking matrix to be calculated and the weight blocking matrix to be calculated to obtain a blocking calculation result of the feature map tensor to be calculated, and traversing all the feature map blocking matrices to be calculated in the feature map tensor to be calculated to obtain a plurality of blocking calculation results; and performing accumulation operation on the multiple block calculation results, and adding an offset item to the result after the accumulation operation to obtain a convolution calculation result of the feature map tensor to be calculated. The performance of the convolution operator template realized by the method can reach the optimum.

Description

Convolution processing method and device of feature graph executed by computer and electronic equipment

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a convolution processing method and apparatus for a feature map executed by a computer, and an electronic device.

Background

The method for realizing the convolution operator on the GPU comprises three methods, the first method is based on an algorithm of rearranging an image into a matrix (Im2col), the convolution operator is converted into a matrix multiplication operator, the method can multiplex an optimized matrix multiplication operator library, but extra temporary space and memory copy operation can be introduced; the second method is a Winograd convolution algorithm, which reduces the actual calculation amount from the algorithm, but introduces extra temporary space, and cannot fully utilize a tensor core calculation unit due to the precision problem; the third method is a convolution algorithm based on Implicit matrix multiplication (Implicit GEMM), the method does not need to explicitly convert the characteristic diagram into a matrix to be actually stored, therefore, no extra temporary space is introduced, the method can complete the calculation of a convolution operator by only one calculation kernel (kernel), and compared with the other two methods, the method can better utilize a tensor core calculation unit with 8-bit and 4-bit shaping.

The implementation of convolution algorithms based on Implicit matrix multiplication (Implicit GEMM) has been a blank for a long time. Although recently, some relevant research has been developed on the 8-bit, 4-bit shaped tensor core convolution operator implementation based on the Implicit GEMM algorithm. However, the current convolution operator still has the following defects: when the access address of the input characteristic diagram is calculated by the convolution operator, branch judgment and additional integer multiply-add operation are introduced, so that the performance of the realized convolution operator template is not optimal.

In conclusion, the conventional convolution processing algorithm of the feature map cannot enable the performance of the convolution operator template to be optimal.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a method, an apparatus and an electronic device for convolution processing of a feature map executed by a computer, so as to alleviate the technical problem that the performance of a convolution operator template cannot be optimized by the conventional convolution processing algorithm of the feature map.

In a first aspect, an embodiment of the present invention provides a convolution processing method for a feature map, where the convolution processing method is executed by a computer, and includes: determining the current memory access offset based on the initial memory access offset and the offset increment; acquiring an eigen map block matrix to be calculated in the eigen map tensor to be calculated and a weight block matrix to be calculated in the weight tensor to be calculated according to the current memory access offset; performing matrix multiplication operation on the feature map block matrix to be calculated and the weight block matrix to be calculated to obtain a block calculation result of the feature map tensor to be calculated; updating the current memory access offset according to the offset increment, and executing the step of obtaining the feature map blocking matrix to be calculated in the feature map tensor to be calculated and the weight blocking matrix to be calculated in the weight tensor to be calculated according to the current memory access offset until all feature map blocking matrices to be calculated in the feature map tensor to be calculated are traversed to obtain a plurality of blocking calculation results; and performing accumulation operation on the multiple block calculation results, and adding an offset term to the result after the accumulation operation to obtain a convolution calculation result of the feature map tensor to be calculated.

Further, the determining the current memory access offset based on the initial memory access offset and the offset increment includes: acquiring the initial memory access offset and the offset increment; and calculating the current memory access offset according to the initial memory access offset and the offset increment.

Further, when the current memory access offset is the initial memory access offset, calculating the current memory access offset according to the initial memory access offset and the offset increment includes: determining that the offset increment is zero; and taking the initial memory access offset as the current memory access offset.

Further, the method further comprises: calculating the initial memory access offset based on the initial input channel number, the initial r-dimensional coordinate of the convolution kernel and the initial s-dimensional coordinate of the convolution kernel; and calculating the memory access offset of each round, and calculating the offset increment based on the memory access offset of the iter +1 round and the memory access offset of the iter round.

Further, the initial memory access offset is a calculation formula according to the memory access offset

Calculated wherein F_iterMemory offset, C, representing the iter-th iteration_iterThe number of channels representing the input for the iter-th iteration, k the coordinates of the initial read by the thread for the thread block, R, S the size of the convolution kernel, R_iterRepresents the intermediate quantity, r, of the iteration of the third iter round_iterCoordinate, s, representing the r dimension of the convolution kernel of the iter-round iteration_iterCoordinates in the s-dimension, C, representing the convolution kernel of the iter-th iteration_strideIs represented by C_iterCorresponding weight, H_strideIs represented by r_iterCorresponding weight, W_strideDenotes s_iterA corresponding weight; the offset increment is calculated according to the offset increment_iter＝F_iter+1—F_iterWherein, in the step (A),

calculated wherein, Δ F_iterIndicates the offset increment of the ith wheel, F_iter+1Indicating the memory offset of the iter +1 th round, F_iterAnd the memory access offset of the third iter round is represented, and BK represents one parameter in the size BM multiplied by BN multiplied by BK of the partitioned matrix processed by each round of the thread blocks.

Further, acquiring a feature map block matrix to be calculated in the feature map tensor to be calculated and a weight block matrix to be calculated in the weight tensor to be calculated according to the current memory access offset, including: determining a first position of the feature map blocking matrix to be calculated in the feature map tensor to be calculated and a second position of the weight blocking matrix to be calculated in the weight tensor to be calculated according to the current memory access offset; and acquiring the feature map block matrix to be calculated and the weight block matrix to be calculated according to the first position and the second position.

Further, the traversing all the feature map blocking matrixes to be calculated in the feature map tensor to be calculated includes: judging whether an eigen map blocking matrix to be calculated which does not participate in calculation exists in the eigen map tensor to be calculated; and if not, determining that all the feature map block matrixes to be calculated in the feature map tensor to be calculated have been traversed.

Further, the initial memory access offset and the offset increment are stored in a parameter buffer of a device kernel, the size of the offset increment is in a period of R × S, wherein R, S represents the size of a convolution kernel.

In a second aspect, an embodiment of the present invention further provides a convolution processing apparatus for a feature map executed by a computer, including: the determining unit is used for determining the current memory access offset based on the initial memory access offset and the offset increment; the acquisition unit is used for acquiring an eigen map block matrix to be calculated in the eigen map tensor to be calculated and a weight block matrix to be calculated in the weight tensor to be calculated according to the current memory access offset; the matrix multiplication unit is used for executing matrix multiplication operation on the feature map blocking matrix to be calculated and the weight blocking matrix to be calculated to obtain a blocking calculation result of the feature map tensor to be calculated; the traversal calculation unit is used for updating the current memory access offset according to the offset increment, and executing the step of obtaining the feature map block matrix to be calculated in the feature map tensor to be calculated and the weight block matrix to be calculated in the weight tensor to be calculated according to the current memory access offset until all feature map block matrices to be calculated in the feature map tensor to be calculated are traversed to obtain a plurality of block calculation results; and the accumulation operation and bias item adding unit is used for executing accumulation operation on the plurality of block calculation results and adding bias items to the result after the accumulation operation to obtain the convolution calculation result of the feature map tensor to be calculated.

In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method according to any one of the above first aspects when executing the computer program.

In a fourth aspect, the present invention further provides a computer-readable medium having non-volatile program code executable by a processor, where the program code causes the processor to execute the steps of the method according to any one of the above first aspects.

In the embodiment of the invention, the current memory access offset is determined based on the initial memory access offset and the offset increment; acquiring an eigen map block matrix to be calculated in an eigen map tensor to be calculated and a weight block matrix to be calculated in a weight tensor to be calculated according to the current memory access offset; further, performing matrix multiplication on the feature map blocking matrix to be calculated and the weight blocking matrix to be calculated to obtain a blocking calculation result of the feature map tensor to be calculated and obtain a plurality of blocking calculation results; and performing accumulation operation on the multiple block calculation results, and adding an offset term to the result after the accumulation operation, thereby obtaining a convolution calculation result of the feature map tensor to be calculated. As can be seen from the above description, in the convolution processing method of a feature map executed by a computer according to the embodiment of the present invention, the current access offset is determined by the initial access offset and the offset increment, and the access offset does not need to be calculated by additional multiplication and addition operations, so that the overhead introduced by the access offset calculation is saved, and the processing performance of the computer is improved. The performance of the convolution operator template realized based on the method of the embodiment of the invention can reach the optimum, and the technical problem that the performance of the convolution operator template can not reach the optimum by the conventional convolution processing algorithm of the characteristic diagram is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic diagram of an electronic device according to an embodiment of the present invention;

FIG. 2 is a flowchart of a convolution processing method for a feature map executed by a computer according to an embodiment of the present invention;

FIG. 3 is a flow chart of another method for convolution processing of a feature map executed by a computer according to an embodiment of the present invention;

fig. 4 is a flowchart of a method for obtaining a to-be-calculated eigen map block matrix in a to-be-calculated eigen map tensor and a to-be-calculated weight block matrix in a to-be-calculated weight tensor according to a current memory access offset according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a convolution operator converted to a matrix multiplication operation according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating the offset increment in a cycle of R × S according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating a calculation process of matrix multiplication according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a convolution processing apparatus for a feature map executed by a computer according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1:

first, an electronic device 100 for implementing an embodiment of the present invention, which can be used to execute a convolution processing method of a feature diagram of embodiments of the present invention, is described with reference to fig. 1.

As shown in FIG. 1, electronic device 100 includes one or more processors 102, one or more memories 104, an input device 106, an output device 108, and a camera 110, which are interconnected via a bus system 112 and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 100 shown in fig. 1 are exemplary only, and not limiting, and the electronic device may have other components and structures as desired.

The processor 102 may be implemented in at least one hardware form of a Digital Signal Processor (DSP), a Field-Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), and an asic (application Specific Integrated circuit), and the processor 102 may be a Central Processing Unit (CPU) or other form of Processing Unit having data Processing capability and/or instruction execution capability, and may control other components in the electronic device 100 to perform desired functions.

The memory 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement client-side functionality (implemented by the processor) and/or other desired functionality in embodiments of the invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

The camera 110 is configured to acquire an image, where the image acquired by the camera is subjected to feature extraction to obtain a feature map to be calculated, and then the feature map to be calculated is processed by the feature map convolution processing method to obtain a convolution calculation result of a feature map tensor to be calculated, for example, the camera may capture an image (e.g., a photograph, a video, etc.) desired by a user, then extract features of the image to obtain the feature map to be calculated, and then the feature map to be calculated is processed by the feature map convolution processing method to obtain a convolution calculation result of the feature map tensor to be calculated, and the camera may further store the captured image in the memory 104 for use by other components.

Exemplarily, an electronic device for implementing a convolution processing method of a feature map executed by a computer according to an embodiment of the present invention may be implemented as a smart mobile terminal such as a smartphone, a tablet computer, or the like.

Example 2:

in accordance with an embodiment of the present invention, there is provided an embodiment of a convolution processing method of a feature diagram executed by a computer, it is noted that the steps shown in the flowchart of the figure may be executed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in an order different from that shown.

Fig. 2 is a flowchart of a convolution processing method of a feature map executed by a computer according to an embodiment of the present invention, as shown in fig. 2, the method including the steps of:

step S202, determining the current memory access offset based on the initial memory access offset and the offset increment;

in the embodiment of the present invention, the initial memory access offset and the offset increment are obtained by pre-calculating, and may be obtained by calculating in the host in advance, and then storing the initial memory access offset and the offset increment obtained by calculation into a parameter buffer of a device convolution kernel, where the size of the offset increment is in a cycle of R × S, and R, S represents the size of the convolution kernel.

And the current memory access offset is used for determining the positions of the feature map block matrix and the weight block matrix in the feature map tensor and the weight tensor during current iteration processing.

Step S204, acquiring an eigen map block matrix to be calculated in an eigen map tensor to be calculated and a weight block matrix to be calculated in a weight tensor to be calculated according to the current memory access offset;

step S206, matrix multiplication is carried out on the feature map block matrix to be calculated and the weight block matrix to be calculated, and block calculation results of the feature map tensor to be calculated are obtained;

when matrix multiplication operation is executed, each thread is executed in a mode that rows of the feature map block matrix to be calculated are multiplied by columns of the weight block matrix to be calculated.

Step S208, judging whether all feature map blocking matrixes to be calculated in the feature map tensor to be calculated are traversed, if not, updating the current access and storage offset according to the offset increment, and returning to the step of acquiring the feature map blocking matrixes to be calculated in the feature map tensor to be calculated and the weight blocking matrixes to be calculated in the weight tensor to be calculated according to the current access and storage offset, so as to obtain a plurality of blocking calculation results comprising the blocking calculation results obtained in the step S206; if the traversal is performed, executing step S210;

and step S210, performing accumulation operation on the multiple block calculation results, and adding an offset item to the result after the accumulation operation to obtain a convolution calculation result of the feature map tensor to be calculated.

The above-mentioned performing accumulation operation on a plurality of block calculation results refers to the calculation of adding a plurality of block calculation results obtained by each thread, and the above-mentioned adding an offset item to the result after accumulation operation refers to the step of obtaining the offset item of each thread, and then adding the obtained offset item and the result after accumulation operation.

In the embodiment of the invention, the current memory access offset is determined based on the initial memory access offset and the offset increment; acquiring an eigen map block matrix to be calculated in an eigen map tensor to be calculated and a weight block matrix to be calculated in a weight tensor to be calculated according to the current memory access offset; further, performing matrix multiplication on the feature map blocking matrix to be calculated and the weight blocking matrix to be calculated to obtain a blocking calculation result of the feature map tensor to be calculated, and traversing all the feature map blocking matrices to be calculated in the feature map tensor to be calculated to obtain a plurality of blocking calculation results; and performing accumulation operation on the multiple block calculation results, and adding an offset term to the result after the accumulation operation, thereby obtaining a convolution calculation result of the feature map tensor to be calculated. As can be seen from the above description, the convolution processing method of the feature diagram executed by the computer according to the embodiment of the present invention determines the current access offset by the initial access offset and the offset increment, and does not need to calculate the access offset by additional multiplication and addition operations, thereby saving the overhead introduced by the access offset calculation and improving the computer performance. The performance of the convolution operator template realized based on the method of the embodiment of the invention can reach the optimum, and the technical problem that the performance of the convolution operator template can not reach the optimum by the conventional convolution processing algorithm of the characteristic diagram is solved.

The foregoing briefly introduces a convolution processing method of a feature diagram according to an embodiment of the present invention, and details of the related contents are described below.

In another alternative embodiment of the present invention, referring to fig. 3, a convolution processing method of a feature map executed by a computer according to an embodiment of the present invention includes the steps of:

step S205, updating the current memory access offset based on the initial memory access offset and the offset increment to obtain the updated memory access offset; acquiring a next eigen map block matrix to be calculated in the eigen map tensor to be calculated and a next weight block matrix to be calculated in the weight tensor to be calculated according to the updated memory access offset, and taking the updated memory access offset as the current memory access offset;

step S207, taking the next feature map block matrix block to be calculated as a feature map block matrix to be calculated, and taking the next weight block matrix block to be calculated as a block weight matrix to be calculated;

step S208, judging whether all feature map block matrixes to be calculated in the feature map tensor to be calculated are traversed, and returning to the step S205 if the feature map block matrixes to be calculated are not traversed; if the traversal is performed, executing step S210;

In an optional embodiment of the present invention, determining a current memory access offset based on the initial memory access offset and the offset increment includes: obtaining an initial memory access offset and an offset increment; and calculating the current memory access offset according to the initial memory access offset and the offset increment.

In an optional embodiment of the present invention, when the current memory access offset is the initial memory access offset, calculating the current memory access offset according to the initial memory access offset and the offset increment, includes: determining an offset increment to be zero; and taking the initial memory access offset as the current memory access offset.

In the embodiment of the invention, based on the initial input channel number, the r dimensional coordinate of the initial convolution kernel and the s dimensional coordinate of the initial convolution kernel, the initial memory access offset is calculated; and calculating the memory access offset of each round, and calculating the offset increment based on the memory access offset of the iter +1 round and the memory access offset of the iter round.

Specifically, the initial memory access offset is a calculation formula based on the memory access offset

Calculated wherein F_iterMemory offset, C, representing the iter-th iteration_iterThe number of channels representing the input for the iter-th iteration, k the coordinates of the initial read by the thread for the thread block, R, S the size of the convolution kernel, R_iterRepresents the intermediate quantity, r, of the iteration of the third iter round_iterCoordinate, s, representing the r dimension of the convolution kernel of the iter-round iteration_iterCoordinates in the s-dimension, C, representing the convolution kernel of the iter-th iteration_strideIs represented by C_iterCorresponding weight, H_strideIs represented by r_iterCorresponding weight, W_strideDenotes s_iterA corresponding weight;

the offset increment is calculated according to the offset increment

In an optional embodiment of the present invention, referring to fig. 4, acquiring an eigen map blocking matrix to be calculated in an eigen map tensor to be calculated and a weight blocking matrix to be calculated in a weight tensor to be calculated according to a current memory access offset includes:

step S401, determining a first position of an eigen map block matrix to be calculated in an eigen map tensor to be calculated and a second position of a weight block matrix to be calculated in a weight tensor to be calculated according to the current memory access offset;

and S402, acquiring a feature map block matrix to be calculated and a weight block matrix to be calculated according to the first position and the second position.

The following describes the convolution processing method of the feature map executed by the computer according to the present invention in a specific embodiment:

let the dimensions of the input feature map be (N, C, H, W), where N represents batch (number of batches), C represents the number of channels input, and H, W represents the height and width of the feature map. The dimensions of the weight tensor are (K, C, R, S), where K represents the number of output channels, C represents the number of input channels, and R, S represents the size of the convolution kernel, and if the method according to the embodiment of the present invention implements a convolution operator, the specific scheme is as follows:

(1) and calculating the initial memory access offset and the offset increment at the host end. If the eigen map matrix size is converted into the matrix multiplier, the eigen map matrix size corresponding to the eigen map tensor to be calculated is (nxh × W) × (C × R × S), and the weight matrix size corresponding to the weight tensor to be calculated is (C × R × S) × K. Assuming that the size of the block matrix processed by each round of the thread block (Warp) is BM × BN × BK, and the coordinate initially read by the thread of the thread block is k, the initial memory access offset is:

wherein, F₀Memory access offset (i.e. initial memory access offset) representing iteration 0, C₀The number of channels representing the input for iteration round 0, k the coordinates of the initial read by the thread for the thread block, R, S the size of the convolution kernel, R₀Represents the intermediate quantity, r, of the 0 th iteration₀Coordinate, s, representing the r dimension of the convolution kernel for iteration 0₀Coordinates representing the s-dimension of the convolution kernel of iteration 0, C_strideIs represented by C₀Corresponding weight, H_strideIs represented by r₀Corresponding weight, W_strideDenotes s₀The corresponding weight.

The memory access offset of the iter-th iteration is:

wherein, F_iterMemory offset, C, representing the iter-th iteration_iterThe number of channels representing the input for the iter-th iteration, k the coordinates of the initial read by the thread for the thread block, R, S the size of the convolution kernel, R_iterRepresents the intermediate quantity, r, of the iteration of the third iter round_iterCoordinate, s, representing the r dimension of the convolution kernel of the iter-round iteration_iterIndicating the first iter wheelCoordinates of the s-dimension of the iterative convolution kernel, C_strideIs represented by C_iterCorresponding weight, H_strideIs represented by r_iterCorresponding weight, W_strideDenotes s_iterThe corresponding weight.

The offset increment is:

wherein, Δ F_iterIndicates the offset increment of the ith wheel, F_iter+1Indicating the memory offset of the iter +1 th round, F_iterAnd the memory access offset of the third iter round is represented, and BK represents one parameter in the size BM multiplied by BN multiplied by BK of the partitioned matrix processed by each round of the thread blocks.

We find that the size of the offset increment is R × S as a period, where R, S represents the size of the convolution kernel, the initial access offset read by the threads in the same thread block is related to the size of the block matrix, when the dimension of a block on K is BK, there are BK different initial access offsets, so that BK initial access offsets and R × S offset increments coexist, and when reading the initial access offsets, each thread reads its respective initial access offset.

(2) Packing each parameter (including a weight tensor and an eigen map tensor) of a convolution operator, the initial memory access offset and the offset increment calculated in the step (1) into a parameter of an equipment kernel, storing the parameter in a parameter buffer area, and then calling the convolution kernel (kernel) of the equipment to carry out convolution calculation;

(3) each thread in the thread blocks firstly reads the corresponding initial access offset from a parameter buffer area of the kernel, then reads one feature map block matrix to be calculated in the feature map tensor to be calculated and one weight block matrix to be calculated in the weight tensor to be calculated from the global memory according to the initial access offset to the register file, and then writes the feature map block matrix to be calculated and the weight block matrix to be calculated in the register file into the shared memory. Each thread then swaps the shared memory pointers for double buffer optimization.

(4) Each thread of the thread block reads the offset increment from the parameter buffer area, updates the previous access offset (the previous access offset in the first iteration is the initial access offset) according to the initial access offset and the offset increment to obtain the updated access offset, and reads the next to-be-calculated eigen map block matrix in the to-be-calculated eigen map tensor and the next to-be-calculated weight block matrix in the to-be-calculated weight tensor from the global memory according to the updated access offset to the register file;

(5) one thread in the thread blocks reads one row of the feature map block matrix to be calculated and one column of the weight block matrix to be calculated from the shared memory, then int4 TenscorCore (4-bit integer tensor core) calculation instructions are used for executing matrix multiplication operation on the read data, while the calculation is executed, the data of the next row of the feature map block matrix to be calculated and the data of the next column of the weight block matrix to be calculated are read from the shared memory to a register, and the step (5) is repeated until all the data of the feature map block matrix to be calculated of the weight block matrix to be calculated are consumed;

(6) and each thread takes the next feature map block matrix to be calculated and the next weight block matrix to be calculated, which are read from the global memory, as the feature map block matrix to be calculated and the weight block matrix to be calculated, writes the feature map block matrix and the weight block matrix to be calculated into the shared memory, and then exchanges pointers of the shared memory. Returning to the step (4) to start the next iteration until all feature map block matrixes to be calculated in the feature map tensor to be calculated are traversed;

(7) each thread accumulates the results obtained by each iteration, the results of the accumulators are written into the shared memory, then the sequence of the threads is rearranged, each thread reads the results of the accumulators from the shared memory to the register, finally each thread executes the operation of adding the offset item (each thread can be mapped to a specific channel, then the offset item of the channel is read, and the offset item is added to the results.

Fig. 5 shows how the convolution operator is converted into a matrix multiplication operation when N is 1, C is 3, H is 3, K is 2, and R is 2. The rows and columns of the eigen map matrix are respectively 4 (in the figure, only convolution is expanded into matrix operation, each row of the eigen map matrix corresponds to one convolution, and the process of padding is not shown, so that 4 convolutions are formed, rows are 4 instead of 9) and 12 (namely C × R × S), the rows and columns of the weight matrix are 12 (namely C × R × S) and 2 (namely K), and the output tensors in the figure correspond to convolution calculation results. The method of the embodiment of the invention does not explicitly store the eigen map tensor, but reads the block matrix of the eigen map tensor from the global memory and stores the block matrix into the shared memory.

The increment of the offset of the eigen-map tensor is shown in fig. 6 with a period R × S, which is shown by way of example for one thread.

Fig. 7 shows the calculation process of the matrix multiplication. In the block matrix multiplication diagram and the block diagram at the thread block level in fig. 7, the thread block reads the small block matrix of BK × BN from the feature map tensor into the shared memory, and reads the small block matrix of BM × BK from the weight tensor into the shared memory. Then, the threads in the whole thread block cooperate to complete the multiplication operation of the matrix of BM multiplied by BK and the matrix of BK multiplied by BN. Specifically, for each Warp in the thread block, each Warp reads a weight matrix of BM × BK and a corresponding small block matrix in a feature map matrix of BK × BN to a local register, and performs a matrix multiplication operation, as shown in a Warp-level block diagram. Further as shown in the instruction diagram of TenscorCore (on GPU), each thread in Warp (thread bundle) reads two smaller small block matrixes in the matrix in the block diagram of Warp level, performs matrix multiplication operation, and then accumulates the matrix into a local accumulation register.

The convolution processing method provided by the embodiment of the invention can fully utilize the int4 TenscorCore computing unit of the GPU, exert the computing capability of the GPU and reduce the delay of GPU-end model reasoning deployment. The method of the embodiment of the invention can achieve the memory access on a shared memory (GPU) without bank conflict by utilizing a high-performance memory access component, does not need to calculate the memory access offset through extra multiplication and addition operations, stores the pre-calculated memory access offset in a parameter buffer area, reads the memory access offset through small memory access overhead, and saves the overhead introduced by memory access offset calculation. The convolution operator template realized by the method of the embodiment of the invention is superior to the performance of the current open source convolution operator, and can reach the official operator library, for example, the performance of 1.5 times of the cudnn int8 TenscorCore convolution operator, and in some cases, the performance of 2 times of the cudnn int8 TenscorCore convolution operator. In addition, by automatic performance optimization of the convolution template, the embodiment of the invention can realize the int4 TenscorCore convolution operator, and can enable the int4 TenscorCore convolution operator to obtain good performance in an actual production model, namely, the convolution algorithm provided by the embodiment of the invention can improve the performance of the actual production model in inference deployment and the computer processing capability.

Example 3:

the embodiment of the present invention further provides a convolution processing apparatus for a feature map executed by a computer, where the convolution processing apparatus for a feature map is mainly used for executing the convolution processing method for a feature map executed by a computer provided in the foregoing content of the embodiment of the present invention, and the following description specifically describes the convolution processing apparatus for a feature map executed by a computer provided in the embodiment of the present invention.

Fig. 8 is a schematic diagram of a convolution processing apparatus of a feature map executed by a computer according to an embodiment of the present invention. As shown in fig. 8, the convolution processing apparatus of the feature map mainly includes: a determination unit 10, an acquisition unit 20, a matrix multiplication unit 30, a traversal calculation unit 40, an accumulation operation and an add bias term unit 50, wherein:

the determining unit is used for determining the current memory access offset based on the initial memory access offset and the offset increment;

the acquisition unit is used for acquiring an eigen map block matrix to be calculated in an eigen map tensor to be calculated and a weight block matrix to be calculated in a weight tensor to be calculated according to the current memory access offset;

the matrix multiplication unit is used for executing matrix multiplication operation on the feature map block matrix to be calculated and the weight block matrix to be calculated to obtain a block calculation result of the feature map tensor to be calculated;

the traversal calculation unit is used for updating the current memory access offset according to the offset increment, and executing the step of obtaining the feature map blocking matrix to be calculated in the feature map tensor to be calculated and the weight blocking matrix to be calculated in the weight tensor to be calculated according to the current memory access offset until all feature map blocking matrices to be calculated in the feature map tensor to be calculated are traversed, so that a plurality of blocking calculation results are obtained;

and the accumulation operation and bias item adding unit is used for executing accumulation operation on the multiple block calculation results and adding bias items to the result after the accumulation operation to obtain the convolution calculation result of the feature map tensor to be calculated.

Optionally, the determining unit is further configured to: obtaining an initial memory access offset and an offset increment; and calculating the current memory access offset according to the initial memory access offset and the offset increment.

Optionally, the determining unit is further configured to: determining an offset increment to be zero; and taking the initial memory access offset as the current memory access offset.

Optionally, the apparatus is further configured to: calculating initial access and storage offset based on the initial input channel number, the initial r-dimensional coordinate of the convolution kernel and the initial s-dimensional coordinate of the convolution kernel; and calculating the memory access offset of each round, and calculating the offset increment based on the memory access offset of the iter +1 round and the memory access offset of the iter round.

Optionally, the initial memory access offset is a calculation formula based on the memory access offset

Calculated wherein F_iterMemory offset, C, representing the iter-th iteration_iterThe number of channels representing the input for the iter-th iteration, k the coordinates of the initial read by the thread for the thread block, R, S the size of the convolution kernel, R_iterRepresents the intermediate quantity, r, of the iteration of the third iter round_iterCoordinate, s, representing the r dimension of the convolution kernel of the iter-round iteration_iterCoordinates in the s-dimension, C, representing the convolution kernel of the iter-th iteration_strideIs represented by C_iterCorresponding weight, H_strideIs represented by r_iterCorresponding weight, W_strideDenotes s_iterA corresponding weight; the offset increment is calculated according to the offset increment

Optionally, the obtaining unit is further configured to: determining a first position of an eigen map blocking matrix to be calculated in an eigen map tensor to be calculated and a second position of a weight blocking matrix to be calculated in a weight tensor to be calculated according to the current memory access offset; and acquiring a feature map block matrix to be calculated and a weight block matrix to be calculated according to the first position and the second position.

Optionally, the traversal calculation unit is further configured to: judging whether an eigen map blocking matrix which does not participate in calculation exists in the eigen map tensor to be calculated; and if not, determining that all the feature map block matrixes to be calculated in the feature map tensor to be calculated have been traversed.

Optionally, the initial memory offset and the offset increment are stored in a parameter buffer of the device kernel, and the size of the offset increment is in a period of R × S, where R, S represents the size of the convolution kernel.

The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments.

In another embodiment of the present invention, a computer storage medium is also provided, on which a computer program is stored, which when executed by a computer performs the steps of the method of the above-described method embodiment.

In another embodiment of the present invention, a computer program is also provided, which may be stored on a storage medium in the cloud or in the local. When being executed by a computer or a processor, for performing the respective steps of the method of an embodiment of the invention and for implementing the respective modules in the convolution processing device of the feature map executed by the computer according to an embodiment of the invention.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one analysis unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by the analyzer. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A convolution processing method of a feature map executed by a computer, comprising:

determining the current memory access offset based on the initial memory access offset and the offset increment;

acquiring an eigen map block matrix to be calculated in the eigen map tensor to be calculated and a weight block matrix to be calculated in the weight tensor to be calculated according to the current memory access offset;

performing matrix multiplication operation on the feature map block matrix to be calculated and the weight block matrix to be calculated to obtain a block calculation result of the feature map tensor to be calculated;

updating the current memory access offset according to the offset increment, and returning to execute the step of obtaining the feature map blocking matrix to be calculated in the feature map tensor to be calculated and the weight blocking matrix to be calculated in the weight tensor to be calculated according to the current memory access offset to obtain a plurality of blocking calculation results;

and performing accumulation operation on the multiple block calculation results, and adding an offset term to the result after the accumulation operation to obtain a convolution calculation result of the feature map tensor to be calculated.

2. The method of claim 1, wherein determining a current memory access offset based on an initial memory access offset and an offset increment comprises:

acquiring the initial memory access offset and the offset increment;

and calculating the current memory access offset according to the initial memory access offset and the offset increment.

3. The method of claim 2, wherein when the current memory access offset is the initial memory access offset, the calculating the current memory access offset according to the initial memory access offset and the offset increment comprises:

determining that the offset increment is zero;

and taking the initial memory access offset as the current memory access offset.

4. The method according to any one of claims 1-3, further comprising:

and calculating the initial memory access offset based on the initial input channel number, the initial r-dimensional coordinate of the convolution kernel and the initial s-dimensional coordinate of the convolution kernel.

5. The method according to any one of claims 1-4, further comprising:

and calculating the offset increment according to the dimensionality of the convolution kernel, the memory access offset of the iter +1 th round and the memory access offset of the iter round, wherein iter is a positive integer.

6. The method of claim 4 or 5, wherein the initial memory access offset is a formula calculated according to memory access offset

the offset increment is calculated according to the offset increment_iter＝F_iter+1-F_iterWherein, in the step (A),

7. The method according to any one of claims 1 to 6, wherein obtaining an eigen map blocking matrix to be calculated in an eigen map tensor to be calculated and a weight blocking matrix to be calculated in a weight tensor to be calculated according to the current memory access offset comprises:

determining a first position of the feature map blocking matrix to be calculated in the feature map tensor to be calculated and a second position of the weight blocking matrix to be calculated in the weight tensor to be calculated according to the current memory access offset;

and acquiring the feature map block matrix to be calculated and the weight block matrix to be calculated according to the first position and the second position.

8. The method according to any one of claims 1 to 7, wherein the plurality of block calculation results are obtained by traversing all feature map block matrices to be calculated in the feature map tensor to be calculated.

9. The method of any one of claims 1-8, wherein the initial memory access offset and the offset increment are stored in a parameter buffer of a device kernel, and the size of the offset increment is in a period of R x S, wherein R, S represents the dimension of a convolution kernel.

10. An apparatus for convolution processing of a feature map executed by a computer, comprising:

the acquisition unit is used for acquiring an eigen map block matrix to be calculated in the eigen map tensor to be calculated and a weight block matrix to be calculated in the weight tensor to be calculated according to the current memory access offset;

the matrix multiplication unit is used for executing matrix multiplication operation on the feature map blocking matrix to be calculated and the weight blocking matrix to be calculated to obtain a blocking calculation result of the feature map tensor to be calculated;

the traversal calculation unit is used for updating the current memory access offset according to the offset increment and returning to the step of acquiring the feature map blocking matrix to be calculated in the feature map tensor to be calculated and the weight blocking matrix to be calculated in the weight tensor to be calculated according to the current memory access offset to obtain a plurality of blocking calculation results;

and the accumulation operation and bias item adding unit is used for executing accumulation operation on the plurality of block calculation results and adding bias items to the result after the accumulation operation to obtain the convolution calculation result of the feature map tensor to be calculated.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of the preceding claims 1 to 9 are implemented when the computer program is executed by the processor.

12. A computer-readable medium having non-volatile program code executable by a processor, characterized in that the program code causes the processor to perform the steps of the method of any of the preceding claims 1 to 9.