CN111797985B

CN111797985B - Convolution operation memory access optimization method based on GPU

Info

Publication number: CN111797985B
Application number: CN202010710031.1A
Authority: CN
Inventors: 张伟哲; 鲁刚钊; 王峥; 李克勤; 孙广中
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2022-11-22
Anticipated expiration: 2040-07-22
Also published as: CN111797985A

Abstract

A convolution operation memory access optimization method based on a GPU relates to a convolution operation memory access optimization technology. The invention can solve the defect of high access and storage overhead of convolution operation in the prior art. The technical points are as follows: loading the convolution kernel data into a shared memory; dividing the convolution output into subblocks by 32 columns to obtain a plurality of subblocks containing 32 columns of data and 1 subblock less than 32 columns of data; each thread calculates the index of the first data required by the thread; each thread acquires the residual required input data from the index of the first data through a column reuse algorithm and transmits the acquired input data to a row reuse algorithm; calculating an output result through a row reuse algorithm and storing the output result in register data sum; writing sum into the global memory; and calculating the rest data to be calculated in the convolution output. The method is used for carrying out access optimization on convolution operation in the fields of image processing, video processing and machine learning.

Description

Convolution operation memory access optimization method based on GPU

Technical Field

The invention relates to a convolution operation memory access optimization technology, in particular to a convolution operation memory access optimization method based on a GPU.

Background

Convolution operation has become a core computing mode in the fields of image processing, video processing and machine learning. The 2D convolution is widely applied to image filtering and frame difference value, depth-wise convolution is commonly used in a mobile neural network, and multichannel 2D convolution is a core operation in the neural network. However, convolution operations consume a large amount of computational and memory resources, and take up 90% of execution time in image processing and machine learning. Many optimization methods for convolution operations have been proposed, with GEMM (matrix multiplication), FFT and Winograd based methods being most widely used. However, these methods require converting input and output data into a matrix of a specific type before operation, which increases the access cost. There is therefore a need for an optimization technique that reduces memory accesses to address the deficiencies of the prior art.

Disclosure of Invention

The technical problem to be solved by the invention is as follows:

the invention aims to solve the problems that the access and storage cost of the convolution operation in the prior art is high, and the access and storage times of the convolution are large, so that the performance of the convolution operation is reduced.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the invention provides a convolution operation memory access optimization method based on a GPU, which comprises the following steps: loading the convolution kernel data into a shared memory; dividing the convolution output into subblocks by 32 columns to obtain a plurality of subblocks containing 32 columns of data and 1 subblock less than 32 columns of data; n threads for processing the sub-blocks are set; each thread calculates the index of the first data required by the thread; each thread acquires the residual required input data from the index of the first data through a column reuse algorithm and transmits the acquired input data to a row reuse algorithm; calculating an output result through a row reuse algorithm and storing the output result in register data sum; writing sum into the global memory; and calculating the rest data to be calculated in the convolution output.

Preferably, the convolution kernel is of arbitrary size.

Preferably, the convolution operation is a 2D convolution, a depth-wise convolution or a multi-channel 2D convolution.

Preferably, the process of the column weight algorithm is as follows: each thread loads the first data and the last data required by the thread from the global memory; each thread acquires required third data from the thread with the interval of 2; each thread retrieves the second and fourth data needed from the thread with interval 1.

Preferably, the method of the present invention further comprises: after each thread finishes loading the required first data and last data, combining the first data and the last data into 64-bit data, and storing the 64-bit data into a first variable array; wherein the last data required is stored in the upper 32 bits and the first data required is stored in the lower 32 bits; and right shifting the variable values corresponding to the threads needing to provide the high-order 32-bit data to other threads in all the threads by 32 bits, right shifting the other threads by 0 bit, and splitting the obtained 64-bit variable array, wherein the high-order 32 bits are used as the fourth data, and the low-order 32 bits are used as the second data.

Preferably, the process of the row reuse algorithm is: each time a row of inputs is loaded, all outputs that can be calculated by the row are calculated using the row inputs.

Preferably, each thread fetches the required data from a thread with

interval

1 or 2 via a CUDA shuffle instruction.

Preferably, all outputs that can be calculated by this row are determined by the calculation formula of the convolution algorithm. Preferably, the convolution kernel size is 3 or 5.

Preferably, the remaining data includes edge data and unprocessed internal data.

The invention has the beneficial effects that: the copy frequency of convolution can be reduced, and the performance of convolution operation is improved. The invention can reduce the access and storage expenses of the convolution operation, and greatly improves the performance of the convolution operation by reducing the access and storage times of the convolution. One application of the invention is to perform memory access optimization for convolution operations in the fields of image processing, video processing and machine learning. Convolution operation is adopted as a core calculation mode in the fields of image processing, video processing and machine learning, and the speed and the efficiency of the image processing, the video processing and the machine learning can be greatly improved by applying the method. The invention can also be applied to other technical fields which adopt convolution operation as a core calculation mode.

In one embodiment, there is a significant speed-up ratio compared to other algorithms and the number of transfers of the memory chunks is greatly reduced.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic diagram of a 2D convolution operation;

FIG. 2 is a schematic diagram of a column reuse algorithm according to one embodiment of the present invention; FIG. 2 (a) shows a loading method in the prior art; FIG. 2 (b) is a diagram illustrating each thread obtaining a third data according to an embodiment of the present invention; FIG. 2 (c) is a diagram illustrating the acquisition of second and fourth data by each thread in one embodiment of the invention;

FIG. 3 is a diagram illustrating conversion of a dynamic index into a static index according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a row reuse algorithm according to one embodiment of the present invention; wherein FIG. 4 (a) is a schematic illustration of an input; FIG. 4 (b) is a schematic diagram of a convolution kernel; FIG. 4 (c) is a schematic of the output;

FIG. 5 is a diagram illustrating a thread to output mapping relationship, according to an embodiment;

FIG. 6 is a flow chart of a method of one embodiment of the present invention;

FIG. 7 is a graph comparing performance of 2D convolution on NVIDIA GPU RTX2080 Ti; wherein figure 7 (a) is a comparison of acceleration ratios for a 2D 3 x 3 convolution; fig. 7 (b) is an acceleration ratio comparison plot of 2D 5 x 5 convolution;

FIG. 8 is a graph comparing performance of depth-wise convolution on NVIDIA GPU RTX2080 Ti; wherein FIG. 8 (a) is an acceleration contrast plot of depth-wise 3*3 convolution; FIG. 8 (b) is a graph comparing acceleration ratios of depth-wise se 5*5 convolution.

FIG. 9 is a graph comparing performance of multi-channel 2D convolution on NVIDIA GPU RTX2080 Ti; wherein FIG. 9 (a) is an acceleration ratio contrast plot for a multi-channel 2D convolution with a convolution depth of 1; FIG. 9 (b) is an acceleration ratio contrast plot for a multi-channel 2D convolution with a convolution depth of 3.

FIG. 10 is a comparison graph of memory access performance of depth-wise convolution on NVIDIA GPU RTX2080 Ti; wherein FIG. 10 (a) is a graph comparing memory access throughput; fig. 10 (b) is a comparison graph of the maximum bandwidth of memory access.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but are intended to be considered a part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as exemplary only and not as limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

It is an object of the present invention to improve the performance of convolution operations by reducing the number of accesses to memory for convolution. The design concept of the present invention is illustrated by an example.

< example >

Fig. 1 is a schematic diagram of this example, which mainly reduces the number of memory accesses in convolution operation by two algorithms. Fig. 1 shows a simple 2D convolution with a picture size of 6 x 11, a convolution kernel size of 5*5, and an output size of 2*6, with each CUDA thread compute a column of outputs.

As can be seen from the figure, there are 4 columns of repeated data in the input data processed by thread 0 and

thread

1, and 4 rows of repeated data in the input data loaded by thread 6. Aiming at the repeated data in the two forms, the invention provides two optimization methods: (1) The column reuse algorithm allows each thread to load the first and last columns it needs, and then uses the CUDA shuffle instruction to fetch the remaining columns from the other threads. (2) The row reuse algorithm allows each thread to load only once the repeated rows, and then computes multiple outputs by multiplying each row of data with multiple rows of the convolution kernel. One serious performance problem is that when using dynamic index arrays in a shuffle instruction, the CUDA stores the arrays in the local memory of the GPU, however, the access latency of the local memory is the same as the global memory, thus causing a serious performance degradation. We use pack and unpack instructions to solve this performance problem.

1. Column weight algorithm:

taking thread 0 and thread 1 in fig. 1 as examples, a column reuse algorithm is shown, as shown in fig. 2. Fig. 2 (a) shows the process of loading data by the direct convolution algorithm. In a first step, each thread loads the first data required from the global memory. In the second step, each thread loads the second data needed from the global memory, and so on until each thread is loaded with 5 data. It can be seen that part of the data loaded by each thread in the second step has already been loaded in the first step. This creates a problem of data reloading. To solve this problem, the present invention proposes a column reuse algorithm, as shown in fig. 2 (b) and (c).

In the first step, each thread loads the first data from the global memory. In the second step, each thread loads the last data. In the third step, each thread acquires the required data from the neighbor with interval 2 using a shuffle instruction, __ shfl _ xor _ sync (0 xFFFFFFFF, iTemp [2], 2). Thread 0 and thread 1 load the required third data from thread 2 and thread 3, respectively, and provide the data for thread 2 and thread 3. Meanwhile, thread 2 and thread 3 acquire the required third data from thread 0 and thread 1, respectively, and provide the data for thread 0 and thread 1. In the fourth and fifth steps, each thread acquires the required data from the neighbors of interval 1, in a similar way to the third step.

It can be seen that iTemp [ i ] in the shuffle instruction is a dynamic index array, and the CUDA compiler cannot determine its address in the compile stage, so that the variable is stored in the local memory. However, the access latency of the local memory is consistent with the global memory, which may cause the performance of the program to be degraded. The problem of dynamic indexing is solved using algorithm 1 and algorithm 2. Where algorithm 1 is used to process the third step of fig. 2 (b) and algorithm 2 is used to process the fourth and fifth steps of fig. 2 (c).

The following takes algorithm 1 and fig. 3 as an example to illustrate how to eliminate the dynamic index. Fig. 3 shows the process of lines 4-7 in algorithm 1. Each thread first loads the first and last data to be processed into a register (lines 2-3). The two 32-bit data are then merged into one 64-bit data and stored in the variable exchange (4 rows), where iTemp [4] is the upper 32-bits and iTemp [0] is the lower 32-bits. For thread 2 and thread 3 in FIG. 3, both threads need to provide the first data, the low 32 bits of the exchange variable, respectively. Thus, the exchange variable in thread 2 and thread 3 is shifted to the right by 0 bits. After shifting, the exchange variable is split, with the upper 32 bits stored in iTemp [2] and the lower 32 bits stored in iTemp [1] (line 7). At this time, the data stored in iTemp [1] is the data that each thread needs to provide. Finally, the shuffle instruction is used to swap iTemp [1] between threads (8 lines).

The third step in fig. 2 (b) and the fourth and fifth steps in (c) have similar processes, so algorithm 1 can be simply modified to obtain algorithm 2. In fig. 2 (c), two adjacent processes need to exchange data. Thus in algorithm 2, the shift amount per thread (line 3) and the pattern of swapping (line 6) need to be modified.

2. Row reuse algorithm

FIG. 4 illustrates the process of convolving an input with a convolution kernel in the elevation direction to generate a list of outputs. Where row represents a row of data. The calculation formula of the direct convolution algorithm is as follows:

out ₁ ＝row _i1 ·row _f1 +row _i2 ·row _f2 +row _i3 ·row _f3

out ₂ ＝row _i2 ·row _f1 +row _i3 ·row _f2 +row _i4 ·row _f3

out ₃ ＝row _i3 ·row _f1 +row _i4 ·row _f2 +row _i5 ·row _f3

from the above equation, row _i2 And row _i2 Is loaded twice, row _i3 Is loaded three times. To reduce repetitive data loading, as many different outputs are computed with a row as possible after each loading of the row of inputs. For example, row _i1 For calculating out ₁ ，row _i2 For calculating out ₁ And out ₂ . Heavy loadThe new modification to the calculation scheme causes the need for each row of inputs to be loaded once, as follows:

loadrow _i1 :out ₁ ＝row _i1 ·row _f1

loadrow _i2 :out ₁ ＝out ₁ +row _i2 ·row _f2

out ₂ ＝row _i2 ·row _f1

loadrow _i3 :out ₁ ＝out ₁ +row _i3 ·row _f3

out ₂ ＝out ₂ +row _i3 ·row _f2

out ₃ ＝row _i3 ·row _f1

loadrow _i4 :out ₂ ＝out ₂ +row _i4 ·row _f3

out ₃ ＝out ₃ +row _i4 ·row _f2

loadrow _i5 :out ₃ ＝out ₃ +row _i5 ·row _f3

as can be seen from the above formula, the input of each row needs to be loaded once. One common way of computational loading is shown in algorithm 3.

In algorithm 3, row represents a line of data loaded from the input, index represents the line is the row number in the input, and filter represents a set of lines representing the convolution kernel. Algorithm lines 3-5 process the first F of the input _H -1 lines, the number of outputs required for these lines being less than F _H . Some rows are just covered by F _H These lines are processed in algorithm lines 3-11 as required by the output. Algorithm lines 3-17 process the last F _H -1 lineAnd (4) inputting.

3. Final access optimization algorithm

Taking 2D convolution as an example, it is explained how to apply the column reuse and row reuse algorithms to the convolution operation. In the implementation of this example, the output is first partitioned into sub-blocks, each containing 32 columns of data. The last sub-block may contain less than 32 columns of data. Each CUDA thread block processes one or more sub-blocks, and each warp processes one sub-block. The segmentation process is shown in fig. 5.

The edge data and the internal data are processed in different ways. In fig. 5, the edge data is represented by a shaded square and the internal data is represented by a dotted square. Assume that each warp contains 4 threads. It can be seen that the inner data is split into two sub-blocks, sub-block 0 contains 4 columns, which can be handled by exactly one warp. The subblock 1 only comprises two columns, and in order to fully utilize the threads, the two columns of data are averagely divided into 4 parts and distributed to 4 threads. Algorithm 4 shows a general algorithmic process.

In algorithm 4, we first load the convolution kernel data into shared memory (lines 1-2). The sub-blocks containing exactly 32 columns are then processed in rows 3-13 and the last sub-block is processed in rows 14-17. Each thread computes a column of outputs as follows: first, each thread calculates the address of the first input data it needs (lines 4-6); then, each thread takes the remaining required input data (8 lines) with algorithm 1 and algorithm 2; next, each thread passes the acquired input data to algorithm 3 for computing a number of outputs and storing the results in register data sum (line 9); finally, the sum array is written to global memory (12 rows).

< embodiment >

Embodiments of the present invention are shown below by way of the foregoing examples.

For the purpose of memory access optimization, one embodiment of the present invention is shown in fig. 6, and includes:

s1: and loading the convolution kernel data into the shared memory.

S2: the convolution output is divided into subblocks in units of 32 columns, and a plurality of subblocks containing 32 columns of data and 1 subblock containing less than 32 columns of data are obtained. I.e. the actual division in fig. 5.

S3: n threads for processing the sub-blocks are set; each thread computes an index of the first data that the thread needs. The index of the first data is the first and left and right data required by each thread shown in fig. 2. Other required data can be obtained by the index operation of the first data.

S4: and each thread acquires the residual required input data from the index of the first data through a column reuse algorithm and transmits the acquired input data to a row reuse algorithm. The process of acquiring the remaining required input data is the process of acquiring data from the adjacent threads with the interval of 1 or 2 in algorithm 1 and algorithm 2.

One embodiment of the column reuse algorithm is: each thread loads the first data and the last data required by the thread from the global memory; each thread acquires required third data from the thread with the interval of 2; each thread fetches the required second and fourth data from the thread with interval 1. Wherein each thread may fetch the required data from a thread with

interval

1 or 2 via the CUDA shuffle instruction.

Further, in order to solve the dynamic indexing problem, the row reuse algorithm uses the following mode when performing data exchange: after each thread finishes loading the needed first data and the last data, combining the first data and the last data into 64-bit data, and storing the 64-bit data into a first variable array; wherein the last data required is stored in the upper 32 bits and the first data required is stored in the lower 32 bits; and right shifting the variable values corresponding to the threads needing to provide the high-order 32-bit data to other threads in all the threads by 32 bits, right shifting the other threads by 0 bit, and splitting the obtained 64-bit variable array, wherein the high-order 32 bits are used as the fourth data, and the low-order 32 bits are used as the second data.

It should be noted that the above embodiment is a case where the convolution kernel size is 5, and those skilled in the art can conceive a specific implementation when the convolution kernel size is 3. Likewise, the 2D convolution process is similar to depth-wise convolution and multi-channel 2D, and those skilled in the art can unambiguously determine the same modified way of depth-wise and multi-channel 2D convolution according to the examples of the present invention.

S5: calculating an output result through a row reuse algorithm and storing the output result in register data sum; and writes sum to global memory.

One embodiment of the row reuse algorithm is: each time a row of inputs is loaded, all outputs that can be calculated by the row are calculated using the row inputs. The loaded inputs can calculate which outputs can be determined from a convolution calculation formula, such as the formula involved in calculating FIG. 4. The skilled person can make a specific choice according to the picture and the specific situation of the convolution kernel in combination with common knowledge.

S6: and calculating the rest data to be calculated in the convolution output. In fig. 5, the remaining data to be calculated includes edge data and unprocessed internal data.

< Experimental Effect >

This test was compared to 5 implementations of 2D convolution for cuDNN, GEMM-im2col, GEMM-im2row, arrayFire, and NPP. The experiments were performed on NVIDIA GPU RTX2080 Ti.

(1) 2D convolution experiment

The experimental results of the 2D convolution are shown in fig. 7. FIG. 7 shows the speed-up ratio of the method of the present invention relative to other methods at different picture sizes and GPU hardware. It can be seen from the figure that cuDNN, im2col and im2row are not suitable for 2D convolution. ArrayFire, NPP and the method of the invention can achieve a very good effect. Relative to cuDNN, im2col and im2row, the method of the invention achieved 5.9,5.9 and 5.8 times the average acceleration ratio on both platforms. FIGS. 7 (a) and (b) show the results on RTX2080Ti, with the method of the present invention achieving average acceleration ratios of 3.1 and 1.3 times.

In order to verify that the method of the present invention can actually reduce the number of memory access times, nvprof is used to count the number of memory block transmission times realized by each 2D convolution in the test, and the specific results are shown in table 1.

TABLE 1 memory Block Transmission times

It can be seen from the table that the method can greatly reduce the number of times of fast transmission of the memory, and bring about the improvement of performance.

(2) Depth-wise convolution experiment

The acceleration of the algorithm in the method and cuDNN of the present invention relative to im2col is shown in fig. 8. On RTX2080Ti, the fastest algorithm for the method relative to cuDNN reaches average acceleration ratios of 1.4 and 4 times on the 3*3 and 5*5 convolution kernels. The access throughput and the maximum bandwidth of the method and the algorithm with the fastest cuDNN are shown in FIGS. 10 (a) and 10 (b), and it can be seen that the method achieves the algorithm with the fastest cuDNN in two access performance indexes.

(3) Multichannel 2D convolution experiment

When testing the performance of the multichannel 2D convolution, the test extracts different convolution configurations from a common neural network, and sets the convolution depths to 1 and 3, and the batch size to 128. The acceleration of the method of the present invention relative to other methods is shown in fig. 9. The method of the invention achieves average acceleration ratios of 17.9 and 18.8 times with respect to im2col and im2 row. On RTX2080Ti, the process according to the invention achieves an average acceleration ratio of 1.2 times with respect to cuDNN fastest.

The test also counts the transmission times of the memory block of the multi-channel 2D convolution, as shown in table 2, where only the statistical result when the depth is 1 is shown. It can be seen that the method of the present invention achieves a minimum number of transmissions.

TABLE 2 Transmission times of multichannel 2D convolution memory blocks

Although some specific embodiments of the present invention have been described in detail by way of example, it should be understood by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. A convolution operation memory access optimization method based on a GPU is characterized by comprising the following steps:

loading the convolution kernel data into a shared memory;

dividing the convolution output into subblocks by 32 columns to obtain a plurality of subblocks containing 32 columns of data and 1 subblock less than 32 columns of data;

n threads for processing the sub-blocks are set; each thread calculates the index of the first data required by the thread;

each thread acquires the residual required input data from the index of the first data through a row weight algorithm, and transmits the acquired input data to a row weight algorithm;

calculating an output result through a row reuse algorithm and storing the output result in register data sum; writing sum into the global memory;

calculating the rest data to be calculated in the convolution output;

the process of the column reuse algorithm is as follows:

each thread loads the first data and the last data required by the thread from the global memory;

each thread acquires required third data from the thread with the interval of 2;

each thread acquires the required second and fourth data from the threads with interval 1;

the process of the row reuse algorithm is as follows:

each time a row of inputs is loaded, all outputs that can be calculated by the row are calculated using the row inputs.

2. The GPU-based convolution operation memory access optimization method of claim 1, wherein a column reuse algorithm can be used for convolution kernels of any size.

3. The GPU-based convolution memory access optimization method according to claim 1 or 2, wherein the convolution operations are 2D convolution, depth-wise convolution and multi-channel 2D convolution.

4. The GPU-based convolutional arithmetic memory access optimization method of claim 1, further comprising:

after each thread finishes loading the required first data and the last data, combining the first data and the last data into 64-bit data, and storing the 64-bit data into a first variable array; the required last data is stored in the upper 32 bits, and the required first data is stored in the lower 32 bits;

and right shifting the variable values corresponding to the threads needing to provide the high-order 32-bit data to other threads in all the threads by 32 bits, right shifting the other threads by 0 bit, and splitting the obtained 64-bit variable array, wherein the high-order 32 bits are used as the fourth data, and the low-order 32 bits are used as the second data.

5. The GPU-based convolutional arithmetic memory access optimization method of claim 1, wherein each thread acquires required data from threads with interval 1 or 2 through a CUDA shuffle instruction.

6. The GPU-based convolution operation memory access optimization method of claim 1, wherein all outputs that can be computed from the row are determined by a computation formula of a convolution algorithm.

7. The GPU-based convolution operation memory access optimization method of claim 1, wherein a column reuse algorithm can be used for convolution kernels with a size of 3 or 5.

8. The GPU-based convolution operation memory access optimization method of claim 1, wherein the remaining data to be computed includes edge data and unprocessed internal data.