CN111797985B - Convolution operation memory access optimization method based on GPU - Google Patents

Convolution operation memory access optimization method based on GPU Download PDF

Info

Publication number
CN111797985B
CN111797985B CN202010710031.1A CN202010710031A CN111797985B CN 111797985 B CN111797985 B CN 111797985B CN 202010710031 A CN202010710031 A CN 202010710031A CN 111797985 B CN111797985 B CN 111797985B
Authority
CN
China
Prior art keywords
data
convolution
thread
row
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010710031.1A
Other languages
Chinese (zh)
Other versions
CN111797985A (en
Inventor
张伟哲
鲁刚钊
王峥
李克勤
孙广中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202010710031.1A priority Critical patent/CN111797985B/en
Publication of CN111797985A publication Critical patent/CN111797985A/en
Application granted granted Critical
Publication of CN111797985B publication Critical patent/CN111797985B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

A convolution operation memory access optimization method based on a GPU relates to a convolution operation memory access optimization technology. The invention can solve the defect of high access and storage overhead of convolution operation in the prior art. The technical points are as follows: loading the convolution kernel data into a shared memory; dividing the convolution output into subblocks by 32 columns to obtain a plurality of subblocks containing 32 columns of data and 1 subblock less than 32 columns of data; each thread calculates the index of the first data required by the thread; each thread acquires the residual required input data from the index of the first data through a column reuse algorithm and transmits the acquired input data to a row reuse algorithm; calculating an output result through a row reuse algorithm and storing the output result in register data sum; writing sum into the global memory; and calculating the rest data to be calculated in the convolution output. The method is used for carrying out access optimization on convolution operation in the fields of image processing, video processing and machine learning.

Description

Convolution operation memory access optimization method based on GPU
Technical Field
The invention relates to a convolution operation memory access optimization technology, in particular to a convolution operation memory access optimization method based on a GPU.
Background
Convolution operation has become a core computing mode in the fields of image processing, video processing and machine learning. The 2D convolution is widely applied to image filtering and frame difference value, depth-wise convolution is commonly used in a mobile neural network, and multichannel 2D convolution is a core operation in the neural network. However, convolution operations consume a large amount of computational and memory resources, and take up 90% of execution time in image processing and machine learning. Many optimization methods for convolution operations have been proposed, with GEMM (matrix multiplication), FFT and Winograd based methods being most widely used. However, these methods require converting input and output data into a matrix of a specific type before operation, which increases the access cost. There is therefore a need for an optimization technique that reduces memory accesses to address the deficiencies of the prior art.
Disclosure of Invention
The technical problem to be solved by the invention is as follows:
the invention aims to solve the problems that the access and storage cost of the convolution operation in the prior art is high, and the access and storage times of the convolution are large, so that the performance of the convolution operation is reduced.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the invention provides a convolution operation memory access optimization method based on a GPU, which comprises the following steps: loading the convolution kernel data into a shared memory; dividing the convolution output into subblocks by 32 columns to obtain a plurality of subblocks containing 32 columns of data and 1 subblock less than 32 columns of data; n threads for processing the sub-blocks are set; each thread calculates the index of the first data required by the thread; each thread acquires the residual required input data from the index of the first data through a column reuse algorithm and transmits the acquired input data to a row reuse algorithm; calculating an output result through a row reuse algorithm and storing the output result in register data sum; writing sum into the global memory; and calculating the rest data to be calculated in the convolution output.
Preferably, the convolution kernel is of arbitrary size.
Preferably, the convolution operation is a 2D convolution, a depth-wise convolution or a multi-channel 2D convolution.
Preferably, the process of the column weight algorithm is as follows: each thread loads the first data and the last data required by the thread from the global memory; each thread acquires required third data from the thread with the interval of 2; each thread retrieves the second and fourth data needed from the thread with interval 1.
Preferably, the method of the present invention further comprises: after each thread finishes loading the required first data and last data, combining the first data and the last data into 64-bit data, and storing the 64-bit data into a first variable array; wherein the last data required is stored in the upper 32 bits and the first data required is stored in the lower 32 bits; and right shifting the variable values corresponding to the threads needing to provide the high-order 32-bit data to other threads in all the threads by 32 bits, right shifting the other threads by 0 bit, and splitting the obtained 64-bit variable array, wherein the high-order 32 bits are used as the fourth data, and the low-order 32 bits are used as the second data.
Preferably, the process of the row reuse algorithm is: each time a row of inputs is loaded, all outputs that can be calculated by the row are calculated using the row inputs.
Preferably, each thread fetches the required data from a thread with interval 1 or 2 via a CUDA shuffle instruction.
Preferably, all outputs that can be calculated by this row are determined by the calculation formula of the convolution algorithm. Preferably, the convolution kernel size is 3 or 5.
Preferably, the remaining data includes edge data and unprocessed internal data.
The invention has the beneficial effects that: the copy frequency of convolution can be reduced, and the performance of convolution operation is improved. The invention can reduce the access and storage expenses of the convolution operation, and greatly improves the performance of the convolution operation by reducing the access and storage times of the convolution. One application of the invention is to perform memory access optimization for convolution operations in the fields of image processing, video processing and machine learning. Convolution operation is adopted as a core calculation mode in the fields of image processing, video processing and machine learning, and the speed and the efficiency of the image processing, the video processing and the machine learning can be greatly improved by applying the method. The invention can also be applied to other technical fields which adopt convolution operation as a core calculation mode.
In one embodiment, there is a significant speed-up ratio compared to other algorithms and the number of transfers of the memory chunks is greatly reduced.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic diagram of a 2D convolution operation;
FIG. 2 is a schematic diagram of a column reuse algorithm according to one embodiment of the present invention; FIG. 2 (a) shows a loading method in the prior art; FIG. 2 (b) is a diagram illustrating each thread obtaining a third data according to an embodiment of the present invention; FIG. 2 (c) is a diagram illustrating the acquisition of second and fourth data by each thread in one embodiment of the invention;
FIG. 3 is a diagram illustrating conversion of a dynamic index into a static index according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a row reuse algorithm according to one embodiment of the present invention; wherein FIG. 4 (a) is a schematic illustration of an input; FIG. 4 (b) is a schematic diagram of a convolution kernel; FIG. 4 (c) is a schematic of the output;
FIG. 5 is a diagram illustrating a thread to output mapping relationship, according to an embodiment;
FIG. 6 is a flow chart of a method of one embodiment of the present invention;
FIG. 7 is a graph comparing performance of 2D convolution on NVIDIA GPU RTX2080 Ti; wherein figure 7 (a) is a comparison of acceleration ratios for a 2D 3 x 3 convolution; fig. 7 (b) is an acceleration ratio comparison plot of 2D 5 x 5 convolution;
FIG. 8 is a graph comparing performance of depth-wise convolution on NVIDIA GPU RTX2080 Ti; wherein FIG. 8 (a) is an acceleration contrast plot of depth-wise 3*3 convolution; FIG. 8 (b) is a graph comparing acceleration ratios of depth-wise se 5*5 convolution.
FIG. 9 is a graph comparing performance of multi-channel 2D convolution on NVIDIA GPU RTX2080 Ti; wherein FIG. 9 (a) is an acceleration ratio contrast plot for a multi-channel 2D convolution with a convolution depth of 1; FIG. 9 (b) is an acceleration ratio contrast plot for a multi-channel 2D convolution with a convolution depth of 3.
FIG. 10 is a comparison graph of memory access performance of depth-wise convolution on NVIDIA GPU RTX2080 Ti; wherein FIG. 10 (a) is a graph comparing memory access throughput; fig. 10 (b) is a comparison graph of the maximum bandwidth of memory access.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but are intended to be considered a part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as exemplary only and not as limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
It is an object of the present invention to improve the performance of convolution operations by reducing the number of accesses to memory for convolution. The design concept of the present invention is illustrated by an example.
< example >
Fig. 1 is a schematic diagram of this example, which mainly reduces the number of memory accesses in convolution operation by two algorithms. Fig. 1 shows a simple 2D convolution with a picture size of 6 x 11, a convolution kernel size of 5*5, and an output size of 2*6, with each CUDA thread compute a column of outputs.
As can be seen from the figure, there are 4 columns of repeated data in the input data processed by thread 0 and thread 1, and 4 rows of repeated data in the input data loaded by thread 6. Aiming at the repeated data in the two forms, the invention provides two optimization methods: (1) The column reuse algorithm allows each thread to load the first and last columns it needs, and then uses the CUDA shuffle instruction to fetch the remaining columns from the other threads. (2) The row reuse algorithm allows each thread to load only once the repeated rows, and then computes multiple outputs by multiplying each row of data with multiple rows of the convolution kernel. One serious performance problem is that when using dynamic index arrays in a shuffle instruction, the CUDA stores the arrays in the local memory of the GPU, however, the access latency of the local memory is the same as the global memory, thus causing a serious performance degradation. We use pack and unpack instructions to solve this performance problem.
1. Column weight algorithm:
taking thread 0 and thread 1 in fig. 1 as examples, a column reuse algorithm is shown, as shown in fig. 2. Fig. 2 (a) shows the process of loading data by the direct convolution algorithm. In a first step, each thread loads the first data required from the global memory. In the second step, each thread loads the second data needed from the global memory, and so on until each thread is loaded with 5 data. It can be seen that part of the data loaded by each thread in the second step has already been loaded in the first step. This creates a problem of data reloading. To solve this problem, the present invention proposes a column reuse algorithm, as shown in fig. 2 (b) and (c).
In the first step, each thread loads the first data from the global memory. In the second step, each thread loads the last data. In the third step, each thread acquires the required data from the neighbor with interval 2 using a shuffle instruction, __ shfl _ xor _ sync (0 xFFFFFFFF, iTemp [2], 2). Thread 0 and thread 1 load the required third data from thread 2 and thread 3, respectively, and provide the data for thread 2 and thread 3. Meanwhile, thread 2 and thread 3 acquire the required third data from thread 0 and thread 1, respectively, and provide the data for thread 0 and thread 1. In the fourth and fifth steps, each thread acquires the required data from the neighbors of interval 1, in a similar way to the third step.
It can be seen that iTemp [ i ] in the shuffle instruction is a dynamic index array, and the CUDA compiler cannot determine its address in the compile stage, so that the variable is stored in the local memory. However, the access latency of the local memory is consistent with the global memory, which may cause the performance of the program to be degraded. The problem of dynamic indexing is solved using algorithm 1 and algorithm 2. Where algorithm 1 is used to process the third step of fig. 2 (b) and algorithm 2 is used to process the fourth and fifth steps of fig. 2 (c).
Figure BDA0002596207080000041
Figure BDA0002596207080000051
The following takes algorithm 1 and fig. 3 as an example to illustrate how to eliminate the dynamic index. Fig. 3 shows the process of lines 4-7 in algorithm 1. Each thread first loads the first and last data to be processed into a register (lines 2-3). The two 32-bit data are then merged into one 64-bit data and stored in the variable exchange (4 rows), where iTemp [4] is the upper 32-bits and iTemp [0] is the lower 32-bits. For thread 2 and thread 3 in FIG. 3, both threads need to provide the first data, the low 32 bits of the exchange variable, respectively. Thus, the exchange variable in thread 2 and thread 3 is shifted to the right by 0 bits. After shifting, the exchange variable is split, with the upper 32 bits stored in iTemp [2] and the lower 32 bits stored in iTemp [1] (line 7). At this time, the data stored in iTemp [1] is the data that each thread needs to provide. Finally, the shuffle instruction is used to swap iTemp [1] between threads (8 lines).
The third step in fig. 2 (b) and the fourth and fifth steps in (c) have similar processes, so algorithm 1 can be simply modified to obtain algorithm 2. In fig. 2 (c), two adjacent processes need to exchange data. Thus in algorithm 2, the shift amount per thread (line 3) and the pattern of swapping (line 6) need to be modified.
2. Row reuse algorithm
FIG. 4 illustrates the process of convolving an input with a convolution kernel in the elevation direction to generate a list of outputs. Where row represents a row of data. The calculation formula of the direct convolution algorithm is as follows:
out 1 =row i1 ·row f1 +row i2 ·row f2 +row i3 ·row f3
out 2 =row i2 ·row f1 +row i3 ·row f2 +row i4 ·row f3
out 3 =row i3 ·row f1 +row i4 ·row f2 +row i5 ·row f3
from the above equation, row i2 And row i2 Is loaded twice, row i3 Is loaded three times. To reduce repetitive data loading, as many different outputs are computed with a row as possible after each loading of the row of inputs. For example, row i1 For calculating out 1 ,row i2 For calculating out 1 And out 2 . Heavy loadThe new modification to the calculation scheme causes the need for each row of inputs to be loaded once, as follows:
loadrow i1 :out 1 =row i1 ·row f1
loadrow i2 :out 1 =out 1 +row i2 ·row f2
out 2 =row i2 ·row f1
loadrow i3 :out 1 =out 1 +row i3 ·row f3
out 2 =out 2 +row i3 ·row f2
out 3 =row i3 ·row f1
loadrow i4 :out 2 =out 2 +row i4 ·row f3
out 3 =out 3 +row i4 ·row f2
loadrow i5 :out 3 =out 3 +row i5 ·row f3
as can be seen from the above formula, the input of each row needs to be loaded once. One common way of computational loading is shown in algorithm 3.
Figure BDA0002596207080000061
Figure BDA0002596207080000071
In algorithm 3, row represents a line of data loaded from the input, index represents the line is the row number in the input, and filter represents a set of lines representing the convolution kernel. Algorithm lines 3-5 process the first F of the input H -1 lines, the number of outputs required for these lines being less than F H . Some rows are just covered by F H These lines are processed in algorithm lines 3-11 as required by the output. Algorithm lines 3-17 process the last F H -1 lineAnd (4) inputting.
3. Final access optimization algorithm
Taking 2D convolution as an example, it is explained how to apply the column reuse and row reuse algorithms to the convolution operation. In the implementation of this example, the output is first partitioned into sub-blocks, each containing 32 columns of data. The last sub-block may contain less than 32 columns of data. Each CUDA thread block processes one or more sub-blocks, and each warp processes one sub-block. The segmentation process is shown in fig. 5.
The edge data and the internal data are processed in different ways. In fig. 5, the edge data is represented by a shaded square and the internal data is represented by a dotted square. Assume that each warp contains 4 threads. It can be seen that the inner data is split into two sub-blocks, sub-block 0 contains 4 columns, which can be handled by exactly one warp. The subblock 1 only comprises two columns, and in order to fully utilize the threads, the two columns of data are averagely divided into 4 parts and distributed to 4 threads. Algorithm 4 shows a general algorithmic process.
Figure BDA0002596207080000072
Figure BDA0002596207080000081
In algorithm 4, we first load the convolution kernel data into shared memory (lines 1-2). The sub-blocks containing exactly 32 columns are then processed in rows 3-13 and the last sub-block is processed in rows 14-17. Each thread computes a column of outputs as follows: first, each thread calculates the address of the first input data it needs (lines 4-6); then, each thread takes the remaining required input data (8 lines) with algorithm 1 and algorithm 2; next, each thread passes the acquired input data to algorithm 3 for computing a number of outputs and storing the results in register data sum (line 9); finally, the sum array is written to global memory (12 rows).
< embodiment >
Embodiments of the present invention are shown below by way of the foregoing examples.
For the purpose of memory access optimization, one embodiment of the present invention is shown in fig. 6, and includes:
s1: and loading the convolution kernel data into the shared memory.
S2: the convolution output is divided into subblocks in units of 32 columns, and a plurality of subblocks containing 32 columns of data and 1 subblock containing less than 32 columns of data are obtained. I.e. the actual division in fig. 5.
S3: n threads for processing the sub-blocks are set; each thread computes an index of the first data that the thread needs. The index of the first data is the first and left and right data required by each thread shown in fig. 2. Other required data can be obtained by the index operation of the first data.
S4: and each thread acquires the residual required input data from the index of the first data through a column reuse algorithm and transmits the acquired input data to a row reuse algorithm. The process of acquiring the remaining required input data is the process of acquiring data from the adjacent threads with the interval of 1 or 2 in algorithm 1 and algorithm 2.
One embodiment of the column reuse algorithm is: each thread loads the first data and the last data required by the thread from the global memory; each thread acquires required third data from the thread with the interval of 2; each thread fetches the required second and fourth data from the thread with interval 1. Wherein each thread may fetch the required data from a thread with interval 1 or 2 via the CUDA shuffle instruction.
Further, in order to solve the dynamic indexing problem, the row reuse algorithm uses the following mode when performing data exchange: after each thread finishes loading the needed first data and the last data, combining the first data and the last data into 64-bit data, and storing the 64-bit data into a first variable array; wherein the last data required is stored in the upper 32 bits and the first data required is stored in the lower 32 bits; and right shifting the variable values corresponding to the threads needing to provide the high-order 32-bit data to other threads in all the threads by 32 bits, right shifting the other threads by 0 bit, and splitting the obtained 64-bit variable array, wherein the high-order 32 bits are used as the fourth data, and the low-order 32 bits are used as the second data.
It should be noted that the above embodiment is a case where the convolution kernel size is 5, and those skilled in the art can conceive a specific implementation when the convolution kernel size is 3. Likewise, the 2D convolution process is similar to depth-wise convolution and multi-channel 2D, and those skilled in the art can unambiguously determine the same modified way of depth-wise and multi-channel 2D convolution according to the examples of the present invention.
S5: calculating an output result through a row reuse algorithm and storing the output result in register data sum; and writes sum to global memory.
One embodiment of the row reuse algorithm is: each time a row of inputs is loaded, all outputs that can be calculated by the row are calculated using the row inputs. The loaded inputs can calculate which outputs can be determined from a convolution calculation formula, such as the formula involved in calculating FIG. 4. The skilled person can make a specific choice according to the picture and the specific situation of the convolution kernel in combination with common knowledge.
S6: and calculating the rest data to be calculated in the convolution output. In fig. 5, the remaining data to be calculated includes edge data and unprocessed internal data.
< Experimental Effect >
This test was compared to 5 implementations of 2D convolution for cuDNN, GEMM-im2col, GEMM-im2row, arrayFire, and NPP. The experiments were performed on NVIDIA GPU RTX2080 Ti.
(1) 2D convolution experiment
The experimental results of the 2D convolution are shown in fig. 7. FIG. 7 shows the speed-up ratio of the method of the present invention relative to other methods at different picture sizes and GPU hardware. It can be seen from the figure that cuDNN, im2col and im2row are not suitable for 2D convolution. ArrayFire, NPP and the method of the invention can achieve a very good effect. Relative to cuDNN, im2col and im2row, the method of the invention achieved 5.9,5.9 and 5.8 times the average acceleration ratio on both platforms. FIGS. 7 (a) and (b) show the results on RTX2080Ti, with the method of the present invention achieving average acceleration ratios of 3.1 and 1.3 times.
In order to verify that the method of the present invention can actually reduce the number of memory access times, nvprof is used to count the number of memory block transmission times realized by each 2D convolution in the test, and the specific results are shown in table 1.
TABLE 1 memory Block Transmission times
Figure BDA0002596207080000101
It can be seen from the table that the method can greatly reduce the number of times of fast transmission of the memory, and bring about the improvement of performance.
(2) Depth-wise convolution experiment
The acceleration of the algorithm in the method and cuDNN of the present invention relative to im2col is shown in fig. 8. On RTX2080Ti, the fastest algorithm for the method relative to cuDNN reaches average acceleration ratios of 1.4 and 4 times on the 3*3 and 5*5 convolution kernels. The access throughput and the maximum bandwidth of the method and the algorithm with the fastest cuDNN are shown in FIGS. 10 (a) and 10 (b), and it can be seen that the method achieves the algorithm with the fastest cuDNN in two access performance indexes.
(3) Multichannel 2D convolution experiment
When testing the performance of the multichannel 2D convolution, the test extracts different convolution configurations from a common neural network, and sets the convolution depths to 1 and 3, and the batch size to 128. The acceleration of the method of the present invention relative to other methods is shown in fig. 9. The method of the invention achieves average acceleration ratios of 17.9 and 18.8 times with respect to im2col and im2 row. On RTX2080Ti, the process according to the invention achieves an average acceleration ratio of 1.2 times with respect to cuDNN fastest.
The test also counts the transmission times of the memory block of the multi-channel 2D convolution, as shown in table 2, where only the statistical result when the depth is 1 is shown. It can be seen that the method of the present invention achieves a minimum number of transmissions.
TABLE 2 Transmission times of multichannel 2D convolution memory blocks
Figure BDA0002596207080000111
Although some specific embodiments of the present invention have been described in detail by way of example, it should be understood by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims (8)

1. A convolution operation memory access optimization method based on a GPU is characterized by comprising the following steps:
loading the convolution kernel data into a shared memory;
dividing the convolution output into subblocks by 32 columns to obtain a plurality of subblocks containing 32 columns of data and 1 subblock less than 32 columns of data;
n threads for processing the sub-blocks are set; each thread calculates the index of the first data required by the thread;
each thread acquires the residual required input data from the index of the first data through a row weight algorithm, and transmits the acquired input data to a row weight algorithm;
calculating an output result through a row reuse algorithm and storing the output result in register data sum; writing sum into the global memory;
calculating the rest data to be calculated in the convolution output;
the process of the column reuse algorithm is as follows:
each thread loads the first data and the last data required by the thread from the global memory;
each thread acquires required third data from the thread with the interval of 2;
each thread acquires the required second and fourth data from the threads with interval 1;
the process of the row reuse algorithm is as follows:
each time a row of inputs is loaded, all outputs that can be calculated by the row are calculated using the row inputs.
2. The GPU-based convolution operation memory access optimization method of claim 1, wherein a column reuse algorithm can be used for convolution kernels of any size.
3. The GPU-based convolution memory access optimization method according to claim 1 or 2, wherein the convolution operations are 2D convolution, depth-wise convolution and multi-channel 2D convolution.
4. The GPU-based convolutional arithmetic memory access optimization method of claim 1, further comprising:
after each thread finishes loading the required first data and the last data, combining the first data and the last data into 64-bit data, and storing the 64-bit data into a first variable array; the required last data is stored in the upper 32 bits, and the required first data is stored in the lower 32 bits;
and right shifting the variable values corresponding to the threads needing to provide the high-order 32-bit data to other threads in all the threads by 32 bits, right shifting the other threads by 0 bit, and splitting the obtained 64-bit variable array, wherein the high-order 32 bits are used as the fourth data, and the low-order 32 bits are used as the second data.
5. The GPU-based convolutional arithmetic memory access optimization method of claim 1, wherein each thread acquires required data from threads with interval 1 or 2 through a CUDA shuffle instruction.
6. The GPU-based convolution operation memory access optimization method of claim 1, wherein all outputs that can be computed from the row are determined by a computation formula of a convolution algorithm.
7. The GPU-based convolution operation memory access optimization method of claim 1, wherein a column reuse algorithm can be used for convolution kernels with a size of 3 or 5.
8. The GPU-based convolution operation memory access optimization method of claim 1, wherein the remaining data to be computed includes edge data and unprocessed internal data.
CN202010710031.1A 2020-07-22 2020-07-22 Convolution operation memory access optimization method based on GPU Active CN111797985B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010710031.1A CN111797985B (en) 2020-07-22 2020-07-22 Convolution operation memory access optimization method based on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010710031.1A CN111797985B (en) 2020-07-22 2020-07-22 Convolution operation memory access optimization method based on GPU

Publications (2)

Publication Number Publication Date
CN111797985A CN111797985A (en) 2020-10-20
CN111797985B true CN111797985B (en) 2022-11-22

Family

ID=72827265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010710031.1A Active CN111797985B (en) 2020-07-22 2020-07-22 Convolution operation memory access optimization method based on GPU

Country Status (1)

Country Link
CN (1) CN111797985B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116091299B (en) * 2023-04-07 2023-06-23 南京砺算科技有限公司 Implicit GEMM convolution calculation method, device, equipment and medium based on GPU
CN116088773B (en) * 2023-04-11 2023-06-16 南京砺算科技有限公司 Data loading method, device, equipment and medium based on implicit GEMM convolution

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106846235A (en) * 2016-12-26 2017-06-13 中国科学院计算技术研究所 Convolution optimization method and system that a kind of utilization NVIDIA Kepler GPU assembly instructions accelerate
CN109871949A (en) * 2017-12-22 2019-06-11 泓图睿语(北京)科技有限公司 Convolutional neural networks accelerator and accelerated method
CN110308982A (en) * 2018-03-20 2019-10-08 华为技术有限公司 A kind of shared drive multiplexing method and device
CN110348574A (en) * 2019-07-17 2019-10-18 哈尔滨理工大学 A kind of general convolutional neural networks accelerating structure and design method based on ZYNQ
CN110458280A (en) * 2019-07-15 2019-11-15 武汉魅瞳科技有限公司 A kind of convolutional neural networks accelerated method and system suitable for mobile terminal
CN111160534A (en) * 2019-12-31 2020-05-15 中山大学 Binary neural network forward propagation frame suitable for mobile terminal
CN111178519A (en) * 2019-12-27 2020-05-19 华中科技大学 Convolutional neural network acceleration engine, convolutional neural network acceleration system and method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11803377B2 (en) * 2017-09-08 2023-10-31 Oracle International Corporation Efficient direct convolution using SIMD instructions
AU2017279610A1 (en) * 2017-12-19 2019-07-04 Canon Kabushiki Kaisha Memory access optimisation using per-layer computational mapping and memory allocation for CNN application
CN110321999B (en) * 2018-03-30 2021-10-01 赛灵思电子科技(北京)有限公司 Neural network computational graph optimization method
US10755469B2 (en) * 2018-12-28 2020-08-25 Intel Corporation Apparatus and method for ray tracing instruction processing and execution

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106846235A (en) * 2016-12-26 2017-06-13 中国科学院计算技术研究所 Convolution optimization method and system that a kind of utilization NVIDIA Kepler GPU assembly instructions accelerate
CN109871949A (en) * 2017-12-22 2019-06-11 泓图睿语(北京)科技有限公司 Convolutional neural networks accelerator and accelerated method
CN110308982A (en) * 2018-03-20 2019-10-08 华为技术有限公司 A kind of shared drive multiplexing method and device
CN110458280A (en) * 2019-07-15 2019-11-15 武汉魅瞳科技有限公司 A kind of convolutional neural networks accelerated method and system suitable for mobile terminal
CN110348574A (en) * 2019-07-17 2019-10-18 哈尔滨理工大学 A kind of general convolutional neural networks accelerating structure and design method based on ZYNQ
CN111178519A (en) * 2019-12-27 2020-05-19 华中科技大学 Convolutional neural network acceleration engine, convolutional neural network acceleration system and method
CN111160534A (en) * 2019-12-31 2020-05-15 中山大学 Binary neural network forward propagation frame suitable for mobile terminal

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
Gangzhao Lu 等.Optimizing Depthwise Separable Convolution Operations on GPUs.《IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS》.2022,第33卷(第01期),70-87. *
Gopalakrishnan Elango 等.Convolutional Neural Network Acceleration on GPU by Exploiting Data Reuse.《SAN JOSE STATE UNIVERSITY MASTER" THESES》.2017,1-67. *
刘磊 等.一种基于GPU的二维离散多分辨率小波变换加速方法.《吉林大学学报(理学版)》.2015,第53卷(第02期),267-272. *
张军阳 等.二维矩阵卷积的并行计算方法.《浙江大学学报(工学版)》.2018,第52卷(第03期),515-523. *
王开宇 等.卷积神经网络的FPGA实现及优化.《实验室科学》.2018,第21卷(第04期),79-84. *
谢根栓 等.面向CUDA程序的线程放置优化策略研究.《智能计算机与应用》.2020,第10卷(第02期),341-345. *
邹虹 等.基于FPGA的CNN算法加速.《电子世界》.2019,(第(2019)03期),82-83. *
陈朋 等.基于改进动态配置的FPGA卷积神经网络加速器的优化方法.《高技术通讯》.2020,第30卷(第03期),240-247. *
韩博 等.GPGPU性能模型及应用实例分析.《计算机辅助设计与图形学学报》.2009,第21卷(第09期),1219-1226. *
马龙飞.二维卷积计算在CUDAGPU架构上的性能优化研究.《电子世界》.2018,(第(2018)02期),56-57. *

Also Published As

Publication number Publication date
CN111797985A (en) 2020-10-20

Similar Documents

Publication Publication Date Title
EP3499428A1 (en) Method and electronic device for convolution calculation in neutral network
CN111797985B (en) Convolution operation memory access optimization method based on GPU
CN106846235B (en) Convolution optimization method and system accelerated by NVIDIA Kepler GPU assembly instruction
CN102436438B (en) Sparse matrix data storage method based on ground power unit (GPU)
US20110107060A1 (en) Transposing array data on simd multi-core processor architectures
CN110555516B (en) Method for realizing low-delay hardware accelerator of YOLOv2-tiny neural network based on FPGA
CN108897716B (en) Data processing device and method for reducing calculation amount through memory read-write operation
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
US20230049471A1 (en) Method and apparatus for operating image data
US20230068450A1 (en) Method and apparatus for processing sparse data
CN103177414A (en) Structure-based dependency graph node similarity concurrent computation method
US20200364289A1 (en) Data processing method and apparatus
CN110490308B (en) Design method of acceleration library, terminal equipment and storage medium
CN115390788A (en) Sparse matrix multiplication distribution system of graph convolution neural network based on FPGA
CN117785480B (en) Processor, reduction calculation method and electronic equipment
US11210090B2 (en) Register-based complex number processing
CN116382617B (en) Singular value decomposition accelerator with parallel ordering function based on FPGA
CN110580675A (en) Matrix storage and calculation method suitable for GPU hardware
Lu et al. Optimizing GPU memory transactions for convolution operations
CN108198128A (en) A kind of method and device of alpha channel boundary corrosions
Zeng et al. Optimizing frequency domain implementation of CNNs on FPGAs
CN113705784A (en) Neural network weight coding method based on matrix sharing and hardware system
CN112446004B (en) Non-structural grid DILU preconditioned sub-many-core parallel optimization method
Honda et al. A warp-synchronous implementation for multiple-length multiplication on the GPU
CN113011563A (en) Convolutional neural network batch normalization processing method based on GPU

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant