CN109522127B

CN109522127B - Fluid machinery simulation program heterogeneous acceleration method based on GPU

Info

Publication number: CN109522127B
Application number: CN201811378843.XA
Authority: CN
Inventors: 张兴军; 赵文强; 董小社; 李靖波; 雷雨; 鲁晨欣; 周剑锋; 伍卫国; 邹年俊; 何峰
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2018-11-19
Filing date: 2018-11-19
Publication date: 2021-01-19
Anticipated expiration: 2038-11-19
Also published as: CN109522127A

Abstract

The invention discloses a heterogeneous acceleration method of a fluid mechanical simulation program based on a GPU, which comprises the steps of performing hotspot analysis and searching a subprogram with acceleration potential; a reduction in data transfer between host devices; the combination of memory access and the utilization of various types of memories improve the effective memory utilization rate and calculate the memory access ratio; code reconstruction exposes data parallelism, and GPU kernel explicit global synchronization has data parallelism loss and should be avoided as much as possible; GPU adaptation of the serial algorithm, namely replacing the serial algorithm with a parallel algorithm with the same function; adjusting thread distribution parameters, wherein thread calculation time delay is fully hidden by adjusting thread distribution, and calculation throughput is improved; if the ideal effect is achieved through the steps, the acceleration is finished, otherwise, a new iteration is started from the hotspot analysis until the satisfactory effect is achieved. The invention provides a GPU acceleration method aiming at the characteristics of a fluid mechanical simulation program, and the modified program can achieve an ideal acceleration effect.

Description

Fluid machinery simulation program heterogeneous acceleration method based on GPU

Technical Field

The invention belongs to the cross field of fluid mechanics and high-performance calculation, and particularly relates to a heterogeneous acceleration method of a fluid machinery simulation program based on a GPU.

Background

Computational fluid dynamics is one of the important techniques in the field of fluid mechanics, and the flow of a flow field can be predicted by solving a control equation of fluid mechanics in a computer by using a numerical method. As computer computing power increases, increasingly sophisticated fluid mechanics models can be constructed. Meanwhile, in order to calculate the flow field change of the fluid mechanics more accurately, the flow field movement is more real, the fluid mechanics model is increasingly refined, and higher requirements are provided for the calculation capacity.

With single-core chip transistor density and frequency reaching the bottleneck, multi-core is becoming the dominant way to increase computing power today. NVIDIA, AMD, Intel, etc. also propose a series of specialized hardware for processing intensive computations, with amada GPUs showing excellent performance and power consumption. The traditional fluid machinery program is optimized for a CPU (Central processing Unit) architecture, the difference between a coprocessor architecture and the CPU architecture is larger and larger along with the rise of heterogeneous computing, and the programming mode and the optimization scheme of the traditional fluid machinery simulation program can not be directly adapted to a new hardware architecture. No matter writing a fluid machine simulation program for a new hardware platform or transplanting an existing CPU-oriented fluid machine simulation program for a GPU, a new guidance scheme is required.

In the fluid mechanical simulation program explicit solver, a finite volume method is mostly used as a basis for discretizing fluid, a calculation equation is iterated on a grid continuously until a convergence condition is met, and physical attributes in the grid are updated in each iteration step, so that the application belongs to calculation-intensive application. When writing a new CPU/GPU heterogeneous program or performing GPU migration on an existing serial program, attention needs to be paid to how to match the characteristics of the new hardware architecture.

Disclosure of Invention

The invention aims to provide a GPU-based heterogeneous acceleration method for a fluid mechanical simulation program, so as to solve the problems.

In order to achieve the purpose, the invention adopts the following technical scheme:

a fluid machinery simulation program heterogeneous acceleration method based on a GPU comprises the following steps:

step 1, performing hotspot analysis on a fluid mechanical simulation program in a mode of combining static analysis and dynamic analysis; and carrying out preliminary parallelization realization on the selected hot spots;

step 2, aiming at the primary parallelization program in the step 1, for the part with intensive data transmission among the host devices, transplanting an intermediate result to a GPU (graphics processing unit) end, and reducing the data transmission among the host devices;

step 3, aiming at the program in the step 2, through the memory layout of the distributed grid data, the threads in the same thread bundle read and update the adjacent grid data in the memory, and utilize the shared memory, the constant memory and the texture memory according to the program characteristics, so as to fully play the parallelism;

step 4, reconstructing the exposed data parallelism through codes;

step 5, GPU adaptation of a serial algorithm; for the program in the step 4, if the running time of an individual serial algorithm is large and the individual serial algorithm cannot run at the GPU end, an iterative method is used for replacing a primitive elimination method when an equation is solved;

step 6, adjusting thread distribution parameters, determining thread distribution capable of maximizing hidden memory access delay and improving calculation throughput;

step 7, testing the acceleration effect of the programs in the steps 1 to 6, and if the acceleration ratio is in accordance with the expectation, finishing the acceleration; otherwise, starting a new round of accelerated iteration from the step 1; until a satisfactory acceleration effect is achieved.

Further, in step 1, by collecting and analyzing the hot spot part running time ratio, the part with large running time ratio in the program is found out, and parallel optimization is performed on the part to obtain the accelerated theoretical upper limit.

Further, in step 2, the whole multi-grid is placed at the GPU end for acceleration, and all the intermediate results are placed at the GPU end.

Furthermore, in step 3, the layout of the data in the global memory is adjusted to enable adjacent hardware threads to process continuous data in the memory, so that the effective bandwidth of memory access is improved; and by utilizing the shared memory, the texture memory and the constant memory, the access pressure of the thread to the bandwidth of the global memory is reduced, and the speed-up ratio is improved.

Further, in step 4, for the part which will generate data competition after being transplanted into the parallel version, the data competition is avoided by adjusting the calculation mode and the calculation order; for the parts which cannot be avoided by the method, the global synchronization is introduced by terminating the current GPU kernel and starting a new GPU kernel.

Further, in step 5, replacing the serial algorithm which cannot be parallel in the program with an algorithm having the same function and data parallelism, and implementing the parallel algorithm on the GPU.

Further, in step 6, after the above steps are completed, through repeated experiments, appropriate thread grids and thread block allocation parameters are found out, and the optimal matching between the program and the hardware is determined.

Compared with the prior art, the invention has the following technical effects:

according to the invention, through reasonable memory layout, the effective bandwidth of global access of the GPU is increased, and the calculation advantages of the GPU are fully exerted; the acceleration area of the program is enlarged and the acceleration ratio is improved by replacing the non-parallel serial algorithm; global synchronization is minimized in its detriment to program parallelism by its best efforts to avoid global synchronization.

Compared with the original CPU program, the GPU version has a remarkable speed-up ratio;

drawings

FIG. 1 is a flow chart of a method provided by the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

referring to fig. 1, a fluid mechanical simulation program heterogeneous acceleration method based on a GPU includes the following steps:

step 4, reconstructing the exposed data parallelism through codes;

In step 1, the part with large running time ratio in the program is found out through the collection and analysis of the running time ratio of the hot spot part, and parallel optimization is carried out on the part to obtain the accelerated theoretical upper limit.

In step 2, the whole multi-grid is placed at the GPU end for acceleration, and all intermediate results are placed at the GPU end.

In step 3, the layout of the data in the global memory is adjusted to enable adjacent hardware threads to process continuous data in the memory, so that the effective bandwidth of memory access is improved; and by utilizing the shared memory, the texture memory and the constant memory, the access pressure of the thread to the bandwidth of the global memory is reduced, and the speed-up ratio is improved.

In step 4, for the part which can generate data competition after being transplanted into the parallel version, the data competition is avoided by adjusting the calculation mode and the calculation order; for the parts which cannot be avoided by the method, the global synchronization is introduced by terminating the current GPU kernel and starting a new GPU kernel.

In step 5, replacing the serial algorithm which cannot be paralleled in the program with an algorithm with the same function and data parallelism, and implementing the parallel algorithm on the GPU.

And 6, after the steps are completed, finding out proper thread grids and thread block distribution parameters through repeated experiments, and determining the optimal matching between the program and the hardware.

And (4) hot spot analysis. The Amdall law states that the upper limit of the acceleration ratio that can be obtained by accelerating a program in parallel computing is limited by the percentage of the accelerated portion of the original serial program. According to the law, the first step of parallel optimization is to find a hot spot, and only by finding out a part of a program with a large running time and performing parallel optimization on the part, an ideal acceleration effect can be obtained. If non-hot spots are selected as acceleration objects, the parallelization effort is necessarily little. The hot spots of the analysis program generally have a static mode and a dynamic mode. The static approach is to find out the hot spots of the program by analyzing the code structure of the program. The dynamic part is to use a tool like gpref to collect information such as time spent by each function call in the program running stage and find out the hot spot of the program according to the information. Due to the characteristics of the model, the fluid machine simulation program can see from static analysis that the hot spot is generally in a multi-grid iterative solution part, and the GPU acceleration is generally carried out on the part.

Reducing data transfer between host devices. Compared with the strong computing power of the GPU, the data transmission efficiency between CPUs is extremely low, the acceleration performance is greatly reduced, and the situation that the acceleration performance is reduced as much as possible is avoided. In the multiple grid solving process, if the GPU computing part cannot completely cover the iteration loop part, the computing result must be transmitted between GPUs in each step of iteration of the loop, and data transmission between CPUs is extremely time-consuming and inefficient compared with the computing performance of the GPUs. If data transmission is needed in each iteration step, the GPU acceleration performance is greatly reduced, and therefore the whole multiple grids are placed at the GPU end for acceleration, all intermediate results are placed at the GPU end, so that data transmission in each iteration step is converted into two times of data transmission before and after the start and the end of the iteration, the time consumed by data transmission becomes a constant and is not in proportion to the iteration times. And the time occupied by data transmission before and after the iterative computation can be ignored compared with the time of the iterative computation, so the influence on the acceleration performance can be ignored. For extremely individual subprograms without parallelism, such as a difference value process, the calculation cannot be carried out at the GPU end, and only can be carried out at the CPU end, and at the moment, data transmission between the GPU and the CPU in an iteration loop cannot be avoided. Although only the calculation results of these serial programs need to be transmitted, the amount of data is not so large. However, since the number of data transmission times is proportional to the number of cycles, the performance of the program is also affected to some extent, and some optimization measures are also required. Fixed memory may be employed to optimize this portion of the data transfer. By default, data transfer between the CPU process and the GPU requires a copy first in the kernel space buffer of the operating system, and then from the kernel and buffer to the destination. After the memory page where the array to be transmitted is located is set as the fixed memory, the copy of the kernel buffer area of the operating system can be avoided, and the transmission can be directly carried out between the CPU process and the GPU. The reduced data transmission obviously improves the performance of the program after multiple iterative cycles.

Memory access consolidation and multiple memory utilizations. The fluid machine simulation program involves a multidimensional array operation, and when each element in the array is allocated to a thread, one-to-one mapping between the lowest dimension of the array and the lowest dimension of the thread is required, so that adjacent data are updated by continuous threads. GPU threads are executed in units of warp, and each warp comprises 32 hardware threads. The memory is read in units of one cache line, and one cache line is generally 32 bytes or 128 bytes which are continuous and aligned. If the memory accessed by the threads in one warp is continuous, the GPU bandwidth can be fully utilized. If the memory accessed by a thread in warp is not continuous, the bandwidth is wasted, and the effective bandwidth is reduced. For example, if only 4 bytes of data are utilized in a 32-byte cache line data transmission, the throughput is reduced by 8 times, the powerful computation capability of the GPU is difficult to be exerted, and the computation speed is significantly reduced. Memory access consolidation is a basic means for increasing the effective bandwidth of a memory, and should be considered first when optimizing. When GPU acceleration is performed on a three-dimensional grid, it is desirable to make continuous threads access data that is continuously distributed in the memory as much as possible. If other mapping schemes are adopted, the thread data read in one warp is changed from a continuous data read to a strided read, so that the effective bandwidth of the memory becomes low, and the situation should be avoided. Meanwhile, the memory types such as share memory and the like can be used under possible conditions, the memory access delay and size of the share memory are equivalent to those of cache, but the stored data is managed by software, and the acceleration performance can be greatly improved when the data access has a specific mode.

Data parallelism is exposed through code reconstruction. The structure of the multiple-grid iterative solver is a loop with large iteration times, and the flow of physical quantities among three-dimensional adjacent grids is simulated in the loop. When a certain physical quantity flows between the adjacent grids, the values of the two adjacent grids are changed simultaneously. If the same thread is responsible for the inflow and outflow of physical quantities of an array element and three dimensionally adjacent thread elements, each thread needs to change the data element owned by the thread and three dimensionally adjacent data, and other threads also modify the four elements. This mode is not problematic in the serial case, but read and write conflicts can arise if not processed in parallel computing. In order to ensure program correctness, the global synchronization needs to be introduced by terminating the current kernel and restarting a new kernel. If global synchronization is introduced once during updating of each dimension, an original subroutine is divided into smaller 6 GPU kernel functions due to introduction of redundant global synchronization. The introduction of global synchronization can significantly lose parallelism and bring additional GPU kernel startup overhead. So that the acceleration effect is greatly reduced. The defect can avoid data competition by rearranging the calculation order, and each thread can be responsible for updating the outflow of the three-dimension positive adjacent grid physical quantity and the inflow of the three-dimension negative adjacent grid physical quantity of the own grid. Through the arrangement, each thread only updates the data of the own grid, so that data competition is avoided, and the program obtains ideal acceleration performance.

GPU adaptation of the serial algorithm. For a particular algorithm, such as the solving process of the tri-diagonal equations often used in equation solving, the data parallelism is not as significant as other algorithms. Because the elimination method solves the serial property of the equation set, the traditional CPU solving process cannot be directly transplanted to the GPU, and corresponding algorithm adaptation is needed before the solving process is transplanted to the GPU. The classic algorithm for solving the tri-diagonal equation by the GPU comprises PCR (parallel Cyclic reduction) and the like, the current Invian official cusboss library provides a set of solvers for the tri-diagonal equation based on the algorithm, and the library can be used for solving the tri-diagonal equation to carry out GPU parallelization. Meanwhile, the algorithm can be optimized by combining the characteristics of the program. In an explicit solver, for example, the parameter matrix of such a tri-diagonal equation does not change during the iteration, and the number of iterations is huge. According to the above characteristics, the parameter matrix can be inverted before the iteration starts, and the solution of the tri-diagonal equation is converted into the multiplication of the parameter inverse matrix and the vector. The matrix multiplication has good data parallelism and can fully exert the performance of the GPU. The above lists a concept of parallel adaptation for the serial algorithm. For the serial algorithm running on the CPU side, it can be replaced with a parallel algorithm having an equivalent function and good data parallelism. For example, the elimination solution of the system of equations, may be replaced by an iterative method. By calling an algorithm with good data parallelism, the architecture characteristics of the GPU are better matched, and the acceleration performance is improved.

And adjusting thread distribution parameters, determining the thread configuration capable of maximizing the hidden memory access delay and improving the calculation throughput, and trying different thread blocks and thread grid specifications. And testing a suitable specific program, wherein the specific scale can maximize the kernel parameter configuration of the hardware acceleration performance.

Claims

1. A fluid machinery simulation program heterogeneous acceleration method based on a GPU is characterized by comprising the following steps:

step 4, reconstructing the exposed data parallelism through codes;

step 5, GPU adaptation of a serial algorithm; in the step 4, if the running time of an individual serial algorithm is large and the individual serial algorithm cannot run at the GPU end, an iterative method is used for replacing a primitive elimination method when an equation is solved;

step 7, testing the acceleration effect of the programs in the steps 1 to 6, and if the acceleration ratio is in accordance with the expectation, finishing the acceleration; otherwise, starting a new round of accelerated iteration from the step 1; until a satisfactory acceleration effect is achieved;

2. The heterogeneous acceleration method of the fluid mechanical simulation program based on the GPU as claimed in claim 1, wherein in step 1, the part with a large running time ratio in the program is found through collection and analysis of the running time ratio of the hot spot part, and parallel optimization is performed on the part to obtain the theoretical upper limit of acceleration.

3. The heterogeneous acceleration method for the fluid mechanical simulation program based on the GPU as claimed in claim 1, wherein in step 2, the acceleration is performed by placing the whole multi-grid at the GPU end, and all the intermediate results are placed at the GPU end.

4. The heterogeneous acceleration method of the fluid mechanical simulation program based on the GPU as claimed in claim 1, wherein in step 3, the effective bandwidth of memory access is increased by adjusting the layout of data in the global memory so that the adjacent hardware threads process the continuous data in the memory; and by utilizing the shared memory, the texture memory and the constant memory, the access pressure of the thread to the bandwidth of the global memory is reduced, and the speed-up ratio is improved.

5. The heterogeneous acceleration method for the fluid mechanical simulation program based on the GPU as claimed in claim 1, characterized in that in step 5, for the algorithm which has a large running time ratio of the serial algorithm and cannot run at the GPU end, the algorithm with the same function and data parallelism is used to replace the parallel algorithm, and the parallel algorithm is implemented on the GPU.

6. The heterogeneous acceleration method of fluid machinery simulation programs based on GPU of claim 1, characterized in that in step 6, after the steps are completed, through repeated experiments, suitable thread grids and thread block allocation parameters are found out, and the best match between the program and the hardware is determined.