CN109522127A

CN109522127A - A kind of fluid machinery simulated program isomery accelerated method based on GPU

Info

Publication number: CN109522127A
Application number: CN201811378843.XA
Authority: CN
Inventors: 张兴军; 赵文强; 董小社; 李靖波; 雷雨; 鲁晨欣; 周剑锋; 伍卫国; 邹年俊; 何峰
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2018-11-19
Filing date: 2018-11-19
Publication date: 2019-03-26
Anticipated expiration: 2038-11-19
Also published as: CN109522127B

Abstract

The invention discloses a kind of fluid machinery simulated program isomery accelerated method based on GPU, step includes analysis of central issue, and finding has the subprogram for accelerating potentiality；The deduction and exemption of data transmission between host equipment；The merging of internal storage access and the utilization of multiple types memory promote valid memory utilization rate and calculate memory access ratio；Code refactoring exposes data parallelism, and the explicit global synchronization of GPU kernel damages data parallelism, and Ying Jinli is avoided；The GPU of serial algorithm is adapted to, by the way that serial algorithm is replaced with parallel algorithm with the same function；The adjustment of thread allocation of parameters, so that thread calculation delay is sufficiently hidden, is promoted to calculate and be handled up by adjusting thread distribution；If reaching ideal effect by above step, then accelerate to complete, otherwise a wheel iteration new since analysis of central issue is until reaching promising result.The present invention provides a kind of GPU accelerated method for fluid machinery simulated program characteristic, modified program can reach ideal acceleration effect.

Description

A kind of fluid machinery simulated program isomery accelerated method based on GPU

Technical field

The invention belongs to hydrodynamics and high-performance calculation crossing domain, in particular to a kind of fluid machinery based on GPU Simulated program isomery accelerated method.

Background technique

Fluid Mechanics Computation is one of important technology of field of fluid mechanics, right in a computer by using numerical method Hydromechanical governing equation is solved, thus the flowing in predictable flow field.It, can be with the promotion of computer computation ability Construct increasingly finer fluid mechanic model.Meanwhile for the flow field change of more accurate Fluid Mechanics Computation, transport flow field It moves more " true ", fluid mechanic model is also increasingly fine, and to computing capability, higher requirements are also raised for this.

As monokaryon chip transistor density and frequency reach bottleneck, multicore becomes the main side for nowadays promoting computing capability Formula.The companies such as NVIDIA, AMD, Intel are reached wherein tall and handsome it is also proposed that a series of for handling the specialized hardwares of intensive calculations GPU suffers from outstanding performance in terms of performance and energy consumption.Traditional fluid machinery program optimizes towards CPU architecture, and with The difference of the rise of Heterogeneous Computing, coprocessor framework and CPU architecture is increasing, the volume of traditional fluid machinery simulated program WriteMode all cannot be directly adapted to new hardware structure again with prioritization scheme.Whether fluid is write for new hardware platform Machine emulated program, or the existing fluid machinery simulated program towards CPU is transplanted for GPU, it requires new Guidance program.

In fluid machinery simulated program explicit solution device, mostly based on finite volume method to fluid carry out from It dissipates, accounting equation continuous iteration on grid, until meeting the condition of convergence, in each iteration step, to the physics in grid Attribute is updated, and such application belongs to compute-intensive applications.Writing new CPU/GPU isomery program or to existing When serial program carries out GPU transplanting, the hardware structure characteristic new notice how matching is required.

Summary of the invention

The purpose of the present invention is to provide a kind of fluid machinery simulated program isomery accelerated method based on GPU, to solve The above problem.

To achieve the above object, the invention adopts the following technical scheme:

A kind of fluid machinery simulated program isomery accelerated method based on GPU, comprising the following steps:

Step 1, the mode combined by using static analysis and dynamic analysis carries out hot spot to the machine emulated program of fluid Analysis；And preliminary parallelization realization is carried out to the hot spot chosen；

Step 2, for parallelisation procedure preliminary in step 1, data between host equipment are transmitted with intensive part, it will Intermediate result is transplanted to the end GPU, and data are transmitted between reducing host equipment；

Step 3, for the program in step 2, by distributing the memory mapping of grid data, so that in the same thread beam Thread read and update grid data adjacent in memory, and according to program characteristic using shared drive, constant memory and Texture memory gives full play to concurrency；

Step 4, pass through code refactoring exposure data parallelism；

Step 5, the GPU adaptation of serial algorithm；For the program in step 4, if there is respective serial Riming time of algorithm Accounting is big and cannot run at the end GPU, substitutes the elimination with iterative method when solving equation；

Step 6, the adjustment of thread allocation of parameters determines to maximize and hides memory access time delay, promotes the thread for calculating and handling up Distribution；

Step 7, the acceleration effect that the program that step 1 arrives step 6 is completed in test then accelerates if speed-up ratio meets expection It completes；Otherwise a wheel Accelerated iteration new since step 1；Until reaching satisfied acceleration effect.

Further, it in step 1, by the collection and analysis to hot spot partial operating time accounting, finds out in program and runs The big part of time accounting carries out parallel optimization, the theoretical upper limit accelerated for the part.

Further, in step 2, will be accelerated by the way that entire multi grid is placed on the end GPU, by intermediate result whole It is placed on the end GPU.

Further, in step 3, by adjusting layout of the data in global memory so that adjacent hardware thread is handled Continuous data in memory, promote the effective bandwidth of internal storage access；And by shared drive, texture memory, constant memory It utilizes, reduces thread to the access pressure of global memory's bandwidth, promote speed-up ratio.

Further, in step 4, after parallel version, the part of data contention can be generated, by calculating side for transplanting Formula, the adjustment for calculating order avoid data contention；For the part that can not be avoided through the above way, it should current by terminating GPU kernel, the mode for starting new GPU kernel introduce global synchronization.

Further, in step 5, for serial algorithm that can not be parallel in program, with identical function and with number It is replaced according to the algorithm of concurrency, parallel algorithm is realized on GPU.

Further, in step 6, after the above step is finished, by experiment repeatedly, find out suitable thread grid, Thread block allocation of parameters determines the best match of the program and hardware.

Compared with prior art, the present invention has following technical effect:

The present invention is by reasonable memory mapping, so that the effective bandwidth of GPU global access is got higher, gives full play to GPU meter Calculation advantage；By replacement can not parallel serial algorithm make calling program can acceleration region expand, speed-up ratio promoted；By to complete Office is synchronous to be avoided as possible, so that damage of the global synchronization to program parallelization is preferably minimized.

The present invention compares original CPU program, and GPU version has significant speed-up ratio；

Detailed description of the invention

Fig. 1 is the flow chart of method provided by the invention.

Specific embodiment

Below in conjunction with attached drawing, the present invention is further described:

Referring to Fig. 1, a kind of fluid machinery simulated program isomery accelerated method based on GPU, comprising the following steps:

Step 4, pass through code refactoring exposure data parallelism；

In step 1, by the collection and analysis to hot spot partial operating time accounting, it is big to find out runing time accounting in program Part, for the part carry out parallel optimization, the theoretical upper limit accelerated.

In step 2, it will be accelerated by the way that entire multi grid is placed on the end GPU, intermediate result is all placed on GPU End.

In step 3, by adjusting layout of the data in global memory so that adjacent hardware thread handles in memory and connects Continuous data promote the effective bandwidth of internal storage access；And by shared drive, texture memory, the utilization of constant memory, reduction Thread promotes speed-up ratio to the access pressure of global memory's bandwidth.

In step 4, after parallel version, the part of data contention can be generated for transplanting, pass through calculation, calculating time The adjustment of sequence avoids data contention；For the part that can not be avoided through the above way, it should by terminating current GPU kernel, The mode for starting new GPU kernel introduces global synchronization.

In step 5, for serial algorithm that can not be parallel in program, with identical function and with data parallelism Algorithm replacement, by parallel algorithm realize on GPU.

In step 6, after the above step is finished, by experiment repeatedly, suitable thread grid, thread block distribution are found out Parameter determines the best match of the program and hardware.

Analysis of central issue.Amdahl's law shows to carry out accelerating to can be obtained speed-up ratio to program in parallel computation The upper limit is limited to the percentage that accelerating part accounts for former serial program.According to the law, the first step of parallel optimization should be found Hot spot only finds out the part that runing time accounting is big in program, carries out parallel optimization for the part, could obtain ideal Acceleration effect.If having selected non-hot is to accelerate object, the effort of parallelization necessarily produces little effect.Analysis hot-spots generally have Static and dynamic two ways.Static mode is to find out the hot spot of program by analyzing program code instructions.Dynamic part is logical It crosses using the tool as gprof etc, collects the information such as each function call the time it takes in the program operation phase, according to These information find out the hot spot of program.Fluid machinery simulated program due to model characteristic, from static analysis as can be seen that hot spot Part generally is solved in multi-grid iteration, GPU acceleration generally is carried out to the part.

Reduce the data transmission between host equipment.The computing capability powerful compared to GPU, the data between CPU transmit effect Rate is extremely low, can greatly drag down accelerating ability, it should avoid as far as possible.In multi grid solution procedure, if GPU calculation part Divide and fail that iterative cycles part is completely covered, then in every single-step iteration of circulation, it is necessary to which transmission calculates knot between GPU Fruit, and the data transmission between CPU is extremely time-consuming and inefficient compared to the calculated performance of GPU.If in every single-step iteration mistake It requires to carry out data transmission in journey, then therefore GPU accelerating ability can have a greatly reduced quality, therefore by putting entire multi grid Accelerated at the end GPU, intermediate result is all placed on to the end GPU, is followed to convert the data transmission in every single-step iteration to The transmission of data twice of ring iterative before and after, so that the time of data transmission consumption becomes a constant, without It is proportional to the number of iterations.And iterating to calculate front and back data and transmitting the occupied time can neglect compared with the time of iterative calculation Slightly disregard, therefore the influence to accelerating ability can also be ignored.For not having the subprogram of concurrency, such as difference individually extremely Process cannot be placed on the end GPU and be calculated, can be only placed at the end CPU, at this time the data transmission in iterative cycles between GPU and CPU Unavoidably.Although only needing to transmit the calculated result of these serial programs, data volume is not very big.But due to this part number To cycle-index be according to the number of transmissions it is proportional, also have certain influence to the performance of program, it is also desirable to take certain optimization Measure.Fixed memory can be taken to optimize the transmission of this partial data.Number under default situations, between cpu process and GPU It needs first to make primary copy in operating system nucleus space buffer area according to transmission, is then copied to destination from interior and buffer area. It, can be to avoid operating system nucleus buffer area after setting fixed memory for the page where the array for needing to transmit Copy, is transmitted directly between cpu process and GPU.The data transmission of reduction is after successive ignition circulation, to the property of program It can be promoted more obvious.

Internal storage access merges and a variety of memories utilize.Fluid machinery simulated program is related to Multidimensional numerical operation, by array In each Elemental partition to thread when, it should do between the minimum peacekeeping thread most low-dimensional of array and map one by one, so that continuously Thread is updated adjacent data.As unit of warp, a warp includes 32 hardware threads for the execution of GPU thread.It is interior As unit of a cache line, a cache line is generally continuous and alignment 32 bytes or 128 for the reading deposited Byte.If the memory of the thread accesses in a warp is continuous, GPU bandwidth can be fully utilized.If the line in a warp The memory of journey access is discontinuous, and it will cause bandwidth wastes, and effective bandwidth is caused to reduce.For example, if 32 bytes Cache line data transmission in, only 4 byte datas are utilized, then handling capacity has just been reduced 8 times, GPU is powerful Computing capability is difficult to be played, and can significantly reduce calculating speed.Internal storage access is incorporated as promoting memory effective bandwidth Basic means, when optimization should pay attention to first.When carrying out GPU acceleration to three-dimensional grid, it should make continuous line as far as possible Journey accesses data continuously distributed in memory.If taking other mapping schemes, the thread in a warp can be made Reading data becomes the reading that strides from continuous data reading, so that memory effective bandwidth is lower, such case should be avoided.Simultaneously In the conceived case also should using the type memories such as share memory, the memory access time delay and size of share memory and Cache is suitable, but the data stored can greatly promote acceleration when data access has AD HOC by software management Performance.

Pass through code refactoring exposure data parallelism.The structure of multi-grid iteration solver is that a number of iterations is huge Big circulation, in the circulating cycle flowing of the analog physical amount between three dimension adjacent mesh.When some physical quantity is in adjacent net When compartment flows, the numerical value of two adjacent mesh can be changed simultaneously.If same thread is responsible for a certain array element and three dimensions The physical quantity inflow and outflow of adjacent thread element on degree, then each thread require to change the data element that the thread possesses with And three data adjacent with the data in three dimensions, while other threads can also modify this four elements.Serial situation Under this mode it is not problematic, if but untreated in parallel computation read/write conflict can be caused.In order to guarantee that program is correct Property, it needs to restart a new kernel by terminating current kernel to introduce global synchronization.If every dimension more A global synchronization is all introduced when new, then script a subprogram will because of extra global synchronization introducing and be split At smaller 6 GPU kernel function.Concurrency can be greatly lost in the introducing of global synchronization, and brings additional GPU Kernel Start-up costs.So that acceleration effect substantially reduces.This defect can avoid counting by rearranging calculating order According to competition, stream that per thread can be allowed to be responsible for updating the grid that itself is possessed to three dimension forward direction adjacent mesh physical quantitys Inflow with three dimension negative sense adjacent mesh to itself grid physical amount out.Arrangement in this way makes per thread more The data of new owned grid, and then data contention is avoided, so that calling program is obtained ideal accelerating ability.

The GPU of serial algorithm is adapted to.For specific algorithm, such as commonly used three Diagonal Equation in equation solution Solution procedure, data parallelism are obvious unlike other its algorithms.Due to the serial property of elimination solve system of equation, tradition CPU solution procedure can not be grafted directly to GPU, be transplanted to before GPU and need to do corresponding algorithm adaptation.Classical GPU is asked It solves three Diagonal Equations and realizes that algorithm has PCR (Parallel Cyclic Reduction) etc., current tall and handsome high ranking official side cusparse Library provides the solver of one group of three Diagonal Equation based on above-mentioned algorithm, can use the library and solves progress to three Diagonal Equations GPU parallelization.This kind of algorithm can also be optimized in conjunction with program self character simultaneously.Such as in explicit solution device, this The parameter matrix of three Diagonal Equation of class is in an iterative process and unchanged, and the number of iterations is huge.It, can be with according to above-mentioned characteristic It inverts before iteration starts to parameter matrix, the solution of three Diagonal Equations is converted into the multiplication of parameter inverse matrix and vector. Matrix multiple has good data parallelism, can give full play to the performance of GPU.It is above-mentioned list a kind of pair of serial algorithm into The thinking of row parallel adaptation.It for the serial algorithm run at the end CPU, can use with equivalent functions, and have good The parallel algorithm of data parallelism is replaced.Such as the member that disappears of equation group solves, and can be replaced with iterative method.Pass through tune With the good algorithm of data parallelism, the architected features of GPU are preferably matched, promote accelerating ability.

The adjustment of thread allocation of parameters determines to maximize and hides memory access time delay, promotes the thread configuration for calculating and handling up, lead to It crosses and attempts different thread blocks, thread lattice specification.Suitable specific program is tested out, specific scale can maximize performance hardware and add The kernel parameter configuration of fast performance.

Claims

1. a kind of fluid machinery simulated program isomery accelerated method based on GPU, which comprises the following steps:

Step 1, the mode combined by using static analysis and dynamic analysis carries out hot spot point to the machine emulated program of fluid Analysis；And preliminary parallelization realization is carried out to the hot spot chosen；

Step 2, for parallelisation procedure preliminary in step 1, data between host equipment are transmitted with intensive part, it will be intermediate As a result it is transplanted to the end GPU, data are transmitted between reducing host equipment；

Step 3, for the program in step 2, by distributing the memory mapping of grid data, so that the line in the same thread beam Journey reads and updates grid data adjacent in memory, and utilizes shared drive, constant memory and texture according to program characteristic Memory gives full play to concurrency；

Step 4, pass through code refactoring exposure data parallelism；

Step 5, the GPU adaptation of serial algorithm；For the program in step 4, if there is respective serial Riming time of algorithm accounting It cannot run greatly and at the end GPU, substitute the elimination with iterative method when solving equation；

Step 6, the adjustment of thread allocation of parameters determines to maximize and hides memory access time delay, promotes the thread distribution for calculating and handling up；

Step 7, the acceleration effect that the program that step 1 arrives step 6 is completed in test then accelerates to complete if speed-up ratio meets expection； Otherwise a wheel Accelerated iteration new since step 1；Until reaching satisfied acceleration effect.

2. a kind of fluid machinery simulated program isomery accelerated method based on GPU according to claim 1, feature exist In by the collection and analysis to hot spot partial operating time accounting, finding out the portion that runing time accounting is big in program in step 1 Point, parallel optimization, the theoretical upper limit accelerated are carried out for the part.

3. a kind of fluid machinery simulated program isomery accelerated method based on GPU according to claim 1, feature exist In will be accelerated by the way that entire multi grid is placed on the end GPU, intermediate result is all placed on to the end GPU in step 2.

4. a kind of fluid machinery simulated program isomery accelerated method based on GPU according to claim 1, feature exist In in step 3, by adjusting layout of the data in global memory so that adjacent hardware thread is handled in memory and continuously counted According to promoting the effective bandwidth of internal storage access；And by shared drive, texture memory, the utilization of constant memory, reduction thread pair The access pressure of global memory's bandwidth promotes speed-up ratio.

5. a kind of fluid machinery simulated program isomery accelerated method based on GPU according to claim 1, feature exist In after being parallel version for transplanting, generating the part of data contention, by calculation, calculate order in step 4 Adjustment avoids data contention；For the part that can not be avoided through the above way, it should by terminating current GPU kernel, starting The mode of new GPU kernel introduces global synchronization.

6. a kind of fluid machinery simulated program isomery accelerated method based on GPU according to claim 1, feature exist In in step 5, for serial algorithm that can not be parallel in program, with identical function and with the calculation of data parallelism Method replacement, parallel algorithm is realized on GPU.

7. a kind of fluid machinery simulated program isomery accelerated method based on GPU according to claim 1, feature exist In, in step 6, after the above step is finished, by experiment repeatedly, suitable thread grid, thread block allocation of parameters are found out, Determine the best match of the program and hardware.