CN109522127B - Fluid machinery simulation program heterogeneous acceleration method based on GPU - Google Patents

Fluid machinery simulation program heterogeneous acceleration method based on GPU Download PDF

Info

Publication number
CN109522127B
CN109522127B CN201811378843.XA CN201811378843A CN109522127B CN 109522127 B CN109522127 B CN 109522127B CN 201811378843 A CN201811378843 A CN 201811378843A CN 109522127 B CN109522127 B CN 109522127B
Authority
CN
China
Prior art keywords
gpu
memory
acceleration
data
program
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811378843.XA
Other languages
Chinese (zh)
Other versions
CN109522127A (en
Inventor
张兴军
赵文强
董小社
李靖波
雷雨
鲁晨欣
周剑锋
伍卫国
邹年俊
何峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201811378843.XA priority Critical patent/CN109522127B/en
Publication of CN109522127A publication Critical patent/CN109522127A/en
Application granted granted Critical
Publication of CN109522127B publication Critical patent/CN109522127B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a heterogeneous acceleration method of a fluid mechanical simulation program based on a GPU, which comprises the steps of performing hotspot analysis and searching a subprogram with acceleration potential; a reduction in data transfer between host devices; the combination of memory access and the utilization of various types of memories improve the effective memory utilization rate and calculate the memory access ratio; code reconstruction exposes data parallelism, and GPU kernel explicit global synchronization has data parallelism loss and should be avoided as much as possible; GPU adaptation of the serial algorithm, namely replacing the serial algorithm with a parallel algorithm with the same function; adjusting thread distribution parameters, wherein thread calculation time delay is fully hidden by adjusting thread distribution, and calculation throughput is improved; if the ideal effect is achieved through the steps, the acceleration is finished, otherwise, a new iteration is started from the hotspot analysis until the satisfactory effect is achieved. The invention provides a GPU acceleration method aiming at the characteristics of a fluid mechanical simulation program, and the modified program can achieve an ideal acceleration effect.

Description

Fluid machinery simulation program heterogeneous acceleration method based on GPU
Technical Field
The invention belongs to the cross field of fluid mechanics and high-performance calculation, and particularly relates to a heterogeneous acceleration method of a fluid machinery simulation program based on a GPU.
Background
Computational fluid dynamics is one of the important techniques in the field of fluid mechanics, and the flow of a flow field can be predicted by solving a control equation of fluid mechanics in a computer by using a numerical method. As computer computing power increases, increasingly sophisticated fluid mechanics models can be constructed. Meanwhile, in order to calculate the flow field change of the fluid mechanics more accurately, the flow field movement is more real, the fluid mechanics model is increasingly refined, and higher requirements are provided for the calculation capacity.
With single-core chip transistor density and frequency reaching the bottleneck, multi-core is becoming the dominant way to increase computing power today. NVIDIA, AMD, Intel, etc. also propose a series of specialized hardware for processing intensive computations, with amada GPUs showing excellent performance and power consumption. The traditional fluid machinery program is optimized for a CPU (Central processing Unit) architecture, the difference between a coprocessor architecture and the CPU architecture is larger and larger along with the rise of heterogeneous computing, and the programming mode and the optimization scheme of the traditional fluid machinery simulation program can not be directly adapted to a new hardware architecture. No matter writing a fluid machine simulation program for a new hardware platform or transplanting an existing CPU-oriented fluid machine simulation program for a GPU, a new guidance scheme is required.
In the fluid mechanical simulation program explicit solver, a finite volume method is mostly used as a basis for discretizing fluid, a calculation equation is iterated on a grid continuously until a convergence condition is met, and physical attributes in the grid are updated in each iteration step, so that the application belongs to calculation-intensive application. When writing a new CPU/GPU heterogeneous program or performing GPU migration on an existing serial program, attention needs to be paid to how to match the characteristics of the new hardware architecture.
Disclosure of Invention
The invention aims to provide a GPU-based heterogeneous acceleration method for a fluid mechanical simulation program, so as to solve the problems.
In order to achieve the purpose, the invention adopts the following technical scheme:
a fluid machinery simulation program heterogeneous acceleration method based on a GPU comprises the following steps:
step 1, performing hotspot analysis on a fluid mechanical simulation program in a mode of combining static analysis and dynamic analysis; and carrying out preliminary parallelization realization on the selected hot spots;
step 2, aiming at the primary parallelization program in the step 1, for the part with intensive data transmission among the host devices, transplanting an intermediate result to a GPU (graphics processing unit) end, and reducing the data transmission among the host devices;
step 3, aiming at the program in the step 2, through the memory layout of the distributed grid data, the threads in the same thread bundle read and update the adjacent grid data in the memory, and utilize the shared memory, the constant memory and the texture memory according to the program characteristics, so as to fully play the parallelism;
step 4, reconstructing the exposed data parallelism through codes;
step 5, GPU adaptation of a serial algorithm; for the program in the step 4, if the running time of an individual serial algorithm is large and the individual serial algorithm cannot run at the GPU end, an iterative method is used for replacing a primitive elimination method when an equation is solved;
step 6, adjusting thread distribution parameters, determining thread distribution capable of maximizing hidden memory access delay and improving calculation throughput;
step 7, testing the acceleration effect of the programs in the steps 1 to 6, and if the acceleration ratio is in accordance with the expectation, finishing the acceleration; otherwise, starting a new round of accelerated iteration from the step 1; until a satisfactory acceleration effect is achieved.
Further, in step 1, by collecting and analyzing the hot spot part running time ratio, the part with large running time ratio in the program is found out, and parallel optimization is performed on the part to obtain the accelerated theoretical upper limit.
Further, in step 2, the whole multi-grid is placed at the GPU end for acceleration, and all the intermediate results are placed at the GPU end.
Furthermore, in step 3, the layout of the data in the global memory is adjusted to enable adjacent hardware threads to process continuous data in the memory, so that the effective bandwidth of memory access is improved; and by utilizing the shared memory, the texture memory and the constant memory, the access pressure of the thread to the bandwidth of the global memory is reduced, and the speed-up ratio is improved.
Further, in step 4, for the part which will generate data competition after being transplanted into the parallel version, the data competition is avoided by adjusting the calculation mode and the calculation order; for the parts which cannot be avoided by the method, the global synchronization is introduced by terminating the current GPU kernel and starting a new GPU kernel.
Further, in step 5, replacing the serial algorithm which cannot be parallel in the program with an algorithm having the same function and data parallelism, and implementing the parallel algorithm on the GPU.
Further, in step 6, after the above steps are completed, through repeated experiments, appropriate thread grids and thread block allocation parameters are found out, and the optimal matching between the program and the hardware is determined.
Compared with the prior art, the invention has the following technical effects:
according to the invention, through reasonable memory layout, the effective bandwidth of global access of the GPU is increased, and the calculation advantages of the GPU are fully exerted; the acceleration area of the program is enlarged and the acceleration ratio is improved by replacing the non-parallel serial algorithm; global synchronization is minimized in its detriment to program parallelism by its best efforts to avoid global synchronization.
Compared with the original CPU program, the GPU version has a remarkable speed-up ratio;
drawings
FIG. 1 is a flow chart of a method provided by the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings:
referring to fig. 1, a fluid mechanical simulation program heterogeneous acceleration method based on a GPU includes the following steps:
step 1, performing hotspot analysis on a fluid mechanical simulation program in a mode of combining static analysis and dynamic analysis; and carrying out preliminary parallelization realization on the selected hot spots;
step 2, aiming at the primary parallelization program in the step 1, for the part with intensive data transmission among the host devices, transplanting an intermediate result to a GPU (graphics processing unit) end, and reducing the data transmission among the host devices;
step 3, aiming at the program in the step 2, through the memory layout of the distributed grid data, the threads in the same thread bundle read and update the adjacent grid data in the memory, and utilize the shared memory, the constant memory and the texture memory according to the program characteristics, so as to fully play the parallelism;
step 4, reconstructing the exposed data parallelism through codes;
step 5, GPU adaptation of a serial algorithm; for the program in the step 4, if the running time of an individual serial algorithm is large and the individual serial algorithm cannot run at the GPU end, an iterative method is used for replacing a primitive elimination method when an equation is solved;
step 6, adjusting thread distribution parameters, determining thread distribution capable of maximizing hidden memory access delay and improving calculation throughput;
step 7, testing the acceleration effect of the programs in the steps 1 to 6, and if the acceleration ratio is in accordance with the expectation, finishing the acceleration; otherwise, starting a new round of accelerated iteration from the step 1; until a satisfactory acceleration effect is achieved.
In step 1, the part with large running time ratio in the program is found out through the collection and analysis of the running time ratio of the hot spot part, and parallel optimization is carried out on the part to obtain the accelerated theoretical upper limit.
In step 2, the whole multi-grid is placed at the GPU end for acceleration, and all intermediate results are placed at the GPU end.
In step 3, the layout of the data in the global memory is adjusted to enable adjacent hardware threads to process continuous data in the memory, so that the effective bandwidth of memory access is improved; and by utilizing the shared memory, the texture memory and the constant memory, the access pressure of the thread to the bandwidth of the global memory is reduced, and the speed-up ratio is improved.
In step 4, for the part which can generate data competition after being transplanted into the parallel version, the data competition is avoided by adjusting the calculation mode and the calculation order; for the parts which cannot be avoided by the method, the global synchronization is introduced by terminating the current GPU kernel and starting a new GPU kernel.
In step 5, replacing the serial algorithm which cannot be paralleled in the program with an algorithm with the same function and data parallelism, and implementing the parallel algorithm on the GPU.
And 6, after the steps are completed, finding out proper thread grids and thread block distribution parameters through repeated experiments, and determining the optimal matching between the program and the hardware.
And (4) hot spot analysis. The Amdall law states that the upper limit of the acceleration ratio that can be obtained by accelerating a program in parallel computing is limited by the percentage of the accelerated portion of the original serial program. According to the law, the first step of parallel optimization is to find a hot spot, and only by finding out a part of a program with a large running time and performing parallel optimization on the part, an ideal acceleration effect can be obtained. If non-hot spots are selected as acceleration objects, the parallelization effort is necessarily little. The hot spots of the analysis program generally have a static mode and a dynamic mode. The static approach is to find out the hot spots of the program by analyzing the code structure of the program. The dynamic part is to use a tool like gpref to collect information such as time spent by each function call in the program running stage and find out the hot spot of the program according to the information. Due to the characteristics of the model, the fluid machine simulation program can see from static analysis that the hot spot is generally in a multi-grid iterative solution part, and the GPU acceleration is generally carried out on the part.
Reducing data transfer between host devices. Compared with the strong computing power of the GPU, the data transmission efficiency between CPUs is extremely low, the acceleration performance is greatly reduced, and the situation that the acceleration performance is reduced as much as possible is avoided. In the multiple grid solving process, if the GPU computing part cannot completely cover the iteration loop part, the computing result must be transmitted between GPUs in each step of iteration of the loop, and data transmission between CPUs is extremely time-consuming and inefficient compared with the computing performance of the GPUs. If data transmission is needed in each iteration step, the GPU acceleration performance is greatly reduced, and therefore the whole multiple grids are placed at the GPU end for acceleration, all intermediate results are placed at the GPU end, so that data transmission in each iteration step is converted into two times of data transmission before and after the start and the end of the iteration, the time consumed by data transmission becomes a constant and is not in proportion to the iteration times. And the time occupied by data transmission before and after the iterative computation can be ignored compared with the time of the iterative computation, so the influence on the acceleration performance can be ignored. For extremely individual subprograms without parallelism, such as a difference value process, the calculation cannot be carried out at the GPU end, and only can be carried out at the CPU end, and at the moment, data transmission between the GPU and the CPU in an iteration loop cannot be avoided. Although only the calculation results of these serial programs need to be transmitted, the amount of data is not so large. However, since the number of data transmission times is proportional to the number of cycles, the performance of the program is also affected to some extent, and some optimization measures are also required. Fixed memory may be employed to optimize this portion of the data transfer. By default, data transfer between the CPU process and the GPU requires a copy first in the kernel space buffer of the operating system, and then from the kernel and buffer to the destination. After the memory page where the array to be transmitted is located is set as the fixed memory, the copy of the kernel buffer area of the operating system can be avoided, and the transmission can be directly carried out between the CPU process and the GPU. The reduced data transmission obviously improves the performance of the program after multiple iterative cycles.
Memory access consolidation and multiple memory utilizations. The fluid machine simulation program involves a multidimensional array operation, and when each element in the array is allocated to a thread, one-to-one mapping between the lowest dimension of the array and the lowest dimension of the thread is required, so that adjacent data are updated by continuous threads. GPU threads are executed in units of warp, and each warp comprises 32 hardware threads. The memory is read in units of one cache line, and one cache line is generally 32 bytes or 128 bytes which are continuous and aligned. If the memory accessed by the threads in one warp is continuous, the GPU bandwidth can be fully utilized. If the memory accessed by a thread in warp is not continuous, the bandwidth is wasted, and the effective bandwidth is reduced. For example, if only 4 bytes of data are utilized in a 32-byte cache line data transmission, the throughput is reduced by 8 times, the powerful computation capability of the GPU is difficult to be exerted, and the computation speed is significantly reduced. Memory access consolidation is a basic means for increasing the effective bandwidth of a memory, and should be considered first when optimizing. When GPU acceleration is performed on a three-dimensional grid, it is desirable to make continuous threads access data that is continuously distributed in the memory as much as possible. If other mapping schemes are adopted, the thread data read in one warp is changed from a continuous data read to a strided read, so that the effective bandwidth of the memory becomes low, and the situation should be avoided. Meanwhile, the memory types such as share memory and the like can be used under possible conditions, the memory access delay and size of the share memory are equivalent to those of cache, but the stored data is managed by software, and the acceleration performance can be greatly improved when the data access has a specific mode.
Data parallelism is exposed through code reconstruction. The structure of the multiple-grid iterative solver is a loop with large iteration times, and the flow of physical quantities among three-dimensional adjacent grids is simulated in the loop. When a certain physical quantity flows between the adjacent grids, the values of the two adjacent grids are changed simultaneously. If the same thread is responsible for the inflow and outflow of physical quantities of an array element and three dimensionally adjacent thread elements, each thread needs to change the data element owned by the thread and three dimensionally adjacent data, and other threads also modify the four elements. This mode is not problematic in the serial case, but read and write conflicts can arise if not processed in parallel computing. In order to ensure program correctness, the global synchronization needs to be introduced by terminating the current kernel and restarting a new kernel. If global synchronization is introduced once during updating of each dimension, an original subroutine is divided into smaller 6 GPU kernel functions due to introduction of redundant global synchronization. The introduction of global synchronization can significantly lose parallelism and bring additional GPU kernel startup overhead. So that the acceleration effect is greatly reduced. The defect can avoid data competition by rearranging the calculation order, and each thread can be responsible for updating the outflow of the three-dimension positive adjacent grid physical quantity and the inflow of the three-dimension negative adjacent grid physical quantity of the own grid. Through the arrangement, each thread only updates the data of the own grid, so that data competition is avoided, and the program obtains ideal acceleration performance.
GPU adaptation of the serial algorithm. For a particular algorithm, such as the solving process of the tri-diagonal equations often used in equation solving, the data parallelism is not as significant as other algorithms. Because the elimination method solves the serial property of the equation set, the traditional CPU solving process cannot be directly transplanted to the GPU, and corresponding algorithm adaptation is needed before the solving process is transplanted to the GPU. The classic algorithm for solving the tri-diagonal equation by the GPU comprises PCR (parallel Cyclic reduction) and the like, the current Invian official cusboss library provides a set of solvers for the tri-diagonal equation based on the algorithm, and the library can be used for solving the tri-diagonal equation to carry out GPU parallelization. Meanwhile, the algorithm can be optimized by combining the characteristics of the program. In an explicit solver, for example, the parameter matrix of such a tri-diagonal equation does not change during the iteration, and the number of iterations is huge. According to the above characteristics, the parameter matrix can be inverted before the iteration starts, and the solution of the tri-diagonal equation is converted into the multiplication of the parameter inverse matrix and the vector. The matrix multiplication has good data parallelism and can fully exert the performance of the GPU. The above lists a concept of parallel adaptation for the serial algorithm. For the serial algorithm running on the CPU side, it can be replaced with a parallel algorithm having an equivalent function and good data parallelism. For example, the elimination solution of the system of equations, may be replaced by an iterative method. By calling an algorithm with good data parallelism, the architecture characteristics of the GPU are better matched, and the acceleration performance is improved.
And adjusting thread distribution parameters, determining the thread configuration capable of maximizing the hidden memory access delay and improving the calculation throughput, and trying different thread blocks and thread grid specifications. And testing a suitable specific program, wherein the specific scale can maximize the kernel parameter configuration of the hardware acceleration performance.

Claims (6)

1. A fluid machinery simulation program heterogeneous acceleration method based on a GPU is characterized by comprising the following steps:
step 1, performing hotspot analysis on a fluid mechanical simulation program in a mode of combining static analysis and dynamic analysis; and carrying out preliminary parallelization realization on the selected hot spots;
step 2, aiming at the primary parallelization program in the step 1, for the part with intensive data transmission among the host devices, transplanting an intermediate result to a GPU (graphics processing unit) end, and reducing the data transmission among the host devices;
step 3, aiming at the program in the step 2, through the memory layout of the distributed grid data, the threads in the same thread bundle read and update the adjacent grid data in the memory, and utilize the shared memory, the constant memory and the texture memory according to the program characteristics, so as to fully play the parallelism;
step 4, reconstructing the exposed data parallelism through codes;
step 5, GPU adaptation of a serial algorithm; in the step 4, if the running time of an individual serial algorithm is large and the individual serial algorithm cannot run at the GPU end, an iterative method is used for replacing a primitive elimination method when an equation is solved;
step 6, adjusting thread distribution parameters, determining thread distribution capable of maximizing hidden memory access delay and improving calculation throughput;
step 7, testing the acceleration effect of the programs in the steps 1 to 6, and if the acceleration ratio is in accordance with the expectation, finishing the acceleration; otherwise, starting a new round of accelerated iteration from the step 1; until a satisfactory acceleration effect is achieved;
in step 4, for the part which can generate data competition after being transplanted into the parallel version, the data competition is avoided by adjusting the calculation mode and the calculation order; for the parts which cannot be avoided by the method, the global synchronization is introduced by terminating the current GPU kernel and starting a new GPU kernel.
2. The heterogeneous acceleration method of the fluid mechanical simulation program based on the GPU as claimed in claim 1, wherein in step 1, the part with a large running time ratio in the program is found through collection and analysis of the running time ratio of the hot spot part, and parallel optimization is performed on the part to obtain the theoretical upper limit of acceleration.
3. The heterogeneous acceleration method for the fluid mechanical simulation program based on the GPU as claimed in claim 1, wherein in step 2, the acceleration is performed by placing the whole multi-grid at the GPU end, and all the intermediate results are placed at the GPU end.
4. The heterogeneous acceleration method of the fluid mechanical simulation program based on the GPU as claimed in claim 1, wherein in step 3, the effective bandwidth of memory access is increased by adjusting the layout of data in the global memory so that the adjacent hardware threads process the continuous data in the memory; and by utilizing the shared memory, the texture memory and the constant memory, the access pressure of the thread to the bandwidth of the global memory is reduced, and the speed-up ratio is improved.
5. The heterogeneous acceleration method for the fluid mechanical simulation program based on the GPU as claimed in claim 1, characterized in that in step 5, for the algorithm which has a large running time ratio of the serial algorithm and cannot run at the GPU end, the algorithm with the same function and data parallelism is used to replace the parallel algorithm, and the parallel algorithm is implemented on the GPU.
6. The heterogeneous acceleration method of fluid machinery simulation programs based on GPU of claim 1, characterized in that in step 6, after the steps are completed, through repeated experiments, suitable thread grids and thread block allocation parameters are found out, and the best match between the program and the hardware is determined.
CN201811378843.XA 2018-11-19 2018-11-19 Fluid machinery simulation program heterogeneous acceleration method based on GPU Active CN109522127B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811378843.XA CN109522127B (en) 2018-11-19 2018-11-19 Fluid machinery simulation program heterogeneous acceleration method based on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811378843.XA CN109522127B (en) 2018-11-19 2018-11-19 Fluid machinery simulation program heterogeneous acceleration method based on GPU

Publications (2)

Publication Number Publication Date
CN109522127A CN109522127A (en) 2019-03-26
CN109522127B true CN109522127B (en) 2021-01-19

Family

ID=65776286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811378843.XA Active CN109522127B (en) 2018-11-19 2018-11-19 Fluid machinery simulation program heterogeneous acceleration method based on GPU

Country Status (1)

Country Link
CN (1) CN109522127B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148361B (en) * 2020-08-27 2022-03-04 中国海洋大学 Method and system for transplanting encryption algorithm of processor
CN112612476A (en) * 2020-12-28 2021-04-06 吉林大学 SLAM control method, equipment and storage medium based on GPU

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103729180A (en) * 2013-12-25 2014-04-16 浪潮电子信息产业股份有限公司 Method for quickly developing CUDA (compute unified device architecture) parallel programs
CN104035781A (en) * 2014-06-27 2014-09-10 北京航空航天大学 Method for quickly developing heterogeneous parallel program
CN104408019A (en) * 2014-10-29 2015-03-11 浪潮电子信息产业股份有限公司 Method for realizing GMRES (generalized minimum residual) algorithm parallel acceleration on basis of MIC (many integrated cores) platform
CN106250349A (en) * 2016-08-08 2016-12-21 浪潮(北京)电子信息产业有限公司 A kind of high energy efficiency heterogeneous computing system
CN108388537A (en) * 2018-03-06 2018-08-10 上海熠知电子科技有限公司 A kind of convolutional neural networks accelerator and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103729180A (en) * 2013-12-25 2014-04-16 浪潮电子信息产业股份有限公司 Method for quickly developing CUDA (compute unified device architecture) parallel programs
CN104035781A (en) * 2014-06-27 2014-09-10 北京航空航天大学 Method for quickly developing heterogeneous parallel program
CN104408019A (en) * 2014-10-29 2015-03-11 浪潮电子信息产业股份有限公司 Method for realizing GMRES (generalized minimum residual) algorithm parallel acceleration on basis of MIC (many integrated cores) platform
CN106250349A (en) * 2016-08-08 2016-12-21 浪潮(北京)电子信息产业有限公司 A kind of high energy efficiency heterogeneous computing system
CN108388537A (en) * 2018-03-06 2018-08-10 上海熠知电子科技有限公司 A kind of convolutional neural networks accelerator and method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GPU Acceleration Smoothed Particle Hydrodynamics for the Navier-Stokes Equations;Yingrui Wang;《2016 24th Euromicro International Conference on Parallel,Distributed and Network-Based Processing》;20161231;全文 *
基于GPU的张量分解及重构方法研究及应用;李铭;《中国优秀硕士学位论文全文数据库 基础科学辑》;20180915;第2018年卷(第9期);全文 *
面向 CPU-GPU源到源编译***的渐近拟合优化方法;魏洪昌;《计算机工程与应用》;20161231;第52卷(第21期);全文 *

Also Published As

Publication number Publication date
CN109522127A (en) 2019-03-26

Similar Documents

Publication Publication Date Title
Gómez-Luna et al. Benchmarking a new paradigm: An experimental analysis of a real processing-in-memory architecture
Gómez-Luna et al. Benchmarking a new paradigm: Experimental analysis and characterization of a real processing-in-memory system
US20210201124A1 (en) Systems and methods for neural network convolutional layer matrix multiplication using cache memory
Yang et al. Fast sparse matrix-vector multiplication on GPUs: Implications for graph mining
CN106383695B (en) The acceleration system and its design method of clustering algorithm based on FPGA
Nukada et al. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA
CN109002659B (en) Fluid machinery simulation program optimization method based on super computer
US20120331278A1 (en) Branch removal by data shuffling
Economon et al. Towards high-performance optimizations of the unstructured open-source SU2 suite
CN109522127B (en) Fluid machinery simulation program heterogeneous acceleration method based on GPU
Li et al. 3-D parallel fault simulation with GPGPU
Liu Parallel and scalable sparse basic linear algebra subprograms
CN103996216A (en) Power efficient attribute handling for tessellation and geometry shaders
Kijsipongse et al. Dynamic load balancing on GPU clusters for large-scale K-Means clustering
Arkhipov et al. Sorting with gpus: A survey
Tolmachev VkFFT-a performant, cross-platform and open-source GPU FFT library
Hesam et al. Gpu acceleration of 3d agent-based biological simulations
CN114116208A (en) Short wave radiation transmission mode three-dimensional acceleration method based on GPU
CN105573834B (en) A kind of higher-dimension vocabulary tree constructing method based on heterogeneous platform
Segura et al. Energy-efficient stream compaction through filtering and coalescing accesses in gpgpu memory partitions
CN113010316A (en) Multi-target group intelligent algorithm parallel optimization method based on cloud computing
Dudnik et al. Cuda architecture analysis as the driving force Of parallel calculation organization
Tomiyama et al. Automatic parameter optimization for edit distance algorithm on GPU
CN105302577B (en) Drive the machine code generation method and device of execution unit
Xue et al. Multi-GPU performance optimization of a CFD code using OpenACC on different platforms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant