CN109522127A - A kind of fluid machinery simulated program isomery accelerated method based on GPU - Google Patents

A kind of fluid machinery simulated program isomery accelerated method based on GPU Download PDF

Info

Publication number
CN109522127A
CN109522127A CN201811378843.XA CN201811378843A CN109522127A CN 109522127 A CN109522127 A CN 109522127A CN 201811378843 A CN201811378843 A CN 201811378843A CN 109522127 A CN109522127 A CN 109522127A
Authority
CN
China
Prior art keywords
gpu
program
memory
thread
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811378843.XA
Other languages
Chinese (zh)
Other versions
CN109522127B (en
Inventor
张兴军
赵文强
董小社
李靖波
雷雨
鲁晨欣
周剑锋
伍卫国
邹年俊
何峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201811378843.XA priority Critical patent/CN109522127B/en
Publication of CN109522127A publication Critical patent/CN109522127A/en
Application granted granted Critical
Publication of CN109522127B publication Critical patent/CN109522127B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a kind of fluid machinery simulated program isomery accelerated method based on GPU, step includes analysis of central issue, and finding has the subprogram for accelerating potentiality;The deduction and exemption of data transmission between host equipment;The merging of internal storage access and the utilization of multiple types memory promote valid memory utilization rate and calculate memory access ratio;Code refactoring exposes data parallelism, and the explicit global synchronization of GPU kernel damages data parallelism, and Ying Jinli is avoided;The GPU of serial algorithm is adapted to, by the way that serial algorithm is replaced with parallel algorithm with the same function;The adjustment of thread allocation of parameters, so that thread calculation delay is sufficiently hidden, is promoted to calculate and be handled up by adjusting thread distribution;If reaching ideal effect by above step, then accelerate to complete, otherwise a wheel iteration new since analysis of central issue is until reaching promising result.The present invention provides a kind of GPU accelerated method for fluid machinery simulated program characteristic, modified program can reach ideal acceleration effect.

Description

A kind of fluid machinery simulated program isomery accelerated method based on GPU
Technical field
The invention belongs to hydrodynamics and high-performance calculation crossing domain, in particular to a kind of fluid machinery based on GPU Simulated program isomery accelerated method.
Background technique
Fluid Mechanics Computation is one of important technology of field of fluid mechanics, right in a computer by using numerical method Hydromechanical governing equation is solved, thus the flowing in predictable flow field.It, can be with the promotion of computer computation ability Construct increasingly finer fluid mechanic model.Meanwhile for the flow field change of more accurate Fluid Mechanics Computation, transport flow field It moves more " true ", fluid mechanic model is also increasingly fine, and to computing capability, higher requirements are also raised for this.
As monokaryon chip transistor density and frequency reach bottleneck, multicore becomes the main side for nowadays promoting computing capability Formula.The companies such as NVIDIA, AMD, Intel are reached wherein tall and handsome it is also proposed that a series of for handling the specialized hardwares of intensive calculations GPU suffers from outstanding performance in terms of performance and energy consumption.Traditional fluid machinery program optimizes towards CPU architecture, and with The difference of the rise of Heterogeneous Computing, coprocessor framework and CPU architecture is increasing, the volume of traditional fluid machinery simulated program WriteMode all cannot be directly adapted to new hardware structure again with prioritization scheme.Whether fluid is write for new hardware platform Machine emulated program, or the existing fluid machinery simulated program towards CPU is transplanted for GPU, it requires new Guidance program.
In fluid machinery simulated program explicit solution device, mostly based on finite volume method to fluid carry out from It dissipates, accounting equation continuous iteration on grid, until meeting the condition of convergence, in each iteration step, to the physics in grid Attribute is updated, and such application belongs to compute-intensive applications.Writing new CPU/GPU isomery program or to existing When serial program carries out GPU transplanting, the hardware structure characteristic new notice how matching is required.
Summary of the invention
The purpose of the present invention is to provide a kind of fluid machinery simulated program isomery accelerated method based on GPU, to solve The above problem.
To achieve the above object, the invention adopts the following technical scheme:
A kind of fluid machinery simulated program isomery accelerated method based on GPU, comprising the following steps:
Step 1, the mode combined by using static analysis and dynamic analysis carries out hot spot to the machine emulated program of fluid Analysis;And preliminary parallelization realization is carried out to the hot spot chosen;
Step 2, for parallelisation procedure preliminary in step 1, data between host equipment are transmitted with intensive part, it will Intermediate result is transplanted to the end GPU, and data are transmitted between reducing host equipment;
Step 3, for the program in step 2, by distributing the memory mapping of grid data, so that in the same thread beam Thread read and update grid data adjacent in memory, and according to program characteristic using shared drive, constant memory and Texture memory gives full play to concurrency;
Step 4, pass through code refactoring exposure data parallelism;
Step 5, the GPU adaptation of serial algorithm;For the program in step 4, if there is respective serial Riming time of algorithm Accounting is big and cannot run at the end GPU, substitutes the elimination with iterative method when solving equation;
Step 6, the adjustment of thread allocation of parameters determines to maximize and hides memory access time delay, promotes the thread for calculating and handling up Distribution;
Step 7, the acceleration effect that the program that step 1 arrives step 6 is completed in test then accelerates if speed-up ratio meets expection It completes;Otherwise a wheel Accelerated iteration new since step 1;Until reaching satisfied acceleration effect.
Further, it in step 1, by the collection and analysis to hot spot partial operating time accounting, finds out in program and runs The big part of time accounting carries out parallel optimization, the theoretical upper limit accelerated for the part.
Further, in step 2, will be accelerated by the way that entire multi grid is placed on the end GPU, by intermediate result whole It is placed on the end GPU.
Further, in step 3, by adjusting layout of the data in global memory so that adjacent hardware thread is handled Continuous data in memory, promote the effective bandwidth of internal storage access;And by shared drive, texture memory, constant memory It utilizes, reduces thread to the access pressure of global memory's bandwidth, promote speed-up ratio.
Further, in step 4, after parallel version, the part of data contention can be generated, by calculating side for transplanting Formula, the adjustment for calculating order avoid data contention;For the part that can not be avoided through the above way, it should current by terminating GPU kernel, the mode for starting new GPU kernel introduce global synchronization.
Further, in step 5, for serial algorithm that can not be parallel in program, with identical function and with number It is replaced according to the algorithm of concurrency, parallel algorithm is realized on GPU.
Further, in step 6, after the above step is finished, by experiment repeatedly, find out suitable thread grid, Thread block allocation of parameters determines the best match of the program and hardware.
Compared with prior art, the present invention has following technical effect:
The present invention is by reasonable memory mapping, so that the effective bandwidth of GPU global access is got higher, gives full play to GPU meter Calculation advantage;By replacement can not parallel serial algorithm make calling program can acceleration region expand, speed-up ratio promoted;By to complete Office is synchronous to be avoided as possible, so that damage of the global synchronization to program parallelization is preferably minimized.
The present invention compares original CPU program, and GPU version has significant speed-up ratio;
Detailed description of the invention
Fig. 1 is the flow chart of method provided by the invention.
Specific embodiment
Below in conjunction with attached drawing, the present invention is further described:
Referring to Fig. 1, a kind of fluid machinery simulated program isomery accelerated method based on GPU, comprising the following steps:
Step 1, the mode combined by using static analysis and dynamic analysis carries out hot spot to the machine emulated program of fluid Analysis;And preliminary parallelization realization is carried out to the hot spot chosen;
Step 2, for parallelisation procedure preliminary in step 1, data between host equipment are transmitted with intensive part, it will Intermediate result is transplanted to the end GPU, and data are transmitted between reducing host equipment;
Step 3, for the program in step 2, by distributing the memory mapping of grid data, so that in the same thread beam Thread read and update grid data adjacent in memory, and according to program characteristic using shared drive, constant memory and Texture memory gives full play to concurrency;
Step 4, pass through code refactoring exposure data parallelism;
Step 5, the GPU adaptation of serial algorithm;For the program in step 4, if there is respective serial Riming time of algorithm Accounting is big and cannot run at the end GPU, substitutes the elimination with iterative method when solving equation;
Step 6, the adjustment of thread allocation of parameters determines to maximize and hides memory access time delay, promotes the thread for calculating and handling up Distribution;
Step 7, the acceleration effect that the program that step 1 arrives step 6 is completed in test then accelerates if speed-up ratio meets expection It completes;Otherwise a wheel Accelerated iteration new since step 1;Until reaching satisfied acceleration effect.
In step 1, by the collection and analysis to hot spot partial operating time accounting, it is big to find out runing time accounting in program Part, for the part carry out parallel optimization, the theoretical upper limit accelerated.
In step 2, it will be accelerated by the way that entire multi grid is placed on the end GPU, intermediate result is all placed on GPU End.
In step 3, by adjusting layout of the data in global memory so that adjacent hardware thread handles in memory and connects Continuous data promote the effective bandwidth of internal storage access;And by shared drive, texture memory, the utilization of constant memory, reduction Thread promotes speed-up ratio to the access pressure of global memory's bandwidth.
In step 4, after parallel version, the part of data contention can be generated for transplanting, pass through calculation, calculating time The adjustment of sequence avoids data contention;For the part that can not be avoided through the above way, it should by terminating current GPU kernel, The mode for starting new GPU kernel introduces global synchronization.
In step 5, for serial algorithm that can not be parallel in program, with identical function and with data parallelism Algorithm replacement, by parallel algorithm realize on GPU.
In step 6, after the above step is finished, by experiment repeatedly, suitable thread grid, thread block distribution are found out Parameter determines the best match of the program and hardware.
Analysis of central issue.Amdahl's law shows to carry out accelerating to can be obtained speed-up ratio to program in parallel computation The upper limit is limited to the percentage that accelerating part accounts for former serial program.According to the law, the first step of parallel optimization should be found Hot spot only finds out the part that runing time accounting is big in program, carries out parallel optimization for the part, could obtain ideal Acceleration effect.If having selected non-hot is to accelerate object, the effort of parallelization necessarily produces little effect.Analysis hot-spots generally have Static and dynamic two ways.Static mode is to find out the hot spot of program by analyzing program code instructions.Dynamic part is logical It crosses using the tool as gprof etc, collects the information such as each function call the time it takes in the program operation phase, according to These information find out the hot spot of program.Fluid machinery simulated program due to model characteristic, from static analysis as can be seen that hot spot Part generally is solved in multi-grid iteration, GPU acceleration generally is carried out to the part.
Reduce the data transmission between host equipment.The computing capability powerful compared to GPU, the data between CPU transmit effect Rate is extremely low, can greatly drag down accelerating ability, it should avoid as far as possible.In multi grid solution procedure, if GPU calculation part Divide and fail that iterative cycles part is completely covered, then in every single-step iteration of circulation, it is necessary to which transmission calculates knot between GPU Fruit, and the data transmission between CPU is extremely time-consuming and inefficient compared to the calculated performance of GPU.If in every single-step iteration mistake It requires to carry out data transmission in journey, then therefore GPU accelerating ability can have a greatly reduced quality, therefore by putting entire multi grid Accelerated at the end GPU, intermediate result is all placed on to the end GPU, is followed to convert the data transmission in every single-step iteration to The transmission of data twice of ring iterative before and after, so that the time of data transmission consumption becomes a constant, without It is proportional to the number of iterations.And iterating to calculate front and back data and transmitting the occupied time can neglect compared with the time of iterative calculation Slightly disregard, therefore the influence to accelerating ability can also be ignored.For not having the subprogram of concurrency, such as difference individually extremely Process cannot be placed on the end GPU and be calculated, can be only placed at the end CPU, at this time the data transmission in iterative cycles between GPU and CPU Unavoidably.Although only needing to transmit the calculated result of these serial programs, data volume is not very big.But due to this part number To cycle-index be according to the number of transmissions it is proportional, also have certain influence to the performance of program, it is also desirable to take certain optimization Measure.Fixed memory can be taken to optimize the transmission of this partial data.Number under default situations, between cpu process and GPU It needs first to make primary copy in operating system nucleus space buffer area according to transmission, is then copied to destination from interior and buffer area. It, can be to avoid operating system nucleus buffer area after setting fixed memory for the page where the array for needing to transmit Copy, is transmitted directly between cpu process and GPU.The data transmission of reduction is after successive ignition circulation, to the property of program It can be promoted more obvious.
Internal storage access merges and a variety of memories utilize.Fluid machinery simulated program is related to Multidimensional numerical operation, by array In each Elemental partition to thread when, it should do between the minimum peacekeeping thread most low-dimensional of array and map one by one, so that continuously Thread is updated adjacent data.As unit of warp, a warp includes 32 hardware threads for the execution of GPU thread.It is interior As unit of a cache line, a cache line is generally continuous and alignment 32 bytes or 128 for the reading deposited Byte.If the memory of the thread accesses in a warp is continuous, GPU bandwidth can be fully utilized.If the line in a warp The memory of journey access is discontinuous, and it will cause bandwidth wastes, and effective bandwidth is caused to reduce.For example, if 32 bytes Cache line data transmission in, only 4 byte datas are utilized, then handling capacity has just been reduced 8 times, GPU is powerful Computing capability is difficult to be played, and can significantly reduce calculating speed.Internal storage access is incorporated as promoting memory effective bandwidth Basic means, when optimization should pay attention to first.When carrying out GPU acceleration to three-dimensional grid, it should make continuous line as far as possible Journey accesses data continuously distributed in memory.If taking other mapping schemes, the thread in a warp can be made Reading data becomes the reading that strides from continuous data reading, so that memory effective bandwidth is lower, such case should be avoided.Simultaneously In the conceived case also should using the type memories such as share memory, the memory access time delay and size of share memory and Cache is suitable, but the data stored can greatly promote acceleration when data access has AD HOC by software management Performance.
Pass through code refactoring exposure data parallelism.The structure of multi-grid iteration solver is that a number of iterations is huge Big circulation, in the circulating cycle flowing of the analog physical amount between three dimension adjacent mesh.When some physical quantity is in adjacent net When compartment flows, the numerical value of two adjacent mesh can be changed simultaneously.If same thread is responsible for a certain array element and three dimensions The physical quantity inflow and outflow of adjacent thread element on degree, then each thread require to change the data element that the thread possesses with And three data adjacent with the data in three dimensions, while other threads can also modify this four elements.Serial situation Under this mode it is not problematic, if but untreated in parallel computation read/write conflict can be caused.In order to guarantee that program is correct Property, it needs to restart a new kernel by terminating current kernel to introduce global synchronization.If every dimension more A global synchronization is all introduced when new, then script a subprogram will because of extra global synchronization introducing and be split At smaller 6 GPU kernel function.Concurrency can be greatly lost in the introducing of global synchronization, and brings additional GPU Kernel Start-up costs.So that acceleration effect substantially reduces.This defect can avoid counting by rearranging calculating order According to competition, stream that per thread can be allowed to be responsible for updating the grid that itself is possessed to three dimension forward direction adjacent mesh physical quantitys Inflow with three dimension negative sense adjacent mesh to itself grid physical amount out.Arrangement in this way makes per thread more The data of new owned grid, and then data contention is avoided, so that calling program is obtained ideal accelerating ability.
The GPU of serial algorithm is adapted to.For specific algorithm, such as commonly used three Diagonal Equation in equation solution Solution procedure, data parallelism are obvious unlike other its algorithms.Due to the serial property of elimination solve system of equation, tradition CPU solution procedure can not be grafted directly to GPU, be transplanted to before GPU and need to do corresponding algorithm adaptation.Classical GPU is asked It solves three Diagonal Equations and realizes that algorithm has PCR (Parallel Cyclic Reduction) etc., current tall and handsome high ranking official side cusparse Library provides the solver of one group of three Diagonal Equation based on above-mentioned algorithm, can use the library and solves progress to three Diagonal Equations GPU parallelization.This kind of algorithm can also be optimized in conjunction with program self character simultaneously.Such as in explicit solution device, this The parameter matrix of three Diagonal Equation of class is in an iterative process and unchanged, and the number of iterations is huge.It, can be with according to above-mentioned characteristic It inverts before iteration starts to parameter matrix, the solution of three Diagonal Equations is converted into the multiplication of parameter inverse matrix and vector. Matrix multiple has good data parallelism, can give full play to the performance of GPU.It is above-mentioned list a kind of pair of serial algorithm into The thinking of row parallel adaptation.It for the serial algorithm run at the end CPU, can use with equivalent functions, and have good The parallel algorithm of data parallelism is replaced.Such as the member that disappears of equation group solves, and can be replaced with iterative method.Pass through tune With the good algorithm of data parallelism, the architected features of GPU are preferably matched, promote accelerating ability.
The adjustment of thread allocation of parameters determines to maximize and hides memory access time delay, promotes the thread configuration for calculating and handling up, lead to It crosses and attempts different thread blocks, thread lattice specification.Suitable specific program is tested out, specific scale can maximize performance hardware and add The kernel parameter configuration of fast performance.

Claims (7)

1. a kind of fluid machinery simulated program isomery accelerated method based on GPU, which comprises the following steps:
Step 1, the mode combined by using static analysis and dynamic analysis carries out hot spot point to the machine emulated program of fluid Analysis;And preliminary parallelization realization is carried out to the hot spot chosen;
Step 2, for parallelisation procedure preliminary in step 1, data between host equipment are transmitted with intensive part, it will be intermediate As a result it is transplanted to the end GPU, data are transmitted between reducing host equipment;
Step 3, for the program in step 2, by distributing the memory mapping of grid data, so that the line in the same thread beam Journey reads and updates grid data adjacent in memory, and utilizes shared drive, constant memory and texture according to program characteristic Memory gives full play to concurrency;
Step 4, pass through code refactoring exposure data parallelism;
Step 5, the GPU adaptation of serial algorithm;For the program in step 4, if there is respective serial Riming time of algorithm accounting It cannot run greatly and at the end GPU, substitute the elimination with iterative method when solving equation;
Step 6, the adjustment of thread allocation of parameters determines to maximize and hides memory access time delay, promotes the thread distribution for calculating and handling up;
Step 7, the acceleration effect that the program that step 1 arrives step 6 is completed in test then accelerates to complete if speed-up ratio meets expection; Otherwise a wheel Accelerated iteration new since step 1;Until reaching satisfied acceleration effect.
2. a kind of fluid machinery simulated program isomery accelerated method based on GPU according to claim 1, feature exist In by the collection and analysis to hot spot partial operating time accounting, finding out the portion that runing time accounting is big in program in step 1 Point, parallel optimization, the theoretical upper limit accelerated are carried out for the part.
3. a kind of fluid machinery simulated program isomery accelerated method based on GPU according to claim 1, feature exist In will be accelerated by the way that entire multi grid is placed on the end GPU, intermediate result is all placed on to the end GPU in step 2.
4. a kind of fluid machinery simulated program isomery accelerated method based on GPU according to claim 1, feature exist In in step 3, by adjusting layout of the data in global memory so that adjacent hardware thread is handled in memory and continuously counted According to promoting the effective bandwidth of internal storage access;And by shared drive, texture memory, the utilization of constant memory, reduction thread pair The access pressure of global memory's bandwidth promotes speed-up ratio.
5. a kind of fluid machinery simulated program isomery accelerated method based on GPU according to claim 1, feature exist In after being parallel version for transplanting, generating the part of data contention, by calculation, calculate order in step 4 Adjustment avoids data contention;For the part that can not be avoided through the above way, it should by terminating current GPU kernel, starting The mode of new GPU kernel introduces global synchronization.
6. a kind of fluid machinery simulated program isomery accelerated method based on GPU according to claim 1, feature exist In in step 5, for serial algorithm that can not be parallel in program, with identical function and with the calculation of data parallelism Method replacement, parallel algorithm is realized on GPU.
7. a kind of fluid machinery simulated program isomery accelerated method based on GPU according to claim 1, feature exist In, in step 6, after the above step is finished, by experiment repeatedly, suitable thread grid, thread block allocation of parameters are found out, Determine the best match of the program and hardware.
CN201811378843.XA 2018-11-19 2018-11-19 Fluid machinery simulation program heterogeneous acceleration method based on GPU Active CN109522127B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811378843.XA CN109522127B (en) 2018-11-19 2018-11-19 Fluid machinery simulation program heterogeneous acceleration method based on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811378843.XA CN109522127B (en) 2018-11-19 2018-11-19 Fluid machinery simulation program heterogeneous acceleration method based on GPU

Publications (2)

Publication Number Publication Date
CN109522127A true CN109522127A (en) 2019-03-26
CN109522127B CN109522127B (en) 2021-01-19

Family

ID=65776286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811378843.XA Active CN109522127B (en) 2018-11-19 2018-11-19 Fluid machinery simulation program heterogeneous acceleration method based on GPU

Country Status (1)

Country Link
CN (1) CN109522127B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148361A (en) * 2020-08-27 2020-12-29 中国海洋大学 Method and system for transplanting encryption algorithm of processor
CN112612476A (en) * 2020-12-28 2021-04-06 吉林大学 SLAM control method, equipment and storage medium based on GPU

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103729180A (en) * 2013-12-25 2014-04-16 浪潮电子信息产业股份有限公司 Method for quickly developing CUDA (compute unified device architecture) parallel programs
CN104035781A (en) * 2014-06-27 2014-09-10 北京航空航天大学 Method for quickly developing heterogeneous parallel program
CN104408019A (en) * 2014-10-29 2015-03-11 浪潮电子信息产业股份有限公司 Method for realizing GMRES (generalized minimum residual) algorithm parallel acceleration on basis of MIC (many integrated cores) platform
CN106250349A (en) * 2016-08-08 2016-12-21 浪潮(北京)电子信息产业有限公司 A kind of high energy efficiency heterogeneous computing system
CN108388537A (en) * 2018-03-06 2018-08-10 上海熠知电子科技有限公司 A kind of convolutional neural networks accelerator and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103729180A (en) * 2013-12-25 2014-04-16 浪潮电子信息产业股份有限公司 Method for quickly developing CUDA (compute unified device architecture) parallel programs
CN104035781A (en) * 2014-06-27 2014-09-10 北京航空航天大学 Method for quickly developing heterogeneous parallel program
CN104408019A (en) * 2014-10-29 2015-03-11 浪潮电子信息产业股份有限公司 Method for realizing GMRES (generalized minimum residual) algorithm parallel acceleration on basis of MIC (many integrated cores) platform
CN106250349A (en) * 2016-08-08 2016-12-21 浪潮(北京)电子信息产业有限公司 A kind of high energy efficiency heterogeneous computing system
CN108388537A (en) * 2018-03-06 2018-08-10 上海熠知电子科技有限公司 A kind of convolutional neural networks accelerator and method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YINGRUI WANG: "GPU Acceleration Smoothed Particle Hydrodynamics for the Navier-Stokes Equations", 《2016 24TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL,DISTRIBUTED AND NETWORK-BASED PROCESSING》 *
李铭: "基于GPU的张量分解及重构方法研究及应用", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *
魏洪昌: "面向 CPU-GPU源到源编译***的渐近拟合优化方法", 《计算机工程与应用》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148361A (en) * 2020-08-27 2020-12-29 中国海洋大学 Method and system for transplanting encryption algorithm of processor
CN112148361B (en) * 2020-08-27 2022-03-04 中国海洋大学 Method and system for transplanting encryption algorithm of processor
CN112612476A (en) * 2020-12-28 2021-04-06 吉林大学 SLAM control method, equipment and storage medium based on GPU

Also Published As

Publication number Publication date
CN109522127B (en) 2021-01-19

Similar Documents

Publication Publication Date Title
Besta et al. Slimsell: A vectorizable graph representation for breadth-first search
Li et al. GPU-accelerated preconditioned iterative linear solvers
CN103336758A (en) Sparse matrix storage method CSRL (Compressed Sparse Row with Local Information) and SpMV (Sparse Matrix Vector Multiplication) realization method based on same
CN110516316B (en) GPU acceleration method for solving Euler equation by interrupted Galerkin method
Prasad et al. GPU-based Parallel R-tree Construction and Querying
Cao et al. Implementing sparse matrix-vector multiplication using CUDA based on a hybrid sparse matrix format
CN109002659A (en) A kind of fluid machinery simulated program optimization method based on supercomputer
CN104765589A (en) Grid parallel preprocessing method based on MPI
Shata et al. Optimized implementation of OpenCL kernels on FPGAs
Bernaschi et al. A factored sparse approximate inverse preconditioned conjugate gradient solver on graphics processing units
CN109522127A (en) A kind of fluid machinery simulated program isomery accelerated method based on GPU
Vartziotis et al. Improved GETMe by adaptive mesh smoothing
Mostafazadeh Davani et al. Unsteady Navier-Stokes computations on GPU architectures
CN106484532B (en) GPGPU parallel calculating method towards SPH fluid simulation
Zhu et al. GPU acceleration of an iterative scheme for gas-kinetic model equations with memory reduction techniques
CN102722472B (en) Complex matrix optimizing method
Bandyopadhyay et al. GRS—GPU radix sort for multifield records
Liu et al. Parallel reconstruction of neighbor-joining trees for large multiple sequence alignments using CUDA
Li et al. A speculative HMMER search implementation on GPU
Diamantopoulos et al. A system-level transprecision FPGA accelerator for BLSTM using on-chip memory reshaping
Charif et al. Detailed and highly parallelizable cycle-accurate network-on-chip simulation on GPGPU
CN109670001A (en) Polygonal gird GPU parallel calculating method based on CUDA
CN114116208A (en) Short wave radiation transmission mode three-dimensional acceleration method based on GPU
Zhang et al. Implementing sparse matrix-vector multiplication with QCSR on GPU
CN106648886A (en) Realization method and apparatus for distributed storage system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant