CN115756605A - Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs - Google Patents

Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs Download PDF

Info

Publication number
CN115756605A
CN115756605A CN202211390967.6A CN202211390967A CN115756605A CN 115756605 A CN115756605 A CN 115756605A CN 202211390967 A CN202211390967 A CN 202211390967A CN 115756605 A CN115756605 A CN 115756605A
Authority
CN
China
Prior art keywords
gpu
shallow
calculation
uwshcu
cloud
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211390967.6A
Other languages
Chinese (zh)
Inventor
王玉柱
李菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences Beijing
Original Assignee
China University of Geosciences Beijing
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences Beijing filed Critical China University of Geosciences Beijing
Priority to CN202211390967.6A priority Critical patent/CN115756605A/en
Publication of CN115756605A publication Critical patent/CN115756605A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs, which is mainly used for carrying out accelerated computing on a shallow cloud convection parameterization scheme UWshcu in an atmospheric circulation mode through multiple GPU cards so as to improve the computing efficiency of a shallow cloud convection physical process in the atmospheric circulation mode. The method comprises the following steps: defining and initializing data required by the shallow cloud convection physical process at a host end; data are distributed to each node of a host end through MPI for calculation; starting a GPU on each node by using a CUDA API; starting a kernel function, and performing thread-level parallel computation on the kernel function on the GPU; and transmitting the calculation result back to the CPU and collecting the calculation result through the MPI. The invention has the beneficial effects that: the shallow convection parameterization scheme UWshcu is subjected to accelerated calculation by using a CPU + GPU heterogeneous calculation mode for the first time, and the calculation efficiency of the shallow cloud convection physical process in the atmospheric circulation mode is greatly improved.

Description

Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs
Technical Field
The invention relates to the technical field of high-performance computing and numerical weather prediction, in particular to a shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs.
Background
The accuracy and timeliness of weather forecast are important problems for the people in the state of China. In the field of weather, numerical simulation of weather and climate has always led to weather forecasts. In recent years, the numerical weather prediction gradually develops towards the kilometer level in the world and the hectometer level in an important area, the mode resolution is continuously improved, the demand on computing resources is also increased, and under the condition that the computing resources are limited, how to develop the high-efficiency weather and climate mode becomes a problem which is more and more concerned in the fields of meteorology and computers.
The General Circulation Model (GCM) is a mathematical Model describing planetary atmosphere or ocean, and is widely applied to weather forecast, climate simulation, climate change prediction and other aspects, and mainly comprises two parts, namely a dynamic process and a physical process. In the atmospheric physical process, cloud convection is a very critical physical process, and when numerical simulation is performed on the cloud convection, due to the severe calculation requirements, the simulation efficiency of the whole atmospheric circulation mode is greatly limited. Currently, mainstream cloud convection parameterization schemes are divided into deep convection parameterization schemes and shallow convection parameterization schemes, and a shallow layer cloud (UWshcu) scheme developed by washington university is currently applied to a plurality of climate modes, such as the Geographic Fluid Dynamics Laboratory (GFDL) coupled model (CM 3), a general atmospheric model CAM5, an atmospheric circulation model IAP AGCM developed by the institute of atmospheric physics of the chinese academy of sciences, and the like.
In modern supercomputers, a central processing unit and graphics processing unit (CPU + GPU) hybrid node design has become mainstream. Because the clock rates of conventional CPU technology are nearly stalled, accelerating computations using GPUs is of increasing interest. In a high-resolution weather and climate mode, the number of global horizontal grid points reaches the level of millions and even tens of millions, if only a CPU is used for parallel accelerated calculation, huge calculation resources are consumed, the advantages of a GPU (graphics processing unit) numerous cores can perfectly fit with the calculation requirement, and the parallelism of the millions and even tens of millions of cores can be realized only by consuming very little cost. For the shallow convection parameterization scheme UWshcu, researchers in the meteorological field have made a lot of improvements to the patterns themselves, but the accelerated computation of the patterns seems to be a point to be neglected all the time, so that when coupled into the weather patterns, it consumes a lot of computing resources and computing time. The invention aims to use a modern CPU + GPU heterogeneous computing system to perform accelerated computing on a shallow convection mode, develop a high-efficiency shallow convection parameterization scheme heterogeneous computing method and solve the computing bottleneck problem of the cloud collection convection physical process in the atmospheric mode.
Disclosure of Invention
The invention aims to provide a shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs, so that the computing efficiency of a shallow cloud convection physical process in an atmospheric circulation mode is improved.
In order to realize the purpose, the technical scheme of the invention is as follows:
step 1, rewriting a serial Fortran code in a UWshcu mode by using a C language. The structure and variable precision of the source code are reserved during rewriting, and only the syntax difference of different programming languages is considered.
And 2, driving different CPU cores and GPU equipment to calculate by using the MPI interface. Acquiring a process number and a total number of processes through MPI _ Comm _ rank and MPI _ Comm _ size; acquiring the number of GPU equipment on one node through cudaGetDeviceCount; the GPU device to compute each MPI process is specified by cudaSetDevice.
And 3, data transmission and kernel function starting. Allocating GPU video memory by using CUDA API cudaMalloc; performing data transmission between the host and the device by using cudaMemcpy; starting a kernel function in a mode of computer _ uwshcu < < dim3 (blocknum), dim3 (blocksize) > > > (parameters) for calculation; and releasing the GPU video memory by using cudaFree ().
And 4, compiling the kernel function. The uwsocu mode code solving part is packaged in a computer _ UWshcu subroutine, and the computer _ UWshcu _ inv calls the computer _ UWshcu and initializes input data. The invention uses CUDA programming architecture to take computer _ uwshcu _ inv as a host function and computer _ uwshcu as a kernel function. The computer _ uwshcu _ inv is responsible for data transmission between the host and the device side and starting the kernel function computer _ uwshcu to perform one-dimensional parallel computation on the GPU. For computer _ uwshcu, before it starts to calculate the horizontal column data, a findsp function is called for calculating wet-bulb temperature and specific humidity, the findsp function is small in calculation amount and is not suitable for parallel calculation due to data dependency, so we rewrite the findsp function to a host-side function, which is called by computer _ uwshcu _ inv. The rest 15 sub-functions called for computer _ uwshcu are all used as device functions. In the serial algorithm, very many temporary arrays are defined in computer _ uwshcu, and the arrays have read-write conflicts among different threads, so the cudaMalloc is used to allocate the arrays to the global memory, and the horizontal dimension (pcols) is added to the arrays to eliminate the read-write conflicts.
And 5, optimizing the kernel function. Optimizing the memory access of the kernel function, storing the constant into a GPU constant memory for the variable in the computer _ uwshcu, and putting the two-dimensional and three-dimensional arrays occupying too large storage space into a global memory. For the one-dimensional array, the memory access pressure is relieved by using a mode of combining a shared memory and a register, for the one-dimensional array without write conflict among threads, the one-dimensional array exists in the shared memory, and the rest arrays are stored in the register.
The invention has the following advantages:
the invention combines and applies the CPU parallel technology MPI and the GPU parallel technology CUDA, and realizes the purpose of accelerating the computation of UWshcu by using a large-scale GPU cluster. Compared with the single Intel Xeon E5-2680 CPU (10 core), the acceleration method realizes 153.15 x acceleration on 16 NVIDIA Tesla V100 GPUs, and compared with the acceleration ratio of a serial algorithm running on the single Intel Xeon E5-2680 CPU core, the acceleration ratio is as high as 1042.30 x, and the calculation efficiency of a shallow cloud convection parameterization scheme UWshcu is greatly improved.
Drawings
FIG. 1 shows a schematic structure of UWshcu.
Fig. 2 is a schematic diagram of uwscu multi-GPU heterogeneous calculation based on MPI + CUDA.
FIG. 3 shows a three-dimensional subdivision of UWshcu based on MPI + CUDA.
Fig. 4 shows the runtime of UWshcu on three GPU cards and multiple CPU cores (10 cores).
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings and tables in the embodiments of the present invention.
The method for modifying the UWshcu serial Fortran code into the parallel CUDA C is shown as an algorithm 1 and an algorithm 2:
Figure BDA0003930433630000041
Figure BDA0003930433630000042
the MPI + CUDA parallel programming model is based on multiple CPU cores and multiple GPU cards, wherein the MPI is responsible for communication operation among different nodes, the calculation tasks are divided in a coarse granularity mode, CUDA API is used for driving GPU equipment, and the calculation tasks are parallel in a finer granularity mode. For the example of a 0.25 ° resolution GCM, the horizontal column parameter is pcols. FIG. 2 is a schematic diagram of a UWshcu physical structure three-dimensionally subdivided using an MPI + CUDA two-level hybrid programming model. When n MPI processes are used for operating GCM, each MPI process is responsible for the calculation tasks of 1152 x 768/n horizontal grid points, one MPI process starts one GPU acceleration card, n GPU cards are used for performing fine-grained parallel calculation in total, each card needs to complete the calculation of 1152 x 768/n horizontal grid points in total, and therefore each GPU card can complete the calculation of one global horizontal grid point only by repeatedly calling the UWshcu mode for 1152 x 768/n/pcs for times. Algorithm 3 describes a method for implementing two-level parallelism of UWshcu using an MPI + CUDA hybrid programming model:
Figure BDA0003930433630000051
in order to compare the acceleration effects of a single GPU and a single CPU, a 10-thread version of OpenMP of UWshcu is used on an Intel Xeon E5-2680 v2 CPU for testing, and performance comparison is performed with UWshcu using acceleration of a single GPU card, with the result shown in fig. 3. The experimental result shows that the performance release of the UWshcu heterogeneous acceleration scheme on a single GPU is far better than that of a CPU with a small number of computational cores, the data transmission time is calculated on a Tesla V100 GPU, the overall operation time of the algorithm is 11.05 times faster than that of the single CPU (10 cores), and the UWshcu GPU acceleration scheme is superior to that of the CPU.
For the multi-GPU acceleration algorithm, the multi-GPU acceleration algorithm is firstly tested by using a multi-node on a K20 GPU cluster and a V100 GPU cluster, each node in the K20 cluster uses 2 Tesla K20 acceleration cards, and each node in the V100 cluster uses 4 Tesla V100 acceleration cards. And compared with UWshcu of a single CPU OpenMP accelerated version, and the experimental results are shown in tables 1 and 2. The acceleration of the multiple GPU acceleration algorithm of UWshcu is 153.15 times faster than that of a single CPU on 16V 100 GPUs, and the acceleration ratio is as high as 1042.30 x compared with that of a serial algorithm running on a single CPU core, so that the effectiveness of the multiple GPU acceleration algorithm is proved. In addition, MPI is used for carrying out primary subdivision on the calculation task of UWshcu at a global level grid point, and the calculation of each process is independent, so that the algorithm has high parallel efficiency, and the average parallel efficiency is up to 91% in terms of the experimental result on a V100 GPU. It follows that this method is very effective in improving the computational efficiency of shallow convection parameterization schemes.
TABLE 3 speed-up ratio of UWshcu multi-GPU acceleration algorithm over K20 cluster compared to single CPU acceleration algorithm
Figure BDA0003930433630000061
TABLE 4 acceleration ratio of UWshcu multiple GPU acceleration algorithm over V100 cluster compared to single CPU acceleration algorithm
Figure BDA0003930433630000062
Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims (5)

1. A shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs is characterized by comprising the following steps: the shallow convection parameterization scheme UWshcu in the atmospheric circulation mode is accelerated to be calculated mainly through an MPI and CUDA hybrid programming model, so that the calculation efficiency of the shallow cloud convection physical process is improved.
2. The multi-GPU based shallow cloud convection parameterization scheme heterogeneous computing method according to claim 1, characterized in that: parameters required by UWshcu calculation are divided through MPI, and divided data are dispatched to a designated node for calculation.
3. The multi-GPU based shallow cloud convection parameterization scheme heterogeneous computing method according to claim 2, characterized in that: each MPI process starts one CPU core to carry out calculation, and each CPU core is responsible for the following operations: the method comprises the steps of GPU video memory allocation, data transmission between a host and a GPU, kernel function starting and GPU video memory releasing after kernel function calculation is completed.
4. The multi-GPU based shallow cloud convection parameterization scheme heterogeneous computing method according to claim 3, characterized in that: the subroutine _ uwshcu (parameters) in the serial code is rewritten into GPU-side kernel function in the format of __ global __ void computer _ uwshcu (parameters).
5. The multi-GPU based shallow cloud convection parameterization scheme heterogeneous computing method according to claim 4, characterized in that: according to the physical characteristics of the UWshcu mode, the calculation of the UWshcu mode in the horizontal dimension has no dependency, so that the advantage that a GPU has a plurality of CUDA cores is utilized to carry out thread-level parallelism on the calculation of the UWshcu mode in the horizontal dimension; and after the calculation is finished, transmitting a result required by the host end back to the CPU from the GPU.
CN202211390967.6A 2022-11-07 2022-11-07 Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs Pending CN115756605A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211390967.6A CN115756605A (en) 2022-11-07 2022-11-07 Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211390967.6A CN115756605A (en) 2022-11-07 2022-11-07 Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs

Publications (1)

Publication Number Publication Date
CN115756605A true CN115756605A (en) 2023-03-07

Family

ID=85357462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211390967.6A Pending CN115756605A (en) 2022-11-07 2022-11-07 Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs

Country Status (1)

Country Link
CN (1) CN115756605A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117032841A (en) * 2023-08-04 2023-11-10 太初(无锡)电子科技有限公司 High-performance transfer method of kernel function parameters in heterogeneous computing and heterogeneous computing system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117032841A (en) * 2023-08-04 2023-11-10 太初(无锡)电子科技有限公司 High-performance transfer method of kernel function parameters in heterogeneous computing and heterogeneous computing system
CN117032841B (en) * 2023-08-04 2024-04-26 太初(无锡)电子科技有限公司 High-performance transfer method of kernel function parameters in heterogeneous computing and heterogeneous computing system

Similar Documents

Publication Publication Date Title
CN106383695B (en) The acceleration system and its design method of clustering algorithm based on FPGA
CN103617150B (en) A kind of system and method for the large-scale electrical power system power flow parallel computing based on GPU
US10909033B1 (en) Techniques for efficiently partitioning memory
Hong-Tao et al. K-means on commodity GPUs with CUDA
US11106261B2 (en) Optimal operating point estimator for hardware operating under a shared power/thermal constraint
US8364739B2 (en) Sparse matrix-vector multiplication on graphics processor units
EP3713093A1 (en) Data compression for a neural network
US12008475B2 (en) Transposed sparse matrix multiply by dense matrix for neural network training
CN110704360A (en) Graph calculation optimization method based on heterogeneous FPGA data flow
EP3742350A1 (en) Parallelization strategies for training a neural network
US10725837B1 (en) Persistent scratchpad memory for data exchange between programs
US11961001B2 (en) Parallel forward and backward propagation
US10810784B1 (en) Techniques for preloading textures in rendering graphics
US20200210805A1 (en) Neural Network Generator
US20200264970A1 (en) Memory management system
CN110135569A (en) Heterogeneous platform neuron positioning three-level flow parallel method, system and medium
CN115756605A (en) Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs
Shimokawabe et al. High-productivity framework for large-scale GPU/CPU stencil applications
Gunarathne et al. Optimizing opencl kernels for iterative statistical applications on gpus
CN114116208A (en) Short wave radiation transmission mode three-dimensional acceleration method based on GPU
US20230061154A1 (en) Implementing hardware-based memory safety for a graphic processing unit
Zhou et al. A Parallel Scheme for Large‐scale Polygon Rasterization on CUDA‐enabled GPUs
US20230063568A1 (en) Implementing compiler-based memory safety for a graphic processing unit
Zou et al. Supernodal sparse Cholesky factorization on graphics processing units
Shimokawabe et al. A high-productivity framework for multi-GPU computation of mesh-based applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination