CN115756605A

CN115756605A - Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs

Info

Publication number: CN115756605A
Application number: CN202211390967.6A
Authority: CN
Inventors: 王玉柱; 李菲
Original assignee: China University of Geosciences Beijing
Current assignee: China University of Geosciences Beijing
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-03-07

Abstract

The invention discloses a shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs, which is mainly used for carrying out accelerated computing on a shallow cloud convection parameterization scheme UWshcu in an atmospheric circulation mode through multiple GPU cards so as to improve the computing efficiency of a shallow cloud convection physical process in the atmospheric circulation mode. The method comprises the following steps: defining and initializing data required by the shallow cloud convection physical process at a host end; data are distributed to each node of a host end through MPI for calculation; starting a GPU on each node by using a CUDA API; starting a kernel function, and performing thread-level parallel computation on the kernel function on the GPU; and transmitting the calculation result back to the CPU and collecting the calculation result through the MPI. The invention has the beneficial effects that: the shallow convection parameterization scheme UWshcu is subjected to accelerated calculation by using a CPU + GPU heterogeneous calculation mode for the first time, and the calculation efficiency of the shallow cloud convection physical process in the atmospheric circulation mode is greatly improved.

Description

Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs

Technical Field

The invention relates to the technical field of high-performance computing and numerical weather prediction, in particular to a shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs.

Background

The accuracy and timeliness of weather forecast are important problems for the people in the state of China. In the field of weather, numerical simulation of weather and climate has always led to weather forecasts. In recent years, the numerical weather prediction gradually develops towards the kilometer level in the world and the hectometer level in an important area, the mode resolution is continuously improved, the demand on computing resources is also increased, and under the condition that the computing resources are limited, how to develop the high-efficiency weather and climate mode becomes a problem which is more and more concerned in the fields of meteorology and computers.

The General Circulation Model (GCM) is a mathematical Model describing planetary atmosphere or ocean, and is widely applied to weather forecast, climate simulation, climate change prediction and other aspects, and mainly comprises two parts, namely a dynamic process and a physical process. In the atmospheric physical process, cloud convection is a very critical physical process, and when numerical simulation is performed on the cloud convection, due to the severe calculation requirements, the simulation efficiency of the whole atmospheric circulation mode is greatly limited. Currently, mainstream cloud convection parameterization schemes are divided into deep convection parameterization schemes and shallow convection parameterization schemes, and a shallow layer cloud (UWshcu) scheme developed by washington university is currently applied to a plurality of climate modes, such as the Geographic Fluid Dynamics Laboratory (GFDL) coupled model (CM 3), a general atmospheric model CAM5, an atmospheric circulation model IAP AGCM developed by the institute of atmospheric physics of the chinese academy of sciences, and the like.

In modern supercomputers, a central processing unit and graphics processing unit (CPU + GPU) hybrid node design has become mainstream. Because the clock rates of conventional CPU technology are nearly stalled, accelerating computations using GPUs is of increasing interest. In a high-resolution weather and climate mode, the number of global horizontal grid points reaches the level of millions and even tens of millions, if only a CPU is used for parallel accelerated calculation, huge calculation resources are consumed, the advantages of a GPU (graphics processing unit) numerous cores can perfectly fit with the calculation requirement, and the parallelism of the millions and even tens of millions of cores can be realized only by consuming very little cost. For the shallow convection parameterization scheme UWshcu, researchers in the meteorological field have made a lot of improvements to the patterns themselves, but the accelerated computation of the patterns seems to be a point to be neglected all the time, so that when coupled into the weather patterns, it consumes a lot of computing resources and computing time. The invention aims to use a modern CPU + GPU heterogeneous computing system to perform accelerated computing on a shallow convection mode, develop a high-efficiency shallow convection parameterization scheme heterogeneous computing method and solve the computing bottleneck problem of the cloud collection convection physical process in the atmospheric mode.

Disclosure of Invention

The invention aims to provide a shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs, so that the computing efficiency of a shallow cloud convection physical process in an atmospheric circulation mode is improved.

In order to realize the purpose, the technical scheme of the invention is as follows:

step 1, rewriting a serial Fortran code in a UWshcu mode by using a C language. The structure and variable precision of the source code are reserved during rewriting, and only the syntax difference of different programming languages is considered.

And 2, driving different CPU cores and GPU equipment to calculate by using the MPI interface. Acquiring a process number and a total number of processes through MPI _ Comm _ rank and MPI _ Comm _ size; acquiring the number of GPU equipment on one node through cudaGetDeviceCount; the GPU device to compute each MPI process is specified by cudaSetDevice.

And 3, data transmission and kernel function starting. Allocating GPU video memory by using CUDA API cudaMalloc; performing data transmission between the host and the device by using cudaMemcpy; starting a kernel function in a mode of computer _ uwshcu < < dim3 (blocknum), dim3 (blocksize) > > > (parameters) for calculation; and releasing the GPU video memory by using cudaFree ().

And 4, compiling the kernel function. The uwsocu mode code solving part is packaged in a computer _ UWshcu subroutine, and the computer _ UWshcu _ inv calls the computer _ UWshcu and initializes input data. The invention uses CUDA programming architecture to take computer _ uwshcu _ inv as a host function and computer _ uwshcu as a kernel function. The computer _ uwshcu _ inv is responsible for data transmission between the host and the device side and starting the kernel function computer _ uwshcu to perform one-dimensional parallel computation on the GPU. For computer _ uwshcu, before it starts to calculate the horizontal column data, a findsp function is called for calculating wet-bulb temperature and specific humidity, the findsp function is small in calculation amount and is not suitable for parallel calculation due to data dependency, so we rewrite the findsp function to a host-side function, which is called by computer _ uwshcu _ inv. The rest 15 sub-functions called for computer _ uwshcu are all used as device functions. In the serial algorithm, very many temporary arrays are defined in computer _ uwshcu, and the arrays have read-write conflicts among different threads, so the cudaMalloc is used to allocate the arrays to the global memory, and the horizontal dimension (pcols) is added to the arrays to eliminate the read-write conflicts.

And 5, optimizing the kernel function. Optimizing the memory access of the kernel function, storing the constant into a GPU constant memory for the variable in the computer _ uwshcu, and putting the two-dimensional and three-dimensional arrays occupying too large storage space into a global memory. For the one-dimensional array, the memory access pressure is relieved by using a mode of combining a shared memory and a register, for the one-dimensional array without write conflict among threads, the one-dimensional array exists in the shared memory, and the rest arrays are stored in the register.

The invention has the following advantages:

the invention combines and applies the CPU parallel technology MPI and the GPU parallel technology CUDA, and realizes the purpose of accelerating the computation of UWshcu by using a large-scale GPU cluster. Compared with the single Intel Xeon E5-2680 CPU (10 core), the acceleration method realizes 153.15 x acceleration on 16 NVIDIA Tesla V100 GPUs, and compared with the acceleration ratio of a serial algorithm running on the single Intel Xeon E5-2680 CPU core, the acceleration ratio is as high as 1042.30 x, and the calculation efficiency of a shallow cloud convection parameterization scheme UWshcu is greatly improved.

Drawings

FIG. 1 shows a schematic structure of UWshcu.

Fig. 2 is a schematic diagram of uwscu multi-GPU heterogeneous calculation based on MPI + CUDA.

FIG. 3 shows a three-dimensional subdivision of UWshcu based on MPI + CUDA.

Fig. 4 shows the runtime of UWshcu on three GPU cards and multiple CPU cores (10 cores).

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings and tables in the embodiments of the present invention.

The method for modifying the UWshcu serial Fortran code into the parallel CUDA C is shown as an algorithm 1 and an algorithm 2:

the MPI + CUDA parallel programming model is based on multiple CPU cores and multiple GPU cards, wherein the MPI is responsible for communication operation among different nodes, the calculation tasks are divided in a coarse granularity mode, CUDA API is used for driving GPU equipment, and the calculation tasks are parallel in a finer granularity mode. For the example of a 0.25 ° resolution GCM, the horizontal column parameter is pcols. FIG. 2 is a schematic diagram of a UWshcu physical structure three-dimensionally subdivided using an MPI + CUDA two-level hybrid programming model. When n MPI processes are used for operating GCM, each MPI process is responsible for the calculation tasks of 1152 x 768/n horizontal grid points, one MPI process starts one GPU acceleration card, n GPU cards are used for performing fine-grained parallel calculation in total, each card needs to complete the calculation of 1152 x 768/n horizontal grid points in total, and therefore each GPU card can complete the calculation of one global horizontal grid point only by repeatedly calling the UWshcu mode for 1152 x 768/n/pcs for times. Algorithm 3 describes a method for implementing two-level parallelism of UWshcu using an MPI + CUDA hybrid programming model:

in order to compare the acceleration effects of a single GPU and a single CPU, a 10-thread version of OpenMP of UWshcu is used on an Intel Xeon E5-2680 v2 CPU for testing, and performance comparison is performed with UWshcu using acceleration of a single GPU card, with the result shown in fig. 3. The experimental result shows that the performance release of the UWshcu heterogeneous acceleration scheme on a single GPU is far better than that of a CPU with a small number of computational cores, the data transmission time is calculated on a Tesla V100 GPU, the overall operation time of the algorithm is 11.05 times faster than that of the single CPU (10 cores), and the UWshcu GPU acceleration scheme is superior to that of the CPU.

For the multi-GPU acceleration algorithm, the multi-GPU acceleration algorithm is firstly tested by using a multi-node on a K20 GPU cluster and a V100 GPU cluster, each node in the K20 cluster uses 2 Tesla K20 acceleration cards, and each node in the V100 cluster uses 4 Tesla V100 acceleration cards. And compared with UWshcu of a single CPU OpenMP accelerated version, and the experimental results are shown in tables 1 and 2. The acceleration of the multiple GPU acceleration algorithm of UWshcu is 153.15 times faster than that of a single CPU on 16V 100 GPUs, and the acceleration ratio is as high as 1042.30 x compared with that of a serial algorithm running on a single CPU core, so that the effectiveness of the multiple GPU acceleration algorithm is proved. In addition, MPI is used for carrying out primary subdivision on the calculation task of UWshcu at a global level grid point, and the calculation of each process is independent, so that the algorithm has high parallel efficiency, and the average parallel efficiency is up to 91% in terms of the experimental result on a V100 GPU. It follows that this method is very effective in improving the computational efficiency of shallow convection parameterization schemes.

TABLE 3 speed-up ratio of UWshcu multi-GPU acceleration algorithm over K20 cluster compared to single CPU acceleration algorithm

TABLE 4 acceleration ratio of UWshcu multiple GPU acceleration algorithm over V100 cluster compared to single CPU acceleration algorithm

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs is characterized by comprising the following steps: the shallow convection parameterization scheme UWshcu in the atmospheric circulation mode is accelerated to be calculated mainly through an MPI and CUDA hybrid programming model, so that the calculation efficiency of the shallow cloud convection physical process is improved.

2. The multi-GPU based shallow cloud convection parameterization scheme heterogeneous computing method according to claim 1, characterized in that: parameters required by UWshcu calculation are divided through MPI, and divided data are dispatched to a designated node for calculation.

3. The multi-GPU based shallow cloud convection parameterization scheme heterogeneous computing method according to claim 2, characterized in that: each MPI process starts one CPU core to carry out calculation, and each CPU core is responsible for the following operations: the method comprises the steps of GPU video memory allocation, data transmission between a host and a GPU, kernel function starting and GPU video memory releasing after kernel function calculation is completed.

4. The multi-GPU based shallow cloud convection parameterization scheme heterogeneous computing method according to claim 3, characterized in that: the subroutine _ uwshcu (parameters) in the serial code is rewritten into GPU-side kernel function in the format of __ global __ void computer _ uwshcu (parameters).

5. The multi-GPU based shallow cloud convection parameterization scheme heterogeneous computing method according to claim 4, characterized in that: according to the physical characteristics of the UWshcu mode, the calculation of the UWshcu mode in the horizontal dimension has no dependency, so that the advantage that a GPU has a plurality of CUDA cores is utilized to carry out thread-level parallelism on the calculation of the UWshcu mode in the horizontal dimension; and after the calculation is finished, transmitting a result required by the host end back to the CPU from the GPU.