CN112733401B

CN112733401B - Finite element tearing butt joint method and system for numerical simulation of reactor core assembly

Info

Publication number: CN112733401B
Application number: CN202011607981.8A
Authority: CN
Inventors: 张纪林; 张鋆宸; 王珏; 冯仰德; 聂宁明; 丁佳明
Original assignee: Computer Network Information Center of CAS; Hangzhou Dianzi University
Current assignee: Computer Network Information Center of CAS; Hangzhou Dianzi University
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2024-03-12
Anticipated expiration: 2040-12-30
Also published as: CN112733401A

Abstract

The invention discloses a finite element tearing butt joint method and a finite element tearing butt joint system for numerical simulation of a reactor core assembly. Each of the n computing nodes is provided with the finite element tearing butt joint system, and each computing node is provided with g-block GPU accelerators. The invention adopts a load balancing strategy, so that the memory size of the dense matrix of each process tends to be the average value, cluster resources are fully utilized, and the solving speed is increased. HIP programming was employed to enable the finite element tear butt method to run on the NvidiaCUDA platform and the AMDROMc platform. In the dense matrix vector multiplication stage of the iterative solving process, a dynamic matrix allocation strategy is adopted, so that different processors are allocated to proper calculated quantities, and the calculating resources are fully utilized, so that the solving speed is increased. In the vector inner product stage, a vector inner product acceleration strategy and a communication calculation overlapping strategy are adopted, and communication waiting time is reduced and the vector inner product speed is accelerated by introducing communication threads.

Description

Finite element tearing butt joint method and system for numerical simulation of reactor core assembly

Technical Field

The invention relates to a finite element tearing butt joint process treatment technology, in particular to a finite element tearing butt joint method and a finite element tearing butt joint system for numerical simulation of reactor core components.

Background

The core component in the nuclear reactor can deform and wear the fuel rod under the conditions of high temperature, irradiation, fluid, pressure and the like, so that a series of problems such as difficult loading and unloading, component damage, fatigue damage and the like are caused, and the safe operation of the reactor is influenced. Because of the special arrangement of the reactor core components and other problems, the theoretical analysis method is very difficult, so that the numerical simulation of the reactor core components is required to be carried out by adopting a finite element method.

The finite element tearing butt joint method (Finite Tearing and Interconnecting DD method, FETI) is an effective scheme for solving the reactor structural mechanics problem, is mainly used for treating a large-scale problem obtained by discretizing a partial differential equation, is an important method for large-scale numerical simulation of a reactor core assembly, and is also suitable for the fields of electromagnetism, aviation technology, mechanical manufacturing and the like. The finite element tear butt joint method was originally proposed by c.farhart and f.x.roux in the structural mechanics field as a non-overlapping region decomposition method, dividing a model into a number of non-overlapping subfields, and each subfield is independent. In order to ensure continuity between subfields, the FETI method is added with a group of unknowns (Lagrange multipliers LM), and in actual solving, a Krylov subspace iteration method is generally adopted to solve LM, and then a subfield equation is solved in each subfield.

However, the original FETI method is not computationally efficient. To solve this problem, farhat et al in 2001 proposed the FETI-DP method (a dual-primal unified FETI method) that eliminates the need for a second set of lagrangian multipliers and unifies all one-layer and two-layer FETI methods previously developed into a single dual pair. The FETI-DP is more robust than the FETI method, has higher calculation efficiency, and is suitable for solving the second-order and fourth-order problems. In 2006, the method of TFETI (Total FETI) was proposed by Dost' al et al. This method is a variant of the FETI method, where Dirichlet boundary conditions are also correlated with LM (lagrangian multiplier), however the rough problem is still an important factor limiting the scalability of the FETI method.

To reduce the impact of the rough problem, improving scalability, klawonn and rhenbach et al proposed the HFETI method in 2010 (Hybrid FETI method). The method combines the FETI method and the FETI-DP method, and groups a plurality of subfields into one cluster, which can be regarded as a three-level region decomposition method. First, one FETI-DP system needs to be set up to handle all clusters. Each cluster is then composed of multiple subfields, which need to be processed using conventional FETI methods. Also, in 2012, kozubek et al proposed a similar method, HTFETI method (Hybrid Total FETI method). The method combines the FETI method and the TFETI method, uses the TFETI method for subfields in each cluster, and uses the FETI method with projection for clusters. The HTFETI method can effectively reduce the rough problem.

However, in iterative solution of FETI, the sparse matrix vector operation consumes a lot of time, so Riha et al proposed LSC method (Local Schur Complement method) in 2016, replacing the sparse matrix vector operation with more efficient dense matrix vector multiplication (GEMV), equivalent to a strategy of time-shifting in space. This dense BLAS 2 level operation has continuous memory access rights and therefore better performance for memory-constrained applications. At the same time, dense matrix vector multiplication is suitable for processing with a GPU accelerator. Thus Vavrik et al used CUDA programming in 2018 to multiply dense matrix vectors to be processed by the GPU.

However, the present finite element tearing butt joint method still has the following problems to be solved urgently: 1) The known finite element tearing butt joint solver supporting heterogeneous parallelism uses CUDA programming, however, the solver realized by using the CUDA programming can only run on an Nvidia CUDA platform and does not support other types of GPU accelerators; 2) When GPU is adopted to calculate dense matrix vector multiplication, CPU is in idle state, and the computing resources of the clusters are not fully utilized; 3) When the numerical simulation of the reactor core assembly is actually carried out, the memory size of the dense matrix assembled by each process is quite different, even the difference between the calculation time and the memory size is 6 times, so that the process with less calculation amount takes a great amount of time to wait for other processes, and the solving time is increased.

Disclosure of Invention

The invention aims to solve the problems of the existing finite element tearing butt joint method, and provides a finite element tearing butt joint system for numerical simulation of reactor core components, which fully utilizes cluster resources, accelerates solving speed, reduces communication waiting time and improves portability.

A finite element tearing butt joint system for numerical simulation of a reactor comprises an input module, a region dividing module, a matrix assembling module, a resource collecting module, a load balancing module, an iteration solving module and a local solving module.

The input module is used for acquiring the grid file data and carrying out initialization parameter setting.

The region dividing module is used for dividing the grid into a plurality of regions and dividing each region into a plurality of sub-regions.

The matrix assembly module is used for generating a corresponding finite element matrix in each subarea.

The resource collection module is used for collecting the dense matrix size information of each process and comparing occupied memories.

The load balancing module is used for calling a load balancing strategy and reallocating the dense matrix of each process.

The iteration solving module is used for solving the displacement of the boundary node of each region by adopting the existing iteration method; and invoking a vector inner product acceleration policy and a communication computation overlap policy.

The local solving module is used for solving the displacement of the internal nodes of each region.

Each of the n computing nodes is provided with the finite element tearing butt joint system, and each computing node is provided with g-block GPU accelerators.

The invention further aims to provide a finite element tearing butt joint method for numerical simulation of a reactor core assembly, which comprises the following specific steps:

step 1: and obtaining geometric model data of the reactor core assembly, and meshing the geometric model data through the existing software to generate a mesh file.

Step 2: each computing node acquires a grid file of the reactor core assembly through an input module, and initializes related parameters: finite element method, iterative method, maximum iteration number, iterative accuracy, core component material parameters, core component boundary conditions, etc.

The finite element method may be a FETI or HTFETI.

Step 3: each computing node in the n computing nodes starts g processes, each process starts T threads, the grid obtained by the input module is divided into g x n areas through the area dividing module, and each area is allocated with one process; while each region is further divided into s sub-regions.

Step 4: and each process generates a corresponding finite element matrix in each subdomain through a matrix assembly module according to the allocated region and the selected finite element method, and each subdomain generates a dense matrix. Thus, each process generates s dense matrices.

Step 5: the resource collection module is utilized to collect dense matrix information of each process, and the occupied memory size of the dense matrix of the process i is L _i Let L _min ＝min{L ₁ ，L ₂ ，L ₃ ...L _n*g },L _max ＝max{L ₁ ，L ₂ ，L ₃ ...L _n*g }. If it isX represents a threshold value, and the phenomenon of unbalanced load of the reactor core assembly occurs in the finite element processing process, and the reactor core assembly is regulated by adopting a load balancing strategy and enters a step 6; otherwise, the load balancing in the finite element processing process is considered, and the step 7 is directly carried out.

Step 6: enabling a load balancing strategy through a load balancing module, and adjusting the size of the memory occupied by the matrix of each process to be near the average value, wherein the method specifically comprises the following steps:

6-1, calculating the average memory size of the dense matrix according to the memory size of the dense matrix of each process;

6-2, comparing the size of the dense matrix memory of each process with the average value, if the size of the dense matrix memory of each process is larger than the average value, considering that the calculated amount of the process is larger, and needing the help of other processes, and setting the process as a helped person; if the calculation amount of the process is smaller than the average value, the calculation amount of the process is considered to be smaller, other processes can be helped, and the process is set as a helpers;

6-3, dividing the process into two groups, wherein the helper is a group, the helped is a group, and sorting each group according to the size of the dense matrix memory, and correspondingly selecting one helped and one helped;

6-4 the helped sends 1 dense matrix to the helped;

6-5 repeating the step 6-4 until the memory of the dense matrix of the current helped person is smaller than the average value, then replacing the next helped person, or the memory of the dense matrix of the helped person is larger than the average value, then replacing the next helped person, and entering the step 6-4;

6-6 repeating steps 6-4 through 6-5 until the memory of all helpers is less than the average, or the dense matrix memory of all helpers is greater than the average.

Step 7: and (3) carrying out iterative solution on each process by an iterative solution module, wherein in each step of iterative solution, vector inner product operation adopts a vector inner product acceleration strategy and a communication calculation overlap strategy, dense matrix vector multiplication adopts HIP (heterogeneous calculation portable interface) programming to calculate on a similar GPU (graphic processing unit) accelerator, and adopts a dynamic matrix allocation strategy.

The vector inner product acceleration strategy is to solve the local vector inner products of all the processes in parallel by multiple threads.

The communication calculation overlap strategy is that each process uses 1 thread for communication, and the rest T-1 threads continue to participate in the local vector inner product calculation, wherein the communication threads participate in the local vector inner product calculation after completing communication.

The dynamic allocation matrix strategy is that when dense matrix vector multiplication is carried out, each process uses 1 thread to call a hipBLAS library, uses a block type GPU accelerator to carry out dense matrix vector multiplication calculation, and the other T-1 threads call an Intel MKL library, and uses a CPU to carry out dense matrix vector multiplication calculation. And dynamically distributing matrix quantity to the CPU and the GPU-like accelerator according to the computation time of multiplying the dense matrix vector by the CPU and the GPU-like accelerator during each iteration. The specific formula is as follows:

where N represents the total number of dense matrices that the current process needs to process,representing the number of dense matrices assigned to the class GPU accelerator in the next iteration, < >>Represents the number of dense matrices allocated by the CPU in the next iteration,/->Representing the number of dense matrices allocated to the GPU accelerator of the last iteration class, +.>Representing the number of dense matrices allocated by the CPU in the last iteration, t _c Representing the calculation time of the GPU accelerator of the last iteration class, t _d Representing the calculation time of CPU in last iteration, x _{c_sub} Representing the number of dense matrices, x, to which a single CPU core is assigned _tmp Is next toTime variable.

Step 8: and each process obtains the displacement of the internal nodes through the local solving module according to the iteration solving result, so as to obtain the displacement of all the nodes.

It is a further object of the present invention to provide a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the above-mentioned method.

It is a further object of the present invention to provide a computing device comprising a memory having executable code stored therein and a processor which, when executing the executable code, implements the method described above.

The beneficial effects of the invention are as follows:

when the finite element tearing butt joint method is used for numerical simulation of the reactor core assembly, a load balancing strategy is adopted, so that the memory size of the dense matrix of each process tends to be the average value, cluster resources can be fully utilized, and the solving speed is increased. Meanwhile, the invention adopts HIP programming, so that the finite element tearing butt joint method can run on an Nvidia CUDA platform and an AMD ROCm platform, and the portability of codes is improved. In the dense matrix vector multiplication stage of the iterative solving process, a dynamic matrix allocation strategy is adopted, so that different processors are allocated to proper calculated quantities, and the calculating resources are fully utilized, so that the solving speed is increased. In the vector inner product stage, a vector inner product acceleration strategy and a communication calculation overlapping strategy are adopted, and communication waiting time is reduced and the vector inner product speed is accelerated by introducing communication threads.

Drawings

FIG. 1 is a flow chart of a finite element tear butt acceleration method;

FIG. 2 is a dense matrix memory occupancy map;

FIG. 3 is a dense matrix vector multiplication computation time contrast graph.

Detailed Description

The invention is further analyzed in connection with the following specific examples.

The finite element tearing butt joint method for numerical simulation of the reactor core assembly is applied to deformation prediction of the reactor core assembly under the high temperature condition.

A deformation prediction method of a reactor core component in a nuclear reactor comprises a finite element tearing butt-joint acceleration method facing numerical simulation of the reactor core component; the deformation condition of the current reactor core assembly can be obtained through the obtained solving result (namely the displacement of all nodes), so that a basis is provided for the analysis design of the reactor core assembly.

The following steps and descriptions are specific:

a finite element tearing butt joint system for numerical simulation of reactor core components comprises an input module, a region dividing module, a matrix assembling module, a resource collecting module, a load balancing module, an iteration solving module and a local solving module.

A finite element tearing butt joint method for numerical simulation of reactor core components is shown in fig. 1, and comprises the following specific steps:

The finite element method may be a FETI or HTFETI.

The iterative method is a pretreatment conjugate gradient method and is used for solving the following finite element equation:

wherein:

F＝BK ⁺ B ^T

G＝BR

d＝BK ⁺ f

e＝R ^T f

matrix B is a displacement coordination matrix such that the node displacements on adjacent sub-field interfaces are equal. The matrix K is a finite element stiffness matrix. Matrix R is the basis of the null space of stiffness matrix K, and neighbor f is the load vector. λ and α are both unknowns, determined by selecting the pre-processing matrix p=i-G (G ^T G) ^-1 G ^T Eliminating the variable alpha, and solving by a preprocessing conjugate gradient method to obtain the displacement lambda of the boundary nodes of the subdomain.

The specific algorithm is described as follows:

step 3: n computing nodes, each computing node starts g processes, each process starts T threads, the grid obtained by the input module is divided into g x n areas through the area dividing module, and each area is allocated with one process; while each region is further divided into s sub-regions.

Step 5: the resource collection module is utilized to collect dense matrix information of each process, and the occupied memory size of the dense matrix of the process i is L _i Let L _min ＝min{L ₁ ，L ₂ ，L ₃ …L _n*g }，L _max ＝max{L ₁ ，L ₂ ，L ₃ …L _n*g }. If it isThe reactor core component is considered to have the load imbalance phenomenon in the finite element processing process, and is required to be adjusted by adopting a load balancing strategy, and the step 6 is entered; otherwise, the load balancing in the finite element processing process is considered, and the step 7 is directly carried out.

Step 6: and starting a load balancing strategy through a load balancing module, and adjusting the size of the memory occupied by the matrix of each process to be near the average value.

The load balancing strategy specifically comprises the following steps:

a) Calculating the average memory size of the dense matrix according to the memory size of the dense matrix of each process;

b) Comparing the size of the dense matrix memory of each process with the average value, if the size is larger than the average value, considering that the calculated amount of the process is larger, and needing the help of other processes, and setting the process as a helped person; if the calculated amount is smaller than the average value, the calculated amount of the process is considered to be smaller, and the process of the group can be helped to be a helpers;

c) Dividing the process into two groups, wherein the helper is a group, the helped is a group, each group is ordered according to the size of the dense matrix memory, and one helped are correspondingly selected;

d) The helped sends 1 dense matrix to the helped;

e) Repeating the step d) until the dense matrix memory of the current helped person is smaller than the average value, then replacing the next helped person, or the dense matrix memory of the helped person is larger than the average value, then replacing the next helped person, and entering the step d);

d) Repeating steps d) and e) until the memory of all helpers is less than the average, or the dense matrix memory of all helpers is greater than the average.

Step 7: and (3) carrying out iterative solution on each process by an iterative solution module, wherein in each step of iterative solution, vector inner product operation adopts a vector inner product acceleration strategy and a communication calculation overlap strategy, dense matrix vector multiplication adopts HIP programming to calculate on a similar GPU accelerator, and adopts a dynamic matrix allocation strategy.

The communication calculation overlap strategy is that each process uses 1 thread for communication, the rest t-1 threads continue to participate in the local vector inner product calculation, and when the communication thread finishes communication, the thread participates in the local vector inner product calculation again. The specific algorithm is described as follows:

where N represents the total number of dense matrices that the current process needs to process,representing the number of dense matrices assigned to the class GPU accelerator in the next iteration, < >>Represents the number of dense matrices allocated by the CPU in the next iteration,/->Representing the number of dense matrices allocated to the GPU accelerator of the last iteration class, +.>Representing the number of dense matrices allocated by the CPU in the last iteration, t _c Representing the calculation time of the GPU accelerator of the last iteration class, t _d Representing the calculation time of CPU in last iteration, x _{c_sub} Representing the number of dense matrices, x, to which a single CPU core is assigned _tmp Is a temporary variable.

Step 8: and each process obtains the displacement of the internal nodes through the iteration solution result and the local solution module, thereby obtaining the displacement of all the nodes.

Fig. 2 shows comparison of the sizes of the dense matrix memories before and after load balancing, which shows that the load of each process can be regulated to an average value through a load balancing strategy, so that overlong communication waiting time caused by unbalanced load is avoided, and cluster resources are fully utilized. Fig. 3 shows the computation time of dense matrix vector multiplication before and after load balancing, and by comparison, it can be found that the load balancing strategy can effectively accelerate the solving speed.

Claims

1. A finite element tear butt joint method for numerical simulation of a reactor core assembly, comprising the steps of:

step 1: obtaining geometric model data of a reactor core assembly, and carrying out grid division on the geometric model data to generate a grid file;

step 2: each computing node acquires a grid file of a reactor core assembly and initializes related parameters;

step 3: each computing node in the n computing nodes starts g processes, each process starts T threads, grids of the reactor core assembly are divided into g x n areas, and each area is allocated with one process; while each region is further divided into s sub-regions;

step 4: each process generates a corresponding finite element matrix in each subdomain according to the allocated region and the selected finite element method, and each subdomain generates a dense matrix;

step 5: collecting dense matrix information of each process, and judging a load balancing phenomenon in a finite element processing process after comparison; if the load is considered to be unbalanced, the step 6 is carried out, otherwise, the step 7 is carried out;

step 6: starting a load balancing strategy, and adjusting the size of a matrix occupied memory of each process to be near the average value; the method specifically comprises the following steps:

6-1, calculating an average value of the memory size of the dense matrix according to the memory size of the dense matrix of each process;

6-2, comparing the memory size of the dense matrix of each process with the average value, and if the memory size is larger than the average value, considering that the calculated amount of the process is larger, and setting the process as a helped person; if the calculation amount is smaller than the average value, the calculation amount of the process is considered to be smaller, and the process is set as a helper;

6-4 the helped sends 1 dense matrix to the helped;

6-6 repeating steps 6-4 to 6-5 until the memory of all helpers is less than the average value or the dense matrix memory of all helpers is greater than the average value;

step 7: carrying out iterative solution on each process, adopting a vector inner product acceleration strategy and a communication calculation overlap strategy for vector inner product operation in each iteration of the iterative solution, adopting HIP programming for dense matrix vector multiplication to calculate on a similar GPU accelerator, and adopting a dynamic matrix allocation strategy; the vector inner product acceleration strategy is to solve local vector inner products of all processes in parallel by multiple threads; the communication calculation overlap strategy is that each process uses 1 thread for communication, and the rest T-1 threads continue to participate in local vector inner product calculation, wherein the communication threads participate in the local vector inner product calculation after completing communication; the dynamic matrix allocation strategy is that when dense matrix vector multiplication is carried out, each process uses 1 thread to call a hipBLAS library, uses a block type GPU accelerator to carry out dense matrix vector multiplication calculation, and the other T-1 threads call an Intel MKL library, and uses a CPU to carry out dense matrix vector multiplication calculation; dynamically distributing matrix quantity to the CPU and the GPU-like accelerator according to the computation time of multiplying the dense matrix vector by the CPU and the GPU-like accelerator during each iteration;

step 8: and each process obtains the displacement of the internal nodes through the iteration solution result and obtains the displacement of all the nodes.

2. The finite element tearing butt joint method for numerical simulation of reactor core components according to claim 1, wherein the judging of the load balancing phenomenon in the finite element processing process in the step 5 is specifically:

let the dense matrix of process i occupy the memory size L _i Let L _min ＝min{L ₁ ,L ₂ ,L ₃ …L _n*g },L _max ＝max{L ₁ ,L ₂ ,L ₃ …L _n*g -a }; if it isX represents a threshold value, and the phenomenon of unbalanced load of the reactor core assembly occurs in the finite element processing process, and the reactor core assembly is regulated by adopting a load balancing strategy and enters a step 6; otherwise, the load balancing in the finite element processing process is considered, and the step 7 is directly carried out.

3. The finite element tear butt joint method for numerical simulation of reactor core assemblies of claim 2, wherein the specific formula of step 7 is as follows:

4. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-3.

5. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-3.