CN113239620A

CN113239620A - Improved particle swarm method for parameter identification of geotechnical material constitutive model based on GPU acceleration

Info

Publication number: CN113239620A
Application number: CN202110509992.0A
Authority: CN
Inventors: 康恒一; 闫自海; 王紫娟; 刘世明; 甘鹏路; 严佳佳
Original assignee: PowerChina Huadong Engineering Corp Ltd
Current assignee: PowerChina Huadong Engineering Corp Ltd
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-08-10
Anticipated expiration: 2041-05-11
Also published as: CN113239620B

Abstract

The invention discloses an improved particle swarm method for identifying intrinsic model parameters of geotechnical materials based on GPU acceleration. Based on the characteristic of large calculation amount in the calculation process, the method solves the problem of calculation efficiency in two aspects of algorithm optimization and high-performance hardware application. On one hand, a mechanism of early termination is introduced according to the characteristic of error function accumulation, so that the calculated amount is saved; on the other hand, the GPU is applied to the computing equipment, and the optimization of the program structure is carried out according to the characteristics of the instruction set of the computing equipment. The invention has the beneficial effect of improving the cross-order calculation efficiency.

Description

Improved particle swarm method for parameter identification of geotechnical material constitutive model based on GPU acceleration

Technical Field

The invention relates to a method for identifying geotechnical constitutive model parameters, belongs to the field of geotechnical engineering indoor tests and material mechanics, provides an auxiliary method for calibrating mechanical constitutive model parameters by using indoor triaxial mechanical test data, and particularly relates to a calculation method for improving calculation efficiency by introducing an early termination mechanism to error function accumulation of a conventional particle swarm optimization method and performing parallelization acceleration by using a GPU (graphics processing unit).

Background

In numerical simulation in geotechnical engineering, a proper constitutive model and parameter values must be selected to reflect the nonlinear relation of the unit body dimensions of the material, so that a relatively accurate calculation result can be obtained in the calculation of the macroscopic dimension. As a pre-set operation of numerical simulation, it is very important to perform indoor geotechnical triaxial test and calibrate a mechanical constitutive model by using test data. Generally speaking, the parameter calibration of a simple constitutive model relies on the physical meaning of the model itself and a geometric interpretation of the stress-strain curve. For example, in a Von Mises model with only two model parameters, elastic modulus and yield stress, the slope of the elastic portion of the stress-strain curve corresponds to the elastic modulus in the model parameter, and the inflection point at yield corresponds to the yield stress in the model parameter. However, for more complex constitutive models, the geometric features represented in the stress-strain curve cannot correspond to the model parameters one-to-one. And part of high-constitutive models, such as a sub-plastic model, also introduce a plurality of hyper-parameters which do not have clear physical significance and are used for realizing smooth fitting of a stress-strain curve. Such parameters can generally only be obtained by trial and error with repeated commissioning. However, when the number of the model parameters is large, it is very difficult to manually adjust a plurality of hyper-parameters and simultaneously conform a plurality of groups of material stress-strain curves obtained by calculation to the test data. Therefore, the calibration problem of the mechanical parameters of the geotechnical materials must be considered from the viewpoint of optimization.

Particle Swarm Optimization (PSO) is a biological heuristic method in the field of computational intelligence, and belongs to a group intelligent optimization algorithm. The optimization process does not need to solve gradient information of an objective function, is easy to implement, high in universality and strong in local and global search capabilities, and is widely applied to optimization problems in a plurality of scientific and engineering fields such as function optimization (CN201610282691.8), wind power prediction (CN201710415906.3), customer classification (CN201911088225.6) and the like. In the invention, the calibration problem of the material parameters can be considered as independent variable from the viewpoint of optimization, the deviation between geotechnical test data and calculation data of a constitutive model is considered as an error function, and the material model parameters with the minimum error function are solved by using a particle swarm algorithm.

When the material constitutive model is complex, the problem of large calculation amount exists when the particle swarm algorithm is applied to parameter identification. In general, the optimization problem we need to solve is that of a non-convex function. The model parameters are more, the non-convex characteristic is stronger, the global optimum can be found only by more particles, and the problems of premature convergence and falling into the local optimum solution are avoided. On the other hand, stress integration of a complex constitutive model requires a very large amount of computation, which makes the error function computation very large. We know the evolution of specimen stress and strain during the loading test. During the calculation of the error function, the strain increment at each step can be considered as a known quantity, stress integration is performed and the stress state is updated so that comparison with the experimental stress state can be made. The stress-strain curve is integrated and the squares of the error between the calculated stress and the experimental stress for each step are summed as an error function. According to the particle swarm mechanism, in each iteration process, a complete stress-strain curve needs to be calculated for each particle, and the calculation is time-consuming. Therefore, there is a need to improve the computational efficiency of this method.

The invention aims to solve the problem of computational efficiency from two aspects of algorithm optimization and high-performance hardware application. From the above, the error function is calculated as a monotonically increasing accumulation process. According to the characteristic, a mechanism of early termination can be introduced to the particles which exceed the self-history optimization of the error function in the accumulation process, so that the calculation amount is saved. In another aspect, the present disclosure introduces a method for parallelization using a graphics processor GPU. Compared with the conventional central processing unit CPU, the GPU has extremely high floating point arithmetic capability and data throughput and is suitable for the problem of higher algorithm strength. Meanwhile, if the programming is not proper, a Single Instruction Multiple Data (SIMD) instruction set of the GPU can bring serious branch punishment, and the invention adjusts the program structure on the GPU and optimizes the performance aiming at the problem.

Disclosure of Invention

The invention aims to provide an improved particle swarm method for identifying parameters of a geotechnical material constitutive model based on GPU acceleration, which solves the problem of computational efficiency from two aspects of algorithm optimization and high-performance hardware application, and adopts the following technical scheme:

using the strain in the experimental data { ε_exp,kIntegrating the stress (as indicated by Update function) as a known quantity, gradually updating the stress state { σ }_cal,kAnd the calculated value of stress σ_calAnd the experimental value sigma_expIs accumulated for calculating an error function, the accumulation process of the error function E is defined as follows:

calculating a strain increment: delta epsilon_k+1＝ε_exp,k+1-ε_exp，k (1)

Updating the calculated value of the stress state:

cumulative error function:

number of population N in terms of particle group_parAs the total thread quantity, the GPU is used for single-instruction multi-instruction at the granularity level of single stress integrationParallelization calculation formulas (1) to (3) of data SIMD are defined as basic parallel steps;

based on the above definition of the error function calculation and the basic parallel steps, the specific execution steps of the calculation of the invention are as follows:

step S1, trial calculation of predetermined number of times N_trialAfter the basic parallel step, performing SIMD parallelization judgment on the running state of each particle, and checking whether the current error function value E exceeds the optimal (pbest) history, wherein the logic of the state judgment is as follows:

if E ≦ pbest and the particle did not complete all data computations is defined to accept accumulation state a0,

if E ≦ pbest and the particle that completed all data calculations is defined as the normal termination state a1,

if E > pbest, then define an abort state A2;

step S2: for all the particles with the state of A1, as the error function value does not exceed the self optimal pbest after the accumulation of all the data is completed, the pbest and the corresponding parameter value can be updated in SIMD parallelization;

step S3: serially traversing all the particles with the state of A1, and if the history optimal pbest of the particles is the global optimal gbest of all the particles, updating the pbest and the corresponding parameter values of the pbest;

step S4: performing SIMD parallel particle swarm update operation on the particles in A1 and A2 states, wherein id is 1,2,3, … and N_par(ii) a D is 1,2,3, …, D; t is 1,2,3, …. id is the number of the particle, k is the current time step, and D is the dimension of the particle. In this example, the value of D is equal to the number of parameters of the constitutive model; omega is an inertia weight factor, the value of the inertia weight factor is nonnegative, when the value is larger, the global optimizing capability is strong, and the local optimizing capability is weak; when the time is small, the global optimizing capability is weak, and the local optimizing capability is strong. By adjusting the magnitude of omega, the global optimization performance and the local optimization performance can be adjusted; c. C₁And c₂The acceleration constant is called, the former is an individual learning factor of each particle, and the latter is a social learning factor of each particle; in general, c₁And c₂Taking 2, but not necessarily 2, and trying between 0 and 4; and r₁And r₂Represents an interval of [0,1 ]]The random number introduces a certain randomness, and increases the chance of finding the global optimum. Each iteration needs to be carried out according to the optimal parameter p of the particle_idGlobal optimum parameter p of sum group_gdUpdating the speed v and the position x of the particle; the specific formula is as follows:

updating the particle speed:

updating the particle position:

step S5: for the particles of a1 and a2 states, the particle swarm state processing of SIMD parallelization is performed, the error function is set to 0, and the state is reset to a 0.

The checking step Check: the completion of one step S1 to S5 is defined as a loop step, and the loop step occurring is denoted as T. After each cycle step of the appointed times, whether the current globally optimal parameter value meets the convergence requirement or not is checked, if so, the calculation is finished, otherwise, the calculation is continued.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the improvement of the invention in the aspect of algorithm improvement is that through the error function calculation process, the intervention of early termination of error function accumulation is carried out on the particles which exceed the self optimal history in the accumulation process, so that unnecessary stress product steps are reduced, and the calculation time is saved. In fact, if the value of the error function of a particle does not exceed its historical optimum, the value of the error function does not participate in subsequent particle swarm calculations, and therefore, the exact value does not need to be known. The conventional particle swarm algorithm always requires accurate calculation of the value of the error function, and the calculation efficiency is relatively low.

The improvement of the invention in the aspect of high-performance hardware application is that the GPU is used for parallelizing single-instruction multi-data SIMD on a fine-grained level of single-time stress integration, and the judgment and processing of the particle running state are carried out in time in the calculation process, so that the particles return to an accumulated state, all threads of the GPU can execute completely consistent instruction sets, the idle waiting time in the SIMD parallelization is reduced, and the utilization efficiency of the GPU calculation is improved. In the conventional parallelization particle swarm optimization, parallelization processing is usually performed on the granularity level of particles, and if a logic branch occurs in the process of calculating an error function, a serious idle waiting problem is generated.

Drawings

FIG. 1 is a flow chart of a calculation;

FIG. 2 shows the ratio of the integral to the ratio of N actually occurring_trialThe relationship between;

fig. 3 shows the number of integration occurrences per second.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and in a specific real-time manner. In particular, to better illustrate the efficiency increase provided by the present invention, the acceleration is illustrated by the von Wollfersdorf subplastic model. In addition, the term description of the present invention with respect to GPU computations refers to OpenCL, but in fact the present algorithm may be implemented on CUDA platforms as well. It should be understood that the embodiments described herein are merely illustrative and are not intended to limit the invention.

The memory layout of the algorithm is as follows:

the main variables applied in the Global Memory (Global Memory) of the GPU terminal are as follows: running State of the particles (runstat N_par]) Array of positions of particles (pos [ N ]_par*D]) Velocity array of particles (vel [ N ]_par*D]) The current error function value array of the particle (curobj [ N ]_par]) Current stress state σ of the particle_1,cal～σ_3,cal(runSig1[N_par]，runSig2[N_par]，runSig3[N_par]) Current porosity ratio of particle (runvoidRt [ N ]_par]) History optimal error function of particlesNumerical array (pbest [ N ]_par]) Historical optimal position array of particles (bestpos [ N ]_par*D]) Global optimal position (gbpos [ D ]]) Global optimum error function value (gbest). Besides, variables should be set in the global memory for storing experimental data, and the product step of another experiment is N_dataI.e. experimental value of stress σ_1,exp～σ_3,exp(datSig1[N_data]、datSig2[N_data]、datSig3[N_data]) And strain experimental value ε_1,exp～ε_3,exp(datEps1[N_data]、datEps2[N_data]、datEps3[N_data])。

Array variables should be arranged in a column-first manner with pos [ N ]_par*D]For example, the mth parameter of the id particle is addressed in pos [ id + D m [ ]]. Therefore, in the calculation process, adjacent threads can strictly access the continuous address space, and the access bandwidth can be improved in a combined access mode.

General parameter omega, c of particle swarm optimization₁，c₂Since the value thereof does not change and the occupied space is small in the calculation, it should be arranged in a Constant Memory (Constant Memory). The hardware will actively cache it in each thread, increasing access bandwidth in repeated calls.

The core of the algorithm executes steps S1 to S5, and should be deployed in a GPU in the form of a kernel function for execution. According to the requirements of functions and parallelization of S1 to S5, 4 GPU-kernels K1-K4 are defined, and the functions and parallelization conditions are described as follows:

k1: this function corresponds exactly to the function of step S1 described in the summary of the invention, the number of bus threads being equal to the total number N of particle clusters_par. Reading the parameter value of the corresponding particle from the pos array in parallel by each thread tid, and entering into an N_trialThe next stress integral is summed with the error function E for the for cycle. In each step k of the cycle, the strain and stress are read from the array of datEps and datSig, and the strain delta is calculated as (datEps1[ k +1 ] according to equations (1) to (3)]-datEps1[k],datEps2[k+1]-datEps2[k],datEps3[k+1]-datEps3[k]) Update thread tid pairStress to calculate (runSig1[ tid)],runSig2[tid]，runSig3[tid]) The squared difference between the calculated stress and the experimental stress (runSig1[ tid)]-datSig1[k+1])²+(runSig2[tid]-datSig2[k+1])²+(runSig3[tid]-datSig3[k+1])²Accumulated to the error function curobj [ tid ] corresponding to the thread]In (1). Followed by comparison of curobj [ tid ]]With pbest [ tid ]]And checking whether k exceeds N_data. If curobj [ tid ]]≤pbest[tid]And k is<N_dataDefining the running state runstat [ tid ] of the particle]0, but does not jump out of the loop; if curobj [ tid ]]≤pbest[tid]And k is N_dataDefining the running state runstat [ tid ] of the particle]Is 1 and the cycle is skipped; if curobj [ tid ]]>pbest[tid]Defining the running state runstat [ tid ] of the particle]Is 1 and the cycle is skipped.

K2: this function corresponds exactly to the function of step S2 described in the summary of the invention, the number of bus threads being equal to the total number N of particle clusters_par. Each thread tid parallelly acquires the current particle state from the runstat array, and if runstat [ tid ]]1, then pbest [ tid ] is set]＝curobj[tid]And updating self-history optimal parameter in bestpos, bestpos [ tid ], using current particle position of pos array]＝pos[tid],bestpos[tid+N_par]＝pos[tid+N_par],…,bestpos[tid+(D-1)*N_par]＝pos[tid+(D-1)*N_par]。

K3: this function corresponds exactly to the function of step S3 described in the summary of the invention, the number of threads is equal to 1, i.e. this is a serial operation. Serially traversing all particles, setting the currently processed particle as i, and checking pbest [ i [ [ i ]]If the error function value is less than the gbest, updating the globally optimal error function value gbest to pbest [ i [ [ i ]]And updates the global optimum parameter, gbpos [0 ]]＝bestpos[i]，gbpos[1]＝bestpos[i+N_par]，…，gbpos[D-1]＝bestpos[i+(D-1)*N_par]。

K4: this function corresponds to the functions of steps S4, S5 described in the summary of the invention, the number of bus threads equals the total number N of particle groups_par. Each thread tid parallelly acquires the current particle state from the runstat array, and for the runstat [ tid ]]The particle with 1 or 2, updating the sum of the velocities vel and the sum of the velocities of the particle group according to the rules of the formulas (4) and (5)Position pos. Setting curobj [ tid ] at the same time]At 0, wait for a new round of accumulation.

The execution logic of Check step Check is to read the values of gbest and gbpos to the CPU using the clequeuerbuffer function, with a data size of only (1+ D) floats. If the change of the two adjacent gbpos is not large or the gbest value is smaller than the control value, the calculation may be considered to be terminated.

Before the above core calculation step and check step are performed, an initialization operation I is required. The initialization operation includes the following:

GPU platform setup I1: OpenCL basic information such as Platform, Device, Context, Program and the like is set on a GPU, Kernel function parameters are set, and then clCreateBuffer is used for applying for an address space used for calculation on the GPU.

Particle swarm initial state settings I2: initializing the initial position of the particle swarm on the CPU, and copying to a pos array in the GPU by using the clEnqueWriteBuffer. The initial position is randomly selected within a reasonable range of constitutive parameters.

Introduction of experimental data I3: the experimental data for stress and strain were read on the CPU and copied to the datSig and datpes arrays in the GPU using the clEnqueWriteBuffer.

Combining the above information, the flow chart can be summarized as fig. 1, which includes the flow of the running program and the flow of the core calculation step.

The invention is applied to the parameter inversion of the von Wollfersdorf model, which has 8 parameters in total,

h_s、n、e_d0、e_c0、e_i0alpha, beta. The integral of stress algorithm is based on the strain increase

Calculating the stress increment

Corresponding to Update in formula (2), which involves more floating point operations, the correlation formula is as follows:

is unit tensor of fourth order (7)

I is the second order unit tensor (12)

The experimental data used for the inversion included two stress paths (drained/not drained), three sets of confining pressures (100kPa/300kPa/500kPa) and three sets of void ratios (0.8/0.9/1.0), for a total of 18 sets of data. Each set of data contained 400 data points, 399 stress product steps, each loaded to 30% axial strain, sufficient to reflect the nonlinear process of specimen loading to a critical state. Using N in the test_parThe number of particles is 20000. To ensure the completeness of the data, a total of 7182 stress product steps were used in this section using all 18 sets of data. Tests different trial product steps N_trialInfluence of (1), N _trial7182, 3591, 1596, 798, 399, 228, 57, 6, 1. For the calculation to be comparable, the total trial product step is defined as 1436400, which corresponds to N above_trialThe cycle steps T of (1) are respectively 200, 400, 900, 1800, 3600, 6300, 25200, 239400 and 1436400. In the calculation, the GPU computing device is Nvidia GTX 1050, and the CPU computing device is Intel i 7-7700.

Here, FIG. 2 illustrates the actual integral ratio and N_trialThe previous relationship. In the calculation process, due to the existence of the early termination accumulation mechanism, the number of integration steps actually occurring is less than the specified total trial calculation steps. N is a radical of_trialThe test of early termination is performed after all data is tried out, which is compared to a lower ratio of the number of stress integration steps actually occurring 7182. Thereby reducing N_trialWill cause more check points to be set in the calculation so that particles that have terminated prematurely will return to the state of the stress integration operation, thus following N_trialThe proportion of the number of integration steps of the actual occurring stress is increased. When N is_trialIf 1, then every time a stress integration is performed, it is checked whether the particles need to be terminated early. That is, all particles always perform stress integration, and the ratio of the number of steps in which stress integration actually occurs is 1. FIG. 2 illustrates the following with N_trialReduce the actual occurrence ofA case where the proportion of the number of force integration steps gradually increases. In N_trialAt 400 or less, the actual integral ratios are all high, and when N is used_trialAbove 400, the actual integral ratio proceeds with N_trialThe increase and decrease are faster.

Figure 3 further counts the number of integration steps performed per second. When N is present_trialWhen smaller, additional overhead is incurred due to the processing function of states K2-K4, despite the higher fraction of integration that is accomplished. In particular, the K3 function of the K2-K4 is a serial function, and since the GPU has low single-core performance, when the K3 function is executed too many times, the performance of the K3526 function is reduced on the GPU more significantly. Thus, N_trialVery small calculations are not very efficient. On the other hand, when N is_trialWhen the integral ratio is larger, the actually generated integral ratio is lower, but the particles stopping accumulation in advance can only be in an idle waiting state due to the characteristics of SIMD in the GPU, and cannot be put into subsequent calculation. Therefore, when N is_trialThe calculation efficiency is not high when it is large. In N_trialThe calculation efficiency of 798 reaches the maximum of 13366 steps/s, which is 21.28 times that of the CPU.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims

1. An improved particle swarm method for parameter identification of a constitutive model of a geotechnical material based on GPU acceleration is characterized in that stress integration is carried out by taking strain in experimental data as a known quantity, an accumulated value of a deviation square of a calculated value of the stress and an experimental value is taken as an error function, and the particle swarm method is used for searching for optimal parameters;

calculating a strain increment: delta epsilon_k+1＝ε_exp,k+1-ε_exp，k (1)；

Updating the calculated value of the stress state:

cumulative error function:

2. the improved particle swarm method for geotechnical material constitutive model parameter identification based on GPU as claimed in claim 1, wherein the method for calculating error function in the particle swarm method combines the requirement of Single Instruction Multiple Data (SIMD) instruction set of GPU, and the population number N of the particle swarm is_parAs the total thread amount, the parallelization computations (1) to (3) of single instruction multiple data SIMD are performed using the GPU at the fine granularity level of single stress integration, which is defined as a basic parallelization step.

3. The improved particle swarm method for the parametric identification of the geotechnical material constitutive model based on GPU acceleration as claimed in claim 1, characterized by comprising the following steps:

step S1, trial calculation of predetermined number of times N_trialAfter the basic parallel step, performing SIMD parallelization judgment on the running state of each particle, and checking whether the current error function value E exceeds the optimal (pbest) history, wherein the logic of the state judgment is as follows: if E ≦ pbest and the particle that did not complete all data computations is defined as accepting the accumulation state A0, if E ≦ pbest and the particle that completed all data computations is defined as the Normal termination state A1, if E>pbest is defined as an abort state a 2;

step S3: serially traversing all the particles with the state of A1, and if the history optimal pbest of the particles is superior to the global optimal gbest of all the particles, updating the pbest and the corresponding parameter values thereof;

step S4: performing SIMD parallelized particle swarm update operations on the particles in A1 and A2 states;

4. The improved particle swarm method for parameter identification of the geotechnical material constitutive model based on GPU acceleration as claimed in claim 3, wherein in step S4, the particle swarm updating operation has the following specific formula:

updating the particle speed:

updating the particle position:

wherein id is 1,2,3, …, N_par(ii) a D is 1,2,3, …, D; t ═ 1,2,3, …; id is the serial number of the particle, k is the current time step, and D is the dimension of the particle; in this example, the value of D is equal to the number of parameters of the constitutive model; omega is an inertia weight factor, the value of omega is non-negative, and the global optimization performance and the local optimization performance can be adjusted by adjusting the value of omega; c. C₁And c₂Is called the acceleration constant, and r₁And r₂Represents an interval of [0,1 ]]A random number of (c); each iteration needs to be carried out according to the optimal parameter p of the particle_idGlobal optimum parameter p of sum group_gdThe velocity v and position x of the particle are updated.

5. The improved particle swarm method for parameter identification of geotechnical material constitutive models based on GPU acceleration as claimed in claim 3, further comprising the checking step Check: the completion of one step S1 to S5 is defined as a loop step, and the loop step occurring is denoted as T. After each cycle step of the appointed times, whether the current globally optimal parameter value meets the convergence requirement or not is checked, if so, the calculation is finished, otherwise, the calculation is continued.

6. The improved particle swarm method for parameter identification of the geotechnical material constitutive model based on GPU acceleration as claimed in claim 3, wherein the calculation process is a monotone accumulated error function, and a mechanism of early termination is introduced to particles which have exceeded the self-history optimization of the error function in the accumulation process, so as to save calculation amount.

7. The improved particle swarm method for parameter identification of the geotechnical material constitutive model based on GPU acceleration as claimed in claim 3, wherein the judgment and processing of the particle running state are performed in time during the calculation process, so that the particles return to the accumulation state, each thread of the GPU can execute a completely consistent instruction set, idle waiting time in SIMD parallelization is reduced, and the utilization efficiency of GPU calculation is improved.