CN106502632A

CN106502632A - A kind of GPU parallel particle swarm optimization methods based on self-adaptive thread beam

Info

Publication number: CN106502632A
Application number: CN201610976893.2A
Authority: CN
Inventors: 何发智; 张硕
Original assignee: Wuhan University WHU
Current assignee: Nanjing Beidou innovation and Application Technology Research Institute Co.,Ltd.
Priority date: 2016-10-28
Filing date: 2016-10-28
Publication date: 2017-03-15
Anticipated expiration: 2036-10-28
Also published as: CN106502632B

Abstract

The invention discloses a kind of GPU parallel particle swarm optimization methods based on self-adaptive thread beam, comprise the following steps：1：Initialization matter function parameter, initializes particle swarm parameter；2：Three CUDA kernel functions are defined, what up to the present angle value that what the speed and position, the fitness value of particle and particle itself of future generation for being respectively used to parallel computation particle was found be preferably adapted to and its corresponding solution, whole population found is preferably adapted to angle value and its corresponding solution；3：Block the and Grid parameters of each kernel function are calculated and are initialized according to self-adaptive thread beam algorithm；4：Call kernel function parallel iteration to update speed and the position of population, and obtain；5：Repeat step 4 result of calculation is exported until reaching the termination condition of setting, GPU；The present invention can significantly shorten Parallel implementation time of the particle cluster algorithm on GPU, reduce power consumption, saves hardware cost.

Description

A kind of GPU parallel particle swarm optimization methods based on self-adaptive thread beam

Technical field

The present invention relates to a kind of particle group optimizing method, belongs to field of computer data processing, and in particular to one kind is based on The GPU parallel particle swarm optimization methods of self-adaptive thread beam.

Background technology

Particle group optimizing (Particle Swarm Optimization, PSO) algorithm is a kind of evolutionary computing, by In its concept simple, be easily achieved, while the features such as but also with stronger global search and convergence capabilities, and obtained quickly Develop and be widely applied.There are various parallel PSO algorithm versions at present, among these, for CUDA parallel architectures, to thread Allocative decision mainly have two kinds：1) thread corresponds to a particle；2) thread corresponds to a dimension, a Block A corresponding particle.The first coarse grain parallelism method, although have been achieved for good speed-up ratio, but due to each thread in Each dimension corresponding to particle remains serial execution, and degree of concurrence is not high.Second fine grained parallel mode is Improve on the premise of one kind, each particle is corresponded to each Block, then the thread in each Blcok is corresponded to every A dimension in individual particle.So undoubtedly increase

Degree of concurrence, it will be appreciated, however, that in CUDA concurrent programs, all of Block is being assigned to by serial On each stream multiprocessor, can also continue to improve degree of parallelism.

GPU is a kind of special graphics rendering device.Initially GPU is the hardware for being exclusively used in graphics process, but since Since 2006, increasing research worker have studied the GPGPU fields for carrying out general-purpose computations using GPU, all big enterprises It is proposed special GPGPU language, such as CUDA, OPENCL etc..

Content of the invention

The purpose of the present invention is to optimize original computational methods based on GPU, adjusts its parallel architecture and is allowed to parallel efficiency more Height, designs a set of improved CUDA parallel architectures mode, accelerates to execute using image processor (GPU) so that population is calculated Degree of parallelism of the method on individual host is further improved, and is compared first two method and is improve 40 on the multiple of CPU speed-up ratios As many as.

In order to solve the above problems, the solution of the present invention is：

A kind of GPU parallel particle swarm optimization methods based on self-adaptive thread beam,

The dimension of each particle is divided into several thread beams, using thread block come comprising the thread beam so that one Corresponding one or more particles in individual thread block；

Wherein, the thread beam is SM scheduling and the ultimate unit for executing.

Optimize, a kind of above-mentioned GPU parallel particle swarm optimization methods based on self-adaptive thread beam, based on below equation Population ParticleNum corresponding to number WarpNum and thread block of the thread beam corresponding to adjustment particle：

WarpNum=DivUp (D, WarpSize) (8)

ThreadNum=WarpNum*WarpSize (9)

ParticleNum=

DivDown(BlockSize,ThreadNum) (10)

In formula, D represents that the dimension of Solve problems, WarpSize represent the size of a thread beam in CUDA frameworks；DivUp The function of function is that D rounds up divided by the business that WarpSize is obtained, to obtain the number of Warp corresponding to particle WarpNum；ThreadNum is used for representing the actual total number of threads that uses of each particle；BlockSize is represented one in CUDA frameworks The size of individual Block, the function of DivDown functions are to do to round BlockSize divided by the business that ThreadNum is obtained downwards, To obtain population ParticleNum corresponding to Block.

Optimize, a kind of above-mentioned GPU parallel particle swarm optimization methods based on self-adaptive thread beam are calling kernel function Before,

Based on self-adaptive thread beam algorithm, the number of the thread block of each kernel function is calculated and is initialized using below equation BlockNum and number GridNum of grid：

BlockNum=TreadNum*ParticleNum；

GridNum=DivUp (N, ParticleNum)；

In formula, ThreadNum is used for representing the actual total number of threads that uses of each particle；ParticleNum is thread block Corresponding population；N is the total number of particle in population.

Optimize, a kind of above-mentioned GPU parallel particle swarm optimization methods based on self-adaptive thread beam,

Define three CUDA kernel functions, be respectively used to parallel computation particle speed and position, the fitness value of particle and It is best that up to the present angle value that what particle of future generation was found itself be preferably adapted to and its corresponding solution, whole population find Fitness value and its corresponding solution.

Optimize, a kind of above-mentioned GPU parallel particle swarm optimization methods based on self-adaptive thread beam are specifically included following Step：

Step 2.1：Speed and the position kernel of particle is calculated, each GPU thread is according to the thread block number for distributing BlockNum and meshes number GridNum, calculate each problem dimension by the computing formula of particle cluster algorithm is corresponding Corresponding speed and position；

Step 2.2：Calculate that the fitness value and particle itself of future generation of particle found is preferably adapted to angle value and its right The solution kernel that answers, according to the fitness value of each dimension of each particle of the BlockNum and GridNum parallel computations for distributing, then According to the fitness value of each dimension by parallel reduction algorithm, the fitness value of each particle is obtained, finally according to obtained The fitness value of fitness value, more new particle and its corresponding solution；

Step 2.3：Calculate that up to the present whole population find is preferably adapted to angle value and its corresponding solution kernel, leads to Cross the cublasI using CUBLAS<t>Amin () function (data types of the t for operation object) tries to achieve whole particle on GPU What up to the present group found is preferably adapted to angle value and its corresponding solution.

Optimize, a kind of above-mentioned GPU parallel particle swarm optimization methods based on self-adaptive thread beam, based on below equation Initialization matter function：

Wherein, f_SphereFor the solution formula of problem function Sphere, f_RastrigrinFor asking for problem function Rastrigrin Solution formula, f_RosenbrockFor the solution formula of problem function Rosenbrock, x is problem function variable, and D is the dimension of problem function Degree.

Optimize, a kind of above-mentioned GPU parallel particle swarm optimization methods based on self-adaptive thread beam, based on below equation Update the formula of population：

X_id(t+1)=X_id(t)+V_id(t)； (5)

Wherein, V_idRepresent that the speed of each particle, t represent that current iterative algebra, w represent the inertia weight system of population Number, c₁And c₂Represent the accelerated factor of population, r₁And r₂It is equally distributed random number in [0,1] interval,Represent the particle Individual extreme value,Represent the global extremum of whole population, X_idRepresent the current location (solution) of the particle.It is based on following public affairs Formula updates particle swarm parameter w and C1/C2：

W=1/ (2*ln (2))； (6)

c₁=c₂=0.5+ln (2)； (7)

Therefore, the invention has the advantages that：

(1) method provided using the present invention, can significantly shorten PSO Algorithm for Solving problem times, improve related application Software responses speed；

(2) method provided with the present invention, can select low side CPU to be used for main frame, and middle and high end GPU is used for calculating, reaches The multi -CPU even performance of cluster, so as to reduce power consumption, save hardware cost.

Description of the drawings

Particle cluster algorithm optimization method flow charts of the Fig. 1 for the embodiment of the present invention.

Fig. 2 is CUDA parallel computational models.

Fig. 3 updates Organization Chart for the GPU ends of the embodiment of the present invention.

The particle cluster algorithm optimization method flow charts that comprising GPU end update Organization Chart of the Fig. 4 for the embodiment of the present invention.

Specific embodiment

Below by embodiment, and accompanying drawing is combined, technical scheme is described in further detail.

Embodiment：

As shown in figure 1, for a kind of GPU parallel particle swarm optimization methods based on self-adaptive thread beam of the present embodiment, bag Include following steps：

Step 1：Initialization matter function parameter, initializes particle swarm parameter；

Step 2：Three CUDA kernel functions are defined, speed and position, the adaptation of particle of parallel computation particle is respectively used to What angle value and particle itself of future generation were found be preferably adapted to angle value and its up to the present corresponding solution, whole population find Be preferably adapted to angle value and its corresponding solution；

Step 3：The BlockNum and GridNum of each kernel function is calculated and is initialized according to self-adaptive thread beam algorithm；

Step 4：Call kernel function parallel iteration to update speed and the position of population, and obtain And its corresponding solution；

Step 5：Repeat step 4 result of calculation is exported until reaching the termination condition of setting, GPU.

Wherein, the problem function in step 1 is defined based on equation below (1)-(3)：

Shown in more new formula equation below (4)-(5) of population：

X_id(t+1)=X_id(t)+V_id(t)； (5)

Wherein, V_idRepresent that the speed of each particle, t represent that current iterative algebra, w represent the inertia weight system of population Number, c₁And c₂Represent the accelerated factor of population, r₁And r₂It is equally distributed random number in [0,1] interval,Represent the particle Individual extreme value,Represent the global extremum of whole population, X_idRepresent the current location (solution) of the particle.Parameter w and C1/ Shown in more new formula following equation below (6)-(7) of C2：

W=1/ (2*ln (2))； (6)

c₁=c₂=0.5+ln (2)； (7)

Three CUDA kernel functions defined in the step of the present embodiment 2, be respectively used to parallel computation particle speed and position, What the fitness value of particle and particle itself of future generation were found is preferably adapted to angle value and its corresponding solution, whole population to mesh Before till find be preferably adapted to angle value and its corresponding solution；

GPU parallel computations, this algorithm are realized on CUDA platforms.Fig. 2 is referred to, CUDA parallel computational models are a kind of SIMD The parallelization computation model of (single-instruction multiple-data), wherein GPU produce a large amount of threads as a coprocessor, can help CPU completes a large amount of simple computation work of highly-parallel.CUDA has three differences using the framework model of Multilayer Memory Level：Thread (Thread), thread block (Block) and block grid (Grid).Thread operates in SP On (StreamingProcessor, stream handle), be wherein most basic performance element, each Thread have one privately owned Depositor, the Thread of multiple execution same instructions can constitute a Block.Block operates in SM On (StreamingMultiprocessor flows multiprocessor), in a Block, all of Thread can pass through in Block Shared drive (ShareMemory) enter row data communication and shared, and realize synchronization, multiple Block for completing identical function A Grid can be constituted.Grid operates in SPA (ScalableStreamingProcessorArray, stream handle array) On, need not communicate between the Block in same Grid, and the execution between Gird is serial.I.e. when program is loaded, After Grid is loaded on GPU, all of Block being assigned on each stream multiprocessor by serial.Therefore, it is necessary to close Thread Count in distribution each Block of reason, preferably to improve degree of parallelism.

In actual motion, Block can be divided into less thread beam Warp.Line on a SM, in each Block Journey carries out sequential packet according to its unique ID, and 32 adjacent threads constitute a Warp.All of thread is in logic Be parallel, but from for the angle of hardware, not all thread can be executed in synchronization, and Warp is only SM Scheduling and the ultimate unit for executing.The definition of Warp is not taken out among CUDA programming models, and Warp is by the hard of GPU Part structures shape, but very big to performance impact.Thread in same Warp may be considered " while " execute, it is not necessary to Synchronize and also can be communicated by sharedmemory, can further save and call _ _ syncthreads () function The consumed time is synchronized to thread.In sum, increase the usage amount of each Block thread, and with Warp as list The optimization that position consideration distributes to Block center lines number of passes, it is possible to obtain higher performance.

Step 2 is implemented including following sub-step：

Step 2.1：Speed and the position kernel of particle is calculated, according to Block quantity BlockNum and Grid that distribute Computing formula of quantity GridNum by particle cluster algorithm, corresponding each the problem dimension that calculates of each GPU thread are corresponded to Speed and position；Wherein the kernel function is defined as follows：

__global__voidParticleFly_VP_kernel(float*Particle_X,float*P article_ V,int*GBestIndex,float*Particle_XBest,float*Particle_Fit Best,curandState*s)

Parameter in function represents the position array (length is population * dimension) of all particles, all grains successively respectively The speed array (length is population * dimension) of son, best particle subscript, degree of being preferably adapted to, the corresponding solution of degree of being preferably adapted to Array (dimension of the length for problem function)；

Step 2.2：Calculate that the fitness value and particle itself of future generation of particle found is preferably adapted to angle value and its right The solution kernel that answers, according to the fitness value of each dimension of each particle of the BlockNum and GridNum parallel computations for distributing, then According to the fitness value of each dimension by parallel stipulations (Reduction) algorithm, the fitness value of each particle is obtained.Finally According to the fitness value for obtaining, the fitness value of more new particle and its corresponding solution；Wherein the kernel function is defined as follows：

__global__voidParticleFly_Fit_kernel(float*Particle_X,float*Particle_ XBest,float*Particle_Fit,float*Particle_FitBest)

Parameter in function represents the position array (length is population * dimension) of all particles successively respectively, preferably fits The array (dimension of the length for problem function) of the corresponding solution of response, the fitness array (length is population) of all particles, Degree of being preferably adapted to；

It should be noted that need to use in the kernel function parallel reduction to ask for being preferably adapted to angle value, so needing For the kernel function distribution shared drive, the shared drive size for needing distribution is Block_Size*sizeof (float), wherein Block_Size is the Thread Count in each Block；

Step 2.3：Calculate that up to the present whole population find is preferably adapted to angle value and its corresponding solution kernel, leads to Cross the cublasI using CUBLAS<t>Amin () function (data types of the t for operation object) tries to achieve whole particle on GPU What up to the present group found is preferably adapted to angle value and its corresponding solution；

The schematic diagram of step 3 is as shown in figure 3, calculate and initialize each kernel function according to self-adaptive thread beam algorithm BlockNum and GridNum；

Which is implemented including following sub-step：

Step 3.1：The characteristics of this method is based on CUDA computation models, the dimension of each particle is divided into one or more Warp, then using Blcok come comprising these Warp so that corresponding one or more particles in a Block, every so as to increase The usage amount of individual Block threads, reaches the effect of Warp level parallelisms.

Step 3.2：Population corresponding to the number of Warp corresponding to particle and Block all can be according to the big of dimensionality of particle Little carry out adaptive adjustment.Specific adaptive process is followed shown in equation below (8)-(10)：

WarpNum=DivUp (D, WarpSize) (8)

ThreadNum=WarpNum*WarpSize (9)

ParticleNum=

DivDown(BlockSize,ThreadNum) (10)

Step 3.3：D represents that the dimension of Solve problems, WarpSize represent the size of a Warp in CUDA frameworks. The function of DivUp functions is that D rounds up divided by the business that WarpSize is obtained, to obtain the number of Warp corresponding to particle WarpNum.ThreadNum is used for representing the actual total number of threads that uses of each particle.BlockSize is represented one in CUDA frameworks The size of individual Block.The function of DivDown functions is to do to round BlockSize divided by the business that ThreadNum is obtained downwards, To obtain population ParticleNum corresponding to Block.

Step 3.4：Before kernel function is called, it is necessary first to determine size BlockNum and the Grid of the Block of CUDA Size GridNum, the characteristics of to embody self-adaptive thread Shu Fangfa, shown in formula (11) specific as follows-(12)：

BlockNum=TreadNum*ParticleNum (11)

GridNum=DivUp (N, ParticleNum) (12)

The original computational methods based on GPU of this algorithm optimization, phase of adjustment parallel architecture to be allowed to parallel efficiency higher, design A set of improved CUDA parallel architectures mode, accelerates to execute using image processor (GPU).By corresponding for a thread dimension Degree, the corresponding particle of one or more Warp, the corresponding Block of one or more particles so that particle cluster algorithm is in list Degree of parallelism on individual main frame is further improved, and is compared first two method and is improve as many as 40 on the multiple of CPU speed-up ratios.

It is evidenced from the above discussion that, the present embodiment has the advantage that：

(1) method provided using the present invention, can significantly shorten PSO Algorithm for Solving problem times, improve related application Software responses speed；(2) method provided with the present invention, can select low side CPU to be used for main frame, and middle and high end GPU is used for calculating, The multi -CPU even performance of cluster is reached, so as to reduce power consumption, hardware cost is saved.

Method described by the present embodiment can be used to play in automatic pathfinding, the field such as image procossing.

Specific embodiment described herein is only to the spiritual explanation for example of the present invention.Technology neck belonging to of the invention The technical staff in domain can be made various modifications or supplement or replaced using similar mode to described specific embodiment Generation, but without departing from the spiritual of the present invention or surmount scope defined in appended claims.

Claims

1. a kind of GPU parallel particle swarm optimization methods based on self-adaptive thread beam, it is characterised in that

The dimension of each particle is divided into several thread beams, using thread block come comprising the thread beam so that a line Corresponding one or more particles in journey block；

Wherein, the thread beam is SM scheduling and the ultimate unit for executing.

2. a kind of GPU parallel particle swarm optimization methods based on self-adaptive thread beam according to claim 1, its feature exist In the population corresponding to number WarpNum and thread block based on the thread beam corresponding to below equation adjustment particle ParticleNum：

WarpNum=DivUp (D, WarpSize) (8)

ThreadNum=WarpNum*WarpSize (9)

ParticleNum=

DivDown(BlockSize,ThreadNum) (10)

In formula, D represents that the dimension of Solve problems, WarpSize represent the size of a thread beam in CUDA frameworks；DivUp functions Function be that D rounds up divided by the business that WarpSize is obtained, to obtain number WarpNum of Warp corresponding to particle； ThreadNum is used for representing the actual total number of threads that uses of each particle；BlockSize represents a Block in CUDA frameworks Size, the function of DivDown functions is to do to round BlockSize divided by the business that ThreadNum is obtained downwards, to obtain Population ParticleNum corresponding to Block.

3. a kind of GPU parallel particle swarm optimization methods based on self-adaptive thread beam according to claim 1, its feature exist In, before kernel function is called,

BlockNum=TreadNum*ParticleNum；

GridNum=DivUp (N, ParticleNum)；

In formula, ThreadNum is used for representing the actual total number of threads that uses of each particle；ParticleNum is right for thread block The population that answers；N is the total number of particle in population.

4. a kind of GPU parallel particle swarm optimization methods based on self-adaptive thread beam according to claim 1, its feature exist In,

Three CUDA kernel functions are defined, speed and position, the fitness value of particle and next of parallel computation particle is respectively used to For particle itself found be preferably adapted to angle value and its up to the present corresponding solution, whole population find is preferably adapted to Angle value and its corresponding solution.

5. a kind of GPU parallel particle swarm optimization methods based on self-adaptive thread beam according to claim 4, its feature exist In specifically including following steps：

Step 2.2：Calculate that the fitness value and particle itself of future generation of particle found is preferably adapted to angle value and its corresponding Solution kernel, according to the fitness value of each dimension of each particle of the BlockNum and GridNum parallel computations for distributing, further according to The fitness value of each dimension obtains the fitness value of each particle, finally according to the adaptation for obtaining by parallel reduction algorithm The fitness value of angle value, more new particle and its corresponding solution；

Step 2.3：Calculate that up to the present whole population find is preferably adapted to angle value and its corresponding solution kernel, by making CublasI with CUBLAS<t>Amin () function (data types of the t for operation object) is tried to achieve whole population on GPU and is arrived That found so far is preferably adapted to angle value and its corresponding solution.

6. a kind of GPU parallel particle swarm optimization methods based on self-adaptive thread beam according to claim 1, its feature exist In based on below equation initialization matter function：

f_{S p h e r e} (x) = Σ_{d = 1}^{D} x_{d}^{2}, x_{d} &Element; [- 100, 100]; - - - (1)

\begin{matrix} f_{R a s t r i g r i n} (x) = Σ_{d = 1}^{D} [x_{d}^{2} - 10 c o s (2 {πx}_{d}) + 10] \\ , x_{d} &Element; [- 5.12, 5.12] \end{matrix}; - - - (2)

\begin{matrix} f_{R o s e n b r o c k} (x) = Σ_{d = 1}^{D - 1} [100 {(x_{d + 1} - x_{d}^{2})}^{2} + {(x_{d} - 1)}^{2}] \\ , x_{d} &Element; [- 10, 10] \end{matrix}; - - - (3)

Wherein, f_SphereFor the solution formula of problem function Sphere, f_RastrigrinSolution for problem function Rastrigrin is public Formula, f_RosenbrockFor the solution formula of problem function Rosenbrock, x is problem function variable, and D is the dimension of problem function.

7. a kind of GPU parallel particle swarm optimization methods based on self-adaptive thread beam according to claim 6, its feature exist In based on the formula that below equation updates population：

\begin{matrix} V_{i d} (t + 1) = {wV}_{i d} (t) + c_{1} r_{1} (P_{i d}^{b} (t) - X_{i d} (t)) \\ + c_{2} r_{2} (P_{d}^{g b} (t) - X_{i d} (t)) \end{matrix}; - - - (4)

X_id(t+1)=X_id(t)+V_id(t)； (5)

Wherein, V_idRepresent that the speed of each particle, t represent that current iterative algebra, w represent the inertia weight coefficient of population, c₁And c₂Represent the accelerated factor of population, r₁And r₂It is equally distributed random number in [0,1] interval,Represent the particle Individual extreme value,Represent the global extremum of whole population, X_idRepresent the current location (solution) of the particle；

Particle swarm parameter w and C1/C2 are updated based on below equation：

W=1/ (2*ln (2))； (6)

c₁=c₂=0.5+ln (2)； (7).