CN102214086A

CN102214086A - General-purpose parallel acceleration algorithm based on multi-core processor

Info

Publication number: CN102214086A
Application number: CN2011101657407A
Authority: CN
Inventors: 曹伟; 王伶俐; 王颖; 周学功; 叶晓敏
Original assignee: Fudan University
Current assignee: Fudan University; Shanghai Redneurons Co Ltd
Priority date: 2011-06-20
Filing date: 2011-06-20
Publication date: 2011-10-12

Abstract

The invention belongs to the technical field of a parallel processor, and in particular relates to a general-purpose parallel acceleration algorithm based on a multi-core processor. The general-purpose parallel acceleration algorithm comprises the following steps: firstly recognizing data dependence during the computation process for large-scale and high-density data calculation, and decomposing the computation process with low data correlation degree or without data dependence so as to form independent computation sequences; distributing the computation sequences into computation cores of the multi-core processor for execution, scheduling threads during the execution process to realize load balance, and dynamically managing a memory to realize memory alignment; and after the computation cores completely run the computation sequences, recovering and combining computed result segments into a complete computed result so as to realize a higher computation acceleration ratio. In the invention, based on the multi-core processors such as a GPGPU (general purpose graphic processing unit), a CELL processor and the like, the general-purpose parallel acceleration algorithm realizes large-scale data computation parallelization, parallel thread optimal scheduling and general-purpose acceleration computation with low correlation degree of multi-core processor architectures.

Description

General parallel accelerating algorithm based on polycaryon processor

Technical field

The invention belongs to the parallel processor applied technical field, be specifically related to a kind of parallel accelerating algorithm based on polycaryon processor.

Background technology

The peak performance of GPU is from the 50GFLOPS(Giga FLoating point Operations Per Second of NV40 in 2004) rise to the 500GFLOPS of G80 in 2007.Memory bandwidth also is increased to the 86.4GB/s of NVIDIA GeForce 8800 GTX from the 42GB/s of ATI Radeon X1800XT.Fully-pipelined and highly-parallel framework among the GPU, with and high memory bandwidth, can provide support for its peak value calculated performance.Meanwhile, present high-end performance of processors is about 12GFLOPS as 3GHz Pentium4 CPU, is 6GB/s to the bandwidth of main memory.

In investment in March calendar year 2001, Sony, new power computer amusement, Toshiba, U.S.'s IBM (IBM) company develop jointly the processor that is used for high-speed computation to the CELL processor by Sony.It is with RISC(Reduced Instruction Set Computer) the PowerPC framework of instruction system designs, and have high clock frequency, the high characteristics such as efficient of carrying out.Be mainly used on PlayStation 3 and the blade server.The CELL processor has 1 to be simplified by PowerPC970 and comes PPE(PowerPC Processing Element) and 8 be called SPE(Synergistic Processing Element) the collaborative process device, frequency of operation surpasses 4GHz.The CELL processor is a 64-bit Power processor, and built-in 8 processing units that cooperate with each other have and handle the ability that separate type is calculated, and have the ability that uniprocessor moves a plurality of operating systems.

In face of high performance parallel processor, adopting general parallel accelerating algorithm to calculate becomes inexorable trend.

List of references

[1]?K.?Gulati?and?S.?P.?Khatri,?“Towards?acceleration?of?fault?simulation?using?graphics?processing?units,”?DAC,?2008

[2]?K.?Gulati?and?S.?P.?Khatri,?“Accelerating?Statistical?Static?Timing?Analysis?Using?Graphics?Processing?Units”,?DATE,?2009

[3] Zou Yongning, Tan Hui, Huang Liang, " several method that the CT image reconstruction quickens ", computer system application, 2008.

Summary of the invention

The object of the present invention is to provide a kind of general parallel accelerating algorithm, can walk abreast with this process for different polycaryon processors with realization and quicken computing based on polycaryon processor.

Parallel accelerating algorithm based on polycaryon processor provided by the invention comprises identification and decomposes the large-scale data calculation process that scheduling Distribution Calculation sequence to processor calculates the process that core is carried out, and collects the process of reorganization result of calculation successively.Wherein:

The large-scale data calculation process is also decomposed in so-called identification, is to calculate for extensive, high density data, and data dependence is low, and computation process that promptly can parallel processing also is decomposed into the independently sequence of calculation;

So-called scheduling Distribution Calculation sequence to processor calculates the process that core is carried out, be with after decomposing independently the sequence of calculation be assigned to respectively calculating on the core of polycaryon processor and carry out, scheduling thread is realized load balance in the process of implementation, and the dynamic management internal memory, realizes the internal memory alignment;

So-called process of collecting reorganization result of calculation is after each calculating core execution finishes the sequence of calculation, each result of calculation fragment is reclaimed be reassembled into complete result of calculation according to the order of sequence, realizes the calculating speed-up ratio.

The present invention is based on polycaryon processor (for example GPGPU(General Purpose Graphic Processing Unit) and CELL processor etc.), realized large-scale data calculate parallelization, parallel thread Optimization Dispatching, with the little general acceleration computing of the polycaryon processor framework degree of correlation.

Description of drawings

Fig. 1 algorithm flow chart.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.

Algorithm flow chart is as shown in Figure 1:

A) calculate for extensive, high density data, computation process that can parallel processing is decomposed into the independently sequence of calculation;

B) will decompose the back sequence of calculation and be assigned to respectively calculating on the core of polycaryon processor and carry out, scheduling thread is realized load balance in the process of implementation, and the dynamic management internal memory;

C) after the sequence of calculation is carried out end, each result of calculation fragment recovery is combined into complete result of calculation.

Below being based on GPGPU to SSTA(Statistical Static Timing Analysis) the parallel acceleration of algorithm describes for embodiment.

A) decompose according to input and computation process

One, data input and computation process are described: make up circuit timing diagram

Recognition data correlativity diagrammatic representation method of the present invention.This figure is made up of set of node V and limit collection E, and the node among the set of node V is represented computing unit and input/output port, the dependence between the limit representative computing unit among the collection E of limit.Data dependence is low or do not have the computing unit of data dependence seldom or each other not have each computing unit of annexation corresponding to fillet.

In block-based SSTA computation process, need earlier circuit meshwork list to be converted to the sequential chart of reflection circuit topological structure.Sequential chart is made up of set of node V and limit collection E, and the node among the set of node V is represented door and the input and output etc. in the circuit meshwork list, the limit representative among the collection E of limit in circuit meshwork list door and the connection between the input and output.Generate after the sequential chart, the SSTA method is carried out the SUM(summation by the traversal sequential chart to the difference configuration of each node) and MAX(get maximal value) operation is with the acquisition analysis result.In order to use the development environment CUDA(Compute Unified Device Architecture of GPGPU) SUM and MAX operation that the configuration of each node is walked abreast, the sequential chart that circuit need be arranged in the video memory of GPU, sequential chart visits and obtains SUM and MAX operates needed data by searching.After master routine read in circuit meshwork list, the CPU end can generate sequential chart, and the operation that next needs to carry out is exactly that the sequential chart of CPU end is transplanted to the GPU end.

Sequential chart is directed acyclic graph normally.Because the video memory of GPU is played up image and was done optimization, still can not effectively support the custom data structure.But can be used as general array to the video memory of GPU, can support more effective data structure than GPGPU at the CUDA model.Usage space complexity of the present invention be O (V+E) in abutting connection with array the data structure of sequential chart is described.Sequential chart in abutting connection with array representation generally is made of two arrays, the limit array that comprises the node array of nodal information and comprise side information.Convert the sequential chart of CPU end to form that GPU end CUDA model uses, be divided into for two steps in abutting connection with array:

1. the node in the sequential chart of CPU end is carried out the breadth First traversal, during traversal node is numbered and layering;

2. press the node in the sequential chart of level visit CPU end, nodal information and side information are added among node array Va and the limit array Ea.

Node v among the accessing time sequence figure also need carry out following operation to Va and Ea interpolation information:

I. obtain the numbering of v in sequential chart, it is set to the index of v in Va;

Ii. travel through forerunner's node of v, the numbering of forerunner's node is joined among the array Ea of limit;

Iii., the value of v in Va is set, is the index of first forerunner's node in Ea of v; If forerunner's number of nodes of v is 0, then be made as-1.

In transfer process, can in the array of limit, write down forerunner's node serial number or drive node numbering.In block-based SSTA, need the information of forerunner's node to carry out SUM and MAX operation, therefore in the array of limit, need to write down the numbering of forerunner's node.After the sequential chart of CPU end was handled, the sequential chart of the GPU end that obtains in abutting connection with array representation imported this sequential chart into video memory that GPU holds, just can carry out block-based SSTA at the GPU end and calculate.

Two, time of arrival computation process decomposition

In SSTA, the time of arrival of the output of a door, the computation process of (Arrival Time abbreviates AT as) was: at first to each input pin iTime of arrival with from input pin iTo the time-delay summation (SUM) of output, then all are obtained and maximizing (MAX), obtain time of arrival when the output at Qianmen.

AT?=?MAX{(AT _{in
i}?+?Delay _{in
i-out})}, i= 0,…,n

Wherein nThe number of expression input.In order to obtain the statistical distribution of time of arrival, need carry out the calculating of a plurality of configurations to time of arrival.Owing to need carry out a large amount of SUM and MAX operation to node configuration, and between the different configurations is data independence, therefore can adopt parallel method that SUM between each configuration and MAX operation are separated, utilize the parallel characteristics of GPU that this process is quickened.If current calculated door one has nIndividual input, then each configuration needs to carry out altogether nInferior summation and N-1Inferior maximizing operation.

The figure of recognition data correlativity represents method.This figure is made up of set of node V and limit collection E, and the node among the set of node V is represented computing unit and input/output port, the dependence between the limit representative computing unit among the collection E of limit.Data dependence is low or do not have the computing unit of data dependence seldom or each other not have each computing unit of annexation corresponding to fillet.

B) calculating that each sequence of calculation is assigned to polycaryon processor is examined

Be that each that each computing unit that previous step obtains suddenly is assigned to polycaryon processor calculates executed in parallel on core; Realize load balance by scheduling thread in the process of implementation, and the dynamic management internal memory, realize the internal memory alignment.Specifically comprise:

One, each calculates the calculation procedure of nuclear

It below is the false code of in CUDA, the configuration of each node being carried out SUM and MAX calculating, a thread(thread) calculating of a configuration in the corresponding node, five parameters of whole process need input: point to the pointer float* AT of the global storage position of depositing each node time of arrival, allocation space in video memory; The set of node of the sequential chart in the video memory and pointer int* Va, the int* Ea of limit collection are left in sensing in, the pointer int* type of global storage position of the door type of each node correspondence is deposited in sensing, and the texture storage device textureMem DEL that deposits the end-to-end delay of all types of doors, these four data are copied in the video memory that into GPU holds by the CPU end memory.Obtain after these parameters, computation process is carried out according to following step.

ssta_kernel(float*?AT,?int*?Va,?int*?Ea,?int*?type,?textureMem?DEL){

thread_id =? get_thread_id();

virtex_id =? get_vid(Va,?thread_id);

samle_id =? get_sid(thread_id);

pre_vids[] =? get_pre_ids(Ea,?virtex_id);

virtex_type =? get_type(type,?virtex_id);

del[] =? get_del(DEL,?virtex_type,?sample_id);

pre_at[] =? get_at(AT,?pre_vids[],?sample_id);

sum[] =? perform_SUM(del[],?pre_at[]);

cur_at =? perform_MAX(sum[]);

store_at_to_AT(cur_at,?AT,?sample_id,?virtex_id);

}

1. Get_thread_id() obtains the ID of thread.In CUDA, provide corresponding built-in variable blockIdx and the corresponding block(of threadIdx to calculate nuclear) index and the index of thread in block, can calculate the ID of thread very easily.

2. Get_vid(Va, thread_id), the ID by thread finds current thread corresponding node index and node in sequential chart to go into the reference position of limit in Ea in the node array Va of sequential chart.

3. Get_sid(thread_id), the configuration index that ID by thread finds current thread to calculate, because each node need calculate a plurality of configurations, and each thread calculates different configurations, so the configuration index that current thread need obtain to be calculated decides the data of calculating which configuration.

4. Get_pre_ids(Ea, virtex_id), in the array Ea of the limit of sequential chart, find the forerunner node index of node in sequential chart of current thread correspondence by the node index, because in STA calculated, need calculate the time of arrival of present node according to the time of arrival of previous node.Therefore the number of forerunner's node may be an array greater than 1.It is pointed out that deposit among the Va be present node go into the reference position of limit in Ea, the number of forerunner's node can deduct the reference position of present node in Ea by the reference position of a back node in Ea and obtains like this.Add that in the ending of Va a value points to the next position that last node the last item is gone into the position of limit in Ea, just can obtain forerunner's interstitial content of last node.

5. Get_type(type virtex_id), finds the type of the door of node correspondence in type array type by the node index.

6. Get_del(DEL, virtex_type, sample_id), in the texture storage device DEL of the pin-to-pin time-delay of depositing door, obtain the door of current thread corresponding node and the delayed data of the pin-to-pin of corresponding configuration by node type and configuration index, the delayed data that uses texture storage device DEL to deposit door is read-only in computation process because of these delayed datas, utilizes read-only texture storage device can reduce thread visit expense.This time-delay has comprised each input and has divided the time-delay that is clipped to output, is an array therefore.

7. Get_at(AT, pre_vids[], sample_id), index by forerunner's node and configuration index obtain the time of arrival of forerunner's node of current thread corresponding node in current configuration in depositing the array AT of time of arrival.Obviously, because the number of forerunner's node may be greater than 1, so also be an array time of arrival of forerunner's node.

8. Perform_SUM(del[], pre_at[]), read group total is with the end-to-end delay and addition time of arrival of forerunner's node of door.

9. Perform_MAX(sum[]), maximizing, the result that previous step is sued for peace compares, and obtains maximal value, is the time of arrival of current configuration.

10. Store_at_to_AT(sample_id virtex_id), deposits back the array AT that deposits time of arrival with the result after calculating for cur_at, AT.

Two, the load balance in the computation process

Each block can have 512 thread at most among the CUDA, calculates 64k configuration and can guarantee that all block that move are in full state, need not to consider thread scheduling in multiprocessor.And the present invention uses the sparse grid method to reduce the configuration number, makes calculation times much smaller than Monte Carlo method, and the configuration number of each node is minimum can to reach 11, much smaller than the thread number upper limit of block.And the duty of each block the best is to have a multiple thread operation of 256 at least, guarantees that therefore block is in the state of operating at full capacity and becomes the key factor that improves operation efficiency.The configuration of calculating a plurality of nodes simultaneously in a block can improve the utilization factor of block and multiprocessor greatly, promotes the efficient of calculating process.

In CUDA, can specify dimension and the size of block when calling kernel by blockDim, specify dimension and the size of grid by gridDim.The present invention block is set to two dimension, size be (m, n), m * n≤512, grid is set to one dimension, size is (a, 1), a * m 〉=node number.N is the configuration number of each node, the node number that m handles for each block, and a is the block number of grid.In kernel, obtain the node index of current calculating by blockIdx.x * blockDim.y+threadIdx.y, threadIdx.x is the configuration index of current calculating.Make the benefit of calculating in this way be, when the configuration number changes and is not more than 512, only need adjusting m, n and a just can adapt to this variation; And when configuration number when exceeding 512, can specify gridDim thread grid be set to (a, b), n * b 〉=configuration number, configuration index then is blockIdx.y * blockDim.x+threadIdx.x, has made things convenient for the adjusting of configuration number, and has improved counting yield.

Three, the dynamic memory management in the computation process

The time of arrival that the inventive method adopts global storage to deposit each node.The management of global storage is very big to the influence of operation efficiency, because the read-write of global storage is needed 400 ~ 600 clock period usually.Use the access mode of alignment can make visit reach top efficiency to global storage.Since the thread of CUDA to the read-write of global storage be according to the size of a half-warp just 16 thread carry out, global storage position read-write to an alignment will only be visited global storage once, otherwise will visit 16 times, and can cause huge expense waste.

For the visit global storage that aligns, the present invention uses CUDA built-in function CudaMallocPitch, and (width height) carries out the distribution of global storage for devPtr, pitch.Can think that what use that the CudaMallocPitch function is assigned to is a two-dimensional space, width multiply by corresponding data width for the configuration number n, height is node number m, pitch is the space size that width place dimension is assigned to, and just aligns as width with pitch in the space of distributing like this, returns the pointer devPtr that points to the space of distributing, in realization of the present invention CudaMallocPitch (AT, pitch, n*size_of (float), m).Like this, in kernel, can use pitch to find the address of the global storage of alignment very easily, so one section continuous global storage space of thread visit aligns, improved counting yield.

The present invention adopts the texture storage device to deposit gate delay, because gate delay is read-only, need not write in the kernel operational process, and the texture storage device is read-only and buffer memory is arranged, the time overhead minimum of visit texture storage device only is a clock period, is access stencil very efficiently.Use constant storage can realize such function equally, and effect is identical.

C) reclaim each result of calculation and be reassembled as complete result of calculation

Because the programming characteristic SIMD(Single Instruction Multiple Data that GPU has), so all thread is carried out identical instruction, be the data difference of operation.In block-based SSTA computation process, there are not data and control dependence between SUM that calls repeatedly and the MAX computation process.Therefore, each result of calculation that previous step is obtained suddenly is reassembled into complete result of calculation according to the corresponding relation in the figure expression: complete SUM result just can obtain each SUM results added summation of calculating nuclear; Complete MAX result carries out simply comparing maximizing to the MAX result of each calculating nuclear just can obtain.The longest path that has so just obtained circuit postpones, and has finished SSTA computation process.

Claims

1. the general parallel accelerating algorithm based on polycaryon processor is characterized in that comprising successively identification and decomposes the large-scale data calculation process, and scheduling Distribution Calculation sequence to processor calculates the process that core is carried out, and collects the process of reorganization result of calculation; Wherein:

The large-scale data calculation process is also decomposed in so-called identification, is to calculate for extensive, high density data, and data dependence is low or do not have the computation process of data dependence and be decomposed into the independently sequence of calculation;

So-called scheduling Distribution Calculation sequence to processor calculates the process that core is carried out, be with after decomposing independently the sequence of calculation be assigned to respectively calculating on the core of polycaryon processor and carry out;

So-called process of collecting reorganization result of calculation is after each calculating core execution finishes the sequence of calculation, each result of calculation fragment is reclaimed be reassembled into complete result of calculation according to the order of sequence, realizes the calculated performance acceleration.

2. the general parallel accelerating algorithm based on polycaryon processor according to claim 1 is characterized in that described identification and decomposes the large-scale data calculation process, comprising:

(1), data input and computation process are described: make up circuit timing diagram

In block-based SSTA computation process, earlier circuit meshwork list is converted to the sequential chart of reflection circuit topological structure, sequential chart is made up of set of node V and limit collection E, node among the set of node V is represented door and the input and output etc. in the circuit meshwork list, the limit representative among the collection E of limit in circuit meshwork list door and the connection between the input and output; Travel through sequential chart then, SUM and MAX operation are carried out in the difference configuration of each node, obtain analysis result;

Sequential chart is a directed acyclic graph, the usage space complexity be O (V+E) in abutting connection with array the data structure of sequential chart is described; Sequential chart in abutting connection with array representation is made of two arrays: the limit array that comprises the node array of nodal information and comprise side information; Convert the sequential chart of CPU end to form that GPU end CUDA model uses, be divided into for two steps in abutting connection with array:

(1) node in the sequential chart of CPU end is carried out the breadth First traversal, during traversal node is numbered and layering;

(2) press node in the sequential chart of level visit CPU end, nodal information and side information are added among node array Va and the limit array Ea;

Node v among the accessing time sequence figure also carries out following operation to Va and Ea interpolation information:

(1) obtain the numbering of v in sequential chart, it is set to the index of v in Va;

(2) forerunner's node of traversal v joins the numbering of forerunner's node among the array Ea of limit;

(3) value of v in Va is set, is the index of first forerunner's node in Ea of v; If forerunner's number of nodes of v is 0, then be made as-1;

(2), time of arrival computation process decomposition

In SSTA, the time of arrival of the output of a door, the computation process of (abbreviating AT as) was: at first to each input pin iTime of arrival with from input pin iTo the time-delay summation (SUM) of output, then all are obtained and maximizing (MAX), obtain time of arrival when the output at Qianmen:

AT?=?MAX{(AT _{in
i}?+?Delay _{in
i-out})}, i= 0,…,n

Wherein nThe number of expression input.

3. the general parallel accelerating algorithm based on polycaryon processor according to claim 2, it is characterized in that described scheduling Distribution Calculation sequence to processor calculates the process that core is carried out, is that each that each computing unit that previous step obtains suddenly is assigned to polycaryon processor calculates executed in parallel on core; Realize load balance by scheduling thread in the process of implementation, and the dynamic management internal memory, realize the internal memory alignment.

4. the general parallel accelerating algorithm based on polycaryon processor according to claim 3, the process that it is characterized in that described collection reorganization result of calculation, be the corresponding relation during each result of calculation that previous step obtains is suddenly represented according to figure, be reassembled into complete result of calculation: each SUM results added summation of calculating nuclear is obtained complete SUM result; Carrying out the MAX result of each calculating nuclear simply relatively, maximizing obtains complete MAX result.