Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.
Algorithm flow chart is as shown in Figure 1:
A) calculate for extensive, high density data, computation process that can parallel processing is decomposed into the independently sequence of calculation;
B) will decompose the back sequence of calculation and be assigned to respectively calculating on the core of polycaryon processor and carry out, scheduling thread is realized load balance in the process of implementation, and the dynamic management internal memory;
C) after the sequence of calculation is carried out end, each result of calculation fragment recovery is combined into complete result of calculation.
Below being based on GPGPU to SSTA(Statistical Static Timing Analysis) the parallel acceleration of algorithm describes for embodiment.
A) decompose according to input and computation process
One, data input and computation process are described: make up circuit timing diagram
Recognition data correlativity diagrammatic representation method of the present invention.This figure is made up of set of node V and limit collection E, and the node among the set of node V is represented computing unit and input/output port, the dependence between the limit representative computing unit among the collection E of limit.Data dependence is low or do not have the computing unit of data dependence seldom or each other not have each computing unit of annexation corresponding to fillet.
In block-based SSTA computation process, need earlier circuit meshwork list to be converted to the sequential chart of reflection circuit topological structure.Sequential chart is made up of set of node V and limit collection E, and the node among the set of node V is represented door and the input and output etc. in the circuit meshwork list, the limit representative among the collection E of limit in circuit meshwork list door and the connection between the input and output.Generate after the sequential chart, the SSTA method is carried out the SUM(summation by the traversal sequential chart to the difference configuration of each node) and MAX(get maximal value) operation is with the acquisition analysis result.In order to use the development environment CUDA(Compute Unified Device Architecture of GPGPU) SUM and MAX operation that the configuration of each node is walked abreast, the sequential chart that circuit need be arranged in the video memory of GPU, sequential chart visits and obtains SUM and MAX operates needed data by searching.After master routine read in circuit meshwork list, the CPU end can generate sequential chart, and the operation that next needs to carry out is exactly that the sequential chart of CPU end is transplanted to the GPU end.
Sequential chart is directed acyclic graph normally.Because the video memory of GPU is played up image and was done optimization, still can not effectively support the custom data structure.But can be used as general array to the video memory of GPU, can support more effective data structure than GPGPU at the CUDA model.Usage space complexity of the present invention be O (V+E) in abutting connection with array the data structure of sequential chart is described.Sequential chart in abutting connection with array representation generally is made of two arrays, the limit array that comprises the node array of nodal information and comprise side information.Convert the sequential chart of CPU end to form that GPU end CUDA model uses, be divided into for two steps in abutting connection with array:
1. the node in the sequential chart of CPU end is carried out the breadth First traversal, during traversal node is numbered and layering;
2. press the node in the sequential chart of level visit CPU end, nodal information and side information are added among node array Va and the limit array Ea.
Node v among the accessing time sequence figure also need carry out following operation to Va and Ea interpolation information:
I. obtain the numbering of v in sequential chart, it is set to the index of v in Va;
Ii. travel through forerunner's node of v, the numbering of forerunner's node is joined among the array Ea of limit;
Iii., the value of v in Va is set, is the index of first forerunner's node in Ea of v; If forerunner's number of nodes of v is 0, then be made as-1.
In transfer process, can in the array of limit, write down forerunner's node serial number or drive node numbering.In block-based SSTA, need the information of forerunner's node to carry out SUM and MAX operation, therefore in the array of limit, need to write down the numbering of forerunner's node.After the sequential chart of CPU end was handled, the sequential chart of the GPU end that obtains in abutting connection with array representation imported this sequential chart into video memory that GPU holds, just can carry out block-based SSTA at the GPU end and calculate.
Two, time of arrival computation process decomposition
In SSTA, the time of arrival of the output of a door, the computation process of (Arrival Time abbreviates AT as) was: at first to each input pin
iTime of arrival with from input pin
iTo the time-delay summation (SUM) of output, then all are obtained and maximizing (MAX), obtain time of arrival when the output at Qianmen.
AT?=?MAX{(AT
in
i ?+?Delay
in
i-out
)},
i=
0,…,n
Wherein
nThe number of expression input.In order to obtain the statistical distribution of time of arrival, need carry out the calculating of a plurality of configurations to time of arrival.Owing to need carry out a large amount of SUM and MAX operation to node configuration, and between the different configurations is data independence, therefore can adopt parallel method that SUM between each configuration and MAX operation are separated, utilize the parallel characteristics of GPU that this process is quickened.If current calculated door one has
nIndividual input, then each configuration needs to carry out altogether
nInferior summation and
N-1Inferior maximizing operation.
The figure of recognition data correlativity represents method.This figure is made up of set of node V and limit collection E, and the node among the set of node V is represented computing unit and input/output port, the dependence between the limit representative computing unit among the collection E of limit.Data dependence is low or do not have the computing unit of data dependence seldom or each other not have each computing unit of annexation corresponding to fillet.
B) calculating that each sequence of calculation is assigned to polycaryon processor is examined
Be that each that each computing unit that previous step obtains suddenly is assigned to polycaryon processor calculates executed in parallel on core; Realize load balance by scheduling thread in the process of implementation, and the dynamic management internal memory, realize the internal memory alignment.Specifically comprise:
One, each calculates the calculation procedure of nuclear
It below is the false code of in CUDA, the configuration of each node being carried out SUM and MAX calculating, a thread(thread) calculating of a configuration in the corresponding node, five parameters of whole process need input: point to the pointer float* AT of the global storage position of depositing each node time of arrival, allocation space in video memory; The set of node of the sequential chart in the video memory and pointer int* Va, the int* Ea of limit collection are left in sensing in, the pointer int* type of global storage position of the door type of each node correspondence is deposited in sensing, and the texture storage device textureMem DEL that deposits the end-to-end delay of all types of doors, these four data are copied in the video memory that into GPU holds by the CPU end memory.Obtain after these parameters, computation process is carried out according to following step.
ssta_kernel(float*?AT,?int*?Va,?int*?Ea,?int*?type,?textureMem?DEL){
thread_id =?
get_thread_id();
virtex_id =?
get_vid(Va,?thread_id);
samle_id =?
get_sid(thread_id);
pre_vids[] =?
get_pre_ids(Ea,?virtex_id);
virtex_type =?
get_type(type,?virtex_id);
del[] =?
get_del(DEL,?virtex_type,?sample_id);
pre_at[] =?
get_at(AT,?pre_vids[],?sample_id);
sum[] =?
perform_SUM(del[],?pre_at[]);
cur_at =?
perform_MAX(sum[]);
store_at_to_AT(cur_at,?AT,?sample_id,?virtex_id);
}
1.
Get_thread_id() obtains the ID of thread.In CUDA, provide corresponding built-in variable blockIdx and the corresponding block(of threadIdx to calculate nuclear) index and the index of thread in block, can calculate the ID of thread very easily.
2.
Get_vid(Va, thread_id), the ID by thread finds current thread corresponding node index and node in sequential chart to go into the reference position of limit in Ea in the node array Va of sequential chart.
3.
Get_sid(thread_id), the configuration index that ID by thread finds current thread to calculate, because each node need calculate a plurality of configurations, and each thread calculates different configurations, so the configuration index that current thread need obtain to be calculated decides the data of calculating which configuration.
4.
Get_pre_ids(Ea, virtex_id), in the array Ea of the limit of sequential chart, find the forerunner node index of node in sequential chart of current thread correspondence by the node index, because in STA calculated, need calculate the time of arrival of present node according to the time of arrival of previous node.Therefore the number of forerunner's node may be an array greater than 1.It is pointed out that deposit among the Va be present node go into the reference position of limit in Ea, the number of forerunner's node can deduct the reference position of present node in Ea by the reference position of a back node in Ea and obtains like this.Add that in the ending of Va a value points to the next position that last node the last item is gone into the position of limit in Ea, just can obtain forerunner's interstitial content of last node.
5.
Get_type(type virtex_id), finds the type of the door of node correspondence in type array type by the node index.
6.
Get_del(DEL, virtex_type, sample_id), in the texture storage device DEL of the pin-to-pin time-delay of depositing door, obtain the door of current thread corresponding node and the delayed data of the pin-to-pin of corresponding configuration by node type and configuration index, the delayed data that uses texture storage device DEL to deposit door is read-only in computation process because of these delayed datas, utilizes read-only texture storage device can reduce thread visit expense.This time-delay has comprised each input and has divided the time-delay that is clipped to output, is an array therefore.
7.
Get_at(AT, pre_vids[], sample_id), index by forerunner's node and configuration index obtain the time of arrival of forerunner's node of current thread corresponding node in current configuration in depositing the array AT of time of arrival.Obviously, because the number of forerunner's node may be greater than 1, so also be an array time of arrival of forerunner's node.
8.
Perform_SUM(del[], pre_at[]), read group total is with the end-to-end delay and addition time of arrival of forerunner's node of door.
9.
Perform_MAX(sum[]), maximizing, the result that previous step is sued for peace compares, and obtains maximal value, is the time of arrival of current configuration.
10.
Store_at_to_AT(sample_id virtex_id), deposits back the array AT that deposits time of arrival with the result after calculating for cur_at, AT.
Two, the load balance in the computation process
Each block can have 512 thread at most among the CUDA, calculates 64k configuration and can guarantee that all block that move are in full state, need not to consider thread scheduling in multiprocessor.And the present invention uses the sparse grid method to reduce the configuration number, makes calculation times much smaller than Monte Carlo method, and the configuration number of each node is minimum can to reach 11, much smaller than the thread number upper limit of block.And the duty of each block the best is to have a multiple thread operation of 256 at least, guarantees that therefore block is in the state of operating at full capacity and becomes the key factor that improves operation efficiency.The configuration of calculating a plurality of nodes simultaneously in a block can improve the utilization factor of block and multiprocessor greatly, promotes the efficient of calculating process.
In CUDA, can specify dimension and the size of block when calling kernel by blockDim, specify dimension and the size of grid by gridDim.The present invention block is set to two dimension, size be (m, n), m * n≤512, grid is set to one dimension, size is (a, 1), a * m 〉=node number.N is the configuration number of each node, the node number that m handles for each block, and a is the block number of grid.In kernel, obtain the node index of current calculating by blockIdx.x * blockDim.y+threadIdx.y, threadIdx.x is the configuration index of current calculating.Make the benefit of calculating in this way be, when the configuration number changes and is not more than 512, only need adjusting m, n and a just can adapt to this variation; And when configuration number when exceeding 512, can specify gridDim thread grid be set to (a, b), n * b 〉=configuration number, configuration index then is blockIdx.y * blockDim.x+threadIdx.x, has made things convenient for the adjusting of configuration number, and has improved counting yield.
Three, the dynamic memory management in the computation process
The time of arrival that the inventive method adopts global storage to deposit each node.The management of global storage is very big to the influence of operation efficiency, because the read-write of global storage is needed 400 ~ 600 clock period usually.Use the access mode of alignment can make visit reach top efficiency to global storage.Since the thread of CUDA to the read-write of global storage be according to the size of a half-warp just 16 thread carry out, global storage position read-write to an alignment will only be visited global storage once, otherwise will visit 16 times, and can cause huge expense waste.
For the visit global storage that aligns, the present invention uses CUDA built-in function CudaMallocPitch, and (width height) carries out the distribution of global storage for devPtr, pitch.Can think that what use that the CudaMallocPitch function is assigned to is a two-dimensional space, width multiply by corresponding data width for the configuration number n, height is node number m, pitch is the space size that width place dimension is assigned to, and just aligns as width with pitch in the space of distributing like this, returns the pointer devPtr that points to the space of distributing, in realization of the present invention CudaMallocPitch (AT, pitch, n*size_of (float), m).Like this, in kernel, can use pitch to find the address of the global storage of alignment very easily, so one section continuous global storage space of thread visit aligns, improved counting yield.
The present invention adopts the texture storage device to deposit gate delay, because gate delay is read-only, need not write in the kernel operational process, and the texture storage device is read-only and buffer memory is arranged, the time overhead minimum of visit texture storage device only is a clock period, is access stencil very efficiently.Use constant storage can realize such function equally, and effect is identical.
C) reclaim each result of calculation and be reassembled as complete result of calculation
Because the programming characteristic SIMD(Single Instruction Multiple Data that GPU has), so all thread is carried out identical instruction, be the data difference of operation.In block-based SSTA computation process, there are not data and control dependence between SUM that calls repeatedly and the MAX computation process.Therefore, each result of calculation that previous step is obtained suddenly is reassembled into complete result of calculation according to the corresponding relation in the figure expression: complete SUM result just can obtain each SUM results added summation of calculating nuclear; Complete MAX result carries out simply comparing maximizing to the MAX result of each calculating nuclear just can obtain.The longest path that has so just obtained circuit postpones, and has finished SSTA computation process.