CN101840329B

CN101840329B - Data parallel processing method based on graph topological structure

Info

Publication number: CN101840329B
Application number: CN201010150568A
Authority: CN
Inventors: 许端清; 杨鑫; 赵磊
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2010-04-19
Filing date: 2010-04-19
Publication date: 2012-10-03
Anticipated expiration: 2030-04-19
Also published as: CN101840329A

Abstract

The invention discloses a data parallel processing method based on a graph topological structure, which comprises the following steps: (1) dividing an application program for data processing into a plurality of execution behaviors; (2) dividing all execution behaviors into a plurality of tasks according to the basic operation type of the execution behaviors to data; (3) diving the data required to be processed by the application program into static data and dynamic data; (4) respectively establishing a queue aiming at each task in step (2), and storing the data required to be processed by each task in the corresponding data queues in a form of data packets; and (5) establishing the graph topological structure and executing the application program according to the flow direction of the data generated by the tasks during task execution and the position of the data required to be scheduled. The invention has the advantages that the program can effectively use the computation resources and the storage resources of the hardware to the maximum extend, the high-performance application program for parallel execution can be rapidly and conveniently developed, the progress and the efficiency of program development is greatly accelerated and the development and research expenses are saved.

Description

A kind of data parallel processing method based on graph topological structure

Technical field

The present invention relates to the parallel computing field, relate in particular to a kind of general parallel data processing method based on the heterogeneous polynuclear framework.

Background technology

Along with current rapid development of science and technology; High-performance calculation has become the research means that has strategic significance in the scientific technological advance; It constituted in modern science and technology and the engineering design with traditional theory research and laboratory experiment complement each other, inter-related research method, be called three big " pillars " of 21 century scientific research in the world.The application of high-performance computer mainly concentrates on science research and development, telecommunications, finance, government etc.; Yes performs meritorious deeds never to be obliterated so high-performance computer is for the contribution of country; In order to accelerate the paces of current informatization, growing field is applied to the high-performance calculation technology.The speed of calculating has greatly been accelerated in high-performance calculation, has shortened development and production cycle.The research ability has been widened in its application greatly, promotes and promoted the development of modern science and engineering.Accelerate development high-performance calculation for promoting China's science and technology capability of independent innovation, enhancing national competitiveness, safeguarding national security, promote the development of the national economy, construction innovation-oriented country to have crucial strategic importance.

In the evolution of high-performance computing sector, once sought hegemony high-performance calculation market with the minicomputer that the RISC framework is taken as the leading factor, owing to the development of X86 framework, the X86 framework that on price, has overwhelming superiority had finally replaced minicomputer with the form of cluster afterwards.Though can solve the problem of part mass computing through creating distributed system, distributed system has communication overhead big, failure rate is high; The access structure of data is complicated, and expense is big; Weakness such as the difficult control of safety of data and confidentiality.Along with computer processor; Particularly GPU (Graphical Processing Unit) raising at full speed of computing power and cheap price; High-performance calculation progressively gets into desktop (low side) field; Make each researchist, scientist and slip-stick artist all might have the supercomputer of oneself, can deal with problems faster, accelerated the rhythm of scientific development.Present GPU has comprised up to a hundred processing units, can obtain the performance of 1TFLOPS to the single-precision floating point computing, also can obtain to surpass the performance of 80GFLOPS to the double-precision floating point computing, can have the video memory of 4GB, surpasses the 100GB/ bandwidth of second.Although GPU is a kind of processor that aims at graphics calculations and design originally, yet be particularly suitable for doing GPU that large-scale parallel calculates appears at many non-graphical application rapidly with characteristics such as powerful calculating performance, lower energy consumption, cheap price and floor area are less high-performance computing sector.Nowadays, many important science engineerings are all attempting the GPU computing power is added in their code.Just warmly wait in expectation their work of software engineers can obtain remarkable performance through GPU.

Yet present most of application programs are grafted directly to GPU and go up the raising that can't obtain performance immediately, even also performance decrease can occur.This mainly because these programs and structure are not to design to the GPU Architecture characteristic, can't excavate the whole computing power of GPU.How to utilize concurrent application to carry out efficiently data processing normally a complicacy and work consuming time.

Summary of the invention

The invention provides a kind of data parallel processing method based on graph topological structure; To solve uncertain dynamic irregular act of execution and the irregular data structure that complicated application program possibly occur in the process of implementation; It is difficult to effective problem of carrying out under current hardware parallel architecture, makes application program carry out can effectively using to greatest extent when data parallel is handled the computational resource and the storage resources of hardware simultaneously.

Described uncertain dynamic irregular behavior; The implementation that is meant many algorithms has uncertainty, and for example, the iterative process each time that the function recurrence is carried out all need be analyzed according to the result of a preceding iteration dynamically; Each iteration possibly produce new task; Also possibly stop, therefore the execution of this iteration is uncertain at this point, can't staticly before program is carried out confirm; And providing the recurrence space requirement very huge storage space for thousands of active threads, this also is difficult to effectively carry out in based on the programming model of data parallel existing; In addition; For irregular data structures such as tree, chained lists; They can't be difficult to the layout of specified data locality storage again in the distribution and the storage of compile duration specified data during program run, therefore existing programming model is difficult to effectively handle this irregular data structure.

A kind of data parallel processing method based on graph topological structure, carry out in computing machine with GPU and CPU processor:

(1) application program that will carry out data processing is divided into some act of execution;

Each act of execution may be done to few basic operation to data, the for example storage of the visit of data, data etc., perhaps computations;

(2) according to basic operation type and the computations type of act of execution, all act of execution are divided into several tasks, are about to similar act of execution and put under in the same calculation task data;

Similar act of execution; Be meant and have identical calculating operation or similar storage operation; Similar storage operation is meant that the visit to data remains in the subrange of storage area; The division of this step can be satisfied SIMD (Single Instruction, Multiple Data) the execution characteristic of hardware and local access's characteristic of storage.

Each task is accomplished the calculation task of appointment, and short and small as much as possible and function singleness during division according to specific circumstances can executed in parallel between task, and also serializable is carried out.

(3) data that application program need be handled are divided into static data and dynamic data; In can carrying out the computing machine video memory of described application program, divide storage space (storage pool); In this storage space, be respectively static data and dynamic data and divide storage area; Be that a part is used to store static data in the storage space, remaining space is used to store dynamic data.

Wherein static data is meant the data that in the application program implementation, can not change, and dynamic data is meant the new data that in the application program implementation, produces.All these information are recorded in the configuration file in advance;

(4), task in the step (2) is divided into calculation type task and logic determines type task, operation calculation type task on GPU, operation logic judgement type task on CPU according to processing mode to data.

In step (4), set up data queue respectively to each the calculation type task in the step (2), the data that each task need be handled leave in the corresponding data queue with the form of packet;

Owing in a task, can not have access to static datas all in the storage pool; The mobile number of times of data in the storage area scope when carrying out for the minimizing task; Can be according to the position distribution of static data in storage space; Packet is divided into the plurality of sub formation, and the position of the pairing static data of same subqueue in storage space is adjacent.In addition, can also use other principle of classification further to divide, so that more effectively carry out SIMD operation and local memory access according to the concrete characteristics of different application.Some application programs possibly need the handle packet (like some pattern algorithms) of order; Therefore we have made mark in each data queue; Need sequential processes packet we will be according to the sequential processes of first in first out (FIFO), otherwise will handle packet with out of order mode.

Eventually the above; Through using formation; We can dynamically organize required by task data that want or that produced according to the similarity principle, and described similarity specifically is meant storage similarity, structural similarity and processing similarity, finally makes task can obtain data as required; Reduce expensive data transfer operation, needing to be particularly suitable for the application program of frequent access data or needs visit mass data.

The flow direction of the data that task generates when (5) carrying out according to task, and the position (the residing data queue of data) at the data that need call place are set up graph topological structure, according to the graph topological structure executive utility.During executive utility, carry out communication and swap data through its corresponding data queue between the various tasks.

Through using this graph topological structure; Having recurrence etc. can shift data to the calculation task that can carry out this recurrence behavior according to the flow direction of data in the graph topological structure when dynamically the calculation task of act of execution runs into the recurrence behavior; Finish this calculation task simultaneously, avoided the appearance of recurrence behavior, the thread of executed in parallel can be finished with identical instruction simultaneously; Avoid carrying out the appearance of branch, thereby used the parallel computation resource of hardware effectively.

The present invention has done special optimization to the complicated algorithm with behavioral characteristics act of execution and irregular data structure; Can when program run, carry out dynamic management to data, make program can effectively use the computational resource and the storage resources of hardware to greatest extent according to storage principle of locality and SIMD operation mechanism.Utilize the inventive method can develop the application program of high performance executed in parallel rapidly and easily, this is the progress and the efficient of faster procedure exploitation greatly undoubtedly, saves the research and development expense.

Enforcement of the present invention can be based on more ripe heterogeneous polynuclear framework; Fermi framework such as the up-to-date release of NVIDIA company; Perhaps Inter company is about to the Larrabee framework of release etc.; These frameworks generally all have the floating-point operation ability above 1TFLOPS, surpass 20 polycaryon processor, hardware thread up to a hundred and complicated memory hierarchy structure.

Description of drawings

Fig. 1 contrasts with the management cost scenario B unny, Fairy, calculating cost required when BART Kitchen draws for using CUDA and model of the present invention respectively.

Embodiment

Select 4 nuclear CPU that are furnished with an Intel Xeon 3.7GHz, the PC of a NvidiaGTX285 (1G video memory) verifies feasibility of the present invention.Realized the DLL that a cover is realized based on said method based on the PTX instruction set; And go to design again and write according to method proposed by the invention and have a large amount of dynamically ray trace algorithms of scrambling behaviors in the graphics; And contrast, and done following analysis with the resulting effect of code that the CUDA programming model that uses Nvidia company is write.

Application program is divided into some calculation tasks; In order to satisfy the SIMD/SIMT operation and the local memory access characteristic of hardware; We are encapsulated in the calculation task effectively to handle the calculating with similar act of execution or similar memory access behavior; The short and small as much as possible and function singleness of each calculation task according to specific circumstances can executed in parallel between calculation task, and also serializable is carried out.Calculate with the data parallel mode calculation task inside, and between calculation task with the asynchronous calculating of tasks in parallel mode.Each calculation task all is provided with a state, in order to handle the execution between the calculation task that possibly have relation of interdependence.

Characteristics according to calculation task in the ray trace algorithm have been created 6 calculation tasks in application program execution pipeline; Carry out respectively that light produces, calculation tasks such as traversal accelerating structure, dough sheet are crossing, painted, shade, carry out in data task scheduling pipeline simultaneously that light sorts and the establishment of light bag.These tasks all have executed in parallel ability preferably, and promptly the SIMD of broad carries out width, but the recursive nature of light makes the SIMD availability acutely descend along with the carrying out of recurrence probably.In addition; We use when realizing and postpone the next SIMD utilization factor that further improves of computing technique; If promptly painted task can't produce enough light and form a complete light bag after calculating, crossing calculating will be postponed to have formed up to complete light bag; Likewise, carry out painted calculating if crossing calculation task can't produce abundant light, painted calculating also will be postponed.

Data are divided into static data and dynamic data, and wherein static data is meant the data that in the application program implementation, can not change, and dynamic data is meant the new data of the continuous variation that in the application program implementation, produces.A storage pool is set when initialization, is that static data distributes certain space in video memory according to concrete application program, and remaining space is occupied by dynamic data.All these information are recorded in the configuration file.

The required static data size of some application programs possibly exceed the size of video memory; So just possibly in program process, dispatch static data dynamically; And that each size of data that imports is not necessarily caught up with once is in full accord, so just maybe be at static data zone and dynamic data interval generation fragment.For fear of the generation of fragment and effectively use video memory, we can adopt the method for two-way distribution in video memory, deposit static data at the low address end of storage pool, and deposit dynamic data in the high address end of storage pool.

In order to solve the uncertain dynamic irregular act of execution that complicated application program possibly occur in the process of implementation; Use a kind of graph topological structure that can dynamic expansion as the execution pipeline of application program; Its core concept is: calculation task is as the node in the graph structure; The dynamic act of execution that before program run, possibly occur during according to program run is set up the graph topological structure between calculation task in advance in configuration file, the part with confirmable regular act of execution was confirmed in static compile time.

The flow direction of the data that task generates when carrying out according to task, and the position (the residing data queue of data) at the data that need call place are set up graph topological structure, according to the graph topological structure executive utility.During executive utility, carry out communication and swap data through its corresponding data queue between the various tasks.

As noted earlier, use graph structure to express the execution relation between these calculation tasks, each calculation task all has special data queue to supply it to use, and the primitive operation of joining the team that uses us to provide constantly reorganizes the data in the data queue.Because a lot of calculating all consume on the traversal and crossing operation of light in the ray trace algorithm; In order to reduce invalid calculating operation as far as possible; Set up corresponding graph topological structure according to the structure of accelerating structure; Feasible calculating can be local at a certain node place, from the beginning carries out from root node and needn't calculate all at every turn.Having recurrence etc. can shift data to the calculation task that can carry out this recurrence behavior according to the flow direction of data in the graph topological structure when dynamically the calculation task of act of execution runs into the recurrence behavior; Finish this calculation task simultaneously; Avoided the appearance of recurrence behavior; The thread of executed in parallel can be finished with identical instruction simultaneously, avoid carrying out the appearance of branch, thereby used the parallel computation resource of hardware effectively.

According to processing mode, task is divided into calculation type task and logic determines type task, operation calculation type task on GPU, operation logic judgement type task on CPU to data.Each calculation task all has a corresponding formation, and what deposited the inside is the data of waiting for that this calculation task is handled.Data leave in the formation with the form of bag, and the size of bag is decided according to the execution width of multiprocessor SIMD.Data queue is used to carry out asynchronous communication, dynamically collects needed data in this calculation task execution, to satisfy the local similar property execution characteristic of hardware.Although we can be through launching some recurrence executive problems that circulation solves program; But the recurrence number of times of a lot of irregular algorithms can't be confirmed; And the cycle characteristics of figure exactly can make the feedback operation of the arbitrary number of times of hanging down expense between calculation task; Need not to confirm in advance that formation can well be handled the data of a large amount of irregular storages that dynamically produce in the calculation task cycling.Calculation task can only carry out communication and swap data through its corresponding data queue and other calculation task.

The each execution of calculation task all may consume some data, also may produce a large amount of new data, and then make the element in the pairing data queue of calculation task that irregular variation take place; In addition, some data possibly shared, reuse between a plurality of calculation tasks.Therefore; In order to satisfy local similar property principle and SIMD operating characteristic, make calculating concentrate on local data as far as possible and concentrate and to carry out, be necessary these data are reorganized; Repack and divide into groups; The topological structure of data based figure is flowed in the pairing calculation task data queue, in data queue, produce new packet and wait for that calculation task handles, can continue on hardware, to carry out effectively thereby guarantee to calculate.The packet in the corresponding data formation is produced or consumed to the three kinds of primitive of programming that obtain, pay, join the team through providing to make things convenient between calculation task through asynchronous mode efficiently, and the dynamic flow of control data strengthens the operability of programming.Obtain operation can the specific data formation in continuous blocks of data bag zone, show that these packets are about to be processed; Delivery operation shows that these packets are processed, and other calculation task can use them; The operation of joining the team then is used for dynamically newly-generated data element being organized into the form of bag.

Owing in a calculation task, can not have access to static datas all in the storage pool, move number of times in order to reduce expensive data, we are subdivided into the plurality of sub formation according to required static data scope with the packet in the data queue.In addition, can also use other principle of classification further to divide, so that more effectively carry out SIMD operation and local memory access according to the characteristics of different application.Some application programs possibly need the handle packet (like some pattern algorithms) of order; Therefore we have made mark in each data queue; Need sequential processes packet we will be according to the sequential processes of first in first out (FIFO), otherwise will handle packet with out of order mode.Eventually the above; Through using formation; We can data calculation task is needed or that produced dynamically organize according to the similarity principle, and the similarity here specifically is meant storage similarity, structural similarity and processing similarity, finally makes calculation task can obtain data as required; Reduce expensive data transfer operation, needing to be particularly suitable for the application program of frequent access data or needs visit mass datas (out-of-core data).

Selection has the test scene of different geometry complexity, and Bunny, Fairy, BART Kitchen are as the test model file, and drawing resolution is 1024*1024.Graph topological structure and corresponding data queue mechanism can be expressed the dynamic irregular behavior in the ray trace algorithm at an easy rate; Constantly collect similarity data and tissue local property data; And put into corresponding data queue, guaranteed that calculation task can efficiently operation in graph topological structure always.Used the inventive method and CUDA programming model that this scene is tested respectively, the result is as shown in table 1, and visible the inventive method is compared CUDA, has obtained more performance.

Table 1

	SIMD	CUDA	This model
				Bunny	0.458s	0.097s	0.041s
Fairy	1.317s	0.128s	0.085s
				BART?Kitchen	0.727s	0.103s	0.065s

SIMD instruction, CUDA and this method of using CPU respectively is in acceleration structure construct comparison of (S) aspect the time.

For further specify the effectively dynamic scrambling behavior in the Processing Algorithm of the inventive method, listed emphatically use CPU respectively SIMD instruction, CUDA and this method in the time-related comparison of acceleration structure construct, as shown in table 1.Here we use the KD tree as structural texture, and visible the inventive method is compared the SIMD method and obtained nearly 10 times performance boost, compares CUDA, have also reduced the nearly half the time.

As stated, the inventive method is rationally used computational resource and storage resources to greatest extent in order to guarantee the effective rate of utilization of hardware computational resource, comes control data to handle according to the similarity principle through data recombination.This method has promoted the performance of algorithm significantly, but has also brought certain expense.In order to verify the pros and cons degree of these expenses, added up calculating cost and management cost required when using CUDA and this model that scene Fairy is drawn respectively.(bar shaped left-hand digit value representation calculates cost, right-hand component value representation management cost among the figure) as shown in Figure 1; Although obviously surpassing, the administration overhead that the inventive method is brought uses the management cost that CUDA produced; But,, make that the effective calculation rate of hardware is higher just because of the inventive method complicated data and task management from overall cost angle; Calculation cost significantly reduces, thereby makes the time of overall cost be tending towards reducing.

Claims

1. the data parallel processing method based on graph topological structure is characterized in that, carries out in the computing machine with GPU and CPU processor:

(2), all act of execution are divided into calculation type task and logic determines type task, operation calculation type task on GPU, operation logic judgement type task on CPU according to the basic operation type of act of execution to data;

(3) data that application program need be handled are divided into static data and dynamic data, in can carrying out the computing machine video memory of described application program, divide storage space, in this storage space, are respectively static data and dynamic data and divide storage area;

(4) set up formation respectively to each task in the step (2), the data that each task need be handled leave in the corresponding data queue with the form of packet;

The flow direction of the data that task generates when (5) carrying out according to task; And the position at the data that need call place; Set up graph topological structure; According to the graph topological structure executive utility, during executive utility, carry out communication and swap data through its corresponding data queue between the various tasks.

2. data parallel processing method according to claim 1 is characterized in that, each act of execution may be done to few a basic operation or a calculating operation to data.

3. data parallel processing method according to claim 1 is characterized in that, described static data is the data that in the application program implementation, can not change, and described dynamic data is the data that in the application program implementation, produce.

4. data parallel processing method according to claim 1 is characterized in that, the data queue described in the step (4) continues to be divided into the plurality of sub formation, and the position of the pairing static data of same subqueue in storage space is adjacent.