CN105700933A

CN105700933A - Parallelization and loop optimization method and system for a high-level language of reconfigurable processor

Info

Publication number: CN105700933A
Application number: CN201610018726.7A
Authority: CN
Inventors: 田***; 绳伟光; 何卫锋
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2016-01-12
Filing date: 2016-01-12
Publication date: 2016-06-22

Abstract

The present invention provides a parallelization and loop optimization method and system for a high-level language of a reconfigurable processor, and proposes a set of end-to-end language conversion system for a generic reconfigurable processor. As for the reconfigurable processor, calculation of a core loop in an intensive application requires parallelization calculation on a reconfigurable part, a C language cannot satisfy parallel characteristics of the reconfigurable processor, therefore, a serial part and a parallel part in the application require to be packaged separately and to be optimized according to system characteristics, and finally a new set of language is generated; when determining input and output data types and lengths of a kernel function, a compilation decls.h method is used, so that complexity of the system is simplified, and applicability of the system is greatly improved; and in a loop optimization process, a polyhedral model is used for processing, so that applicability of the system is more extensive, and transplantation of the system on different architectures is easier.

Description

The parallelization of the high-level language of reconfigurable processor and loop optimization method and system

Technical field

The present invention relates to the realization of the method and system of the automatically parallelizing of high-level language based on general reconfigurable processor and loop optimization。

Background technology

Along with operand and the operator of calculating on a single chip get more and more, the processor architecture with multinuclear of parallelization has had become as main flow。In conventional processor computation schema, it is generally divided into two categories below。Traditional general-purpose computations based on von Neumann processor has extremely strong motility, but executive mode, limited arithmetic element and memory bandwidth that its instruction stream drives make its overall performance and power consumption unsatisfactory。Dedicated computing can for specific optimizing application structure and circuit, it is not necessary to instruction set, and it is fast that it performs speed, low in energy consumption。But it is very poor that special-purpose computing system also exists fatal defect, motility and autgmentability, the more complicated application constantly developed is tended not to by simply having extended。Reconfigurable Computation is exactly a kind of calculation motility of software and the high efficiency of hardware combined occurred under this background。

For a reconfigurable processor, it includes a general processor and several reconfigurable computing units。General processor is used for controlling the calculating process of Reconfigurable Computation unit, Parallel Task Scheduling and execution serial task。Reconfigurable Computation unit is responsible for being reconstructed according to configuration information and calculating。Thus, reconstruction structure is adapted to the calculating of different algorithms and application。Although there now have been a variety of reconfigurable computing architecture, but original C language can not meet the needs of parallel computation, and how more efficient programming inspires the high efficiency of PEA as much as possible and had become as an apparent difficult problem and challenge。

For a kind of reconstruction structure, it is necessary to realize a task compiler matched with it and a parallelization extend after C language version。CUDA allows developer to utilize graphics calculations unit (GPU) to carry out the programming of special-purpose。But its programming process is very loaded down with trivial details, it is necessary to programmer oneself goes to carry out Memory Allocation and loop optimization。OpenMP supports the multinuclear of C, C++ and Fortran language is programmed。Although it with the addition of the labelling of a lot of parallelization in a program, but it is merely capable of carrying out optimization and the calculating of multithreading。Above both parallel languages all can not carry out task scheduling on reconstruction structure。

Summary of the invention

It is an object of the invention to provide the parallelization of the high-level language of a kind of reconfigurable processor and loop optimization method and system, C language can transform into GR-C language automatically and realization is directed to the loop optimization of concrete reconstruction structure, so that the performance boost of reconstruction structure。

For solving the problems referred to above, the present invention provides parallelization and the loop optimization method of the high-level language of a kind of reconfigurable processor, including:

Obtain decls.h file, inputted according to described decls.h file and export intermediate file, obtain task function and the parameter information of pea function from described input and output intermediate file；

From C code, extract kernel function part, utilize polyhedral model that the kernel function part proposed is optimized, to generate the GR-C language of kernel function part；

According to described parameter information, the GR-C language of described kernel function part is write back to the kernel function part of C code, to generate final GR-C code。

Further, in the above-mentioned methods, from C code, extract kernel function part, utilize polyhedral model that the kernel function part proposed is optimized, to generate the GR-C language of kernel function part, including:

The static dependencies analysis of input C code, is converted into polyhedral model by the static dependencies analysis of described C code；

Described polyhedral model is optimized, to obtain the polyhedral model of parallelization；

Polyhedral model according to described parallelization generates the GR-C language of kernel function part。

Further, in the above-mentioned methods, the GR-C language of kernel function part is generated according to the polyhedral model of described parallelization, including:

Use CLooG instrument that the polyhedral model of described parallelization is generated the GR-C language of kernel function part。

Further, in the above-mentioned methods, the static dependencies analysis inputting C code includes:

LooPo framework is used to carry out code scans and dependency analysis。

Further, in the above-mentioned methods, described polyhedral model is optimized, to obtain in the step of the polyhedral model of parallelization,

Rewrite based on PLUTO model, obtained the framework of affine transformation；

Use PipLib as ILP computer；

According to described framework and ILP computer, described polyhedral model is optimized, to obtain the polyhedral model of parallelization。

Further, in the above-mentioned methods, in the step be optimized described polyhedral model, adopt following loop optimization order that the kernel function part proposed is optimized:

First, all states are separated, and be grouped according to dependence and loop boundary；

Then, it is circulated fusion to each group, by suitable state fusion to together；

Then, the execution cycle needed for the operand often organized determines the parameter of loop unrolling；

Finally, it is circulated expansion by calculated parameter, obtains suitable length of the cycle and representation。

Another side according to the present invention, it is provided that the parallelization of the high-level language of a kind of reconfigurable processor and loop optimization system, including:

Acquisition module, is used for obtaining decls.h file, is inputted according to described decls.h file and exports intermediate file, obtains task function and the parameter information of pea function from described input and output intermediate file；

Optimize module, for extracting kernel function part from C code, utilize polyhedral model that the kernel function part proposed is optimized, to generate the GR-C language of kernel function part；

Generation module, for writing back to the kernel function part of C code, to generate final GR-C code according to described parameter information by the GR-C language of described kernel function part。

Further, in said system, described optimization module, including:

Conversion unit, for inputting the static dependencies analysis of C code, is converted into polyhedral model by the static dependencies analysis of described C code；

Optimize unit, for described polyhedral model is optimized, to obtain the polyhedral model of parallelization；

Generate unit, generate the GR-C language of kernel function part for the polyhedral model according to described parallelization。

Further, in said system, described generation unit, for using CLooG instrument that the polyhedral model of described parallelization is generated the GR-C language of kernel function part。

Further, in said system, described conversion unit, it is used for using LooPo framework to carry out code scans and dependency analysis。

Further, in said system, described optimization unit, for having rewritten based on PLUTO model, obtain the framework of affine transformation；Use PipLib as ILP computer；According to described framework and ILP computer, described polyhedral model is optimized, to obtain the polyhedral model of parallelization。

Further, in said system, described optimization unit adopts following loop optimization order that the kernel function part proposed is optimized:

Compared with prior art, the present invention is directed to general reconfigurable processor and propose a set of end-to-end language conversion system, for reconfigurable processor, core loop in compute-intensive applications requires over restructural part and carries out parallel computation, and this allows for C language and can not meet his parallel characteristics, utilize the system in the present invention, so needing to encapsulate the serial section in application program and parallel section respectively, and it is optimized according to system performance, ultimately generating a set of novel language, the suitability is very wide。It addition, when determining the data type of input and output of kernel function and length, have employed the method allowing programmer write decls.h, which greatly simplifies the complexity of system, and the suitability of system is greatly improved。In addition, in the process being circulated optimization, present invention utilizes polyhedral model to process, this makes system suitability more extensive too, whatever reconstruction structure, have only to change the order of the round-robin method of loop optimization part, it is possible to obtaining general solution, system transplanting on a different architecture is simpler。

Accompanying drawing explanation

Fig. 1 is the flow chart of the parallelization of the high-level language of the reconfigurable processor of one embodiment of the invention and loop optimization method；

Fig. 2 is the flow chart of the parallelization of the high-level language of the reconfigurable processor of one embodiment of the present invention and loop optimization method；

Fig. 3 is the original C code kernel of one embodiment of the invention schematic diagram represented；

Fig. 4 be one embodiment of the invention GR-C code function statement and definition schematic diagram；

Fig. 5 is the schematic diagram of the GR-C code main function kernel part of one embodiment of the invention。

Detailed description of the invention

Understandable for enabling the above-mentioned purpose of the present invention, feature and advantage to become apparent from, below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation。

Embodiment one

As it is shown in figure 1, the present invention provides parallelization and the loop optimization method of the high-level language of a kind of reconfigurable processor, including:

Step S1, obtains decls.h file, is inputted according to described decls.h file and export intermediate file, obtains task function and the parameter information of pea function from described input and output intermediate file；Concrete, in the GR-C application write, task function (_ _ GR-C_taskvoidkernel_name ()) is utilized to represent kernel function, and core loop code therein utilizes pea function (_ _ GR-C_peavoidkernel_name_PEA ()) to be indicated, Task function is used for producing the executable code of ARM7, is responsible for carrying data and scheduling PEA；Pea function is used for producing the PEA reconfigurable configuration information needed, and in task function, includes the statement and definition that need to use the variable arrived；Main memory and PEA memorizer need to move into the data taken out of；Also has data address in main memory and PEA memorizer, in task function, use _ _ gr_MemcpyStoG () and _ _ gr_MemcpyGtoS () carries out the realization of data carrying, available #pragmascop and the #pragmaendscop of programmer marks kernel position, a given decls.h file simultaneously, described decls.h file includes the name variable and the data length that need the data carrying out between main memory and PEA memorizer to copy。It is below the form of decls.h file:

#pragmaarray_decls_in/outkernel_number

typevariable_name[length]memory_length

typevariable_name[length][length]memory_length

…

#progmaend_array_decls_in/outkernel_number

After learning data length, system can in PEA memorizer storage allocation address, generally can start distribution and reserved general address to calculation result data from the first address of internal memory；

Step S2, extracts kernel function part from C code, utilizes polyhedral model that the kernel function part proposed is optimized, to generate the GR-C language of kernel function part；Detailed, application for computation-intensive, program always has some focuses and occupies most of program runtime, especially those nested cyclic parts, these hotspot's definitions are become kernel function by us, reconstruction structure carries out the calculating of compute-intensive applications, it is exactly that kernel function part is carried out parallel computation on reconfigurable arrays, non-kenel function part is calculated on main core processing device, so the core content of the present invention is exactly say that the function wrapping of kernel function part becomes some can be compiled the form of device identification

Step S3, writes back to the kernel function part of C code, to generate final GR-C code according to described parameter information by the GR-C language of described kernel function part。The present embodiment can be very clear and definite the program that C programmer is converted to GR-C language, the same characteristic for other parallel language can reach same effect。Function for complicated multiple kernel, it is possible to the parameter increasing-N repeatedly processes。A set of end-to-end language conversion system is proposed for general reconfigurable processor, for reconfigurable processor, core loop in compute-intensive applications requires over restructural part and carries out parallel computation, and this allows for C language and can not meet his parallel characteristics, utilizing the system in the present invention, so needing to encapsulate the serial section in application program and parallel section respectively, and being optimized according to system performance, ultimately generating a set of novel language, the suitability is very wide。It addition, when determining the data type of input and output of kernel function and length, have employed the method allowing programmer write decls.h, which greatly simplifies the complexity of system, and the suitability of system is greatly improved。In addition, in the process being circulated optimization, present invention utilizes polyhedral model to process, this makes system suitability more extensive too, whatever reconstruction structure, have only to change the order of the round-robin method of loop optimization part, it is possible to obtaining general solution, system transplanting on a different architecture is simpler。

Preferably, as in figure 2 it is shown, step S2, from C code, extract kernel function part, utilize polyhedral model that the kernel function part proposed is optimized, to generate the GR-C language of kernel function part, including:

Step S21, the static dependencies analysis of input C code, the static dependencies analysis of described C code is converted into polyhedral model；

Step S22, is optimized described polyhedral model, to obtain the polyhedral model of parallelization；

Step S23, generates the GR-C language of kernel function part according to the polyhedral model of described parallelization。

Preferably, step S23, the GR-C language of kernel function part is generated according to the polyhedral model of described parallelization, including:

Use CLooG instrument that the polyhedral model of described parallelization is generated the GR-C language of kernel function part。At this, use CLooG as the basis of Code Generator。

Preferably, the static dependencies analysis inputting C code includes:

LooPo framework is used to carry out code scans and dependency analysis。LooPo is the source compiler to the polyhedral model in source, and the analysis of it is built-in multiple polyhedral model, owing to the first step of the present invention is to convert C code to polyhedral model, it is very perfect for using this instrument undoubtedly。

Preferably, step S22, described polyhedral model is optimized, to obtain in the step of the polyhedral model of parallelization,

Use PipLib as ILP computer；

According to described framework and ILP computer, described polyhedral model is optimized, to obtain the polyhedral model of parallelization。Concrete, PLUTO is an automatically parallelizing for multiple nucleus system and local optimization tool, and its core conversion is to carry out affine transformation by circulation splicing and fusion, and the present invention has rewritten his optimization process to meet the computing demand of our reconfigurable system。

Preferably, in the step be optimized described polyhedral model, adopt following loop optimization order that the kernel function part proposed is optimized:

Finally, it is circulated expansion by calculated parameter, obtains suitable length of the cycle and representation。Concrete, at this, for a pea function, wherein there are one or several SCOP (static cost control part), namely do not include the longest cyclic sequence of while, simultaneously the loop boundary of all of which and control variable are all the affine transformations of cyclic variable, in polyhedral model, these statement lists are shown as a series of state, each state has loop boundary and the operating process of himself, need to analyze the dependence of each state, and search out the execution sequence of its optimum。

Below for certain program that can be parallel, the present invention is illustrated。

Fig. 3 be certain can in concurrent program can cardiopulmonary bypass in beating heart part, be identified by #pragmascop and #pragmaendscop。And given the decls.h file shown in Fig. 4。

First with GR-C_pre to decls.h process, it is decomposed into two files of .array_decls_in and .array_decls_out, and utilize GR-C_pre_help to carry out pretreatment, extract the title and length that need the data being operated and be respectively written in two intermediate files of Arrays_in.cfg and Arrays_out.cfg；

Then the parsing of C code is carried out for the kernel number in input parameter, #pragmascop and #pragmaendscop is identified that kernel code section extracts by the order according to kernel, it is analyzed after the data structure that content after extraction is converted into polyhedral model and optimizes, for the example that we provide, systematic analysis goes out every statement in circulation and is respectively provided with the dependence of context, the optimization of order can not be carried out, so not carrying out too deep optimization, write as the easy-to-handle circulation form of compiler, and by its generating code；

Then the generation of final GR-C code is carried out, the parameter information of GR-C_task and GR-C_kernel function is extracted by two intermediate files of Arrays_in.cfg and the Arrays_out.cfg produced before, and utilize inscop_GR-C instrument to be written to by the code generated by kernel part to extract #pragmascop and #pragmaendscop and be identified in the original C code of kernel code section, thus generating the final GR-C code in Fig. 4, Fig. 5 is the position that in main function, original #pragmascop and #pragmaendscop identifies place。

In order to verify that reconstruction structure utilizes the performance of this set automated conversion system, use EEMBC test case to test, table 1 illustrate C code at ATOM230 platform and our GR-C that automatically generates the performance on GReP platform。For given test case that can be parallel, the performance of the present invention improves general about 10 times。Equally, the GR-C that the present invention is automatically generated and the GR-C of hand-coding has carried out Experimental comparison, it is analyzed with the encoding procedure of convolutional code: for the GR-C code of hand-coding, reconstruction structure has 286 cycles reading configuration information, 673 cycles read data, 2041 cycles write data, and 4016 cycles carry out data calculating, and wherein on ALU, calculating employs 512 cycles；For the GR-C code that the present invention automatically generates, 16 PE be divide into two groups by system, often organize one group of two data of every four period treatment, so the calculating time on ALU takes 1024 cycles, are the twices of manual configuration。We utilize other algorithm to be also carried out calculating, it has been found that on average have the gap of 2～3 times。

Table 1GReP and Atom230 Performance comparision

The present invention is that general reconfigurable processor proposes a kind of parallelization extension language GR-C based on C language, and proposing and a set of C language is automatically converted to the automatically parallelizing of GR-C language and the method and system of loop optimization, the present invention is a kind of source based on polyhedral model automatic crossover tool to source。In existing achievement, static dependencies analysis has had a lot of feasible method, but automatically parallelizing and automatically generating of language there is also a lot of problems, present invention is generally directed to certain general reconfigurable processor propose a set of automatic parallel system and solve this problem, C language can be transformed into GR-C language by system automatically that utilize us and realization is directed to the loop optimization of concrete reconstruction structure, so that the performance boost of reconstruction structure。

Embodiment two

The present invention also provides for parallelization and the loop optimization system of the high-level language of another kind of reconfigurable processor, including:

Preferably, described optimization module, including:

Preferably, described generation unit, for using CLooG instrument that the polyhedral model of described parallelization is generated the GR-C language of kernel function part。

Preferably, described conversion unit, it is used for using LooPo framework to carry out code scans and dependency analysis。

Preferably, described optimization unit, for having rewritten based on PLUTO model, obtain the framework of affine transformation；Use PipLib as ILP computer；According to described framework and ILP computer, described polyhedral model is optimized, to obtain the polyhedral model of parallelization。

Preferably, described optimization unit adopts following loop optimization order that the kernel function part proposed is optimized:

Other detailed content of embodiment two, specifically referring to the corresponding part of embodiment one, can not repeat them here。

In sum, the present invention is directed to general reconfigurable processor and propose a set of end-to-end language conversion system, for reconfigurable processor, core loop in compute-intensive applications requires over restructural part and carries out parallel computation, and this allows for C language and can not meet his parallel characteristics, utilize the system in the present invention, so needing to encapsulate the serial section in application program and parallel section respectively, and it is optimized according to system performance, ultimately generating a set of novel language, the suitability is very wide。It addition, when determining the data type of input and output of kernel function and length, have employed the method allowing programmer write decls.h, which greatly simplifies the complexity of system, and the suitability of system is greatly improved。In addition, in the process being circulated optimization, present invention utilizes polyhedral model to process, this makes system suitability more extensive too, whatever reconstruction structure, have only to change the order of the round-robin method of loop optimization part, it is possible to obtaining general solution, system transplanting on a different architecture is simpler。

In this specification, each embodiment adopts the mode gone forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar portion mutually referring to。

Professional further appreciates that, the unit of each example described in conjunction with the embodiments described herein and algorithm steps, can with electronic hardware, computer software or the two be implemented in combination in, in order to clearly demonstrate the interchangeability of hardware and software, generally describe composition and the step of each example in the above description according to function。These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme。Professional and technical personnel specifically can should be used for using different methods to realize described function to each, but this realization is it is not considered that beyond the scope of this invention。

Obviously, invention can be carried out various change and modification without deviating from the spirit and scope of the present invention by those skilled in the art。So, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to include these change and modification。

Claims

1. the parallelization of the high-level language of a reconfigurable processor and loop optimization method, it is characterised in that including:

2. the parallelization of the high-level language of reconfigurable processor as claimed in claim 1 and loop optimization method, it is characterized in that, kernel function part is extracted from C code, utilize polyhedral model that the kernel function part proposed is optimized, to generate the GR-C language of kernel function part, including:

3. the parallelization of the high-level language of reconfigurable processor as claimed in claim 2 and loop optimization method, it is characterised in that generate the GR-C language of kernel function part according to the polyhedral model of described parallelization, including:

4. the parallelization of the high-level language of reconfigurable processor as claimed in claim 2 and loop optimization method, it is characterised in that the static dependencies analysis of input C code includes:

LooPo framework is used to carry out code scans and dependency analysis。

5. the parallelization of the high-level language of reconfigurable processor as claimed in claim 2 and loop optimization method, it is characterised in that described polyhedral model is optimized, to obtain in the step of the polyhedral model of parallelization,

Use PipLib as ILP computer；

6. the parallelization of the high-level language of the reconfigurable processor as described in any one of claim 2 to 5 and loop optimization method, it is characterized in that, in the step be optimized described polyhedral model, adopt following loop optimization order that the kernel function part proposed is optimized:

7. the parallelization of the high-level language of a reconfigurable processor and loop optimization system, it is characterised in that including:

8. the parallelization of the high-level language of reconfigurable processor as claimed in claim 7 and loop optimization system, it is characterised in that described optimization module, including:

9. the parallelization of the high-level language of reconfigurable processor as claimed in claim 7 and loop optimization system, it is characterised in that described generation unit, for using CLooG instrument that the polyhedral model of described parallelization is generated the GR-C language of kernel function part。

10. the parallelization of the high-level language of reconfigurable processor as claimed in claim 7 and loop optimization system, it is characterised in that described conversion unit, is used for using LooPo framework to carry out code scans and dependency analysis。

11. the parallelization of the high-level language of reconfigurable processor as claimed in claim 7 and loop optimization system, it is characterised in that described optimization unit, for having rewritten based on PLUTO model, obtain the framework of affine transformation；Use PipLib as ILP computer；According to described framework and ILP computer, described polyhedral model is optimized, to obtain the polyhedral model of parallelization。

12. the parallelization of the high-level language of the reconfigurable processor as described in any one of claim 7 to 11 and loop optimization system, it is characterised in that described optimization unit adopts following loop optimization order that the kernel function part proposed is optimized: