CN112527304B

CN112527304B - Self-adaptive node fusion compiling optimization method based on heterogeneous platform

Info

Publication number: CN112527304B
Application number: CN201910885756.1A
Authority: CN
Inventors: 王飞; 沈莉; 吴伟; 胡浩; 钱宏
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2022-10-04
Anticipated expiration: 2039-09-19
Also published as: CN112527304A

Abstract

The invention discloses a self-adaptive node fusion compiling and optimizing method based on a heterogeneous platform, which comprises the following steps of: s1, generating intermediate representation; s2, identifying a DAG fusion subgraph; s3, node fusion strategy; s4, cost evaluation; s5, adaptively selecting a node fusion strategy, namely adaptively selecting an optimal node fusion strategy according to the k fusion strategy cost obtained by calculation in S4 by combining the use conditions of a register, a cache and a memory at the rear end of a target; s6, fusing target related nodes, transferring the control flow and data flow relation of the DAG subgraph obtained by matching in the S23 to the fused DAG subgraph generated by the node fusion strategy selected in the S5 according to the node fusion strategy, replacing the DAG subgraph before fusion by using the fused DAG subgraph, and transferring to the S22; and S7, generating an object code, namely compiling the DAG after the degradation is finished by the compiler to generate a heterogeneous platform code. The method provides accurate guidance for node fusion optimization of the heterogeneous platform, can further excavate the potential of a heterogeneous platform composite instruction, and improves the performance of the heterogeneous platform.

Description

Self-adaptive node fusion compiling optimization method based on heterogeneous platform

Technical Field

The invention relates to a self-adaptive node fusion compiling optimization method based on a heterogeneous platform, and belongs to the technical field of compiler optimization.

Background

Reduced instruction set computers and complex instruction set computers are two architectures of current CPUs that differ in different CPU design concepts and methods. Early CPUs were all complex instruction set architectures designed to perform the required computational tasks with a minimum of machine language instructions. For a long time, the performance of computers has often been improved by increasing the complexity of the hardware, and a typical complex instruction computer contains at least 300 instructions, and some instructions even exceed 500 instructions. Although a complex instruction set computer can achieve a large performance improvement, for a typical program, 80% of instructions used in the calculation process only account for 20% of the instruction system of a processor, so that a huge imbalance exists between instructions and cost. Furthermore, although Very Large Scale Integration (VLSI) technology is now reaching a high level, it is difficult to implement all the hardware of a complex instruction set computer on one chip, which also hinders the development of single chip computers. The reduced instruction set system contains only those instructions that are frequently used and provides some of the necessary instructions to support the operating system and high-level languages. Computers using a reduced instruction set are not only simple in manufacturing process but also inexpensive.

The compound instruction is a special instruction which is added on the basis of the basic simplified instruction set and is used for improving the performance of the program and increasing the instruction parallelism. The appearance of compound instructions can be said to mark that a simplified instruction set computer and a complex instruction computer are gradually merging, for example, a common compound instruction, namely a multiply-add instruction, is a special multiply-add unit to complete multiply-add operation, and for some subjects of machine learning and scientific calculation, the multiply-add instruction is used quite frequently. The most common expression y = x × w + b in neural networks can be done by a multiply-add instruction, where x is [ x ] ₁ ,x ₂ ,…,x _n ]W is [ w ] ₁ ,w ₂ ,…,w _n ] ^T And b is a constant. There are also other compound instructions to speed up certain issues, so the potential of the CPU can be further released by using the compound instruction, and the performance of the CPU can be improved. The complex instruction completes complex functions through dedicated hardware logic, and compared with software implementation, the hardware implementation efficiency is higher. The instructions are widely used for improving the execution efficiency of the topic and achieving a good acceleration effect.

The node fusion optimization technology adopted by the traditional compiler mainly generates a compound instruction by calling a built-in function interface in source code or performing template matching by using intermediate representation and the like. The built-in function calling mode is strongly related to the back-end instruction information, which limits the optimization of nodes irrelevant to the target to a certain extent, is not beneficial to the development of the compiler optimization technology, and increases the complexity of developing programs by programmers. The template matching mode is used for generating the compound instruction by matching a subgraph and then replacing the subgraph with the corresponding compound instruction, and the mode does not fully consider the influence of an instruction set, data flow and control flow information on the compound instruction, so that the performance of the compound instruction of the generated executable file cannot be fully exerted, and the performance of the compound instruction of the processor is not favorably improved. The method has the advantages of simplicity and easiness in implementation, but information such as back-end characteristics and current data flow is not fully considered, so that the generated instruction sequence cannot achieve the expected acceleration effect, even backward acceleration can be caused, and the performance of the compound instruction is greatly limited.

Disclosure of Invention

The invention aims to provide a self-adaptive node fusion compiling and optimizing method based on a heterogeneous platform, which provides accurate guidance for node fusion optimization of the heterogeneous platform, can further excavate the potential of a heterogeneous platform compound instruction, and improves the performance of the heterogeneous platform.

In order to achieve the purpose, the invention adopts the technical scheme that: a self-adaptive node fusion compiling and optimizing method based on a heterogeneous platform comprises the following steps:

s1, the source program generates an intermediate representation DAG of the compiler through the compiling processing of the compiler, carries out degradation processing on the DAG, and carries out the following operations on the DAG at a DAG degradation stage:

s2, performing DAG fusion subgraph recognition, and further comprising the following steps:

s21, carrying out topological sequencing on the DAG to obtain a topological sequence, and adding nodes in the DAG into a node fusion optimization work list according to the sequence of the topological sequence;

s22, the compiler sequentially takes out a node of the work list from the first node of the work list generated in the S21, deletes the node from the work list, checks the operation code, the operand value type and the result value type of the node, if the operation code, the operand value type and the structure value type of the node are legal, the node can perform node fusion, and performs S23, otherwise, the compiler continues to perform S22 until the work list is empty, and then goes to S71;

s23, taking the nodes extracted in S22 as root nodes, using a graph matching algorithm according to a DAG sub-graph matching template at the rear end of the compiler to find all n DAG sub-graphs which take the nodes extracted in S22 as the root nodes and can carry out node fusion, and turning to S24;

s24, if the DAG subgraph capable of carrying out node fusion is not found in the S23, turning to S22, otherwise, turning to S31;

the n DAG subgraphs which can be subjected to node fusion and are found in S3 and S23 correspond to n node fusion strategies one by one, the k DAG subgraph which can be subjected to node fusion and is found in S23 is subjected to node fusion according to the k node fusion strategy, wherein k =1,2,3,4,.

S4, fusion strategy cost evaluation, namely calculating the cost spent on operating the instructions in the instruction sequence after converting the fused kth DAG sub-graph generated in the S3 into the instruction sequence according to the data references of all nodes of the fused DAG sub-graph in the S3 and the instruction set information of the heterogeneous platform, wherein the cost comprises the spent clock period number, the spent register number and the occupied memory size, and turning to S51;

s5, adaptively selecting a node fusion strategy, namely adaptively selecting an optimal node fusion strategy according to the k fusion strategy cost obtained by calculation in S4 by combining the use conditions of a register, a cache and a memory at the rear end of the target, namely the node fusion strategy with the best performance improvement effect on the rear end of the target, and turning to S6;

s6, fusing target related nodes, namely transferring the control flow and data flow relation of the DAG subgraph obtained by matching in the S23 to the fused DAG subgraph generated by the node fusion strategy selected in the S5 according to the node fusion strategy selected in the S5, replacing the DAG subgraph before fusion by using the fused DAG subgraph, and transferring to the S22;

and S7, generating an object code, namely compiling the DAG after the degradation is finished by the compiler to generate a heterogeneous platform code.

The further improved scheme in the technical scheme is as follows:

1. in the above solution, the work list is a linear data structure, and includes all nodes to be processed.

2. In the above scheme, different root nodes correspond to different DAG subgraph matching templates, and the DAG subgraph matching template is also a DAG subgraph.

3. In the above scheme, one node in the DAG corresponds to one instruction in the instruction set of the heterogeneous platform.

4. In the above scheme, the DAG subgraph obtained by matching in S23 is the DAG subgraph corresponding to the merged DAG subgraph and before node merging optimization.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

the invention discloses a self-adaptive node fusion compiling optimization method based on a heterogeneous platform, which is characterized in that a self-adaptive node fusion compiling optimization interface and a self-adaptive node fusion compiling optimization algorithm are provided on the heterogeneous platform, the cost of sub-graphs before and after fusion is evaluated by utilizing data flow and control flow information of a DAG graph in a DAG degradation stage and combining instruction set information at the rear end of a target, and an optimal node fusion optimization strategy is selected in a self-adaptive mode according to an evaluation result, so that more efficient program codes are generated, the DAG graph is simplified, the complexity of other optimizations is reduced, more possibilities are provided for other optimizations, meanwhile, accurate guidance is provided for the node fusion optimization of the heterogeneous platform, the potential of composite instructions of the heterogeneous platform can be further excavated, and the performance of the heterogeneous platform is improved.

Drawings

FIG. 1 is a flow chart of a self-adaptive node fusion compiling and optimizing method based on a heterogeneous platform.

Detailed Description

Example (b): a self-adaptive node fusion compiling optimization method based on a heterogeneous platform is based on a large-scale heterogeneous system and comprises the following steps:

s1, the source program generates an intermediate representation DAG of the compiler through the compiling processing of the compiler, the DAG is subjected to degradation processing, and the following operations are carried out on the DAG at a DAG degradation stage:

s23, taking the node taken out from S22 as a root node, matching a template according to a DAG subgraph at the rear end of a compiler, wherein the matched template refers to a Pattern template, the Pattern is a data structure of the compiler and is used for template matching, the input of the Pattern template is a DAG subgraph, the output of the Pattern template is also a DAG subgraph, the work done by the Pattern is to convert the input DAG subgraph into the output DAG subgraph, and n DAG subgraphs which can be subjected to node fusion and take all the nodes taken out from S22 as the root node are found by using a graph matching algorithm and then the operation is switched to S24;

the n DAG subgraphs which can be subjected to node fusion and are found in S3 and S23 correspond to n node fusion strategies one by one, the k DAG subgraph which can be subjected to node fusion and is found in S23 is subjected to node fusion according to the k node fusion strategy, wherein k =1,2,3,4.

S4, evaluating fusion strategy cost, namely calculating the cost spent on running the instructions in the instruction sequence after converting the fused kth DAG sub-graph generated in the S3 into the instruction sequence according to the data reference of all nodes of the fused DAG sub-graph and the instruction set information of the heterogeneous platform in the S3, wherein the cost comprises the spent clock cycle number, the spent register number and the occupied memory size, and turning to S51;

s5, adaptively selecting a node fusion strategy, namely adaptively selecting an optimal node fusion strategy according to the k-th fusion strategy cost obtained by calculation in S4 by combining the use conditions of a register, a cache and a memory at the rear end of the target, namely selecting the node fusion strategy with the best performance improving effect on the rear end of the target, and if the cache has less residual resources, selecting the fusion strategy with lower access and storage costs and turning to S6;

The worklist is a linear data structure containing all nodes to be processed.

Different root nodes correspond to different DAG subgraph matching templates, which are also one DAG subgraph.

One node in the DAG corresponds to one instruction in the instruction set of the heterogeneous platform.

And the DAG subgraph obtained by matching in S23 is the DAG subgraph corresponding to the merged DAG subgraph and before node merging optimization.

The examples are further explained below:

the specific flow of the invention is shown in fig. 1, in the process of optimizing and degrading the DAG graph by the compiler, traversing the DAG graph from the root node according to the topology sequence, identifying the DAG fusion subgraph by taking each node as the root node, evaluating the cost of various node fusion strategies according to DAG control flow and data flow information and instruction set information of a rear-end feature platform, and adaptively selecting the optimal node fusion optimization strategy according to the cost.

The specific process is as follows:

1) Generating an intermediate representation

a) Compiling the source program by a compiler to generate an intermediate representation DAG of the compiler, and turning to 2 a);

2) DAG fusion subgraph recognition

a) In a DAG degradation stage, carrying out topological sequencing on a DAG to obtain a topological sequence, adding nodes in the DAG into a working list (the working list is a linear data structure and comprises all nodes to be processed) according to the sequence of the topological sequence, and turning to 2 b);

b) Taking out the first node of the work list and deleting the first node from the work list, checking the operation code, the operand value type and the result value type of the node, if the node can be subjected to node fusion, carrying out 2 c), otherwise, continuing to carry out 2 b) until the work list is empty, and turning to 7 a);

c) Taking the node found in 2 b) as a root node, matching templates according to DAG subgraphs at the back end (different root nodes correspond to different templates, and the template is also a DAG subgraph), and finding all n DAG subgraphs which can be subjected to node fusion and take the node found in 2 b) as the root node by using a graph matching algorithm, and turning to 2 d);

d) If 2 c) does not find a DAG subgraph capable of node fusion, then go to 2 b), otherwise go to 3 a);

3) Node fusion strategy n

a) According to the node fusion strategy n, carrying out node fusion (multiple nodes are fused into one node) on the nth DAG subgraph found by 2 c) to generate a DAG subgraph fused (matched to one subgraph through template matching and then replaced to another subgraph), recording all nodes of the fused DAG subgraph, and turning to 4 a);

4) Cost assessment

a) According to data reference of nodes and instruction set information (one node in the DAG corresponds to one instruction in the instruction set) of the heterogeneous platform, evaluating the cost spent on operating the instruction after the fused DAG subgraph subjected to the 3 a) node fusion strategy n is converted into an instruction sequence, and then, turning to 5 a, wherein the cost comprises the spent clock period number, the number of registers, the size of occupied memory and the like;

5) Adaptive selection node fusion strategy

a) N fusion policy costs obtained by calculation according to 4 a), and adaptively selecting an optimal node fusion policy (a node fusion policy with the best performance effect on the target rear end improvement, such as a cache with less residual resources and a fusion policy with less access cost can be selected) by combining the use conditions of a target rear end register, a cache and a memory, and turning to 6 a);

6) Target-dependent node fusion

a) According to the node fusion strategy selected by 5 a), transferring the control flow and data flow relationship of the DAG subgraph (corresponding to the fused DAG subgraph and the node before optimization fusion) obtained by matching 2 c) to the fused DAG subgraph generated by the node fusion strategy selected by 5 a) and replacing the DAG subgraph before fusion by using the fused DAG subgraph to transfer to 2 b);

7) Generating object code

a) After the DAG demotion is completed, the compiler compiles the DAG to generate a heterogeneous platform code.

When the self-adaptive node fusion compiling and optimizing method based on the heterogeneous platform is adopted, a self-adaptive node fusion compiling and optimizing interface and a self-adaptive node fusion compiling and optimizing algorithm are provided on the heterogeneous platform, in a DAG degradation stage, cost evaluation is carried out on sub-images before and after fusion by using data flow and control flow information of a DAG image and combining instruction set information at the rear end of a target, and an optimal node fusion optimizing strategy is selected in a self-adaptive mode according to an evaluation result, so that more efficient program codes are generated, the DAG image is simplified, the complexity of other optimizations is reduced, more possibilities are provided for other optimizations, meanwhile, accurate guidance is provided for node fusion optimizing of the heterogeneous platform, the potential of composite instructions of the heterogeneous platform can be further mined, and the performance of the heterogeneous platform is improved.

To facilitate a better understanding of the invention, the terms used herein will be briefly explained as follows:

DAG (Directed acyclic graph): directed acyclic graph, an intermediate representation in compilation optimization, for degradation and optimization of the intermediate representation.

Topological sorting: a directed acyclic graph G is topologically ordered by arranging all vertices in G into a linear sequence such that any pair of vertices u and v in the graph, if an edge < u, v > belongs to E (G), then u appears before v in the linear sequence.

Topological sequence: the linear sequence obtained by topological sorting of the directed acyclic graph is called a topological sequence.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A self-adaptive node fusion compiling optimization method based on a heterogeneous platform is characterized by comprising the following steps: the method comprises the following steps:

s22, the compiler sequentially takes out a node of the work list from the first node of the work list generated in the S21, deletes the node from the work list, checks the operation code, the operand value type and the result value type of the node, if the operation code, the operand value type and the structure value type of the node are legal, the node performs node fusion, executes S23, otherwise, the compiler continues to perform S22 until the work list is empty, and then turns to S71;

s23, taking the node taken out of the S22 as a root node, finding all n DAG subgraphs which take the node taken out of the S22 as the root node and can carry out node fusion by using a graph matching algorithm according to a DAG subgraph matching template at the rear end of the compiler, and turning to S24;

2. The adaptive node fusion compilation optimization method based on the heterogeneous platform as claimed in claim 1, wherein: the worklist is a linear data structure containing all nodes to be processed.

3. The adaptive node fusion compilation optimization method based on the heterogeneous platform according to claim 1, characterized in that: different root nodes correspond to different DAG subgraph matching templates, which are also one DAG subgraph.

4. The adaptive node fusion compilation optimization method based on the heterogeneous platform according to claim 1, characterized in that: a node in the DAG corresponds to an instruction in the instruction set of the heterogeneous platform.

5. The adaptive node fusion compilation optimization method based on the heterogeneous platform as claimed in claim 1, wherein: and the DAG subgraph obtained by matching in S23 is the DAG subgraph corresponding to the merged DAG subgraph and before node merging optimization.