CN112559031B

CN112559031B - Many-core program reconstruction method based on data structure

Info

Publication number: CN112559031B
Application number: CN201910910099.1A
Authority: CN
Inventors: 徐金秀; 何香; 陈鑫; 徐占; 刘鑫; 李芳�; 孙唯哲; 郭恒; 赵朋朋
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2022-10-04
Anticipated expiration: 2039-09-25
Also published as: CN112559031A

Abstract

The invention discloses a many-core program reconstruction method based on a data structure, which comprises a reconstruction method based on extracting a basic type data structure, a reconstruction method based on space compression of array dimension reduction and a reconstruction method based on space compression of increasing transmission word length. The invention mainly aims at diversified data structures in the multi-stage heterogeneous many-core parallel computing problem, provides a high-efficiency data structure reconstruction method and improves the computing efficiency of heterogeneous parallel programs.

Description

Many-core program reconstruction method based on data structure

Technical Field

The invention relates to a many-core program reconstruction method based on a data structure, and belongs to the technical field of computers.

Background

In recent years, high-performance computing technology is rapidly developed, and numerical simulation computing software not only pursues higher and higher computing performance of a computer system, but also puts higher demands on storage capacity of the computer system. The data structure is a key factor influencing the performance of the computing software, and the data structure of the computing software designed by the multi-core computing system often becomes a soft rib which restricts the computing capability of the computing system when the computing software is calculated on a heterogeneous multi-core system.

Any computing software that wants to obtain correct results and good performance must have an ideal data structure designed for the data object. When researching a data structure, the relationship of data elements and the implementation mode thereof are generally considered, and meanwhile, the algorithm implementation and the operation execution efficiency need to be considered. The common problem encountered in the parallelization process of the many cores of the computing software is that a discrete data structure or a data structure with an intricate relationship causes the slave cores to frequently access irregular storage addresses of the master cores, so that the parallelization acceleration performance of the many cores of the computing software is greatly reduced.

Most of application software relates to complex data structures, a large number of physical quantity storage structures are basic data type multi-dimensional arrays, a data element storage structure with a complex logic relation is a complex data type, many-core parallelization is achieved, a slave core needs to visit a main core storage address, data with a certain length are copied to a slave core storage space, and efficient access and calculation are achieved for the slave core. At present, the heterogeneous many-core coprocessor has a simple structure, strong computing capability and limited storage capability, so that the problem of memory access exception of the coprocessor commonly occurs when a traditional multi-core program is directly parallelized by heterogeneous many-core.

Disclosure of Invention

The invention aims to provide a data structure-based many-core program reconstruction method, which mainly aims at diversified data structures in the multi-level heterogeneous many-core parallel computing problem, provides an efficient data structure reconstruction method and improves the computing efficiency of heterogeneous parallel programs.

In order to achieve the purpose, the invention adopts the technical scheme that: a many-core program reconstruction method based on a data structure comprises a reconstruction method based on extracting a basic type data structure, a reconstruction method based on array dimension reduction space compression, and a reconstruction method based on transmission word length increase space compression;

the reconstruction method based on the extracted basic type data structure comprises the following steps:

s11, analyzing a plurality of time hot spot functions during program operation by using a performance analysis tool or printing output information, and finding out the most time-consuming program segment in each time hot spot function;

s12, analyzing the most time-consuming cycle segments in each time hotspot function one by one, firstly analyzing the data structure of the cycle segments, and executing S13 if a data variable of a complex data type statement exists in the cycle segments; if only the data variable of the basic data type statement exists in the loop segment, the loop segment is completed by the slave core, and S16 is executed;

s13, extracting basic data type member variables related to the tasks of the loop segments from the data variables of the complex data type declarations in the loop segments, wherein the basic data type member variables are called original variables, and performing corresponding alias declarations of the basic data type data variables, which are called new variables;

s14, adding the statement of the new variable extracted in the step S13 in the time hotspot function variable statement part, and performing address upper and lower boundary matching on the memory address of the original variable and the corresponding memory address of the new variable at the starting position of the time hotspot function execution part;

s15, modifying the original variable name in the loop section into a new variable name, completing the task of the loop section by a slave core, and executing S16;

and S16, directly using the compiling instruction to carry out many-core accelerated parallelization aiming at the loop segment task completed by the slave core.

The reconstruction method based on the array dimension reduction space compression comprises the following steps:

s21, analyzing a plurality of time hotspot functions during program operation by using a performance analysis tool or printing output information, and finding out the most time-consuming program segment in each time hotspot function;

s22, analyzing the most time-consuming cycle sections in each time hotspot function one by one, and executing S23 if a multidimensional array exists in the data structure in the cycle sections and the multidimensional array does not have a data dependency relationship; if the multi-dimensional arrays do not exist in the loop section or data dependency exists among the multi-dimensional arrays, S25 is executed;

s23, performing corresponding dimension reduction array statement on the multidimensional array without the dependency relationship in the time hotspot function statement part;

s24, searching execution statements related to the multidimensional arrays without dependency relations, modifying the execution statements related to the reduced-dimension arrays into execution statements related to the reduced-dimension arrays, forming a new cycle segment for reconstructing the data structure, and completing the operation by the slave core;

and S25, directly using the compiling instruction to carry out many-core accelerated parallelization aiming at the loop segment task completed by the slave core.

The reconstruction method based on the space compression of the increased transmission word length comprises the following steps:

s31, analyzing a plurality of time hotspot functions during program operation by using a performance analysis tool or printing output information, and finding out the most time-consuming program segment in each time hotspot function;

s32, analyzing the most time-consuming cycle sections in each time hotspot function one by one, observing the characteristics of the calculation tasks in the cycle sections, and executing S33 if multiple execution statements have operations of reading one or more same array variables for multiple times and writing different array variables; if not, the loop section is completed by the slave core, and S35 is executed;

s33, optimizing the data structure of the array variables read for many times, declaring corresponding alias arrays in the time hotspot function declaration part, combining a plurality of written different array variables into one dimension expansion array, and declaring corresponding dimension expansion arrays in the time hotspot function declaration part;

s34, in the time hotspot function executing part, aiming at the reconstructed data structure, adjusting the discrete memory access of the alias array into continuous memory access, and simultaneously modifying the write operation corresponding to the expanded dimension array to form a cycle segment of the reconstructed data structure, wherein the cycle segment is completed by a slave core;

and S35, directly using the compiling instruction to carry out many-core accelerated parallelization aiming at the loop segment task completed by the slave core.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

the invention discloses a many-core program reconstruction method based on a data structure, which mainly aims at diversified data structures in the problem of multi-level heterogeneous many-core parallel computation, provides an efficient data structure reconstruction method and improves the computation efficiency of heterogeneous parallel programs; in the heterogeneous many-core parallel process, the complex data structure of the many-core parallel block is analyzed in advance, the data structure is optimized, the redundant data structure is reduced, the many-core parallelization of application software is realized by combining the performance advantages of a heterogeneous many-core system, the optimized data structure can greatly reduce the discrete access and storage overhead of a master core and a slave core, and the operation speed of a program is improved; the method is suitable for most high-performance scientific computing software, including ocean numerical simulation computation, aerodynamic numerical simulation and the like, and can optimize the data structure of application software, improve the utilization rate of the storage space of a computing system and improve the computing performance of heterogeneous many-core parallel programs.

Drawings

FIG. 1 is an example of a code for a method for extracting a reconstruction of a base data structure;

FIG. 2 is a pseudo code of a data structure reconstruction method based on spatial compression;

FIG. 3 is a code example of a data structure reconstruction method based on spatial compression;

FIG. 4 is a flow chart of a reconstruction method based on extracting a basic type data structure according to the present invention;

FIG. 5 is a flow chart of the reconstruction method of the space compression based on the array dimension reduction according to the present invention;

fig. 6 is a flow chart of a reconstruction method based on space compression for increasing transmission word length according to the present invention.

Detailed Description

The embodiment is as follows: a many-core program reconstruction method based on a data structure comprises a reconstruction method based on extracting a basic type data structure, a reconstruction method based on space compression of array dimension reduction, and a reconstruction method based on space compression of increasing transmission word length;

s13, extracting basic data type member variables related to the tasks of the loop segment from data variables of complex data type declarations in the loop segment, wherein the basic data type member variables are called original variables, and performing corresponding alias declarations of the basic data type data variables, wherein the alias declarations are called new variables;

As redundant storage of the LDM space of the slave core and the access capability of the slave core are eliminated, the optimization performance of the many cores is greatly improved.

s22, analyzing the most time-consuming loop sections in each time hotspot function one by one, and executing S23 if a multidimensional array exists in a data structure in the loop section and the multidimensional array does not have a data dependency relationship; if the multi-dimensional arrays do not exist in the loop section or data dependency exists among the multi-dimensional arrays, S25 is executed;

taking the actual procedure shown in fig. 2 as an example: grad _ p is a four-dimensional array, which can be regarded as 2 np nlev three-dimensional arrays, and the third dimension of the three-dimensional array has no dependency relationship in use; through analyzing the gradient _ sphere function, the dependency relationship does not exist in the calculation of the third dimension;

as in the example of fig. 2, through the above correlation analysis, the gradient _ sphere function can be calculated separately, so that the gradient _ sphere function is realized again, only half of the grad _ p is calculated each time, and the reconstruction can save half of the DMA space;

the continuity of data access is increased, so that the transmission word length is increased by one DMA access;

The values of the 5 groups of variables in the upper box of fig. 3 are obtained from two identical variable operations, respectively, and due to the characteristics of data storage, the 5 groups on the left require 5 DMA write operations;

the data structure is optimized as follows: the left 5 arrays are merged into one expanded dimension array, and the right array structure is adjusted to be read continuously from the first dimension. Therefore, the DMA writing operation of the slave core is changed into 1 time, the length of the single DMA is increased by 4 times, the DMA times are reduced, meanwhile, higher bandwidth utilization rate is obtained, and the calculation/access ratio is greatly improved.

The examples are further explained below:

the invention provides a plurality of data structure optimization modes.

The most commonly used method is to extract a basic data structure reconstruction method, so that complex data types in computing software are simplified, the purpose is to extract a basic type data structure related to data operation in a many-core parallel block, and the method is specifically realized as shown in fig. 1.

In the figure, 3 complex data types exist in a left frame, type _ h comprises a real data type 3-dimensional array, type _ dish comprises a real data type 4-dimensional array, and type _ dh comprises a plurality of basic data type elements, variables type _ h, type _ dish and type _ dh are involved in a many-core parallel block, the problem that data variables are not identified can occur in direct many-core parallel, and corresponding basic data types t _ h _ h, t _ dish _ h and area need to be declared. After the variables are declared, addresses are required to be matched with the new variables and the original variables, and the access of the slave core to the main core variable area is changed into continuous access and storage operation, so that on one hand, redundant storage of an LDM space is reduced, and on the other hand, the performance of the slave core to the main memory is improved.

The second method is a data structure reconstruction method based on spatial compression. The multidimensional arrays serving as local variables in part of application software core calculation have no dependency relationship in some dimensions, and can achieve the effect of spatial multiplexing through dimension reduction, and the specific design idea is shown in fig. 2.

In fig. 2, the upper block diagram is the original code, and grad _ p is a four-dimensional array, which can be regarded as 2 np nlev three-dimensional arrays, and the third dimension of the three-dimensional array has no dependency relationship in use. Therefore, the gradient _ sphere function is re-implemented, only half of the grad _ p is calculated each time, and the reconstruction can save half of the DMA space.

The third method is a data structure reconstruction method based on space compression, and the core idea is to increase the word length of single transmission as much as possible, obtain higher DMA bandwidth, reduce the number of DMA and avoid DMA congestion.

Values of 5 array variables in an example in an upper box in fig. 3 are obtained from two same variable operations respectively, and are directly subjected to many-core parallel, 5 arrays on the left side need 5 times of DMA write operations, after data structure optimization, the 5 arrays are combined into one extended dimension array, the DMA write operations from the core are changed into 1 time, and the length of a single DMA is increased by 4 times, so that higher bandwidth utilization rate is obtained while the DMA times are reduced, and the calculation/memory access ratio is greatly improved.

When the data structure-based many-core program reconstruction method is adopted, the efficient data structure reconstruction method is mainly provided for diversified data structures in the multi-level heterogeneous many-core parallel computing problem, and the computing efficiency of heterogeneous parallel programs is improved; in the heterogeneous many-core parallel process, the complex data structure of the many-core parallel block is analyzed in advance, the data structure is optimized, the redundant data structure is reduced, the many-core parallelization of application software is realized by combining the performance advantages of a heterogeneous many-core system, the optimized data structure can greatly reduce the discrete access and storage overhead of a master core and a slave core, and the operation speed of a program is improved; the method is suitable for most high-performance scientific computing software, including ocean numerical simulation computation, aerodynamic numerical simulation and the like, and can optimize the data structure of application software, improve the utilization rate of the storage space of a computing system and improve the computing performance of heterogeneous many-core parallel programs.

To facilitate a better understanding of the invention, the terms used herein will be briefly explained as follows:

data structure: in the English data structure, data elements with specific relations among the data elements are organized according to logical relations to form a set, which can be defined as a data structure and is a data storage mode of a computer, and the good data structure can improve the data processing capacity of the computer.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A many-core program reconstruction method based on a data structure is characterized in that: the method comprises a reconstruction method based on extracting a basic type data structure, a reconstruction method based on array dimension reduction space compression, and a reconstruction method based on transmission word length increase space compression;

s12, analyzing the most time-consuming cycle sections in each time hotspot function one by one, firstly analyzing the data structure of the cycle sections, and executing S13 if data variables declared by complex data types exist in the cycle sections; if only the data variable of the basic data type statement exists in the loop section, the loop section is completed by the slave core, and S16 is executed;

s14, adding the statement of the new variable extracted in the S13 in the time hot spot function variable statement part, and performing address upper and lower boundary matching on the memory address of the original variable and the corresponding memory address of the new variable at the starting position of the time hot spot function execution part;

s16, directly using a compiling instruction to carry out many-core accelerated parallelization aiming at the loop segment task completed by the slave core;

s25, directly using a compiling instruction to carry out many-core accelerated parallelization aiming at the loop segment task completed by the slave core;

s33, optimizing the data structure of the array variables read for many times, declaring corresponding alias arrays in the time hotspot function declaration part, combining a plurality of written different array variables into one expanded-dimension array, and declaring corresponding expanded-dimension arrays in the time hotspot function declaration part;