CN110516194B

CN110516194B - Heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method

Info

Publication number: CN110516194B
Application number: CN201910750655.3A
Authority: CN
Inventors: 栾钟治; 张增校; 杨海龙; 王锐
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2018-08-15
Filing date: 2019-08-14
Publication date: 2021-03-09
Anticipated expiration: 2039-08-14
Also published as: CN110516194A

Abstract

The invention discloses a heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method, which comprises the following steps: the method comprises the steps that firstly, the slave cores are subjected to position division according to slave core identification numbers, secondly, read data information is stored according to the position of a four-dimensional space, and thirdly, the slave cores read grid point values which are responsible for calculation from a storage according to the position identification of the slave cores; and fourthly, carrying out iterative updating on the grid point value of any one slave core to obtain an updated grid point value belonging to the slave core. The parallelization method for the Shenwei 26010 heterogeneous many-core processors fully utilizes the unique register communication characteristics among the Shenwei many-core processors, increases the reusability of data and reduces a large amount of redundant data. Compared with the method only operating on the main core after parallel acceleration, the method of the invention improves the performance by 63 times.

Description

Heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method

Technical Field

The invention relates to a parallel acceleration method for lattice point quantum color dynamics, in particular to a parallel acceleration method for lattice point quantum color dynamics by using an Shenwei 26010 heterogeneous many-core processor.

Background

The Shenwei Taihu optical supercomputer is a supercomputer which is developed by the national parallel computer engineering and technology research center and is installed in the national supercomputer tin-free center, and is also the first supercomputer in the world which is constructed by adopting autonomous technology in China. 40960 autonomous-developed Shenwei 26010 multi-core processors are installed in the optical supercomputer of Shenwei Taihu lake, the multi-core processor adopts a 64-bit autonomous Shenwei instruction system, the peak performance of floating point operation is 12.5 hundred million times/second, and the continuous performance is 9.3 hundred million times/second. The Shenwei Taihu optical super computer uses a Shenwei 26010 heterogeneous many-Core processor, the processor architecture is shown in FIG. 1, each processor chip in the figure comprises four Core Groups (CG), and the Core groups are connected through a network on chip. Each core group mainly includes a Management Processing Elements (MPE, referred to as a master core for short), a Processing core array cluster (CPE, referred to as a slave core for short), and a Memory Controller (MC). The operation cores of the operation core cluster are connected by adopting a communication network with a topological structure of 8 multiplied by 8 Mesh. The System Interface (SI) is used for connecting the chip and the off-chip System, and is implemented by a standard PCIE 3.0 Interface.

At present, heterogeneous computer system structures have the characteristics of strong parallel capability and strong computing capability. The heterogeneous system structure greatly improves the parallel capability and the expansion capability of a computing platform, more computer heterogeneous system structures provide a new computing and programming method for scientific computing with huge computing amount, and how to utilize the computing capability of the Haiwei 26010 heterogeneous many-core processor and the related algorithm parallelization of the scientific computing is one of the research hotspots of researchers.

Disclosure of Invention

In order to solve the problem that the bandwidth utilization rate between a slave Core (CPE) and a master core (MPE) is extremely low due to data redundancy existing in the transmission process in the data segmentation of the Shenwei 26010 heterogeneous many-core processor, the invention provides a grid-point quantum color dynamics parallel acceleration method based on the heterogeneous many-core processor. The method carries out position matching by combining position sequencing of the secondary cores with data segmentation, carries out lattice point calculation in a four-dimensional space by utilizing the Fermi sub-field quantity and the standard field quantity, and saves an updated Fermi sub-field quantity matrix in a memory in a file form through multiple iterations. The method optimizes and realizes the characteristics of data transmission and calculation modes of the Shenwei 26010 heterogeneous many-core processor and the characteristics of the lattice quantum color dynamics algorithm. The method makes full use of the unique register communication characteristics between the slave Cores (CPE), increases the reusability of data, and reduces a large amount of redundant data. By utilizing the characteristic that the Shenwei 26010 heterogeneous many-core processor supports Single Instruction Stream (SIMD) instructions, the computing performance is greatly improved.

The invention discloses a heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method, which is characterized by comprising the following steps of:

initializing a slave core matrix position of a heterogeneous many-core processor;

reading the fermi sub-field quantity and the standard field quantity by the main core;

reading data information from the core based on the line number and the column number of the core to realize data segmentation;

step four, calculating the data information of any grid point in any slave core;

step five, each slave core carries out parallel processing of step four on the data information of each grid point in the local storage space thereof, thereby obtaining the updated Fermi sub-field quantity of all the grid points, namely obtaining

Executing the step six;

step six, after the updating is finished, adding 1 to the iteration times; calculating a residual error value of the lattice point Fermi sub-field quantity;

the number of iterations is recorded as U, and the maximum number of iterations is recorded as U_maxAnd U is_maxThe value is 1000, and the current iteration number is recorded as U_{At present}(ii) a If U is_{At present}＜U_maxIf yes, executing the step four; if U is_{At present}≥U_maxIf yes, executing step seven;

the residual error of the grid point Fermi sub-field quantity is recorded as R, and the residual error threshold value of the grid point Fermi sub-field quantity is recorded as R_minAnd R is_minIs taken to be 1.0 × 10^－12(ii) a If R > R_minIf yes, executing the step four; if R is less than or equal to R_minIf yes, executing step seven;

step seven, outputting the updated lattice point matrix to a memory to be stored as a file;

will be provided with

Passed to memory to update the DAA^MPETo obtain

Will be provided with

The file is saved and written.

The parallel acceleration method can be applied to parallel acceleration processing of the Shenwei 26010 heterogeneous many-core processor.

The heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method has the advantages that:

the method determines the data information processed by the slave core by utilizing the slave core position matrix and data segmentation, increases the reusability of data and reduces a large amount of redundant data.

Secondly, the method carries out lattice point calculation in a four-dimensional space by using the Fermi sub-field quantity and the standard field quantity after position matching, reduces the bandwidth utilization rate and improves the performance by 145 times.

Experiments show that compared with the original serial computing method, the computing method after parallel optimization has the advantages that the time consumption is reduced, and 63 times of performance improvement can be achieved.

The result obtained by the parallel acceleration method is stored in a memory in a single file form, so that the Shenwei 26010 heterogeneous many-core processor can be reused conveniently.

Drawings

FIG. 1 is a diagram of a Shenwei 26010 heterogeneous many-core processor architecture.

FIG. 2 is a two-dimensional schematic of a four-dimensional spatial grid of points.

Fig. 2A is a schematic diagram of a lattice point in the XY plane.

Fig. 2B is a schematic diagram of grid points in the XZ plane.

Fig. 2C is a schematic diagram of a lattice point under the XT plane.

Fig. 2D is a schematic diagram of a grid point in the YZ plane.

FIG. 2E is a schematic diagram of a lattice point under the YT plane.

Fig. 2F is a schematic diagram of a lattice under the ZT plane.

FIG. 3 is a flow chart of the grid point quantum color dynamics parallel acceleration method based on the heterogeneous many-core processor.

FIG. 4 is a graph of the ratio of the running times of each part of the program calculated iteratively 10 times by the method of the present invention.

FIG. 5 is a graph of the ratio of the run times of each part of the program calculated iteratively 100 times by the method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Quantum Chromo Dynamics (QCD) is a fundamental theory used to describe strong interactions. Lattice point quantum color dynamics (Lattice QCD) can also be applied to theoretical studies of non-QCD in principle. The Lattice QCD is based on the basic degrees of freedom of the QCD, i.e. depicted by a quark field, an inverse quark field, a glue field. These fields are defined on a set of discrete grid points in four-dimensional euclidean space.

There are four Core Groups (CG) in each chip of the Schwey heterogeneous many-core processor, one for each Core Group (CG)An arithmetic control core (MPE) and a core array (CPE). For convenience of explanation, the operation control core is denoted as MPE, and is called as a main core for short; the core array is denoted as CPEs, called slave for short. Since there are multiple cores in the core array, the slave core set is denoted as CPEs ═ cpe₁,cpe₂,…,cpe_A}，cpe₁Denotes the first slave core, cpe₂Denotes the second slave core, cpe_ADenotes the last slave core, the cpe for ease of explanation_AAlso indicates any one of the slave cores, and a indicates a slave core identification number. Since the Shenwei 26010 heterogeneous many-core processor is specified, the number of A is 64.

In the invention, any piece of data information processed by the Shenwei 26010 heterogeneous many-core processor is marked as S_g(ii) a The plurality of pieces of data information constitute a data information set denoted as SPM ═ S₁,S₂,…,S_g,…,S_G}，S₁Representing a first piece of data information, S₂Representing a second piece of data information, S_gIndicates the g-th data information (for convenience of explanation, the S_gAlso represents any piece of data information, G belongs to G), S_GIndicating the last piece of data information, and G indicates the total number of pieces of data information.

In the present invention, all the slave kernel sets CPEs ═ { cpe₁,cpe₂,…,cpe_APosition sorting is carried out according to an 8 multiplied by 8 matrix (sorting from small to large according to the core identification numbers), namely a slave core set position matrix add is obtained^CPEs：

d_1,1Indicates the first slave core cpe₁A position of a first column in a first row from the kernel set position matrix;

d_1,2indicating a second slave core cpe₂A position of a first row and a second column in the position matrix from the kernel set;

d_1,3indicates the third slave core cpe₃The position of the first row and the third column in the slave kernel set position matrix;

d_1,4indicates the fourth slave core cpe₄A position in a fourth column of a first row in the from-kernel-set position matrix;

d_1,5indicates the fifth slave core cpe₅A position in a fifth column of the first row in the from-kernel-set position matrix;

d_1,6indicates the sixth slave core cpe₆A position in a sixth column of the first row from the kernel set position matrix;

d_1,7represents the seventh slave nucleus cpe₇A position in a seventh column from a first row in the kernel set position matrix;

d_1,8denotes the eighth slave core cpe₈A position of the eighth column in the first row from the kernel set position matrix;

d_2,1denotes the ninth slave core cpe₉A position of a first column in a second row from the kernel set position matrix;

d_2,2represents the tenth slave core cpe₁₀A position in a second row and a second column in the kernel set position matrix;

d_2,3represents the eleventh slave core cpe₁₁The position of the third column in the second row from the kernel set position matrix;

d_2,4represents the twelfth slave core cpe₁₂A position in a fourth column of the second row in the kernel set position matrix;

d_2,5represents the thirteenth slave core cpe₁₃A position in a fifth column from the second row in the kernel set position matrix;

d_2,6denotes the fourteenth slave core cpe₁₄A position in a sixth column of the second row in the secondary kernel set position matrix;

d_2,7indicates the fifteenth slave core cpe₁₅A position in the seventh column of the second row in the secondary kernel set position matrix;

d_2,8represents the sixteenth slave core cpe₁₆A position in the eighth column from the second row in the kernel set position matrix;

d_3,1represents the seventeenth Slave core cpe₁₇The position of the first column in the third row from the kernel set position matrix;

d_3,2represents the eighteenth slave core cpe₁₈The position of the third row and the second column in the position matrix of the kernel set;

d_3,3represents the nineteenth slave core cpe₁₉The position of the third row and the third column in the slave kernel set position matrix;

d_3,4represents the second tenth slave core cpe₂₀The position of the third row and the fourth column in the position matrix of the kernel set;

d_3,5represents the twenty-first slave core cpe₂₁The position of the third row and the fifth column in the position matrix of the kernel set;

d_3,6represents the twenty-second slave core cpe₂₂The position of the third row and the sixth column in the position matrix of the secondary kernel set;

d_3,7denotes the twenty-third slave nucleus cpe₂₃A position in a third row and a seventh column from the kernel set position matrix;

d_3,8denotes the twenty-fourth slave core cpe₂₄The position of the eighth column in the third row from the kernel set position matrix;

d_4,1represents the twenty-fifth slave core cpe₂₅A position in a first column from a fourth row in the kernel set position matrix;

d_4,2represents the twenty-sixth slave nucleus cpe₂₆A position in a second column from a fourth row in the kernel set position matrix;

d_4,3represents the twenty-seventh slave nucleus cpe₂₇The position of the fourth row and the third column in the slave kernel set position matrix;

d_4,4represents the twenty-eighth slave core cpe₂₈A position in a fourth column from a fourth row in the kernel set position matrix;

d_4,5represents the twenty-ninth slave core cpe₂₉A position in a fifth column from the fourth row in the kernel set position matrix;

d_4,6represents the thirty-th slave core cpe₃₀A position in a sixth column from the fourth row in the kernel set position matrix;

d_4,7indicating the thirty-first slave core cpe₃₁Fourth row in the slave kernel set location matrixPosition of the seventh column;

d_4,8indicating the thirty-second slave core cpe₃₂The position of the eighth column from the fourth row in the kernel set position matrix;

d_5,1represents the thirty-third slave core cpe₃₃The position of the fifth row and the first column in the position matrix of the kernel set;

d_5,2represents the thirty-fourth slave core cpe₃₄A position in a second column from a fifth row in the kernel set position matrix;

d_5,3represents the thirty-fifth slave core cpe₃₅The position of the fifth row and the third column in the slave kernel set position matrix;

d_5,4represents the thirty-sixth slave core cpe₃₆A position in the fourth column of the fifth row in the kernel set position matrix;

d_5,5represents the thirty-seventh slave core cpe₃₇A position in a fifth row and a fifth column of the secondary kernel set position matrix;

d_5,6represents the thirty-eighth slave core cpe₃₈A position in a sixth column of a fifth row in the secondary kernel set position matrix;

d_5,7represents the thirty ninth slave core cpe₃₉A position in the seventh column of the fifth row from the kernel set position matrix;

d_5,8represents the forty-fourth slave core cpe₄₀The position of the eighth column in the fifth row from the kernel set position matrix;

d_6,1indicates the forty-first slave core cpe₄₁The position of the first column in the sixth row from the kernel set position matrix;

d_6,2indicates the forty-second slave core cpe₄₂A position in a second column of a sixth row in the kernel set position matrix;

d_6,3denotes the forty-third slave core cpe₄₃The position of the sixth row and the third column in the slave kernel set position matrix;

d_6,4indicates the forty-fourth slave core cpe₄₄A position in the fourth column of the sixth row from the kernel set position matrix;

d_6,5indicates the forty-fifthSlave nucleus cpe₄₅A position in a fifth column from a sixth row in the kernel set position matrix;

d_6,6indicates the forty-sixth slave core cpe₄₆A position in a sixth row and a sixth column from the kernel set position matrix;

d_6,7indicates the forty-seventh slave core cpe₄₇A position in the seventh column of the sixth row in the from-kernel-set position matrix;

d_6,8indicates the forty-eighth slave core cpe₄₈A position in the eighth column from the sixth row in the kernel set position matrix;

d_7,1indicates the forty-ninth slave core cpe₄₉The position of the seventh row and the first column in the position matrix of the secondary core set;

d_7,2represents the fifth tenth slave core cpe₅₀A position in a second column of a seventh row in the kernel set position matrix;

d_7,3represents the fifty-th slave core cpe₅₁The position of the seventh row and the third column in the slave kernel set position matrix;

d_7,4denotes the fifty-second slave core cpe₅₂A position in a fourth column from a seventh row in the kernel set position matrix;

d_7,5denotes the fifty-third slave nucleus cpe₅₃A position in a fifth column from a seventh row in the kernel set position matrix;

d_7,6denotes the fifty-fourth slave core cpe₅₄A position in a sixth column from a seventh row in the kernel set position matrix;

d_7,7denotes the fifty-fifth slave nucleus cpe₅₅A position in the seventh column of the seventh row from the kernel set position matrix;

d_7,8denotes the fifty-sixth slave nucleus cpe₅₆A position in the eighth column from the seventh row in the kernel set position matrix;

d_8,1denotes the fifty-seventh Slave core cpe₅₇The position of the eighth row and the first column in the position matrix of the kernel set;

d_8,2denotes the fifty-eighth slave core cpe₅₈The position of the second column in the eighth row in the position matrix of the kernel set;

d_8,3denotes the fifty-ninth slave core cpe₅₉The position of the eighth row and the third column in the slave kernel set position matrix;

d_8,4represents the sixteenth slave core cpe₆₀A position in a fourth column of the eighth row in the from-kernel-set position matrix;

d_8,5indicating sixty-th slave core cpe₆₁A position in the fifth column of the eighth row in the from-kernel-set position matrix;

d_8,6indicating a sixty-second slave core cpe₆₂The position of the eighth row and the sixth column in the secondary kernel set position matrix;

d_8,7denotes the sixty-third slave nucleus cpe₆₃A position in the seventh column of the eighth row in the from-kernel-set position matrix;

d_8,8denotes the sixty-fourth slave core cpe₆₄At the position of the eighth column in the eighth row from the kernel set position matrix.

In the present invention, d is used for convenience of explanation_p,qIndicates any one of the slave cores cpe_AIn the slave kernel set location matrix add^CPEsPosition in, p is the row number, q is the column number; d_p,qSimply called the slave core site.

In the present invention, the normalized field quantity GF is of the form:

wherein i is an imaginary unit, i²＝-1；

a₁The real part of the first complex number of the first row vector representing the normalized field quantity;

a₂the real part of the second complex number of the first row vector representing the normalized field quantity;

a₃the real part of the third complex number of the first row vector representing the normalized field quantity;

a₄the real part of the first complex number of the second row vector representing the normalized field quantity;

a₅second representing normalized field quantityThe real part of the second complex number of the row vector;

a₆a real part of a third complex number of a second row vector representing a normalized field quantity;

a₇the real part of the first complex number of the third row vector representing the normalized field quantity;

a₈the real part of the second complex number of the third row vector representing the normalized field quantity;

a₉the real part of the third complex number of the third row vector representing the normalized field quantity;

b₁an imaginary part of a first complex number of a first row vector representing a normalized field quantity;

b₂an imaginary part of a second complex number of the first row vector representing the normalized field quantity;

b₃an imaginary part of a third complex number of the first row vector representing the normalized field quantity;

b₄an imaginary part of a first complex number of a second row vector representing a normalized field quantity;

b₅an imaginary part of a second complex number of a second row vector representing a normalized field quantity;

b₆an imaginary part of a third complex number of a second row vector representing a normalized field quantity;

b₇the imaginary part of the first complex number of the third row vector representing the normalized field quantity;

b₈an imaginary part of a second complex number of a third row vector representing a normalized field quantity;

b₉the imaginary part of the third complex number of the third row vector representing the normalized field quantity.

In the present invention, the fermi sub-field quantity WIL is in the form:

wherein i is an imaginary unit, i²＝-1；

ξ₁The real part of the first complex number of the first column vector which is the fermi sub-field quantity;

ξ₂the real part of the second complex number of the first column vector of fermi sub-field quantities;

ξ₃the real part of the third complex number of the first column vector of fermi sub-field quantities;

ξ₄the real part of the fourth complex number of the first column vector which is the fermi sub-field quantity;

β₁the imaginary part of the first complex number of the first column vector which is the fermi sub-field quantity;

β₂the imaginary part of the second complex number of the first column vector which is the fermi sub-field quantity;

β₃the imaginary part of the third complex number of the first column vector which is the fermi sub-field quantity;

β₄the imaginary part of the fourth complex number of the first column vector which is the fermi sub-field quantity;

γ₁the real part of the first complex number of the second column vector, which is the fermi sub-field quantity;

γ₂the real part of the second complex number of the second column vector which is the fermi sub-field quantity;

γ₃the real part of the third complex number of the second column vector which is the fermi sub-field quantity;

γ₄the real part of the fourth complex number of the second column vector, which is the fermi sub-field quantity;

δ₁the imaginary part of the first complex number of the second column vector being the fermi sub-field quantity;

δ₂the imaginary part of the second complex number of the second column vector being the fermi sub-field quantity;

δ₃the imaginary part of the third complex number of the second column vector being the fermi sub-field quantity;

δ₄the imaginary part of the fourth complex number of the second column vector being the fermi sub-field quantity;

μ₁the real part of the first complex number of the third column vector, which is the fermi sub-field quantity;

μ₂the real part of the second complex number of the third column vector, which is the fermi sub-field quantity;

μ₃the real part of the third complex number of the third column vector being the fermi sub-field quantity；

μ₄The real part of the fourth complex number of the third column vector, which is the fermi sub-field quantity;

ν₁the imaginary part of the first complex number of the third column vector, which is the fermi sub-field quantity;

ν₂the imaginary part of the second complex number of the third column vector, which is the fermi sub-field quantity;

ν₃the imaginary part of the third complex number of the third column vector, which is the fermi sub-field quantity;

ν₄the imaginary part of the fourth complex number of the third column vector of the fermi sub-field quantity.

In the present invention, any one data information S_gFermi sub-field quantity of

In the present invention, any one data information S_gIs recorded as

In the same way, the first data information S of the present invention₁Fermi sub-field quantity of

In the same way, the first data information S of the present invention₁Is recorded as

Similarly, the second data information S in the present invention₂Fermi sub-field quantity of

In the same way, the second data information S of the present invention₂Is recorded as

In the same way, the last data information S of the present invention can be obtained_GFermi sub-field quantity of

In the same way, the last data information S of the present invention can be obtained_GIs recorded as

In the present invention, in the case of the present invention,

and

are not the same.

And

are not the same.

In the present invention, the secondary kernel set location matrix add is utilized^CPEsTo mark the slave set of kernels, CPEs ═ cpe₁,cpe₂,…,cpe_AThe specific positions of the cores in the system are used for improving the matching efficiency of the cores and the grid point positions represented by the four-dimensional space when the slave cores perform parallelization data information operation.

Referring to the two-dimensional schematic diagram of the four-dimensional grid point shown in FIG. 2, any one data information S_gIs denoted as S_gThe xyz, since the four dimensions are X, Y, Z, T axes, and the four dimensional space coordinate is difficult to represent, the present invention is represented by using two dimensional plane coordinates as shown in fig. 2A to 2F. The arrows shown schematically in FIG. 2 represent the direction in which the normalized field magnitude is selected (i.e., the direction in which the normalized field magnitude is selected)

) The dots represent coordinate points of a four-dimensional space, and two ends of each dimension are connected through six two-dimensional plane coordinates, so that theoretical support is provided for realizing parallelization in a data parallelization mode. In the present invention, for any one data information S_gThe coordinate position of the CPEs is marked, so that the problem that the storage of each data information from the core in the main memory is not continuous and the required data information of one-time calculation needs to be initiated for multiple times is solvedDirect Memory Access (DMA) transmission, which results in a very low bandwidth utilization between each slave core and the master Memory, and data information of a neighbor grid is used for calculating data information of each grid, and this calculation method transmits a large amount of redundant data information.

In the present invention, the first data information S₁The position in four-dimensional space is recorded as

Is S₁The value on the X-axis is,

is S₁The value on the Y-axis is,

is S₁The value on the Z-axis is,

is S₁Values on the T-axis.

In the present invention, the second data information S₂The position in four-dimensional space is recorded as

Is S₂The value on the X-axis is,

is S₂The value on the Y-axis is,

is S₂The value on the Z-axis is,

is S₂Values on the T-axis.

In the present invention, any one data information S_gThe position in four-dimensional space is recorded as

Is S_gThe value on the X-axis is,

is S_gThe value on the Y-axis is,

is S_gThe value on the Z-axis is,

is S_gValues on the T-axis. S_gThe lower subscript G in (1) is the identification number of the data information, and G belongs to G. As shown in fig. 2A to 2F, the X axis is perpendicular to the Y axis, and the Z axis is perpendicular to the Y axis. Data information S processed by heterogeneous many-core processor_gThe time in the four-dimensional space is represented as the time axis and denoted as T.

For convenience of explanation, any one of the data information S_gThe position in four-dimensional space is noted as:

in the present invention, the last data information S_GThe position in four-dimensional space is recorded as

Is S_GThe value on the X-axis is,

is S_GThe value on the Y-axis is,

is S_GThe value on the Z-axis is,

is S_GValues on the T-axis. S_GThe lower subscript G in (1) is the total number of data information.

Referring to fig. 3, the invention discloses a heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method, which comprises the following steps:

since a plurality of slave cores exist in the heterogeneous many-core processor, the slave cores need to be divided according to the positions of the slave core identification numbers, and the matrix position of each slave core is recorded.

From the set of nuclei CPEs { cpe } ═ cpe₁,cpe₂,…,cpe_APosition sorting is carried out according to an 8 multiplied by 8 matrix, and a secondary kernel set position matrix add of a formula (1) is obtained^CPEsAny slave core position is denoted as d_p,q：

The slave kernel set location matrix add^CPEsIs ordered from small to large according to the core identification number. In the following calculation process, the slave core determines the data area which is responsible for the slave core according to the row number and the column number of the slave core.

step 201, MPE of the main core reads data information, and expresses all the read data information as SPM ═ S in a set form₁,S₂,…,S_g,…,S_G}；

Step 202, the main core sets the SPM to { S ═ S₁,S₂,…,S_g,…,S_GStoring Fermi sub-field quantity in data information into an 8 x 8 grid point matrix DAA according to the reading sequence^MPEAnd mixing the DAA^MPESaving to a memory;

wherein the content of the first and second substances,

representing the first data information S₁At four-dimensional coordinate points

Fermi sub-field magnitude above.

Representing second data information S₂At four-dimensional coordinate points

Fermi sub-field magnitude above.

Representing any one data information S_gAt four-dimensional coordinate points

Fermi sub-field magnitude above.

Indicating the last data information S_GAt four-dimensional coordinate points

Fermi sub-field magnitude above.

Step 203, the master coreConverting said SPM to { S ═ S₁,S₂,…,S_g,…,S_GStoring the standard field quantity in the data information into a 4 multiplied by 8 lattice point link matrix DBB according to the reading sequence^MPEAnd DBB is^MPESaving to a memory;

in the present invention, the c-direction means any one axis selected from the X-axis, the Y-axis, the Z-axis, and the T-axis as a direction.

Denotes S₁At four-dimensional coordinate points

In the direction of (a). In the same way, the method for preparing the composite material,

and

all indicate the selection direction.

Is/are as follows

Normalized field magnitude in direction.

Representing second data information S₂At four-dimensional coordinate points

Is/are as follows

Normalized field magnitude in direction.

Representing any one data information S_gAt four-dimensional coordinate points

Is/are as follows

Normalized field magnitude in direction.

Indicating the last data information S_GAt four-dimensional coordinate points

Is/are as follows

Normalized field magnitude in direction.

In the invention, the Fermi sub-field quantity and the standard field quantity are adopted to represent four-dimensional data information, which is beneficial to the division and parallel calculation of tasks.

step 301, optional Slave core cpe_AFrom the core matrix position d according to step one_p,qDAA is arranged in the Z-axis and T-axis directions^MPECpe in matrix_AThe responsible data information is partially read into the local memory space and is recorded as

All data information read in from the core can be written as:

step 302, optional Slave core cpe_AFrom the core matrix position d according to step one_p,qDBB is adjusted in the Z-axis and T-axis directions^MPECpe in matrix_AThe responsible data information is partially read into the local memory space and is recorded as

All data information read in from the core can be written as:

in the present invention, any one of the slave cores cpe_AAnd initiating direct memory access transmission to the memory according to the identification number of the slave core, reading the grid point Fermi field quantity value and the standard field quantity value which are calculated by the slave core, and storing the grid point Fermi field quantity value and the standard field quantity value into local storage of the slave core. And dividing the four-dimensional sub-grid into 64 two-dimensional planes according to the Z-axis and T-axis directions, so that each slave core is in charge of one plane. The internal calculation of each layer of the plane is relatively independent, and only when the boundary data is calculated, the data information of adjacent layers which are connected end to end is needed.

step 401, any one of the slave cores cpe_AFrom corresponding

Obtaining S_gThe corresponding lattice fermi sub-field amount; step 403 is executed;

step 402, any one of the slave cores cpe_AFrom corresponding

Obtaining S_gA corresponding normalized field size; step 403 is executed;

step 403, from the data information S of any grid point_gAcquiring data information of 8 adjacent grid points in x, y, z and t dimensions, and then acquiring grid point Fermi sub-field quantity and standard sub-field quantity of the 8 adjacent grid points; step 404 is executed;

the data information of the adjacent 8 grid points is respectively marked as S₁、S₂、S₃、S₄、S₅、S₆、S₇And S₈The central lattice point of the adjacent 8 lattice points is S_gThen the lattice fermi sub-field quantity is respectively recorded as

And

said S_gThe grid point fermi sub-field quantity is recorded as

The normalized field quantity is respectively recorded as

And

step 404, performing matrix multiplication on Fermi sub-field quantity and standard sub-field quantity of the adjacent 8 grid points; step 405 is executed;

step 405, updating the central grid point to be S by the matrix multiplication quantity of the adjacent 8 grid points_gThe updated amount of the lattice point fermi sub-field belongs to S_gThe lattice point fermi sub-field quantity of (2), is recorded as

And is

Namely, it is

Data information S in_gIs updated to

Executing the step five;

in the present invention, step four is the processing of one lattice point, and all lattice points in the slave core need to adopt the same step four processing, and for the purpose of iterative description, the operation of one iteration on all lattice points is specifically described as step five.

Executing the step six;

in the invention, the iteration times are recorded as U, and the maximum iteration times are recorded as U_maxAnd U is_maxThe value is 1000, and the current iteration number is recorded as U_{At present}. If U is_{At present}＜U_maxIf yes, executing the step four; if U is_{At present}≥U_maxIf yes, executing step seven;

in the invention, the residual error of the lattice point Fermi sub-field quantity is recorded as R, and the residual error threshold value of the lattice point Fermi sub-field quantity is recorded as R_minAnd R is_minIs taken to be 1.0 × 10^－12. If R > R_minIf yes, executing the step four; if R is less than or equal to R_minIf yes, executing step seven;

in the present invention, the following components are added

Passed to memory to update the DAA^MPETo obtain

Will be provided with

The file is saved and written.

Example 1

Software and hardware environment parameters of the Shenwei 26010 heterogeneous many-core processor are as follows:

TABLE 1 software and hardware Environment

CPU	Memory device	Compiler with a plurality of compiler modules
			SW26010 1.45GHz	32G for 4CG	Sw5cc

The data used in example 1 is grid-sized lattice data, using point sources to solve for quark propagators. The proportion of the operation time of each part of the program and the acceleration effect of the parallelization calculation are analyzed.

The data cutting mode of the invention can improve the bandwidth utilization rate (expressed by DMA in the invention) between the slave core and the main memory and reduce the transmission of redundant data. Compared with the serial calculation method, the data redundancy is reduced by 8 times, the direct memory access transmission times are reduced, and the experimental result is shown in table 2, so that the performance is improved by 145 times.

TABLE 2 Direct Memory Access (DMA) transfer time comparison of different data partitioning methods

	DMA transfer Total time consumption (MPE beat number)
		Serial computing method	22328320
Improved post-calculation method	153468

TABLE 3 run time analysis

From table 3 it can be seen that the method performed using MPE + CPEs parallelization can speed up by a factor of 16.4 compared to the calculation on MPE only. Since the theoretical calculated peak for one MPE is 23.2 gflps and one CPE calculated peak is 11.6 gflps, the theoretically highest achievable ratio is 32 times. The method of the invention adopts the characteristic of register communication among a large number of slave cores, obtains better parallel effect, and obtains the maximum parallelism degree reaching 51.3 percent.

It can also be seen from table 3 that the overall operation efficiency of the method after vectorization is improved by 3.9 times compared with the operation efficiency of the method without vectorization, and thus it can be seen that the operation efficiency can be greatly improved by performing vectorization processing on floating point operations.

By using the parallelization method of the invention in table 3 for data division and transmission, secondary core cooperative computing and vectorization computing, the speed-up ratio can be 63.96 times as high as that of the original serial running program.

Referring to fig. 4 and 5, whether the method of the present invention is a single main core serial algorithm, or an MPE + CPEs parallelization calculation algorithm, or an MPE + CPEs + SIMD vectorization parallelization algorithm, the method of the present invention is divided into two parts, namely reading a file, transmitting data, and consuming iterative calculation, and is shown in fig. 4 and 5. It can be seen from the figure that the method of the present invention is computationally intensive, the iterative computation portion occupies most of the program running time, and the iterative computation time proportion further increases as the number of iterations increases. A special data segmentation mode is designed for the Shenwei 26010 heterogeneous many-core processor, and the proportion of the data transmission time of the slave core and the main memory in the whole program is greatly reduced.

Claims

1. A lattice point quantum color dynamics parallel acceleration method based on a heterogeneous many-core processor is characterized by comprising the following steps:

because a plurality of slave cores exist in the heterogeneous many-core processor, the slave cores need to be divided according to the slave core identification numbers, and the matrix position of each slave core is recorded;

from the set of nuclei CPEs { cpe } ═ cpe₁,cpe₂,…,cpe_APosition sorting is carried out according to an 8 multiplied by 8 matrix to obtain a secondary kernel set position matrix add^CPEsAny slave core position is denoted as d_p,q：

Fermi sub-field magnitude above;

denotes S₁The amount of fermi sub-field of (a);

representing second data information S₂At four-dimensional coordinate points

Fermi sub-field magnitude above;

denotes S₂The amount of fermi sub-field of (a);

representing any one data information S_gAt four-dimensional coordinate points

Fermi sub-field magnitude above;

denotes S_gThe amount of fermi sub-field of (a);

indicating the last data information S_GAt four-dimensional coordinate points

Fermi sub-field magnitude above;

denotes S_GThe amount of fermi sub-field of (a);

step 203, the main core sets the SPM to { S ═ S₁,S₂,…,S_g,…,S_GStoring the standard field quantity in the data information into a 4 multiplied by 8 lattice point link matrix DBB according to the reading sequence^MPEAnd DBB is^MPESaving to a memory;

Is/are as follows

A normalized field magnitude in a direction;

denotes S₁At four-dimensional coordinate points

The direction of (a);

denotes S₁The normalized field size of (a);

representing second data information S₂At four-dimensional coordinate points

Is/are as follows

A normalized field magnitude in a direction;

denotes S₂At four-dimensional coordinate points

The direction of (a);

denotes S₂The normalized field size of (a);

representing any one data information S_gAt four-dimensional coordinate points

Is/are as follows

A normalized field magnitude in a direction;

denotes S_gAt four-dimensional coordinate points

The direction of (a);

denotes S_gThe normalized field size of (a);

indicating the last data information S_GAt four-dimensional coordinate points

Is/are as follows

A normalized field magnitude in a direction;

denotes S_GAt four-dimensional coordinate points

The direction of (a);

denotes S_GThe normalized field size of (a);

All data information read in from the core can be written as:

All data information read in from the core can be written as:

step 401, any data information S_gThe grid point fermi sub-field quantity is recorded as

Step 403 is executed;

step 402, any data information S_gIs recorded as

Step 403 is executed;

And

said S_gThe grid point fermi sub-field quantity is recorded as

The normalized field quantity is respectively recorded as

And

first data information S₁Fermi sub-field quantity of

First data information S₁Is recorded as

Second data information S₂Fermi sub-field quantity of

Second data information S₂Is recorded as

Third data information S₃Fermi sub-field quantity of

Third data information S₃Is recorded as

Fourth data information S₄Fermi sub-field quantity of

Fourth data information S₄Is a specification ofField magnitude is recorded as

Fifth data information S₅Fermi sub-field quantity of

Fifth data information S₅Is recorded as

Sixth data information S₆Fermi sub-field quantity of

Sixth data information S₆Is recorded as

Seventh data information S₇Fermi sub-field quantity of

Seventh data information S₇Is recorded as

Eighth data information S₈Fermi sub-field quantity of

Eighth data information S₈Is recorded as

And is

Namely, it is

Data information S in_gIs updated to

Executing the step five;

Executing the step six;

will be provided with

Passed to memory to update the DAA^MPETo obtain

Will be provided with

The file is saved and written.

2. The heterogeneous many-core processor-based grid-point quantum color dynamics parallel acceleration method according to claim 1, characterized in that: the slave kernel set location matrix add^CPEsIs ordered from small to large according to the core identification number.

3. The heterogeneous many-core processor-based grid-point quantum color dynamics parallel acceleration method according to claim 1, characterized in that: the bandwidth utilization rate is reduced, and the performance is improved by 145 times.

4. The heterogeneous many-core processor-based grid-point quantum color dynamics parallel acceleration method according to claim 1, characterized in that: parallel acceleration reduces the time consumption and achieves 63 times of performance improvement.

5. The heterogeneous many-core processor-based grid-point quantum color dynamics parallel acceleration method according to claim 1, characterized in that: the parallel acceleration processing method is suitable for parallel acceleration processing of the Shenwei 26010 heterogeneous many-core processor.