CN110516194B - Heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method - Google Patents
Heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method Download PDFInfo
- Publication number
- CN110516194B CN110516194B CN201910750655.3A CN201910750655A CN110516194B CN 110516194 B CN110516194 B CN 110516194B CN 201910750655 A CN201910750655 A CN 201910750655A CN 110516194 B CN110516194 B CN 110516194B
- Authority
- CN
- China
- Prior art keywords
- data information
- core
- field
- recorded
- slave
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8038—Associative processors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Mathematical Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Computer Hardware Design (AREA)
- Mathematical Optimization (AREA)
- Computing Systems (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method, which comprises the following steps: the method comprises the steps that firstly, the slave cores are subjected to position division according to slave core identification numbers, secondly, read data information is stored according to the position of a four-dimensional space, and thirdly, the slave cores read grid point values which are responsible for calculation from a storage according to the position identification of the slave cores; and fourthly, carrying out iterative updating on the grid point value of any one slave core to obtain an updated grid point value belonging to the slave core. The parallelization method for the Shenwei 26010 heterogeneous many-core processors fully utilizes the unique register communication characteristics among the Shenwei many-core processors, increases the reusability of data and reduces a large amount of redundant data. Compared with the method only operating on the main core after parallel acceleration, the method of the invention improves the performance by 63 times.
Description
Technical Field
The invention relates to a parallel acceleration method for lattice point quantum color dynamics, in particular to a parallel acceleration method for lattice point quantum color dynamics by using an Shenwei 26010 heterogeneous many-core processor.
Background
The Shenwei Taihu optical supercomputer is a supercomputer which is developed by the national parallel computer engineering and technology research center and is installed in the national supercomputer tin-free center, and is also the first supercomputer in the world which is constructed by adopting autonomous technology in China. 40960 autonomous-developed Shenwei 26010 multi-core processors are installed in the optical supercomputer of Shenwei Taihu lake, the multi-core processor adopts a 64-bit autonomous Shenwei instruction system, the peak performance of floating point operation is 12.5 hundred million times/second, and the continuous performance is 9.3 hundred million times/second. The Shenwei Taihu optical super computer uses a Shenwei 26010 heterogeneous many-Core processor, the processor architecture is shown in FIG. 1, each processor chip in the figure comprises four Core Groups (CG), and the Core groups are connected through a network on chip. Each core group mainly includes a Management Processing Elements (MPE, referred to as a master core for short), a Processing core array cluster (CPE, referred to as a slave core for short), and a Memory Controller (MC). The operation cores of the operation core cluster are connected by adopting a communication network with a topological structure of 8 multiplied by 8 Mesh. The System Interface (SI) is used for connecting the chip and the off-chip System, and is implemented by a standard PCIE 3.0 Interface.
At present, heterogeneous computer system structures have the characteristics of strong parallel capability and strong computing capability. The heterogeneous system structure greatly improves the parallel capability and the expansion capability of a computing platform, more computer heterogeneous system structures provide a new computing and programming method for scientific computing with huge computing amount, and how to utilize the computing capability of the Haiwei 26010 heterogeneous many-core processor and the related algorithm parallelization of the scientific computing is one of the research hotspots of researchers.
Disclosure of Invention
In order to solve the problem that the bandwidth utilization rate between a slave Core (CPE) and a master core (MPE) is extremely low due to data redundancy existing in the transmission process in the data segmentation of the Shenwei 26010 heterogeneous many-core processor, the invention provides a grid-point quantum color dynamics parallel acceleration method based on the heterogeneous many-core processor. The method carries out position matching by combining position sequencing of the secondary cores with data segmentation, carries out lattice point calculation in a four-dimensional space by utilizing the Fermi sub-field quantity and the standard field quantity, and saves an updated Fermi sub-field quantity matrix in a memory in a file form through multiple iterations. The method optimizes and realizes the characteristics of data transmission and calculation modes of the Shenwei 26010 heterogeneous many-core processor and the characteristics of the lattice quantum color dynamics algorithm. The method makes full use of the unique register communication characteristics between the slave Cores (CPE), increases the reusability of data, and reduces a large amount of redundant data. By utilizing the characteristic that the Shenwei 26010 heterogeneous many-core processor supports Single Instruction Stream (SIMD) instructions, the computing performance is greatly improved.
The invention discloses a heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method, which is characterized by comprising the following steps of:
initializing a slave core matrix position of a heterogeneous many-core processor;
reading the fermi sub-field quantity and the standard field quantity by the main core;
reading data information from the core based on the line number and the column number of the core to realize data segmentation;
step four, calculating the data information of any grid point in any slave core;
step five, each slave core carries out parallel processing of step four on the data information of each grid point in the local storage space thereof, thereby obtaining the updated Fermi sub-field quantity of all the grid points, namely obtainingExecuting the step six;
step six, after the updating is finished, adding 1 to the iteration times; calculating a residual error value of the lattice point Fermi sub-field quantity;
the number of iterations is recorded as U, and the maximum number of iterations is recorded as UmaxAnd U ismaxThe value is 1000, and the current iteration number is recorded as UAt present(ii) a If U isAt present<UmaxIf yes, executing the step four; if U isAt present≥UmaxIf yes, executing step seven;
the residual error of the grid point Fermi sub-field quantity is recorded as R, and the residual error threshold value of the grid point Fermi sub-field quantity is recorded as RminAnd R isminIs taken to be 1.0 × 10-12(ii) a If R > RminIf yes, executing the step four; if R is less than or equal to RminIf yes, executing step seven;
step seven, outputting the updated lattice point matrix to a memory to be stored as a file;
will be provided withPassed to memory to update the DAAMPETo obtainWill be provided withThe file is saved and written.
The parallel acceleration method can be applied to parallel acceleration processing of the Shenwei 26010 heterogeneous many-core processor.
The heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method has the advantages that:
the method determines the data information processed by the slave core by utilizing the slave core position matrix and data segmentation, increases the reusability of data and reduces a large amount of redundant data.
Secondly, the method carries out lattice point calculation in a four-dimensional space by using the Fermi sub-field quantity and the standard field quantity after position matching, reduces the bandwidth utilization rate and improves the performance by 145 times.
Experiments show that compared with the original serial computing method, the computing method after parallel optimization has the advantages that the time consumption is reduced, and 63 times of performance improvement can be achieved.
The result obtained by the parallel acceleration method is stored in a memory in a single file form, so that the Shenwei 26010 heterogeneous many-core processor can be reused conveniently.
Drawings
FIG. 1 is a diagram of a Shenwei 26010 heterogeneous many-core processor architecture.
FIG. 2 is a two-dimensional schematic of a four-dimensional spatial grid of points.
Fig. 2A is a schematic diagram of a lattice point in the XY plane.
Fig. 2B is a schematic diagram of grid points in the XZ plane.
Fig. 2C is a schematic diagram of a lattice point under the XT plane.
Fig. 2D is a schematic diagram of a grid point in the YZ plane.
FIG. 2E is a schematic diagram of a lattice point under the YT plane.
Fig. 2F is a schematic diagram of a lattice under the ZT plane.
FIG. 3 is a flow chart of the grid point quantum color dynamics parallel acceleration method based on the heterogeneous many-core processor.
FIG. 4 is a graph of the ratio of the running times of each part of the program calculated iteratively 10 times by the method of the present invention.
FIG. 5 is a graph of the ratio of the run times of each part of the program calculated iteratively 100 times by the method of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Quantum Chromo Dynamics (QCD) is a fundamental theory used to describe strong interactions. Lattice point quantum color dynamics (Lattice QCD) can also be applied to theoretical studies of non-QCD in principle. The Lattice QCD is based on the basic degrees of freedom of the QCD, i.e. depicted by a quark field, an inverse quark field, a glue field. These fields are defined on a set of discrete grid points in four-dimensional euclidean space.
There are four Core Groups (CG) in each chip of the Schwey heterogeneous many-core processor, one for each Core Group (CG)An arithmetic control core (MPE) and a core array (CPE). For convenience of explanation, the operation control core is denoted as MPE, and is called as a main core for short; the core array is denoted as CPEs, called slave for short. Since there are multiple cores in the core array, the slave core set is denoted as CPEs ═ cpe1,cpe2,…,cpeA},cpe1Denotes the first slave core, cpe2Denotes the second slave core, cpeADenotes the last slave core, the cpe for ease of explanationAAlso indicates any one of the slave cores, and a indicates a slave core identification number. Since the Shenwei 26010 heterogeneous many-core processor is specified, the number of A is 64.
In the invention, any piece of data information processed by the Shenwei 26010 heterogeneous many-core processor is marked as Sg(ii) a The plurality of pieces of data information constitute a data information set denoted as SPM ═ S1,S2,…,Sg,…,SG},S1Representing a first piece of data information, S2Representing a second piece of data information, SgIndicates the g-th data information (for convenience of explanation, the SgAlso represents any piece of data information, G belongs to G), SGIndicating the last piece of data information, and G indicates the total number of pieces of data information.
In the present invention, all the slave kernel sets CPEs ═ { cpe1,cpe2,…,cpeAPosition sorting is carried out according to an 8 multiplied by 8 matrix (sorting from small to large according to the core identification numbers), namely a slave core set position matrix add is obtainedCPEs:
d1,1Indicates the first slave core cpe1A position of a first column in a first row from the kernel set position matrix;
d1,2indicating a second slave core cpe2A position of a first row and a second column in the position matrix from the kernel set;
d1,3indicates the third slave core cpe3The position of the first row and the third column in the slave kernel set position matrix;
d1,4indicates the fourth slave core cpe4A position in a fourth column of a first row in the from-kernel-set position matrix;
d1,5indicates the fifth slave core cpe5A position in a fifth column of the first row in the from-kernel-set position matrix;
d1,6indicates the sixth slave core cpe6A position in a sixth column of the first row from the kernel set position matrix;
d1,7represents the seventh slave nucleus cpe7A position in a seventh column from a first row in the kernel set position matrix;
d1,8denotes the eighth slave core cpe8A position of the eighth column in the first row from the kernel set position matrix;
d2,1denotes the ninth slave core cpe9A position of a first column in a second row from the kernel set position matrix;
d2,2represents the tenth slave core cpe10A position in a second row and a second column in the kernel set position matrix;
d2,3represents the eleventh slave core cpe11The position of the third column in the second row from the kernel set position matrix;
d2,4represents the twelfth slave core cpe12A position in a fourth column of the second row in the kernel set position matrix;
d2,5represents the thirteenth slave core cpe13A position in a fifth column from the second row in the kernel set position matrix;
d2,6denotes the fourteenth slave core cpe14A position in a sixth column of the second row in the secondary kernel set position matrix;
d2,7indicates the fifteenth slave core cpe15A position in the seventh column of the second row in the secondary kernel set position matrix;
d2,8represents the sixteenth slave core cpe16A position in the eighth column from the second row in the kernel set position matrix;
d3,1represents the seventeenth Slave core cpe17The position of the first column in the third row from the kernel set position matrix;
d3,2represents the eighteenth slave core cpe18The position of the third row and the second column in the position matrix of the kernel set;
d3,3represents the nineteenth slave core cpe19The position of the third row and the third column in the slave kernel set position matrix;
d3,4represents the second tenth slave core cpe20The position of the third row and the fourth column in the position matrix of the kernel set;
d3,5represents the twenty-first slave core cpe21The position of the third row and the fifth column in the position matrix of the kernel set;
d3,6represents the twenty-second slave core cpe22The position of the third row and the sixth column in the position matrix of the secondary kernel set;
d3,7denotes the twenty-third slave nucleus cpe23A position in a third row and a seventh column from the kernel set position matrix;
d3,8denotes the twenty-fourth slave core cpe24The position of the eighth column in the third row from the kernel set position matrix;
d4,1represents the twenty-fifth slave core cpe25A position in a first column from a fourth row in the kernel set position matrix;
d4,2represents the twenty-sixth slave nucleus cpe26A position in a second column from a fourth row in the kernel set position matrix;
d4,3represents the twenty-seventh slave nucleus cpe27The position of the fourth row and the third column in the slave kernel set position matrix;
d4,4represents the twenty-eighth slave core cpe28A position in a fourth column from a fourth row in the kernel set position matrix;
d4,5represents the twenty-ninth slave core cpe29A position in a fifth column from the fourth row in the kernel set position matrix;
d4,6represents the thirty-th slave core cpe30A position in a sixth column from the fourth row in the kernel set position matrix;
d4,7indicating the thirty-first slave core cpe31Fourth row in the slave kernel set location matrixPosition of the seventh column;
d4,8indicating the thirty-second slave core cpe32The position of the eighth column from the fourth row in the kernel set position matrix;
d5,1represents the thirty-third slave core cpe33The position of the fifth row and the first column in the position matrix of the kernel set;
d5,2represents the thirty-fourth slave core cpe34A position in a second column from a fifth row in the kernel set position matrix;
d5,3represents the thirty-fifth slave core cpe35The position of the fifth row and the third column in the slave kernel set position matrix;
d5,4represents the thirty-sixth slave core cpe36A position in the fourth column of the fifth row in the kernel set position matrix;
d5,5represents the thirty-seventh slave core cpe37A position in a fifth row and a fifth column of the secondary kernel set position matrix;
d5,6represents the thirty-eighth slave core cpe38A position in a sixth column of a fifth row in the secondary kernel set position matrix;
d5,7represents the thirty ninth slave core cpe39A position in the seventh column of the fifth row from the kernel set position matrix;
d5,8represents the forty-fourth slave core cpe40The position of the eighth column in the fifth row from the kernel set position matrix;
d6,1indicates the forty-first slave core cpe41The position of the first column in the sixth row from the kernel set position matrix;
d6,2indicates the forty-second slave core cpe42A position in a second column of a sixth row in the kernel set position matrix;
d6,3denotes the forty-third slave core cpe43The position of the sixth row and the third column in the slave kernel set position matrix;
d6,4indicates the forty-fourth slave core cpe44A position in the fourth column of the sixth row from the kernel set position matrix;
d6,5indicates the forty-fifthSlave nucleus cpe45A position in a fifth column from a sixth row in the kernel set position matrix;
d6,6indicates the forty-sixth slave core cpe46A position in a sixth row and a sixth column from the kernel set position matrix;
d6,7indicates the forty-seventh slave core cpe47A position in the seventh column of the sixth row in the from-kernel-set position matrix;
d6,8indicates the forty-eighth slave core cpe48A position in the eighth column from the sixth row in the kernel set position matrix;
d7,1indicates the forty-ninth slave core cpe49The position of the seventh row and the first column in the position matrix of the secondary core set;
d7,2represents the fifth tenth slave core cpe50A position in a second column of a seventh row in the kernel set position matrix;
d7,3represents the fifty-th slave core cpe51The position of the seventh row and the third column in the slave kernel set position matrix;
d7,4denotes the fifty-second slave core cpe52A position in a fourth column from a seventh row in the kernel set position matrix;
d7,5denotes the fifty-third slave nucleus cpe53A position in a fifth column from a seventh row in the kernel set position matrix;
d7,6denotes the fifty-fourth slave core cpe54A position in a sixth column from a seventh row in the kernel set position matrix;
d7,7denotes the fifty-fifth slave nucleus cpe55A position in the seventh column of the seventh row from the kernel set position matrix;
d7,8denotes the fifty-sixth slave nucleus cpe56A position in the eighth column from the seventh row in the kernel set position matrix;
d8,1denotes the fifty-seventh Slave core cpe57The position of the eighth row and the first column in the position matrix of the kernel set;
d8,2denotes the fifty-eighth slave core cpe58The position of the second column in the eighth row in the position matrix of the kernel set;
d8,3denotes the fifty-ninth slave core cpe59The position of the eighth row and the third column in the slave kernel set position matrix;
d8,4represents the sixteenth slave core cpe60A position in a fourth column of the eighth row in the from-kernel-set position matrix;
d8,5indicating sixty-th slave core cpe61A position in the fifth column of the eighth row in the from-kernel-set position matrix;
d8,6indicating a sixty-second slave core cpe62The position of the eighth row and the sixth column in the secondary kernel set position matrix;
d8,7denotes the sixty-third slave nucleus cpe63A position in the seventh column of the eighth row in the from-kernel-set position matrix;
d8,8denotes the sixty-fourth slave core cpe64At the position of the eighth column in the eighth row from the kernel set position matrix.
In the present invention, d is used for convenience of explanationp,qIndicates any one of the slave cores cpeAIn the slave kernel set location matrix addCPEsPosition in, p is the row number, q is the column number; dp,qSimply called the slave core site.
In the present invention, the normalized field quantity GF is of the form:
wherein i is an imaginary unit, i2=-1;
a1The real part of the first complex number of the first row vector representing the normalized field quantity;
a2the real part of the second complex number of the first row vector representing the normalized field quantity;
a3the real part of the third complex number of the first row vector representing the normalized field quantity;
a4the real part of the first complex number of the second row vector representing the normalized field quantity;
a5second representing normalized field quantityThe real part of the second complex number of the row vector;
a6a real part of a third complex number of a second row vector representing a normalized field quantity;
a7the real part of the first complex number of the third row vector representing the normalized field quantity;
a8the real part of the second complex number of the third row vector representing the normalized field quantity;
a9the real part of the third complex number of the third row vector representing the normalized field quantity;
b1an imaginary part of a first complex number of a first row vector representing a normalized field quantity;
b2an imaginary part of a second complex number of the first row vector representing the normalized field quantity;
b3an imaginary part of a third complex number of the first row vector representing the normalized field quantity;
b4an imaginary part of a first complex number of a second row vector representing a normalized field quantity;
b5an imaginary part of a second complex number of a second row vector representing a normalized field quantity;
b6an imaginary part of a third complex number of a second row vector representing a normalized field quantity;
b7the imaginary part of the first complex number of the third row vector representing the normalized field quantity;
b8an imaginary part of a second complex number of a third row vector representing a normalized field quantity;
b9the imaginary part of the third complex number of the third row vector representing the normalized field quantity.
In the present invention, the fermi sub-field quantity WIL is in the form:
wherein i is an imaginary unit, i2=-1;
ξ1The real part of the first complex number of the first column vector which is the fermi sub-field quantity;
ξ2the real part of the second complex number of the first column vector of fermi sub-field quantities;
ξ3the real part of the third complex number of the first column vector of fermi sub-field quantities;
ξ4the real part of the fourth complex number of the first column vector which is the fermi sub-field quantity;
β1the imaginary part of the first complex number of the first column vector which is the fermi sub-field quantity;
β2the imaginary part of the second complex number of the first column vector which is the fermi sub-field quantity;
β3the imaginary part of the third complex number of the first column vector which is the fermi sub-field quantity;
β4the imaginary part of the fourth complex number of the first column vector which is the fermi sub-field quantity;
γ1the real part of the first complex number of the second column vector, which is the fermi sub-field quantity;
γ2the real part of the second complex number of the second column vector which is the fermi sub-field quantity;
γ3the real part of the third complex number of the second column vector which is the fermi sub-field quantity;
γ4the real part of the fourth complex number of the second column vector, which is the fermi sub-field quantity;
δ1the imaginary part of the first complex number of the second column vector being the fermi sub-field quantity;
δ2the imaginary part of the second complex number of the second column vector being the fermi sub-field quantity;
δ3the imaginary part of the third complex number of the second column vector being the fermi sub-field quantity;
δ4the imaginary part of the fourth complex number of the second column vector being the fermi sub-field quantity;
μ1the real part of the first complex number of the third column vector, which is the fermi sub-field quantity;
μ2the real part of the second complex number of the third column vector, which is the fermi sub-field quantity;
μ3the real part of the third complex number of the third column vector being the fermi sub-field quantity;
μ4The real part of the fourth complex number of the third column vector, which is the fermi sub-field quantity;
ν1the imaginary part of the first complex number of the third column vector, which is the fermi sub-field quantity;
ν2the imaginary part of the second complex number of the third column vector, which is the fermi sub-field quantity;
ν3the imaginary part of the third complex number of the third column vector, which is the fermi sub-field quantity;
ν4the imaginary part of the fourth complex number of the third column vector of the fermi sub-field quantity.
In the same way, the last data information S of the present invention can be obtainedGFermi sub-field quantity of
In the same way, the last data information S of the present invention can be obtainedGIs recorded as
In the present invention, in the case of the present invention,andare not the same. Andare not the same.
In the present invention, the secondary kernel set location matrix add is utilizedCPEsTo mark the slave set of kernels, CPEs ═ cpe1,cpe2,…,cpeAThe specific positions of the cores in the system are used for improving the matching efficiency of the cores and the grid point positions represented by the four-dimensional space when the slave cores perform parallelization data information operation.
Referring to the two-dimensional schematic diagram of the four-dimensional grid point shown in FIG. 2, any one data information SgIs denoted as SgThe xyz, since the four dimensions are X, Y, Z, T axes, and the four dimensional space coordinate is difficult to represent, the present invention is represented by using two dimensional plane coordinates as shown in fig. 2A to 2F. The arrows shown schematically in FIG. 2 represent the direction in which the normalized field magnitude is selected (i.e., the direction in which the normalized field magnitude is selected)) The dots represent coordinate points of a four-dimensional space, and two ends of each dimension are connected through six two-dimensional plane coordinates, so that theoretical support is provided for realizing parallelization in a data parallelization mode. In the present invention, for any one data information SgThe coordinate position of the CPEs is marked, so that the problem that the storage of each data information from the core in the main memory is not continuous and the required data information of one-time calculation needs to be initiated for multiple times is solvedDirect Memory Access (DMA) transmission, which results in a very low bandwidth utilization between each slave core and the master Memory, and data information of a neighbor grid is used for calculating data information of each grid, and this calculation method transmits a large amount of redundant data information.
In the present invention, the first data information S1The position in four-dimensional space is recorded as Is S1The value on the X-axis is,is S1The value on the Y-axis is,is S1The value on the Z-axis is,is S1Values on the T-axis.
In the present invention, the second data information S2The position in four-dimensional space is recorded as Is S2The value on the X-axis is,is S2The value on the Y-axis is,is S2The value on the Z-axis is,is S2Values on the T-axis.
In the present invention, any one data information SgThe position in four-dimensional space is recorded as Is SgThe value on the X-axis is,is SgThe value on the Y-axis is,is SgThe value on the Z-axis is,is SgValues on the T-axis. SgThe lower subscript G in (1) is the identification number of the data information, and G belongs to G. As shown in fig. 2A to 2F, the X axis is perpendicular to the Y axis, and the Z axis is perpendicular to the Y axis. Data information S processed by heterogeneous many-core processorgThe time in the four-dimensional space is represented as the time axis and denoted as T.
For convenience of explanation, any one of the data information SgThe position in four-dimensional space is noted as:
in the present invention, the last data information SGThe position in four-dimensional space is recorded as Is SGThe value on the X-axis is,is SGThe value on the Y-axis is,is SGThe value on the Z-axis is,is SGValues on the T-axis. SGThe lower subscript G in (1) is the total number of data information.
Referring to fig. 3, the invention discloses a heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method, which comprises the following steps:
initializing a slave core matrix position of a heterogeneous many-core processor;
since a plurality of slave cores exist in the heterogeneous many-core processor, the slave cores need to be divided according to the positions of the slave core identification numbers, and the matrix position of each slave core is recorded.
From the set of nuclei CPEs { cpe } ═ cpe1,cpe2,…,cpeAPosition sorting is carried out according to an 8 multiplied by 8 matrix, and a secondary kernel set position matrix add of a formula (1) is obtainedCPEsAny slave core position is denoted as dp,q:
The slave kernel set location matrix addCPEsIs ordered from small to large according to the core identification number. In the following calculation process, the slave core determines the data area which is responsible for the slave core according to the row number and the column number of the slave core.
Reading the fermi sub-field quantity and the standard field quantity by the main core;
step 201, MPE of the main core reads data information, and expresses all the read data information as SPM ═ S in a set form1,S2,…,Sg,…,SG};
Step 202, the main core sets the SPM to { S ═ S1,S2,…,Sg,…,SGStoring Fermi sub-field quantity in data information into an 8 x 8 grid point matrix DAA according to the reading sequenceMPEAnd mixing the DAAMPESaving to a memory;
wherein the content of the first and second substances,representing the first data information S1At four-dimensional coordinate pointsFermi sub-field magnitude above.
Representing second data information S2At four-dimensional coordinate pointsFermi sub-field magnitude above.
Representing any one data information SgAt four-dimensional coordinate pointsFermi sub-field magnitude above.
Indicating the last data information SGAt four-dimensional coordinate pointsFermi sub-field magnitude above.
Step 203, the master coreConverting said SPM to { S ═ S1,S2,…,Sg,…,SGStoring the standard field quantity in the data information into a 4 multiplied by 8 lattice point link matrix DBB according to the reading sequenceMPEAnd DBB isMPESaving to a memory;
in the present invention, the c-direction means any one axis selected from the X-axis, the Y-axis, the Z-axis, and the T-axis as a direction.Denotes S1At four-dimensional coordinate pointsIn the direction of (a). In the same way, the method for preparing the composite material,andall indicate the selection direction.
Representing the first data information S1At four-dimensional coordinate pointsIs/are as followsNormalized field magnitude in direction.
Representing second data information S2At four-dimensional coordinate pointsIs/are as followsNormalized field magnitude in direction.
Representing any one data information SgAt four-dimensional coordinate pointsIs/are as followsNormalized field magnitude in direction.
Indicating the last data information SGAt four-dimensional coordinate pointsIs/are as followsNormalized field magnitude in direction.
In the invention, the Fermi sub-field quantity and the standard field quantity are adopted to represent four-dimensional data information, which is beneficial to the division and parallel calculation of tasks.
Reading data information from the core based on the line number and the column number of the core to realize data segmentation;
step 301, optional Slave core cpeAFrom the core matrix position d according to step onep,qDAA is arranged in the Z-axis and T-axis directionsMPECpe in matrixAThe responsible data information is partially read into the local memory space and is recorded asAll data information read in from the core can be written as:
step 302, optional Slave core cpeAFrom the core matrix position d according to step onep,qDBB is adjusted in the Z-axis and T-axis directionsMPECpe in matrixAThe responsible data information is partially read into the local memory space and is recorded asAll data information read in from the core can be written as:
in the present invention, any one of the slave cores cpeAAnd initiating direct memory access transmission to the memory according to the identification number of the slave core, reading the grid point Fermi field quantity value and the standard field quantity value which are calculated by the slave core, and storing the grid point Fermi field quantity value and the standard field quantity value into local storage of the slave core. And dividing the four-dimensional sub-grid into 64 two-dimensional planes according to the Z-axis and T-axis directions, so that each slave core is in charge of one plane. The internal calculation of each layer of the plane is relatively independent, and only when the boundary data is calculated, the data information of adjacent layers which are connected end to end is needed.
Step four, calculating the data information of any grid point in any slave core;
step 401, any one of the slave cores cpeAFrom correspondingObtaining SgThe corresponding lattice fermi sub-field amount; step 403 is executed;
step 402, any one of the slave cores cpeAFrom correspondingObtaining SgA corresponding normalized field size; step 403 is executed;
step 403, from the data information S of any grid pointgAcquiring data information of 8 adjacent grid points in x, y, z and t dimensions, and then acquiring grid point Fermi sub-field quantity and standard sub-field quantity of the 8 adjacent grid points; step 404 is executed;
the data information of the adjacent 8 grid points is respectively marked as S1、S2、S3、S4、S5、S6、S7And S8The central lattice point of the adjacent 8 lattice points is SgThen the lattice fermi sub-field quantity is respectively recorded as Andsaid SgThe grid point fermi sub-field quantity is recorded as
step 404, performing matrix multiplication on Fermi sub-field quantity and standard sub-field quantity of the adjacent 8 grid points; step 405 is executed;
step 405, updating the central grid point to be S by the matrix multiplication quantity of the adjacent 8 grid pointsgThe updated amount of the lattice point fermi sub-field belongs to SgThe lattice point fermi sub-field quantity of (2), is recorded asAnd isNamely, it isData information S ingIs updated toExecuting the step five;
in the present invention, step four is the processing of one lattice point, and all lattice points in the slave core need to adopt the same step four processing, and for the purpose of iterative description, the operation of one iteration on all lattice points is specifically described as step five.
Step five, each slave core carries out parallel processing of step four on the data information of each grid point in the local storage space thereof, thereby obtaining the updated Fermi sub-field quantity of all the grid points, namely obtainingExecuting the step six;
step six, after the updating is finished, adding 1 to the iteration times; calculating a residual error value of the lattice point Fermi sub-field quantity;
in the invention, the iteration times are recorded as U, and the maximum iteration times are recorded as UmaxAnd U ismaxThe value is 1000, and the current iteration number is recorded as UAt present. If U isAt present<UmaxIf yes, executing the step four; if U isAt present≥UmaxIf yes, executing step seven;
in the invention, the residual error of the lattice point Fermi sub-field quantity is recorded as R, and the residual error threshold value of the lattice point Fermi sub-field quantity is recorded as RminAnd R isminIs taken to be 1.0 × 10-12. If R > RminIf yes, executing the step four; if R is less than or equal to RminIf yes, executing step seven;
step seven, outputting the updated lattice point matrix to a memory to be stored as a file;
in the present invention, the following components are addedPassed to memory to update the DAAMPETo obtainWill be provided withThe file is saved and written.
Example 1
Software and hardware environment parameters of the Shenwei 26010 heterogeneous many-core processor are as follows:
TABLE 1 software and hardware Environment
CPU | Memory device | Compiler with a plurality of compiler modules |
SW26010 1.45GHz | 32G for 4CG | Sw5cc |
The data used in example 1 is grid-sized lattice data, using point sources to solve for quark propagators. The proportion of the operation time of each part of the program and the acceleration effect of the parallelization calculation are analyzed.
The data cutting mode of the invention can improve the bandwidth utilization rate (expressed by DMA in the invention) between the slave core and the main memory and reduce the transmission of redundant data. Compared with the serial calculation method, the data redundancy is reduced by 8 times, the direct memory access transmission times are reduced, and the experimental result is shown in table 2, so that the performance is improved by 145 times.
TABLE 2 Direct Memory Access (DMA) transfer time comparison of different data partitioning methods
DMA transfer Total time consumption (MPE beat number) | |
Serial computing method | 22328320 |
Improved post-calculation method | 153468 |
TABLE 3 run time analysis
From table 3 it can be seen that the method performed using MPE + CPEs parallelization can speed up by a factor of 16.4 compared to the calculation on MPE only. Since the theoretical calculated peak for one MPE is 23.2 gflps and one CPE calculated peak is 11.6 gflps, the theoretically highest achievable ratio is 32 times. The method of the invention adopts the characteristic of register communication among a large number of slave cores, obtains better parallel effect, and obtains the maximum parallelism degree reaching 51.3 percent.
It can also be seen from table 3 that the overall operation efficiency of the method after vectorization is improved by 3.9 times compared with the operation efficiency of the method without vectorization, and thus it can be seen that the operation efficiency can be greatly improved by performing vectorization processing on floating point operations.
By using the parallelization method of the invention in table 3 for data division and transmission, secondary core cooperative computing and vectorization computing, the speed-up ratio can be 63.96 times as high as that of the original serial running program.
Referring to fig. 4 and 5, whether the method of the present invention is a single main core serial algorithm, or an MPE + CPEs parallelization calculation algorithm, or an MPE + CPEs + SIMD vectorization parallelization algorithm, the method of the present invention is divided into two parts, namely reading a file, transmitting data, and consuming iterative calculation, and is shown in fig. 4 and 5. It can be seen from the figure that the method of the present invention is computationally intensive, the iterative computation portion occupies most of the program running time, and the iterative computation time proportion further increases as the number of iterations increases. A special data segmentation mode is designed for the Shenwei 26010 heterogeneous many-core processor, and the proportion of the data transmission time of the slave core and the main memory in the whole program is greatly reduced.
Claims (5)
1. A lattice point quantum color dynamics parallel acceleration method based on a heterogeneous many-core processor is characterized by comprising the following steps:
initializing a slave core matrix position of a heterogeneous many-core processor;
because a plurality of slave cores exist in the heterogeneous many-core processor, the slave cores need to be divided according to the slave core identification numbers, and the matrix position of each slave core is recorded;
from the set of nuclei CPEs { cpe } ═ cpe1,cpe2,…,cpeAPosition sorting is carried out according to an 8 multiplied by 8 matrix to obtain a secondary kernel set position matrix addCPEsAny slave core position is denoted as dp,q:
Reading the fermi sub-field quantity and the standard field quantity by the main core;
step 201, MPE of the main core reads data information, and expresses all the read data information as SPM ═ S in a set form1,S2,…,Sg,…,SG};
Step 202, the main core sets the SPM to { S ═ S1,S2,…,Sg,…,SGStoring Fermi sub-field quantity in data information into an 8 x 8 grid point matrix DAA according to the reading sequenceMPEAnd mixing the DAAMPESaving to a memory;
representing the first data information S1At four-dimensional coordinate pointsFermi sub-field magnitude above;denotes S1The amount of fermi sub-field of (a);
representing second data information S2At four-dimensional coordinate pointsFermi sub-field magnitude above;denotes S2The amount of fermi sub-field of (a);
representing any one data information SgAt four-dimensional coordinate pointsFermi sub-field magnitude above;denotes SgThe amount of fermi sub-field of (a);
indicating the last data information SGAt four-dimensional coordinate pointsFermi sub-field magnitude above;denotes SGThe amount of fermi sub-field of (a);
step 203, the main core sets the SPM to { S ═ S1,S2,…,Sg,…,SGStoring the standard field quantity in the data information into a 4 multiplied by 8 lattice point link matrix DBB according to the reading sequenceMPEAnd DBB isMPESaving to a memory;
representing the first data information S1At four-dimensional coordinate pointsIs/are as followsA normalized field magnitude in a direction;denotes S1At four-dimensional coordinate pointsThe direction of (a);denotes S1The normalized field size of (a);
representing second data information S2At four-dimensional coordinate pointsIs/are as followsA normalized field magnitude in a direction;denotes S2At four-dimensional coordinate pointsThe direction of (a);denotes S2The normalized field size of (a);
representing any one data information SgAt four-dimensional coordinate pointsIs/are as followsA normalized field magnitude in a direction;denotes SgAt four-dimensional coordinate pointsThe direction of (a);denotes SgThe normalized field size of (a);
indicating the last data information SGAt four-dimensional coordinate pointsIs/are as followsA normalized field magnitude in a direction;denotes SGAt four-dimensional coordinate pointsThe direction of (a);denotes SGThe normalized field size of (a);
reading data information from the core based on the line number and the column number of the core to realize data segmentation;
step 301, optional Slave core cpeAFrom the core matrix position d according to step onep,qDAA is arranged in the Z-axis and T-axis directionsMPECpe in matrixAThe responsible data information is partially read into the local memory space and is recorded asAll data information read in from the core can be written as:
step 302, optional Slave core cpeAFrom the core matrix position d according to step onep,qDBB is adjusted in the Z-axis and T-axis directionsMPECpe in matrixAThe responsible data information is partially read into the local memory space and is recorded asAll data information read in from the core can be written as:
step four, calculating the data information of any grid point in any slave core;
step 401, any data information SgThe grid point fermi sub-field quantity is recorded asStep 403 is executed;
step 403, from the data information S of any grid pointgAcquiring data information of 8 adjacent grid points in x, y, z and t dimensions, and then acquiring grid point Fermi sub-field quantity and standard sub-field quantity of the 8 adjacent grid points; step 404 is executed;
the data information of the adjacent 8 grid points is respectively marked as S1、S2、S3、S4、S5、S6、S7And S8The central lattice point of the adjacent 8 lattice points is SgThen the lattice fermi sub-field quantity is respectively recorded as Andsaid SgThe grid point fermi sub-field quantity is recorded as
Step 404, performing matrix multiplication on Fermi sub-field quantity and standard sub-field quantity of the adjacent 8 grid points; step 405 is executed;
step 405, updating the central grid point to be S by the matrix multiplication quantity of the adjacent 8 grid pointsgThe updated amount of the lattice point fermi sub-field belongs to SgThe lattice point fermi sub-field quantity of (2), is recorded asAnd isNamely, it isData information S ingIs updated toExecuting the step five;
step five, each slave core carries out parallel processing of step four on the data information of each grid point in the local storage space thereof, thereby obtaining the updated Fermi sub-field quantity of all the grid points, namely obtainingExecuting the step six;
step six, after the updating is finished, adding 1 to the iteration times; calculating a residual error value of the lattice point Fermi sub-field quantity;
the number of iterations is recorded as U, and the maximum number of iterations is recorded as UmaxAnd U ismaxThe value is 1000, and the current iteration number is recorded as UAt present(ii) a If U isAt present<UmaxIf yes, executing the step four; if U isAt present≥UmaxIf yes, executing step seven;
the residual error of the grid point Fermi sub-field quantity is recorded as R, and the residual error threshold value of the grid point Fermi sub-field quantity is recorded as RminAnd R isminIs taken to be 1.0 × 10-12(ii) a If R > RminIf yes, executing the step four; if R is less than or equal to RminIf yes, executing step seven;
step seven, outputting the updated lattice point matrix to a memory to be stored as a file;
2. The heterogeneous many-core processor-based grid-point quantum color dynamics parallel acceleration method according to claim 1, characterized in that: the slave kernel set location matrix addCPEsIs ordered from small to large according to the core identification number.
3. The heterogeneous many-core processor-based grid-point quantum color dynamics parallel acceleration method according to claim 1, characterized in that: the bandwidth utilization rate is reduced, and the performance is improved by 145 times.
4. The heterogeneous many-core processor-based grid-point quantum color dynamics parallel acceleration method according to claim 1, characterized in that: parallel acceleration reduces the time consumption and achieves 63 times of performance improvement.
5. The heterogeneous many-core processor-based grid-point quantum color dynamics parallel acceleration method according to claim 1, characterized in that: the parallel acceleration processing method is suitable for parallel acceleration processing of the Shenwei 26010 heterogeneous many-core processor.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810927168 | 2018-08-15 | ||
CN2018109271685 | 2018-08-15 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110516194A CN110516194A (en) | 2019-11-29 |
CN110516194B true CN110516194B (en) | 2021-03-09 |
Family
ID=68625172
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910750655.3A Active CN110516194B (en) | 2018-08-15 | 2019-08-14 | Heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110516194B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113935491B (en) * | 2021-10-20 | 2022-08-23 | 腾讯科技(深圳)有限公司 | Method, device, equipment, medium and product for obtaining eigenstates of quantum system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101246446A (en) * | 2008-03-12 | 2008-08-20 | 浪潮电子信息产业股份有限公司 | Method for testing PC server performance |
CN101727512A (en) * | 2008-10-17 | 2010-06-09 | 中国科学院过程工程研究所 | General algorithm based on variation multiscale method and parallel calculation system |
CN106775594A (en) * | 2017-01-13 | 2017-05-31 | 中国科学院软件研究所 | A kind of Sparse Matrix-Vector based on the domestic processor of Shen prestige 26010 multiplies isomery many-core implementation method |
CN106933777A (en) * | 2017-03-14 | 2017-07-07 | 中国科学院软件研究所 | The high-performance implementation method of the one-dimensional FFT of base 2 based on the domestic processor of Shen prestige 26010 |
US9727471B2 (en) * | 2010-11-29 | 2017-08-08 | Intel Corporation | Method and apparatus for stream buffer management instructions |
CN107085743A (en) * | 2017-05-18 | 2017-08-22 | 郑州云海信息技术有限公司 | A kind of deep learning algorithm implementation method and platform based on domestic many-core processor |
CN107168683A (en) * | 2017-05-05 | 2017-09-15 | 中国科学院软件研究所 | GEMM dense matrix multiply high-performance implementation method on the domestic many-core CPU of Shen prestige 26010 |
CN107451097A (en) * | 2017-08-04 | 2017-12-08 | 中国科学院软件研究所 | Multidimensional FFT high-performance implementation method on the domestic many-core processor of Shen prestige 26010 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103268297A (en) * | 2013-05-20 | 2013-08-28 | 浙江大学 | Accelerating core virtual scratch pad memory method based on heterogeneous multi-core platform |
KR102390162B1 (en) * | 2015-10-16 | 2022-04-22 | 삼성전자주식회사 | Apparatus and method for encoding data |
US10311174B2 (en) * | 2015-10-22 | 2019-06-04 | International Business Machines Corporation | Innermost data sharing method of lattice quantum chromodynamics calculation |
CN105808926B (en) * | 2016-03-02 | 2017-10-03 | 中国地质大学(武汉) | A kind of pre-conditional conjugate gradient block adjustment method accelerated parallel based on GPU |
CN107273094B (en) * | 2017-05-18 | 2020-06-16 | 中国科学院软件研究所 | Data structure suitable for HPCG optimization on ' Shenwei ' Taihu light ' and efficient implementation method thereof |
-
2019
- 2019-08-14 CN CN201910750655.3A patent/CN110516194B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101246446A (en) * | 2008-03-12 | 2008-08-20 | 浪潮电子信息产业股份有限公司 | Method for testing PC server performance |
CN101727512A (en) * | 2008-10-17 | 2010-06-09 | 中国科学院过程工程研究所 | General algorithm based on variation multiscale method and parallel calculation system |
US9727471B2 (en) * | 2010-11-29 | 2017-08-08 | Intel Corporation | Method and apparatus for stream buffer management instructions |
CN106775594A (en) * | 2017-01-13 | 2017-05-31 | 中国科学院软件研究所 | A kind of Sparse Matrix-Vector based on the domestic processor of Shen prestige 26010 multiplies isomery many-core implementation method |
CN106933777A (en) * | 2017-03-14 | 2017-07-07 | 中国科学院软件研究所 | The high-performance implementation method of the one-dimensional FFT of base 2 based on the domestic processor of Shen prestige 26010 |
CN107168683A (en) * | 2017-05-05 | 2017-09-15 | 中国科学院软件研究所 | GEMM dense matrix multiply high-performance implementation method on the domestic many-core CPU of Shen prestige 26010 |
CN107085743A (en) * | 2017-05-18 | 2017-08-22 | 郑州云海信息技术有限公司 | A kind of deep learning algorithm implementation method and platform based on domestic many-core processor |
CN107451097A (en) * | 2017-08-04 | 2017-12-08 | 中国科学院软件研究所 | Multidimensional FFT high-performance implementation method on the domestic many-core processor of Shen prestige 26010 |
Non-Patent Citations (5)
Title |
---|
Accelerating Lattice QCD Multigrid on GPUs Using Fine-Grained Parallelization;Clark Michael A. 等;《SC16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis》;20161118;795-806 * |
Lwptool: A lightweight profiler to guide data layout optimization;Yu Chao 等;《IEEE Transactions on Parallel and Distributed Systems》;20180528;第29卷(第11期);2489-2502 * |
基于申威众核处理器的混合并行遗传算法;赵瑞祥 等;《计算机应用》;20170910;第37卷(第9期);2518-2523 * |
探索"手征电子学"——第二类外尔半金属的手征输运;王锐 等;《物理》;20170212;第46卷(第2期);100-102 * |
申威众核处理器的并行NSGA-Ⅱ算法;沈焕学 等;《计算机工程与应用》;20180301;第54卷(第17期);35-40 * |
Also Published As
Publication number | Publication date |
---|---|
CN110516194A (en) | 2019-11-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zaruba et al. | Manticore: A 4096-core RISC-V chiplet architecture for ultraefficient floating-point computing | |
US8676874B2 (en) | Data structure for tiling and packetizing a sparse matrix | |
Griebel et al. | A multi-GPU accelerated solver for the three-dimensional two-phase incompressible Navier-Stokes equations | |
US8762655B2 (en) | Optimizing output vector data generation using a formatted matrix data structure | |
TW201635143A (en) | Work-efficient, load-balanced, merge-based parallelized consumption of sequences of sequences | |
Shimokawabe et al. | 145 TFlops performance on 3990 GPUs of TSUBAME 2.0 supercomputer for an operational weather prediction | |
US20200302284A1 (en) | Data compression for a neural network | |
CN110516194B (en) | Heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method | |
Solano-Quinde et al. | Unstructured grid applications on GPU: performance analysis and improvement | |
He et al. | A multiple-GPU based parallel independent coefficient reanalysis method and applications for vehicle design | |
Zhu et al. | GPU acceleration of an iterative scheme for gas-kinetic model equations with memory reduction techniques | |
CN109753682B (en) | Finite element stiffness matrix simulation method based on GPU (graphics processing Unit) end | |
Wang et al. | Accelerating ap3m-based computational astrophysics simulations with reconfigurable clusters | |
WO2021250392A1 (en) | Mixed-element-size instruction | |
US8564601B2 (en) | Parallel and vectored Gilbert-Johnson-Keerthi graphics processing | |
Wang et al. | Implementation of Jacobi iterative method on graphics processor unit | |
Tan et al. | A pipelining loop optimization method for dataflow architecture | |
CN114116208A (en) | Short wave radiation transmission mode three-dimensional acceleration method based on GPU | |
CN115756605A (en) | Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs | |
Xu et al. | Parallelizing a high-order CFD software for 3D, multi-block, structural grids on the TianHe-1A supercomputer | |
Yu et al. | GPU-based JFNG method for power system transient dynamic simulation | |
JP4052181B2 (en) | Communication hiding parallel fast Fourier transform method | |
Xu et al. | Generalized GPU acceleration for applications employing finite-volume methods | |
Zhang et al. | Accelerating lattice QCD on sunway many-core processor | |
Chen et al. | Edge FPGA-based onsite neural network training |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |