CN110516194B - Heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method - Google Patents

Heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method Download PDF

Info

Publication number
CN110516194B
CN110516194B CN201910750655.3A CN201910750655A CN110516194B CN 110516194 B CN110516194 B CN 110516194B CN 201910750655 A CN201910750655 A CN 201910750655A CN 110516194 B CN110516194 B CN 110516194B
Authority
CN
China
Prior art keywords
data information
core
field
recorded
slave
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910750655.3A
Other languages
Chinese (zh)
Other versions
CN110516194A (en
Inventor
栾钟治
张增校
杨海龙
王锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Publication of CN110516194A publication Critical patent/CN110516194A/en
Application granted granted Critical
Publication of CN110516194B publication Critical patent/CN110516194B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8038Associative processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Optimization (AREA)
  • Computing Systems (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method, which comprises the following steps: the method comprises the steps that firstly, the slave cores are subjected to position division according to slave core identification numbers, secondly, read data information is stored according to the position of a four-dimensional space, and thirdly, the slave cores read grid point values which are responsible for calculation from a storage according to the position identification of the slave cores; and fourthly, carrying out iterative updating on the grid point value of any one slave core to obtain an updated grid point value belonging to the slave core. The parallelization method for the Shenwei 26010 heterogeneous many-core processors fully utilizes the unique register communication characteristics among the Shenwei many-core processors, increases the reusability of data and reduces a large amount of redundant data. Compared with the method only operating on the main core after parallel acceleration, the method of the invention improves the performance by 63 times.

Description

Heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method
Technical Field
The invention relates to a parallel acceleration method for lattice point quantum color dynamics, in particular to a parallel acceleration method for lattice point quantum color dynamics by using an Shenwei 26010 heterogeneous many-core processor.
Background
The Shenwei Taihu optical supercomputer is a supercomputer which is developed by the national parallel computer engineering and technology research center and is installed in the national supercomputer tin-free center, and is also the first supercomputer in the world which is constructed by adopting autonomous technology in China. 40960 autonomous-developed Shenwei 26010 multi-core processors are installed in the optical supercomputer of Shenwei Taihu lake, the multi-core processor adopts a 64-bit autonomous Shenwei instruction system, the peak performance of floating point operation is 12.5 hundred million times/second, and the continuous performance is 9.3 hundred million times/second. The Shenwei Taihu optical super computer uses a Shenwei 26010 heterogeneous many-Core processor, the processor architecture is shown in FIG. 1, each processor chip in the figure comprises four Core Groups (CG), and the Core groups are connected through a network on chip. Each core group mainly includes a Management Processing Elements (MPE, referred to as a master core for short), a Processing core array cluster (CPE, referred to as a slave core for short), and a Memory Controller (MC). The operation cores of the operation core cluster are connected by adopting a communication network with a topological structure of 8 multiplied by 8 Mesh. The System Interface (SI) is used for connecting the chip and the off-chip System, and is implemented by a standard PCIE 3.0 Interface.
At present, heterogeneous computer system structures have the characteristics of strong parallel capability and strong computing capability. The heterogeneous system structure greatly improves the parallel capability and the expansion capability of a computing platform, more computer heterogeneous system structures provide a new computing and programming method for scientific computing with huge computing amount, and how to utilize the computing capability of the Haiwei 26010 heterogeneous many-core processor and the related algorithm parallelization of the scientific computing is one of the research hotspots of researchers.
Disclosure of Invention
In order to solve the problem that the bandwidth utilization rate between a slave Core (CPE) and a master core (MPE) is extremely low due to data redundancy existing in the transmission process in the data segmentation of the Shenwei 26010 heterogeneous many-core processor, the invention provides a grid-point quantum color dynamics parallel acceleration method based on the heterogeneous many-core processor. The method carries out position matching by combining position sequencing of the secondary cores with data segmentation, carries out lattice point calculation in a four-dimensional space by utilizing the Fermi sub-field quantity and the standard field quantity, and saves an updated Fermi sub-field quantity matrix in a memory in a file form through multiple iterations. The method optimizes and realizes the characteristics of data transmission and calculation modes of the Shenwei 26010 heterogeneous many-core processor and the characteristics of the lattice quantum color dynamics algorithm. The method makes full use of the unique register communication characteristics between the slave Cores (CPE), increases the reusability of data, and reduces a large amount of redundant data. By utilizing the characteristic that the Shenwei 26010 heterogeneous many-core processor supports Single Instruction Stream (SIMD) instructions, the computing performance is greatly improved.
The invention discloses a heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method, which is characterized by comprising the following steps of:
initializing a slave core matrix position of a heterogeneous many-core processor;
reading the fermi sub-field quantity and the standard field quantity by the main core;
reading data information from the core based on the line number and the column number of the core to realize data segmentation;
step four, calculating the data information of any grid point in any slave core;
step five, each slave core carries out parallel processing of step four on the data information of each grid point in the local storage space thereof, thereby obtaining the updated Fermi sub-field quantity of all the grid points, namely obtaining
Figure BDA0002167067360000021
Executing the step six;
step six, after the updating is finished, adding 1 to the iteration times; calculating a residual error value of the lattice point Fermi sub-field quantity;
the number of iterations is recorded as U, and the maximum number of iterations is recorded as UmaxAnd U ismaxThe value is 1000, and the current iteration number is recorded as UAt present(ii) a If U isAt present<UmaxIf yes, executing the step four; if U isAt present≥UmaxIf yes, executing step seven;
the residual error of the grid point Fermi sub-field quantity is recorded as R, and the residual error threshold value of the grid point Fermi sub-field quantity is recorded as RminAnd R isminIs taken to be 1.0 × 10-12(ii) a If R > RminIf yes, executing the step four; if R is less than or equal to RminIf yes, executing step seven;
step seven, outputting the updated lattice point matrix to a memory to be stored as a file;
will be provided with
Figure BDA0002167067360000022
Passed to memory to update the DAAMPETo obtain
Figure BDA0002167067360000023
Will be provided with
Figure BDA0002167067360000024
The file is saved and written.
The parallel acceleration method can be applied to parallel acceleration processing of the Shenwei 26010 heterogeneous many-core processor.
The heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method has the advantages that:
the method determines the data information processed by the slave core by utilizing the slave core position matrix and data segmentation, increases the reusability of data and reduces a large amount of redundant data.
Secondly, the method carries out lattice point calculation in a four-dimensional space by using the Fermi sub-field quantity and the standard field quantity after position matching, reduces the bandwidth utilization rate and improves the performance by 145 times.
Experiments show that compared with the original serial computing method, the computing method after parallel optimization has the advantages that the time consumption is reduced, and 63 times of performance improvement can be achieved.
The result obtained by the parallel acceleration method is stored in a memory in a single file form, so that the Shenwei 26010 heterogeneous many-core processor can be reused conveniently.
Drawings
FIG. 1 is a diagram of a Shenwei 26010 heterogeneous many-core processor architecture.
FIG. 2 is a two-dimensional schematic of a four-dimensional spatial grid of points.
Fig. 2A is a schematic diagram of a lattice point in the XY plane.
Fig. 2B is a schematic diagram of grid points in the XZ plane.
Fig. 2C is a schematic diagram of a lattice point under the XT plane.
Fig. 2D is a schematic diagram of a grid point in the YZ plane.
FIG. 2E is a schematic diagram of a lattice point under the YT plane.
Fig. 2F is a schematic diagram of a lattice under the ZT plane.
FIG. 3 is a flow chart of the grid point quantum color dynamics parallel acceleration method based on the heterogeneous many-core processor.
FIG. 4 is a graph of the ratio of the running times of each part of the program calculated iteratively 10 times by the method of the present invention.
FIG. 5 is a graph of the ratio of the run times of each part of the program calculated iteratively 100 times by the method of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Quantum Chromo Dynamics (QCD) is a fundamental theory used to describe strong interactions. Lattice point quantum color dynamics (Lattice QCD) can also be applied to theoretical studies of non-QCD in principle. The Lattice QCD is based on the basic degrees of freedom of the QCD, i.e. depicted by a quark field, an inverse quark field, a glue field. These fields are defined on a set of discrete grid points in four-dimensional euclidean space.
There are four Core Groups (CG) in each chip of the Schwey heterogeneous many-core processor, one for each Core Group (CG)An arithmetic control core (MPE) and a core array (CPE). For convenience of explanation, the operation control core is denoted as MPE, and is called as a main core for short; the core array is denoted as CPEs, called slave for short. Since there are multiple cores in the core array, the slave core set is denoted as CPEs ═ cpe1,cpe2,…,cpeA},cpe1Denotes the first slave core, cpe2Denotes the second slave core, cpeADenotes the last slave core, the cpe for ease of explanationAAlso indicates any one of the slave cores, and a indicates a slave core identification number. Since the Shenwei 26010 heterogeneous many-core processor is specified, the number of A is 64.
In the invention, any piece of data information processed by the Shenwei 26010 heterogeneous many-core processor is marked as Sg(ii) a The plurality of pieces of data information constitute a data information set denoted as SPM ═ S1,S2,…,Sg,…,SG},S1Representing a first piece of data information, S2Representing a second piece of data information, SgIndicates the g-th data information (for convenience of explanation, the SgAlso represents any piece of data information, G belongs to G), SGIndicating the last piece of data information, and G indicates the total number of pieces of data information.
In the present invention, all the slave kernel sets CPEs ═ { cpe1,cpe2,…,cpeAPosition sorting is carried out according to an 8 multiplied by 8 matrix (sorting from small to large according to the core identification numbers), namely a slave core set position matrix add is obtainedCPEs
Figure BDA0002167067360000041
d1,1Indicates the first slave core cpe1A position of a first column in a first row from the kernel set position matrix;
d1,2indicating a second slave core cpe2A position of a first row and a second column in the position matrix from the kernel set;
d1,3indicates the third slave core cpe3The position of the first row and the third column in the slave kernel set position matrix;
d1,4indicates the fourth slave core cpe4A position in a fourth column of a first row in the from-kernel-set position matrix;
d1,5indicates the fifth slave core cpe5A position in a fifth column of the first row in the from-kernel-set position matrix;
d1,6indicates the sixth slave core cpe6A position in a sixth column of the first row from the kernel set position matrix;
d1,7represents the seventh slave nucleus cpe7A position in a seventh column from a first row in the kernel set position matrix;
d1,8denotes the eighth slave core cpe8A position of the eighth column in the first row from the kernel set position matrix;
d2,1denotes the ninth slave core cpe9A position of a first column in a second row from the kernel set position matrix;
d2,2represents the tenth slave core cpe10A position in a second row and a second column in the kernel set position matrix;
d2,3represents the eleventh slave core cpe11The position of the third column in the second row from the kernel set position matrix;
d2,4represents the twelfth slave core cpe12A position in a fourth column of the second row in the kernel set position matrix;
d2,5represents the thirteenth slave core cpe13A position in a fifth column from the second row in the kernel set position matrix;
d2,6denotes the fourteenth slave core cpe14A position in a sixth column of the second row in the secondary kernel set position matrix;
d2,7indicates the fifteenth slave core cpe15A position in the seventh column of the second row in the secondary kernel set position matrix;
d2,8represents the sixteenth slave core cpe16A position in the eighth column from the second row in the kernel set position matrix;
d3,1represents the seventeenth Slave core cpe17The position of the first column in the third row from the kernel set position matrix;
d3,2represents the eighteenth slave core cpe18The position of the third row and the second column in the position matrix of the kernel set;
d3,3represents the nineteenth slave core cpe19The position of the third row and the third column in the slave kernel set position matrix;
d3,4represents the second tenth slave core cpe20The position of the third row and the fourth column in the position matrix of the kernel set;
d3,5represents the twenty-first slave core cpe21The position of the third row and the fifth column in the position matrix of the kernel set;
d3,6represents the twenty-second slave core cpe22The position of the third row and the sixth column in the position matrix of the secondary kernel set;
d3,7denotes the twenty-third slave nucleus cpe23A position in a third row and a seventh column from the kernel set position matrix;
d3,8denotes the twenty-fourth slave core cpe24The position of the eighth column in the third row from the kernel set position matrix;
d4,1represents the twenty-fifth slave core cpe25A position in a first column from a fourth row in the kernel set position matrix;
d4,2represents the twenty-sixth slave nucleus cpe26A position in a second column from a fourth row in the kernel set position matrix;
d4,3represents the twenty-seventh slave nucleus cpe27The position of the fourth row and the third column in the slave kernel set position matrix;
d4,4represents the twenty-eighth slave core cpe28A position in a fourth column from a fourth row in the kernel set position matrix;
d4,5represents the twenty-ninth slave core cpe29A position in a fifth column from the fourth row in the kernel set position matrix;
d4,6represents the thirty-th slave core cpe30A position in a sixth column from the fourth row in the kernel set position matrix;
d4,7indicating the thirty-first slave core cpe31Fourth row in the slave kernel set location matrixPosition of the seventh column;
d4,8indicating the thirty-second slave core cpe32The position of the eighth column from the fourth row in the kernel set position matrix;
d5,1represents the thirty-third slave core cpe33The position of the fifth row and the first column in the position matrix of the kernel set;
d5,2represents the thirty-fourth slave core cpe34A position in a second column from a fifth row in the kernel set position matrix;
d5,3represents the thirty-fifth slave core cpe35The position of the fifth row and the third column in the slave kernel set position matrix;
d5,4represents the thirty-sixth slave core cpe36A position in the fourth column of the fifth row in the kernel set position matrix;
d5,5represents the thirty-seventh slave core cpe37A position in a fifth row and a fifth column of the secondary kernel set position matrix;
d5,6represents the thirty-eighth slave core cpe38A position in a sixth column of a fifth row in the secondary kernel set position matrix;
d5,7represents the thirty ninth slave core cpe39A position in the seventh column of the fifth row from the kernel set position matrix;
d5,8represents the forty-fourth slave core cpe40The position of the eighth column in the fifth row from the kernel set position matrix;
d6,1indicates the forty-first slave core cpe41The position of the first column in the sixth row from the kernel set position matrix;
d6,2indicates the forty-second slave core cpe42A position in a second column of a sixth row in the kernel set position matrix;
d6,3denotes the forty-third slave core cpe43The position of the sixth row and the third column in the slave kernel set position matrix;
d6,4indicates the forty-fourth slave core cpe44A position in the fourth column of the sixth row from the kernel set position matrix;
d6,5indicates the forty-fifthSlave nucleus cpe45A position in a fifth column from a sixth row in the kernel set position matrix;
d6,6indicates the forty-sixth slave core cpe46A position in a sixth row and a sixth column from the kernel set position matrix;
d6,7indicates the forty-seventh slave core cpe47A position in the seventh column of the sixth row in the from-kernel-set position matrix;
d6,8indicates the forty-eighth slave core cpe48A position in the eighth column from the sixth row in the kernel set position matrix;
d7,1indicates the forty-ninth slave core cpe49The position of the seventh row and the first column in the position matrix of the secondary core set;
d7,2represents the fifth tenth slave core cpe50A position in a second column of a seventh row in the kernel set position matrix;
d7,3represents the fifty-th slave core cpe51The position of the seventh row and the third column in the slave kernel set position matrix;
d7,4denotes the fifty-second slave core cpe52A position in a fourth column from a seventh row in the kernel set position matrix;
d7,5denotes the fifty-third slave nucleus cpe53A position in a fifth column from a seventh row in the kernel set position matrix;
d7,6denotes the fifty-fourth slave core cpe54A position in a sixth column from a seventh row in the kernel set position matrix;
d7,7denotes the fifty-fifth slave nucleus cpe55A position in the seventh column of the seventh row from the kernel set position matrix;
d7,8denotes the fifty-sixth slave nucleus cpe56A position in the eighth column from the seventh row in the kernel set position matrix;
d8,1denotes the fifty-seventh Slave core cpe57The position of the eighth row and the first column in the position matrix of the kernel set;
d8,2denotes the fifty-eighth slave core cpe58The position of the second column in the eighth row in the position matrix of the kernel set;
d8,3denotes the fifty-ninth slave core cpe59The position of the eighth row and the third column in the slave kernel set position matrix;
d8,4represents the sixteenth slave core cpe60A position in a fourth column of the eighth row in the from-kernel-set position matrix;
d8,5indicating sixty-th slave core cpe61A position in the fifth column of the eighth row in the from-kernel-set position matrix;
d8,6indicating a sixty-second slave core cpe62The position of the eighth row and the sixth column in the secondary kernel set position matrix;
d8,7denotes the sixty-third slave nucleus cpe63A position in the seventh column of the eighth row in the from-kernel-set position matrix;
d8,8denotes the sixty-fourth slave core cpe64At the position of the eighth column in the eighth row from the kernel set position matrix.
In the present invention, d is used for convenience of explanationp,qIndicates any one of the slave cores cpeAIn the slave kernel set location matrix addCPEsPosition in, p is the row number, q is the column number; dp,qSimply called the slave core site.
In the present invention, the normalized field quantity GF is of the form:
Figure BDA0002167067360000071
wherein i is an imaginary unit, i2=-1;
a1The real part of the first complex number of the first row vector representing the normalized field quantity;
a2the real part of the second complex number of the first row vector representing the normalized field quantity;
a3the real part of the third complex number of the first row vector representing the normalized field quantity;
a4the real part of the first complex number of the second row vector representing the normalized field quantity;
a5second representing normalized field quantityThe real part of the second complex number of the row vector;
a6a real part of a third complex number of a second row vector representing a normalized field quantity;
a7the real part of the first complex number of the third row vector representing the normalized field quantity;
a8the real part of the second complex number of the third row vector representing the normalized field quantity;
a9the real part of the third complex number of the third row vector representing the normalized field quantity;
b1an imaginary part of a first complex number of a first row vector representing a normalized field quantity;
b2an imaginary part of a second complex number of the first row vector representing the normalized field quantity;
b3an imaginary part of a third complex number of the first row vector representing the normalized field quantity;
b4an imaginary part of a first complex number of a second row vector representing a normalized field quantity;
b5an imaginary part of a second complex number of a second row vector representing a normalized field quantity;
b6an imaginary part of a third complex number of a second row vector representing a normalized field quantity;
b7the imaginary part of the first complex number of the third row vector representing the normalized field quantity;
b8an imaginary part of a second complex number of a third row vector representing a normalized field quantity;
b9the imaginary part of the third complex number of the third row vector representing the normalized field quantity.
In the present invention, the fermi sub-field quantity WIL is in the form:
Figure BDA0002167067360000072
wherein i is an imaginary unit, i2=-1;
ξ1The real part of the first complex number of the first column vector which is the fermi sub-field quantity;
ξ2the real part of the second complex number of the first column vector of fermi sub-field quantities;
ξ3the real part of the third complex number of the first column vector of fermi sub-field quantities;
ξ4the real part of the fourth complex number of the first column vector which is the fermi sub-field quantity;
β1the imaginary part of the first complex number of the first column vector which is the fermi sub-field quantity;
β2the imaginary part of the second complex number of the first column vector which is the fermi sub-field quantity;
β3the imaginary part of the third complex number of the first column vector which is the fermi sub-field quantity;
β4the imaginary part of the fourth complex number of the first column vector which is the fermi sub-field quantity;
γ1the real part of the first complex number of the second column vector, which is the fermi sub-field quantity;
γ2the real part of the second complex number of the second column vector which is the fermi sub-field quantity;
γ3the real part of the third complex number of the second column vector which is the fermi sub-field quantity;
γ4the real part of the fourth complex number of the second column vector, which is the fermi sub-field quantity;
δ1the imaginary part of the first complex number of the second column vector being the fermi sub-field quantity;
δ2the imaginary part of the second complex number of the second column vector being the fermi sub-field quantity;
δ3the imaginary part of the third complex number of the second column vector being the fermi sub-field quantity;
δ4the imaginary part of the fourth complex number of the second column vector being the fermi sub-field quantity;
μ1the real part of the first complex number of the third column vector, which is the fermi sub-field quantity;
μ2the real part of the second complex number of the third column vector, which is the fermi sub-field quantity;
μ3the real part of the third complex number of the third column vector being the fermi sub-field quantity;
μ4The real part of the fourth complex number of the third column vector, which is the fermi sub-field quantity;
ν1the imaginary part of the first complex number of the third column vector, which is the fermi sub-field quantity;
ν2the imaginary part of the second complex number of the third column vector, which is the fermi sub-field quantity;
ν3the imaginary part of the third complex number of the third column vector, which is the fermi sub-field quantity;
ν4the imaginary part of the fourth complex number of the third column vector of the fermi sub-field quantity.
In the present invention, any one data information SgFermi sub-field quantity of
Figure BDA0002167067360000081
Figure BDA0002167067360000082
In the present invention, any one data information SgIs recorded as
Figure BDA0002167067360000083
Figure BDA0002167067360000084
In the same way, the first data information S of the present invention1Fermi sub-field quantity of
Figure BDA0002167067360000091
Figure BDA0002167067360000092
In the same way, the first data information S of the present invention1Is recorded as
Figure BDA0002167067360000093
Figure BDA0002167067360000094
Similarly, the second data information S in the present invention2Fermi sub-field quantity of
Figure BDA0002167067360000095
Figure BDA0002167067360000096
In the same way, the second data information S of the present invention2Is recorded as
Figure BDA0002167067360000097
Figure BDA0002167067360000098
In the same way, the last data information S of the present invention can be obtainedGFermi sub-field quantity of
Figure BDA0002167067360000099
Figure BDA00021670673600000910
In the same way, the last data information S of the present invention can be obtainedGIs recorded as
Figure BDA00021670673600000911
Figure BDA00021670673600000912
In the present invention, in the case of the present invention,
Figure BDA00021670673600000913
and
Figure BDA00021670673600000914
are not the same.
Figure BDA00021670673600000915
Figure BDA00021670673600000916
And
Figure BDA00021670673600000917
are not the same.
In the present invention, the secondary kernel set location matrix add is utilizedCPEsTo mark the slave set of kernels, CPEs ═ cpe1,cpe2,…,cpeAThe specific positions of the cores in the system are used for improving the matching efficiency of the cores and the grid point positions represented by the four-dimensional space when the slave cores perform parallelization data information operation.
Referring to the two-dimensional schematic diagram of the four-dimensional grid point shown in FIG. 2, any one data information SgIs denoted as SgThe xyz, since the four dimensions are X, Y, Z, T axes, and the four dimensional space coordinate is difficult to represent, the present invention is represented by using two dimensional plane coordinates as shown in fig. 2A to 2F. The arrows shown schematically in FIG. 2 represent the direction in which the normalized field magnitude is selected (i.e., the direction in which the normalized field magnitude is selected)
Figure BDA0002167067360000101
) The dots represent coordinate points of a four-dimensional space, and two ends of each dimension are connected through six two-dimensional plane coordinates, so that theoretical support is provided for realizing parallelization in a data parallelization mode. In the present invention, for any one data information SgThe coordinate position of the CPEs is marked, so that the problem that the storage of each data information from the core in the main memory is not continuous and the required data information of one-time calculation needs to be initiated for multiple times is solvedDirect Memory Access (DMA) transmission, which results in a very low bandwidth utilization between each slave core and the master Memory, and data information of a neighbor grid is used for calculating data information of each grid, and this calculation method transmits a large amount of redundant data information.
In the present invention, the first data information S1The position in four-dimensional space is recorded as
Figure BDA0002167067360000102
Figure BDA0002167067360000103
Is S1The value on the X-axis is,
Figure BDA0002167067360000104
is S1The value on the Y-axis is,
Figure BDA0002167067360000105
is S1The value on the Z-axis is,
Figure BDA0002167067360000106
is S1Values on the T-axis.
In the present invention, the second data information S2The position in four-dimensional space is recorded as
Figure BDA0002167067360000107
Figure BDA0002167067360000108
Is S2The value on the X-axis is,
Figure BDA0002167067360000109
is S2The value on the Y-axis is,
Figure BDA00021670673600001010
is S2The value on the Z-axis is,
Figure BDA00021670673600001011
is S2Values on the T-axis.
In the present invention, any one data information SgThe position in four-dimensional space is recorded as
Figure BDA00021670673600001012
Figure BDA00021670673600001013
Is SgThe value on the X-axis is,
Figure BDA00021670673600001014
is SgThe value on the Y-axis is,
Figure BDA00021670673600001015
is SgThe value on the Z-axis is,
Figure BDA00021670673600001016
is SgValues on the T-axis. SgThe lower subscript G in (1) is the identification number of the data information, and G belongs to G. As shown in fig. 2A to 2F, the X axis is perpendicular to the Y axis, and the Z axis is perpendicular to the Y axis. Data information S processed by heterogeneous many-core processorgThe time in the four-dimensional space is represented as the time axis and denoted as T.
For convenience of explanation, any one of the data information SgThe position in four-dimensional space is noted as:
Figure BDA00021670673600001017
in the present invention, the last data information SGThe position in four-dimensional space is recorded as
Figure BDA00021670673600001018
Figure BDA00021670673600001019
Is SGThe value on the X-axis is,
Figure BDA00021670673600001020
is SGThe value on the Y-axis is,
Figure BDA00021670673600001021
is SGThe value on the Z-axis is,
Figure BDA00021670673600001022
is SGValues on the T-axis. SGThe lower subscript G in (1) is the total number of data information.
Referring to fig. 3, the invention discloses a heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method, which comprises the following steps:
initializing a slave core matrix position of a heterogeneous many-core processor;
since a plurality of slave cores exist in the heterogeneous many-core processor, the slave cores need to be divided according to the positions of the slave core identification numbers, and the matrix position of each slave core is recorded.
From the set of nuclei CPEs { cpe } ═ cpe1,cpe2,…,cpeAPosition sorting is carried out according to an 8 multiplied by 8 matrix, and a secondary kernel set position matrix add of a formula (1) is obtainedCPEsAny slave core position is denoted as dp,q
Figure BDA0002167067360000111
The slave kernel set location matrix addCPEsIs ordered from small to large according to the core identification number. In the following calculation process, the slave core determines the data area which is responsible for the slave core according to the row number and the column number of the slave core.
Reading the fermi sub-field quantity and the standard field quantity by the main core;
step 201, MPE of the main core reads data information, and expresses all the read data information as SPM ═ S in a set form1,S2,…,Sg,…,SG};
Step 202, the main core sets the SPM to { S ═ S1,S2,…,Sg,…,SGStoring Fermi sub-field quantity in data information into an 8 x 8 grid point matrix DAA according to the reading sequenceMPEAnd mixing the DAAMPESaving to a memory;
Figure BDA0002167067360000112
wherein the content of the first and second substances,
Figure BDA0002167067360000113
representing the first data information S1At four-dimensional coordinate points
Figure BDA0002167067360000114
Fermi sub-field magnitude above.
Figure BDA0002167067360000121
Representing second data information S2At four-dimensional coordinate points
Figure BDA0002167067360000122
Fermi sub-field magnitude above.
Figure BDA0002167067360000123
Representing any one data information SgAt four-dimensional coordinate points
Figure BDA0002167067360000124
Fermi sub-field magnitude above.
Figure BDA0002167067360000125
Indicating the last data information SGAt four-dimensional coordinate points
Figure BDA0002167067360000126
Fermi sub-field magnitude above.
Step 203, the master coreConverting said SPM to { S ═ S1,S2,…,Sg,…,SGStoring the standard field quantity in the data information into a 4 multiplied by 8 lattice point link matrix DBB according to the reading sequenceMPEAnd DBB isMPESaving to a memory;
Figure BDA0002167067360000127
in the present invention, the c-direction means any one axis selected from the X-axis, the Y-axis, the Z-axis, and the T-axis as a direction.
Figure BDA0002167067360000128
Denotes S1At four-dimensional coordinate points
Figure BDA0002167067360000129
In the direction of (a). In the same way, the method for preparing the composite material,
Figure BDA00021670673600001210
and
Figure BDA00021670673600001211
all indicate the selection direction.
Figure BDA00021670673600001212
Representing the first data information S1At four-dimensional coordinate points
Figure BDA00021670673600001213
Is/are as follows
Figure BDA00021670673600001214
Normalized field magnitude in direction.
Figure BDA00021670673600001215
Representing second data information S2At four-dimensional coordinate points
Figure BDA00021670673600001216
Is/are as follows
Figure BDA00021670673600001217
Normalized field magnitude in direction.
Figure BDA00021670673600001218
Representing any one data information SgAt four-dimensional coordinate points
Figure BDA00021670673600001219
Is/are as follows
Figure BDA00021670673600001220
Normalized field magnitude in direction.
Figure BDA00021670673600001221
Indicating the last data information SGAt four-dimensional coordinate points
Figure BDA0002167067360000131
Is/are as follows
Figure BDA0002167067360000132
Normalized field magnitude in direction.
In the invention, the Fermi sub-field quantity and the standard field quantity are adopted to represent four-dimensional data information, which is beneficial to the division and parallel calculation of tasks.
Reading data information from the core based on the line number and the column number of the core to realize data segmentation;
step 301, optional Slave core cpeAFrom the core matrix position d according to step onep,qDAA is arranged in the Z-axis and T-axis directionsMPECpe in matrixAThe responsible data information is partially read into the local memory space and is recorded as
Figure BDA0002167067360000133
All data information read in from the core can be written as:
Figure BDA0002167067360000134
step 302, optional Slave core cpeAFrom the core matrix position d according to step onep,qDBB is adjusted in the Z-axis and T-axis directionsMPECpe in matrixAThe responsible data information is partially read into the local memory space and is recorded as
Figure BDA0002167067360000135
All data information read in from the core can be written as:
Figure BDA0002167067360000136
in the present invention, any one of the slave cores cpeAAnd initiating direct memory access transmission to the memory according to the identification number of the slave core, reading the grid point Fermi field quantity value and the standard field quantity value which are calculated by the slave core, and storing the grid point Fermi field quantity value and the standard field quantity value into local storage of the slave core. And dividing the four-dimensional sub-grid into 64 two-dimensional planes according to the Z-axis and T-axis directions, so that each slave core is in charge of one plane. The internal calculation of each layer of the plane is relatively independent, and only when the boundary data is calculated, the data information of adjacent layers which are connected end to end is needed.
Step four, calculating the data information of any grid point in any slave core;
step 401, any one of the slave cores cpeAFrom corresponding
Figure BDA0002167067360000141
Obtaining SgThe corresponding lattice fermi sub-field amount; step 403 is executed;
step 402, any one of the slave cores cpeAFrom corresponding
Figure BDA0002167067360000142
Obtaining SgA corresponding normalized field size; step 403 is executed;
step 403, from the data information S of any grid pointgAcquiring data information of 8 adjacent grid points in x, y, z and t dimensions, and then acquiring grid point Fermi sub-field quantity and standard sub-field quantity of the 8 adjacent grid points; step 404 is executed;
the data information of the adjacent 8 grid points is respectively marked as S1、S2、S3、S4、S5、S6、S7And S8The central lattice point of the adjacent 8 lattice points is SgThen the lattice fermi sub-field quantity is respectively recorded as
Figure BDA0002167067360000143
Figure BDA0002167067360000144
And
Figure BDA0002167067360000145
said SgThe grid point fermi sub-field quantity is recorded as
Figure BDA0002167067360000146
The normalized field quantity is respectively recorded as
Figure BDA0002167067360000147
And
Figure BDA0002167067360000148
step 404, performing matrix multiplication on Fermi sub-field quantity and standard sub-field quantity of the adjacent 8 grid points; step 405 is executed;
Figure BDA0002167067360000149
Figure BDA00021670673600001410
Figure BDA00021670673600001411
Figure BDA00021670673600001412
Figure BDA00021670673600001413
Figure BDA00021670673600001414
Figure BDA00021670673600001415
Figure BDA00021670673600001416
step 405, updating the central grid point to be S by the matrix multiplication quantity of the adjacent 8 grid pointsgThe updated amount of the lattice point fermi sub-field belongs to SgThe lattice point fermi sub-field quantity of (2), is recorded as
Figure BDA00021670673600001417
And is
Figure BDA00021670673600001418
Namely, it is
Figure BDA00021670673600001419
Data information S ingIs updated to
Figure BDA0002167067360000151
Executing the step five;
in the present invention, step four is the processing of one lattice point, and all lattice points in the slave core need to adopt the same step four processing, and for the purpose of iterative description, the operation of one iteration on all lattice points is specifically described as step five.
Step five, each slave core carries out parallel processing of step four on the data information of each grid point in the local storage space thereof, thereby obtaining the updated Fermi sub-field quantity of all the grid points, namely obtaining
Figure BDA0002167067360000152
Executing the step six;
step six, after the updating is finished, adding 1 to the iteration times; calculating a residual error value of the lattice point Fermi sub-field quantity;
in the invention, the iteration times are recorded as U, and the maximum iteration times are recorded as UmaxAnd U ismaxThe value is 1000, and the current iteration number is recorded as UAt present. If U isAt present<UmaxIf yes, executing the step four; if U isAt present≥UmaxIf yes, executing step seven;
in the invention, the residual error of the lattice point Fermi sub-field quantity is recorded as R, and the residual error threshold value of the lattice point Fermi sub-field quantity is recorded as RminAnd R isminIs taken to be 1.0 × 10-12. If R > RminIf yes, executing the step four; if R is less than or equal to RminIf yes, executing step seven;
step seven, outputting the updated lattice point matrix to a memory to be stored as a file;
in the present invention, the following components are added
Figure BDA0002167067360000153
Passed to memory to update the DAAMPETo obtain
Figure BDA0002167067360000154
Will be provided with
Figure BDA0002167067360000155
The file is saved and written.
Example 1
Software and hardware environment parameters of the Shenwei 26010 heterogeneous many-core processor are as follows:
TABLE 1 software and hardware Environment
CPU Memory device Compiler with a plurality of compiler modules
SW26010 1.45GHz 32G for 4CG Sw5cc
The data used in example 1 is grid-sized lattice data, using point sources to solve for quark propagators. The proportion of the operation time of each part of the program and the acceleration effect of the parallelization calculation are analyzed.
The data cutting mode of the invention can improve the bandwidth utilization rate (expressed by DMA in the invention) between the slave core and the main memory and reduce the transmission of redundant data. Compared with the serial calculation method, the data redundancy is reduced by 8 times, the direct memory access transmission times are reduced, and the experimental result is shown in table 2, so that the performance is improved by 145 times.
TABLE 2 Direct Memory Access (DMA) transfer time comparison of different data partitioning methods
DMA transfer Total time consumption (MPE beat number)
Serial computing method 22328320
Improved post-calculation method 153468
TABLE 3 run time analysis
Figure BDA0002167067360000161
From table 3 it can be seen that the method performed using MPE + CPEs parallelization can speed up by a factor of 16.4 compared to the calculation on MPE only. Since the theoretical calculated peak for one MPE is 23.2 gflps and one CPE calculated peak is 11.6 gflps, the theoretically highest achievable ratio is 32 times. The method of the invention adopts the characteristic of register communication among a large number of slave cores, obtains better parallel effect, and obtains the maximum parallelism degree reaching 51.3 percent.
It can also be seen from table 3 that the overall operation efficiency of the method after vectorization is improved by 3.9 times compared with the operation efficiency of the method without vectorization, and thus it can be seen that the operation efficiency can be greatly improved by performing vectorization processing on floating point operations.
By using the parallelization method of the invention in table 3 for data division and transmission, secondary core cooperative computing and vectorization computing, the speed-up ratio can be 63.96 times as high as that of the original serial running program.
Referring to fig. 4 and 5, whether the method of the present invention is a single main core serial algorithm, or an MPE + CPEs parallelization calculation algorithm, or an MPE + CPEs + SIMD vectorization parallelization algorithm, the method of the present invention is divided into two parts, namely reading a file, transmitting data, and consuming iterative calculation, and is shown in fig. 4 and 5. It can be seen from the figure that the method of the present invention is computationally intensive, the iterative computation portion occupies most of the program running time, and the iterative computation time proportion further increases as the number of iterations increases. A special data segmentation mode is designed for the Shenwei 26010 heterogeneous many-core processor, and the proportion of the data transmission time of the slave core and the main memory in the whole program is greatly reduced.

Claims (5)

1. A lattice point quantum color dynamics parallel acceleration method based on a heterogeneous many-core processor is characterized by comprising the following steps:
initializing a slave core matrix position of a heterogeneous many-core processor;
because a plurality of slave cores exist in the heterogeneous many-core processor, the slave cores need to be divided according to the slave core identification numbers, and the matrix position of each slave core is recorded;
from the set of nuclei CPEs { cpe } ═ cpe1,cpe2,…,cpeAPosition sorting is carried out according to an 8 multiplied by 8 matrix to obtain a secondary kernel set position matrix addCPEsAny slave core position is denoted as dp,q
Figure FDA0002810856450000011
Reading the fermi sub-field quantity and the standard field quantity by the main core;
step 201, MPE of the main core reads data information, and expresses all the read data information as SPM ═ S in a set form1,S2,…,Sg,…,SG};
Step 202, the main core sets the SPM to { S ═ S1,S2,…,Sg,…,SGStoring Fermi sub-field quantity in data information into an 8 x 8 grid point matrix DAA according to the reading sequenceMPEAnd mixing the DAAMPESaving to a memory;
Figure FDA0002810856450000012
Figure FDA0002810856450000013
representing the first data information S1At four-dimensional coordinate points
Figure FDA0002810856450000021
Fermi sub-field magnitude above;
Figure FDA0002810856450000022
denotes S1The amount of fermi sub-field of (a);
Figure FDA0002810856450000023
representing second data information S2At four-dimensional coordinate points
Figure FDA0002810856450000024
Fermi sub-field magnitude above;
Figure FDA0002810856450000025
denotes S2The amount of fermi sub-field of (a);
Figure FDA0002810856450000026
representing any one data information SgAt four-dimensional coordinate points
Figure FDA0002810856450000027
Fermi sub-field magnitude above;
Figure FDA0002810856450000028
denotes SgThe amount of fermi sub-field of (a);
Figure FDA0002810856450000029
indicating the last data information SGAt four-dimensional coordinate points
Figure FDA00028108564500000210
Fermi sub-field magnitude above;
Figure FDA00028108564500000211
denotes SGThe amount of fermi sub-field of (a);
step 203, the main core sets the SPM to { S ═ S1,S2,…,Sg,…,SGStoring the standard field quantity in the data information into a 4 multiplied by 8 lattice point link matrix DBB according to the reading sequenceMPEAnd DBB isMPESaving to a memory;
Figure FDA00028108564500000212
Figure FDA00028108564500000213
representing the first data information S1At four-dimensional coordinate points
Figure FDA00028108564500000214
Is/are as follows
Figure FDA00028108564500000215
A normalized field magnitude in a direction;
Figure FDA00028108564500000216
denotes S1At four-dimensional coordinate points
Figure FDA00028108564500000217
The direction of (a);
Figure FDA00028108564500000218
denotes S1The normalized field size of (a);
Figure FDA00028108564500000219
representing second data information S2At four-dimensional coordinate points
Figure FDA00028108564500000220
Is/are as follows
Figure FDA00028108564500000221
A normalized field magnitude in a direction;
Figure FDA00028108564500000222
denotes S2At four-dimensional coordinate points
Figure FDA00028108564500000223
The direction of (a);
Figure FDA00028108564500000224
denotes S2The normalized field size of (a);
Figure FDA00028108564500000225
representing any one data information SgAt four-dimensional coordinate points
Figure FDA0002810856450000031
Is/are as follows
Figure FDA0002810856450000032
A normalized field magnitude in a direction;
Figure FDA0002810856450000033
denotes SgAt four-dimensional coordinate points
Figure FDA0002810856450000034
The direction of (a);
Figure FDA0002810856450000035
denotes SgThe normalized field size of (a);
Figure FDA0002810856450000036
indicating the last data information SGAt four-dimensional coordinate points
Figure FDA0002810856450000037
Is/are as follows
Figure FDA0002810856450000038
A normalized field magnitude in a direction;
Figure FDA0002810856450000039
denotes SGAt four-dimensional coordinate points
Figure FDA00028108564500000310
The direction of (a);
Figure FDA00028108564500000311
denotes SGThe normalized field size of (a);
reading data information from the core based on the line number and the column number of the core to realize data segmentation;
step 301, optional Slave core cpeAFrom the core matrix position d according to step onep,qDAA is arranged in the Z-axis and T-axis directionsMPECpe in matrixAThe responsible data information is partially read into the local memory space and is recorded as
Figure FDA00028108564500000312
All data information read in from the core can be written as:
Figure FDA00028108564500000313
step 302, optional Slave core cpeAFrom the core matrix position d according to step onep,qDBB is adjusted in the Z-axis and T-axis directionsMPECpe in matrixAThe responsible data information is partially read into the local memory space and is recorded as
Figure FDA00028108564500000314
All data information read in from the core can be written as:
Figure FDA0002810856450000041
step four, calculating the data information of any grid point in any slave core;
step 401, any data information SgThe grid point fermi sub-field quantity is recorded as
Figure FDA0002810856450000042
Step 403 is executed;
Figure FDA0002810856450000043
step 402, any data information SgIs recorded as
Figure FDA0002810856450000044
Step 403 is executed;
Figure FDA0002810856450000045
step 403, from the data information S of any grid pointgAcquiring data information of 8 adjacent grid points in x, y, z and t dimensions, and then acquiring grid point Fermi sub-field quantity and standard sub-field quantity of the 8 adjacent grid points; step 404 is executed;
the data information of the adjacent 8 grid points is respectively marked as S1、S2、S3、S4、S5、S6、S7And S8The central lattice point of the adjacent 8 lattice points is SgThen the lattice fermi sub-field quantity is respectively recorded as
Figure FDA0002810856450000046
Figure FDA0002810856450000047
And
Figure FDA0002810856450000048
said SgThe grid point fermi sub-field quantity is recorded as
Figure FDA0002810856450000049
The normalized field quantity is respectively recorded as
Figure FDA00028108564500000410
And
Figure FDA00028108564500000411
first data information S1Fermi sub-field quantity of
Figure FDA00028108564500000412
Figure FDA00028108564500000413
First data information S1Is recorded as
Figure FDA00028108564500000414
Figure FDA0002810856450000051
Second data information S2Fermi sub-field quantity of
Figure FDA0002810856450000052
Figure FDA0002810856450000053
Second data information S2Is recorded as
Figure FDA0002810856450000054
Figure FDA0002810856450000055
Third data information S3Fermi sub-field quantity of
Figure FDA0002810856450000056
Figure FDA0002810856450000057
Third data information S3Is recorded as
Figure FDA0002810856450000058
Figure FDA0002810856450000059
Fourth data information S4Fermi sub-field quantity of
Figure FDA00028108564500000510
Figure FDA00028108564500000511
Fourth data information S4Is a specification ofField magnitude is recorded as
Figure FDA00028108564500000512
Figure FDA00028108564500000513
Fifth data information S5Fermi sub-field quantity of
Figure FDA00028108564500000514
Figure FDA00028108564500000515
Fifth data information S5Is recorded as
Figure FDA0002810856450000061
Figure FDA0002810856450000062
Sixth data information S6Fermi sub-field quantity of
Figure FDA0002810856450000063
Figure FDA0002810856450000064
Sixth data information S6Is recorded as
Figure FDA0002810856450000065
Figure FDA0002810856450000066
Seventh data information S7Fermi sub-field quantity of
Figure FDA0002810856450000067
Figure FDA0002810856450000068
Seventh data information S7Is recorded as
Figure FDA0002810856450000069
Figure FDA00028108564500000610
Eighth data information S8Fermi sub-field quantity of
Figure FDA00028108564500000611
Figure FDA00028108564500000612
Eighth data information S8Is recorded as
Figure FDA00028108564500000613
Figure FDA00028108564500000614
Step 404, performing matrix multiplication on Fermi sub-field quantity and standard sub-field quantity of the adjacent 8 grid points; step 405 is executed;
Figure FDA00028108564500000615
Figure FDA00028108564500000616
Figure FDA0002810856450000071
Figure FDA0002810856450000072
Figure FDA0002810856450000073
Figure FDA0002810856450000074
Figure FDA0002810856450000075
Figure FDA0002810856450000076
step 405, updating the central grid point to be S by the matrix multiplication quantity of the adjacent 8 grid pointsgThe updated amount of the lattice point fermi sub-field belongs to SgThe lattice point fermi sub-field quantity of (2), is recorded as
Figure FDA0002810856450000077
And is
Figure FDA0002810856450000078
Namely, it is
Figure FDA0002810856450000079
Data information S ingIs updated to
Figure FDA00028108564500000710
Executing the step five;
step five, each slave core carries out parallel processing of step four on the data information of each grid point in the local storage space thereof, thereby obtaining the updated Fermi sub-field quantity of all the grid points, namely obtaining
Figure FDA00028108564500000711
Executing the step six;
step six, after the updating is finished, adding 1 to the iteration times; calculating a residual error value of the lattice point Fermi sub-field quantity;
the number of iterations is recorded as U, and the maximum number of iterations is recorded as UmaxAnd U ismaxThe value is 1000, and the current iteration number is recorded as UAt present(ii) a If U isAt present<UmaxIf yes, executing the step four; if U isAt present≥UmaxIf yes, executing step seven;
the residual error of the grid point Fermi sub-field quantity is recorded as R, and the residual error threshold value of the grid point Fermi sub-field quantity is recorded as RminAnd R isminIs taken to be 1.0 × 10-12(ii) a If R > RminIf yes, executing the step four; if R is less than or equal to RminIf yes, executing step seven;
step seven, outputting the updated lattice point matrix to a memory to be stored as a file;
will be provided with
Figure FDA00028108564500000712
Passed to memory to update the DAAMPETo obtain
Figure FDA00028108564500000713
Will be provided with
Figure FDA00028108564500000714
The file is saved and written.
2. The heterogeneous many-core processor-based grid-point quantum color dynamics parallel acceleration method according to claim 1, characterized in that: the slave kernel set location matrix addCPEsIs ordered from small to large according to the core identification number.
3. The heterogeneous many-core processor-based grid-point quantum color dynamics parallel acceleration method according to claim 1, characterized in that: the bandwidth utilization rate is reduced, and the performance is improved by 145 times.
4. The heterogeneous many-core processor-based grid-point quantum color dynamics parallel acceleration method according to claim 1, characterized in that: parallel acceleration reduces the time consumption and achieves 63 times of performance improvement.
5. The heterogeneous many-core processor-based grid-point quantum color dynamics parallel acceleration method according to claim 1, characterized in that: the parallel acceleration processing method is suitable for parallel acceleration processing of the Shenwei 26010 heterogeneous many-core processor.
CN201910750655.3A 2018-08-15 2019-08-14 Heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method Active CN110516194B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810927168 2018-08-15
CN2018109271685 2018-08-15

Publications (2)

Publication Number Publication Date
CN110516194A CN110516194A (en) 2019-11-29
CN110516194B true CN110516194B (en) 2021-03-09

Family

ID=68625172

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910750655.3A Active CN110516194B (en) 2018-08-15 2019-08-14 Heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method

Country Status (1)

Country Link
CN (1) CN110516194B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113935491B (en) * 2021-10-20 2022-08-23 腾讯科技(深圳)有限公司 Method, device, equipment, medium and product for obtaining eigenstates of quantum system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246446A (en) * 2008-03-12 2008-08-20 浪潮电子信息产业股份有限公司 Method for testing PC server performance
CN101727512A (en) * 2008-10-17 2010-06-09 中国科学院过程工程研究所 General algorithm based on variation multiscale method and parallel calculation system
CN106775594A (en) * 2017-01-13 2017-05-31 中国科学院软件研究所 A kind of Sparse Matrix-Vector based on the domestic processor of Shen prestige 26010 multiplies isomery many-core implementation method
CN106933777A (en) * 2017-03-14 2017-07-07 中国科学院软件研究所 The high-performance implementation method of the one-dimensional FFT of base 2 based on the domestic processor of Shen prestige 26010
US9727471B2 (en) * 2010-11-29 2017-08-08 Intel Corporation Method and apparatus for stream buffer management instructions
CN107085743A (en) * 2017-05-18 2017-08-22 郑州云海信息技术有限公司 A kind of deep learning algorithm implementation method and platform based on domestic many-core processor
CN107168683A (en) * 2017-05-05 2017-09-15 中国科学院软件研究所 GEMM dense matrix multiply high-performance implementation method on the domestic many-core CPU of Shen prestige 26010
CN107451097A (en) * 2017-08-04 2017-12-08 中国科学院软件研究所 Multidimensional FFT high-performance implementation method on the domestic many-core processor of Shen prestige 26010

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103268297A (en) * 2013-05-20 2013-08-28 浙江大学 Accelerating core virtual scratch pad memory method based on heterogeneous multi-core platform
KR102390162B1 (en) * 2015-10-16 2022-04-22 삼성전자주식회사 Apparatus and method for encoding data
US10311174B2 (en) * 2015-10-22 2019-06-04 International Business Machines Corporation Innermost data sharing method of lattice quantum chromodynamics calculation
CN105808926B (en) * 2016-03-02 2017-10-03 中国地质大学(武汉) A kind of pre-conditional conjugate gradient block adjustment method accelerated parallel based on GPU
CN107273094B (en) * 2017-05-18 2020-06-16 中国科学院软件研究所 Data structure suitable for HPCG optimization on ' Shenwei ' Taihu light ' and efficient implementation method thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246446A (en) * 2008-03-12 2008-08-20 浪潮电子信息产业股份有限公司 Method for testing PC server performance
CN101727512A (en) * 2008-10-17 2010-06-09 中国科学院过程工程研究所 General algorithm based on variation multiscale method and parallel calculation system
US9727471B2 (en) * 2010-11-29 2017-08-08 Intel Corporation Method and apparatus for stream buffer management instructions
CN106775594A (en) * 2017-01-13 2017-05-31 中国科学院软件研究所 A kind of Sparse Matrix-Vector based on the domestic processor of Shen prestige 26010 multiplies isomery many-core implementation method
CN106933777A (en) * 2017-03-14 2017-07-07 中国科学院软件研究所 The high-performance implementation method of the one-dimensional FFT of base 2 based on the domestic processor of Shen prestige 26010
CN107168683A (en) * 2017-05-05 2017-09-15 中国科学院软件研究所 GEMM dense matrix multiply high-performance implementation method on the domestic many-core CPU of Shen prestige 26010
CN107085743A (en) * 2017-05-18 2017-08-22 郑州云海信息技术有限公司 A kind of deep learning algorithm implementation method and platform based on domestic many-core processor
CN107451097A (en) * 2017-08-04 2017-12-08 中国科学院软件研究所 Multidimensional FFT high-performance implementation method on the domestic many-core processor of Shen prestige 26010

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Accelerating Lattice QCD Multigrid on GPUs Using Fine-Grained Parallelization;Clark Michael A. 等;《SC16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis》;20161118;795-806 *
Lwptool: A lightweight profiler to guide data layout optimization;Yu Chao 等;《IEEE Transactions on Parallel and Distributed Systems》;20180528;第29卷(第11期);2489-2502 *
基于申威众核处理器的混合并行遗传算法;赵瑞祥 等;《计算机应用》;20170910;第37卷(第9期);2518-2523 *
探索"手征电子学"——第二类外尔半金属的手征输运;王锐 等;《物理》;20170212;第46卷(第2期);100-102 *
申威众核处理器的并行NSGA-Ⅱ算法;沈焕学 等;《计算机工程与应用》;20180301;第54卷(第17期);35-40 *

Also Published As

Publication number Publication date
CN110516194A (en) 2019-11-29

Similar Documents

Publication Publication Date Title
Zaruba et al. Manticore: A 4096-core RISC-V chiplet architecture for ultraefficient floating-point computing
US8676874B2 (en) Data structure for tiling and packetizing a sparse matrix
Griebel et al. A multi-GPU accelerated solver for the three-dimensional two-phase incompressible Navier-Stokes equations
US8762655B2 (en) Optimizing output vector data generation using a formatted matrix data structure
TW201635143A (en) Work-efficient, load-balanced, merge-based parallelized consumption of sequences of sequences
Shimokawabe et al. 145 TFlops performance on 3990 GPUs of TSUBAME 2.0 supercomputer for an operational weather prediction
US20200302284A1 (en) Data compression for a neural network
CN110516194B (en) Heterogeneous many-core processor-based grid point quantum color dynamics parallel acceleration method
Solano-Quinde et al. Unstructured grid applications on GPU: performance analysis and improvement
He et al. A multiple-GPU based parallel independent coefficient reanalysis method and applications for vehicle design
Zhu et al. GPU acceleration of an iterative scheme for gas-kinetic model equations with memory reduction techniques
CN109753682B (en) Finite element stiffness matrix simulation method based on GPU (graphics processing Unit) end
Wang et al. Accelerating ap3m-based computational astrophysics simulations with reconfigurable clusters
WO2021250392A1 (en) Mixed-element-size instruction
US8564601B2 (en) Parallel and vectored Gilbert-Johnson-Keerthi graphics processing
Wang et al. Implementation of Jacobi iterative method on graphics processor unit
Tan et al. A pipelining loop optimization method for dataflow architecture
CN114116208A (en) Short wave radiation transmission mode three-dimensional acceleration method based on GPU
CN115756605A (en) Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs
Xu et al. Parallelizing a high-order CFD software for 3D, multi-block, structural grids on the TianHe-1A supercomputer
Yu et al. GPU-based JFNG method for power system transient dynamic simulation
JP4052181B2 (en) Communication hiding parallel fast Fourier transform method
Xu et al. Generalized GPU acceleration for applications employing finite-volume methods
Zhang et al. Accelerating lattice QCD on sunway many-core processor
Chen et al. Edge FPGA-based onsite neural network training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant