CN109871512B - Matrix multiplication acceleration method for heterogeneous fusion system structure - Google Patents
Matrix multiplication acceleration method for heterogeneous fusion system structure Download PDFInfo
- Publication number
- CN109871512B CN109871512B CN201910076766.0A CN201910076766A CN109871512B CN 109871512 B CN109871512 B CN 109871512B CN 201910076766 A CN201910076766 A CN 201910076766A CN 109871512 B CN109871512 B CN 109871512B
- Authority
- CN
- China
- Prior art keywords
- matrix multiplication
- matrix
- cpu
- fma
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Complex Calculations (AREA)
Abstract
The invention discloses a matrix multiplication acceleration method for a heterogeneous fusion system structure, and aims to aim at aiming at different many-core accelerator targetsThe system structure designs a general multiplication acceleration method for a heterogeneous fusion system structure matrix, and the use efficiency of a heterogeneous system is improved. The technical scheme is that a block matrix multiplication version oriented to a heterogeneous fusion system structure is designed firstly, and comprises vcpu、vgpu、vmic,vscif,vcoi,vtargetThen, integrating and packaging the heterogeneous fusion multi-version matrix multiplication versions to generate a library file HU-xgemm of the heterogeneous fusion version; and finally, adapting an accelerator in a heterogeneous fusion system structure by using HU-xgemm. The invention can be adaptive to different target accelerators and processors, can perform matrix multiplication adaptively according to different heterogeneous fusion system structures, and performs matrix multiplication according to topological structures of CPUs or accelerators in different heterogeneous fusion system structures, and each FMA performs parallel calculation, thereby accelerating the matrix multiplication speed and improving the use efficiency of a heterogeneous system.
Description
Technical Field
The invention relates to a matrix multiplication accelerating method, in particular to a heterogeneous system-oriented heterogeneous fusion system structure matrix multiplication accelerating method.
Background
With the continuous rising of the computing performance of the general accelerator and the wide application of the accelerator, the many-core accelerator must become an important development direction for high-performance computing, and the accelerators such as GPU, MIC (Xeon Phi), Matrix2000 and the like which meet the requirements of various fields are developed. With the wide application and popularization of heterogeneous systems, a plurality of heterogeneous system structures of different types such as CPU + GPU, CPU + MIC, CPU + Matrix2000 and the like emerge.
The design target and design principle of the accelerator determine the specificity and limitation of the accelerator, and different accelerator manufacturers develop programming models adapted to the accelerator, such as CUDA supported by GPU, Offload supported by MIC, COI (coprocessor over infrastructure) supported by Matrix2000, SCIF (systematic communication interface), OpenMP target, and other programming models. The program design of the target-oriented accelerator needs to adopt a programming model supported by the target-oriented accelerator to redesign and realize an algorithm, so that the acceleration is possible; if the program is not redesigned and implemented according to the programming model supported by the accelerator, the program is basically impossible to run and has no acceleration effect. Therefore, programs for realizing different versions need to be designed for different heterogeneous systems, for example, algorithms and programs capable of efficiently cooperating between a CPU and a GPU must be realized for CPU + GPU-oriented heterogeneous systems; the CPU + MIC is oriented to an algorithm and a program which can realize efficient cooperation between the CPU and the MIC; CPU + Matrix2000 oriented algorithms and programs must be implemented that enable efficient coordination between the CPU and the Matrix 2000. With the updating, replacement and upgrading of the heterogeneous system accelerator, programs for implementing different accelerator versions need to be redesigned at different periods, and even algorithms and programs for different target accelerators need to be simultaneously designed when a mixed use situation of multiple accelerators occurs in a set of heterogeneous system.
For different heterogeneous systems, software designers need to re-understand a target system structure and learn a new programming model to realize the existing algorithm, spend a lot of time learning new knowledge to repeat the existing work, the effect may not be good, and the design and development of the algorithm in the field are not facilitated. Therefore, a set of universal programs can be designed to run on different heterogeneous systems, so that program designers can be greatly liberated, and the development efficiency is improved.
The matrix multiplication is the most common operation in numerical calculation, and many applications include the calculation process of the matrix multiplication, so that the operation speed of the matrix multiplication is improved, and the speed of high-performance calculation can be improved to a great extent.
The matrix multiplication is to multiply one row of the multiplied matrix a and one column of the multiplier matrix B to obtain one element in the resultant matrix C. Matrix multiplication oriented to a heterogeneous system generally needs to reasonably distribute a matrix multiplication calculation process between a main processor (CPU) and a many-core accelerator to perform heterogeneous cooperation and parallel completion of the calculation process so as to improve the operation speed of matrix multiplication and maximize the calculation efficiency and the use efficiency of the heterogeneous system.
Because the design target of the many-core accelerator is different from the instruction set structure, the conventional matrix multiplication implementation technology facing a general main processor hardly meets the performance requirement of the many-core accelerator designed facing a specific application, so the matrix multiplication needs to be accelerated facing a target architecture of the many-core accelerator to improve the operation speed of the matrix multiplication, and the design target of a heterogeneous system is met to the maximum extent.
If a heterogeneous fusion Matrix multiplication acceleration method can be provided for various heterogeneous systems such as a CPU + GPU, a CPU + MIC, a CPU + Matrix2000 and the like to shield structural details of a target system, simplify the program development of the heterogeneous system and improve the efficiency of the heterogeneous system, a programmer can concentrate on the field algorithm design and development to the maximum extent without knowing the specific structure and instruction of the heterogeneous system, the development restriction of a many-core accelerator in the high-performance computing field can be effectively solved, and the technical problem to be solved by technical personnel in the field is urgently needed.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: a general Matrix multiplication accelerating method facing different many-core accelerator target architectures is designed, so that the details of the target architecture are shielded to the maximum extent, the Matrix multiplication design difficulty and the workload of a heterogeneous system are simplified, and the use efficiency of the heterogeneous system is improved.
In order to solve the technical problems, the specific technical scheme of the invention is as follows:
firstly, designing a partitioned matrix multiplication version oriented to a heterogeneous fusion system structure to obtain a CPU matrix multiplication acceleration version vcpuGPU matrix multiplication acceleration version vgpuMIC matrix multiplication acceleration version vmicWhen the heterogeneous integration system structure consists of a CPU and a Matrix2000, a Matrix multiplication acceleration version v realized by adopting a SCIF programming modescifMatrix multiplication acceleration version v realized by adopting COI programming modecoiMatrix multiplication acceleration version v realized by OpenMP tag programming modetarget, the specific steps are as follows:
1.1 configuration and initialization of a heterogeneous fusion system structure, the specific method is as follows:
defining the dimensionality of a matrix A to be M multiplied by K and the dimensionality of a matrix B to be K multiplied by N, and then defining the dimensionality of a result matrix C obtained by multiplying A and B to be M multiplied by N, wherein M, K and N are positive integers; the element of the p-th row and the q-th column of A is apqP is more than or equal to 0 and less than or equal to M-1, q is more than or equal to 0 and less than or equal to K-1, and the element in the qth row and the tth column of B is Bqt,0≤t≤N-1;
1.2 if the heterogeneous fusion system structure only consists of a CPU, initializing the CPU, and the specific method comprises the following steps:
1.2.1. inquiring an architecture manual to obtain the topological structure of the computing cores in one CPU, and if the topological structure in the multi-core CPU is mg × ng, namely mg × ng computing cores in one CPU are physically distributed in mg rows × ng columns, numbering the computing cores as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1), i is more than or equal to 0 and less than or equal to mg-1, and j is more than or equal to 0 and less than or equal to ng-1 in sequence;
1.2.2. the query architecture manual obtains the number m of floating point vector Multiply accumulate functional units FMA (fused multiplex Add) owned by each CPU computing coree;
1.2.3. The initialization of the multi-core CPU is completed by using an initialization function (such as init) provided by an operating system;
1.3 matrix division, namely, partitioning A and B according to the topological structure of a CPU to obtain mg × ng A block matrixes and ng × mg B block matrixes, wherein the specific method comprises the following steps:
1.3.1 divide matrix A into mg × ng A block matrices, the arrangement of mg × ng A block matrices is the same as that of mg × ng computational cores, and the ith row and jth column of A block matrices are used as Aij (i is more than or equal to 0 and less than or equal to mg-1, j is more than or equal to 0 and less than or equal to ng-1), and the dimension of each block matrix is m multiplied by k, wherein The representation is rounded up, and the dimension of the edge block matrix needs special treatment. The specific matrix partitioning method is as follows:
1.3.1.1 let the block matrix row variable i equal to 0;
1.3.1.2 let the block matrix column variable j be 0;
1.3.1.3 let the block matrix element row variable m' be 0;
1.3.1.4 let block matrix element column variable n' be 0, n be k;
1.3.1.5 order row coordinate variable s ═ m × i + m';
1.3.1.6 let column coordinate variable e be n × j + n';
1.3.1.7 if s is less than or equal to M-1, switching to 1.3.1.10, otherwise, switching to 1.3.2 after the matrix A is divided;
1.3.1.8 if e is less than or equal to K-1, turning to 1.3.1.11, otherwise, turning to 1.3.2 after the matrix A is divided;
1.3.1.9 selecting elements of matrix A ase to form block matrix AijElement a 'of'm'n';
1.3.1.10 n'=n'+1;
1.3.1.11 if n' is less than or equal to k-1, turning to 1.3.1.7, otherwise, turning to 1.3.1.14;
1.3.1.12 m'=m'+1;
1.3.1.13 if m' is less than or equal to m-1, switching to 1.3.1.6, otherwise, switching to 1.3.1.16;
1.3.1.14 j=j+1;
1.3.1.16 i=i+1;
1.3.2 divide the matrix B of K × N into ng × mg B block matrices, the arrangement of ng × mg B block matrices is the same as that of mg × ng computing cores, and B is used for the r-th row and s-th column B block matricesrsExpressed as (0. ltoreq. r. ltoreq. mg-1, 0. ltoreq. s. ltoreq. ng-1), the dimension of each block matrix is k x n, representing upper rounding; the dimension of the edge block matrix needs special treatment, and the specific matrix division method is as follows:
1.3.2.1, making a B block matrix row variable p equal to 0;
1.3.2.2, a variable q of a B block matrix column is equal to 0;
1.3.2.3, making a coordinate variable s' of the B row equal to 0;
1.3.2.4 let B-column coordinate variable e' be 0;
1.3.2.5 let the block matrix element row variable k' be 0;
1.3.2.6 order the block matrix element column variable n ″, 0;
1.3.2.7 s'=k*p+k';
1.3.2.8 e'=n*q+n”;
1.3.2.9 if s' is less than or equal to K-1, turning to 1.3.2.10, otherwise, turning to 1.4 after the matrix B is divided;
1.3.2.10 if e' is less than or equal to N-1, turning to 1.3.2.11, otherwise, turning to 1.4 after the matrix B is divided;
1.3.2.11 selecting element B of matrix Bs'e'Block matrix B is formedpqElement b 'of'k'n”;
1.3.2.12 n”=n”+1;
1.3.2.13 if the fruit n' is less than or equal to n-1, turning to 1.3.2.7, otherwise, turning to 1.3.2.14;
1.3.2.14 k'=k'+1;
1.3.2.15 if k' is less than or equal to k-1, turning to 21.3.2.6, otherwise, turning to 1.3.2.16;
1.3.2.16 q=q+1;
1.3.2.17 if j2 is not more than ng-1, turning to 1.3.2.7, otherwise, turning to 1.3.2.18;
1.3.2.18 p=p+1;
1.3.2.19 if i2 is not more than mg-1, turning to 1.3.2.6, otherwise, turning to 1.4 after the matrix B is divided;
1.4 initializing the result matrix C to be 0, that is, assigning each element in C to be 0, making C (mo, no) to be 0, where C (mo, no) represents an element in the no-th row and the no-th column of the result matrix, mo is greater than or equal to 0 and less than or equal to M-1, and no is greater than or equal to 0 and less than or equal to N-1;
1.5 matrix multiplication acceleration to obtain mg × ng × meC block matrices: mg × ng × meThe FMA takes the CPU computing core shared memory space as the shared memory space and a scalar data buffer area as the data buffer area to execute the block matrix multiplication operation in parallel, so that the matrix multiplication is accelerated, each FMA independently finishes the block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') finishes the block matrix multiplicationCi'j'The specific calculation process of (2) is as follows:
1.5.1 let a block matrix row variable i' be 0;
1.5.2 let a block matrix column variable j' be 0;
1.5.3 the computational core (i ', j') reads the A-Block matrix A from memoryi'j'In a shared storage space (the CPU computing core shares the storage space as a portable Cache), the storage space is recorded as
1.5.4 let the B-block matrix column variable k ″) be 0;
1.5.5 computing cores (j ', k') read B Block matrix B from shared memory spacej'k”To the FMA data buffer;
1.5.6 mg×ng×methe individual FMAs perform the following vector multiplication operations in parallel:
1.5.7 obtaining A Block matrix Ai'j'The dimension m0 xk 0 is more than or equal to 1, m0 is more than or equal to 1, and k0 is more than or equal to 1; let the column length variable j ″) be 0;
1.5.8 initializing a vector loop variable v equal to 0;
1.5.10 initializes a broadcast variable r to 0;
1.5.11 from A Block matrix Ai'j'Column j of (2)eStarting to read m continuouslyeElement (a) of composition meVector of dimensions
1.5.12 taking B block matrix Bj'k”The r element in the j' th row is broadcast as an meVector of dimensions
1.5.13 vector VaAnd VbMultiplying corresponding elements to obtain the j' th column continuous m of the matrix CeThe partial product of each element, the specific multiplication rule is as follows:
1.5.14 matrix C obtained by 1.5.13 has j th column continuous meThe partial product of the individual elements is returned to the shared memory space, which is marked as
1.5.15 if j "> 1, go to 1.5.16, otherwise, go to 1.5.18;
1.5.16 will beStored content (i.e. column j's consecutive m)ePartial product of elements) andstored content (i.e. j' -1 th column of consecutive mePartial product of individual elements) are correspondingly added, and the specific adding method is as follows:
1.5.16.1 let i3 be v × me + 0;
1.5.16.2 Ci'k”(i3,j”)=Ci'k”(i3,j”)+Ci'k”(i3,j”-1),Ci'k”(i3, j') represents the block matrix C of the result matrixi'k”Row i3, column j;
1.5.16.3 i3=i3+1;
1.5.16.4 if i3 is not more than v × me + (me-1), switching to 1.5.16.2, otherwise, switching to 1.5.16.5;
1.5.16.5 the result of 1.5.16.2 corresponding additions is returned to the shared memory space, and its new memory space is marked as
1.5.17 let broadcast variable r be r + 1;
1.5.18 if r < k0, go to 1.5.12, otherwise go to 1.5.21;
1.5.19 let vector loop variable v be v + 1;
1.5.20 if v < Vz, go to 1.5.11, otherwise, 1.5.23;
1.5.21 defines the vector cyclic remainder vr-m 0-v me;
1.5.22 from A Block matrix Ai'j'V x m of column j of (1)eThe beginning of the continuous reading of vr elements constitutes meFront vr components of the vector of dimensions, rear meThe vr components are supplemented with 0, constituting a vector
1.5.23 if v is Vz, go to 1.5.13, otherwise go to 1.5.24;
1.5.24 j”=j”+1;
1.5.25 if j "< n 0; 1.5.9, otherwise, indicating that the calculation of the length direction of the matrix array A is finished, and 1.5.26 is turned;
1.5.26 k”=k”+1;
1.5.27 if1.5.5, otherwise, indicating that the vector calculation in the array direction of the A block matrix is finished, and 1.5.28 is turned;
1.5.28 j'=j'+1;
1.5.29 if j' is less than or equal to ng-1, turning to 1.5.3, otherwise, turning to 1.5.30 after the calculation of the vector in the row direction is finished;
1.5.30 i'=i'+1;
1.5.31 if i' is less than or equal to mg-1, rotating to 1.5.2, otherwise, indicating that the calculation of the row length direction is finished, obtaining Ci'j'Rotating by 1.6;
1.6 the results are merged, namely, according to the data distribution principle, the shared storage space of the CPU computing core is taken as the shared storage space,
merging the results of the shared memory space into a result matrix C, wherein the specific method comprises the following steps:
1.6.1 let the matrix C row variable u be 0;
1.6.2 let the matrix C column variable v be 0;
1.6.3 matrix C of C blocksuvThe calculation result is transmitted back to the shared storage space;
1.6.4 acquisition of CuvThe dimension mc x nc is that mc is more than or equal to 1 and is less than or equal to m, and nc is more than or equal to 1 and is less than or equal to n;
1.6.5, making the C block matrix row coordinate variable ic equal to 0;
1.6.6 order C block matrix column coordinate variable jc=0;
1.6.7 C(ic,jc)=C(ic,jc)+Cuv(ic,jc),C(ic,jc) Row ith and jth of the representation matrix CcElement of column, Cuv(ic,jc) Represents a block matrix CuvI th of (1)cLine j (th)cA column element;
1.6.8 jc=jc+1;
1.6.9 if jc< nc; turning 1.6.7, otherwise, turning 1.6.10;
1.6.10 ic=ic+1;
1.6.11 if icIf mc is less than mc, go to 1.6.6, otherwise go to 1.6.12;
1.6.12 v=v+1;
1.6.14 u=u+1;
1.6.15 if1.6.2, otherwise, merging to obtain C; the matrix multiplication realized by adopting the MPI programming mode according to the steps of 1.2-1.6 is the CPU matrix multiplication accelerating version vcpu。
1.7 if the heterogeneous integration system structure consists of a CPU and a GPU, a matrix multiplication acceleration version v facing the GPU is realized by adopting a CUDA programming modegpuThe specific method comprises the following steps:
1.7.1 query architecture handbook obtains topology of computing core in each GPU, namely, topology mg × ng of thread processor TPU (threading Processing Unit) in GPU, namely, mg × ng of TPU in GPU, physical distribution is mg rows × ng columns numbered (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), (… … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.7.2 query architecture handbook obtains the number m of vector floating point Multiply accumulate functional parts FMA (fused multiplex Add) owned by each TPUe;
1.7.3, utilizing a function (such as init) provided by an operating system to complete initialization;
1.7.4 matrix division, namely, partitioning the A and the B by adopting the matrix division method in the step 1.3 based on the topological structure of the GPU to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.7.5 mg×ng×methe FMA takes the Shared Memory of the GPU as a Shared Memory space, takes the Constant Memory of the GPU as a data buffer, adopts the matrix multiplication acceleration method in the step 1.5 to execute matrix multiplication in parallel, each FMA independently completes block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') completes block matrix multiplication
1.7.6, taking the Shared Memory of the GPU as a Shared Memory space, and merging the C block matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C. The matrix multiplication realized according to 1.7.1-1.7.6 is vgpu。
1.8 if the heterogeneous fusion system structure is composed of a CPU and an MIC, because the task scheduling between the CPU and the MIC is not considered in the patent, only the matrix multiplication calculation running at a target accelerator end is concerned, and the MIC matrix multiplication acceleration version v realized by adopting the Offload programming modemicThe specific method comprises the following steps:
1.8.1 query architecture handbook gets the computational cores in each MIC, i.e. as vector processor VPU (virtual Processing Unit) in MIC topology mg × ng, i.e. mg × ng VPUs in a MIC, the physical distribution is mg rows × ng columns numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), (… … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.8.2 query architecture Manual get the number m of vector floating point Multiply accumulate functional parts FMA (fused multiplex Add) owned by each VPUe;
1.8.3, the initialization is completed by using the function (such as init) provided by the operation system;
1.8.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPU to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.8.5 mg×ng×methe FMA takes the memory space of MIC, namely memory, as the shared storage space, and is portable by the shared storage spaceThe formula Cache is a data buffer area, matrix multiplication is executed in parallel by adopting the matrix multiplication acceleration method in the step 1.5, each FMA independently completes block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') completes block matrix multiplication
And 1.8.6, taking the memory space of the MIC as a shared storage space, and adopting the result merging method in the step 1.6 to finish the merging of the C block matrix results calculated by each FMA to obtain C. The matrix multiplication realized according to 1.8.1-1.8.6 is vmic。
1.9 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts SCIF programming mode, then adopting SCIF programming mode to realize Matrix multiplication acceleration version vscifThe specific method comprises the following steps:
1.9.1 query the architecture handbook for the computation cores in each Matrix2000, namely, the vector Processing unit VPE (virtual Processing element) topology mg × ng in Matrix2000, namely, mg × ng VPEs in Matrix2000, the physical distribution is mg rows × ng columns numbered (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.9.2 query the architecture manual to obtain the number m of vector floating point Multiply accumulate functional units FMA (fused multiplex Add) owned by each VPEe;
1.9.3, using the function (such as init) provided by the operating system to complete initialization;
1.9.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.9.5 mg×ng×methe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
1.9.6, taking the shared Memory space Array Memory in Matrix2000 as the shared Memory space, merging the C block Matrix results calculated by each FMA by adopting the result merging method described in step 1.6, and obtaining C. The matrix multiplication realized according to 1.9.1-1.9.6 is vscif。
1.10 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts COI (processor off flow Infrastructure) programming mode, then adopting COI programming mode to realize Matrix multiplication acceleration version vcoiThe specific method comprises the following steps:
1.10.1 inquiring about the computing cores in the architecture handbook acquisition system if the VPE topology of the vector processing units in Matrix2000 is mg × ng, i.e. there are mg × ng VPEs in Matrix2000, the physical distribution is mg rows × ng columns, numbered sequentially as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.10.2 query the architecture manual to obtain the number m of vector floating point Multiply accumulate functional units FMA (fused multiplex Add) owned by each VPEe;
1.10.3, utilizing a function (such as init) provided by system software to complete initialization;
1.10.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.10.5 mg×ng×methe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
1.10.6 takes the shared Memory space Array Memory in Matrix2000 asAnd sharing a storage space, and completing the merging of the matrix results of the C blocks calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C. The matrix multiplication realized according to 1.10.1-1.10.6 is vcoi。
1.11 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts OpenMP tag programming mode, then the Matrix multiplication acceleration version v realized by OpenMP tag programming mode is adoptedtarget, the specific method is as follows:
1.11.1 query architecture handbook gets the computational cores in the system, i.e. vector processing unit VPE topology mg × ng as in Matrix2000, i.e. mg × ng VPEs in the system, physical distribution mg rows × ng columns, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), "… … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.11.2 query the architecture manual to obtain the number m of vector floating point Multiply accumulate functional units FMA (fused multiplex Add) owned by each VPEe;
1.11.3, the initialization is completed by a function (such as init) provided by system software;
1.11.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.11.5 Global space Global Cache and shared memory space Arraymemory based on Matrix2000, mg × ng × meThe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
1.11.6, taking the shared Memory space Array Memory in Matrix2000 as the shared Memory space, merging the C block Matrix results completed by each FMA by the result merging method described in step 1.6, and obtaining C. The ratio is 1.11.1-1.The matrix multiplication realized by 11.6 is vtarget。
And step two, integrating heterogeneous fusion multi-version matrix multiplication. The specific method comprises the following steps:
2.1 separately compiling vcpu、vgpu、vmic、vtarget、vcoiAnd vscifCorresponding source code, generating vcpu、vgpu、vmic、vtarget、vcoiAnd vscifThe executable file of (2);
2.2 Using tar Command to convert vcpu、vgpu、vmic、vtarget、vcoiAnd vscifThe executable file is packed to generate a library file HU-xgemm of a heterogeneous fusion version.
Step three, adapting an accelerator in a heterogeneous fusion system structure, wherein the method comprises the following steps:
3.1 inquiring the accelerator type in the heterogeneous fusion system structure by using an operating system command lspci | grep processor, and enabling an accelerator type variable arc to be the accelerator type;
3.2 if arc ═ Matrix2000, run 3.2.1: otherwise, turning to 3.3;
3.2.1. inquiring a Matrix2000 programming technical manual, confirming a supported programming model, and assigning a value to a programming model variable prolan;
3.2.2. if prolan ═ OpenMP tag, call v in HU-xgemmcpuAnd vtargetRespectively finishing the Matrix multiplication calculation of a CPU end of a main processor and a Matrix2000 end of an accelerator (dividing the Matrix multiplication into a part to be distributed to the CPU and a part to be distributed to the accelerator (such as Matrix2000, GPU and MIC), belonging to the task scheduling category, and adopting a known scheduling method such as 'a method for dynamically balancing heterogeneous system calculation load (ZL 201410544782.5)', otherwise, turning to 3.2.3;
3.2.3. if prolan ═ COI, recall v in HU-xgemmcpuAnd vcoiRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.2.4;
3.2.4. if prolan ═ SCIF, recall v in HU-xgemmcpuAnd vscifRespectively completing the CPU end and the accelerator of the main processorMatrix multiplication at the Matrix2000 end; otherwise, turning to 3.3;
3.3 if arc ═ MIC, recall v in HU-xgemmcpuAnd vmicRespectively finishing the matrix multiplication of a CPU end of a main processor and an MIC end of an accelerator; otherwise, turning to 3.4;
3.4 if arc ═ GPU, call v in HU-xgemmcpuAnd vgpuRespectively finishing the matrix multiplication of a CPU end of a main processor and a GPU end of an accelerator; otherwise, turning to 3.5;
3.5 if there is no special accelerator in the system, i.e. only CPU, call v in HU-xgemmcpuThe matrix multiplication calculation is completed.
The invention can achieve the following technical effects:
1. the invention can self-adapt to different target accelerators and processors by realizing and integrating the universal matrix multiplication versions suitable for various accelerators, can self-adaptively perform matrix multiplication according to different heterogeneous fusion system structures, reduces the difficulty of matrix multiplication development for the accelerators and lightens the workload.
2. The invention carries out matrix multiplication according to the topological structures of CPUs or accelerators in different heterogeneous fusion system structures, and each FMA carries out parallel calculation, thereby accelerating the matrix multiplication speed and improving the use efficiency of a heterogeneous system.
Drawings
FIG. 1 is a general flowchart of the method for optimizing the multiplication and acceleration of the heterogeneous fusion matrix according to the present invention.
Detailed Description
FIG. 1 is a general flowchart of a general multi-core DSP-oriented matrix multiplication acceleration method according to the present invention.
The method comprises the following steps:
firstly, designing a block matrix multiplication version oriented to a heterogeneous fusion system structure to obtain vcpu、vgpu、vmic、vtarget、vcoiAnd vscif6 versions, the concrete steps are as follows:
1.1 configuration and initialization of a heterogeneous fusion system structure, the specific method is as follows:
defining the dimension of matrix A as M multiplied by K and the dimension of matrix B as KThe dimension of a result matrix C obtained by multiplying the XN, A and B is MXN, and M, K and N are positive integers; the element of the p-th row and the q-th column of A is apqP is more than or equal to 0 and less than or equal to M-1, q is more than or equal to 0 and less than or equal to K-1, and the element in the qth row and the tth column of B is Bqt,0≤t≤N-1;
1.2 if the heterogeneous integration system structure only consists of CPU, the CPU is initialized (MPI programming mode is adopted to realize CPU matrix multiplication acceleration version vcpu) The method comprises the following steps:
1.2.1. inquiring an architecture manual to obtain the topological structure of the computing cores in one CPU, and if the topological structure in the multi-core CPU is mg × ng, namely mg × ng computing cores in one CPU are physically distributed in mg rows × ng columns, numbering the computing cores as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1), i is more than or equal to 0 and less than or equal to mg-1, and j is more than or equal to 0 and less than or equal to ng-1 in sequence;
1.2.2. query architecture manual obtains the number m of floating point vector multiply accumulate functional units FMA owned by each CPU compute coree;
1.2.3. Completing the initialization of the multi-core CPU by using an initialization function provided by an operating system;
1.3 matrix division is carried out on A and B according to the topological structure of the CPU, the matrix A is divided into mg multiplied by ng A block matrixes, the arrangement mode of the mg multiplied by ng A block matrixes is the same as that of the mg multiplied by ng computing cores, the jth column A block matrix of the ith row is represented by Aij, i is more than or equal to 0 and less than or equal to mg-1, j is more than or equal to 0 and less than or equal to ng-1, and the dimension of each block matrix is m multiplied by k, wherein the matrix A is divided into mg multiplied by ng A block matrixes, the arrangement Representing upper rounding; dividing the matrix B of K x N into ng x mg B block matrixes, wherein the arrangement mode of the ng x mg B block matrixes is the same as that of mg x ng computing cores, and B is used for the No. r row and No. s column B block matrixesrsExpressed that r is more than or equal to 0 and less than or equal to mg-1, s is more than or equal to 0 and less than or equal to ng-1, the dimension of each block matrix is kXn,
1.4 initializing the result matrix C to be 0, that is, assigning each element in C to be 0, making C (mo, no) to be 0, where C (mo, no) represents an element in the no-th row and the no-th column of the result matrix, mo is greater than or equal to 0 and less than or equal to M-1, and no is greater than or equal to 0 and less than or equal to N-1;
1.5 matrix multiplication acceleration to obtain mg × ng × meC block matrices: mg × ng × meThe FMA takes the CPU computing core shared memory space as the shared memory space and a scalar data buffer area as the data buffer area to execute the block matrix multiplication operation in parallel, so that the matrix multiplication is accelerated, each FMA independently finishes the block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') finishes the block matrix multiplication
1.6 result merging, namely according to the data distribution principle, taking the shared storage space of the CPU computing core as the shared storage space, merging the results of the shared storage space to form a result matrix C, and performing matrix multiplication which is realized by adopting an MPI (Multi-processor interface) programming mode according to 1.2-1.6 to obtain a CPU matrix multiplication accelerated version vcpu;
1.7 if the heterogeneous integration system structure consists of a CPU and a GPU, a matrix multiplication acceleration version v facing the GPU is realized by adopting a CUDA programming modegpuThe specific method comprises the following steps:
1.7.1 query architecture handbook obtains topology of computing core in each GPU, namely topology mg × ng of thread processor TPU in GPU, namely mg × ng TPUs in GPU, physical distribution is mg row × ng column, and serial numbers are (0,0), (0,1), … (0, ng-1),
(1,0),(1,1),…(1,ng-1),……,(i,0),(i,1),…,(i,j),…,(i,ng-1),,……(mg-1,0),(mg-1,1),…(mg-1,ng-1);
1.7.2 query architecture handbook number m of vector floating point multiply accumulate functional unit FMA owned by each TPUe;
1.7.3, utilizing the function provided by the operating system to complete initialization;
1.7.4 matrix division, namely, partitioning the A and the B by adopting the matrix division method in the step 1.3 based on the topological structure of the GPU to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.7.5 mg×ng×methe FMA takes the Shared Memory of the GPU as a Shared Memory space, takes the Constant Memory of the GPU as a data buffer, adopts the matrix multiplication acceleration method in the step 1.5 to execute matrix multiplication in parallel, each FMA independently completes block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') completes block matrix multiplication
1.7.6, taking the Shared Memory of the GPU as a Shared storage space, and merging the C block matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;
1.8 if the heterogeneous fusion system structure consists of a CPU and a MIC, adopting an Offload programming mode to realize MIC matrix multiplication acceleration version vmicThe specific method comprises the following steps:
1.8.1 query architecture handbook obtains the computational cores in each MIC, i.e. as vector processor VPU topology mg × ng in MIC, i.e. mg × ng VPUs in one MIC, physical distribution mg row × ng column, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.8.2 query architecture Manual obtains the number m of vector floating point multiply accumulate functional units FMA each VPU possessese;
1.8.3 initialization is accomplished by using the functions provided by the operating system;
1.8.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPU to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.8.5 mg×ng×meeach FMA takes the memory space of MIC, namely memory, as a shared memory space, takes a portable Cache of the shared memory space as a data buffer area, adopts the matrix multiplication acceleration method described in the step 1.5 to execute matrix multiplication in parallel, each FMA independently finishes block matrix multiplication assigned to itself, and the FMA with the serial number of (i ', j') finishes block matrix multiplication
1.8.6 using the memory space of MIC as the shared storage space, adopting the result merging method in the step 1.6 to finish the merging of the C block matrix results calculated by each FMA, and obtaining C;
1.9 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts SCIF programming mode, then adopting SCIF programming mode to realize Matrix multiplication acceleration version vscifThe specific method comprises the following steps:
1.9.1 query the architecture handbook for the topology mg × ng of the computing cores in each Matrix2000, i.e. the vector processing units VPE in Matrix2000, i.e. mg × ng VPEs in Matrix2000, the physical distribution is mg rows × ng columns, numbered sequentially as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), (… … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.9.2 query the architecture manual to obtain the number m of vector floating point Multiply accumulate functional units FMA (fused multiplex Add) owned by each VPEe;
1.9.3 initialization is done using functions provided by the operating system;
1.9.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.9.5 mg×ng×methe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
1.9.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;
1.10 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts COI programming mode, then adopting COI programming mode to realize Matrix multiplication acceleration version vcoiThe specific method comprises the following steps:
1.10.1 inquiring about the computing cores in the architecture handbook acquisition system if the VPE topology of the vector processing units in Matrix2000 is mg × ng, i.e. there are mg × ng VPEs in Matrix2000, the physical distribution is mg rows × ng columns, numbered sequentially as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.10.2 query architecture Manual to obtain the number m of vector floating point multiply accumulate functional units FMA owned by each VPEe;
1.10.3, utilizing a function provided by system software to complete initialization;
1.10.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.10.5 mg×ng×methe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
1.10.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;
1.11 if the heterogeneous fusion system structure is composed of a CPU and a Matrix2000, if the Matrix2000 adopts an OpenMP taget programming mode, the OpenMP taget programming mode is adopted to realize the OpenMP taget Matrix multiplication acceleration version vtargetThe specific method comprises the following steps:
1.11.1 query architecture handbook gets the computational cores in the system, i.e. vector processing unit VPE topology mg × ng as in Matrix2000, i.e. mg × ng VPEs in the system, physical distribution mg rows × ng columns, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), "… … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.11.2 query the architecture manual to obtain the number m of vector floating point Multiply accumulate functional units FMA (fused multiplex Add) owned by each VPEe;
1.11.3 initialization is done using functions provided by the system software;
1.11.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.11.5 Global space Global Cache and shared memory space Arraymemory based on Matrix2000, mg × ng × meThe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
1.11.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results completed by each FMA by the result merging method in step 1.6 to obtain C;
the second step is that: integrating heterogeneous fusion multi-version matrix multiplication, the method is as follows:
2.1 separately compiling vcpu、vgpu、vmic、vtarget、vcoiAnd vscifCorresponding source code, generating vcpu、vgpu、vmic、vtarget、vcoiAnd vscifThe executable file of (2);
2.2 Using tar Command to convert vcpu、vgpu、vmic、vtarget、vcoiAnd vscifThe executable file is packed to generate a library file HU-xgemm of a heterogeneous fusion version;
step three, adapting an accelerator in a heterogeneous fusion system structure, wherein the method comprises the following steps:
3.1 inquiring the accelerator type in the heterogeneous fusion system structure by using an operating system command lspci | grep processor, and enabling an accelerator type variable arc to be the accelerator type;
3.2 if arc ═ Matrix2000, run 3.2.1: otherwise, turning to 3.3;
3.2.1. inquiring a Matrix2000 programming technical manual, confirming a supported programming model, and assigning a value to a programming model variable prolan;
3.2.2. if prolan ═ OpenMP tag, call v in HU-xgemmcpuAnd vtargetRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.2.3;
3.2.3. if prolan ═ COI, recall v in HU-xgemmcpuAnd vcoiRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.2.4;
3.2.4. if prolan ═ SCIF, recall v in HU-xgemmcpuAnd vscifRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.3;
3.3 if arc ═ MIC, recall v in HU-xgemmcpuAnd vmicRespectively finishing the matrix multiplication of a CPU end of a main processor and an MIC end of an accelerator; otherwise, turning to 3.4;
3.4 if arc ═ GPU, call v in HU-xgemmcpuAnd vgpuRespectively finishing the matrix multiplication of a CPU end of a main processor and a GPU end of an accelerator; otherwise, turning to 3.5;
3.5 if there is no special accelerator in the system, i.e. only CPU, call v in HU-xgemmcpuThe matrix multiplication calculation is completed.
Claims (2)
1. A matrix multiplication accelerating method oriented to a heterogeneous fusion architecture is characterized by comprising the following steps:
the method comprises the following steps of firstly, designing a block matrix multiplication version oriented to a heterogeneous fusion system structure, and specifically:
1.1 configuration and initialization of a heterogeneous fusion system structure, the specific method is as follows:
defining the dimensionality of a matrix A to be M multiplied by K, the dimensionality of a matrix B to be K multiplied by N, and the dimensionality of a result matrix C obtained by multiplying A and B to be M multiplied by N, wherein M, K and N are positive integers; the element of the p-th row and the q-th column of A is apqP is more than or equal to 0 and less than or equal to M-1, q is more than or equal to 0 and less than or equal to K-1, and the element in the qth row and the tth column of B is Bqt,0≤t≤N-1;
1.2 if the heterogeneous fusion system structure only consists of a CPU, initializing the CPU, wherein the method comprises the following steps:
1.2.1. inquiring an architecture manual to obtain the topological structure of the computing cores in one CPU, and if the topological structure in the multi-core CPU is mg × ng, namely mg × ng computing cores in one CPU are physically distributed in mg rows × ng columns, numbering the computing cores as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1), i is more than or equal to 0 and less than or equal to mg-1, and j is more than or equal to 0 and less than or equal to ng-1 in sequence;
1.2.2. query architecture manual obtains the number m of floating point vector multiply accumulate functional units FMA owned by each CPU compute coree;
1.2.3. Completing the initialization of the multi-core CPU by using an initialization function provided by an operating system;
1.3 according to the topological structure of CPU, matrix division is carried out on A and B, the matrix A is divided into mg multiplied by ng A block matrixes, the arrangement mode of the mg multiplied by ng A block matrixes is the same as that of the mg multiplied by ng computing cores, and the ith row and the jth column of A block matrixes use AijI is more than or equal to 0 and less than or equal to mg-1, j is more than or equal to 0 and less than or equal to ng-1, and the dimension of each block matrix is m multiplied by k, wherein Representing upper rounding; dividing the matrix B of K x N into ng x mg B block matrixes, wherein the arrangement mode of the ng x mg B block matrixes is the same as that of mg x ng computing cores,row r, column s, block B matrix BrsExpressed that r is more than or equal to 0 and less than or equal to mg-1, s is more than or equal to 0 and less than or equal to ng-1, the dimension of each block matrix is kXn,
1.4 initializing the result matrix C to be 0, that is, assigning each element in C to be 0, making C (mo, no) to be 0, where C (mo, no) represents an element in the no-th row and the no-th column of the result matrix, mo is greater than or equal to 0 and less than or equal to M-1, and no is greater than or equal to 0 and less than or equal to N-1;
1.5 matrix multiplication acceleration to obtain mg × ng × meC block matrices: mg × ng × meThe FMA takes the CPU computing core shared memory space as the shared memory space and a scalar data buffer area as the data buffer area to execute the block matrix multiplication operation in parallel, so that the matrix multiplication is accelerated, each FMA independently finishes the block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') finishes the block matrix multiplication
1.6 result merging, namely according to the data distribution principle, taking the shared storage space of the CPU computing core as the shared storage space, merging the results of the shared storage space to form a result matrix C, and performing matrix multiplication which is realized by adopting an MPI (Multi-processor interface) programming mode according to 1.2-1.6 to obtain a CPU matrix multiplication accelerated version vcpu;
1.7 if the heterogeneous integration system structure consists of a CPU and a GPU, a matrix multiplication acceleration version v facing the GPU is realized by adopting a CUDA programming modegpuThe specific method comprises the following steps:
1.7.1 query architecture handbook obtains topology of computing core in each GPU, namely topology mg × ng of thread processor TPU in GPU, namely mg × ng TPUs in GPU, physical distribution is mg row × ng column, and serial numbers are (0,0), (0,1), … (0, ng-1),
(1,0),(1,1),…(1,ng-1),……,(i,0),(i,1),…,(i,j),…,(i,ng-1),……(mg-1,0),(mg-1,1),…(mg-1,ng-1);
1.7.2 query architecture handbook number m of vector floating point multiply accumulate functional unit FMA owned by each TPUe;
1.7.3, utilizing the function provided by the operating system to complete initialization;
1.7.4 matrix division, namely, partitioning the A and the B by adopting the matrix division method in the step 1.3 based on the topological structure of the GPU to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.7.5 mg×ng×methe FMA takes Shared Memory of the GPU as a Shared Memory space, takes constant Memory of the GPU as a data buffer, adopts the matrix multiplication acceleration method in the step 1.5 to execute matrix multiplication in parallel, each FMA independently completes block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') completes block matrix multiplication
1.7.6, taking the Shared Memory of the GPU as a Shared storage space, and merging the C block matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;
1.8 if the heterogeneous fusion system structure consists of a CPU and a MIC, adopting an Offload programming mode to realize MIC matrix multiplication acceleration version vmicThe specific method comprises the following steps:
1.8.1 query architecture handbook obtains the computational cores in each MIC, i.e. as vector processor VPU topology mg × ng in MIC, i.e. mg × ng VPUs in one MIC, physical distribution mg row × ng column, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.8.2 query architecture Manual obtains the number m of vector floating point multiply accumulate functional units FMA each VPU possessese;
1.8.3 initialization is accomplished by using the functions provided by the operating system;
1.8.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPU to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.8.5 mg×ng×methe FMA takes the memory space of MIC, namely memory, as a shared storage space, takes the portable Cache of the shared storage space as a data buffer area,the matrix multiplication accelerating method described in step 1.5 is adopted to execute matrix multiplication in parallel, each FMA independently completes block matrix multiplication allocated to itself, and the FMA with the serial number (i ', j') completes block matrix multiplication
1.8.6 using the memory space of MIC as the shared storage space, adopting the result merging method in the step 1.6 to finish the merging of the C block matrix results calculated by each FMA, and obtaining C;
1.9 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts SCIF programming mode, then adopting SCIF programming mode to realize Matrix multiplication acceleration version vscifThe specific method comprises the following steps:
1.9.1 query the architecture handbook for the topological mg × ng of the computing cores in each Matrix2000, i.e. the vector processing units VPE in Matrix2000, i.e. mg × ng VPEs in Matrix2000, the physical distribution being mg rows × ng columns, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.9.2 query the architecture manual to obtain the number m of vector floating point multiply accumulate functional units FMA (fused Multi Add) owned by each VPEe;
1.9.3 initialization is done using functions provided by the operating system;
1.9.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.9.5 mg×ng×methe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
1.9.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;
1.10 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts COI programming mode, then adopting COI programming mode to realize Matrix multiplication acceleration version vcoiThe specific method comprises the following steps:
1.10.1 inquiring about the computing cores in the architecture handbook acquisition system if the VPE topology of the vector processing units in Matrix2000 is mg × ng, i.e. there are mg × ng VPEs in Matrix2000, the physical distribution is mg rows × ng columns, numbered sequentially as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.10.2 query architecture Manual to obtain the number m of vector floating point multiply accumulate functional units FMA owned by each VPEe;
1.10.3, utilizing a function provided by system software to complete initialization;
1.10.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.10.5 mg×ng×methe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
1.10.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;
1.11 if the heterogeneous fusion architecture is composed of CPU and Matrix2000, if the Matrix2000 adopts OpenMP tag programming mode, thenOpenMP tag matrix multiplication acceleration version v realized by adopting OpenMP tag programming modetargetThe specific method comprises the following steps:
1.11.1 query architecture handbook gets the computational cores in the system, i.e. vector processing unit VPE topology mg × ng as in Matrix2000, i.e. mg × ng VPEs in the system, physical distribution mg rows × ng columns, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.11.2 query the architecture manual to obtain the number m of vector floating point multiply accumulate functional units FMA (fused Multi Add) owned by each VPEe;
1.11.3 initialization is done using functions provided by the system software;
1.11.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.11.5 Global space Global Cache and shared Memory space Array Memory based on Matrix2000, mg × ng × meThe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
1.11.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results completed by each FMA by the result merging method in step 1.6 to obtain C;
the second step is that: integrating heterogeneous fusion multi-version matrix multiplication, the method is as follows:
2.1 separately compiling vcpu、vgpu、vmic、vtarget、vcoiAnd vscifCorresponding source code, generating vcpu、vgpu、vmic、vtarget、vcoiAnd vscifThe executable file of (2);
2.2 Using tar Command to convert vcpu、vgpu、vmic、vtarget、vcoiAnd vscifThe executable file is packed to generate a library file HU-xgemm of a heterogeneous fusion version;
step three, adapting an accelerator in a heterogeneous fusion system structure, wherein the method comprises the following steps:
3.1 inquiring the accelerator type in the heterogeneous fusion system structure by using an operating system command lspci | grep processor, and enabling an accelerator type variable arc to be the accelerator type;
3.2 if arc ═ Matrix2000, run 3.2.1: otherwise, turning to 3.3;
3.2.1. inquiring a Matrix2000 programming technical manual, confirming a supported programming model, and assigning a value to a programming model variable prolan;
3.2.2. if prolan ═ OpenMP tag, call v in HU-xgemmcpuAnd vtargetRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.2.3;
3.2.3. if prolan ═ COI, recall v in HU-xgemmcpuAnd vcoiRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.2.4;
3.2.4. if prolan ═ SCIF, recall v in HU-xgemmcpuAnd vscifRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.3;
3.3 if arc ═ MIC, recall v in HU-xgemmcpuAnd vmicRespectively finishing the matrix multiplication of a CPU end of a main processor and an MIC end of an accelerator; otherwise, turning to 3.4;
3.4 if arc ═ GPU, call v in HU-xgemmcpuAnd vgpuRespectively finishing the matrix multiplication of a CPU end of a main processor and a GPU end of an accelerator; otherwise, turning to 3.5;
3.5 if there is no special accelerator in the system, i.e. only CPU, call v in HU-xgemmcpuThe matrix multiplication calculation is completed.
2. The heterogeneous convergence architecture oriented matrix multiplication acceleration method of claim 1, wherein step 1.2.3 is an initialization function index function provided by the operating system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910076766.0A CN109871512B (en) | 2019-01-27 | 2019-01-27 | Matrix multiplication acceleration method for heterogeneous fusion system structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910076766.0A CN109871512B (en) | 2019-01-27 | 2019-01-27 | Matrix multiplication acceleration method for heterogeneous fusion system structure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109871512A CN109871512A (en) | 2019-06-11 |
CN109871512B true CN109871512B (en) | 2020-05-22 |
Family
ID=66918078
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910076766.0A Active CN109871512B (en) | 2019-01-27 | 2019-01-27 | Matrix multiplication acceleration method for heterogeneous fusion system structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109871512B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110415162B (en) * | 2019-07-22 | 2020-03-31 | 中国人民大学 | Adaptive graph partitioning method facing heterogeneous fusion processor in big data |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106775594A (en) * | 2017-01-13 | 2017-05-31 | 中国科学院软件研究所 | A kind of Sparse Matrix-Vector based on the domestic processor of Shen prestige 26010 multiplies isomery many-core implementation method |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103744682B (en) * | 2014-01-24 | 2017-02-08 | 中国科学院自动化研究所 | System and method for separate compilation of heterogeneous mixed programs |
CN104317768B (en) * | 2014-10-15 | 2017-02-15 | 中国人民解放军国防科学技术大学 | Matrix multiplication accelerating method for CPU+DSP (Central Processing Unit + Digital Signal Processor) heterogeneous system |
CN104346318B (en) * | 2014-10-15 | 2017-03-15 | 中国人民解放军国防科学技术大学 | Matrix Multiplication accelerated method towards general multi-core DSP |
CN104820613B (en) * | 2015-05-27 | 2018-03-27 | 北京思朗科技有限责任公司 | A kind of Compilation Method of heterogeneous polynuclear program |
CN105242962B (en) * | 2015-11-24 | 2018-07-03 | 无锡江南计算技术研究所 | The quick triggering method of lightweight thread based on isomery many-core |
-
2019
- 2019-01-27 CN CN201910076766.0A patent/CN109871512B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106775594A (en) * | 2017-01-13 | 2017-05-31 | 中国科学院软件研究所 | A kind of Sparse Matrix-Vector based on the domestic processor of Shen prestige 26010 multiplies isomery many-core implementation method |
Also Published As
Publication number | Publication date |
---|---|
CN109871512A (en) | 2019-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106940815B (en) | Programmable convolutional neural network coprocessor IP core | |
US8400458B2 (en) | Method and system for blocking data on a GPU | |
Skinderowicz | The GPU-based parallel ant colony system | |
JP2021515300A (en) | Neural network accelerator | |
CN101826142B (en) | Reconfigurable elliptic curve cipher processor | |
US9753726B2 (en) | Computer for amdahl-compliant algorithms like matrix inversion | |
US7370156B1 (en) | Unity parallel processing system and method | |
CN114201287B (en) | Method for cooperatively processing data based on CPU + GPU heterogeneous platform | |
US20200226201A1 (en) | Methods and Apparatus for Constructing Digital Circuits for Performing Matrix Operations | |
Maroosi et al. | Parallel and distributed computing models on a graphics processing unit to accelerate simulation of membrane systems | |
CN116710912A (en) | Matrix multiplier and control method thereof | |
CN109871512B (en) | Matrix multiplication acceleration method for heterogeneous fusion system structure | |
Buttari et al. | Limitations of the playstation 3 for high performance cluster computing | |
CN117032807A (en) | AI acceleration processor architecture based on RISC-V instruction set | |
CN114402337A (en) | Hardware circuit for accelerating neural network calculation | |
CN113553054A (en) | Heterogeneous system based compiling method, device, equipment and storage medium | |
CN109614367B (en) | Improved DND algorithm and implementation method based on FPGA | |
Mayannavar et al. | Hardware Accelerators for Neural Processing | |
Al Maruf et al. | Optimizing DNNs Model Partitioning for Enhanced Performance on Edge Devices. | |
Chen et al. | Exploring efficient data parallelism for genome read mapping on multicore and manycore architectures | |
CN108062249A (en) | High in the clouds data allocation schedule method based on big data | |
CN109993284B (en) | Integrated circuit chip device and related product | |
Mahajan et al. | Review of Artificial Intelligence Applications and Architectures | |
CN112052042B (en) | Data pipeline processor system | |
Tan et al. | Heterogeneous Parallel and Distributed Optimization of K-Means Algorithm on Sunway Supercomputer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |