CN109871512B - Matrix multiplication acceleration method for heterogeneous fusion system structure - Google Patents

Matrix multiplication acceleration method for heterogeneous fusion system structure Download PDF

Info

Publication number
CN109871512B
CN109871512B CN201910076766.0A CN201910076766A CN109871512B CN 109871512 B CN109871512 B CN 109871512B CN 201910076766 A CN201910076766 A CN 201910076766A CN 109871512 B CN109871512 B CN 109871512B
Authority
CN
China
Prior art keywords
matrix multiplication
matrix
cpu
fma
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910076766.0A
Other languages
Chinese (zh)
Other versions
CN109871512A (en
Inventor
甘新标
曾瑞庚
杨志辉
孙泽文
吴涛
刘杰
龚春叶
李胜国
杨博
徐涵
晏益慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201910076766.0A priority Critical patent/CN109871512B/en
Publication of CN109871512A publication Critical patent/CN109871512A/en
Application granted granted Critical
Publication of CN109871512B publication Critical patent/CN109871512B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a matrix multiplication acceleration method for a heterogeneous fusion system structure, and aims to aim at aiming at different many-core accelerator targetsThe system structure designs a general multiplication acceleration method for a heterogeneous fusion system structure matrix, and the use efficiency of a heterogeneous system is improved. The technical scheme is that a block matrix multiplication version oriented to a heterogeneous fusion system structure is designed firstly, and comprises vcpu、vgpu、vmic,vscif,vcoi,vtargetThen, integrating and packaging the heterogeneous fusion multi-version matrix multiplication versions to generate a library file HU-xgemm of the heterogeneous fusion version; and finally, adapting an accelerator in a heterogeneous fusion system structure by using HU-xgemm. The invention can be adaptive to different target accelerators and processors, can perform matrix multiplication adaptively according to different heterogeneous fusion system structures, and performs matrix multiplication according to topological structures of CPUs or accelerators in different heterogeneous fusion system structures, and each FMA performs parallel calculation, thereby accelerating the matrix multiplication speed and improving the use efficiency of a heterogeneous system.

Description

Matrix multiplication acceleration method for heterogeneous fusion system structure
Technical Field
The invention relates to a matrix multiplication accelerating method, in particular to a heterogeneous system-oriented heterogeneous fusion system structure matrix multiplication accelerating method.
Background
With the continuous rising of the computing performance of the general accelerator and the wide application of the accelerator, the many-core accelerator must become an important development direction for high-performance computing, and the accelerators such as GPU, MIC (Xeon Phi), Matrix2000 and the like which meet the requirements of various fields are developed. With the wide application and popularization of heterogeneous systems, a plurality of heterogeneous system structures of different types such as CPU + GPU, CPU + MIC, CPU + Matrix2000 and the like emerge.
The design target and design principle of the accelerator determine the specificity and limitation of the accelerator, and different accelerator manufacturers develop programming models adapted to the accelerator, such as CUDA supported by GPU, Offload supported by MIC, COI (coprocessor over infrastructure) supported by Matrix2000, SCIF (systematic communication interface), OpenMP target, and other programming models. The program design of the target-oriented accelerator needs to adopt a programming model supported by the target-oriented accelerator to redesign and realize an algorithm, so that the acceleration is possible; if the program is not redesigned and implemented according to the programming model supported by the accelerator, the program is basically impossible to run and has no acceleration effect. Therefore, programs for realizing different versions need to be designed for different heterogeneous systems, for example, algorithms and programs capable of efficiently cooperating between a CPU and a GPU must be realized for CPU + GPU-oriented heterogeneous systems; the CPU + MIC is oriented to an algorithm and a program which can realize efficient cooperation between the CPU and the MIC; CPU + Matrix2000 oriented algorithms and programs must be implemented that enable efficient coordination between the CPU and the Matrix 2000. With the updating, replacement and upgrading of the heterogeneous system accelerator, programs for implementing different accelerator versions need to be redesigned at different periods, and even algorithms and programs for different target accelerators need to be simultaneously designed when a mixed use situation of multiple accelerators occurs in a set of heterogeneous system.
For different heterogeneous systems, software designers need to re-understand a target system structure and learn a new programming model to realize the existing algorithm, spend a lot of time learning new knowledge to repeat the existing work, the effect may not be good, and the design and development of the algorithm in the field are not facilitated. Therefore, a set of universal programs can be designed to run on different heterogeneous systems, so that program designers can be greatly liberated, and the development efficiency is improved.
The matrix multiplication is the most common operation in numerical calculation, and many applications include the calculation process of the matrix multiplication, so that the operation speed of the matrix multiplication is improved, and the speed of high-performance calculation can be improved to a great extent.
The matrix multiplication is to multiply one row of the multiplied matrix a and one column of the multiplier matrix B to obtain one element in the resultant matrix C. Matrix multiplication oriented to a heterogeneous system generally needs to reasonably distribute a matrix multiplication calculation process between a main processor (CPU) and a many-core accelerator to perform heterogeneous cooperation and parallel completion of the calculation process so as to improve the operation speed of matrix multiplication and maximize the calculation efficiency and the use efficiency of the heterogeneous system.
Because the design target of the many-core accelerator is different from the instruction set structure, the conventional matrix multiplication implementation technology facing a general main processor hardly meets the performance requirement of the many-core accelerator designed facing a specific application, so the matrix multiplication needs to be accelerated facing a target architecture of the many-core accelerator to improve the operation speed of the matrix multiplication, and the design target of a heterogeneous system is met to the maximum extent.
If a heterogeneous fusion Matrix multiplication acceleration method can be provided for various heterogeneous systems such as a CPU + GPU, a CPU + MIC, a CPU + Matrix2000 and the like to shield structural details of a target system, simplify the program development of the heterogeneous system and improve the efficiency of the heterogeneous system, a programmer can concentrate on the field algorithm design and development to the maximum extent without knowing the specific structure and instruction of the heterogeneous system, the development restriction of a many-core accelerator in the high-performance computing field can be effectively solved, and the technical problem to be solved by technical personnel in the field is urgently needed.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: a general Matrix multiplication accelerating method facing different many-core accelerator target architectures is designed, so that the details of the target architecture are shielded to the maximum extent, the Matrix multiplication design difficulty and the workload of a heterogeneous system are simplified, and the use efficiency of the heterogeneous system is improved.
In order to solve the technical problems, the specific technical scheme of the invention is as follows:
firstly, designing a partitioned matrix multiplication version oriented to a heterogeneous fusion system structure to obtain a CPU matrix multiplication acceleration version vcpuGPU matrix multiplication acceleration version vgpuMIC matrix multiplication acceleration version vmicWhen the heterogeneous integration system structure consists of a CPU and a Matrix2000, a Matrix multiplication acceleration version v realized by adopting a SCIF programming modescifMatrix multiplication acceleration version v realized by adopting COI programming modecoiMatrix multiplication acceleration version v realized by OpenMP tag programming modetarget, the specific steps are as follows:
1.1 configuration and initialization of a heterogeneous fusion system structure, the specific method is as follows:
defining the dimensionality of a matrix A to be M multiplied by K and the dimensionality of a matrix B to be K multiplied by N, and then defining the dimensionality of a result matrix C obtained by multiplying A and B to be M multiplied by N, wherein M, K and N are positive integers; the element of the p-th row and the q-th column of A is apqP is more than or equal to 0 and less than or equal to M-1, q is more than or equal to 0 and less than or equal to K-1, and the element in the qth row and the tth column of B is Bqt,0≤t≤N-1;
1.2 if the heterogeneous fusion system structure only consists of a CPU, initializing the CPU, and the specific method comprises the following steps:
1.2.1. inquiring an architecture manual to obtain the topological structure of the computing cores in one CPU, and if the topological structure in the multi-core CPU is mg × ng, namely mg × ng computing cores in one CPU are physically distributed in mg rows × ng columns, numbering the computing cores as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1), i is more than or equal to 0 and less than or equal to mg-1, and j is more than or equal to 0 and less than or equal to ng-1 in sequence;
1.2.2. the query architecture manual obtains the number m of floating point vector Multiply accumulate functional units FMA (fused multiplex Add) owned by each CPU computing coree
1.2.3. The initialization of the multi-core CPU is completed by using an initialization function (such as init) provided by an operating system;
1.3 matrix division, namely, partitioning A and B according to the topological structure of a CPU to obtain mg × ng A block matrixes and ng × mg B block matrixes, wherein the specific method comprises the following steps:
1.3.1 divide matrix A into mg × ng A block matrices, the arrangement of mg × ng A block matrices is the same as that of mg × ng computational cores, and the ith row and jth column of A block matrices are used as Aij (i is more than or equal to 0 and less than or equal to mg-1, j is more than or equal to 0 and less than or equal to ng-1), and the dimension of each block matrix is m multiplied by k, wherein
Figure GDA0002443977970000031
Figure GDA0002443977970000032
The representation is rounded up, and the dimension of the edge block matrix needs special treatment. The specific matrix partitioning method is as follows:
1.3.1.1 let the block matrix row variable i equal to 0;
1.3.1.2 let the block matrix column variable j be 0;
1.3.1.3 let the block matrix element row variable m' be 0;
1.3.1.4 let block matrix element column variable n' be 0, n be k;
1.3.1.5 order row coordinate variable s ═ m × i + m';
1.3.1.6 let column coordinate variable e be n × j + n';
1.3.1.7 if s is less than or equal to M-1, switching to 1.3.1.10, otherwise, switching to 1.3.2 after the matrix A is divided;
1.3.1.8 if e is less than or equal to K-1, turning to 1.3.1.11, otherwise, turning to 1.3.2 after the matrix A is divided;
1.3.1.9 selecting elements of matrix A ase to form block matrix AijElement a 'of'm'n'
1.3.1.10 n'=n'+1;
1.3.1.11 if n' is less than or equal to k-1, turning to 1.3.1.7, otherwise, turning to 1.3.1.14;
1.3.1.12 m'=m'+1;
1.3.1.13 if m' is less than or equal to m-1, switching to 1.3.1.6, otherwise, switching to 1.3.1.16;
1.3.1.14 j=j+1;
1.3.1.15 if
Figure GDA0002443977970000041
Rotating 1.3.1.3, otherwise, rotating 1.3.1.18;
1.3.1.16 i=i+1;
1.3.1.17 if
Figure GDA0002443977970000042
1.3.1.2, otherwise, 1.3.2 is carried out after the matrix A is divided;
1.3.2 divide the matrix B of K × N into ng × mg B block matrices, the arrangement of ng × mg B block matrices is the same as that of mg × ng computing cores, and B is used for the r-th row and s-th column B block matricesrsExpressed as (0. ltoreq. r. ltoreq. mg-1, 0. ltoreq. s. ltoreq. ng-1), the dimension of each block matrix is k x n,
Figure GDA0002443977970000043
Figure GDA0002443977970000044
representing upper rounding; the dimension of the edge block matrix needs special treatment, and the specific matrix division method is as follows:
1.3.2.1, making a B block matrix row variable p equal to 0;
1.3.2.2, a variable q of a B block matrix column is equal to 0;
1.3.2.3, making a coordinate variable s' of the B row equal to 0;
1.3.2.4 let B-column coordinate variable e' be 0;
1.3.2.5 let the block matrix element row variable k' be 0;
1.3.2.6 order the block matrix element column variable n ″, 0;
1.3.2.7 s'=k*p+k';
1.3.2.8 e'=n*q+n”;
1.3.2.9 if s' is less than or equal to K-1, turning to 1.3.2.10, otherwise, turning to 1.4 after the matrix B is divided;
1.3.2.10 if e' is less than or equal to N-1, turning to 1.3.2.11, otherwise, turning to 1.4 after the matrix B is divided;
1.3.2.11 selecting element B of matrix Bs'e'Block matrix B is formedpqElement b 'of'k'n”
1.3.2.12 n”=n”+1;
1.3.2.13 if the fruit n' is less than or equal to n-1, turning to 1.3.2.7, otherwise, turning to 1.3.2.14;
1.3.2.14 k'=k'+1;
1.3.2.15 if k' is less than or equal to k-1, turning to 21.3.2.6, otherwise, turning to 1.3.2.16;
1.3.2.16 q=q+1;
1.3.2.17 if j2 is not more than ng-1, turning to 1.3.2.7, otherwise, turning to 1.3.2.18;
1.3.2.18 p=p+1;
1.3.2.19 if i2 is not more than mg-1, turning to 1.3.2.6, otherwise, turning to 1.4 after the matrix B is divided;
1.4 initializing the result matrix C to be 0, that is, assigning each element in C to be 0, making C (mo, no) to be 0, where C (mo, no) represents an element in the no-th row and the no-th column of the result matrix, mo is greater than or equal to 0 and less than or equal to M-1, and no is greater than or equal to 0 and less than or equal to N-1;
1.5 matrix multiplication acceleration to obtain mg × ng × meC block matrices: mg × ng × meThe FMA takes the CPU computing core shared memory space as the shared memory space and a scalar data buffer area as the data buffer area to execute the block matrix multiplication operation in parallel, so that the matrix multiplication is accelerated, each FMA independently finishes the block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') finishes the block matrix multiplication
Figure GDA0002443977970000051
Ci'j'The specific calculation process of (2) is as follows:
1.5.1 let a block matrix row variable i' be 0;
1.5.2 let a block matrix column variable j' be 0;
1.5.3 the computational core (i ', j') reads the A-Block matrix A from memoryi'j'In a shared storage space (the CPU computing core shares the storage space as a portable Cache), the storage space is recorded as
Figure GDA0002443977970000052
1.5.4 let the B-block matrix column variable k ″) be 0;
1.5.5 computing cores (j ', k') read B Block matrix B from shared memory spacej'k”To the FMA data buffer;
1.5.6 mg×ng×methe individual FMAs perform the following vector multiplication operations in parallel:
1.5.7 obtaining A Block matrix Ai'j'The dimension m0 xk 0 is more than or equal to 1, m0 is more than or equal to 1, and k0 is more than or equal to 1; let the column length variable j ″) be 0;
1.5.8 initializing a vector loop variable v equal to 0;
1.5.9 define vector cycle number
Figure GDA0002443977970000061
Represents lower rounding;
1.5.10 initializes a broadcast variable r to 0;
1.5.11 from A Block matrix Ai'j'Column j of (2)eStarting to read m continuouslyeElement (a) of composition meVector of dimensions
Figure GDA0002443977970000062
1.5.12 taking B block matrix Bj'k”The r element in the j' th row is broadcast as an meVector of dimensions
Figure GDA0002443977970000063
1.5.13 vector VaAnd VbMultiplying corresponding elements to obtain the j' th column continuous m of the matrix CeThe partial product of each element, the specific multiplication rule is as follows:
Figure GDA0002443977970000064
1.5.14 matrix C obtained by 1.5.13 has j th column continuous meThe partial product of the individual elements is returned to the shared memory space, which is marked as
Figure GDA0002443977970000065
1.5.15 if j "> 1, go to 1.5.16, otherwise, go to 1.5.18;
1.5.16 will be
Figure GDA0002443977970000066
Stored content (i.e. column j's consecutive m)ePartial product of elements) and
Figure GDA0002443977970000067
stored content (i.e. j' -1 th column of consecutive mePartial product of individual elements) are correspondingly added, and the specific adding method is as follows:
1.5.16.1 let i3 be v × me + 0;
1.5.16.2 Ci'k”(i3,j”)=Ci'k”(i3,j”)+Ci'k”(i3,j”-1),Ci'k”(i3, j') represents the block matrix C of the result matrixi'k”Row i3, column j;
1.5.16.3 i3=i3+1;
1.5.16.4 if i3 is not more than v × me + (me-1), switching to 1.5.16.2, otherwise, switching to 1.5.16.5;
1.5.16.5 the result of 1.5.16.2 corresponding additions is returned to the shared memory space, and its new memory space is marked as
Figure GDA0002443977970000068
1.5.17 let broadcast variable r be r + 1;
1.5.18 if r < k0, go to 1.5.12, otherwise go to 1.5.21;
1.5.19 let vector loop variable v be v + 1;
1.5.20 if v < Vz, go to 1.5.11, otherwise, 1.5.23;
1.5.21 defines the vector cyclic remainder vr-m 0-v me
1.5.22 from A Block matrix Ai'j'V x m of column j of (1)eThe beginning of the continuous reading of vr elements constitutes meFront vr components of the vector of dimensions, rear meThe vr components are supplemented with 0, constituting a vector
Figure GDA0002443977970000071
1.5.23 if v is Vz, go to 1.5.13, otherwise go to 1.5.24;
1.5.24 j”=j”+1;
1.5.25 if j "< n 0; 1.5.9, otherwise, indicating that the calculation of the length direction of the matrix array A is finished, and 1.5.26 is turned;
1.5.26 k”=k”+1;
1.5.27 if
Figure GDA0002443977970000072
1.5.5, otherwise, indicating that the vector calculation in the array direction of the A block matrix is finished, and 1.5.28 is turned;
1.5.28 j'=j'+1;
1.5.29 if j' is less than or equal to ng-1, turning to 1.5.3, otherwise, turning to 1.5.30 after the calculation of the vector in the row direction is finished;
1.5.30 i'=i'+1;
1.5.31 if i' is less than or equal to mg-1, rotating to 1.5.2, otherwise, indicating that the calculation of the row length direction is finished, obtaining Ci'j'Rotating by 1.6;
1.6 the results are merged, namely, according to the data distribution principle, the shared storage space of the CPU computing core is taken as the shared storage space,
merging the results of the shared memory space into a result matrix C, wherein the specific method comprises the following steps:
1.6.1 let the matrix C row variable u be 0;
1.6.2 let the matrix C column variable v be 0;
1.6.3 matrix C of C blocksuvThe calculation result is transmitted back to the shared storage space;
1.6.4 acquisition of CuvThe dimension mc x nc is that mc is more than or equal to 1 and is less than or equal to m, and nc is more than or equal to 1 and is less than or equal to n;
1.6.5, making the C block matrix row coordinate variable ic equal to 0;
1.6.6 order C block matrix column coordinate variable jc=0;
1.6.7 C(ic,jc)=C(ic,jc)+Cuv(ic,jc),C(ic,jc) Row ith and jth of the representation matrix CcElement of column, Cuv(ic,jc) Represents a block matrix CuvI th of (1)cLine j (th)cA column element;
1.6.8 jc=jc+1;
1.6.9 if jc< nc; turning 1.6.7, otherwise, turning 1.6.10;
1.6.10 ic=ic+1;
1.6.11 if icIf mc is less than mc, go to 1.6.6, otherwise go to 1.6.12;
1.6.12 v=v+1;
1.6.13 if
Figure GDA0002443977970000081
Turning to 1.6.3, otherwise, turning to 1.6.14;
1.6.14 u=u+1;
1.6.15 if
Figure GDA0002443977970000082
1.6.2, otherwise, merging to obtain C; the matrix multiplication realized by adopting the MPI programming mode according to the steps of 1.2-1.6 is the CPU matrix multiplication accelerating version vcpu
1.7 if the heterogeneous integration system structure consists of a CPU and a GPU, a matrix multiplication acceleration version v facing the GPU is realized by adopting a CUDA programming modegpuThe specific method comprises the following steps:
1.7.1 query architecture handbook obtains topology of computing core in each GPU, namely, topology mg × ng of thread processor TPU (threading Processing Unit) in GPU, namely, mg × ng of TPU in GPU, physical distribution is mg rows × ng columns numbered (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), (… … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.7.2 query architecture handbook obtains the number m of vector floating point Multiply accumulate functional parts FMA (fused multiplex Add) owned by each TPUe
1.7.3, utilizing a function (such as init) provided by an operating system to complete initialization;
1.7.4 matrix division, namely, partitioning the A and the B by adopting the matrix division method in the step 1.3 based on the topological structure of the GPU to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.7.5 mg×ng×methe FMA takes the Shared Memory of the GPU as a Shared Memory space, takes the Constant Memory of the GPU as a data buffer, adopts the matrix multiplication acceleration method in the step 1.5 to execute matrix multiplication in parallel, each FMA independently completes block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') completes block matrix multiplication
Figure GDA0002443977970000091
1.7.6, taking the Shared Memory of the GPU as a Shared Memory space, and merging the C block matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C. The matrix multiplication realized according to 1.7.1-1.7.6 is vgpu
1.8 if the heterogeneous fusion system structure is composed of a CPU and an MIC, because the task scheduling between the CPU and the MIC is not considered in the patent, only the matrix multiplication calculation running at a target accelerator end is concerned, and the MIC matrix multiplication acceleration version v realized by adopting the Offload programming modemicThe specific method comprises the following steps:
1.8.1 query architecture handbook gets the computational cores in each MIC, i.e. as vector processor VPU (virtual Processing Unit) in MIC topology mg × ng, i.e. mg × ng VPUs in a MIC, the physical distribution is mg rows × ng columns numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), (… … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.8.2 query architecture Manual get the number m of vector floating point Multiply accumulate functional parts FMA (fused multiplex Add) owned by each VPUe
1.8.3, the initialization is completed by using the function (such as init) provided by the operation system;
1.8.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPU to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.8.5 mg×ng×methe FMA takes the memory space of MIC, namely memory, as the shared storage space, and is portable by the shared storage spaceThe formula Cache is a data buffer area, matrix multiplication is executed in parallel by adopting the matrix multiplication acceleration method in the step 1.5, each FMA independently completes block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') completes block matrix multiplication
Figure GDA0002443977970000092
And 1.8.6, taking the memory space of the MIC as a shared storage space, and adopting the result merging method in the step 1.6 to finish the merging of the C block matrix results calculated by each FMA to obtain C. The matrix multiplication realized according to 1.8.1-1.8.6 is vmic
1.9 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts SCIF programming mode, then adopting SCIF programming mode to realize Matrix multiplication acceleration version vscifThe specific method comprises the following steps:
1.9.1 query the architecture handbook for the computation cores in each Matrix2000, namely, the vector Processing unit VPE (virtual Processing element) topology mg × ng in Matrix2000, namely, mg × ng VPEs in Matrix2000, the physical distribution is mg rows × ng columns numbered (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.9.2 query the architecture manual to obtain the number m of vector floating point Multiply accumulate functional units FMA (fused multiplex Add) owned by each VPEe
1.9.3, using the function (such as init) provided by the operating system to complete initialization;
1.9.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.9.5 mg×ng×methe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
Figure GDA0002443977970000101
1.9.6, taking the shared Memory space Array Memory in Matrix2000 as the shared Memory space, merging the C block Matrix results calculated by each FMA by adopting the result merging method described in step 1.6, and obtaining C. The matrix multiplication realized according to 1.9.1-1.9.6 is vscif
1.10 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts COI (processor off flow Infrastructure) programming mode, then adopting COI programming mode to realize Matrix multiplication acceleration version vcoiThe specific method comprises the following steps:
1.10.1 inquiring about the computing cores in the architecture handbook acquisition system if the VPE topology of the vector processing units in Matrix2000 is mg × ng, i.e. there are mg × ng VPEs in Matrix2000, the physical distribution is mg rows × ng columns, numbered sequentially as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.10.2 query the architecture manual to obtain the number m of vector floating point Multiply accumulate functional units FMA (fused multiplex Add) owned by each VPEe
1.10.3, utilizing a function (such as init) provided by system software to complete initialization;
1.10.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.10.5 mg×ng×methe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
Figure GDA0002443977970000111
1.10.6 takes the shared Memory space Array Memory in Matrix2000 asAnd sharing a storage space, and completing the merging of the matrix results of the C blocks calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C. The matrix multiplication realized according to 1.10.1-1.10.6 is vcoi
1.11 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts OpenMP tag programming mode, then the Matrix multiplication acceleration version v realized by OpenMP tag programming mode is adoptedtarget, the specific method is as follows:
1.11.1 query architecture handbook gets the computational cores in the system, i.e. vector processing unit VPE topology mg × ng as in Matrix2000, i.e. mg × ng VPEs in the system, physical distribution mg rows × ng columns, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), "… … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.11.2 query the architecture manual to obtain the number m of vector floating point Multiply accumulate functional units FMA (fused multiplex Add) owned by each VPEe
1.11.3, the initialization is completed by a function (such as init) provided by system software;
1.11.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.11.5 Global space Global Cache and shared memory space Arraymemory based on Matrix2000, mg × ng × meThe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
Figure GDA0002443977970000121
1.11.6, taking the shared Memory space Array Memory in Matrix2000 as the shared Memory space, merging the C block Matrix results completed by each FMA by the result merging method described in step 1.6, and obtaining C. The ratio is 1.11.1-1.The matrix multiplication realized by 11.6 is vtarget
And step two, integrating heterogeneous fusion multi-version matrix multiplication. The specific method comprises the following steps:
2.1 separately compiling vcpu、vgpu、vmic、vtarget、vcoiAnd vscifCorresponding source code, generating vcpu、vgpu、vmic、vtarget、vcoiAnd vscifThe executable file of (2);
2.2 Using tar Command to convert vcpu、vgpu、vmic、vtarget、vcoiAnd vscifThe executable file is packed to generate a library file HU-xgemm of a heterogeneous fusion version.
Step three, adapting an accelerator in a heterogeneous fusion system structure, wherein the method comprises the following steps:
3.1 inquiring the accelerator type in the heterogeneous fusion system structure by using an operating system command lspci | grep processor, and enabling an accelerator type variable arc to be the accelerator type;
3.2 if arc ═ Matrix2000, run 3.2.1: otherwise, turning to 3.3;
3.2.1. inquiring a Matrix2000 programming technical manual, confirming a supported programming model, and assigning a value to a programming model variable prolan;
3.2.2. if prolan ═ OpenMP tag, call v in HU-xgemmcpuAnd vtargetRespectively finishing the Matrix multiplication calculation of a CPU end of a main processor and a Matrix2000 end of an accelerator (dividing the Matrix multiplication into a part to be distributed to the CPU and a part to be distributed to the accelerator (such as Matrix2000, GPU and MIC), belonging to the task scheduling category, and adopting a known scheduling method such as 'a method for dynamically balancing heterogeneous system calculation load (ZL 201410544782.5)', otherwise, turning to 3.2.3;
3.2.3. if prolan ═ COI, recall v in HU-xgemmcpuAnd vcoiRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.2.4;
3.2.4. if prolan ═ SCIF, recall v in HU-xgemmcpuAnd vscifRespectively completing the CPU end and the accelerator of the main processorMatrix multiplication at the Matrix2000 end; otherwise, turning to 3.3;
3.3 if arc ═ MIC, recall v in HU-xgemmcpuAnd vmicRespectively finishing the matrix multiplication of a CPU end of a main processor and an MIC end of an accelerator; otherwise, turning to 3.4;
3.4 if arc ═ GPU, call v in HU-xgemmcpuAnd vgpuRespectively finishing the matrix multiplication of a CPU end of a main processor and a GPU end of an accelerator; otherwise, turning to 3.5;
3.5 if there is no special accelerator in the system, i.e. only CPU, call v in HU-xgemmcpuThe matrix multiplication calculation is completed.
The invention can achieve the following technical effects:
1. the invention can self-adapt to different target accelerators and processors by realizing and integrating the universal matrix multiplication versions suitable for various accelerators, can self-adaptively perform matrix multiplication according to different heterogeneous fusion system structures, reduces the difficulty of matrix multiplication development for the accelerators and lightens the workload.
2. The invention carries out matrix multiplication according to the topological structures of CPUs or accelerators in different heterogeneous fusion system structures, and each FMA carries out parallel calculation, thereby accelerating the matrix multiplication speed and improving the use efficiency of a heterogeneous system.
Drawings
FIG. 1 is a general flowchart of the method for optimizing the multiplication and acceleration of the heterogeneous fusion matrix according to the present invention.
Detailed Description
FIG. 1 is a general flowchart of a general multi-core DSP-oriented matrix multiplication acceleration method according to the present invention.
The method comprises the following steps:
firstly, designing a block matrix multiplication version oriented to a heterogeneous fusion system structure to obtain vcpu、vgpu、vmic、vtarget、vcoiAnd vscif6 versions, the concrete steps are as follows:
1.1 configuration and initialization of a heterogeneous fusion system structure, the specific method is as follows:
defining the dimension of matrix A as M multiplied by K and the dimension of matrix B as KThe dimension of a result matrix C obtained by multiplying the XN, A and B is MXN, and M, K and N are positive integers; the element of the p-th row and the q-th column of A is apqP is more than or equal to 0 and less than or equal to M-1, q is more than or equal to 0 and less than or equal to K-1, and the element in the qth row and the tth column of B is Bqt,0≤t≤N-1;
1.2 if the heterogeneous integration system structure only consists of CPU, the CPU is initialized (MPI programming mode is adopted to realize CPU matrix multiplication acceleration version vcpu) The method comprises the following steps:
1.2.1. inquiring an architecture manual to obtain the topological structure of the computing cores in one CPU, and if the topological structure in the multi-core CPU is mg × ng, namely mg × ng computing cores in one CPU are physically distributed in mg rows × ng columns, numbering the computing cores as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1), i is more than or equal to 0 and less than or equal to mg-1, and j is more than or equal to 0 and less than or equal to ng-1 in sequence;
1.2.2. query architecture manual obtains the number m of floating point vector multiply accumulate functional units FMA owned by each CPU compute coree
1.2.3. Completing the initialization of the multi-core CPU by using an initialization function provided by an operating system;
1.3 matrix division is carried out on A and B according to the topological structure of the CPU, the matrix A is divided into mg multiplied by ng A block matrixes, the arrangement mode of the mg multiplied by ng A block matrixes is the same as that of the mg multiplied by ng computing cores, the jth column A block matrix of the ith row is represented by Aij, i is more than or equal to 0 and less than or equal to mg-1, j is more than or equal to 0 and less than or equal to ng-1, and the dimension of each block matrix is m multiplied by k, wherein the matrix A is divided into mg multiplied by ng A block matrixes, the arrangement
Figure GDA0002443977970000141
Figure GDA0002443977970000142
Representing upper rounding; dividing the matrix B of K x N into ng x mg B block matrixes, wherein the arrangement mode of the ng x mg B block matrixes is the same as that of mg x ng computing cores, and B is used for the No. r row and No. s column B block matrixesrsExpressed that r is more than or equal to 0 and less than or equal to mg-1, s is more than or equal to 0 and less than or equal to ng-1, the dimension of each block matrix is kXn,
Figure GDA0002443977970000143
1.4 initializing the result matrix C to be 0, that is, assigning each element in C to be 0, making C (mo, no) to be 0, where C (mo, no) represents an element in the no-th row and the no-th column of the result matrix, mo is greater than or equal to 0 and less than or equal to M-1, and no is greater than or equal to 0 and less than or equal to N-1;
1.5 matrix multiplication acceleration to obtain mg × ng × meC block matrices: mg × ng × meThe FMA takes the CPU computing core shared memory space as the shared memory space and a scalar data buffer area as the data buffer area to execute the block matrix multiplication operation in parallel, so that the matrix multiplication is accelerated, each FMA independently finishes the block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') finishes the block matrix multiplication
Figure GDA0002443977970000151
1.6 result merging, namely according to the data distribution principle, taking the shared storage space of the CPU computing core as the shared storage space, merging the results of the shared storage space to form a result matrix C, and performing matrix multiplication which is realized by adopting an MPI (Multi-processor interface) programming mode according to 1.2-1.6 to obtain a CPU matrix multiplication accelerated version vcpu
1.7 if the heterogeneous integration system structure consists of a CPU and a GPU, a matrix multiplication acceleration version v facing the GPU is realized by adopting a CUDA programming modegpuThe specific method comprises the following steps:
1.7.1 query architecture handbook obtains topology of computing core in each GPU, namely topology mg × ng of thread processor TPU in GPU, namely mg × ng TPUs in GPU, physical distribution is mg row × ng column, and serial numbers are (0,0), (0,1), … (0, ng-1),
(1,0),(1,1),…(1,ng-1),……,(i,0),(i,1),…,(i,j),…,(i,ng-1),,……(mg-1,0),(mg-1,1),…(mg-1,ng-1);
1.7.2 query architecture handbook number m of vector floating point multiply accumulate functional unit FMA owned by each TPUe
1.7.3, utilizing the function provided by the operating system to complete initialization;
1.7.4 matrix division, namely, partitioning the A and the B by adopting the matrix division method in the step 1.3 based on the topological structure of the GPU to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.7.5 mg×ng×methe FMA takes the Shared Memory of the GPU as a Shared Memory space, takes the Constant Memory of the GPU as a data buffer, adopts the matrix multiplication acceleration method in the step 1.5 to execute matrix multiplication in parallel, each FMA independently completes block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') completes block matrix multiplication
Figure GDA0002443977970000152
1.7.6, taking the Shared Memory of the GPU as a Shared storage space, and merging the C block matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;
1.8 if the heterogeneous fusion system structure consists of a CPU and a MIC, adopting an Offload programming mode to realize MIC matrix multiplication acceleration version vmicThe specific method comprises the following steps:
1.8.1 query architecture handbook obtains the computational cores in each MIC, i.e. as vector processor VPU topology mg × ng in MIC, i.e. mg × ng VPUs in one MIC, physical distribution mg row × ng column, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.8.2 query architecture Manual obtains the number m of vector floating point multiply accumulate functional units FMA each VPU possessese
1.8.3 initialization is accomplished by using the functions provided by the operating system;
1.8.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPU to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.8.5 mg×ng×meeach FMA takes the memory space of MIC, namely memory, as a shared memory space, takes a portable Cache of the shared memory space as a data buffer area, adopts the matrix multiplication acceleration method described in the step 1.5 to execute matrix multiplication in parallel, each FMA independently finishes block matrix multiplication assigned to itself, and the FMA with the serial number of (i ', j') finishes block matrix multiplication
Figure GDA0002443977970000161
1.8.6 using the memory space of MIC as the shared storage space, adopting the result merging method in the step 1.6 to finish the merging of the C block matrix results calculated by each FMA, and obtaining C;
1.9 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts SCIF programming mode, then adopting SCIF programming mode to realize Matrix multiplication acceleration version vscifThe specific method comprises the following steps:
1.9.1 query the architecture handbook for the topology mg × ng of the computing cores in each Matrix2000, i.e. the vector processing units VPE in Matrix2000, i.e. mg × ng VPEs in Matrix2000, the physical distribution is mg rows × ng columns, numbered sequentially as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), (… … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.9.2 query the architecture manual to obtain the number m of vector floating point Multiply accumulate functional units FMA (fused multiplex Add) owned by each VPEe
1.9.3 initialization is done using functions provided by the operating system;
1.9.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.9.5 mg×ng×methe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
Figure GDA0002443977970000171
1.9.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;
1.10 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts COI programming mode, then adopting COI programming mode to realize Matrix multiplication acceleration version vcoiThe specific method comprises the following steps:
1.10.1 inquiring about the computing cores in the architecture handbook acquisition system if the VPE topology of the vector processing units in Matrix2000 is mg × ng, i.e. there are mg × ng VPEs in Matrix2000, the physical distribution is mg rows × ng columns, numbered sequentially as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.10.2 query architecture Manual to obtain the number m of vector floating point multiply accumulate functional units FMA owned by each VPEe
1.10.3, utilizing a function provided by system software to complete initialization;
1.10.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.10.5 mg×ng×methe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
Figure GDA0002443977970000172
1.10.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;
1.11 if the heterogeneous fusion system structure is composed of a CPU and a Matrix2000, if the Matrix2000 adopts an OpenMP taget programming mode, the OpenMP taget programming mode is adopted to realize the OpenMP taget Matrix multiplication acceleration version vtargetThe specific method comprises the following steps:
1.11.1 query architecture handbook gets the computational cores in the system, i.e. vector processing unit VPE topology mg × ng as in Matrix2000, i.e. mg × ng VPEs in the system, physical distribution mg rows × ng columns, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), "… … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.11.2 query the architecture manual to obtain the number m of vector floating point Multiply accumulate functional units FMA (fused multiplex Add) owned by each VPEe
1.11.3 initialization is done using functions provided by the system software;
1.11.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.11.5 Global space Global Cache and shared memory space Arraymemory based on Matrix2000, mg × ng × meThe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
Figure GDA0002443977970000181
1.11.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results completed by each FMA by the result merging method in step 1.6 to obtain C;
the second step is that: integrating heterogeneous fusion multi-version matrix multiplication, the method is as follows:
2.1 separately compiling vcpu、vgpu、vmic、vtarget、vcoiAnd vscifCorresponding source code, generating vcpu、vgpu、vmic、vtarget、vcoiAnd vscifThe executable file of (2);
2.2 Using tar Command to convert vcpu、vgpu、vmic、vtarget、vcoiAnd vscifThe executable file is packed to generate a library file HU-xgemm of a heterogeneous fusion version;
step three, adapting an accelerator in a heterogeneous fusion system structure, wherein the method comprises the following steps:
3.1 inquiring the accelerator type in the heterogeneous fusion system structure by using an operating system command lspci | grep processor, and enabling an accelerator type variable arc to be the accelerator type;
3.2 if arc ═ Matrix2000, run 3.2.1: otherwise, turning to 3.3;
3.2.1. inquiring a Matrix2000 programming technical manual, confirming a supported programming model, and assigning a value to a programming model variable prolan;
3.2.2. if prolan ═ OpenMP tag, call v in HU-xgemmcpuAnd vtargetRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.2.3;
3.2.3. if prolan ═ COI, recall v in HU-xgemmcpuAnd vcoiRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.2.4;
3.2.4. if prolan ═ SCIF, recall v in HU-xgemmcpuAnd vscifRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.3;
3.3 if arc ═ MIC, recall v in HU-xgemmcpuAnd vmicRespectively finishing the matrix multiplication of a CPU end of a main processor and an MIC end of an accelerator; otherwise, turning to 3.4;
3.4 if arc ═ GPU, call v in HU-xgemmcpuAnd vgpuRespectively finishing the matrix multiplication of a CPU end of a main processor and a GPU end of an accelerator; otherwise, turning to 3.5;
3.5 if there is no special accelerator in the system, i.e. only CPU, call v in HU-xgemmcpuThe matrix multiplication calculation is completed.

Claims (2)

1. A matrix multiplication accelerating method oriented to a heterogeneous fusion architecture is characterized by comprising the following steps:
the method comprises the following steps of firstly, designing a block matrix multiplication version oriented to a heterogeneous fusion system structure, and specifically:
1.1 configuration and initialization of a heterogeneous fusion system structure, the specific method is as follows:
defining the dimensionality of a matrix A to be M multiplied by K, the dimensionality of a matrix B to be K multiplied by N, and the dimensionality of a result matrix C obtained by multiplying A and B to be M multiplied by N, wherein M, K and N are positive integers; the element of the p-th row and the q-th column of A is apqP is more than or equal to 0 and less than or equal to M-1, q is more than or equal to 0 and less than or equal to K-1, and the element in the qth row and the tth column of B is Bqt,0≤t≤N-1;
1.2 if the heterogeneous fusion system structure only consists of a CPU, initializing the CPU, wherein the method comprises the following steps:
1.2.1. inquiring an architecture manual to obtain the topological structure of the computing cores in one CPU, and if the topological structure in the multi-core CPU is mg × ng, namely mg × ng computing cores in one CPU are physically distributed in mg rows × ng columns, numbering the computing cores as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1), i is more than or equal to 0 and less than or equal to mg-1, and j is more than or equal to 0 and less than or equal to ng-1 in sequence;
1.2.2. query architecture manual obtains the number m of floating point vector multiply accumulate functional units FMA owned by each CPU compute coree
1.2.3. Completing the initialization of the multi-core CPU by using an initialization function provided by an operating system;
1.3 according to the topological structure of CPU, matrix division is carried out on A and B, the matrix A is divided into mg multiplied by ng A block matrixes, the arrangement mode of the mg multiplied by ng A block matrixes is the same as that of the mg multiplied by ng computing cores, and the ith row and the jth column of A block matrixes use AijI is more than or equal to 0 and less than or equal to mg-1, j is more than or equal to 0 and less than or equal to ng-1, and the dimension of each block matrix is m multiplied by k, wherein
Figure FDA0002443977960000011
Figure FDA0002443977960000012
Representing upper rounding; dividing the matrix B of K x N into ng x mg B block matrixes, wherein the arrangement mode of the ng x mg B block matrixes is the same as that of mg x ng computing cores,row r, column s, block B matrix BrsExpressed that r is more than or equal to 0 and less than or equal to mg-1, s is more than or equal to 0 and less than or equal to ng-1, the dimension of each block matrix is kXn,
Figure FDA0002443977960000013
1.4 initializing the result matrix C to be 0, that is, assigning each element in C to be 0, making C (mo, no) to be 0, where C (mo, no) represents an element in the no-th row and the no-th column of the result matrix, mo is greater than or equal to 0 and less than or equal to M-1, and no is greater than or equal to 0 and less than or equal to N-1;
1.5 matrix multiplication acceleration to obtain mg × ng × meC block matrices: mg × ng × meThe FMA takes the CPU computing core shared memory space as the shared memory space and a scalar data buffer area as the data buffer area to execute the block matrix multiplication operation in parallel, so that the matrix multiplication is accelerated, each FMA independently finishes the block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') finishes the block matrix multiplication
Figure FDA0002443977960000021
1.6 result merging, namely according to the data distribution principle, taking the shared storage space of the CPU computing core as the shared storage space, merging the results of the shared storage space to form a result matrix C, and performing matrix multiplication which is realized by adopting an MPI (Multi-processor interface) programming mode according to 1.2-1.6 to obtain a CPU matrix multiplication accelerated version vcpu
1.7 if the heterogeneous integration system structure consists of a CPU and a GPU, a matrix multiplication acceleration version v facing the GPU is realized by adopting a CUDA programming modegpuThe specific method comprises the following steps:
1.7.1 query architecture handbook obtains topology of computing core in each GPU, namely topology mg × ng of thread processor TPU in GPU, namely mg × ng TPUs in GPU, physical distribution is mg row × ng column, and serial numbers are (0,0), (0,1), … (0, ng-1),
(1,0),(1,1),…(1,ng-1),……,(i,0),(i,1),…,(i,j),…,(i,ng-1),……(mg-1,0),(mg-1,1),…(mg-1,ng-1);
1.7.2 query architecture handbook number m of vector floating point multiply accumulate functional unit FMA owned by each TPUe
1.7.3, utilizing the function provided by the operating system to complete initialization;
1.7.4 matrix division, namely, partitioning the A and the B by adopting the matrix division method in the step 1.3 based on the topological structure of the GPU to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.7.5 mg×ng×methe FMA takes Shared Memory of the GPU as a Shared Memory space, takes constant Memory of the GPU as a data buffer, adopts the matrix multiplication acceleration method in the step 1.5 to execute matrix multiplication in parallel, each FMA independently completes block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') completes block matrix multiplication
Figure FDA0002443977960000031
1.7.6, taking the Shared Memory of the GPU as a Shared storage space, and merging the C block matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;
1.8 if the heterogeneous fusion system structure consists of a CPU and a MIC, adopting an Offload programming mode to realize MIC matrix multiplication acceleration version vmicThe specific method comprises the following steps:
1.8.1 query architecture handbook obtains the computational cores in each MIC, i.e. as vector processor VPU topology mg × ng in MIC, i.e. mg × ng VPUs in one MIC, physical distribution mg row × ng column, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.8.2 query architecture Manual obtains the number m of vector floating point multiply accumulate functional units FMA each VPU possessese
1.8.3 initialization is accomplished by using the functions provided by the operating system;
1.8.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPU to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.8.5 mg×ng×methe FMA takes the memory space of MIC, namely memory, as a shared storage space, takes the portable Cache of the shared storage space as a data buffer area,the matrix multiplication accelerating method described in step 1.5 is adopted to execute matrix multiplication in parallel, each FMA independently completes block matrix multiplication allocated to itself, and the FMA with the serial number (i ', j') completes block matrix multiplication
Figure FDA0002443977960000032
1.8.6 using the memory space of MIC as the shared storage space, adopting the result merging method in the step 1.6 to finish the merging of the C block matrix results calculated by each FMA, and obtaining C;
1.9 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts SCIF programming mode, then adopting SCIF programming mode to realize Matrix multiplication acceleration version vscifThe specific method comprises the following steps:
1.9.1 query the architecture handbook for the topological mg × ng of the computing cores in each Matrix2000, i.e. the vector processing units VPE in Matrix2000, i.e. mg × ng VPEs in Matrix2000, the physical distribution being mg rows × ng columns, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.9.2 query the architecture manual to obtain the number m of vector floating point multiply accumulate functional units FMA (fused Multi Add) owned by each VPEe
1.9.3 initialization is done using functions provided by the operating system;
1.9.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.9.5 mg×ng×methe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
Figure FDA0002443977960000041
1.9.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;
1.10 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts COI programming mode, then adopting COI programming mode to realize Matrix multiplication acceleration version vcoiThe specific method comprises the following steps:
1.10.1 inquiring about the computing cores in the architecture handbook acquisition system if the VPE topology of the vector processing units in Matrix2000 is mg × ng, i.e. there are mg × ng VPEs in Matrix2000, the physical distribution is mg rows × ng columns, numbered sequentially as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.10.2 query architecture Manual to obtain the number m of vector floating point multiply accumulate functional units FMA owned by each VPEe
1.10.3, utilizing a function provided by system software to complete initialization;
1.10.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.10.5 mg×ng×methe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
Figure FDA0002443977960000051
1.10.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;
1.11 if the heterogeneous fusion architecture is composed of CPU and Matrix2000, if the Matrix2000 adopts OpenMP tag programming mode, thenOpenMP tag matrix multiplication acceleration version v realized by adopting OpenMP tag programming modetargetThe specific method comprises the following steps:
1.11.1 query architecture handbook gets the computational cores in the system, i.e. vector processing unit VPE topology mg × ng as in Matrix2000, i.e. mg × ng VPEs in the system, physical distribution mg rows × ng columns, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);
1.11.2 query the architecture manual to obtain the number m of vector floating point multiply accumulate functional units FMA (fused Multi Add) owned by each VPEe
1.11.3 initialization is done using functions provided by the system software;
1.11.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;
1.11.5 Global space Global Cache and shared Memory space Array Memory based on Matrix2000, mg × ng × meThe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication
Figure FDA0002443977960000052
1.11.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results completed by each FMA by the result merging method in step 1.6 to obtain C;
the second step is that: integrating heterogeneous fusion multi-version matrix multiplication, the method is as follows:
2.1 separately compiling vcpu、vgpu、vmic、vtarget、vcoiAnd vscifCorresponding source code, generating vcpu、vgpu、vmic、vtarget、vcoiAnd vscifThe executable file of (2);
2.2 Using tar Command to convert vcpu、vgpu、vmic、vtarget、vcoiAnd vscifThe executable file is packed to generate a library file HU-xgemm of a heterogeneous fusion version;
step three, adapting an accelerator in a heterogeneous fusion system structure, wherein the method comprises the following steps:
3.1 inquiring the accelerator type in the heterogeneous fusion system structure by using an operating system command lspci | grep processor, and enabling an accelerator type variable arc to be the accelerator type;
3.2 if arc ═ Matrix2000, run 3.2.1: otherwise, turning to 3.3;
3.2.1. inquiring a Matrix2000 programming technical manual, confirming a supported programming model, and assigning a value to a programming model variable prolan;
3.2.2. if prolan ═ OpenMP tag, call v in HU-xgemmcpuAnd vtargetRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.2.3;
3.2.3. if prolan ═ COI, recall v in HU-xgemmcpuAnd vcoiRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.2.4;
3.2.4. if prolan ═ SCIF, recall v in HU-xgemmcpuAnd vscifRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.3;
3.3 if arc ═ MIC, recall v in HU-xgemmcpuAnd vmicRespectively finishing the matrix multiplication of a CPU end of a main processor and an MIC end of an accelerator; otherwise, turning to 3.4;
3.4 if arc ═ GPU, call v in HU-xgemmcpuAnd vgpuRespectively finishing the matrix multiplication of a CPU end of a main processor and a GPU end of an accelerator; otherwise, turning to 3.5;
3.5 if there is no special accelerator in the system, i.e. only CPU, call v in HU-xgemmcpuThe matrix multiplication calculation is completed.
2. The heterogeneous convergence architecture oriented matrix multiplication acceleration method of claim 1, wherein step 1.2.3 is an initialization function index function provided by the operating system.
CN201910076766.0A 2019-01-27 2019-01-27 Matrix multiplication acceleration method for heterogeneous fusion system structure Active CN109871512B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910076766.0A CN109871512B (en) 2019-01-27 2019-01-27 Matrix multiplication acceleration method for heterogeneous fusion system structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910076766.0A CN109871512B (en) 2019-01-27 2019-01-27 Matrix multiplication acceleration method for heterogeneous fusion system structure

Publications (2)

Publication Number Publication Date
CN109871512A CN109871512A (en) 2019-06-11
CN109871512B true CN109871512B (en) 2020-05-22

Family

ID=66918078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910076766.0A Active CN109871512B (en) 2019-01-27 2019-01-27 Matrix multiplication acceleration method for heterogeneous fusion system structure

Country Status (1)

Country Link
CN (1) CN109871512B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110415162B (en) * 2019-07-22 2020-03-31 中国人民大学 Adaptive graph partitioning method facing heterogeneous fusion processor in big data

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106775594A (en) * 2017-01-13 2017-05-31 中国科学院软件研究所 A kind of Sparse Matrix-Vector based on the domestic processor of Shen prestige 26010 multiplies isomery many-core implementation method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744682B (en) * 2014-01-24 2017-02-08 中国科学院自动化研究所 System and method for separate compilation of heterogeneous mixed programs
CN104317768B (en) * 2014-10-15 2017-02-15 中国人民解放军国防科学技术大学 Matrix multiplication accelerating method for CPU+DSP (Central Processing Unit + Digital Signal Processor) heterogeneous system
CN104346318B (en) * 2014-10-15 2017-03-15 中国人民解放军国防科学技术大学 Matrix Multiplication accelerated method towards general multi-core DSP
CN104820613B (en) * 2015-05-27 2018-03-27 北京思朗科技有限责任公司 A kind of Compilation Method of heterogeneous polynuclear program
CN105242962B (en) * 2015-11-24 2018-07-03 无锡江南计算技术研究所 The quick triggering method of lightweight thread based on isomery many-core

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106775594A (en) * 2017-01-13 2017-05-31 中国科学院软件研究所 A kind of Sparse Matrix-Vector based on the domestic processor of Shen prestige 26010 multiplies isomery many-core implementation method

Also Published As

Publication number Publication date
CN109871512A (en) 2019-06-11

Similar Documents

Publication Publication Date Title
CN106940815B (en) Programmable convolutional neural network coprocessor IP core
US8400458B2 (en) Method and system for blocking data on a GPU
Skinderowicz The GPU-based parallel ant colony system
JP2021515300A (en) Neural network accelerator
CN101826142B (en) Reconfigurable elliptic curve cipher processor
US9753726B2 (en) Computer for amdahl-compliant algorithms like matrix inversion
US7370156B1 (en) Unity parallel processing system and method
CN114201287B (en) Method for cooperatively processing data based on CPU + GPU heterogeneous platform
US20200226201A1 (en) Methods and Apparatus for Constructing Digital Circuits for Performing Matrix Operations
Maroosi et al. Parallel and distributed computing models on a graphics processing unit to accelerate simulation of membrane systems
CN116710912A (en) Matrix multiplier and control method thereof
CN109871512B (en) Matrix multiplication acceleration method for heterogeneous fusion system structure
Buttari et al. Limitations of the playstation 3 for high performance cluster computing
CN117032807A (en) AI acceleration processor architecture based on RISC-V instruction set
CN114402337A (en) Hardware circuit for accelerating neural network calculation
CN113553054A (en) Heterogeneous system based compiling method, device, equipment and storage medium
CN109614367B (en) Improved DND algorithm and implementation method based on FPGA
Mayannavar et al. Hardware Accelerators for Neural Processing
Al Maruf et al. Optimizing DNNs Model Partitioning for Enhanced Performance on Edge Devices.
Chen et al. Exploring efficient data parallelism for genome read mapping on multicore and manycore architectures
CN108062249A (en) High in the clouds data allocation schedule method based on big data
CN109993284B (en) Integrated circuit chip device and related product
Mahajan et al. Review of Artificial Intelligence Applications and Architectures
CN112052042B (en) Data pipeline processor system
Tan et al. Heterogeneous Parallel and Distributed Optimization of K-Means Algorithm on Sunway Supercomputer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant