CN109871512B

CN109871512B - Matrix multiplication acceleration method for heterogeneous fusion system structure

Info

Publication number: CN109871512B
Application number: CN201910076766.0A
Authority: CN
Inventors: 甘新标; 曾瑞庚; 杨志辉; 孙泽文; 吴涛; 刘杰; 龚春叶; 李胜国; 杨博; 徐涵; 晏益慧
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-01-27
Filing date: 2019-01-27
Publication date: 2020-05-22
Anticipated expiration: 2039-01-27
Also published as: CN109871512A

Abstract

The invention discloses a matrix multiplication acceleration method for a heterogeneous fusion system structure, and aims to aim at aiming at different many-core accelerator targetsThe system structure designs a general multiplication acceleration method for a heterogeneous fusion system structure matrix, and the use efficiency of a heterogeneous system is improved. The technical scheme is that a block matrix multiplication version oriented to a heterogeneous fusion system structure is designed firstly, and comprises v_cpu、v_gpu、v_mic,v_scif,v_coi,v_targetThen, integrating and packaging the heterogeneous fusion multi-version matrix multiplication versions to generate a library file HU-xgemm of the heterogeneous fusion version; and finally, adapting an accelerator in a heterogeneous fusion system structure by using HU-xgemm. The invention can be adaptive to different target accelerators and processors, can perform matrix multiplication adaptively according to different heterogeneous fusion system structures, and performs matrix multiplication according to topological structures of CPUs or accelerators in different heterogeneous fusion system structures, and each FMA performs parallel calculation, thereby accelerating the matrix multiplication speed and improving the use efficiency of a heterogeneous system.

Description

Matrix multiplication acceleration method for heterogeneous fusion system structure

Technical Field

The invention relates to a matrix multiplication accelerating method, in particular to a heterogeneous system-oriented heterogeneous fusion system structure matrix multiplication accelerating method.

Background

With the continuous rising of the computing performance of the general accelerator and the wide application of the accelerator, the many-core accelerator must become an important development direction for high-performance computing, and the accelerators such as GPU, MIC (Xeon Phi), Matrix2000 and the like which meet the requirements of various fields are developed. With the wide application and popularization of heterogeneous systems, a plurality of heterogeneous system structures of different types such as CPU + GPU, CPU + MIC, CPU + Matrix2000 and the like emerge.

The design target and design principle of the accelerator determine the specificity and limitation of the accelerator, and different accelerator manufacturers develop programming models adapted to the accelerator, such as CUDA supported by GPU, Offload supported by MIC, COI (coprocessor over infrastructure) supported by Matrix2000, SCIF (systematic communication interface), OpenMP target, and other programming models. The program design of the target-oriented accelerator needs to adopt a programming model supported by the target-oriented accelerator to redesign and realize an algorithm, so that the acceleration is possible; if the program is not redesigned and implemented according to the programming model supported by the accelerator, the program is basically impossible to run and has no acceleration effect. Therefore, programs for realizing different versions need to be designed for different heterogeneous systems, for example, algorithms and programs capable of efficiently cooperating between a CPU and a GPU must be realized for CPU + GPU-oriented heterogeneous systems; the CPU + MIC is oriented to an algorithm and a program which can realize efficient cooperation between the CPU and the MIC; CPU + Matrix2000 oriented algorithms and programs must be implemented that enable efficient coordination between the CPU and the Matrix 2000. With the updating, replacement and upgrading of the heterogeneous system accelerator, programs for implementing different accelerator versions need to be redesigned at different periods, and even algorithms and programs for different target accelerators need to be simultaneously designed when a mixed use situation of multiple accelerators occurs in a set of heterogeneous system.

For different heterogeneous systems, software designers need to re-understand a target system structure and learn a new programming model to realize the existing algorithm, spend a lot of time learning new knowledge to repeat the existing work, the effect may not be good, and the design and development of the algorithm in the field are not facilitated. Therefore, a set of universal programs can be designed to run on different heterogeneous systems, so that program designers can be greatly liberated, and the development efficiency is improved.

The matrix multiplication is the most common operation in numerical calculation, and many applications include the calculation process of the matrix multiplication, so that the operation speed of the matrix multiplication is improved, and the speed of high-performance calculation can be improved to a great extent.

The matrix multiplication is to multiply one row of the multiplied matrix a and one column of the multiplier matrix B to obtain one element in the resultant matrix C. Matrix multiplication oriented to a heterogeneous system generally needs to reasonably distribute a matrix multiplication calculation process between a main processor (CPU) and a many-core accelerator to perform heterogeneous cooperation and parallel completion of the calculation process so as to improve the operation speed of matrix multiplication and maximize the calculation efficiency and the use efficiency of the heterogeneous system.

Because the design target of the many-core accelerator is different from the instruction set structure, the conventional matrix multiplication implementation technology facing a general main processor hardly meets the performance requirement of the many-core accelerator designed facing a specific application, so the matrix multiplication needs to be accelerated facing a target architecture of the many-core accelerator to improve the operation speed of the matrix multiplication, and the design target of a heterogeneous system is met to the maximum extent.

If a heterogeneous fusion Matrix multiplication acceleration method can be provided for various heterogeneous systems such as a CPU + GPU, a CPU + MIC, a CPU + Matrix2000 and the like to shield structural details of a target system, simplify the program development of the heterogeneous system and improve the efficiency of the heterogeneous system, a programmer can concentrate on the field algorithm design and development to the maximum extent without knowing the specific structure and instruction of the heterogeneous system, the development restriction of a many-core accelerator in the high-performance computing field can be effectively solved, and the technical problem to be solved by technical personnel in the field is urgently needed.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: a general Matrix multiplication accelerating method facing different many-core accelerator target architectures is designed, so that the details of the target architecture are shielded to the maximum extent, the Matrix multiplication design difficulty and the workload of a heterogeneous system are simplified, and the use efficiency of the heterogeneous system is improved.

In order to solve the technical problems, the specific technical scheme of the invention is as follows:

firstly, designing a partitioned matrix multiplication version oriented to a heterogeneous fusion system structure to obtain a CPU matrix multiplication acceleration version v_cpuGPU matrix multiplication acceleration version v_gpuMIC matrix multiplication acceleration version v_micWhen the heterogeneous integration system structure consists of a CPU and a Matrix2000, a Matrix multiplication acceleration version v realized by adopting a SCIF programming mode_scifMatrix multiplication acceleration version v realized by adopting COI programming mode_coiMatrix multiplication acceleration version v realized by OpenMP tag programming mode_target, the specific steps are as follows:

1.1 configuration and initialization of a heterogeneous fusion system structure, the specific method is as follows:

defining the dimensionality of a matrix A to be M multiplied by K and the dimensionality of a matrix B to be K multiplied by N, and then defining the dimensionality of a result matrix C obtained by multiplying A and B to be M multiplied by N, wherein M, K and N are positive integers; the element of the p-th row and the q-th column of A is a_pqP is more than or equal to 0 and less than or equal to M-1, q is more than or equal to 0 and less than or equal to K-1, and the element in the qth row and the tth column of B is B_qt，0≤t≤N-1；

1.2 if the heterogeneous fusion system structure only consists of a CPU, initializing the CPU, and the specific method comprises the following steps:

1.2.1. inquiring an architecture manual to obtain the topological structure of the computing cores in one CPU, and if the topological structure in the multi-core CPU is mg × ng, namely mg × ng computing cores in one CPU are physically distributed in mg rows × ng columns, numbering the computing cores as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1), i is more than or equal to 0 and less than or equal to mg-1, and j is more than or equal to 0 and less than or equal to ng-1 in sequence;

1.2.2. the query architecture manual obtains the number m of floating point vector Multiply accumulate functional units FMA (fused multiplex Add) owned by each CPU computing core_e；

1.2.3. The initialization of the multi-core CPU is completed by using an initialization function (such as init) provided by an operating system;

1.3 matrix division, namely, partitioning A and B according to the topological structure of a CPU to obtain mg × ng A block matrixes and ng × mg B block matrixes, wherein the specific method comprises the following steps:

1.3.1 divide matrix A into mg × ng A block matrices, the arrangement of mg × ng A block matrices is the same as that of mg × ng computational cores, and the ith row and jth column of A block matrices are used as A_ij (i is more than or equal to 0 and less than or equal to mg-1, j is more than or equal to 0 and less than or equal to ng-1), and the dimension of each block matrix is m multiplied by k, wherein

The representation is rounded up, and the dimension of the edge block matrix needs special treatment. The specific matrix partitioning method is as follows:

1.3.1.1 let the block matrix row variable i equal to 0;

1.3.1.2 let the block matrix column variable j be 0;

1.3.1.3 let the block matrix element row variable m' be 0;

1.3.1.4 let block matrix element column variable n' be 0, n be k;

1.3.1.5 order row coordinate variable s ═ m × i + m';

1.3.1.6 let column coordinate variable e be n × j + n';

1.3.1.7 if s is less than or equal to M-1, switching to 1.3.1.10, otherwise, switching to 1.3.2 after the matrix A is divided;

1.3.1.8 if e is less than or equal to K-1, turning to 1.3.1.11, otherwise, turning to 1.3.2 after the matrix A is divided;

1.3.1.9 selecting elements of matrix A ase to form block matrix A_ijElement a 'of'_m'n'；

1.3.1.10 n'＝n'+1；

1.3.1.11 if n' is less than or equal to k-1, turning to 1.3.1.7, otherwise, turning to 1.3.1.14;

1.3.1.12 m'＝m'+1；

1.3.1.13 if m' is less than or equal to m-1, switching to 1.3.1.6, otherwise, switching to 1.3.1.16;

1.3.1.14 j＝j+1；

1.3.1.15 if

Rotating 1.3.1.3, otherwise, rotating 1.3.1.18;

1.3.1.16 i＝i+1；

1.3.1.17 if

1.3.1.2, otherwise, 1.3.2 is carried out after the matrix A is divided;

1.3.2 divide the matrix B of K × N into ng × mg B block matrices, the arrangement of ng × mg B block matrices is the same as that of mg × ng computing cores, and B is used for the r-th row and s-th column B block matrices_rsExpressed as (0. ltoreq. r. ltoreq. mg-1, 0. ltoreq. s. ltoreq. ng-1), the dimension of each block matrix is k x n,

representing upper rounding; the dimension of the edge block matrix needs special treatment, and the specific matrix division method is as follows:

1.3.2.1, making a B block matrix row variable p equal to 0;

1.3.2.2, a variable q of a B block matrix column is equal to 0;

1.3.2.3, making a coordinate variable s' of the B row equal to 0;

1.3.2.4 let B-column coordinate variable e' be 0;

1.3.2.5 let the block matrix element row variable k' be 0;

1.3.2.6 order the block matrix element column variable n ″, 0;

1.3.2.7 s'＝k*p+k'；

1.3.2.8 e'＝n*q+n”；

1.3.2.9 if s' is less than or equal to K-1, turning to 1.3.2.10, otherwise, turning to 1.4 after the matrix B is divided;

1.3.2.10 if e' is less than or equal to N-1, turning to 1.3.2.11, otherwise, turning to 1.4 after the matrix B is divided;

1.3.2.11 selecting element B of matrix B_s'e'Block matrix B is formed_pqElement b 'of'_k'n”；

1.3.2.12 n”＝n”+1；

1.3.2.13 if the fruit n' is less than or equal to n-1, turning to 1.3.2.7, otherwise, turning to 1.3.2.14;

1.3.2.14 k'＝k'+1；

1.3.2.15 if k' is less than or equal to k-1, turning to 21.3.2.6, otherwise, turning to 1.3.2.16;

1.3.2.16 q＝q+1；

1.3.2.17 if j2 is not more than ng-1, turning to 1.3.2.7, otherwise, turning to 1.3.2.18;

1.3.2.18 p＝p+1；

1.3.2.19 if i2 is not more than mg-1, turning to 1.3.2.6, otherwise, turning to 1.4 after the matrix B is divided;

1.4 initializing the result matrix C to be 0, that is, assigning each element in C to be 0, making C (mo, no) to be 0, where C (mo, no) represents an element in the no-th row and the no-th column of the result matrix, mo is greater than or equal to 0 and less than or equal to M-1, and no is greater than or equal to 0 and less than or equal to N-1;

1.5 matrix multiplication acceleration to obtain mg × ng × m_eC block matrices: mg × ng × m_eThe FMA takes the CPU computing core shared memory space as the shared memory space and a scalar data buffer area as the data buffer area to execute the block matrix multiplication operation in parallel, so that the matrix multiplication is accelerated, each FMA independently finishes the block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') finishes the block matrix multiplication

C_i'j'The specific calculation process of (2) is as follows:

1.5.1 let a block matrix row variable i' be 0;

1.5.2 let a block matrix column variable j' be 0;

1.5.3 the computational core (i ', j') reads the A-Block matrix A from memory_i'j'In a shared storage space (the CPU computing core shares the storage space as a portable Cache), the storage space is recorded as

1.5.4 let the B-block matrix column variable k ″) be 0;

1.5.5 computing cores (j ', k') read B Block matrix B from shared memory space_j'k”To the FMA data buffer;

1.5.6 mg×ng×m_ethe individual FMAs perform the following vector multiplication operations in parallel:

1.5.7 obtaining A Block matrix A_i'j'The dimension m0 xk 0 is more than or equal to 1, m0 is more than or equal to 1, and k0 is more than or equal to 1; let the column length variable j ″) be 0;

1.5.8 initializing a vector loop variable v equal to 0;

1.5.9 define vector cycle number

Represents lower rounding;

1.5.10 initializes a broadcast variable r to 0;

1.5.11 from A Block matrix A_i'j'Column j of (2)_eStarting to read m continuously_eElement (a) of composition m_eVector of dimensions

1.5.12 taking B block matrix B_j'k”The r element in the j' th row is broadcast as an m_eVector of dimensions

1.5.13 vector V_aAnd V_bMultiplying corresponding elements to obtain the j' th column continuous m of the matrix C_eThe partial product of each element, the specific multiplication rule is as follows:

1.5.14 matrix C obtained by 1.5.13 has j th column continuous m_eThe partial product of the individual elements is returned to the shared memory space, which is marked as

1.5.15 if j "> 1, go to 1.5.16, otherwise, go to 1.5.18;

1.5.16 will be

Stored content (i.e. column j's consecutive m)_ePartial product of elements) and

stored content (i.e. j' -1 th column of consecutive m_ePartial product of individual elements) are correspondingly added, and the specific adding method is as follows:

1.5.16.1 let i3 be v × me + 0;

1.5.16.2 C_i'k”(i3,j”)＝C_i'k”(i3,j”)+C_i'k”(i3,j”-1)，C_i'k”(i3, j') represents the block matrix C of the result matrix_i'k”Row i3, column j;

1.5.16.3 i3＝i3+1；

1.5.16.4 if i3 is not more than v × me + (me-1), switching to 1.5.16.2, otherwise, switching to 1.5.16.5;

1.5.16.5 the result of 1.5.16.2 corresponding additions is returned to the shared memory space, and its new memory space is marked as

1.5.17 let broadcast variable r be r + 1;

1.5.18 if r < k0, go to 1.5.12, otherwise go to 1.5.21;

1.5.19 let vector loop variable v be v + 1;

1.5.20 if v < Vz, go to 1.5.11, otherwise, 1.5.23;

1.5.21 defines the vector cyclic remainder vr-m 0-v m_e；

1.5.22 from A Block matrix A_i'j'V x m of column j of (1)_eThe beginning of the continuous reading of vr elements constitutes m_eFront vr components of the vector of dimensions, rear m_eThe vr components are supplemented with 0, constituting a vector

1.5.23 if v is Vz, go to 1.5.13, otherwise go to 1.5.24;

1.5.24 j”＝j”+1；

1.5.25 if j "< n 0; 1.5.9, otherwise, indicating that the calculation of the length direction of the matrix array A is finished, and 1.5.26 is turned;

1.5.26 k”＝k”+1；

1.5.27 if

1.5.5, otherwise, indicating that the vector calculation in the array direction of the A block matrix is finished, and 1.5.28 is turned;

1.5.28 j'＝j'+1；

1.5.29 if j' is less than or equal to ng-1, turning to 1.5.3, otherwise, turning to 1.5.30 after the calculation of the vector in the row direction is finished;

1.5.30 i'＝i'+1；

1.5.31 if i' is less than or equal to mg-1, rotating to 1.5.2, otherwise, indicating that the calculation of the row length direction is finished, obtaining C_i'j'Rotating by 1.6;

1.6 the results are merged, namely, according to the data distribution principle, the shared storage space of the CPU computing core is taken as the shared storage space,

merging the results of the shared memory space into a result matrix C, wherein the specific method comprises the following steps:

1.6.1 let the matrix C row variable u be 0;

1.6.2 let the matrix C column variable v be 0;

1.6.3 matrix C of C blocks_uvThe calculation result is transmitted back to the shared storage space;

1.6.4 acquisition of C_uvThe dimension mc x nc is that mc is more than or equal to 1 and is less than or equal to m, and nc is more than or equal to 1 and is less than or equal to n;

1.6.5, making the C block matrix row coordinate variable ic equal to 0;

1.6.6 order C block matrix column coordinate variable j_c＝0；

1.6.7 C(i_c,j_c)＝C(i_c,j_c)+C_uv(i_c,j_c)，C(i_c,j_c) Row ith and jth of the representation matrix C_cElement of column, C_uv(i_c,j_c) Represents a block matrix C_uvI th of (1)_cLine j (th)_cA column element;

1.6.8 j_c＝j_c+1；

1.6.9 if j_c< nc; turning 1.6.7, otherwise, turning 1.6.10;

1.6.10 i_c＝i_c+1；

1.6.11 if i_cIf mc is less than mc, go to 1.6.6, otherwise go to 1.6.12;

1.6.12 v＝v+1；

1.6.13 if

Turning to 1.6.3, otherwise, turning to 1.6.14;

1.6.14 u＝u+1；

1.6.15 if

1.6.2, otherwise, merging to obtain C; the matrix multiplication realized by adopting the MPI programming mode according to the steps of 1.2-1.6 is the CPU matrix multiplication accelerating version v_cpu。

1.7 if the heterogeneous integration system structure consists of a CPU and a GPU, a matrix multiplication acceleration version v facing the GPU is realized by adopting a CUDA programming mode_gpuThe specific method comprises the following steps:

1.7.1 query architecture handbook obtains topology of computing core in each GPU, namely, topology mg × ng of thread processor TPU (threading Processing Unit) in GPU, namely, mg × ng of TPU in GPU, physical distribution is mg rows × ng columns numbered (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), (… … (mg-1,0), (mg-1,1), … (mg-1, ng-1);

1.7.2 query architecture handbook obtains the number m of vector floating point Multiply accumulate functional parts FMA (fused multiplex Add) owned by each TPU_e；

1.7.3, utilizing a function (such as init) provided by an operating system to complete initialization;

1.7.4 matrix division, namely, partitioning the A and the B by adopting the matrix division method in the step 1.3 based on the topological structure of the GPU to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;

1.7.5 mg×ng×m_ethe FMA takes the Shared Memory of the GPU as a Shared Memory space, takes the Constant Memory of the GPU as a data buffer, adopts the matrix multiplication acceleration method in the step 1.5 to execute matrix multiplication in parallel, each FMA independently completes block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') completes block matrix multiplication

1.7.6, taking the Shared Memory of the GPU as a Shared Memory space, and merging the C block matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C. The matrix multiplication realized according to 1.7.1-1.7.6 is v_gpu。

1.8 if the heterogeneous fusion system structure is composed of a CPU and an MIC, because the task scheduling between the CPU and the MIC is not considered in the patent, only the matrix multiplication calculation running at a target accelerator end is concerned, and the MIC matrix multiplication acceleration version v realized by adopting the Offload programming mode_micThe specific method comprises the following steps:

1.8.1 query architecture handbook gets the computational cores in each MIC, i.e. as vector processor VPU (virtual Processing Unit) in MIC topology mg × ng, i.e. mg × ng VPUs in a MIC, the physical distribution is mg rows × ng columns numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), (… … (mg-1,0), (mg-1,1), … (mg-1, ng-1);

1.8.2 query architecture Manual get the number m of vector floating point Multiply accumulate functional parts FMA (fused multiplex Add) owned by each VPU_e；

1.8.3, the initialization is completed by using the function (such as init) provided by the operation system;

1.8.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPU to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;

1.8.5 mg×ng×m_ethe FMA takes the memory space of MIC, namely memory, as the shared storage space, and is portable by the shared storage spaceThe formula Cache is a data buffer area, matrix multiplication is executed in parallel by adopting the matrix multiplication acceleration method in the step 1.5, each FMA independently completes block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') completes block matrix multiplication

And 1.8.6, taking the memory space of the MIC as a shared storage space, and adopting the result merging method in the step 1.6 to finish the merging of the C block matrix results calculated by each FMA to obtain C. The matrix multiplication realized according to 1.8.1-1.8.6 is v_mic。

1.9 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts SCIF programming mode, then adopting SCIF programming mode to realize Matrix multiplication acceleration version v_scifThe specific method comprises the following steps:

1.9.1 query the architecture handbook for the computation cores in each Matrix2000, namely, the vector Processing unit VPE (virtual Processing element) topology mg × ng in Matrix2000, namely, mg × ng VPEs in Matrix2000, the physical distribution is mg rows × ng columns numbered (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);

1.9.2 query the architecture manual to obtain the number m of vector floating point Multiply accumulate functional units FMA (fused multiplex Add) owned by each VPE_e；

1.9.3, using the function (such as init) provided by the operating system to complete initialization;

1.9.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;

1.9.5 mg×ng×m_ethe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication

1.9.6, taking the shared Memory space Array Memory in Matrix2000 as the shared Memory space, merging the C block Matrix results calculated by each FMA by adopting the result merging method described in step 1.6, and obtaining C. The matrix multiplication realized according to 1.9.1-1.9.6 is v_scif。

1.10 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts COI (processor off flow Infrastructure) programming mode, then adopting COI programming mode to realize Matrix multiplication acceleration version v_coiThe specific method comprises the following steps:

1.10.1 inquiring about the computing cores in the architecture handbook acquisition system if the VPE topology of the vector processing units in Matrix2000 is mg × ng, i.e. there are mg × ng VPEs in Matrix2000, the physical distribution is mg rows × ng columns, numbered sequentially as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);

1.10.2 query the architecture manual to obtain the number m of vector floating point Multiply accumulate functional units FMA (fused multiplex Add) owned by each VPE_e；

1.10.3, utilizing a function (such as init) provided by system software to complete initialization;

1.10.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;

1.10.5 mg×ng×m_ethe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication

1.10.6 takes the shared Memory space Array Memory in Matrix2000 asAnd sharing a storage space, and completing the merging of the matrix results of the C blocks calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C. The matrix multiplication realized according to 1.10.1-1.10.6 is v_coi。

1.11 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts OpenMP tag programming mode, then the Matrix multiplication acceleration version v realized by OpenMP tag programming mode is adopted_target, the specific method is as follows:

1.11.1 query architecture handbook gets the computational cores in the system, i.e. vector processing unit VPE topology mg × ng as in Matrix2000, i.e. mg × ng VPEs in the system, physical distribution mg rows × ng columns, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), "… … (mg-1,0), (mg-1,1), … (mg-1, ng-1);

1.11.2 query the architecture manual to obtain the number m of vector floating point Multiply accumulate functional units FMA (fused multiplex Add) owned by each VPE_e；

1.11.3, the initialization is completed by a function (such as init) provided by system software;

1.11.4, dividing matrixes, namely dividing A and B into blocks by adopting the matrix dividing method in the step 1.3 based on the topological structure of the VPE to obtain mg multiplied by ng A block matrixes and ng multiplied by mg B block matrixes;

1.11.5 Global space Global Cache and shared memory space Arraymemory based on Matrix2000, mg × ng × m_eThe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication

1.11.6, taking the shared Memory space Array Memory in Matrix2000 as the shared Memory space, merging the C block Matrix results completed by each FMA by the result merging method described in step 1.6, and obtaining C. The ratio is 1.11.1-1.The matrix multiplication realized by 11.6 is v_target。

And step two, integrating heterogeneous fusion multi-version matrix multiplication. The specific method comprises the following steps:

2.1 separately compiling v_cpu、v_gpu、v_mic、v_target、v_coiAnd v_scifCorresponding source code, generating v_cpu、v_gpu、v_mic、v_target、v_coiAnd v_scifThe executable file of (2);

2.2 Using tar Command to convert v_cpu、v_gpu、v_mic、v_target、v_coiAnd v_scifThe executable file is packed to generate a library file HU-xgemm of a heterogeneous fusion version.

Step three, adapting an accelerator in a heterogeneous fusion system structure, wherein the method comprises the following steps:

3.1 inquiring the accelerator type in the heterogeneous fusion system structure by using an operating system command lspci | grep processor, and enabling an accelerator type variable arc to be the accelerator type;

3.2 if arc ═ Matrix2000, run 3.2.1: otherwise, turning to 3.3;

3.2.1. inquiring a Matrix2000 programming technical manual, confirming a supported programming model, and assigning a value to a programming model variable prolan;

3.2.2. if prolan ═ OpenMP tag, call v in HU-xgemm_cpuAnd v_targetRespectively finishing the Matrix multiplication calculation of a CPU end of a main processor and a Matrix2000 end of an accelerator (dividing the Matrix multiplication into a part to be distributed to the CPU and a part to be distributed to the accelerator (such as Matrix2000, GPU and MIC), belonging to the task scheduling category, and adopting a known scheduling method such as 'a method for dynamically balancing heterogeneous system calculation load (ZL 201410544782.5)', otherwise, turning to 3.2.3;

3.2.3. if prolan ═ COI, recall v in HU-xgemm_cpuAnd v_coiRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.2.4;

3.2.4. if prolan ═ SCIF, recall v in HU-xgemm_cpuAnd v_scifRespectively completing the CPU end and the accelerator of the main processorMatrix multiplication at the Matrix2000 end; otherwise, turning to 3.3;

3.3 if arc ═ MIC, recall v in HU-xgemm_cpuAnd v_micRespectively finishing the matrix multiplication of a CPU end of a main processor and an MIC end of an accelerator; otherwise, turning to 3.4;

3.4 if arc ═ GPU, call v in HU-xgemm_cpuAnd v_gpuRespectively finishing the matrix multiplication of a CPU end of a main processor and a GPU end of an accelerator; otherwise, turning to 3.5;

3.5 if there is no special accelerator in the system, i.e. only CPU, call v in HU-xgemm_cpuThe matrix multiplication calculation is completed.

The invention can achieve the following technical effects:

1. the invention can self-adapt to different target accelerators and processors by realizing and integrating the universal matrix multiplication versions suitable for various accelerators, can self-adaptively perform matrix multiplication according to different heterogeneous fusion system structures, reduces the difficulty of matrix multiplication development for the accelerators and lightens the workload.

2. The invention carries out matrix multiplication according to the topological structures of CPUs or accelerators in different heterogeneous fusion system structures, and each FMA carries out parallel calculation, thereby accelerating the matrix multiplication speed and improving the use efficiency of a heterogeneous system.

Drawings

FIG. 1 is a general flowchart of the method for optimizing the multiplication and acceleration of the heterogeneous fusion matrix according to the present invention.

Detailed Description

FIG. 1 is a general flowchart of a general multi-core DSP-oriented matrix multiplication acceleration method according to the present invention.

The method comprises the following steps:

firstly, designing a block matrix multiplication version oriented to a heterogeneous fusion system structure to obtain v_cpu、v_gpu、v_mic、v_target、v_coiAnd v_scif6 versions, the concrete steps are as follows:

defining the dimension of matrix A as M multiplied by K and the dimension of matrix B as KThe dimension of a result matrix C obtained by multiplying the XN, A and B is MXN, and M, K and N are positive integers; the element of the p-th row and the q-th column of A is a_pqP is more than or equal to 0 and less than or equal to M-1, q is more than or equal to 0 and less than or equal to K-1, and the element in the qth row and the tth column of B is B_qt，0≤t≤N-1；

1.2 if the heterogeneous integration system structure only consists of CPU, the CPU is initialized (MPI programming mode is adopted to realize CPU matrix multiplication acceleration version v_cpu) The method comprises the following steps:

1.2.2. query architecture manual obtains the number m of floating point vector multiply accumulate functional units FMA owned by each CPU compute core_e；

1.2.3. Completing the initialization of the multi-core CPU by using an initialization function provided by an operating system;

1.3 matrix division is carried out on A and B according to the topological structure of the CPU, the matrix A is divided into mg multiplied by ng A block matrixes, the arrangement mode of the mg multiplied by ng A block matrixes is the same as that of the mg multiplied by ng computing cores, the jth column A block matrix of the ith row is represented by Aij, i is more than or equal to 0 and less than or equal to mg-1, j is more than or equal to 0 and less than or equal to ng-1, and the dimension of each block matrix is m multiplied by k, wherein the matrix A is divided into mg multiplied by ng A block matrixes, the arrangement

Representing upper rounding; dividing the matrix B of K x N into ng x mg B block matrixes, wherein the arrangement mode of the ng x mg B block matrixes is the same as that of mg x ng computing cores, and B is used for the No. r row and No. s column B block matrixes_rsExpressed that r is more than or equal to 0 and less than or equal to mg-1, s is more than or equal to 0 and less than or equal to ng-1, the dimension of each block matrix is kXn,

1.6 result merging, namely according to the data distribution principle, taking the shared storage space of the CPU computing core as the shared storage space, merging the results of the shared storage space to form a result matrix C, and performing matrix multiplication which is realized by adopting an MPI (Multi-processor interface) programming mode according to 1.2-1.6 to obtain a CPU matrix multiplication accelerated version v_cpu；

1.7.1 query architecture handbook obtains topology of computing core in each GPU, namely topology mg × ng of thread processor TPU in GPU, namely mg × ng TPUs in GPU, physical distribution is mg row × ng column, and serial numbers are (0,0), (0,1), … (0, ng-1),

(1,0),(1,1),…(1,ng-1)，……，(i,0),(i,1),…,(i,j),…,(i,ng-1),，……(mg-1,0),(mg-1,1),…(mg-1,ng-1)；

1.7.2 query architecture handbook number m of vector floating point multiply accumulate functional unit FMA owned by each TPU_e；

1.7.3, utilizing the function provided by the operating system to complete initialization;

1.7.6, taking the Shared Memory of the GPU as a Shared storage space, and merging the C block matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;

1.8 if the heterogeneous fusion system structure consists of a CPU and a MIC, adopting an Offload programming mode to realize MIC matrix multiplication acceleration version v_micThe specific method comprises the following steps:

1.8.1 query architecture handbook obtains the computational cores in each MIC, i.e. as vector processor VPU topology mg × ng in MIC, i.e. mg × ng VPUs in one MIC, physical distribution mg row × ng column, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);

1.8.2 query architecture Manual obtains the number m of vector floating point multiply accumulate functional units FMA each VPU possesses_e；

1.8.3 initialization is accomplished by using the functions provided by the operating system;

1.8.5 mg×ng×m_eeach FMA takes the memory space of MIC, namely memory, as a shared memory space, takes a portable Cache of the shared memory space as a data buffer area, adopts the matrix multiplication acceleration method described in the step 1.5 to execute matrix multiplication in parallel, each FMA independently finishes block matrix multiplication assigned to itself, and the FMA with the serial number of (i ', j') finishes block matrix multiplication

1.8.6 using the memory space of MIC as the shared storage space, adopting the result merging method in the step 1.6 to finish the merging of the C block matrix results calculated by each FMA, and obtaining C;

1.9.1 query the architecture handbook for the topology mg × ng of the computing cores in each Matrix2000, i.e. the vector processing units VPE in Matrix2000, i.e. mg × ng VPEs in Matrix2000, the physical distribution is mg rows × ng columns, numbered sequentially as (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), (… … (mg-1,0), (mg-1,1), … (mg-1, ng-1);

1.9.3 initialization is done using functions provided by the operating system;

1.9.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;

1.10 if the heterogeneous fusion system structure is composed of CPU and Matrix2000, if the Matrix2000 adopts COI programming mode, then adopting COI programming mode to realize Matrix multiplication acceleration version v_coiThe specific method comprises the following steps:

1.10.2 query architecture Manual to obtain the number m of vector floating point multiply accumulate functional units FMA owned by each VPE_e；

1.10.3, utilizing a function provided by system software to complete initialization;

1.10.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results calculated by each FMA by adopting the result merging method in the step 1.6 to obtain C;

1.11 if the heterogeneous fusion system structure is composed of a CPU and a Matrix2000, if the Matrix2000 adopts an OpenMP taget programming mode, the OpenMP taget programming mode is adopted to realize the OpenMP taget Matrix multiplication acceleration version v_targetThe specific method comprises the following steps:

1.11.3 initialization is done using functions provided by the system software;

1.11.6, taking the Array Memory in the Matrix2000 as the shared Memory space, merging the C block Matrix results completed by each FMA by the result merging method in step 1.6 to obtain C;

the second step is that: integrating heterogeneous fusion multi-version matrix multiplication, the method is as follows:

2.2 Using tar Command to convert v_cpu、v_gpu、v_mic、v_target、v_coiAnd v_scifThe executable file is packed to generate a library file HU-xgemm of a heterogeneous fusion version;

3.2 if arc ═ Matrix2000, run 3.2.1: otherwise, turning to 3.3;

3.2.2. if prolan ═ OpenMP tag, call v in HU-xgemm_cpuAnd v_targetRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.2.3;

3.2.4. if prolan ═ SCIF, recall v in HU-xgemm_cpuAnd v_scifRespectively finishing the Matrix multiplication of the CPU end of the main processor and the Matrix multiplication of the Matrix2000 end of the accelerator; otherwise, turning to 3.3;

Claims

1. A matrix multiplication accelerating method oriented to a heterogeneous fusion architecture is characterized by comprising the following steps:

the method comprises the following steps of firstly, designing a block matrix multiplication version oriented to a heterogeneous fusion system structure, and specifically:

defining the dimensionality of a matrix A to be M multiplied by K, the dimensionality of a matrix B to be K multiplied by N, and the dimensionality of a result matrix C obtained by multiplying A and B to be M multiplied by N, wherein M, K and N are positive integers; the element of the p-th row and the q-th column of A is a_pqP is more than or equal to 0 and less than or equal to M-1, q is more than or equal to 0 and less than or equal to K-1, and the element in the qth row and the tth column of B is B_qt，0≤t≤N-1；

1.2 if the heterogeneous fusion system structure only consists of a CPU, initializing the CPU, wherein the method comprises the following steps:

1.3 according to the topological structure of CPU, matrix division is carried out on A and B, the matrix A is divided into mg multiplied by ng A block matrixes, the arrangement mode of the mg multiplied by ng A block matrixes is the same as that of the mg multiplied by ng computing cores, and the ith row and the jth column of A block matrixes use A_ijI is more than or equal to 0 and less than or equal to mg-1, j is more than or equal to 0 and less than or equal to ng-1, and the dimension of each block matrix is m multiplied by k, wherein

Representing upper rounding; dividing the matrix B of K x N into ng x mg B block matrixes, wherein the arrangement mode of the ng x mg B block matrixes is the same as that of mg x ng computing cores,row r, column s, block B matrix B_rsExpressed that r is more than or equal to 0 and less than or equal to mg-1, s is more than or equal to 0 and less than or equal to ng-1, the dimension of each block matrix is kXn,

(1,0),(1,1),…(1,ng-1)，……，(i,0),(i,1),…,(i,j),…,(i,ng-1),……(mg-1,0),(mg-1,1),…(mg-1,ng-1)；

1.7.5 mg×ng×m_ethe FMA takes Shared Memory of the GPU as a Shared Memory space, takes constant Memory of the GPU as a data buffer, adopts the matrix multiplication acceleration method in the step 1.5 to execute matrix multiplication in parallel, each FMA independently completes block matrix multiplication allocated to itself, and the FMA with the serial number of (i ', j') completes block matrix multiplication

1.8.5 mg×ng×m_ethe FMA takes the memory space of MIC, namely memory, as a shared storage space, takes the portable Cache of the shared storage space as a data buffer area,the matrix multiplication accelerating method described in step 1.5 is adopted to execute matrix multiplication in parallel, each FMA independently completes block matrix multiplication allocated to itself, and the FMA with the serial number (i ', j') completes block matrix multiplication

1.9.1 query the architecture handbook for the topological mg × ng of the computing cores in each Matrix2000, i.e. the vector processing units VPE in Matrix2000, i.e. mg × ng VPEs in Matrix2000, the physical distribution being mg rows × ng columns, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);

1.9.2 query the architecture manual to obtain the number m of vector floating point multiply accumulate functional units FMA (fused Multi Add) owned by each VPE_e；

1.9.3 initialization is done using functions provided by the operating system;

1.11 if the heterogeneous fusion architecture is composed of CPU and Matrix2000, if the Matrix2000 adopts OpenMP tag programming mode, thenOpenMP tag matrix multiplication acceleration version v realized by adopting OpenMP tag programming mode_targetThe specific method comprises the following steps:

1.11.1 query architecture handbook gets the computational cores in the system, i.e. vector processing unit VPE topology mg × ng as in Matrix2000, i.e. mg × ng VPEs in the system, physical distribution mg rows × ng columns, numbered sequentially (0,0), (0,1), … (0, ng-1), (1,0), (1,1), … (1, ng-1), … …, (i,0), (i,1), …, (i, j), …, (i, ng-1), … … (mg-1,0), (mg-1,1), … (mg-1, ng-1);

1.11.2 query the architecture manual to obtain the number m of vector floating point multiply accumulate functional units FMA (fused Multi Add) owned by each VPE_e；

1.11.3 initialization is done using functions provided by the system software;

1.11.5 Global space Global Cache and shared Memory space Array Memory based on Matrix2000, mg × ng × m_eThe FMA takes Global space Global Cache of Matrix2000 as a shared storage space and Array Memory as a data buffer area, and adopts the Matrix multiplication acceleration method in the step 1.5 to execute Matrix multiplication in parallel; each FMA independently performs block matrix multiplication assigned to itself, and the FMA numbered (i ', j') performs block matrix multiplication

3.2 if arc ═ Matrix2000, run 3.2.1: otherwise, turning to 3.3;

2. The heterogeneous convergence architecture oriented matrix multiplication acceleration method of claim 1, wherein step 1.2.3 is an initialization function index function provided by the operating system.