CN106293638B

CN106293638B - Heterogeneous formats storage method based on CPU Yu GPU heterogeneous platform

Info

Publication number: CN106293638B
Application number: CN201510320501.2A
Authority: CN
Inventors: 陶袁; 任可欣; 付军; 杜奕秋; 赵志文; 姜艳成
Original assignee: Jilin Normal University
Current assignee: Jilin Normal University
Priority date: 2015-06-11
Filing date: 2015-06-11
Publication date: 2019-08-06
Anticipated expiration: 2035-06-11
Also published as: CN106293638A

Abstract

The heterogeneous formats storage method based on CPU Yu GPU heterogeneous platform that the present invention relates to a kind of, method includes the following steps: step 1: detection GPU calculates the communication bandwidth between the scale and CPU and GPU of data；Step 2: the number of segment of segmentation transmission is calculated using the result of step 1 measurement；Step 3: Data Format Transform is executed to the first segment data；The data converted: being uploaded the end GPU by step 4, and GPU executes calculating, executes format conversion to lower one piece of data；Step 5: whether the data that judgement currently calculates are final data segments, if so, algorithm executes completion, otherwise execute step 4.Since the present invention uses the storage mode of heterogeneous formats on different types of processor, main research CPU executes the hidden method for the format conversion bring overhead that the end GPU needs, to make algorithm obtain higher calculating handling capacity in entire node, this method has extensive practical value and application prospect in the compute-intensive applications such as high-performance calculation.

Description

Heterogeneous formats storage method based on CPU Yu GPU heterogeneous platform

Technical field

The present invention relates to it is a kind of CPU and GPU composition heterogeneous platform on, the architecture difference of CPU and GPU are very big, For the data-handling capacity for fully playing two kinds of processors, same data are respectively adopted different-format at the end CPU and GPU and deposit Storage, and in particular to a kind of on the heterogeneous platform of CPU and GPU composition, same data use heterogeneous formats at the end CPU and the end GPU Storage hides the way of realization of conversion process bring overhead when CPU formats GPU end data, belongs to meter Calculation machine system structure field.

Technical background

Current many high-performance computers calculate node by the processor group of two kinds of different architectures of multi-core CPU and GPU At because the architecture difference of both processors is larger, the parallel algorithm of the same data structure data is in two kinds of processors It is very big that upper acquisition calculates handling capacity difference；In order to guarantee that two kinds of processors obtain higher calculating handling capacity, can adopt respectively Data are saved with different data structures, but this can bring CPU to execute the overhead that format is converted, so as to cause algorithm different Total calculating throughput degradation of structure node；The approach for solving this problem is to hide CPU execution format conversion bring additionally to open Pin achievees the purpose that improving algorithm always calculates handling capacity in heterogeneous platform.

Sparse matrix relevant operation is the core algorithm of many performance applications, if the sparse matrix of transposition is multiplied by sparse square Battle array and the product algorithm of dense vector are widely used in science and engineering, specifically include singular value decomposition or figure calculates Deng.

With the increase of sparse matrix calculation scale, most of sparse matrix relevant operation is complete parallel by high-performance computer At；In this case, the transposition operation of sparse matrix can bring biggish overhead, to transposition sparse matrix multiplied by sparse square The product algorithm of battle array and dense vector, because being related to the sparse matrix of sparse matrix and transposition simultaneously, if saving this simultaneously Two matrixes can waste a large amount of memory space；The researchs such as Aydin Buluc provide the sparse block format storage of compression can be more It is the sparse of Parallel Implementation sparse matrix and dense multiplication of vectors algorithm and transposition in the case where transposition avoids on core CPU platform Matrix and dense multiplication of vectors algorithm are provided which higher, consistent calculating handling capacity；The researchs such as Yuan of making pottery are sparse using extension compression The transposition sparse matrix that block storage format realizes sparse matrix in GPU platform and dense multiplication of vectors algorithm and transposition avoid with Dense multiplication of vectors algorithm is provided which higher, consistent calculating handling capacity.But the difference of two kinds of storage formats due to same data It is different, lead to that additionally holding for format conversion can be brought using both formatted datas simultaneously on the heterogeneous platform that CPU and GPU is formed Pin, so research hides the method for the overhead of this format conversion effectively to guarantee the sparse matrix of transposition multiplied by sparse square Battle array and the product of dense vector obtain higher calculating handling capacity with important meaning on the heterogeneous platform of CPU and GPU composition Justice.

Summary of the invention

The object of the present invention is to provide a kind of heterogeneous formats Realization of Storing based on CPU Yu GPU heterogeneous platform, the party Method is using isomery storage format on the heterogeneous platform that CPU and GPU is formed, so that hiding CPU executes Data Format Transform band The method of the overhead come achievees the purpose that improve the algorithm calculating handling capacity total in the heterogeneous platform of CPU and GPU composition.

The technical scheme is that

The processor of two kinds of different architectures is deposited using heterogeneous formats on the heterogeneous platform that CPU and GPU is formed The mode of storage avoids using from the end CPU to the end GPU and once transmits all data, and repeatedly transmits different number of segment using segmentation According to, transmitting every segment data and GPU executes CPU while calculating and executes Data Format Transform, realize toward the end GPU upload data and GPU, which is calculated, executes format converting parallel with CPU, so that the overhead that CPU executes format conversion is hidden, specific implementation form packet Include following steps:

Step 1: determining the data scale that GPU is calculated, and detects the communication bandwidth between CPU and GPU；

Step 2: the number of segment of data sectional transmission is calculated；

Step 3: Data Format Transform is executed to the first segment data；

Step 4: translated data is uploaded to GPU, GPU executes calculating operation, and CPU executes format to lower one piece of data Conversion；

Step 5: whether the data for judging that GPU is calculated are final data segments, if so, algorithm executes completion；Otherwise it executes Step 4.

The determined GPU of step 2 calculate data be divided into multistage be according to the real data scale of step 1 measurement and CPU with Practical communication bandwidth between GPU makes the technology have the versatility of platform portability and data.

Executing format conversion to the first segment data described in step 3 is shifted to an earlier date before the loop body of step 4 and step 5 First segment data is formatted, is to guarantee that the data after format transformation upload to the end GPU and GPU calculating and CPU pairs Lower one piece of data format converting parallel.

The data that CPU described in step 4 is converted upload to the end GPU, and GPU executes calculating and CPU holds lower one piece of data The data slot of three operations such as row format conversion in program process having a size of being dynamically determined, these three operations are not respectively by Same equipment executes, so concurrencys of these three operations are to be automatically performed, and GPU executes calculating and refers to that GPU executes algorithm regulation Operation.

Judge whether to be final data segment is whether to judge circulation according to the variate-value of program process described in step 5 It executes.

The present invention has the advantages that:

Invention is on the heterogeneous platform of CPU and GPU composition, and the same data use heterogeneous formats at the end CPU and the end GPU In the case where saving data, the method that CPU executes the overhead of format conversion is hidden, which compared with the prior art, leads Wanting advantage is:

(1), different-format is respectively adopted at the end CPU and the end GPU and saves data, make the algorithm at two distinct types of place Higher calculating handling capacity can be obtained on reason device；

(2), when format translated data uploads to the end GPU and GPU is executed while calculating, CPU parallel execution of data lattice Formula conversion achievees the purpose that hiding data format converts overhead.

Detailed description of the invention

Fig. 1 is that the heterogeneous platform of CPU of the present invention and GPU composition uses heterogeneous formats, so that hiding CPU executes format conversion The flow chart of overhead.

Fig. 2 is the heterogeneous platform of CPU of the present invention and GPU composition using heterogeneous formats, realize the sparse matrix of transposition multiplied by The flow diagram of sparse matrix and dense vector product algorithm.

Specific embodiment

It is with reference to the accompanying drawing and specifically real to make the object, technical solutions and advantages of the present invention express more clearly Example is described in further detail the present invention again.

The present invention is for different type processor on the heterogeneous platform of CPU and GPU composition using isomery as shown in Figure 1: In the case that format stores, the implementation method that CPU executes the overhead of format conversion is hidden, wherein CPU executes format conversion Realized using multi-threaded parallel, specific method the following steps are included:

Step 1: determine that GPU calculates the scale of data and detects the communication bandwidth between CPU and GPU；

Wherein, major function described in step 1 includes detecting leading between the data scale and CPU and GPU that GPU is calculated Interrogate bandwidth.To hide the overhead that CPU executes Data Format Transform, the GPU data calculated are divided into multistage by size, are used The data after conversion are uploaded toward the end GPU and GPU is calculated and CPU executes lower one piece of data format converting parallel；This step determines Need that GPU calculates the scale of data and CPU uploads data to the communication bandwidth of GPU.

Step 2: it calculates GPU segmentation and calculates number of segment；

Wherein, major function described in step 2 is the number of segment that data sectional is calculated using the measurement result of step 1, such as The data scale that fruit needs GPU to accelerate is less than certain threshold value, then uses primary all transmission and execute, i.e., do not use CPU to execute The conversion of multiple segment data format calculates parallel the case where executing with toward the end GPU multistage upload data and GPU；

Step 3: Data Format Transform is executed to first segment；

Wherein, major function described in step 3 is to execute format conversion to the first segment data that GPU is calculated；No matter to need The data for wanting GPU to accelerate are converted using several paragraph formats, and the delay of first segment Data Format Transform can not be with upload other number of segment of the end GPU According to latency hiding；

Step 4: the data converted are uploaded to the end GPU, GPU executes calculating, and CPU executes format to lower one piece of data Conversion；

Wherein, major function described in step 4 is to completing, the data that format is converted upload to the end GPU, GPU is executed Calculating operation and CPU execute format conversion to lower one piece of data；The end GPU is uploaded data in these three operations and GPU is executed Calculating operation is serial operation, is all to execute format conversion operation to lower one piece of data to the current data segment of data, but with CPU It is parallel；

Step 5 judges whether the data that GPU is currently calculated are final data segments, if so, algorithm executes completion；Otherwise Execute step 4；

Wherein, major function described in step 5 is to judge whether current fragment calculating terminates；

Main idea is that on the heterogeneous platform of CPU and GPU composition, for the processor of different architecture In the case where using heterogeneous formats storing data, executed respectively using entire data are divided into multistage by GPU, using on GPU The process, GPU calculating data and CPU for passing data execute format converting parallel, to hide the reality that CPU executes Data Format Transform Existing form, the heterogeneous platform to be made of CPU and GPU obtain higher calculating handling capacity way of realization and provide technical support.

First determine need GPU calculate data scale and CPU and GPU between communication bandwidth, while determine CPU with The number of segment of calculating is segmented between GPU；Format conversion is executed according to determining every segment data amount and to the first segment data later；Again Data after format is converted upload to the end GPU, and GPU executes calculating, and CPU executes format conversion to lower one piece of data；Finally sentence Whether disconnected calculating is completed.

It is illustrated below with dense vector product algorithm as example using transposition sparse matrix multiplied by sparse matrix, such as Fig. 2 It is shown, comprising the following steps:

Step 1: determining the line number of sparse matrix, the number of columns and nonzero element, and detects logical between CPU and GPU Interrogate bandwidth；

Step 2: the number of segment of segmentation is calculated according to the result of step 1 measurement；

Step 3: according to the calculated result of front, the first segment data of sparse matrix is formatted；

Step 4: the sparse matrix nonzero element after conversion uploads to the end GPU, and GPU executes sparse matrix and multiplication of vectors Or the sparse matrix and multiplication of vectors of transposition, while CPU executes format conversion to the nonzero element of next section of sparse matrix；

Step 5: judging whether current data are final stages, if so, this algorithm is completed；Otherwise step 4 is executed.

Mainly studied in this example CPU and GPU composition heterogeneous platform on heterogeneous formats way of realization, this method only with Whether platform is CPU related with the heterogeneous platform that GPU is formed, and unrelated with the model of CPU and GPU.

It should be noted last that: above example is only used to illustrate and not limit the technical solutions of the present invention, although reference Above-described embodiment describes the invention in detail, those skilled in the art should understand that: it still can be to this hair It is bright to be modified or replaced equivalently, it without departing from the spirit or scope of the invention, or any substitutions, should all It is included within the scope of the claims of the present invention.

Claims

1. a kind of heterogeneous formats storage method based on CPU Yu GPU heterogeneous platform, this method are in the same data CPU In the case that the format of calculating is different from the format that GPU is calculated, hides and formatted data is needed by the CPU conversion end GPU The method of overhead, it is characterised in that: method includes the following steps:

Step 2: the number of segment of segmentation transmission is calculated using the result of step 1 measurement；

Step 3: format conversion is executed to the first segment data；

Step 4: the data converted are uploaded to the end GPU, GPU executes calculating, while CPU holds lower one piece of data Row format conversion；

Step 5: judging whether to be final data segment, if so, algorithm terminates；Otherwise step 4 is executed；

Wherein, described in step 2 determine GPU calculate data be divided into multistage be according to step 1 measurement real data scale and Practical communication bandwidth between CPU and GPU makes the technology have the versatility of platform portability and data；

Executing format conversion to the first segment data described in step 3 is before the loop body of step 4 and step 5 in advance to the One piece of data formats, and is to guarantee that the data after format transformation upload to the end GPU and calculate under on GPU One piece of data format converting parallel；

The data that CPU described in step 4 is converted upload to the end GPU, and GPU executes calculating and CPU to next number of segment According to three data slots operated such as format conversion are executed having a size of being dynamically determined in program process, these three operations are distinguished It is executed by different equipment, so the concurrency of these three operations is to be automatically performed, and GPU executes the calculation for calculating and referring to that GPU accelerates The operation of law regulation；

Judge whether to be final data segment described in step 5 to be to judge whether circulation executes according to the variate-value of program process.