CN106293638B - Heterogeneous formats storage method based on CPU Yu GPU heterogeneous platform - Google Patents

Heterogeneous formats storage method based on CPU Yu GPU heterogeneous platform Download PDF

Info

Publication number
CN106293638B
CN106293638B CN201510320501.2A CN201510320501A CN106293638B CN 106293638 B CN106293638 B CN 106293638B CN 201510320501 A CN201510320501 A CN 201510320501A CN 106293638 B CN106293638 B CN 106293638B
Authority
CN
China
Prior art keywords
gpu
data
cpu
executes
format
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510320501.2A
Other languages
Chinese (zh)
Other versions
CN106293638A (en
Inventor
陶袁
任可欣
付军
杜奕秋
赵志文
姜艳成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin Normal University
Original Assignee
Jilin Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin Normal University filed Critical Jilin Normal University
Priority to CN201510320501.2A priority Critical patent/CN106293638B/en
Publication of CN106293638A publication Critical patent/CN106293638A/en
Application granted granted Critical
Publication of CN106293638B publication Critical patent/CN106293638B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The heterogeneous formats storage method based on CPU Yu GPU heterogeneous platform that the present invention relates to a kind of, method includes the following steps: step 1: detection GPU calculates the communication bandwidth between the scale and CPU and GPU of data;Step 2: the number of segment of segmentation transmission is calculated using the result of step 1 measurement;Step 3: Data Format Transform is executed to the first segment data;The data converted: being uploaded the end GPU by step 4, and GPU executes calculating, executes format conversion to lower one piece of data;Step 5: whether the data that judgement currently calculates are final data segments, if so, algorithm executes completion, otherwise execute step 4.Since the present invention uses the storage mode of heterogeneous formats on different types of processor, main research CPU executes the hidden method for the format conversion bring overhead that the end GPU needs, to make algorithm obtain higher calculating handling capacity in entire node, this method has extensive practical value and application prospect in the compute-intensive applications such as high-performance calculation.

Description

Heterogeneous formats storage method based on CPU Yu GPU heterogeneous platform
Technical field
The present invention relates to it is a kind of CPU and GPU composition heterogeneous platform on, the architecture difference of CPU and GPU are very big, For the data-handling capacity for fully playing two kinds of processors, same data are respectively adopted different-format at the end CPU and GPU and deposit Storage, and in particular to a kind of on the heterogeneous platform of CPU and GPU composition, same data use heterogeneous formats at the end CPU and the end GPU Storage hides the way of realization of conversion process bring overhead when CPU formats GPU end data, belongs to meter Calculation machine system structure field.
Technical background
Current many high-performance computers calculate node by the processor group of two kinds of different architectures of multi-core CPU and GPU At because the architecture difference of both processors is larger, the parallel algorithm of the same data structure data is in two kinds of processors It is very big that upper acquisition calculates handling capacity difference;In order to guarantee that two kinds of processors obtain higher calculating handling capacity, can adopt respectively Data are saved with different data structures, but this can bring CPU to execute the overhead that format is converted, so as to cause algorithm different Total calculating throughput degradation of structure node;The approach for solving this problem is to hide CPU execution format conversion bring additionally to open Pin achievees the purpose that improving algorithm always calculates handling capacity in heterogeneous platform.
Sparse matrix relevant operation is the core algorithm of many performance applications, if the sparse matrix of transposition is multiplied by sparse square Battle array and the product algorithm of dense vector are widely used in science and engineering, specifically include singular value decomposition or figure calculates Deng.
With the increase of sparse matrix calculation scale, most of sparse matrix relevant operation is complete parallel by high-performance computer At;In this case, the transposition operation of sparse matrix can bring biggish overhead, to transposition sparse matrix multiplied by sparse square The product algorithm of battle array and dense vector, because being related to the sparse matrix of sparse matrix and transposition simultaneously, if saving this simultaneously Two matrixes can waste a large amount of memory space;The researchs such as Aydin Buluc provide the sparse block format storage of compression can be more It is the sparse of Parallel Implementation sparse matrix and dense multiplication of vectors algorithm and transposition in the case where transposition avoids on core CPU platform Matrix and dense multiplication of vectors algorithm are provided which higher, consistent calculating handling capacity;The researchs such as Yuan of making pottery are sparse using extension compression The transposition sparse matrix that block storage format realizes sparse matrix in GPU platform and dense multiplication of vectors algorithm and transposition avoid with Dense multiplication of vectors algorithm is provided which higher, consistent calculating handling capacity.But the difference of two kinds of storage formats due to same data It is different, lead to that additionally holding for format conversion can be brought using both formatted datas simultaneously on the heterogeneous platform that CPU and GPU is formed Pin, so research hides the method for the overhead of this format conversion effectively to guarantee the sparse matrix of transposition multiplied by sparse square Battle array and the product of dense vector obtain higher calculating handling capacity with important meaning on the heterogeneous platform of CPU and GPU composition Justice.
Summary of the invention
The object of the present invention is to provide a kind of heterogeneous formats Realization of Storing based on CPU Yu GPU heterogeneous platform, the party Method is using isomery storage format on the heterogeneous platform that CPU and GPU is formed, so that hiding CPU executes Data Format Transform band The method of the overhead come achievees the purpose that improve the algorithm calculating handling capacity total in the heterogeneous platform of CPU and GPU composition.
The technical scheme is that
The processor of two kinds of different architectures is deposited using heterogeneous formats on the heterogeneous platform that CPU and GPU is formed The mode of storage avoids using from the end CPU to the end GPU and once transmits all data, and repeatedly transmits different number of segment using segmentation According to, transmitting every segment data and GPU executes CPU while calculating and executes Data Format Transform, realize toward the end GPU upload data and GPU, which is calculated, executes format converting parallel with CPU, so that the overhead that CPU executes format conversion is hidden, specific implementation form packet Include following steps:
Step 1: determining the data scale that GPU is calculated, and detects the communication bandwidth between CPU and GPU;
Step 2: the number of segment of data sectional transmission is calculated;
Step 3: Data Format Transform is executed to the first segment data;
Step 4: translated data is uploaded to GPU, GPU executes calculating operation, and CPU executes format to lower one piece of data Conversion;
Step 5: whether the data for judging that GPU is calculated are final data segments, if so, algorithm executes completion;Otherwise it executes Step 4.
The determined GPU of step 2 calculate data be divided into multistage be according to the real data scale of step 1 measurement and CPU with Practical communication bandwidth between GPU makes the technology have the versatility of platform portability and data.
Executing format conversion to the first segment data described in step 3 is shifted to an earlier date before the loop body of step 4 and step 5 First segment data is formatted, is to guarantee that the data after format transformation upload to the end GPU and GPU calculating and CPU pairs Lower one piece of data format converting parallel.
The data that CPU described in step 4 is converted upload to the end GPU, and GPU executes calculating and CPU holds lower one piece of data The data slot of three operations such as row format conversion in program process having a size of being dynamically determined, these three operations are not respectively by Same equipment executes, so concurrencys of these three operations are to be automatically performed, and GPU executes calculating and refers to that GPU executes algorithm regulation Operation.
Judge whether to be final data segment is whether to judge circulation according to the variate-value of program process described in step 5 It executes.
The present invention has the advantages that:
Invention is on the heterogeneous platform of CPU and GPU composition, and the same data use heterogeneous formats at the end CPU and the end GPU In the case where saving data, the method that CPU executes the overhead of format conversion is hidden, which compared with the prior art, leads Wanting advantage is:
(1), different-format is respectively adopted at the end CPU and the end GPU and saves data, make the algorithm at two distinct types of place Higher calculating handling capacity can be obtained on reason device;
(2), when format translated data uploads to the end GPU and GPU is executed while calculating, CPU parallel execution of data lattice Formula conversion achievees the purpose that hiding data format converts overhead.
Detailed description of the invention
Fig. 1 is that the heterogeneous platform of CPU of the present invention and GPU composition uses heterogeneous formats, so that hiding CPU executes format conversion The flow chart of overhead.
Fig. 2 is the heterogeneous platform of CPU of the present invention and GPU composition using heterogeneous formats, realize the sparse matrix of transposition multiplied by The flow diagram of sparse matrix and dense vector product algorithm.
Specific embodiment
It is with reference to the accompanying drawing and specifically real to make the object, technical solutions and advantages of the present invention express more clearly Example is described in further detail the present invention again.
The present invention is for different type processor on the heterogeneous platform of CPU and GPU composition using isomery as shown in Figure 1: In the case that format stores, the implementation method that CPU executes the overhead of format conversion is hidden, wherein CPU executes format conversion Realized using multi-threaded parallel, specific method the following steps are included:
Step 1: determine that GPU calculates the scale of data and detects the communication bandwidth between CPU and GPU;
Wherein, major function described in step 1 includes detecting leading between the data scale and CPU and GPU that GPU is calculated Interrogate bandwidth.To hide the overhead that CPU executes Data Format Transform, the GPU data calculated are divided into multistage by size, are used The data after conversion are uploaded toward the end GPU and GPU is calculated and CPU executes lower one piece of data format converting parallel;This step determines Need that GPU calculates the scale of data and CPU uploads data to the communication bandwidth of GPU.
Step 2: it calculates GPU segmentation and calculates number of segment;
Wherein, major function described in step 2 is the number of segment that data sectional is calculated using the measurement result of step 1, such as The data scale that fruit needs GPU to accelerate is less than certain threshold value, then uses primary all transmission and execute, i.e., do not use CPU to execute The conversion of multiple segment data format calculates parallel the case where executing with toward the end GPU multistage upload data and GPU;
Step 3: Data Format Transform is executed to first segment;
Wherein, major function described in step 3 is to execute format conversion to the first segment data that GPU is calculated;No matter to need The data for wanting GPU to accelerate are converted using several paragraph formats, and the delay of first segment Data Format Transform can not be with upload other number of segment of the end GPU According to latency hiding;
Step 4: the data converted are uploaded to the end GPU, GPU executes calculating, and CPU executes format to lower one piece of data Conversion;
Wherein, major function described in step 4 is to completing, the data that format is converted upload to the end GPU, GPU is executed Calculating operation and CPU execute format conversion to lower one piece of data;The end GPU is uploaded data in these three operations and GPU is executed Calculating operation is serial operation, is all to execute format conversion operation to lower one piece of data to the current data segment of data, but with CPU It is parallel;
Step 5 judges whether the data that GPU is currently calculated are final data segments, if so, algorithm executes completion;Otherwise Execute step 4;
Wherein, major function described in step 5 is to judge whether current fragment calculating terminates;
Main idea is that on the heterogeneous platform of CPU and GPU composition, for the processor of different architecture In the case where using heterogeneous formats storing data, executed respectively using entire data are divided into multistage by GPU, using on GPU The process, GPU calculating data and CPU for passing data execute format converting parallel, to hide the reality that CPU executes Data Format Transform Existing form, the heterogeneous platform to be made of CPU and GPU obtain higher calculating handling capacity way of realization and provide technical support.
First determine need GPU calculate data scale and CPU and GPU between communication bandwidth, while determine CPU with The number of segment of calculating is segmented between GPU;Format conversion is executed according to determining every segment data amount and to the first segment data later;Again Data after format is converted upload to the end GPU, and GPU executes calculating, and CPU executes format conversion to lower one piece of data;Finally sentence Whether disconnected calculating is completed.
It is illustrated below with dense vector product algorithm as example using transposition sparse matrix multiplied by sparse matrix, such as Fig. 2 It is shown, comprising the following steps:
Step 1: determining the line number of sparse matrix, the number of columns and nonzero element, and detects logical between CPU and GPU Interrogate bandwidth;
Step 2: the number of segment of segmentation is calculated according to the result of step 1 measurement;
Step 3: according to the calculated result of front, the first segment data of sparse matrix is formatted;
Step 4: the sparse matrix nonzero element after conversion uploads to the end GPU, and GPU executes sparse matrix and multiplication of vectors Or the sparse matrix and multiplication of vectors of transposition, while CPU executes format conversion to the nonzero element of next section of sparse matrix;
Step 5: judging whether current data are final stages, if so, this algorithm is completed;Otherwise step 4 is executed.
Mainly studied in this example CPU and GPU composition heterogeneous platform on heterogeneous formats way of realization, this method only with Whether platform is CPU related with the heterogeneous platform that GPU is formed, and unrelated with the model of CPU and GPU.
It should be noted last that: above example is only used to illustrate and not limit the technical solutions of the present invention, although reference Above-described embodiment describes the invention in detail, those skilled in the art should understand that: it still can be to this hair It is bright to be modified or replaced equivalently, it without departing from the spirit or scope of the invention, or any substitutions, should all It is included within the scope of the claims of the present invention.

Claims (1)

1. a kind of heterogeneous formats storage method based on CPU Yu GPU heterogeneous platform, this method are in the same data CPU In the case that the format of calculating is different from the format that GPU is calculated, hides and formatted data is needed by the CPU conversion end GPU The method of overhead, it is characterised in that: method includes the following steps:
Step 1: determine that GPU calculates the scale of data and detects the communication bandwidth between CPU and GPU;
Step 2: the number of segment of segmentation transmission is calculated using the result of step 1 measurement;
Step 3: format conversion is executed to the first segment data;
Step 4: the data converted are uploaded to the end GPU, GPU executes calculating, while CPU holds lower one piece of data Row format conversion;
Step 5: judging whether to be final data segment, if so, algorithm terminates;Otherwise step 4 is executed;
Wherein, described in step 2 determine GPU calculate data be divided into multistage be according to step 1 measurement real data scale and Practical communication bandwidth between CPU and GPU makes the technology have the versatility of platform portability and data;
Executing format conversion to the first segment data described in step 3 is before the loop body of step 4 and step 5 in advance to the One piece of data formats, and is to guarantee that the data after format transformation upload to the end GPU and calculate under on GPU One piece of data format converting parallel;
The data that CPU described in step 4 is converted upload to the end GPU, and GPU executes calculating and CPU to next number of segment According to three data slots operated such as format conversion are executed having a size of being dynamically determined in program process, these three operations are distinguished It is executed by different equipment, so the concurrency of these three operations is to be automatically performed, and GPU executes the calculation for calculating and referring to that GPU accelerates The operation of law regulation;
Judge whether to be final data segment described in step 5 to be to judge whether circulation executes according to the variate-value of program process.
CN201510320501.2A 2015-06-11 2015-06-11 Heterogeneous formats storage method based on CPU Yu GPU heterogeneous platform Expired - Fee Related CN106293638B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510320501.2A CN106293638B (en) 2015-06-11 2015-06-11 Heterogeneous formats storage method based on CPU Yu GPU heterogeneous platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510320501.2A CN106293638B (en) 2015-06-11 2015-06-11 Heterogeneous formats storage method based on CPU Yu GPU heterogeneous platform

Publications (2)

Publication Number Publication Date
CN106293638A CN106293638A (en) 2017-01-04
CN106293638B true CN106293638B (en) 2019-08-06

Family

ID=57659392

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510320501.2A Expired - Fee Related CN106293638B (en) 2015-06-11 2015-06-11 Heterogeneous formats storage method based on CPU Yu GPU heterogeneous platform

Country Status (1)

Country Link
CN (1) CN106293638B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103033783A (en) * 2012-12-10 2013-04-10 深圳先进技术研究院 Magnetic resonance imaging fast reconstruction system and method thereof
CN103841389A (en) * 2014-04-02 2014-06-04 北京奇艺世纪科技有限公司 Video playing method and player

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101393610B (en) * 2008-11-11 2010-08-25 北大方正集团有限公司 Transmission method and apparatus for dot matrix data
US9317482B2 (en) * 2012-10-14 2016-04-19 Microsoft Technology Licensing, Llc Universal FPGA/ASIC matrix-vector multiplication architecture
CN103108186A (en) * 2013-02-21 2013-05-15 中国对外翻译出版有限公司 Method of achieving high-definition transmission of videos

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103033783A (en) * 2012-12-10 2013-04-10 深圳先进技术研究院 Magnetic resonance imaging fast reconstruction system and method thereof
CN103841389A (en) * 2014-04-02 2014-06-04 北京奇艺世纪科技有限公司 Video playing method and player

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Peng LIU等.PARALLEL PROCESSING OF MASSIVE REMOTE SENSING IMAGES IN A GPU ARCHITECTURE.《Computing and Informatics》.2014,第33卷(第1期),第197-217页.
Yuan Tao等.Atomic reduction based sparse matrix-transpose vector multiplication on GPUs.《Parallel and Distributed Systems(ICPADS),2014 20th IEEE International Conference on》.2015,987-992.

Also Published As

Publication number Publication date
CN106293638A (en) 2017-01-04

Similar Documents

Publication Publication Date Title
CN109086678B (en) Pedestrian detection method for extracting image multilevel features based on deep supervised learning
CN103049241B (en) A kind of method improving CPU+GPU isomery device calculated performance
CN109086244A (en) Matrix convolution vectorization implementation method based on vector processor
WO2017124644A1 (en) Artificial neural network compression encoding device and method
EP4016398A1 (en) Apparatus and method for distributed training model, and computer program product
CN104731729B (en) A kind of table connection optimization method, CPU and accelerator based on heterogeneous system
WO2022001550A1 (en) Address generation method, related device and storage medium
CN103177414A (en) Structure-based dependency graph node similarity concurrent computation method
US20220350607A1 (en) Method of executing operation, electronic device, and computer-readable storage medium
CN102902657A (en) Method for accelerating FFT (Fast Fourier Transform) by using GPU (Graphic Processing Unit)
CN105022631A (en) Scientific calculation-orientated floating-point data parallel lossless compression method
CN115150471B (en) Data processing method, apparatus, device, storage medium, and program product
Ji et al. Parallelizing word2vec in multi-core and many-core architectures
CN110086602A (en) The Fast implementation of SM3 cryptographic Hash algorithms based on GPU
CN104850516B (en) A kind of DDR Frequency Conversion Designs method and apparatus
CN103577161A (en) Big data frequency parallel-processing method
CN105183562A (en) Method for conducting degree drawing on grid data on basis of CUDA technology
CN106293638B (en) Heterogeneous formats storage method based on CPU Yu GPU heterogeneous platform
WO2024016946A1 (en) Cost estimation method, electronic device, storage medium and computer program product
CN103593304A (en) Quantization method for efficiently using caches on basis of parallel device model
CN104636315A (en) GPDSP-oriented matrix LU decomposition vectorization calculation method
CN115983343A (en) YOLOv4 convolutional neural network lightweight method based on FPGA
CN103678255A (en) FFT efficient parallel achieving optimizing method based on Loongson number three processor
CN103746771A (en) Data format conversion method of channel coding and decoding based on GPP and SIMD technologies
US20230244974A1 (en) Quantum state processing method, computing device and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190806

Termination date: 20210611