CN110135569A - Heterogeneous platform neuron positioning three-level flow parallel method, system and medium - Google Patents

Heterogeneous platform neuron positioning three-level flow parallel method, system and medium Download PDF

Info

Publication number
CN110135569A
CN110135569A CN201910289495.7A CN201910289495A CN110135569A CN 110135569 A CN110135569 A CN 110135569A CN 201910289495 A CN201910289495 A CN 201910289495A CN 110135569 A CN110135569 A CN 110135569A
Authority
CN
China
Prior art keywords
pointer
variable
directed toward
cpu
pointer variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910289495.7A
Other languages
Chinese (zh)
Other versions
CN110135569B (en
Inventor
邹丹
朱小谦
朱敏
王文珂
李金才
汪祥
陆丽娜
甘新标
孟祥飞
夏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201910289495.7A priority Critical patent/CN110135569B/en
Publication of CN110135569A publication Critical patent/CN110135569A/en
Application granted granted Critical
Publication of CN110135569B publication Critical patent/CN110135569B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Neurology (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a heterogeneous platform neuron positioning three-level pipeline parallel method, a system and a medium, wherein slice image data are calculated into blocking parameters according to the image size and the calculation granularity; respectively allocating storage space at a CPU end and a GPU end based on the partitioning parameters; initializing variables and storage space; the CPU carries out task scheduling, the CPU and the GPU simultaneously adopt a three-stage pipeline mode to execute computing tasks, each computing task comprises three steps of parallel data reading-in, positioning computation and data writing-back, and each computing task in the middle executes the data writing-back of the previous computing task and the data reading-in of the next computing task while executing the positioning computation. The method can improve the processing speed of the neuron positioning, and has the advantages of high neuron positioning speed, short total program execution time, flexible three-stage pipeline realization, parameter configuration support and easy transplantation and popularization.

Description

A kind of heterogeneous platform Neurons location three-level flowing water parallel method, system and medium
Technical field
The present invention relates to the analytic methods of neural circuit fine structure, and in particular to a kind of heterogeneous platform Neurons location three Grade flowing water parallel method, system and medium, for realizing Neurons location parallel computation based on CPU-GPU heterogeneous computing platforms.
Background technique
Neural circuit information is the key that understand brain function and cerebral disease mechanism, how to realize neural circuit big data oneself Dynamic tracking is one of the key scientific problems that the neural area research such as brain science is faced.Neurons location is neural circuit number According to the key of parsing, accurate pericaryon position is obtained by being analyzed neural circuit image data, is subsequent fixed Measure the basis of analysis.
Typical large scale range Neurons location method is based on " each cell has and only one cell space " this biology Learn true, Biophysical model established by mathematical method (such as 1 norm minimum thought) integration, and by solve the model come Carry out large-scale Neurons location.Such methods are close to various kinds of cell type, shape, size and distribution in large scale range Degree etc. has good robustness, thus is the large scale range Neurons location of current high-precision neural circuit image data set Main method, but the image dimension-limited that this method is capable of handling is in the memory size of single calculate node, processing speed It is limited to the calculated performance of single calculate node.
With being constantly progressive for observation technology, the data scale of high-precision neural circuit image data set is increased rapidly, special It is not the huge advance of optical labeling molecule and micro-imaging technique, so that high-resolution obtains full brain data and becomes a reality.By It is larger in primate brain volume, according to current MOST imaging technique calculate, to 10 cubic centimetres of broad range of data into Each imaging to 1 micrometer resolution of row, can generate hundred TB data.Based on existing Neurons location method, handling 1GB data is needed 1 hour is wanted, with 1TB data instance, needs 1000 hours, that is, more than 40 days, it is this.How from intensive neural group TB grades Neuron is efficiently positioned in mass data, is still huge challenge in terms of image procossing, it has also become can serious restrict incite somebody to action The data of acquisition are converted into the bottleneck problem of knowledge.
Graphics processor (Graphic Processing Unit, GPU) uses complete with traditional general multi-core processor Complete different brand-new design framework.GPU is designed specifically for large-scale data parallel mode, this kind of calculating mode Typical case includes the processing of figure and video, extensive matrix calculates and the scenes such as numerical simulation.With the processing of general multicore Device is different, and GPU largely uses SIMD (Single Instruction Multiple Data) structure to realize at the same place The data and instruction managed under device access is parallel.As GPU programmability constantly enhances, the programmed environments such as especially CUDA and a system The appearance of column enhanced debugging tool, the complexity for programming GPU general-purpose computations are greatly lowered, and open GPU comprehensively towards logical With the new era of calculating.General-purpose computations image processor (GPGPU) has developed into high degree of parallelism, and multithreading possesses powerful meter The many-core processor of calculation ability and high bandwidth of memory.
Compared with isomorphism parallel architecture, the isomerism parallel system that is made of general processor CPU and coprocessor GPU Structure is a kind of structure for being more suitable for large-scale calculations intensive task.Isomerism parallel architecture can effectively adapt to apply more The complexity of field performance of program, practical application is high-efficient, has complied with VLSI chip capacity rapid growth Trend, while can satisfy the growth requirement that application features become more diverse.Isomeric architecture contains different structure Processor, have the general processor CPU of affairs processing-type and the application specific processor GPU of calculation type, with different types of processing Device handles different tasks, is the advantage place of isomeric architecture.
It is existing based on CPU's however since CPU-GPU Heterogeneous Computing model is different from traditional CPU isomorphism computation model Program can not be run directly on GPU.And since GPU can not directly access the memory space of CPU, in order to utilize the meter of GPU Calculation ability, it is necessary to input data be transferred to the video memory at the end GPU before calculating starts from the memory at the end CPU, after calculating Calculated result is transferred to the memory at the end CPU from the video memory at the end GPU, and so on, until all calculating tasks are finished. Frequent data transmission occupies a large amount of program runtime between CPU and GPU, largely affects the operation of program Efficiency.How the computational efficiency of CPU and GPU is improved, and data transfer overhead is that exploitation is different towards CPU-GPU between reducing CPU and GPU The difficult point of the Neurons location algorithm of structure architecture.There is presently no the technical sides that Neurons location is carried out using CPU-GPU The public reporting of case.
Summary of the invention
The technical problem to be solved in the present invention: in view of the above problems in the prior art, a kind of heterogeneous platform neuron is provided Positioning three-level flowing water parallel method, system and medium, the present invention can be improved the processing speed of Neurons location, have neuron Locating speed is fast, and program execution total time is short, what three class pipeline was realized flexible and supported parameter configuration, is easy to transplant and promote Advantage.
In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention are as follows:
A kind of heterogeneous platform Neurons location three-level flowing water parallel method, implementation steps include:
1) slice image data is calculated by piecemeal parameter according to picture size and calculating granularity;
2) memory allocation is carried out at the end CPU and the end GPU based on piecemeal parameter respectively;
3) variable and memory space initialization;
4) task schedule is carried out by CPU, executes calculating task in such a way that CPU and GPU uses three class pipeline simultaneously, Each calculating task includes that data are read in, location Calculation and data write back three steps, and intermediate each round calculating task exists The data that last round of calculating task is executed while executing location Calculation write back and execute the data reading of next round calculating task Enter, so that data are read in, location Calculation and data write back three steps and carry out parallel.
Preferably, the detailed step of step 1) includes:
1.1) calculating can support that the maximum data block size gSizeMax, gSizeMax that calculate are positive integer on GPU:
1.2) x, y are determined respectively, the block size and piecemeal quantity and piecemeal total quantity in the direction z: if xDim < GSizeMax, then the value that the direction x block size xScale is arranged is xDim, and the value that xScale is otherwise arranged is gSizeMax, if The value for setting the direction x piecemeal quantity xNum isIf yDim < gSizeMax, the direction y block size is set The value of yScale is yDim, and the value that yScale is otherwise arranged is gSizeMax.The value that the direction y piecemeal quantity yNum is arranged isIf zDim < gSizeMax, the value that the direction z block size zScale is arranged is zDim, is otherwise arranged The value of zScale is gSizeMax;The value that the direction z piecemeal quantity zNum is arranged isWherein xDim, yDim, ZDim is respectively parameter preset;The value that piecemeal total quantity bNum is arranged is xNum*yNum*zNum, the value of piecemeal total quantity bNum The number consecutively since 1.
Preferably, piecemeal parameter is based in step 2) when the end CPU and the end GPU carry out memory allocation respectively, needle Three pointer variables gReadPtr, gProcPtr and gWritePtr, each pointer are stated at the end GPU to the end GPU memory allocation The video memory spatial content of distribution is gSizeMax3, wherein gReadPtr is directed toward next image block data to be processed, GProcPtr is directed toward currently processed image block data, and gWritePtr is directed toward a upper processed image block data;For The end CPU memory allocation states two pointer variables cReadBuf, cWriteBuf on CPU, and the memory of each pointer distribution is empty Between capacity be gSizeMax3, wherein cReadBuf is used for for the data buffering between gReadPtr and disk, cWriteBuf Data buffering between gWritePtr and disk;And the block size calculated on CPU is set as gSizeMax, states at the end CPU Three pointer variables cReadPtr, cProcPtr and cWritePtr, the memory headroom capacity of each pointer distribution is gSizeMax3, Wherein gSizeMax is the maximum data block size that can support to calculate on GPU.
Preferably, variable and the detailed step of memory space initialization include: to recycle to become for CPU application mutual exclusion in step 3) Idx is measured, initialization idx is 2,1 number block in disk is read in the memory headroom that pointer variable cProcPtr is directed toward, by magnetic 2 number blocks in disk read in the memory headroom that pointer variable cReadBuf is directed toward, and are then directed toward from pointer variable cReadBuf Memory headroom by 2 number blocks be transferred to the end GPU pointer variable gProcPtr direction video memory space.
Preferably, the detailed step in step 4) includes:
4.1) start 0~No. 2 process and be responsible for for being responsible for that the tissue of calculating task and data are transmitted on GPU on CPU 3~No. 5 processes of tissue and the data transmission of the upper calculating task of CPU;
4.2) 0~No. 2 process call GPU using three class pipeline by way of execute calculating task, while by 3~ No. 5 processes call CPU to execute calculating task by the way of three class pipeline simultaneously, fixed carrying out each image block neuron Data block required for next group of Neurons location is read while position, simultaneously by the data block back of upper one group of Neurons location Disk, so that disk read-write operation and the parallel progress of Neurons location operation;
4.3) synchronous 0,1,2,3,4, No. 5 process, calculating terminate.
Preferably, calculating is executed in such a way that 0~No. 2 process calls GPU to use three class pipeline in step 4.2) to appoint The detailed step of business includes:
4.2.1A ncGPU line) can be started with core number ncGPU is calculated according on GPU on GPU by No. 0 process Journey, all GPU thread parallels carry out Neurons location calculating to the data block that pointer variable gProcPtr is directed toward;By No. 1 into Mutual exclusion cyclic variable idx is added 1 by journey, compares mutual exclusion cyclic variable idx and piecemeal total quantity bNum, if mutual exclusion cyclic variable Idx is less than or equal to piecemeal total quantity bNum, and the mutual exclusion cyclic variable idx number block in disk is read in pointer variable Then the memory headroom that cReadBuf is directed toward is transferred to the end GPU from the memory headroom that the end CPU pointer variable cReadBuf is directed toward and refers to The video memory space that needle variable gReadPtr is directed toward;The video memory space being directed toward by No. 2 process check pointer variable gWritePtr, If the video memory space that pointer variable gWritePtr is directed toward has been stored in data block, by the data block from pointer variable gWritePtr The memory headroom that the video memory space propagation of direction is directed toward to pointer variable cWriteBuf, then refers to from pointer variable cWriteBuf To memory headroom by the data block be stored in disk, and remove pointer variable gWritePtr direction video memory space;
4.2.2A) synchronous No. 0, No. 1 and No. 2 process, after synchronous, GPU current data block, which calculates, to be completed;By No. 0 into The exchange of Cheng Jinhang GPU video memory pointer, concrete operations are statement temporary pointer variable gtPtr, and pointer variable gtPtr is assigned a value of referring to Pointer variable gProcPtr is assigned a value of pointer variable gReadPtr, pointer variable gReadPtr is assigned by needle variable gProcPtr Value is pointer variable gWritePtr, and pointer variable gWritePtr is assigned a value of pointer variable gtPtr;No. 0 process check pointer The video memory space that variable gProcPtr is directed toward, if content is that sky thens follow the steps 4.2.3A), no to then follow the steps 4.2.1A);
4.2.3A the transmission of data blocks in video memory space) being directed toward pointer variable gWritePtr by No. 0 process is to pointer The memory headroom that variable cWriteBuff is directed toward, then the memory headroom being directed toward from pointer variable cWriteBuf is by data block back Disk;The video memory space that the end GPU pointer variable gReadPtr, gProcPtr and gWritePtr are directed toward is recycled, the recycling end CPU refers to The memory headroom that needle variable cReadBuf and cWriteBuf are directed toward.
Preferably, meter is executed in such a way that 3~No. 5 processes call CPU to use three class pipeline simultaneously in step 4.2) The detailed step of calculation task includes:
4.2.1B ncCPU line) can be started on CPU with core number ncCPU is calculated according on CPU by No. 3 processes Journey, all CPU line journeys carry out Neurons location calculating to the data block that pointer variable cProcPtr is directed toward parallel;By No. 4 into Mutual exclusion cyclic variable idx is added 1 by journey, compares mutual exclusion cyclic variable idx and piecemeal total quantity bNum, if mutual exclusion cyclic variable Idx is less than or equal to bNum, and the mutual exclusion cyclic variable idx number block in disk is read in pointer variable cReadPtr direction Deposit space;The memory headroom being directed toward by No. 5 process check pointer variable cWritePtr, if pointer variable cWritePtr refers to To memory headroom be stored in data block, by the data block be stored in disk, and empty pointer variable cWritePtr direction memory Space;
4.2.2B) synchronous No. 3, No. 4 and No. 5 processes, after synchronous, CPU current data block, which calculates, to be completed;By No. 3 into The exchange of Cheng Jinhang CPU memory pointer, concrete operations are statement temporary pointer variable ctPtr, and pointer variable ctPtr is assigned a value of referring to Pointer variable cProcPtr is assigned a value of pointer variable cReadPtr, pointer variable cReadPtr is assigned by needle variable cProcPtr Value is pointer variable cWritePtr, and pointer variable cWritePtr is assigned a value of pointer variable ctPtr;No. 3 process check pointers The memory headroom that variable cProcPtr is directed toward, if content is that sky thens follow the steps 4.2.3B), no to then follow the steps 4.2.1B);
4.2.3B the data block back disk in memory headroom) being directed toward pointer variable cWritePtr by No. 3 processes; Recycle the memory headroom that the end CPU pointer variable cReadPtr, cProcPtr and cWritePtr are directed toward.
The present invention also provides a kind of heterogeneous platform Neurons location three-level flowing water parallel systems, the calculating including having GPU Machine equipment, the computer equipment are programmed to perform the aforementioned heterogeneous platform Neurons location three-level flowing water parallel method of the present invention Step.
The present invention also provides a kind of heterogeneous platform Neurons location three-level flowing water parallel systems, the calculating including having GPU Machine equipment is stored on the storage medium of the computer equipment and is programmed to perform the aforementioned heterogeneous platform Neurons location of the present invention The computer program of three-level flowing water parallel method.
The present invention also provides a kind of computer readable storage medium, it is stored with and is programmed on the computer readable storage medium To execute the computer program of the aforementioned heterogeneous platform Neurons location three-level flowing water parallel method of the present invention.
Compared to the prior art, the present invention has an advantage that the present invention according to picture size and calculates granularity for slice Image data calculates piecemeal parameter;Memory allocation is carried out respectively at the end CPU and the end GPU based on piecemeal parameter;Variable and Memory space initialization;Task schedule is carried out by CPU, executes calculating in such a way that CPU and GPU uses three class pipeline simultaneously Task reads data block required for next group of Neurons location, same while carrying out each image block Neurons location When by the data block back disk of upper one group of Neurons location so that disk read-write operation and Neurons location are operated and are advanced Row.The present invention can be improved the processing speed of Neurons location, have Neurons location speed fast, and program execution total time is short, Three class pipeline is realized flexibly and is supported parameter configuration, is easy to the advantages of transplanting with promoting.
Detailed description of the invention
Fig. 1 is the basic procedure schematic diagram of present invention method.
Fig. 2 is the parallel schematic illustration of three class pipeline in present invention method.
Specific embodiment
It will hereafter be made with the server for being equipped with 12 core 2.4GHz CPU of two-way and one piece of NVIDIA GTX 1080Ti GPU For the example of heterogeneous platform, to heterogeneous platform Neurons location three-level flowing water parallel method, system and medium of the present invention carry out into The detailed description of one step.The hard-disk capacity of the server is 24TB, and memory size 256GB, GPU video memory space is 11GB.It is defeated Enter data to be made of the free hand drawing upper layer images sequence that 10000 resolution ratio are 40000 × 40000.
As shown in Figure 1, the step of the present embodiment heterogeneous platform Neurons location three-level flowing water parallel method, includes:
1) slice image data is calculated by piecemeal parameter according to picture size and calculating granularity;
2) memory allocation is carried out at the end CPU and the end GPU based on piecemeal parameter respectively;
3) variable and memory space initialization;
4) task schedule is carried out by CPU, executes calculating task in such a way that CPU and GPU uses three class pipeline simultaneously, Each calculating task includes that data are read in, location Calculation and data write back three steps, and intermediate each round calculating task exists The data that last round of calculating task is executed while executing location Calculation write back and execute the data reading of next round calculating task Enter, so that data are read in, location Calculation and data write back three steps and carry out parallel.
In the present embodiment, the detailed step of step 1) includes:
1.1) calculating can support that the maximum data block size gSizeMax, gSizeMax that calculate are positive integer on GPU:
1.2) x, y are determined respectively, the block size and piecemeal quantity and piecemeal total quantity in the direction z: if xDim < GSizeMax, then the value that the direction x block size xScale is arranged is xDim, and the value that xScale is otherwise arranged is gSizeMax, if The value for setting the direction x piecemeal quantity xNum isIf yDim < gSizeMax, the direction y block size is set The value of yScale is yDim, and the value that yScale is otherwise arranged is gSizeMax.The value that the direction y piecemeal quantity yNum is arranged isIf zDim < gSizeMax, the value that the direction z block size zScale is arranged is zDim, is otherwise arranged The value of zScale is gSizeMax;The value that the direction z piecemeal quantity zNum is arranged isWherein xDim, yDim, ZDim is respectively parameter preset;The value that piecemeal total quantity bNum is arranged is xNum*yNum*zNum, the value of piecemeal total quantity bNum The number consecutively since 1.Primary variables is defined as follows: cMem:CPU end memory capacity.The end gMem:GPU video memory capacity.GNum: GPU quantity.XDim: each figure layer x direction pixel quantity.YDim: each figure layer y direction pixel quantity.ZDim: layer count.
In the present embodiment, the maximum data block size gSizeMax that can support to calculate on GPU is calculated are as follows:
In above formula, gMem is the memory size on GPU.
In the present embodiment, the direction x piecemeal quantityThe direction y piecemeal quantityThe direction z piecemeal quantityPiecemeal total quantity bNum= 260 × 260 × 65=4394000, the number consecutively since 1, each data block size are 1543B=3.65GB.
In the present embodiment, memory allocation is carried out at the end CPU and the end GPU based on piecemeal parameter respectively in step 2) When, three pointer variables gReadPtr, gProcPtr and gWritePtr are stated at the end GPU for the end GPU memory allocation, The video memory spatial content of each pointer distribution is gSizeMax3, wherein gReadPtr is directed toward next image block data to be processed, GProcPtr is directed toward currently processed image block data, and gWritePtr is directed toward a upper processed image block data;For The end CPU memory allocation states two pointer variables cReadBuf, cWriteBuf on CPU, and the memory of each pointer distribution is empty Between capacity be gSizeMax3, wherein cReadBuf is used for for the data buffering between gReadPtr and disk, cWriteBuf Data buffering between gWritePtr and disk;And the block size calculated on CPU is set as gSizeMax, states at the end CPU Three pointer variables cReadPtr, cProcPtr and cWritePtr, the memory headroom capacity of each pointer distribution is gSizeMax3, Wherein gSizeMax is the maximum data block size that can support to calculate on GPU.
Specifically, three pointer variables gReadPtr, gProcPtr and gWritePtr are stated at the end GPU in the present embodiment, The video memory spatial content of each pointer distribution is 1543B=3.65GB, and wherein gReadPtr is directed toward next image block to be processed Data, gProcPtr are directed toward currently processed image block data, and gWritePtr is directed toward a upper processed image block data. Two pointer variables cReadBuf, cWriteBuf are stated on CPU, the memory headroom capacity of each pointer distribution is 1543B= 3.65GB, wherein cReadBuf is used for gWritePtr and magnetic for the data buffering between gReadPtr and disk, cWriteBuf Data buffering between disk.The block size calculated on CPU is set as 3.65GB.Correspondingly, stating that three pointers become at the end CPU CReadPtr, cProcPtr and cWritePtr are measured, the memory headroom capacity of each pointer distribution is 3.65GB.
Variable and the detailed step of memory space initialization include: to follow for CPU application mutual exclusion in step 3) in the present embodiment Ring variable i dx, initialization idx are 2, and 1 number block in disk is read in the memory headroom that pointer variable cProcPtr is directed toward, 2 number blocks in disk are read in into the memory headroom that pointer variable cReadBuf is directed toward, then from pointer variable cReadBuf 2 number blocks are transferred to the video memory space of the end GPU pointer variable gProcPtr direction by the memory headroom of direction.
In the present embodiment, the detailed step in step 4) includes:
4.1) start 0~No. 2 process and be responsible for for being responsible for that the tissue of calculating task and data are transmitted on GPU on CPU 3~No. 5 processes of tissue and the data transmission of the upper calculating task of CPU;
4.2) 0~No. 2 process call GPU using three class pipeline by way of execute calculating task, while by 3~ No. 5 processes call CPU to execute calculating task by the way of three class pipeline simultaneously, fixed carrying out each image block neuron Data block required for next group of Neurons location is read while position, simultaneously by the data block back of upper one group of Neurons location Disk, so that disk read-write operation and the parallel progress of Neurons location operation;In the present embodiment, step 4.2) by 0~No. 2 into Journey calls GPU to execute calculating task by the way of three class pipeline, while calling CPU to use three simultaneously by 3~No. 5 processes The mode of level production line executes calculating task, i.e. CPU and GPU carry out Neurons location simultaneously on CPU and GPU, improve meter Efficiency is calculated, reduces and calculates the time;
4.3) synchronous 0,1,2,3,4, No. 5 process, calculating terminate.
In the present embodiment, meter is executed in such a way that 0~No. 2 process calls GPU to use three class pipeline in step 4.2) The detailed step of calculation task includes:
4.2.1A ncGPU line) can be started with core number ncGPU is calculated according on GPU on GPU by No. 0 process Journey, all GPU thread parallels carry out Neurons location calculating to the data block that pointer variable gProcPtr is directed toward;The present embodiment In, No. 0 process can start 3584 threads with core number 3584 is calculated according on GPU on GPU;By No. 1 process by mutual exclusion Cyclic variable idx adds 1, compares mutual exclusion cyclic variable idx and piecemeal total quantity bNum, if mutual exclusion cyclic variable idx be less than etc. In piecemeal total quantity bNum, the mutual exclusion cyclic variable idx number block in disk is read in what pointer variable cReadBuf was directed toward Then memory headroom is transferred to the end GPU pointer variable gReadPtr from the memory headroom that the end CPU pointer variable cReadBuf is directed toward The video memory space of direction;The video memory space being directed toward by No. 2 process check pointer variable gWritePtr, if pointer variable The video memory space that gWritePtr is directed toward has been stored in data block, and the video memory which is directed toward from pointer variable gWritePtr is empty Between be transferred to pointer variable cWriteBuf direction memory headroom, then from pointer variable cWriteBuf be directed toward memory headroom The data block is stored in disk, and removes the video memory space of pointer variable gWritePtr direction;Step 4.2.1A) in 0~No. 2 Process simultaneously carry out the end GPU data block read, data block calculate and data block back, realize data transmission and calculate when Between be overlapped, reduce the data transfer overhead at the end GPU;
4.2.2A) synchronous No. 0, No. 1 and No. 2 process, after synchronous, GPU current data block, which calculates, to be completed;By No. 0 into The exchange of Cheng Jinhang GPU video memory pointer, concrete operations are statement temporary pointer variable gtPtr, and pointer variable gtPtr is assigned a value of referring to Pointer variable gProcPtr is assigned a value of pointer variable gReadPtr, pointer variable gReadPtr is assigned by needle variable gProcPtr Value is pointer variable gWritePtr, and pointer variable gWritePtr is assigned a value of pointer variable gtPtr;No. 0 process check pointer The video memory space that variable gProcPtr is directed toward, if content is that sky thens follow the steps 4.2.3A), no to then follow the steps 4.2.1A);This The step 4.2.2A of embodiment) in realize data exchange by way of exchanging pointer, avoid copying a large amount of memory headrooms, improve The spatiotemporal efficiency of storage space management;
4.2.3A the transmission of data blocks in video memory space) being directed toward pointer variable gWritePtr by No. 0 process is to pointer The memory headroom that variable cWriteBuff is directed toward, then the memory headroom being directed toward from pointer variable cWriteBuf is by data block back Disk;The video memory space that the end GPU pointer variable gReadPtr, gProcPtr and gWritePtr are directed toward is recycled, the recycling end CPU refers to The memory headroom that needle variable cReadBuf and cWriteBuf are directed toward.
In the present embodiment, held in such a way that 3~No. 5 processes call CPU to use three class pipeline simultaneously in step 4.2) The detailed step of row calculating task includes:
4.2.1B ncCPU line) can be started on CPU with core number ncCPU is calculated according on CPU by No. 3 processes Journey, all CPU line journeys carry out Neurons location calculating to the data block that pointer variable cProcPtr is directed toward parallel;By No. 4 into Mutual exclusion cyclic variable idx is added 1 by journey, compares mutual exclusion cyclic variable idx and piecemeal total quantity bNum (4394000), if mutual exclusion Cyclic variable idx is less than or equal to bNum (4394000), and the mutual exclusion cyclic variable idx number block in disk is read in pointer and is become Measure the memory headroom that cReadPtr is directed toward;The memory headroom being directed toward by No. 5 process check pointer variable cWritePtr, if The memory headroom that pointer variable cWritePtr is directed toward has been stored in data block, which is stored in disk, and empty pointer variable The memory headroom that cWritePtr is directed toward;Step 4.2.1B) in 3~No. 5 processes carry out simultaneously the end CPU data block read, data Block calculates and data block back, and the time-interleaving for realizing data transmission and calculating reduces the data transfer overhead at the end CPU;
4.2.2B) synchronous No. 3, No. 4 and No. 5 processes, after synchronous, CPU current data block, which calculates, to be completed;By No. 3 into The exchange of Cheng Jinhang CPU memory pointer, concrete operations are statement temporary pointer variable ctPtr, and pointer variable ctPtr is assigned a value of referring to Pointer variable cProcPtr is assigned a value of pointer variable cReadPtr, pointer variable cReadPtr is assigned by needle variable cProcPtr Value is pointer variable cWritePtr, and pointer variable cWritePtr is assigned a value of pointer variable ctPtr;No. 3 process check pointers The memory headroom that variable cProcPtr is directed toward, if content is that sky thens follow the steps 4.2.3B), no to then follow the steps 4.2.1B);This The step 4.2.2B of embodiment) in realize data exchange by way of exchanging pointer, avoid copying a large amount of memory headrooms, improve The spatiotemporal efficiency of storage space management;
4.2.3B the data block back disk in memory headroom) being directed toward pointer variable cWritePtr by No. 3 processes; Recycle the memory headroom that the end CPU pointer variable cReadPtr, cProcPtr and cWritePtr are directed toward.
As shown in Fig. 2, the processing step in the location Calculation task executed due to CPU and GPU include data read in, Location Calculation and data write back three steps, and there are dependence, first round algorithm (Round between the data of three steps 1) it executes and reads in internal storage data reading generation 3-D image volume data, the second wheel algorithm (Round 2), which executes, reads in internal storage data Reading generates 3-D image volume data, Neurons location calculates two steps and carries out simultaneously, opens from third round algorithm (Round 3) Begin, until location Calculation terminate before wheel third from the bottom (Round n-2), each round algorithm execute in data read in, positioning meter It calculates and data writes back three steps while carrying out.Wherein data reading is next group of slice image data of processing, location Calculation It is processing when the corresponding volume data of slice image data of data reading is completed in previous group, it is by upper one group of slice that data, which write back, Disk array is write back after the Neurons location result treatment of image data.By the above technological approaches, effectively data can be read Enter the time write back with data be hidden in Neurons location calculate step in.
In conclusion the present embodiment is based on method of partition organizational computing and data transmission using CPU, CPU and GPU use more Thread carries out Neurons location, by the data transmission step between CPU memory, GPU video memory and disk using multistage pipeline mode, i.e., While handling each image block data, next image block data to be processed is read, while processed by upper one Image block data write back disk so that data transfer operation and data processing operation carry out parallel.The present embodiment heterogeneous platform Neurons location three-level flowing water is flat in CPU-GPU heterogeneous Computing in parallel through multi-process and multithreading hybrid parallel technology On platform, while Neurons location calculating is carried out using CPU multi-core processor and GPU many-core coprocessor, and pass through multistage flowing water Line technology carries out the time-interleaving calculated and data are transmitted, and Neurons location speed can be improved.It is found after statistics operation data, Compared with the Neurons location algorithm run on 12 core CPU of two-way, the present embodiment heterogeneous platform Neurons location three-level flowing water Neurons location speed can be increased to 3 times or more by parallel method.
In addition, the present embodiment also provides a kind of heterogeneous platform Neurons location three-level flowing water parallel system, including have GPU Computer equipment, which is programmed to perform the aforementioned heterogeneous platform Neurons location three-level flowing water of the present embodiment simultaneously The step of row method.In addition, the present embodiment also provides a kind of heterogeneous platform Neurons location three-level flowing water parallel system, including band There is the computer equipment of GPU, is stored on the storage medium of the computer equipment and is programmed to perform the aforementioned isomery of the present embodiment The computer program of platform Neurons location three-level flowing water parallel method.In addition, the present embodiment also provide it is a kind of computer-readable Storage medium, is stored with that be programmed to perform the aforementioned heterogeneous platform neuron of the present embodiment fixed on the computer readable storage medium The computer program of position three-level flowing water parallel method.
The above is only a preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-mentioned implementation Example, all technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art Those of ordinary skill for, several improvements and modifications without departing from the principles of the present invention, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims (10)

1. a kind of heterogeneous platform Neurons location three-level flowing water parallel method, it is characterised in that implementation steps include:
1) slice image data is calculated by piecemeal parameter according to picture size and calculating granularity;
2) memory allocation is carried out at the end CPU and the end GPU based on piecemeal parameter respectively;
3) variable and memory space initialization;
4) task schedule is carried out by CPU, executes calculating task in such a way that CPU and GPU uses three class pipeline simultaneously, it is each A calculating task includes that data are read in, location Calculation and data write back three steps, and intermediate each round calculating task is executing The data that last round of calculating task is executed while location Calculation write back and execute the data reading of next round calculating task, make Data reading, location Calculation and data are obtained to write back three steps and carry out parallel.
2. heterogeneous platform Neurons location three-level flowing water parallel method according to claim 1, which is characterized in that step 1) Detailed step includes:
1.1) calculating can support that the maximum data block size gSizeMax, gSizeMax that calculate are positive integer on GPU:
1.2) x, y are determined respectively, the block size and piecemeal quantity and piecemeal total quantity in the direction z: if xDim < GSizeMax, then the value that the direction x block size xScale is arranged is xDim, and the value that xScale is otherwise arranged is gSizeMax, if The value for setting the direction x piecemeal quantity xNum isIf yDim < gSizeMax, the direction y block size is set The value of yScale is yDim, and the value that yScale is otherwise arranged is gSizeMax.The value that the direction y piecemeal quantity yNum is arranged isIf zDim < gSizeMax, the value that the direction z block size zScale is arranged is zDim, is otherwise arranged The value of zScale is gSizeMax;The value that the direction z piecemeal quantity zNum is arranged isWherein xDim, yDim, ZDim is respectively parameter preset;The value that piecemeal total quantity bNum is arranged is xNum*yNum*zNum, the value of piecemeal total quantity bNum The number consecutively since 1.
3. heterogeneous platform Neurons location three-level flowing water parallel method according to claim 2, which is characterized in that in step 2) Based on piecemeal parameter when the end CPU and the end GPU carry out memory allocation respectively, for the end GPU memory allocation in GPU State that three pointer variables gReadPtr, gProcPtr and gWritePtr, the video memory spatial content of each pointer distribution are in end gSizeMax3, wherein gReadPtr is directed toward next image block data to be processed, and gProcPtr is directed toward currently processed image Block number evidence, gWritePtr are directed toward a upper processed image block data;It is stated on CPU for the end CPU memory allocation The memory headroom capacity of two pointer variables cReadBuf, cWriteBuf, each pointer distribution is gSizeMax3, wherein For cReadBuf for the data buffering between gReadPtr and disk, data of the cWriteBuf between gWritePtr and disk are slow Punching;And the block size calculated on CPU is set as gSizeMax, the end CPU state three pointer variable cReadPtr, The memory headroom capacity of cProcPtr and cWritePtr, each pointer distribution are gSizeMax3, wherein gSizeMax be GPU on can Support the maximum data block size calculated.
4. heterogeneous platform Neurons location three-level flowing water parallel method according to claim 3, which is characterized in that in step 3) It is 2 that variable and the detailed step of memory space initialization, which include: for CPU application mutual exclusion cyclic variable idx, initialization idx, by magnetic 1 number block in disk reads in the memory headroom that pointer variable cProcPtr is directed toward, and 2 number blocks in disk are read in pointer Then the memory headroom that variable cReadBuf is directed toward passes 2 number blocks from the memory headroom that pointer variable cReadBuf is directed toward The defeated video memory space being directed toward to the end GPU pointer variable gProcPtr.
5. heterogeneous platform Neurons location three-level flowing water parallel method according to claim 3, which is characterized in that in step 4) Detailed step include:
4.1) start on CPU and be responsible on GPU in 0~No. 2 process and responsible CPU of the tissue of calculating task and data transmission 3~No. 5 processes of tissue and the data transmission of calculating task;
4.2) calculating task is executed in such a way that 0~No. 2 process calls GPU to use three class pipeline, while passing through 3~No. 5 Process calls CPU to execute calculating task by the way of three class pipeline simultaneously, is carrying out each image block Neurons location While read data block required for next group of Neurons location, simultaneously by the data block back magnetic of upper one group of Neurons location Disk, so that disk read-write operation and the parallel progress of Neurons location operation;
4.3) synchronous 0,1,2,3,4, No. 5 process, calculating terminate.
6. heterogeneous platform Neurons location three-level flowing water parallel method according to claim 5, which is characterized in that step 4.2) In 0~No. 2 process call GPU using three class pipeline by way of execute the detailed step of calculating task and include:
4.2.1A ncGPU thread, institute) can be started with core number ncGPU is calculated according on GPU on GPU by No. 0 process There is GPU thread parallel to carry out Neurons location calculating to the data block that pointer variable gProcPtr is directed toward;It will be mutual by No. 1 process Reprimand cyclic variable idx adds 1, compares mutual exclusion cyclic variable idx and piecemeal total quantity bNum, if mutual exclusion cyclic variable idx is less than Equal to piecemeal total quantity bNum, the mutual exclusion cyclic variable idx number block in disk is read in into pointer variable cReadBuf and is directed toward Memory headroom, then from the end CPU pointer variable cReadBuf be directed toward memory headroom be transferred to the end GPU pointer variable The video memory space that gReadPtr is directed toward;The video memory space being directed toward by No. 2 process check pointer variable gWritePtr, if referred to The video memory space that needle variable gWritePtr is directed toward has been stored in data block, which is directed toward from pointer variable gWritePtr The memory headroom that video memory space propagation is directed toward to pointer variable cWriteBuf, then out of pointer variable cWriteBuf direction It deposits space and the data block is stored in disk, and remove the video memory space of pointer variable gWritePtr direction;
4.2.2A) synchronous No. 0, No. 1 and No. 2 process, after synchronous, GPU current data block, which calculates, to be completed;By No. 0 process into The exchange of row GPU video memory pointer, concrete operations are statement temporary pointer variable gtPtr, and pointer variable gtPtr is assigned a value of pointer and is become GProcPtr is measured, pointer variable gProcPtr is assigned a value of pointer variable gReadPtr, pointer variable gReadPtr is assigned a value of Pointer variable gWritePtr is assigned a value of pointer variable gtPtr by pointer variable gWritePtr;No. 0 process check pointer variable The video memory space that gProcPtr is directed toward, if content is that sky thens follow the steps 4.2.3A), no to then follow the steps 4.2.1A);
4.2.3A the transmission of data blocks in video memory space) being directed toward pointer variable gWritePtr by No. 0 process is to pointer variable The memory headroom that cWriteBuff is directed toward, then the memory headroom being directed toward from pointer variable cWriteBuf is by data block back disk; The video memory space that the end GPU pointer variable gReadPtr, gProcPtr and gWritePtr are directed toward is recycled, the end CPU pointer variable is recycled The memory headroom that cReadBuf and cWriteBuf is directed toward.
7. heterogeneous platform Neurons location three-level flowing water parallel method according to claim 5, which is characterized in that step 4.2) In pass through and execute the detailed step of calculating task by way of 3~No. 5 processes call CPU to use three class pipeline simultaneously and include:
4.2.1B ncCPU thread, institute) can be started on CPU with core number ncCPU is calculated according on CPU by No. 3 processes There is CPU line journey to carry out Neurons location calculating to the data block that pointer variable cProcPtr is directed toward parallel;It will be mutual by No. 4 processes Reprimand cyclic variable idx adds 1, compares mutual exclusion cyclic variable idx and piecemeal total quantity bNum, if mutual exclusion cyclic variable idx is less than Equal to bNum, the mutual exclusion cyclic variable idx number block in disk is read in into the memory headroom that pointer variable cReadPtr is directed toward; The memory headroom being directed toward by No. 5 process check pointer variable cWritePtr, if pointer variable cWritePtr direction is interior It deposits space and has been stored in data block, which is stored in disk, and empty the memory headroom of pointer variable cWritePtr direction;
4.2.2B) synchronous No. 3, No. 4 and No. 5 processes, after synchronous, CPU current data block, which calculates, to be completed;By No. 3 processes into The exchange of row CPU memory pointer, concrete operations are statement temporary pointer variable ctPtr, and pointer variable ctPtr is assigned a value of pointer and is become CProcPtr is measured, pointer variable cProcPtr is assigned a value of pointer variable cReadPtr, pointer variable cReadPtr is assigned a value of Pointer variable cWritePtr is assigned a value of pointer variable ctPtr by pointer variable cWritePtr;No. 3 process check pointer variables The memory headroom that cProcPtr is directed toward, if content is that sky thens follow the steps 4.2.3B), no to then follow the steps 4.2.1B);
4.2.3B the data block back disk in memory headroom) being directed toward pointer variable cWritePtr by No. 3 processes;Recycling The memory headroom that the end CPU pointer variable cReadPtr, cProcPtr and cWritePtr are directed toward.
8. a kind of heterogeneous platform Neurons location three-level flowing water parallel system, the computer equipment including having GPU, feature exist In the computer equipment is programmed to perform heterogeneous platform Neurons location three-level stream described in any one of claim 1~7 The step of water parallel method.
9. a kind of heterogeneous platform Neurons location three-level flowing water parallel system, the computer equipment including having GPU, feature exist In being stored with that be programmed to perform isomery described in any one of claim 1~7 flat on the storage medium of the computer equipment The computer program of platform Neurons location three-level flowing water parallel method.
10. a kind of computer readable storage medium, which is characterized in that be stored with and be programmed on the computer readable storage medium The computer program of heterogeneous platform Neurons location three-level flowing water parallel method described in any one of perform claim requirement 1~7.
CN201910289495.7A 2019-04-11 2019-04-11 Heterogeneous platform neuron positioning three-level flow parallel method, system and medium Active CN110135569B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910289495.7A CN110135569B (en) 2019-04-11 2019-04-11 Heterogeneous platform neuron positioning three-level flow parallel method, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910289495.7A CN110135569B (en) 2019-04-11 2019-04-11 Heterogeneous platform neuron positioning three-level flow parallel method, system and medium

Publications (2)

Publication Number Publication Date
CN110135569A true CN110135569A (en) 2019-08-16
CN110135569B CN110135569B (en) 2021-09-21

Family

ID=67569648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910289495.7A Active CN110135569B (en) 2019-04-11 2019-04-11 Heterogeneous platform neuron positioning three-level flow parallel method, system and medium

Country Status (1)

Country Link
CN (1) CN110135569B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516795A (en) * 2019-08-28 2019-11-29 北京达佳互联信息技术有限公司 A kind of method, apparatus and electronic equipment for model variable allocation processing device
CN110543940A (en) * 2019-08-29 2019-12-06 中国人民解放军国防科技大学 Neural circuit body data processing method, system and medium based on hierarchical storage
CN110992241A (en) * 2019-11-21 2020-04-10 支付宝(杭州)信息技术有限公司 Heterogeneous embedded system and method for accelerating neural network target detection
CN112529763A (en) * 2020-12-16 2021-03-19 航天科工微电子***研究院有限公司 Image processing system and tracking and aiming system based on soft and hard coupling
CN113806067A (en) * 2021-07-28 2021-12-17 卡斯柯信号有限公司 Safety data verification method, device, equipment and medium based on vehicle-to-vehicle communication
CN113918356A (en) * 2021-12-13 2022-01-11 广东睿江云计算股份有限公司 Method and device for quickly synchronizing data based on CUDA (compute unified device architecture), computer equipment and storage medium
CN117689025A (en) * 2023-12-07 2024-03-12 上海交通大学 Quick large model reasoning service method and system suitable for consumer display card

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130169658A1 (en) * 2011-12-28 2013-07-04 Think Silicon Ltd Multi-threaded multi-format blending device for computer graphics operations
CN103617626A (en) * 2013-12-16 2014-03-05 武汉狮图空间信息技术有限公司 Central processing unit (CPU) and ground power unit (GPU)-based remote-sensing image multi-scale heterogeneous parallel segmentation method
CN104267940A (en) * 2014-09-17 2015-01-07 武汉狮图空间信息技术有限公司 Quick map tile generation method based on CPU+GPU
CN104375807A (en) * 2014-12-09 2015-02-25 中国人民解放军国防科学技术大学 Three-level flow sequence comparison method based on many-core co-processor
CN106815807A (en) * 2017-01-11 2017-06-09 重庆市地理信息中心 A kind of unmanned plane image Fast Mosaic method based on GPU CPU collaborations
CN109451322A (en) * 2018-09-14 2019-03-08 北京航天控制仪器研究所 DCT algorithm and DWT algorithm for compression of images based on CUDA framework speed up to realize method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130169658A1 (en) * 2011-12-28 2013-07-04 Think Silicon Ltd Multi-threaded multi-format blending device for computer graphics operations
CN103617626A (en) * 2013-12-16 2014-03-05 武汉狮图空间信息技术有限公司 Central processing unit (CPU) and ground power unit (GPU)-based remote-sensing image multi-scale heterogeneous parallel segmentation method
CN104267940A (en) * 2014-09-17 2015-01-07 武汉狮图空间信息技术有限公司 Quick map tile generation method based on CPU+GPU
CN104375807A (en) * 2014-12-09 2015-02-25 中国人民解放军国防科学技术大学 Three-level flow sequence comparison method based on many-core co-processor
CN106815807A (en) * 2017-01-11 2017-06-09 重庆市地理信息中心 A kind of unmanned plane image Fast Mosaic method based on GPU CPU collaborations
CN109451322A (en) * 2018-09-14 2019-03-08 北京航天控制仪器研究所 DCT algorithm and DWT algorithm for compression of images based on CUDA framework speed up to realize method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
TAO LI: "Dual buffer rotation four-stage pipeline for CPU–GPU cooperative computing", 《SPRINGER》 *
肖难: "基于异构***架构的朴素贝叶斯图像分类算法的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
马永军等: "面向 CPU+GPU 异构平台的模板匹配目标识别并行算法", 《天津科技大学学报》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516795A (en) * 2019-08-28 2019-11-29 北京达佳互联信息技术有限公司 A kind of method, apparatus and electronic equipment for model variable allocation processing device
CN110516795B (en) * 2019-08-28 2022-05-10 北京达佳互联信息技术有限公司 Method and device for allocating processors to model variables and electronic equipment
CN110543940A (en) * 2019-08-29 2019-12-06 中国人民解放军国防科技大学 Neural circuit body data processing method, system and medium based on hierarchical storage
CN110992241A (en) * 2019-11-21 2020-04-10 支付宝(杭州)信息技术有限公司 Heterogeneous embedded system and method for accelerating neural network target detection
CN112529763A (en) * 2020-12-16 2021-03-19 航天科工微电子***研究院有限公司 Image processing system and tracking and aiming system based on soft and hard coupling
CN113806067A (en) * 2021-07-28 2021-12-17 卡斯柯信号有限公司 Safety data verification method, device, equipment and medium based on vehicle-to-vehicle communication
CN113806067B (en) * 2021-07-28 2024-03-29 卡斯柯信号有限公司 Safety data verification method, device, equipment and medium based on vehicle-to-vehicle communication
CN113918356A (en) * 2021-12-13 2022-01-11 广东睿江云计算股份有限公司 Method and device for quickly synchronizing data based on CUDA (compute unified device architecture), computer equipment and storage medium
CN113918356B (en) * 2021-12-13 2022-02-18 广东睿江云计算股份有限公司 Method and device for quickly synchronizing data based on CUDA (compute unified device architecture), computer equipment and storage medium
CN117689025A (en) * 2023-12-07 2024-03-12 上海交通大学 Quick large model reasoning service method and system suitable for consumer display card

Also Published As

Publication number Publication date
CN110135569B (en) 2021-09-21

Similar Documents

Publication Publication Date Title
CN110135569A (en) Heterogeneous platform neuron positioning three-level flow parallel method, system and medium
CN110363294B (en) Representing a neural network with paths in the network to improve performance of the neural network
Baskaran et al. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories
Herrero-Lopez et al. Parallel multiclass classification using SVMs on GPUs
CN109993683A (en) Machine learning sparse calculation mechanism, the algorithm calculations micro-architecture and sparsity for training mechanism of any neural network
CN110135575A (en) Communication optimization for distributed machines study
CN103761215B (en) Matrix transpose optimization method based on graphic process unit
Scherer et al. Accelerating large-scale convolutional neural networks with parallel graphics multiprocessors
CN105808309B (en) A kind of high-performance implementation method of the basic linear algebra library BLAS three-level function GEMM based on Shen prestige platform
US10725837B1 (en) Persistent scratchpad memory for data exchange between programs
EP3742350A1 (en) Parallelization strategies for training a neural network
CN106484532B (en) GPGPU parallel calculating method towards SPH fluid simulation
Liu Parallel and scalable sparse basic linear algebra subprograms
DE102023105565A1 (en) METHOD AND APPARATUS FOR EFFICIENT ACCESS TO MULTI-DIMENSIONAL DATA STRUCTURES AND/OR OTHER LARGE BLOCKS OF DATA
CN103413273A (en) Method for rapidly achieving image restoration processing based on GPU
CN115390922A (en) Shenwei architecture-based seismic wave simulation algorithm parallel optimization method and system
Bakunas-Milanowski et al. Efficient algorithms for stream compaction on GPUs
DE102020130081A1 (en) EXTENDED PROCESSOR FUNCTIONS FOR CALCULATIONS
CN110383206A (en) System and method for generating Gauss number using hardware-accelerated
CN115756605A (en) Shallow cloud convection parameterization scheme heterogeneous computing method based on multiple GPUs
US20230289398A1 (en) Efficient Matrix Multiply and Add with a Group of Warps
Lin et al. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer
Zhou et al. A Parallel Scheme for Large‐scale Polygon Rasterization on CUDA‐enabled GPUs
Hou et al. A GPU-based tabu search for very large hardware/software partitioning with limited resource usage
CN111445503B (en) Pyramid mutual information image registration method based on parallel programming model on GPU cluster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant