CN102214086A - General-purpose parallel acceleration algorithm based on multi-core processor - Google Patents

General-purpose parallel acceleration algorithm based on multi-core processor Download PDF

Info

Publication number
CN102214086A
CN102214086A CN2011101657407A CN201110165740A CN102214086A CN 102214086 A CN102214086 A CN 102214086A CN 2011101657407 A CN2011101657407 A CN 2011101657407A CN 201110165740 A CN201110165740 A CN 201110165740A CN 102214086 A CN102214086 A CN 102214086A
Authority
CN
China
Prior art keywords
node
calculation
computation
sequential chart
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011101657407A
Other languages
Chinese (zh)
Inventor
曹伟
王伶俐
王颖
周学功
叶晓敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Shanghai Redneurons Co Ltd
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN2011101657407A priority Critical patent/CN102214086A/en
Publication of CN102214086A publication Critical patent/CN102214086A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Processing (AREA)

Abstract

The invention belongs to the technical field of a parallel processor, and in particular relates to a general-purpose parallel acceleration algorithm based on a multi-core processor. The general-purpose parallel acceleration algorithm comprises the following steps: firstly recognizing data dependence during the computation process for large-scale and high-density data calculation, and decomposing the computation process with low data correlation degree or without data dependence so as to form independent computation sequences; distributing the computation sequences into computation cores of the multi-core processor for execution, scheduling threads during the execution process to realize load balance, and dynamically managing a memory to realize memory alignment; and after the computation cores completely run the computation sequences, recovering and combining computed result segments into a complete computed result so as to realize a higher computation acceleration ratio. In the invention, based on the multi-core processors such as a GPGPU (general purpose graphic processing unit), a CELL processor and the like, the general-purpose parallel acceleration algorithm realizes large-scale data computation parallelization, parallel thread optimal scheduling and general-purpose acceleration computation with low correlation degree of multi-core processor architectures.

Description

General parallel accelerating algorithm based on polycaryon processor
Technical field
The invention belongs to the parallel processor applied technical field, be specifically related to a kind of parallel accelerating algorithm based on polycaryon processor.
Background technology
The peak performance of GPU is from the 50GFLOPS(Giga FLoating point Operations Per Second of NV40 in 2004) rise to the 500GFLOPS of G80 in 2007.Memory bandwidth also is increased to the 86.4GB/s of NVIDIA GeForce 8800 GTX from the 42GB/s of ATI Radeon X1800XT.Fully-pipelined and highly-parallel framework among the GPU, with and high memory bandwidth, can provide support for its peak value calculated performance.Meanwhile, present high-end performance of processors is about 12GFLOPS as 3GHz Pentium4 CPU, is 6GB/s to the bandwidth of main memory.
In investment in March calendar year 2001, Sony, new power computer amusement, Toshiba, U.S.'s IBM (IBM) company develop jointly the processor that is used for high-speed computation to the CELL processor by Sony.It is with RISC(Reduced Instruction Set Computer) the PowerPC framework of instruction system designs, and have high clock frequency, the high characteristics such as efficient of carrying out.Be mainly used on PlayStation 3 and the blade server.The CELL processor has 1 to be simplified by PowerPC970 and comes PPE(PowerPC Processing Element) and 8 be called SPE(Synergistic Processing Element) the collaborative process device, frequency of operation surpasses 4GHz.The CELL processor is a 64-bit Power processor, and built-in 8 processing units that cooperate with each other have and handle the ability that separate type is calculated, and have the ability that uniprocessor moves a plurality of operating systems.
In face of high performance parallel processor, adopting general parallel accelerating algorithm to calculate becomes inexorable trend.
List of references
[1]?K.?Gulati?and?S.?P.?Khatri,?“Towards?acceleration?of?fault?simulation?using?graphics?processing?units,”?DAC,?2008
[2]?K.?Gulati?and?S.?P.?Khatri,?“Accelerating?Statistical?Static?Timing?Analysis?Using?Graphics?Processing?Units”,?DATE,?2009
[3] Zou Yongning, Tan Hui, Huang Liang, " several method that the CT image reconstruction quickens ", computer system application, 2008.
Summary of the invention
The object of the present invention is to provide a kind of general parallel accelerating algorithm, can walk abreast with this process for different polycaryon processors with realization and quicken computing based on polycaryon processor.
Parallel accelerating algorithm based on polycaryon processor provided by the invention comprises identification and decomposes the large-scale data calculation process that scheduling Distribution Calculation sequence to processor calculates the process that core is carried out, and collects the process of reorganization result of calculation successively.Wherein:
The large-scale data calculation process is also decomposed in so-called identification, is to calculate for extensive, high density data, and data dependence is low, and computation process that promptly can parallel processing also is decomposed into the independently sequence of calculation;
So-called scheduling Distribution Calculation sequence to processor calculates the process that core is carried out, be with after decomposing independently the sequence of calculation be assigned to respectively calculating on the core of polycaryon processor and carry out, scheduling thread is realized load balance in the process of implementation, and the dynamic management internal memory, realizes the internal memory alignment;
So-called process of collecting reorganization result of calculation is after each calculating core execution finishes the sequence of calculation, each result of calculation fragment is reclaimed be reassembled into complete result of calculation according to the order of sequence, realizes the calculating speed-up ratio.
The present invention is based on polycaryon processor (for example GPGPU(General Purpose Graphic Processing Unit) and CELL processor etc.), realized large-scale data calculate parallelization, parallel thread Optimization Dispatching, with the little general acceleration computing of the polycaryon processor framework degree of correlation.
Description of drawings
Fig. 1 algorithm flow chart.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.
Algorithm flow chart is as shown in Figure 1:
A) calculate for extensive, high density data, computation process that can parallel processing is decomposed into the independently sequence of calculation;
B) will decompose the back sequence of calculation and be assigned to respectively calculating on the core of polycaryon processor and carry out, scheduling thread is realized load balance in the process of implementation, and the dynamic management internal memory;
C) after the sequence of calculation is carried out end, each result of calculation fragment recovery is combined into complete result of calculation.
Below being based on GPGPU to SSTA(Statistical Static Timing Analysis) the parallel acceleration of algorithm describes for embodiment.
A) decompose according to input and computation process
One, data input and computation process are described: make up circuit timing diagram
Recognition data correlativity diagrammatic representation method of the present invention.This figure is made up of set of node V and limit collection E, and the node among the set of node V is represented computing unit and input/output port, the dependence between the limit representative computing unit among the collection E of limit.Data dependence is low or do not have the computing unit of data dependence seldom or each other not have each computing unit of annexation corresponding to fillet.
In block-based SSTA computation process, need earlier circuit meshwork list to be converted to the sequential chart of reflection circuit topological structure.Sequential chart is made up of set of node V and limit collection E, and the node among the set of node V is represented door and the input and output etc. in the circuit meshwork list, the limit representative among the collection E of limit in circuit meshwork list door and the connection between the input and output.Generate after the sequential chart, the SSTA method is carried out the SUM(summation by the traversal sequential chart to the difference configuration of each node) and MAX(get maximal value) operation is with the acquisition analysis result.In order to use the development environment CUDA(Compute Unified Device Architecture of GPGPU) SUM and MAX operation that the configuration of each node is walked abreast, the sequential chart that circuit need be arranged in the video memory of GPU, sequential chart visits and obtains SUM and MAX operates needed data by searching.After master routine read in circuit meshwork list, the CPU end can generate sequential chart, and the operation that next needs to carry out is exactly that the sequential chart of CPU end is transplanted to the GPU end.
Sequential chart is directed acyclic graph normally.Because the video memory of GPU is played up image and was done optimization, still can not effectively support the custom data structure.But can be used as general array to the video memory of GPU, can support more effective data structure than GPGPU at the CUDA model.Usage space complexity of the present invention be O (V+E) in abutting connection with array the data structure of sequential chart is described.Sequential chart in abutting connection with array representation generally is made of two arrays, the limit array that comprises the node array of nodal information and comprise side information.Convert the sequential chart of CPU end to form that GPU end CUDA model uses, be divided into for two steps in abutting connection with array:
1. the node in the sequential chart of CPU end is carried out the breadth First traversal, during traversal node is numbered and layering;
2. press the node in the sequential chart of level visit CPU end, nodal information and side information are added among node array Va and the limit array Ea.
Node v among the accessing time sequence figure also need carry out following operation to Va and Ea interpolation information:
I. obtain the numbering of v in sequential chart, it is set to the index of v in Va;
Ii. travel through forerunner's node of v, the numbering of forerunner's node is joined among the array Ea of limit;
Iii., the value of v in Va is set, is the index of first forerunner's node in Ea of v; If forerunner's number of nodes of v is 0, then be made as-1.
In transfer process, can in the array of limit, write down forerunner's node serial number or drive node numbering.In block-based SSTA, need the information of forerunner's node to carry out SUM and MAX operation, therefore in the array of limit, need to write down the numbering of forerunner's node.After the sequential chart of CPU end was handled, the sequential chart of the GPU end that obtains in abutting connection with array representation imported this sequential chart into video memory that GPU holds, just can carry out block-based SSTA at the GPU end and calculate.
Two, time of arrival computation process decomposition
In SSTA, the time of arrival of the output of a door, the computation process of (Arrival Time abbreviates AT as) was: at first to each input pin iTime of arrival with from input pin iTo the time-delay summation (SUM) of output, then all are obtained and maximizing (MAX), obtain time of arrival when the output at Qianmen.
AT?=?MAX{(AT in i ?+?Delay in i-out )}, i= 0,…,n
Wherein nThe number of expression input.In order to obtain the statistical distribution of time of arrival, need carry out the calculating of a plurality of configurations to time of arrival.Owing to need carry out a large amount of SUM and MAX operation to node configuration, and between the different configurations is data independence, therefore can adopt parallel method that SUM between each configuration and MAX operation are separated, utilize the parallel characteristics of GPU that this process is quickened.If current calculated door one has nIndividual input, then each configuration needs to carry out altogether nInferior summation and N-1Inferior maximizing operation.
The figure of recognition data correlativity represents method.This figure is made up of set of node V and limit collection E, and the node among the set of node V is represented computing unit and input/output port, the dependence between the limit representative computing unit among the collection E of limit.Data dependence is low or do not have the computing unit of data dependence seldom or each other not have each computing unit of annexation corresponding to fillet.
B) calculating that each sequence of calculation is assigned to polycaryon processor is examined
Be that each that each computing unit that previous step obtains suddenly is assigned to polycaryon processor calculates executed in parallel on core; Realize load balance by scheduling thread in the process of implementation, and the dynamic management internal memory, realize the internal memory alignment.Specifically comprise:
One, each calculates the calculation procedure of nuclear
It below is the false code of in CUDA, the configuration of each node being carried out SUM and MAX calculating, a thread(thread) calculating of a configuration in the corresponding node, five parameters of whole process need input: point to the pointer float* AT of the global storage position of depositing each node time of arrival, allocation space in video memory; The set of node of the sequential chart in the video memory and pointer int* Va, the int* Ea of limit collection are left in sensing in, the pointer int* type of global storage position of the door type of each node correspondence is deposited in sensing, and the texture storage device textureMem DEL that deposits the end-to-end delay of all types of doors, these four data are copied in the video memory that into GPU holds by the CPU end memory.Obtain after these parameters, computation process is carried out according to following step.
ssta_kernel(float*?AT,?int*?Va,?int*?Ea,?int*?type,?textureMem?DEL){
thread_id =? get_thread_id();
virtex_id =? get_vid(Va,?thread_id);
samle_id =? get_sid(thread_id);
pre_vids[] =? get_pre_ids(Ea,?virtex_id);
virtex_type =? get_type(type,?virtex_id);
del[] =? get_del(DEL,?virtex_type,?sample_id);
pre_at[] =? get_at(AT,?pre_vids[],?sample_id);
sum[] =? perform_SUM(del[],?pre_at[]);
cur_at =? perform_MAX(sum[]);
store_at_to_AT(cur_at,?AT,?sample_id,?virtex_id);
}
1. Get_thread_id() obtains the ID of thread.In CUDA, provide corresponding built-in variable blockIdx and the corresponding block(of threadIdx to calculate nuclear) index and the index of thread in block, can calculate the ID of thread very easily.
2. Get_vid(Va, thread_id), the ID by thread finds current thread corresponding node index and node in sequential chart to go into the reference position of limit in Ea in the node array Va of sequential chart.
3. Get_sid(thread_id), the configuration index that ID by thread finds current thread to calculate, because each node need calculate a plurality of configurations, and each thread calculates different configurations, so the configuration index that current thread need obtain to be calculated decides the data of calculating which configuration.
4. Get_pre_ids(Ea, virtex_id), in the array Ea of the limit of sequential chart, find the forerunner node index of node in sequential chart of current thread correspondence by the node index, because in STA calculated, need calculate the time of arrival of present node according to the time of arrival of previous node.Therefore the number of forerunner's node may be an array greater than 1.It is pointed out that deposit among the Va be present node go into the reference position of limit in Ea, the number of forerunner's node can deduct the reference position of present node in Ea by the reference position of a back node in Ea and obtains like this.Add that in the ending of Va a value points to the next position that last node the last item is gone into the position of limit in Ea, just can obtain forerunner's interstitial content of last node.
5. Get_type(type virtex_id), finds the type of the door of node correspondence in type array type by the node index.
6. Get_del(DEL, virtex_type, sample_id), in the texture storage device DEL of the pin-to-pin time-delay of depositing door, obtain the door of current thread corresponding node and the delayed data of the pin-to-pin of corresponding configuration by node type and configuration index, the delayed data that uses texture storage device DEL to deposit door is read-only in computation process because of these delayed datas, utilizes read-only texture storage device can reduce thread visit expense.This time-delay has comprised each input and has divided the time-delay that is clipped to output, is an array therefore.
7. Get_at(AT, pre_vids[], sample_id), index by forerunner's node and configuration index obtain the time of arrival of forerunner's node of current thread corresponding node in current configuration in depositing the array AT of time of arrival.Obviously, because the number of forerunner's node may be greater than 1, so also be an array time of arrival of forerunner's node.
8. Perform_SUM(del[], pre_at[]), read group total is with the end-to-end delay and addition time of arrival of forerunner's node of door.
9. Perform_MAX(sum[]), maximizing, the result that previous step is sued for peace compares, and obtains maximal value, is the time of arrival of current configuration.
10. Store_at_to_AT(sample_id virtex_id), deposits back the array AT that deposits time of arrival with the result after calculating for cur_at, AT.
Two, the load balance in the computation process
Each block can have 512 thread at most among the CUDA, calculates 64k configuration and can guarantee that all block that move are in full state, need not to consider thread scheduling in multiprocessor.And the present invention uses the sparse grid method to reduce the configuration number, makes calculation times much smaller than Monte Carlo method, and the configuration number of each node is minimum can to reach 11, much smaller than the thread number upper limit of block.And the duty of each block the best is to have a multiple thread operation of 256 at least, guarantees that therefore block is in the state of operating at full capacity and becomes the key factor that improves operation efficiency.The configuration of calculating a plurality of nodes simultaneously in a block can improve the utilization factor of block and multiprocessor greatly, promotes the efficient of calculating process.
In CUDA, can specify dimension and the size of block when calling kernel by blockDim, specify dimension and the size of grid by gridDim.The present invention block is set to two dimension, size be (m, n), m * n≤512, grid is set to one dimension, size is (a, 1), a * m 〉=node number.N is the configuration number of each node, the node number that m handles for each block, and a is the block number of grid.In kernel, obtain the node index of current calculating by blockIdx.x * blockDim.y+threadIdx.y, threadIdx.x is the configuration index of current calculating.Make the benefit of calculating in this way be, when the configuration number changes and is not more than 512, only need adjusting m, n and a just can adapt to this variation; And when configuration number when exceeding 512, can specify gridDim thread grid be set to (a, b), n * b 〉=configuration number, configuration index then is blockIdx.y * blockDim.x+threadIdx.x, has made things convenient for the adjusting of configuration number, and has improved counting yield.
Three, the dynamic memory management in the computation process
The time of arrival that the inventive method adopts global storage to deposit each node.The management of global storage is very big to the influence of operation efficiency, because the read-write of global storage is needed 400 ~ 600 clock period usually.Use the access mode of alignment can make visit reach top efficiency to global storage.Since the thread of CUDA to the read-write of global storage be according to the size of a half-warp just 16 thread carry out, global storage position read-write to an alignment will only be visited global storage once, otherwise will visit 16 times, and can cause huge expense waste.
For the visit global storage that aligns, the present invention uses CUDA built-in function CudaMallocPitch, and (width height) carries out the distribution of global storage for devPtr, pitch.Can think that what use that the CudaMallocPitch function is assigned to is a two-dimensional space, width multiply by corresponding data width for the configuration number n, height is node number m, pitch is the space size that width place dimension is assigned to, and just aligns as width with pitch in the space of distributing like this, returns the pointer devPtr that points to the space of distributing, in realization of the present invention CudaMallocPitch (AT, pitch, n*size_of (float), m).Like this, in kernel, can use pitch to find the address of the global storage of alignment very easily, so one section continuous global storage space of thread visit aligns, improved counting yield.
The present invention adopts the texture storage device to deposit gate delay, because gate delay is read-only, need not write in the kernel operational process, and the texture storage device is read-only and buffer memory is arranged, the time overhead minimum of visit texture storage device only is a clock period, is access stencil very efficiently.Use constant storage can realize such function equally, and effect is identical.
C) reclaim each result of calculation and be reassembled as complete result of calculation
Because the programming characteristic SIMD(Single Instruction Multiple Data that GPU has), so all thread is carried out identical instruction, be the data difference of operation.In block-based SSTA computation process, there are not data and control dependence between SUM that calls repeatedly and the MAX computation process.Therefore, each result of calculation that previous step is obtained suddenly is reassembled into complete result of calculation according to the corresponding relation in the figure expression: complete SUM result just can obtain each SUM results added summation of calculating nuclear; Complete MAX result carries out simply comparing maximizing to the MAX result of each calculating nuclear just can obtain.The longest path that has so just obtained circuit postpones, and has finished SSTA computation process.

Claims (4)

1. the general parallel accelerating algorithm based on polycaryon processor is characterized in that comprising successively identification and decomposes the large-scale data calculation process, and scheduling Distribution Calculation sequence to processor calculates the process that core is carried out, and collects the process of reorganization result of calculation; Wherein:
The large-scale data calculation process is also decomposed in so-called identification, is to calculate for extensive, high density data, and data dependence is low or do not have the computation process of data dependence and be decomposed into the independently sequence of calculation;
So-called scheduling Distribution Calculation sequence to processor calculates the process that core is carried out, be with after decomposing independently the sequence of calculation be assigned to respectively calculating on the core of polycaryon processor and carry out;
So-called process of collecting reorganization result of calculation is after each calculating core execution finishes the sequence of calculation, each result of calculation fragment is reclaimed be reassembled into complete result of calculation according to the order of sequence, realizes the calculated performance acceleration.
2. the general parallel accelerating algorithm based on polycaryon processor according to claim 1 is characterized in that described identification and decomposes the large-scale data calculation process, comprising:
(1), data input and computation process are described: make up circuit timing diagram
In block-based SSTA computation process, earlier circuit meshwork list is converted to the sequential chart of reflection circuit topological structure, sequential chart is made up of set of node V and limit collection E, node among the set of node V is represented door and the input and output etc. in the circuit meshwork list, the limit representative among the collection E of limit in circuit meshwork list door and the connection between the input and output; Travel through sequential chart then, SUM and MAX operation are carried out in the difference configuration of each node, obtain analysis result;
Sequential chart is a directed acyclic graph, the usage space complexity be O (V+E) in abutting connection with array the data structure of sequential chart is described; Sequential chart in abutting connection with array representation is made of two arrays: the limit array that comprises the node array of nodal information and comprise side information; Convert the sequential chart of CPU end to form that GPU end CUDA model uses, be divided into for two steps in abutting connection with array:
(1) node in the sequential chart of CPU end is carried out the breadth First traversal, during traversal node is numbered and layering;
(2) press node in the sequential chart of level visit CPU end, nodal information and side information are added among node array Va and the limit array Ea;
Node v among the accessing time sequence figure also carries out following operation to Va and Ea interpolation information:
(1) obtain the numbering of v in sequential chart, it is set to the index of v in Va;
(2) forerunner's node of traversal v joins the numbering of forerunner's node among the array Ea of limit;
(3) value of v in Va is set, is the index of first forerunner's node in Ea of v; If forerunner's number of nodes of v is 0, then be made as-1;
(2), time of arrival computation process decomposition
In SSTA, the time of arrival of the output of a door, the computation process of (abbreviating AT as) was: at first to each input pin iTime of arrival with from input pin iTo the time-delay summation (SUM) of output, then all are obtained and maximizing (MAX), obtain time of arrival when the output at Qianmen:
AT?=?MAX{(AT in i ?+?Delay in i-out )}, i= 0,…,n
Wherein nThe number of expression input.
3. the general parallel accelerating algorithm based on polycaryon processor according to claim 2, it is characterized in that described scheduling Distribution Calculation sequence to processor calculates the process that core is carried out, is that each that each computing unit that previous step obtains suddenly is assigned to polycaryon processor calculates executed in parallel on core; Realize load balance by scheduling thread in the process of implementation, and the dynamic management internal memory, realize the internal memory alignment.
4. the general parallel accelerating algorithm based on polycaryon processor according to claim 3, the process that it is characterized in that described collection reorganization result of calculation, be the corresponding relation during each result of calculation that previous step obtains is suddenly represented according to figure, be reassembled into complete result of calculation: each SUM results added summation of calculating nuclear is obtained complete SUM result; Carrying out the MAX result of each calculating nuclear simply relatively, maximizing obtains complete MAX result.
CN2011101657407A 2011-06-20 2011-06-20 General-purpose parallel acceleration algorithm based on multi-core processor Pending CN102214086A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011101657407A CN102214086A (en) 2011-06-20 2011-06-20 General-purpose parallel acceleration algorithm based on multi-core processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011101657407A CN102214086A (en) 2011-06-20 2011-06-20 General-purpose parallel acceleration algorithm based on multi-core processor

Publications (1)

Publication Number Publication Date
CN102214086A true CN102214086A (en) 2011-10-12

Family

ID=44745409

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011101657407A Pending CN102214086A (en) 2011-06-20 2011-06-20 General-purpose parallel acceleration algorithm based on multi-core processor

Country Status (1)

Country Link
CN (1) CN102214086A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411658A (en) * 2011-11-25 2012-04-11 中国人民解放军国防科学技术大学 Molecular dynamics accelerating method based on CUP (Central Processing Unit) and GPU (Graphics Processing Unit) cooperation
CN102508820A (en) * 2011-11-25 2012-06-20 中国人民解放军国防科学技术大学 Method for data correlation in parallel solving process based on cloud elimination equation of GPU (Graph Processing Unit)
CN102611680A (en) * 2011-10-26 2012-07-25 苏州闻道网络科技有限公司 Multi-core synchronous audio converting method based on multi-core nodes in local area network (LAN)
CN102710589A (en) * 2011-10-26 2012-10-03 苏州闻道网络科技有限公司 Local area network multi-core node-based audio conversion method
CN102750328A (en) * 2012-05-29 2012-10-24 北京城市网邻信息技术有限公司 Construction and storage method for data structure
CN102831627A (en) * 2012-06-27 2012-12-19 浙江大学 PET (positron emission tomography) image reconstruction method based on GPU (graphics processing unit) multi-core parallel processing
CN103514042A (en) * 2012-06-18 2014-01-15 中国科学院计算机网络信息中心 Dual-adjustment merge-sorting tuning method and device
CN104794002A (en) * 2014-12-29 2015-07-22 南京大学 Multi-channel parallel dividing method based on specific resources and hardware architecture of multi-channel parallel dividing method based on specific resources
CN105677436A (en) * 2015-12-31 2016-06-15 华为技术有限公司 Program transforming method, processor and computer system
CN108345503A (en) * 2018-01-18 2018-07-31 杭州电子科技大学 The parallel knife rail planing method of B-spline surface based on CPU-GPU
CN108510429A (en) * 2018-03-20 2018-09-07 华南师范大学 A kind of multivariable cryptographic algorithm parallelization accelerated method based on GPU
CN110134508A (en) * 2018-02-08 2019-08-16 北京连心医疗科技有限公司 A kind of cloud Monte Carlo state machine system and framework method
CN111858465A (en) * 2020-06-29 2020-10-30 西南电子技术研究所(中国电子科技集团公司第十研究所) Large-scale matrix QR decomposition parallel computing structure
CN112000844A (en) * 2020-08-18 2020-11-27 中山大学 Vectorization method, system and device for bottom-to-top breadth-first search
CN115687233A (en) * 2021-07-29 2023-02-03 腾讯科技(深圳)有限公司 Communication method, device, equipment and computer readable storage medium
CN116755782A (en) * 2023-08-18 2023-09-15 腾讯科技(深圳)有限公司 Method, device, equipment, storage medium and program product for instruction scheduling

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6728961B1 (en) * 1999-03-31 2004-04-27 International Business Machines Corporation Method and system for dynamically load balancing a process over a plurality of peer machines
CN101441557A (en) * 2008-11-08 2009-05-27 腾讯科技(深圳)有限公司 Distributed parallel calculating system and method based on dynamic data division
CN102054090A (en) * 2009-10-28 2011-05-11 复旦大学 Statistical timing analysis method and device based on improved adaptive random configuration method
CN102135949A (en) * 2011-03-01 2011-07-27 浪潮(北京)电子信息产业有限公司 Computing network system, method and device based on graphic processing unit

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6728961B1 (en) * 1999-03-31 2004-04-27 International Business Machines Corporation Method and system for dynamically load balancing a process over a plurality of peer machines
CN101441557A (en) * 2008-11-08 2009-05-27 腾讯科技(深圳)有限公司 Distributed parallel calculating system and method based on dynamic data division
CN102054090A (en) * 2009-10-28 2011-05-11 复旦大学 Statistical timing analysis method and device based on improved adaptive random configuration method
CN102135949A (en) * 2011-03-01 2011-07-27 浪潮(北京)电子信息产业有限公司 Computing network system, method and device based on graphic processing unit

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
K.GULATI等: "Accelerating statistical static timing analysis using graphics processing units", 《PROCEEDINGS OF THE 2009 ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE》 *
程豪: "CPU-GPU并行矩阵乘法的实现与性能分析", 《计算机工程》 *
许慧: "基于GPU高性能计算的切割与布局问题的并行求解方法研究", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102611680A (en) * 2011-10-26 2012-07-25 苏州闻道网络科技有限公司 Multi-core synchronous audio converting method based on multi-core nodes in local area network (LAN)
CN102710589A (en) * 2011-10-26 2012-10-03 苏州闻道网络科技有限公司 Local area network multi-core node-based audio conversion method
CN102508820B (en) * 2011-11-25 2014-05-21 中国人民解放军国防科学技术大学 Method for data correlation in parallel solving process based on cloud elimination equation of GPU (Graph Processing Unit)
CN102508820A (en) * 2011-11-25 2012-06-20 中国人民解放军国防科学技术大学 Method for data correlation in parallel solving process based on cloud elimination equation of GPU (Graph Processing Unit)
CN102411658A (en) * 2011-11-25 2012-04-11 中国人民解放军国防科学技术大学 Molecular dynamics accelerating method based on CUP (Central Processing Unit) and GPU (Graphics Processing Unit) cooperation
CN102411658B (en) * 2011-11-25 2013-05-15 中国人民解放军国防科学技术大学 Molecular dynamics accelerating method based on CUP (Central Processing Unit) and GPU (Graphics Processing Unit) cooperation
CN102750328A (en) * 2012-05-29 2012-10-24 北京城市网邻信息技术有限公司 Construction and storage method for data structure
CN103514042B (en) * 2012-06-18 2018-01-09 中国科学院计算机网络信息中心 A kind of Dual-adjustment merge-sorting tuning method and device
CN103514042A (en) * 2012-06-18 2014-01-15 中国科学院计算机网络信息中心 Dual-adjustment merge-sorting tuning method and device
WO2014000557A1 (en) * 2012-06-27 2014-01-03 浙江大学 Pet image reconstruction method based on gpu multicore parallel processing
CN102831627A (en) * 2012-06-27 2012-12-19 浙江大学 PET (positron emission tomography) image reconstruction method based on GPU (graphics processing unit) multi-core parallel processing
CN104794002B (en) * 2014-12-29 2019-03-22 南京大学 A kind of multidiameter delay division methods and system
CN104794002A (en) * 2014-12-29 2015-07-22 南京大学 Multi-channel parallel dividing method based on specific resources and hardware architecture of multi-channel parallel dividing method based on specific resources
CN105677436B (en) * 2015-12-31 2019-04-05 华为技术有限公司 Program transformation method, processor and computer system
CN105677436A (en) * 2015-12-31 2016-06-15 华为技术有限公司 Program transforming method, processor and computer system
CN108345503A (en) * 2018-01-18 2018-07-31 杭州电子科技大学 The parallel knife rail planing method of B-spline surface based on CPU-GPU
CN110134508A (en) * 2018-02-08 2019-08-16 北京连心医疗科技有限公司 A kind of cloud Monte Carlo state machine system and framework method
CN108510429A (en) * 2018-03-20 2018-09-07 华南师范大学 A kind of multivariable cryptographic algorithm parallelization accelerated method based on GPU
CN111858465A (en) * 2020-06-29 2020-10-30 西南电子技术研究所(中国电子科技集团公司第十研究所) Large-scale matrix QR decomposition parallel computing structure
CN111858465B (en) * 2020-06-29 2023-06-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Large-scale matrix QR decomposition parallel computing system
CN112000844A (en) * 2020-08-18 2020-11-27 中山大学 Vectorization method, system and device for bottom-to-top breadth-first search
CN115687233A (en) * 2021-07-29 2023-02-03 腾讯科技(深圳)有限公司 Communication method, device, equipment and computer readable storage medium
CN116755782A (en) * 2023-08-18 2023-09-15 腾讯科技(深圳)有限公司 Method, device, equipment, storage medium and program product for instruction scheduling
CN116755782B (en) * 2023-08-18 2023-10-20 腾讯科技(深圳)有限公司 Method, device, equipment, storage medium and program product for instruction scheduling

Similar Documents

Publication Publication Date Title
CN102214086A (en) General-purpose parallel acceleration algorithm based on multi-core processor
Kan et al. Improving water quantity simulation & forecasting to solve the energy-water-food nexus issue by using heterogeneous computing accelerated global optimization method
CN102902512B (en) A kind of multi-threading parallel process method based on multi-thread programming and message queue
He et al. Comet: batched stream processing for data intensive distributed computing
CN102222092B (en) Massive high-dimension data clustering method for MapReduce platform
CN104598565B (en) A kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm
CN107480694B (en) Weighting selection integration three-branch clustering method adopting two-time evaluation based on Spark platform
He et al. Probability density forecasting of wind power based on multi-core parallel quantile regression neural network
Acharya et al. A parallel and memory efficient algorithm for constructing the contour tree
CN105373517A (en) Spark-based distributed matrix inversion parallel operation method
Huo et al. An improved multi-cores parallel artificial Bee colony optimization algorithm for parameters calibration of hydrological model
CN112948123B (en) Spark-based grid hydrological model distributed computing method
CN106779219A (en) A kind of electricity demand forecasting method and system
CN105808358A (en) Data dependency thread group mapping method for many-core system
CN104850866A (en) SoC-FPGA-based self-reconstruction K-means cluster technology realization method
Emeliyanov et al. GPU-based tracking algorithms for the ATLAS high-level trigger
CN103559148A (en) On-chip scratch-pad memory (SPM) management method facing multitasking embedded system
Liu et al. Regional-scale calculation of the LS factor using parallel processing
CN104573082A (en) Space small file data distribution storage method and system based on access log information
CN112784435B (en) GPU real-time power modeling method based on performance event counting and temperature
CN112766609A (en) Power consumption prediction method based on cloud computing
CN103870342B (en) Task core value calculating method based on node attribute function in cloud computing environment
Huang et al. A grid and density based fast spatial clustering algorithm
CN115270921A (en) Power load prediction method, system and storage medium based on combined prediction model
Geimer et al. Recent developments in the scalasca toolset

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
ASS Succession or assignment of patent right

Owner name: SHANGHAI REDNEURONS INFORMATION TECHNOLOGY CO., LT

Effective date: 20111205

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20111205

Address after: 200433 Handan Road, Shanghai, No. 220, No.

Applicant after: Fudan University

Co-applicant after: Shanghai RedNeurons Information Technology Co., Ltd.

Address before: 200433 Handan Road, Shanghai, No. 220, No.

Applicant before: Fudan University

C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20111012

RJ01 Rejection of invention patent application after publication