CN109409511A - A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array - Google Patents

A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array Download PDF

Info

Publication number
CN109409511A
CN109409511A CN201811115052.8A CN201811115052A CN109409511A CN 109409511 A CN109409511 A CN 109409511A CN 201811115052 A CN201811115052 A CN 201811115052A CN 109409511 A CN109409511 A CN 109409511A
Authority
CN
China
Prior art keywords
data
convolution
unit
image
mapped
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811115052.8A
Other languages
Chinese (zh)
Other versions
CN109409511B (en
Inventor
杨晨
张海波
王小力
耿莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201811115052.8A priority Critical patent/CN109409511B/en
Publication of CN109409511A publication Critical patent/CN109409511A/en
Application granted granted Critical
Publication of CN109409511B publication Critical patent/CN109409511B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Image Analysis (AREA)

Abstract

A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array, IRB is by being scheduled weight data and image data, matrix inner products fractionation is embarked on journey, it is mapped in different PE units and is calculated, it is cumulative to calculate obtained result, obtained to add up and activate in afterbody SPE, after output activation data, complete scheduling.Weight data is not gone together and is fixed in different PE units, then image data is mapped to line by line to each PE unit and weight data does convolution, intermediate data is temporarily stored in PE unit, is then transferred to next PE unit step by step and is added up, assembly line is formed, convolved data is obtained.During calculating CNN network, the reusability of input image data and weight data is can be improved in IRB data flow, is reduced and is flowed outside the piece inner sheet of data, advantageously reduces power consumption and the time of data flowing, has promotion to performance and efficiency.

Description

A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array
Technical field
The present invention relates to a kind of convolution algorithm data stream scheduling methods for dynamic reconfigurable array.
Background technique
Artificial intelligence is current popular one of computer science, as the major way for realizing artificial intelligence, depth Habit has also obtained far-reaching development.Convolutional neural networks (Convolution Neural Network, CNN) are artificial neural network At most most widely used one of the network structure of network structural research, has become one of the research hotspot of numerous scientific domains at present, Especially original graph can be directly inputted since CNN avoids the pretreatment complicated early period to image in pattern classification field Picture, thus obtained more being widely applied.Convolutional neural networks achieve all well and good in computer vision field in recent years Achievement, while also convolutional neural networks being allowed to be developed.The core of neural network is operation, and CNN is being applied to computer view When feel field, feature extraction is carried out to image data using convolution kernel, main operational is convolution algorithm operation.In general, in CNN In network, 90% or so of the total arithmetic operation number of convolution algorithm Zhan.Therefore at present for, how to be efficiently completed in CNN network Convolution algorithm operation, be the key problem of CNN accelerator design.
With the increase of the CNN network number of plies and neuron number, the computation complexity of model is increased with exponential, depth The pace of learning and speed of service bottleneck for practising algorithm are increasingly dependent on hardware computing platform.For the hardware of deep learning algorithm Accelerating, usually there is three classes implementation at present --- multi-core CPU, GPU and FPGA, their common feature can be achieved on height simultaneously The calculating of row degree.However, existing hardware implementation mode power consumption is higher, there is also energy efficiency (performance/power consumption) is lower Problem can not be applied on intelligent mobile terminal, such as smart phone, wearable device either autonomous driving vehicle etc.. In this context, reconfigurable processor has proven to a kind of parallel computation framework for having both high flexibility and energy-efficient Form, its advantage are that suitable resource allocation strategy, expansion dedicated processes can be selected according to different model sizes Process performance is improved while device use scope, is that multi-core CPU and FPGA technology further develop the solution route being restricted One of, it is possible to as following one of the scheme for realizing high-effect deep learning SoC.Difference between general processor is It not only can change control stream, can also dynamically change the structure of data path, have high-performance, low hardware spending and function Consumption, the advantages of flexibility is good, favorable expandability;Meanwhile in processing speed, the performance of reconfigurable processor is close to dedicated fixed Coremaking piece.Reconfigureable computing array is expired using the array that multiple processing units (Processing Elements, PEs) is constituted The different demands of sufficient different application.Following computing system generally requires to have both multi-functional and high performance feature, currently Trend be that multiple reconfigureable computing arrays are added in computing systems, adaptively to support different standards, meet simultaneously Increasingly increased performance requirement.
For CNN algorithm when calculating, convolution kernel slides on the image carries out convolutional calculation.Such calculating mode has largely Data computed repeatedly.Different from being calculated on GPU, during carrying out hardware-accelerated to CNN algorithm, nothing All calculating data are all buffered on piece by method, it is therefore desirable to are scheduled to the data flow in convolution algorithm.
CNN algorithm includes a large amount of calculating, and reconfigureable computing array can be performed in parallel calculation included in CNN algorithm Method.The weight data of CNN network and image data are divided, are then mapped on corresponding computing unit.Due to hardware The limitation of resource, CNN algorithm can not be mapped completely on hardware structure, it is therefore desirable to be adjusted to image data and weight data Degree.During calculating, a large amount of input data needs are duplicate to be calculated CNN network, and existing many methods are in number According to scheduling process in can all have the following problems:
1, data are repeatedly input.In CNN algorithm, convolution kernel slides over an input image carries out convolution algorithm, works as convolution When the step-length of core sliding is less than convolution kernel own dimensions, when sliding carries out convolution algorithm every time, can all there be part last time convolution meter Data duplication when calculation.These data can re-start reading outside computing unit, but will lead to the repetition of data in this way Input.
2, it when CNN data are mapped to hardware cell, may be subjected to the constraint of hardware resource framework itself, cause designed The flow work it is inefficient.
Summary of the invention
The object of the present invention is to provide a kind of convolution algorithm data stream scheduling methods for dynamic reconfigurable array.
To achieve the above object, the present invention adopts the following technical scheme that:
A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array, which is characterized in that IRB passes through to power Value Data and image data are scheduled, and matrix inner products fractionation is embarked on journey, is mapped in different PE units and is calculated, is calculated Obtained result is cumulative, obtained to add up and activate in afterbody SPE, and the data after output activation are completed to adjust Degree.
A further improvement of the present invention lies in that, comprising the following steps:
Step 1: the data of convolution kernel are mapped to line by line on PE array in IRB data flow, each PE unit is shown Penetrate a line convolution Nuclear Data;
It is mapped on entire PE array step 2: image data is broadcasted line by line, convolutional calculation is carried out in PE unit;
Step 3: the intermediate data that convolutional calculation is obtained is transferred to next stage PE unit, until being transferred to afterbody PE unit, afterbody PE unit are SPE, and SPE swashs cumulative final result by f () function in formula (1) Operation living, activation operation are completed by RelU module, and the data after activation are as output data;
0≤z < N, 0≤u < M, 0≤y < E, 0≤x < F
Wherein, O is output image data, and I is input image data, and W is weight data, and f () function is neural network Activation primitive, the number of z representing input images given N width image in figure, and u indicates the number of convolution kernel, in figure is M and rolls up Product core, y indicate that the row number of output image, E are the total line numbers for exporting image, and x indicates that the column number of output image, F are output Total columns of image, i and j respectively represent the line number and columns of convolution kernel, and k indicates that port number, U indicate convolution kernel after each convolution The step-length of sliding.
A further improvement of the present invention lies in that detailed process is as follows for the first step: the size of convolution kernel is R row, is being mapped The convolution Nuclear Data of this R row is respectively mapped in R PE unit in the process, the weight data of mapping is stored in weight deposit In device.
A further improvement of the present invention lies in that detailed process is as follows for second step: image data has H row, is mapped to line by line Weight data on PE array, and in the PE unit that has been mapped into does multiply-accumulate operation, map and multiply accumulating be simultaneously into Capable;Image data is mapped in PE unit, is cached in image register, shift register is in caching image data It can be realized sliding sash function in convolution operation simultaneously, what each PE unit was calculated is row convolution results to get to R row Convolved data.
A further improvement of the present invention lies in that image register is shift register.
A further improvement of the present invention lies in that the result of obtained convolutional calculation is temporarily stored in the FIFO of PE unit, During next stage PE carries out convolutional calculation, the intermediate data of upper level PE convolutional calculation is transferred to next stage and carries out mediant According to cumulative;For the convolution kernel having a size of i, each convolution kernel needs i PE unit to be calculated;The size of convolution kernel size i It is 3,5,11, corresponding on PE array, the PE unit number needed is also i.
A further improvement of the present invention lies in that realizing IRB data flow on the PE array of 22*22.
A further improvement of the present invention lies in that using the convolution nuclear volume calculated every time as degree of parallelism measurement standard, convolution kernel When size is 3, array can simultaneously be calculated 22*7=154 convolution kernel;When convolution kernel size is 5, array can calculate simultaneously 22*4=88 convolution kernel calculates, and when convolution kernel size is 11, array calculates the convolution kernel of 22*2=44 simultaneously.
Compared with prior art, the invention has the benefit that
1, it is based on Dynamic Reconfigurable Technique, the data stream scheduling machine accelerated for CNN network proposed in conjunction with hardware Data are split Mapping implementation CNN algorithm, are scheduled, image progressive are mapped to all to the convolution algorithm of CNN by system PE unit on carry out convolutional calculation, image data is scheduled using the form broadcasted line by line avoid image data to Complex time sequence control when being mapped on PE array.
2, weight data is not gone together and is fixed in different PE units, then image data is mapped to each PE line by line Unit and weight data do convolution, and intermediate data is temporarily stored in PE unit, are then transferred to next PE unit step by step and add up, Assembly line is formed, convolved data is obtained.During calculating CNN network, IRB data flow can be improved input image data and The reusability of weight data is reduced and is flowed outside the piece inner sheet of data, power consumption and the time of data flowing is advantageously reduced, to performance There is promotion with efficiency.
Detailed description of the invention
Fig. 1 is the computing architecture of CNN accelerator.
Fig. 2 is PE unit structure.
Fig. 3 is convolutional calculation process.
Fig. 4 is that convolution kernel is mapped to PE array line by line.
Fig. 5 is broadcasted line by line for image data and is mapped to PE array.
Fig. 6 intermediate data between PE unit adds up line by line.
Fig. 7 is RS data flow.
Fig. 8 is IRB data flow.
Specific embodiment
Present invention will now be described in detail with reference to the accompanying drawings..
The present invention is that dynamic reconfigurable computing array proposes a new data stream scheduling mechanism, and referred to as image progressive is broadcasted The data stream scheduling mechanism of (Image Row Broadcast, IRB).IRB is proposed based on Reconfigurable Computation hardware structure, is used In the data stream scheduling method that the convolution algorithm of CNN network accelerates, the multiple networks knot such as LeNet, AlexNet, VGG can be accelerated Structure.
The invention proposes IRB data stream schedulings when calculating for CNN algorithm, are applied to hardware structure shown in FIG. 1. Computing array based on dynamic reconfigurable is adapted to the different calculating modes of CNN, and configuration module, which passes through, to be configured Information configures PE array;FSM is the control module of system;Restructural PE array is the computing architecture of whole system, It is also the hardware components that IRB is realized;The data flow of array computation will not when two memory modules guarantee to calculate as intermediate buffer It is interrupted by the delay of waiting operational data.
The PE unit that the present invention is designed for the calculation features of CNN network includes that there are two types of structures, respectively Normal PE (abbreviation PE) and Special PE (abbreviation SPE).Shown in Fig. 2, PE includes with lower module: image register group (Picture Reg), weight register group (Filter Reg), multiplier, accumulator (Acc), adder and FIFO.SPE is on the basis of PE On increase with lower module: multiple selector, data branches switch, adder and ReLU function module (ReLU).Specific ginseng Number is as follows: the input data bit wide of weight register and image register group is 16, depth 16.Multiplier input data position Width is 16.The input data bit wide of adder is 32.The data bit width of FIFO is 32, and depth is 64.It is PE gusts entire The size of column is 22*22, the calculating mode for being 3,5,11 comprising convolution kernel size in AlexNet network.PE array can pass through Change the interconnection between unit, and internal register configuration, meets these calculating modes.Meanwhile it being added inside PE unit Storage unit module, data storage when can satisfy IRB data-flow computation.
The basic operation of convolutional neural networks is convolutional calculation, as shown in figure 3, multiple convolution kernels are carried out to multiple images The process of convolution algorithm, convolution are the basic operations for convolutional neural networks, and convolution kernel slides on the image carries out convolutional calculation Export new image data.Calculation formula is as follows:
0≤z < N, 0≤u < M, 0≤y < E, 0≤x < F
Wherein, O is output image data, and I is input image data, and W is weight data, and f () function is neural network Activation primitive.The number of z representing input images has given N width image in figure.U indicates the number of convolution kernel, is M volume in figure Product core.Y indicates that the row number of output image, E are the total line numbers for exporting image.X indicates that the column number of output image, F are output Total columns of image.I and j respectively represents the line number and columns of convolution kernel, and k indicates port number.U indicates convolution kernel after each convolution The step-length of sliding.
It is right from formula (1) it can be seen that convolutional calculation process is exactly that input image data and weight data do matrix inner products The data that the data that should be put obtain after being multiplied are added.
Convolution algorithm data stream scheduling method for dynamic reconfigurable array of the invention, IRB pass through to calculating process In weight data and image data be scheduled, big matrix inner products fractionation is embarked on journey, be mapped in different PE units into Row calculates, and calculates the part in the cumulative as above formula bracket of obtained result.It is obtained cumulative and mono- in afterbody SPE It is activated in member, the data as exported.Specifically includes the following steps:
Step 1: the form of convolution kernel line by line is mapped on PE array, a line convolution nucleus number is mapped on each PE unit According to as shown in figure 4, detailed process is as follows:
In IRB data flow, the data of convolution kernel are mapped in PE array line by line first, and each PE unit maps a line convolution The data of core.
Such as the size of convolution kernel is R row in Fig. 3, then the convolution Nuclear Data of this R row is needed to distinguish in mapping process It is mapped in R PE unit.It is noted that the preceding R-1 row of convolution kernel is mapped in PE, SPE can pass through configuration information reality The function of existing PE, last line convolution kernel are mapped in SPE.Convolution kernel is mapped in PE unit, due to convolution in convolution process Core does sliding on the image and is calculated, thus in the process weight data be continuous multiplexing repeatedly, need and entire image Convolution algorithm is carried out, so the weight data of mapping is stored in weight register, it can be in convolution process constantly from PE Weight data is read in internal weight register, can be read in this way to avoid the repetition to weight data, calculates effect to improve Rate.
It is mapped on entire PE array step 2: image data is broadcasted line by line, convolutional calculation is carried out in PE unit, such as Shown in Fig. 5, detailed process is as follows
After convolution kernel is mapped on PE array, image data, which starts to broadcast line by line, to be mapped in inside PE unit.In Fig. 3 Middle image data has H row, is mapped on PE array line by line, and the weight data in the PE unit having been mapped into do it is multiply-accumulate Operation, maps and multiplies accumulating and carry out simultaneously.Image data is mapped in PE unit, is cached in image register, Image register design is shift register, and convolution operation may be implemented while caching image data in image shift register In sliding sash function, displacement effect can be generated in calculating process, every time carry out a convolution algorithm after, moving step length U, with It obtains correctly as a result, what each PE unit was calculated is row convolution results, it can obtain the convolved data of R row.
Step 3: the intermediate data that convolutional calculation is obtained is transferred to next stage PE unit, until being transferred to afterbody PE unit, afterbody PE unit are SPE, and SPE swashs cumulative final result by f () function in formula (1) Operation living, activation operation are completed by RelU module, and the data after activation are as output data;
It should be noted that SPE is configurable to PE, for being configured to the SPE of PE, it is regarded as PE, last will not be used as Grade PE unit.That is intergrade is only PE, and only afterbody could be SPE;As shown in fig. 6, the detailed process of the step It is as follows:
The result of obtained convolutional calculation can be temporarily stored in the FIFO of PE unit in Fig. 5, carry out convolution in next stage PE During calculating, it is cumulative that the intermediate data of upper level PE convolutional calculation is transferred to next stage progress intermediate data.Every level-one PE The PE data that unit is transferred to next stage are every level-one PE unit calculates before this grade row convolved data result corresponding data Cumulative, for the convolution kernel having a size of i, each convolution kernel needs i PE unit to be calculated, i.e., accumulation result is ∑ Rowi. For the CNN structure that the present invention is accelerated, the size of convolution kernel size i can be 3,5,11, corresponding on PE array, need The PE unit number wanted also is i, i.e., 3,5,11.Image data is broadcast on all PE units, in calculating process, due to volume Product core size difference and hardware limitation, the degree of parallelism of calculating be it is different, the present invention is on the PE array of 22*22 Realize IRB data flow.Using the convolution nuclear volume calculated every time as degree of parallelism measurement standard, when convolution kernel size is 3, array can be with 22 (row) * 7=154 convolution kernels are calculated simultaneously.When convolution kernel size is 5, array can calculate 22 (row) * 4=88 simultaneously Convolution kernel calculates, and it is that array can calculate the convolution kernel of 22*2=44 simultaneously that convolution kernel size, which is 11,.The afterbody of array computation It is SPE unit, SPE unit carries out activation operation by f () function in formula (1) to all cumulative final results, swashs Operation living is completed by RelU module, and the data after activation are as output data.
Following table shows data flow proposed by the invention and the performance comparison that some other CNN accelerates.
The performance comparison that the data flow proposed by the invention of table 1 and some other CNN accelerate
Method of the invention is used as can be seen from Table 1, and the performance and efficiency of system all have increased significantly.Locating Manage convolutional layer when, the present invention can obtain performance be respectively as follows: AlexNet be 97.4GOPS, VGG 90.75GOPS, Lenet- 5 be 100.8GOPS.Compared with Virtex7VX485T, 1.59 times of performance is may be implemented in AlexNet and 2.96 times of efficiency mentions It rises.As for Zynq-7000, the performance of LeNet can be improved 47 times by the present invention, and efficiency improves 14.5 times.Meanwhile with Stratix-V GXA7 is compared, and also there is the present invention at least 2.9 times of performance and 7 times of efficiency to improve.For Intel Xeon E5-2620 CPU, speed of the present invention improve 6.6 times, and 52 times of promotion is realized in terms of efficiency.
RS (Row Saturation) data flow that IRB data flow and Eyeriss are proposed compares:
By taking the M convolution kernel of 3 × 3 × C as an example, convolutional calculation is carried out to the image of 7 × 7 × C size, wherein C is channel Number.PE array sub-block size is 3 × 3.Fig. 7 shows the assembly line timing of RS data flow, it is primary complete in PE array sub-block At the mapping in a channel.Fig. 8 shows the convolutional calculation using IRB data flow method, and IRB can be parallel complete on PE array At the image in three channels.
T1 indicates the period that a line image of PE array is mapped to from memory, and T2 is the volume of a line image of each PE The product period.Image size is 7 × 7, and kernel size is 3 × 3.So T1=7, T2=3 × (7-2)=15;Use RS data flow Average time needed for calculating a channel image are as follows:
TRS=(T1 × 5+ (T1+1) × 2+15) × C × M=66 × C × M (2)
Use average time needed for one channel of IRB data-flow computation proposed by the present invention are as follows:
TIRB37 × C of=(T1+T2 × 7) × C × M/3 ≈ × M (3)
It should be noted that the division arithmetic in equation (2) is 3 due to degree of parallelism.Although that is, the calculating process of IRB Than RS long, but IRB can generate the image in three channels with parallel computation, and RS can only be calculated simultaneously and be generated single channel image. Therefore, IRB provides degree of parallelism more higher than RS.In this example, as the result is shown compared with RS, IRB data flow be can be improved 44% performance.

Claims (8)

1. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array, which is characterized in that IRB passes through to weight Data and image data are scheduled, and matrix inner products fractionation is embarked on journey, is mapped in different PE units and is calculated, calculates institute Obtained result is cumulative, obtained to add up and activate in afterbody SPE, and the data after output activation complete scheduling.
2. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special Sign is, comprising the following steps:
Step 1: the data of convolution kernel are mapped to line by line on PE array in IRB data flow, one is mapped on each PE unit Row convolution Nuclear Data;
It is mapped on entire PE array step 2: image data is broadcasted line by line, convolutional calculation is carried out in PE unit;
Step 3: the intermediate data that convolutional calculation is obtained is transferred to next stage PE unit, until it is mono- to be transferred to afterbody PE Member, afterbody PE unit are SPE, and SPE carries out activation behaviour by f () function in formula (1) to cumulative final result Make, activation operation is completed by RelU module, and the data after activation are as output data;
0≤z < N, 0≤u < M, 0≤y < E, 0≤x < F
Wherein, O is output image data, and I is input image data, and W is weight data, and f () function is swashing for neural network Function living, the number of z representing input images have given N width image in figure, and it is M convolution kernel in figure that u, which indicates the number of convolution kernel, Y indicates that the row number of output image, E are the total line numbers for exporting image, and x indicates that the column number of output image, F are output images Total columns, i and j respectively represent the line number and columns of convolution kernel, and k indicates port number, and U indicates that convolution kernel after each convolution slides Step-length.
3. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special Sign is that detailed process is as follows for the first step: the size of convolution kernel is R row, by the convolution Nuclear Data of this R row in mapping process It is respectively mapped in R PE unit, the weight data of mapping is stored in weight register.
4. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special Sign is that detailed process is as follows for second step: image data has H row, is mapped on PE array line by line, and the PE having been mapped into Weight data in unit does multiply-accumulate operation, maps and multiplies accumulating and carries out simultaneously;Image data is mapped to PE unit It is interior, it is cached in image register, image shift register can be realized convolution operation while caching image data In sliding sash function, what each PE unit was calculated is row convolution results to get the convolved data for arriving R row.
5. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 4, special Sign is that image register is shift register.
6. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special Sign is that the result of obtained convolutional calculation is temporarily stored in the FIFO of PE unit, carries out the mistake of convolutional calculation in next stage PE Cheng Zhong, it is cumulative that the intermediate data of upper level PE convolutional calculation is transferred to next stage progress intermediate data;For the convolution having a size of i Core, each convolution kernel need i PE unit to be calculated;The size of convolution kernel size i is 3,5,11, corresponding on PE array, The PE unit number needed is also i.
7. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special Sign is, IRB data flow is realized on the PE array of 22*22.
8. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 7, special Sign is, using the convolution nuclear volume calculated every time as degree of parallelism measurement standard, when convolution kernel size is 3, array can be right simultaneously 22*7=154 convolution kernel calculates;When convolution kernel size is 5, array can calculate 22*4=88 convolution kernel simultaneously and calculate, convolution When core size is 11, array calculates the convolution kernel of 22*2=44 simultaneously.
CN201811115052.8A 2018-09-25 2018-09-25 Convolution operation data flow scheduling method for dynamic reconfigurable array Active CN109409511B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811115052.8A CN109409511B (en) 2018-09-25 2018-09-25 Convolution operation data flow scheduling method for dynamic reconfigurable array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811115052.8A CN109409511B (en) 2018-09-25 2018-09-25 Convolution operation data flow scheduling method for dynamic reconfigurable array

Publications (2)

Publication Number Publication Date
CN109409511A true CN109409511A (en) 2019-03-01
CN109409511B CN109409511B (en) 2020-07-28

Family

ID=65465836

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811115052.8A Active CN109409511B (en) 2018-09-25 2018-09-25 Convolution operation data flow scheduling method for dynamic reconfigurable array

Country Status (1)

Country Link
CN (1) CN109409511B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097174A (en) * 2019-04-22 2019-08-06 西安交通大学 Preferential convolutional neural networks implementation method, system and device are exported based on FPGA and row
CN110135554A (en) * 2019-03-25 2019-08-16 电子科技大学 A kind of hardware-accelerated framework of convolutional neural networks based on FPGA
CN110163409A (en) * 2019-04-08 2019-08-23 华中科技大学 A kind of convolutional neural networks dispatching method applied to displacement Flow Shop
CN110222818A (en) * 2019-05-13 2019-09-10 西安交通大学 A kind of more bank ranks intertexture reading/writing methods for the storage of convolutional neural networks data
CN110288078A (en) * 2019-05-19 2019-09-27 南京惟心光电***有限公司 A kind of accelerator and its method for GoogLeNet model
CN110516801A (en) * 2019-08-05 2019-11-29 西安交通大学 A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput
CN110796245A (en) * 2019-10-25 2020-02-14 浪潮电子信息产业股份有限公司 Method and device for calculating convolutional neural network model
CN111931911A (en) * 2020-07-30 2020-11-13 山东云海国创云计算装备产业创新中心有限公司 CNN accelerator configuration method, system and device
CN112132275A (en) * 2020-09-30 2020-12-25 南京风兴科技有限公司 Parallel computing method and device
CN112540946A (en) * 2020-12-18 2021-03-23 清华大学 Reconfigurable processor and method for calculating activation functions of various neural networks on reconfigurable processor
CN113313251A (en) * 2021-05-13 2021-08-27 中国科学院计算技术研究所 Deep separable convolution fusion method and system based on data stream architecture
CN113469326A (en) * 2021-06-24 2021-10-01 上海寒武纪信息科技有限公司 Integrated circuit device and board card for executing pruning optimization in neural network model
US11200092B2 (en) * 2018-03-27 2021-12-14 Tencent Technology (Shenzhen) Company Limited Convolutional computing accelerator, convolutional computing method, and computer-readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
US20180032859A1 (en) * 2016-07-27 2018-02-01 Samsung Electronics Co., Ltd. Accelerator in convolutional neural network and method for operating the same

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
US20180032859A1 (en) * 2016-07-27 2018-02-01 Samsung Electronics Co., Ltd. Accelerator in convolutional neural network and method for operating the same

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王晶波: "面向可重构处理器的图像处理算子调度技术研究", 《中国优秀硕士学位论文全文数据库(电子期刊)》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11200092B2 (en) * 2018-03-27 2021-12-14 Tencent Technology (Shenzhen) Company Limited Convolutional computing accelerator, convolutional computing method, and computer-readable storage medium
CN110135554A (en) * 2019-03-25 2019-08-16 电子科技大学 A kind of hardware-accelerated framework of convolutional neural networks based on FPGA
CN110163409B (en) * 2019-04-08 2021-05-18 华中科技大学 Convolutional neural network scheduling method applied to replacement flow shop
CN110163409A (en) * 2019-04-08 2019-08-23 华中科技大学 A kind of convolutional neural networks dispatching method applied to displacement Flow Shop
CN110097174A (en) * 2019-04-22 2019-08-06 西安交通大学 Preferential convolutional neural networks implementation method, system and device are exported based on FPGA and row
CN110222818A (en) * 2019-05-13 2019-09-10 西安交通大学 A kind of more bank ranks intertexture reading/writing methods for the storage of convolutional neural networks data
CN110288078A (en) * 2019-05-19 2019-09-27 南京惟心光电***有限公司 A kind of accelerator and its method for GoogLeNet model
CN110516801A (en) * 2019-08-05 2019-11-29 西安交通大学 A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput
CN110516801B (en) * 2019-08-05 2022-04-22 西安交通大学 High-throughput-rate dynamic reconfigurable convolutional neural network accelerator
CN110796245A (en) * 2019-10-25 2020-02-14 浪潮电子信息产业股份有限公司 Method and device for calculating convolutional neural network model
CN110796245B (en) * 2019-10-25 2022-03-22 浪潮电子信息产业股份有限公司 Method and device for calculating convolutional neural network model
CN111931911B (en) * 2020-07-30 2022-07-08 山东云海国创云计算装备产业创新中心有限公司 CNN accelerator configuration method, system and device
CN111931911A (en) * 2020-07-30 2020-11-13 山东云海国创云计算装备产业创新中心有限公司 CNN accelerator configuration method, system and device
CN112132275A (en) * 2020-09-30 2020-12-25 南京风兴科技有限公司 Parallel computing method and device
CN112540946A (en) * 2020-12-18 2021-03-23 清华大学 Reconfigurable processor and method for calculating activation functions of various neural networks on reconfigurable processor
CN113313251A (en) * 2021-05-13 2021-08-27 中国科学院计算技术研究所 Deep separable convolution fusion method and system based on data stream architecture
CN113469326A (en) * 2021-06-24 2021-10-01 上海寒武纪信息科技有限公司 Integrated circuit device and board card for executing pruning optimization in neural network model
CN113469326B (en) * 2021-06-24 2024-04-02 上海寒武纪信息科技有限公司 Integrated circuit device and board for executing pruning optimization in neural network model

Also Published As

Publication number Publication date
CN109409511B (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN109409511A (en) A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array
US11775802B2 (en) Neural processor
US20230334006A1 (en) Compute near memory convolution accelerator
CN111667051B (en) Neural network accelerator applicable to edge equipment and neural network acceleration calculation method
Ma et al. Optimizing the convolution operation to accelerate deep neural networks on FPGA
CN110097174B (en) Method, system and device for realizing convolutional neural network based on FPGA and row output priority
CN110210610B (en) Convolution calculation accelerator, convolution calculation method and convolution calculation device
CN108564168B (en) Design method for neural network processor supporting multi-precision convolution
CN110516801A (en) A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput
CN108805266A (en) A kind of restructural CNN high concurrents convolution accelerator
CN111967468A (en) FPGA-based lightweight target detection neural network implementation method
CN110222818A (en) A kind of more bank ranks intertexture reading/writing methods for the storage of convolutional neural networks data
CN110580519B (en) Convolution operation device and method thereof
Stevens et al. Manna: An accelerator for memory-augmented neural networks
CN112950656A (en) Block convolution method for pre-reading data according to channel based on FPGA platform
Li et al. Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration
CN115238863A (en) Hardware acceleration method, system and application of convolutional neural network convolutional layer
CN113033794A (en) Lightweight neural network hardware accelerator based on deep separable convolution
CN109993293A (en) A kind of deep learning accelerator suitable for stack hourglass network
Que et al. Recurrent neural networks with column-wise matrix–vector multiplication on FPGAs
Duan et al. Energy-efficient architecture for FPGA-based deep convolutional neural networks with binary weights
Yin et al. FPGA-based high-performance CNN accelerator architecture with high DSP utilization and efficient scheduling mode
CN112200310A (en) Intelligent processor, data processing method and storage medium
Jiang et al. Hardware implementation of depthwise separable convolution neural network
US20230025068A1 (en) Hybrid machine learning architecture with neural processing unit and compute-in-memory processing elements

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant