CN109409511A - A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array - Google Patents
A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array Download PDFInfo
- Publication number
- CN109409511A CN109409511A CN201811115052.8A CN201811115052A CN109409511A CN 109409511 A CN109409511 A CN 109409511A CN 201811115052 A CN201811115052 A CN 201811115052A CN 109409511 A CN109409511 A CN 109409511A
- Authority
- CN
- China
- Prior art keywords
- data
- convolution
- unit
- image
- mapped
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
- Image Analysis (AREA)
Abstract
A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array, IRB is by being scheduled weight data and image data, matrix inner products fractionation is embarked on journey, it is mapped in different PE units and is calculated, it is cumulative to calculate obtained result, obtained to add up and activate in afterbody SPE, after output activation data, complete scheduling.Weight data is not gone together and is fixed in different PE units, then image data is mapped to line by line to each PE unit and weight data does convolution, intermediate data is temporarily stored in PE unit, is then transferred to next PE unit step by step and is added up, assembly line is formed, convolved data is obtained.During calculating CNN network, the reusability of input image data and weight data is can be improved in IRB data flow, is reduced and is flowed outside the piece inner sheet of data, advantageously reduces power consumption and the time of data flowing, has promotion to performance and efficiency.
Description
Technical field
The present invention relates to a kind of convolution algorithm data stream scheduling methods for dynamic reconfigurable array.
Background technique
Artificial intelligence is current popular one of computer science, as the major way for realizing artificial intelligence, depth
Habit has also obtained far-reaching development.Convolutional neural networks (Convolution Neural Network, CNN) are artificial neural network
At most most widely used one of the network structure of network structural research, has become one of the research hotspot of numerous scientific domains at present,
Especially original graph can be directly inputted since CNN avoids the pretreatment complicated early period to image in pattern classification field
Picture, thus obtained more being widely applied.Convolutional neural networks achieve all well and good in computer vision field in recent years
Achievement, while also convolutional neural networks being allowed to be developed.The core of neural network is operation, and CNN is being applied to computer view
When feel field, feature extraction is carried out to image data using convolution kernel, main operational is convolution algorithm operation.In general, in CNN
In network, 90% or so of the total arithmetic operation number of convolution algorithm Zhan.Therefore at present for, how to be efficiently completed in CNN network
Convolution algorithm operation, be the key problem of CNN accelerator design.
With the increase of the CNN network number of plies and neuron number, the computation complexity of model is increased with exponential, depth
The pace of learning and speed of service bottleneck for practising algorithm are increasingly dependent on hardware computing platform.For the hardware of deep learning algorithm
Accelerating, usually there is three classes implementation at present --- multi-core CPU, GPU and FPGA, their common feature can be achieved on height simultaneously
The calculating of row degree.However, existing hardware implementation mode power consumption is higher, there is also energy efficiency (performance/power consumption) is lower
Problem can not be applied on intelligent mobile terminal, such as smart phone, wearable device either autonomous driving vehicle etc..
In this context, reconfigurable processor has proven to a kind of parallel computation framework for having both high flexibility and energy-efficient
Form, its advantage are that suitable resource allocation strategy, expansion dedicated processes can be selected according to different model sizes
Process performance is improved while device use scope, is that multi-core CPU and FPGA technology further develop the solution route being restricted
One of, it is possible to as following one of the scheme for realizing high-effect deep learning SoC.Difference between general processor is
It not only can change control stream, can also dynamically change the structure of data path, have high-performance, low hardware spending and function
Consumption, the advantages of flexibility is good, favorable expandability;Meanwhile in processing speed, the performance of reconfigurable processor is close to dedicated fixed
Coremaking piece.Reconfigureable computing array is expired using the array that multiple processing units (Processing Elements, PEs) is constituted
The different demands of sufficient different application.Following computing system generally requires to have both multi-functional and high performance feature, currently
Trend be that multiple reconfigureable computing arrays are added in computing systems, adaptively to support different standards, meet simultaneously
Increasingly increased performance requirement.
For CNN algorithm when calculating, convolution kernel slides on the image carries out convolutional calculation.Such calculating mode has largely
Data computed repeatedly.Different from being calculated on GPU, during carrying out hardware-accelerated to CNN algorithm, nothing
All calculating data are all buffered on piece by method, it is therefore desirable to are scheduled to the data flow in convolution algorithm.
CNN algorithm includes a large amount of calculating, and reconfigureable computing array can be performed in parallel calculation included in CNN algorithm
Method.The weight data of CNN network and image data are divided, are then mapped on corresponding computing unit.Due to hardware
The limitation of resource, CNN algorithm can not be mapped completely on hardware structure, it is therefore desirable to be adjusted to image data and weight data
Degree.During calculating, a large amount of input data needs are duplicate to be calculated CNN network, and existing many methods are in number
According to scheduling process in can all have the following problems:
1, data are repeatedly input.In CNN algorithm, convolution kernel slides over an input image carries out convolution algorithm, works as convolution
When the step-length of core sliding is less than convolution kernel own dimensions, when sliding carries out convolution algorithm every time, can all there be part last time convolution meter
Data duplication when calculation.These data can re-start reading outside computing unit, but will lead to the repetition of data in this way
Input.
2, it when CNN data are mapped to hardware cell, may be subjected to the constraint of hardware resource framework itself, cause designed
The flow work it is inefficient.
Summary of the invention
The object of the present invention is to provide a kind of convolution algorithm data stream scheduling methods for dynamic reconfigurable array.
To achieve the above object, the present invention adopts the following technical scheme that:
A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array, which is characterized in that IRB passes through to power
Value Data and image data are scheduled, and matrix inner products fractionation is embarked on journey, is mapped in different PE units and is calculated, is calculated
Obtained result is cumulative, obtained to add up and activate in afterbody SPE, and the data after output activation are completed to adjust
Degree.
A further improvement of the present invention lies in that, comprising the following steps:
Step 1: the data of convolution kernel are mapped to line by line on PE array in IRB data flow, each PE unit is shown
Penetrate a line convolution Nuclear Data;
It is mapped on entire PE array step 2: image data is broadcasted line by line, convolutional calculation is carried out in PE unit;
Step 3: the intermediate data that convolutional calculation is obtained is transferred to next stage PE unit, until being transferred to afterbody
PE unit, afterbody PE unit are SPE, and SPE swashs cumulative final result by f () function in formula (1)
Operation living, activation operation are completed by RelU module, and the data after activation are as output data;
0≤z < N, 0≤u < M, 0≤y < E, 0≤x < F
Wherein, O is output image data, and I is input image data, and W is weight data, and f () function is neural network
Activation primitive, the number of z representing input images given N width image in figure, and u indicates the number of convolution kernel, in figure is M and rolls up
Product core, y indicate that the row number of output image, E are the total line numbers for exporting image, and x indicates that the column number of output image, F are output
Total columns of image, i and j respectively represent the line number and columns of convolution kernel, and k indicates that port number, U indicate convolution kernel after each convolution
The step-length of sliding.
A further improvement of the present invention lies in that detailed process is as follows for the first step: the size of convolution kernel is R row, is being mapped
The convolution Nuclear Data of this R row is respectively mapped in R PE unit in the process, the weight data of mapping is stored in weight deposit
In device.
A further improvement of the present invention lies in that detailed process is as follows for second step: image data has H row, is mapped to line by line
Weight data on PE array, and in the PE unit that has been mapped into does multiply-accumulate operation, map and multiply accumulating be simultaneously into
Capable;Image data is mapped in PE unit, is cached in image register, shift register is in caching image data
It can be realized sliding sash function in convolution operation simultaneously, what each PE unit was calculated is row convolution results to get to R row
Convolved data.
A further improvement of the present invention lies in that image register is shift register.
A further improvement of the present invention lies in that the result of obtained convolutional calculation is temporarily stored in the FIFO of PE unit,
During next stage PE carries out convolutional calculation, the intermediate data of upper level PE convolutional calculation is transferred to next stage and carries out mediant
According to cumulative;For the convolution kernel having a size of i, each convolution kernel needs i PE unit to be calculated;The size of convolution kernel size i
It is 3,5,11, corresponding on PE array, the PE unit number needed is also i.
A further improvement of the present invention lies in that realizing IRB data flow on the PE array of 22*22.
A further improvement of the present invention lies in that using the convolution nuclear volume calculated every time as degree of parallelism measurement standard, convolution kernel
When size is 3, array can simultaneously be calculated 22*7=154 convolution kernel;When convolution kernel size is 5, array can calculate simultaneously
22*4=88 convolution kernel calculates, and when convolution kernel size is 11, array calculates the convolution kernel of 22*2=44 simultaneously.
Compared with prior art, the invention has the benefit that
1, it is based on Dynamic Reconfigurable Technique, the data stream scheduling machine accelerated for CNN network proposed in conjunction with hardware
Data are split Mapping implementation CNN algorithm, are scheduled, image progressive are mapped to all to the convolution algorithm of CNN by system
PE unit on carry out convolutional calculation, image data is scheduled using the form broadcasted line by line avoid image data to
Complex time sequence control when being mapped on PE array.
2, weight data is not gone together and is fixed in different PE units, then image data is mapped to each PE line by line
Unit and weight data do convolution, and intermediate data is temporarily stored in PE unit, are then transferred to next PE unit step by step and add up,
Assembly line is formed, convolved data is obtained.During calculating CNN network, IRB data flow can be improved input image data and
The reusability of weight data is reduced and is flowed outside the piece inner sheet of data, power consumption and the time of data flowing is advantageously reduced, to performance
There is promotion with efficiency.
Detailed description of the invention
Fig. 1 is the computing architecture of CNN accelerator.
Fig. 2 is PE unit structure.
Fig. 3 is convolutional calculation process.
Fig. 4 is that convolution kernel is mapped to PE array line by line.
Fig. 5 is broadcasted line by line for image data and is mapped to PE array.
Fig. 6 intermediate data between PE unit adds up line by line.
Fig. 7 is RS data flow.
Fig. 8 is IRB data flow.
Specific embodiment
Present invention will now be described in detail with reference to the accompanying drawings..
The present invention is that dynamic reconfigurable computing array proposes a new data stream scheduling mechanism, and referred to as image progressive is broadcasted
The data stream scheduling mechanism of (Image Row Broadcast, IRB).IRB is proposed based on Reconfigurable Computation hardware structure, is used
In the data stream scheduling method that the convolution algorithm of CNN network accelerates, the multiple networks knot such as LeNet, AlexNet, VGG can be accelerated
Structure.
The invention proposes IRB data stream schedulings when calculating for CNN algorithm, are applied to hardware structure shown in FIG. 1.
Computing array based on dynamic reconfigurable is adapted to the different calculating modes of CNN, and configuration module, which passes through, to be configured
Information configures PE array;FSM is the control module of system;Restructural PE array is the computing architecture of whole system,
It is also the hardware components that IRB is realized;The data flow of array computation will not when two memory modules guarantee to calculate as intermediate buffer
It is interrupted by the delay of waiting operational data.
The PE unit that the present invention is designed for the calculation features of CNN network includes that there are two types of structures, respectively Normal PE
(abbreviation PE) and Special PE (abbreviation SPE).Shown in Fig. 2, PE includes with lower module: image register group (Picture
Reg), weight register group (Filter Reg), multiplier, accumulator (Acc), adder and FIFO.SPE is on the basis of PE
On increase with lower module: multiple selector, data branches switch, adder and ReLU function module (ReLU).Specific ginseng
Number is as follows: the input data bit wide of weight register and image register group is 16, depth 16.Multiplier input data position
Width is 16.The input data bit wide of adder is 32.The data bit width of FIFO is 32, and depth is 64.It is PE gusts entire
The size of column is 22*22, the calculating mode for being 3,5,11 comprising convolution kernel size in AlexNet network.PE array can pass through
Change the interconnection between unit, and internal register configuration, meets these calculating modes.Meanwhile it being added inside PE unit
Storage unit module, data storage when can satisfy IRB data-flow computation.
The basic operation of convolutional neural networks is convolutional calculation, as shown in figure 3, multiple convolution kernels are carried out to multiple images
The process of convolution algorithm, convolution are the basic operations for convolutional neural networks, and convolution kernel slides on the image carries out convolutional calculation
Export new image data.Calculation formula is as follows:
0≤z < N, 0≤u < M, 0≤y < E, 0≤x < F
Wherein, O is output image data, and I is input image data, and W is weight data, and f () function is neural network
Activation primitive.The number of z representing input images has given N width image in figure.U indicates the number of convolution kernel, is M volume in figure
Product core.Y indicates that the row number of output image, E are the total line numbers for exporting image.X indicates that the column number of output image, F are output
Total columns of image.I and j respectively represents the line number and columns of convolution kernel, and k indicates port number.U indicates convolution kernel after each convolution
The step-length of sliding.
It is right from formula (1) it can be seen that convolutional calculation process is exactly that input image data and weight data do matrix inner products
The data that the data that should be put obtain after being multiplied are added.
Convolution algorithm data stream scheduling method for dynamic reconfigurable array of the invention, IRB pass through to calculating process
In weight data and image data be scheduled, big matrix inner products fractionation is embarked on journey, be mapped in different PE units into
Row calculates, and calculates the part in the cumulative as above formula bracket of obtained result.It is obtained cumulative and mono- in afterbody SPE
It is activated in member, the data as exported.Specifically includes the following steps:
Step 1: the form of convolution kernel line by line is mapped on PE array, a line convolution nucleus number is mapped on each PE unit
According to as shown in figure 4, detailed process is as follows:
In IRB data flow, the data of convolution kernel are mapped in PE array line by line first, and each PE unit maps a line convolution
The data of core.
Such as the size of convolution kernel is R row in Fig. 3, then the convolution Nuclear Data of this R row is needed to distinguish in mapping process
It is mapped in R PE unit.It is noted that the preceding R-1 row of convolution kernel is mapped in PE, SPE can pass through configuration information reality
The function of existing PE, last line convolution kernel are mapped in SPE.Convolution kernel is mapped in PE unit, due to convolution in convolution process
Core does sliding on the image and is calculated, thus in the process weight data be continuous multiplexing repeatedly, need and entire image
Convolution algorithm is carried out, so the weight data of mapping is stored in weight register, it can be in convolution process constantly from PE
Weight data is read in internal weight register, can be read in this way to avoid the repetition to weight data, calculates effect to improve
Rate.
It is mapped on entire PE array step 2: image data is broadcasted line by line, convolutional calculation is carried out in PE unit, such as
Shown in Fig. 5, detailed process is as follows
After convolution kernel is mapped on PE array, image data, which starts to broadcast line by line, to be mapped in inside PE unit.In Fig. 3
Middle image data has H row, is mapped on PE array line by line, and the weight data in the PE unit having been mapped into do it is multiply-accumulate
Operation, maps and multiplies accumulating and carry out simultaneously.Image data is mapped in PE unit, is cached in image register,
Image register design is shift register, and convolution operation may be implemented while caching image data in image shift register
In sliding sash function, displacement effect can be generated in calculating process, every time carry out a convolution algorithm after, moving step length U, with
It obtains correctly as a result, what each PE unit was calculated is row convolution results, it can obtain the convolved data of R row.
Step 3: the intermediate data that convolutional calculation is obtained is transferred to next stage PE unit, until being transferred to afterbody
PE unit, afterbody PE unit are SPE, and SPE swashs cumulative final result by f () function in formula (1)
Operation living, activation operation are completed by RelU module, and the data after activation are as output data;
It should be noted that SPE is configurable to PE, for being configured to the SPE of PE, it is regarded as PE, last will not be used as
Grade PE unit.That is intergrade is only PE, and only afterbody could be SPE;As shown in fig. 6, the detailed process of the step
It is as follows:
The result of obtained convolutional calculation can be temporarily stored in the FIFO of PE unit in Fig. 5, carry out convolution in next stage PE
During calculating, it is cumulative that the intermediate data of upper level PE convolutional calculation is transferred to next stage progress intermediate data.Every level-one PE
The PE data that unit is transferred to next stage are every level-one PE unit calculates before this grade row convolved data result corresponding data
Cumulative, for the convolution kernel having a size of i, each convolution kernel needs i PE unit to be calculated, i.e., accumulation result is ∑ Rowi.
For the CNN structure that the present invention is accelerated, the size of convolution kernel size i can be 3,5,11, corresponding on PE array, need
The PE unit number wanted also is i, i.e., 3,5,11.Image data is broadcast on all PE units, in calculating process, due to volume
Product core size difference and hardware limitation, the degree of parallelism of calculating be it is different, the present invention is on the PE array of 22*22
Realize IRB data flow.Using the convolution nuclear volume calculated every time as degree of parallelism measurement standard, when convolution kernel size is 3, array can be with
22 (row) * 7=154 convolution kernels are calculated simultaneously.When convolution kernel size is 5, array can calculate 22 (row) * 4=88 simultaneously
Convolution kernel calculates, and it is that array can calculate the convolution kernel of 22*2=44 simultaneously that convolution kernel size, which is 11,.The afterbody of array computation
It is SPE unit, SPE unit carries out activation operation by f () function in formula (1) to all cumulative final results, swashs
Operation living is completed by RelU module, and the data after activation are as output data.
Following table shows data flow proposed by the invention and the performance comparison that some other CNN accelerates.
The performance comparison that the data flow proposed by the invention of table 1 and some other CNN accelerate
Method of the invention is used as can be seen from Table 1, and the performance and efficiency of system all have increased significantly.Locating
Manage convolutional layer when, the present invention can obtain performance be respectively as follows: AlexNet be 97.4GOPS, VGG 90.75GOPS, Lenet-
5 be 100.8GOPS.Compared with Virtex7VX485T, 1.59 times of performance is may be implemented in AlexNet and 2.96 times of efficiency mentions
It rises.As for Zynq-7000, the performance of LeNet can be improved 47 times by the present invention, and efficiency improves 14.5 times.Meanwhile with
Stratix-V GXA7 is compared, and also there is the present invention at least 2.9 times of performance and 7 times of efficiency to improve.For Intel Xeon
E5-2620 CPU, speed of the present invention improve 6.6 times, and 52 times of promotion is realized in terms of efficiency.
RS (Row Saturation) data flow that IRB data flow and Eyeriss are proposed compares:
By taking the M convolution kernel of 3 × 3 × C as an example, convolutional calculation is carried out to the image of 7 × 7 × C size, wherein C is channel
Number.PE array sub-block size is 3 × 3.Fig. 7 shows the assembly line timing of RS data flow, it is primary complete in PE array sub-block
At the mapping in a channel.Fig. 8 shows the convolutional calculation using IRB data flow method, and IRB can be parallel complete on PE array
At the image in three channels.
T1 indicates the period that a line image of PE array is mapped to from memory, and T2 is the volume of a line image of each PE
The product period.Image size is 7 × 7, and kernel size is 3 × 3.So T1=7, T2=3 × (7-2)=15;Use RS data flow
Average time needed for calculating a channel image are as follows:
TRS=(T1 × 5+ (T1+1) × 2+15) × C × M=66 × C × M (2)
Use average time needed for one channel of IRB data-flow computation proposed by the present invention are as follows:
TIRB37 × C of=(T1+T2 × 7) × C × M/3 ≈ × M (3)
It should be noted that the division arithmetic in equation (2) is 3 due to degree of parallelism.Although that is, the calculating process of IRB
Than RS long, but IRB can generate the image in three channels with parallel computation, and RS can only be calculated simultaneously and be generated single channel image.
Therefore, IRB provides degree of parallelism more higher than RS.In this example, as the result is shown compared with RS, IRB data flow be can be improved
44% performance.
Claims (8)
1. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array, which is characterized in that IRB passes through to weight
Data and image data are scheduled, and matrix inner products fractionation is embarked on journey, is mapped in different PE units and is calculated, calculates institute
Obtained result is cumulative, obtained to add up and activate in afterbody SPE, and the data after output activation complete scheduling.
2. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special
Sign is, comprising the following steps:
Step 1: the data of convolution kernel are mapped to line by line on PE array in IRB data flow, one is mapped on each PE unit
Row convolution Nuclear Data;
It is mapped on entire PE array step 2: image data is broadcasted line by line, convolutional calculation is carried out in PE unit;
Step 3: the intermediate data that convolutional calculation is obtained is transferred to next stage PE unit, until it is mono- to be transferred to afterbody PE
Member, afterbody PE unit are SPE, and SPE carries out activation behaviour by f () function in formula (1) to cumulative final result
Make, activation operation is completed by RelU module, and the data after activation are as output data;
0≤z < N, 0≤u < M, 0≤y < E, 0≤x < F
Wherein, O is output image data, and I is input image data, and W is weight data, and f () function is swashing for neural network
Function living, the number of z representing input images have given N width image in figure, and it is M convolution kernel in figure that u, which indicates the number of convolution kernel,
Y indicates that the row number of output image, E are the total line numbers for exporting image, and x indicates that the column number of output image, F are output images
Total columns, i and j respectively represent the line number and columns of convolution kernel, and k indicates port number, and U indicates that convolution kernel after each convolution slides
Step-length.
3. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special
Sign is that detailed process is as follows for the first step: the size of convolution kernel is R row, by the convolution Nuclear Data of this R row in mapping process
It is respectively mapped in R PE unit, the weight data of mapping is stored in weight register.
4. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special
Sign is that detailed process is as follows for second step: image data has H row, is mapped on PE array line by line, and the PE having been mapped into
Weight data in unit does multiply-accumulate operation, maps and multiplies accumulating and carries out simultaneously;Image data is mapped to PE unit
It is interior, it is cached in image register, image shift register can be realized convolution operation while caching image data
In sliding sash function, what each PE unit was calculated is row convolution results to get the convolved data for arriving R row.
5. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 4, special
Sign is that image register is shift register.
6. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special
Sign is that the result of obtained convolutional calculation is temporarily stored in the FIFO of PE unit, carries out the mistake of convolutional calculation in next stage PE
Cheng Zhong, it is cumulative that the intermediate data of upper level PE convolutional calculation is transferred to next stage progress intermediate data;For the convolution having a size of i
Core, each convolution kernel need i PE unit to be calculated;The size of convolution kernel size i is 3,5,11, corresponding on PE array,
The PE unit number needed is also i.
7. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special
Sign is, IRB data flow is realized on the PE array of 22*22.
8. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 7, special
Sign is, using the convolution nuclear volume calculated every time as degree of parallelism measurement standard, when convolution kernel size is 3, array can be right simultaneously
22*7=154 convolution kernel calculates;When convolution kernel size is 5, array can calculate 22*4=88 convolution kernel simultaneously and calculate, convolution
When core size is 11, array calculates the convolution kernel of 22*2=44 simultaneously.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811115052.8A CN109409511B (en) | 2018-09-25 | 2018-09-25 | Convolution operation data flow scheduling method for dynamic reconfigurable array |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811115052.8A CN109409511B (en) | 2018-09-25 | 2018-09-25 | Convolution operation data flow scheduling method for dynamic reconfigurable array |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109409511A true CN109409511A (en) | 2019-03-01 |
CN109409511B CN109409511B (en) | 2020-07-28 |
Family
ID=65465836
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811115052.8A Active CN109409511B (en) | 2018-09-25 | 2018-09-25 | Convolution operation data flow scheduling method for dynamic reconfigurable array |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109409511B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110097174A (en) * | 2019-04-22 | 2019-08-06 | 西安交通大学 | Preferential convolutional neural networks implementation method, system and device are exported based on FPGA and row |
CN110135554A (en) * | 2019-03-25 | 2019-08-16 | 电子科技大学 | A kind of hardware-accelerated framework of convolutional neural networks based on FPGA |
CN110163409A (en) * | 2019-04-08 | 2019-08-23 | 华中科技大学 | A kind of convolutional neural networks dispatching method applied to displacement Flow Shop |
CN110222818A (en) * | 2019-05-13 | 2019-09-10 | 西安交通大学 | A kind of more bank ranks intertexture reading/writing methods for the storage of convolutional neural networks data |
CN110288078A (en) * | 2019-05-19 | 2019-09-27 | 南京惟心光电***有限公司 | A kind of accelerator and its method for GoogLeNet model |
CN110516801A (en) * | 2019-08-05 | 2019-11-29 | 西安交通大学 | A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput |
CN110796245A (en) * | 2019-10-25 | 2020-02-14 | 浪潮电子信息产业股份有限公司 | Method and device for calculating convolutional neural network model |
CN111931911A (en) * | 2020-07-30 | 2020-11-13 | 山东云海国创云计算装备产业创新中心有限公司 | CNN accelerator configuration method, system and device |
CN112132275A (en) * | 2020-09-30 | 2020-12-25 | 南京风兴科技有限公司 | Parallel computing method and device |
CN112540946A (en) * | 2020-12-18 | 2021-03-23 | 清华大学 | Reconfigurable processor and method for calculating activation functions of various neural networks on reconfigurable processor |
CN113313251A (en) * | 2021-05-13 | 2021-08-27 | 中国科学院计算技术研究所 | Deep separable convolution fusion method and system based on data stream architecture |
CN113469326A (en) * | 2021-06-24 | 2021-10-01 | 上海寒武纪信息科技有限公司 | Integrated circuit device and board card for executing pruning optimization in neural network model |
US11200092B2 (en) * | 2018-03-27 | 2021-12-14 | Tencent Technology (Shenzhen) Company Limited | Convolutional computing accelerator, convolutional computing method, and computer-readable storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104915322A (en) * | 2015-06-09 | 2015-09-16 | 中国人民解放军国防科学技术大学 | Method for accelerating convolution neutral network hardware and AXI bus IP core thereof |
US20180032859A1 (en) * | 2016-07-27 | 2018-02-01 | Samsung Electronics Co., Ltd. | Accelerator in convolutional neural network and method for operating the same |
-
2018
- 2018-09-25 CN CN201811115052.8A patent/CN109409511B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104915322A (en) * | 2015-06-09 | 2015-09-16 | 中国人民解放军国防科学技术大学 | Method for accelerating convolution neutral network hardware and AXI bus IP core thereof |
US20180032859A1 (en) * | 2016-07-27 | 2018-02-01 | Samsung Electronics Co., Ltd. | Accelerator in convolutional neural network and method for operating the same |
Non-Patent Citations (1)
Title |
---|
王晶波: "面向可重构处理器的图像处理算子调度技术研究", 《中国优秀硕士学位论文全文数据库(电子期刊)》 * |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11200092B2 (en) * | 2018-03-27 | 2021-12-14 | Tencent Technology (Shenzhen) Company Limited | Convolutional computing accelerator, convolutional computing method, and computer-readable storage medium |
CN110135554A (en) * | 2019-03-25 | 2019-08-16 | 电子科技大学 | A kind of hardware-accelerated framework of convolutional neural networks based on FPGA |
CN110163409B (en) * | 2019-04-08 | 2021-05-18 | 华中科技大学 | Convolutional neural network scheduling method applied to replacement flow shop |
CN110163409A (en) * | 2019-04-08 | 2019-08-23 | 华中科技大学 | A kind of convolutional neural networks dispatching method applied to displacement Flow Shop |
CN110097174A (en) * | 2019-04-22 | 2019-08-06 | 西安交通大学 | Preferential convolutional neural networks implementation method, system and device are exported based on FPGA and row |
CN110222818A (en) * | 2019-05-13 | 2019-09-10 | 西安交通大学 | A kind of more bank ranks intertexture reading/writing methods for the storage of convolutional neural networks data |
CN110288078A (en) * | 2019-05-19 | 2019-09-27 | 南京惟心光电***有限公司 | A kind of accelerator and its method for GoogLeNet model |
CN110516801A (en) * | 2019-08-05 | 2019-11-29 | 西安交通大学 | A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput |
CN110516801B (en) * | 2019-08-05 | 2022-04-22 | 西安交通大学 | High-throughput-rate dynamic reconfigurable convolutional neural network accelerator |
CN110796245A (en) * | 2019-10-25 | 2020-02-14 | 浪潮电子信息产业股份有限公司 | Method and device for calculating convolutional neural network model |
CN110796245B (en) * | 2019-10-25 | 2022-03-22 | 浪潮电子信息产业股份有限公司 | Method and device for calculating convolutional neural network model |
CN111931911B (en) * | 2020-07-30 | 2022-07-08 | 山东云海国创云计算装备产业创新中心有限公司 | CNN accelerator configuration method, system and device |
CN111931911A (en) * | 2020-07-30 | 2020-11-13 | 山东云海国创云计算装备产业创新中心有限公司 | CNN accelerator configuration method, system and device |
CN112132275A (en) * | 2020-09-30 | 2020-12-25 | 南京风兴科技有限公司 | Parallel computing method and device |
CN112540946A (en) * | 2020-12-18 | 2021-03-23 | 清华大学 | Reconfigurable processor and method for calculating activation functions of various neural networks on reconfigurable processor |
CN113313251A (en) * | 2021-05-13 | 2021-08-27 | 中国科学院计算技术研究所 | Deep separable convolution fusion method and system based on data stream architecture |
CN113469326A (en) * | 2021-06-24 | 2021-10-01 | 上海寒武纪信息科技有限公司 | Integrated circuit device and board card for executing pruning optimization in neural network model |
CN113469326B (en) * | 2021-06-24 | 2024-04-02 | 上海寒武纪信息科技有限公司 | Integrated circuit device and board for executing pruning optimization in neural network model |
Also Published As
Publication number | Publication date |
---|---|
CN109409511B (en) | 2020-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109409511A (en) | A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array | |
US11775802B2 (en) | Neural processor | |
US20230334006A1 (en) | Compute near memory convolution accelerator | |
CN111667051B (en) | Neural network accelerator applicable to edge equipment and neural network acceleration calculation method | |
Ma et al. | Optimizing the convolution operation to accelerate deep neural networks on FPGA | |
CN110097174B (en) | Method, system and device for realizing convolutional neural network based on FPGA and row output priority | |
CN110210610B (en) | Convolution calculation accelerator, convolution calculation method and convolution calculation device | |
CN108564168B (en) | Design method for neural network processor supporting multi-precision convolution | |
CN110516801A (en) | A kind of dynamic reconfigurable convolutional neural networks accelerator architecture of high-throughput | |
CN108805266A (en) | A kind of restructural CNN high concurrents convolution accelerator | |
CN111967468A (en) | FPGA-based lightweight target detection neural network implementation method | |
CN110222818A (en) | A kind of more bank ranks intertexture reading/writing methods for the storage of convolutional neural networks data | |
CN110580519B (en) | Convolution operation device and method thereof | |
Stevens et al. | Manna: An accelerator for memory-augmented neural networks | |
CN112950656A (en) | Block convolution method for pre-reading data according to channel based on FPGA platform | |
Li et al. | Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration | |
CN115238863A (en) | Hardware acceleration method, system and application of convolutional neural network convolutional layer | |
CN113033794A (en) | Lightweight neural network hardware accelerator based on deep separable convolution | |
CN109993293A (en) | A kind of deep learning accelerator suitable for stack hourglass network | |
Que et al. | Recurrent neural networks with column-wise matrix–vector multiplication on FPGAs | |
Duan et al. | Energy-efficient architecture for FPGA-based deep convolutional neural networks with binary weights | |
Yin et al. | FPGA-based high-performance CNN accelerator architecture with high DSP utilization and efficient scheduling mode | |
CN112200310A (en) | Intelligent processor, data processing method and storage medium | |
Jiang et al. | Hardware implementation of depthwise separable convolution neural network | |
US20230025068A1 (en) | Hybrid machine learning architecture with neural processing unit and compute-in-memory processing elements |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |