CN109409511A

CN109409511A - A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array

Info

Publication number: CN109409511A
Application number: CN201811115052.8A
Authority: CN
Inventors: 杨晨; 张海波; 王小力; 耿莉
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2018-09-25
Filing date: 2018-09-25
Publication date: 2019-03-01
Anticipated expiration: 2038-09-25
Also published as: CN109409511B

Abstract

A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array, IRB is by being scheduled weight data and image data, matrix inner products fractionation is embarked on journey, it is mapped in different PE units and is calculated, it is cumulative to calculate obtained result, obtained to add up and activate in afterbody SPE, after output activation data, complete scheduling.Weight data is not gone together and is fixed in different PE units, then image data is mapped to line by line to each PE unit and weight data does convolution, intermediate data is temporarily stored in PE unit, is then transferred to next PE unit step by step and is added up, assembly line is formed, convolved data is obtained.During calculating CNN network, the reusability of input image data and weight data is can be improved in IRB data flow, is reduced and is flowed outside the piece inner sheet of data, advantageously reduces power consumption and the time of data flowing, has promotion to performance and efficiency.

Description

A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array

Technical field

The present invention relates to a kind of convolution algorithm data stream scheduling methods for dynamic reconfigurable array.

Background technique

Artificial intelligence is current popular one of computer science, as the major way for realizing artificial intelligence, depth Habit has also obtained far-reaching development.Convolutional neural networks (Convolution Neural Network, CNN) are artificial neural network At most most widely used one of the network structure of network structural research, has become one of the research hotspot of numerous scientific domains at present, Especially original graph can be directly inputted since CNN avoids the pretreatment complicated early period to image in pattern classification field Picture, thus obtained more being widely applied.Convolutional neural networks achieve all well and good in computer vision field in recent years Achievement, while also convolutional neural networks being allowed to be developed.The core of neural network is operation, and CNN is being applied to computer view When feel field, feature extraction is carried out to image data using convolution kernel, main operational is convolution algorithm operation.In general, in CNN In network, 90% or so of the total arithmetic operation number of convolution algorithm Zhan.Therefore at present for, how to be efficiently completed in CNN network Convolution algorithm operation, be the key problem of CNN accelerator design.

With the increase of the CNN network number of plies and neuron number, the computation complexity of model is increased with exponential, depth The pace of learning and speed of service bottleneck for practising algorithm are increasingly dependent on hardware computing platform.For the hardware of deep learning algorithm Accelerating, usually there is three classes implementation at present --- multi-core CPU, GPU and FPGA, their common feature can be achieved on height simultaneously The calculating of row degree.However, existing hardware implementation mode power consumption is higher, there is also energy efficiency (performance/power consumption) is lower Problem can not be applied on intelligent mobile terminal, such as smart phone, wearable device either autonomous driving vehicle etc.. In this context, reconfigurable processor has proven to a kind of parallel computation framework for having both high flexibility and energy-efficient Form, its advantage are that suitable resource allocation strategy, expansion dedicated processes can be selected according to different model sizes Process performance is improved while device use scope, is that multi-core CPU and FPGA technology further develop the solution route being restricted One of, it is possible to as following one of the scheme for realizing high-effect deep learning SoC.Difference between general processor is It not only can change control stream, can also dynamically change the structure of data path, have high-performance, low hardware spending and function Consumption, the advantages of flexibility is good, favorable expandability；Meanwhile in processing speed, the performance of reconfigurable processor is close to dedicated fixed Coremaking piece.Reconfigureable computing array is expired using the array that multiple processing units (Processing Elements, PEs) is constituted The different demands of sufficient different application.Following computing system generally requires to have both multi-functional and high performance feature, currently Trend be that multiple reconfigureable computing arrays are added in computing systems, adaptively to support different standards, meet simultaneously Increasingly increased performance requirement.

For CNN algorithm when calculating, convolution kernel slides on the image carries out convolutional calculation.Such calculating mode has largely Data computed repeatedly.Different from being calculated on GPU, during carrying out hardware-accelerated to CNN algorithm, nothing All calculating data are all buffered on piece by method, it is therefore desirable to are scheduled to the data flow in convolution algorithm.

CNN algorithm includes a large amount of calculating, and reconfigureable computing array can be performed in parallel calculation included in CNN algorithm Method.The weight data of CNN network and image data are divided, are then mapped on corresponding computing unit.Due to hardware The limitation of resource, CNN algorithm can not be mapped completely on hardware structure, it is therefore desirable to be adjusted to image data and weight data Degree.During calculating, a large amount of input data needs are duplicate to be calculated CNN network, and existing many methods are in number According to scheduling process in can all have the following problems:

1, data are repeatedly input.In CNN algorithm, convolution kernel slides over an input image carries out convolution algorithm, works as convolution When the step-length of core sliding is less than convolution kernel own dimensions, when sliding carries out convolution algorithm every time, can all there be part last time convolution meter Data duplication when calculation.These data can re-start reading outside computing unit, but will lead to the repetition of data in this way Input.

2, it when CNN data are mapped to hardware cell, may be subjected to the constraint of hardware resource framework itself, cause designed The flow work it is inefficient.

Summary of the invention

The object of the present invention is to provide a kind of convolution algorithm data stream scheduling methods for dynamic reconfigurable array.

To achieve the above object, the present invention adopts the following technical scheme that:

A kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array, which is characterized in that IRB passes through to power Value Data and image data are scheduled, and matrix inner products fractionation is embarked on journey, is mapped in different PE units and is calculated, is calculated Obtained result is cumulative, obtained to add up and activate in afterbody SPE, and the data after output activation are completed to adjust Degree.

A further improvement of the present invention lies in that, comprising the following steps:

Step 1: the data of convolution kernel are mapped to line by line on PE array in IRB data flow, each PE unit is shown Penetrate a line convolution Nuclear Data；

It is mapped on entire PE array step 2: image data is broadcasted line by line, convolutional calculation is carried out in PE unit；

Step 3: the intermediate data that convolutional calculation is obtained is transferred to next stage PE unit, until being transferred to afterbody PE unit, afterbody PE unit are SPE, and SPE swashs cumulative final result by f () function in formula (1) Operation living, activation operation are completed by RelU module, and the data after activation are as output data；

0≤z < N, 0≤u < M, 0≤y < E, 0≤x < F

Wherein, O is output image data, and I is input image data, and W is weight data, and f () function is neural network Activation primitive, the number of z representing input images given N width image in figure, and u indicates the number of convolution kernel, in figure is M and rolls up Product core, y indicate that the row number of output image, E are the total line numbers for exporting image, and x indicates that the column number of output image, F are output Total columns of image, i and j respectively represent the line number and columns of convolution kernel, and k indicates that port number, U indicate convolution kernel after each convolution The step-length of sliding.

A further improvement of the present invention lies in that detailed process is as follows for the first step: the size of convolution kernel is R row, is being mapped The convolution Nuclear Data of this R row is respectively mapped in R PE unit in the process, the weight data of mapping is stored in weight deposit In device.

A further improvement of the present invention lies in that detailed process is as follows for second step: image data has H row, is mapped to line by line Weight data on PE array, and in the PE unit that has been mapped into does multiply-accumulate operation, map and multiply accumulating be simultaneously into Capable；Image data is mapped in PE unit, is cached in image register, shift register is in caching image data It can be realized sliding sash function in convolution operation simultaneously, what each PE unit was calculated is row convolution results to get to R row Convolved data.

A further improvement of the present invention lies in that image register is shift register.

A further improvement of the present invention lies in that the result of obtained convolutional calculation is temporarily stored in the FIFO of PE unit, During next stage PE carries out convolutional calculation, the intermediate data of upper level PE convolutional calculation is transferred to next stage and carries out mediant According to cumulative；For the convolution kernel having a size of i, each convolution kernel needs i PE unit to be calculated；The size of convolution kernel size i It is 3,5,11, corresponding on PE array, the PE unit number needed is also i.

A further improvement of the present invention lies in that realizing IRB data flow on the PE array of 22*22.

A further improvement of the present invention lies in that using the convolution nuclear volume calculated every time as degree of parallelism measurement standard, convolution kernel When size is 3, array can simultaneously be calculated 22*7=154 convolution kernel；When convolution kernel size is 5, array can calculate simultaneously 22*4=88 convolution kernel calculates, and when convolution kernel size is 11, array calculates the convolution kernel of 22*2=44 simultaneously.

Compared with prior art, the invention has the benefit that

1, it is based on Dynamic Reconfigurable Technique, the data stream scheduling machine accelerated for CNN network proposed in conjunction with hardware Data are split Mapping implementation CNN algorithm, are scheduled, image progressive are mapped to all to the convolution algorithm of CNN by system PE unit on carry out convolutional calculation, image data is scheduled using the form broadcasted line by line avoid image data to Complex time sequence control when being mapped on PE array.

2, weight data is not gone together and is fixed in different PE units, then image data is mapped to each PE line by line Unit and weight data do convolution, and intermediate data is temporarily stored in PE unit, are then transferred to next PE unit step by step and add up, Assembly line is formed, convolved data is obtained.During calculating CNN network, IRB data flow can be improved input image data and The reusability of weight data is reduced and is flowed outside the piece inner sheet of data, power consumption and the time of data flowing is advantageously reduced, to performance There is promotion with efficiency.

Detailed description of the invention

Fig. 1 is the computing architecture of CNN accelerator.

Fig. 2 is PE unit structure.

Fig. 3 is convolutional calculation process.

Fig. 4 is that convolution kernel is mapped to PE array line by line.

Fig. 5 is broadcasted line by line for image data and is mapped to PE array.

Fig. 6 intermediate data between PE unit adds up line by line.

Fig. 7 is RS data flow.

Fig. 8 is IRB data flow.

Specific embodiment

Present invention will now be described in detail with reference to the accompanying drawings..

The present invention is that dynamic reconfigurable computing array proposes a new data stream scheduling mechanism, and referred to as image progressive is broadcasted The data stream scheduling mechanism of (Image Row Broadcast, IRB).IRB is proposed based on Reconfigurable Computation hardware structure, is used In the data stream scheduling method that the convolution algorithm of CNN network accelerates, the multiple networks knot such as LeNet, AlexNet, VGG can be accelerated Structure.

The invention proposes IRB data stream schedulings when calculating for CNN algorithm, are applied to hardware structure shown in FIG. 1. Computing array based on dynamic reconfigurable is adapted to the different calculating modes of CNN, and configuration module, which passes through, to be configured Information configures PE array；FSM is the control module of system；Restructural PE array is the computing architecture of whole system, It is also the hardware components that IRB is realized；The data flow of array computation will not when two memory modules guarantee to calculate as intermediate buffer It is interrupted by the delay of waiting operational data.

The PE unit that the present invention is designed for the calculation features of CNN network includes that there are two types of structures, respectively Normal PE (abbreviation PE) and Special PE (abbreviation SPE).Shown in Fig. 2, PE includes with lower module: image register group (Picture Reg), weight register group (Filter Reg), multiplier, accumulator (Acc), adder and FIFO.SPE is on the basis of PE On increase with lower module: multiple selector, data branches switch, adder and ReLU function module (ReLU).Specific ginseng Number is as follows: the input data bit wide of weight register and image register group is 16, depth 16.Multiplier input data position Width is 16.The input data bit wide of adder is 32.The data bit width of FIFO is 32, and depth is 64.It is PE gusts entire The size of column is 22*22, the calculating mode for being 3,5,11 comprising convolution kernel size in AlexNet network.PE array can pass through Change the interconnection between unit, and internal register configuration, meets these calculating modes.Meanwhile it being added inside PE unit Storage unit module, data storage when can satisfy IRB data-flow computation.

The basic operation of convolutional neural networks is convolutional calculation, as shown in figure 3, multiple convolution kernels are carried out to multiple images The process of convolution algorithm, convolution are the basic operations for convolutional neural networks, and convolution kernel slides on the image carries out convolutional calculation Export new image data.Calculation formula is as follows:

0≤z < N, 0≤u < M, 0≤y < E, 0≤x < F

Wherein, O is output image data, and I is input image data, and W is weight data, and f () function is neural network Activation primitive.The number of z representing input images has given N width image in figure.U indicates the number of convolution kernel, is M volume in figure Product core.Y indicates that the row number of output image, E are the total line numbers for exporting image.X indicates that the column number of output image, F are output Total columns of image.I and j respectively represents the line number and columns of convolution kernel, and k indicates port number.U indicates convolution kernel after each convolution The step-length of sliding.

It is right from formula (1) it can be seen that convolutional calculation process is exactly that input image data and weight data do matrix inner products The data that the data that should be put obtain after being multiplied are added.

Convolution algorithm data stream scheduling method for dynamic reconfigurable array of the invention, IRB pass through to calculating process In weight data and image data be scheduled, big matrix inner products fractionation is embarked on journey, be mapped in different PE units into Row calculates, and calculates the part in the cumulative as above formula bracket of obtained result.It is obtained cumulative and mono- in afterbody SPE It is activated in member, the data as exported.Specifically includes the following steps:

Step 1: the form of convolution kernel line by line is mapped on PE array, a line convolution nucleus number is mapped on each PE unit According to as shown in figure 4, detailed process is as follows:

In IRB data flow, the data of convolution kernel are mapped in PE array line by line first, and each PE unit maps a line convolution The data of core.

Such as the size of convolution kernel is R row in Fig. 3, then the convolution Nuclear Data of this R row is needed to distinguish in mapping process It is mapped in R PE unit.It is noted that the preceding R-1 row of convolution kernel is mapped in PE, SPE can pass through configuration information reality The function of existing PE, last line convolution kernel are mapped in SPE.Convolution kernel is mapped in PE unit, due to convolution in convolution process Core does sliding on the image and is calculated, thus in the process weight data be continuous multiplexing repeatedly, need and entire image Convolution algorithm is carried out, so the weight data of mapping is stored in weight register, it can be in convolution process constantly from PE Weight data is read in internal weight register, can be read in this way to avoid the repetition to weight data, calculates effect to improve Rate.

It is mapped on entire PE array step 2: image data is broadcasted line by line, convolutional calculation is carried out in PE unit, such as Shown in Fig. 5, detailed process is as follows

After convolution kernel is mapped on PE array, image data, which starts to broadcast line by line, to be mapped in inside PE unit.In Fig. 3 Middle image data has H row, is mapped on PE array line by line, and the weight data in the PE unit having been mapped into do it is multiply-accumulate Operation, maps and multiplies accumulating and carry out simultaneously.Image data is mapped in PE unit, is cached in image register, Image register design is shift register, and convolution operation may be implemented while caching image data in image shift register In sliding sash function, displacement effect can be generated in calculating process, every time carry out a convolution algorithm after, moving step length U, with It obtains correctly as a result, what each PE unit was calculated is row convolution results, it can obtain the convolved data of R row.

It should be noted that SPE is configurable to PE, for being configured to the SPE of PE, it is regarded as PE, last will not be used as Grade PE unit.That is intergrade is only PE, and only afterbody could be SPE；As shown in fig. 6, the detailed process of the step It is as follows:

The result of obtained convolutional calculation can be temporarily stored in the FIFO of PE unit in Fig. 5, carry out convolution in next stage PE During calculating, it is cumulative that the intermediate data of upper level PE convolutional calculation is transferred to next stage progress intermediate data.Every level-one PE The PE data that unit is transferred to next stage are every level-one PE unit calculates before this grade row convolved data result corresponding data Cumulative, for the convolution kernel having a size of i, each convolution kernel needs i PE unit to be calculated, i.e., accumulation result is ∑ Rowi. For the CNN structure that the present invention is accelerated, the size of convolution kernel size i can be 3,5,11, corresponding on PE array, need The PE unit number wanted also is i, i.e., 3,5,11.Image data is broadcast on all PE units, in calculating process, due to volume Product core size difference and hardware limitation, the degree of parallelism of calculating be it is different, the present invention is on the PE array of 22*22 Realize IRB data flow.Using the convolution nuclear volume calculated every time as degree of parallelism measurement standard, when convolution kernel size is 3, array can be with 22 (row) * 7=154 convolution kernels are calculated simultaneously.When convolution kernel size is 5, array can calculate 22 (row) * 4=88 simultaneously Convolution kernel calculates, and it is that array can calculate the convolution kernel of 22*2=44 simultaneously that convolution kernel size, which is 11,.The afterbody of array computation It is SPE unit, SPE unit carries out activation operation by f () function in formula (1) to all cumulative final results, swashs Operation living is completed by RelU module, and the data after activation are as output data.

Following table shows data flow proposed by the invention and the performance comparison that some other CNN accelerates.

The performance comparison that the data flow proposed by the invention of table 1 and some other CNN accelerate

Method of the invention is used as can be seen from Table 1, and the performance and efficiency of system all have increased significantly.Locating Manage convolutional layer when, the present invention can obtain performance be respectively as follows: AlexNet be 97.4GOPS, VGG 90.75GOPS, Lenet- 5 be 100.8GOPS.Compared with Virtex7VX485T, 1.59 times of performance is may be implemented in AlexNet and 2.96 times of efficiency mentions It rises.As for Zynq-7000, the performance of LeNet can be improved 47 times by the present invention, and efficiency improves 14.5 times.Meanwhile with Stratix-V GXA7 is compared, and also there is the present invention at least 2.9 times of performance and 7 times of efficiency to improve.For Intel Xeon E5-2620 CPU, speed of the present invention improve 6.6 times, and 52 times of promotion is realized in terms of efficiency.

RS (Row Saturation) data flow that IRB data flow and Eyeriss are proposed compares:

By taking the M convolution kernel of 3 × 3 × C as an example, convolutional calculation is carried out to the image of 7 × 7 × C size, wherein C is channel Number.PE array sub-block size is 3 × 3.Fig. 7 shows the assembly line timing of RS data flow, it is primary complete in PE array sub-block At the mapping in a channel.Fig. 8 shows the convolutional calculation using IRB data flow method, and IRB can be parallel complete on PE array At the image in three channels.

T1 indicates the period that a line image of PE array is mapped to from memory, and T2 is the volume of a line image of each PE The product period.Image size is 7 × 7, and kernel size is 3 × 3.So T1=7, T2=3 × (7-2)=15；Use RS data flow Average time needed for calculating a channel image are as follows:

T_RS=(T1 × 5+ (T1+1) × 2+15) × C × M=66 × C × M (2)

Use average time needed for one channel of IRB data-flow computation proposed by the present invention are as follows:

T_IRB37 × C of=(T1+T2 × 7) × C × M/3 ≈ × M (3)

It should be noted that the division arithmetic in equation (2) is 3 due to degree of parallelism.Although that is, the calculating process of IRB Than RS long, but IRB can generate the image in three channels with parallel computation, and RS can only be calculated simultaneously and be generated single channel image. Therefore, IRB provides degree of parallelism more higher than RS.In this example, as the result is shown compared with RS, IRB data flow be can be improved 44% performance.

Claims

1. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array, which is characterized in that IRB passes through to weight Data and image data are scheduled, and matrix inner products fractionation is embarked on journey, is mapped in different PE units and is calculated, calculates institute Obtained result is cumulative, obtained to add up and activate in afterbody SPE, and the data after output activation complete scheduling.

2. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special Sign is, comprising the following steps:

Step 1: the data of convolution kernel are mapped to line by line on PE array in IRB data flow, one is mapped on each PE unit Row convolution Nuclear Data；

Step 3: the intermediate data that convolutional calculation is obtained is transferred to next stage PE unit, until it is mono- to be transferred to afterbody PE Member, afterbody PE unit are SPE, and SPE carries out activation behaviour by f () function in formula (1) to cumulative final result Make, activation operation is completed by RelU module, and the data after activation are as output data；

0≤z < N, 0≤u < M, 0≤y < E, 0≤x < F

Wherein, O is output image data, and I is input image data, and W is weight data, and f () function is swashing for neural network Function living, the number of z representing input images have given N width image in figure, and it is M convolution kernel in figure that u, which indicates the number of convolution kernel, Y indicates that the row number of output image, E are the total line numbers for exporting image, and x indicates that the column number of output image, F are output images Total columns, i and j respectively represent the line number and columns of convolution kernel, and k indicates port number, and U indicates that convolution kernel after each convolution slides Step-length.

3. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special Sign is that detailed process is as follows for the first step: the size of convolution kernel is R row, by the convolution Nuclear Data of this R row in mapping process It is respectively mapped in R PE unit, the weight data of mapping is stored in weight register.

4. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special Sign is that detailed process is as follows for second step: image data has H row, is mapped on PE array line by line, and the PE having been mapped into Weight data in unit does multiply-accumulate operation, maps and multiplies accumulating and carries out simultaneously；Image data is mapped to PE unit It is interior, it is cached in image register, image shift register can be realized convolution operation while caching image data In sliding sash function, what each PE unit was calculated is row convolution results to get the convolved data for arriving R row.

5. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 4, special Sign is that image register is shift register.

6. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special Sign is that the result of obtained convolutional calculation is temporarily stored in the FIFO of PE unit, carries out the mistake of convolutional calculation in next stage PE Cheng Zhong, it is cumulative that the intermediate data of upper level PE convolutional calculation is transferred to next stage progress intermediate data；For the convolution having a size of i Core, each convolution kernel need i PE unit to be calculated；The size of convolution kernel size i is 3,5,11, corresponding on PE array, The PE unit number needed is also i.

7. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 1, special Sign is, IRB data flow is realized on the PE array of 22*22.

8. a kind of convolution algorithm data stream scheduling method for dynamic reconfigurable array according to claim 7, special Sign is, using the convolution nuclear volume calculated every time as degree of parallelism measurement standard, when convolution kernel size is 3, array can be right simultaneously 22*7=154 convolution kernel calculates；When convolution kernel size is 5, array can calculate 22*4=88 convolution kernel simultaneously and calculate, convolution When core size is 11, array calculates the convolution kernel of 22*2=44 simultaneously.