CN108647777A

CN108647777A - A kind of data mapped system and method for realizing that parallel-convolution calculates

Info

Publication number: CN108647777A
Application number: CN201810432269.5A
Authority: CN
Inventors: 聂林川; 姜凯; 王子彤
Original assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Current assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Priority date: 2018-05-08
Filing date: 2018-05-08
Publication date: 2018-10-12

Abstract

The invention discloses a kind of data mapped systems and method for realizing that parallel-convolution calculates, belong to nerual network technique field.The data mapped system that the realization parallel-convolution of the present invention calculates includes input feature vector cache module, mapping logic module, output characteristic pattern cache module, weight cache module, convolutional calculation array and control logic module, the input feature vector figure cache module is separately connected with control logic module, mapping logic module, weight cache module is separately connected with control logic module, mapping logic module, computing array is connected with control logic module, mapping logic module, output characteristic pattern cache module, and output characteristic pattern cache module is connected with control logic module.The data mapped system and can eliminate computing resource that is invalid or being not involved in that the realization parallel-convolution of the invention calculates, improve computing resource utilization rate, have good application value.

Description

A kind of data mapped system and method for realizing that parallel-convolution calculates

Technical field

The present invention relates to nerual network technique fields, specifically provide a kind of data mapped system realized parallel-convolution and calculated And method.

Background technology

With artificial intelligence（AI）The development in field, CNN（Convolutional Neural Network, that is, convolutional Neural Network）It is fully used.Mainstream convolutional neural networks model is not only complicated at present, and it is big and each to calculate data volume Layer architecture difference is also very big, and hardware circuit realizes that high-performance realizes that high universalizable is not light simultaneously, should consider the utilization of resources Rate considers Energy Efficiency Ratio again.Realize each layer of whole network model and unrealistic, power consumption, area, resource profit simultaneously with hardware circuit It is difficult to obtain satisfied with rate etc. as a result, the usual way for solving the problems, such as this is to exchange area for the time, it also will entire model Hierarchical block processing is carried out, circuit design at general basic unit, entire model is constructed by control circuit timesharing, simultaneously Means are mapped by efficient data and improve resource utilization, and circuit working performance is improved with this.In the prior art in hardware electricity Road is realized be more than 1 there are convolution kernel sliding step during certain convolutional neural networks models calculate in the case of, there are invalid computation, Reduce resource utilization；On the other hand, in the case of computing array circuit design is fixed, if there is output characteristic pattern and calculate There is the resource for being not involved in calculating in the unmatched situation of array sizes, there is also waste, computing resource waste meetings for resource utilization Overall performance is set to cannot get ideal result.

Invention content

The technical assignment of the present invention is in view of the above problems, to provide a kind of meter that can be eliminated and in vain or be not involved in Resource is calculated, the data mapped system of computing resource utilization rate realized parallel-convolution and calculated is improved.

The further technical assignment of the present invention is to provide a kind of data mapping method realized parallel-convolution and calculated.

To achieve the above object, the present invention provides following technical solutions：

A kind of data mapped system realized parallel-convolution and calculated, which includes input feature vector cache module, mapping logic mould Block, output characteristic pattern cache module, weight cache module, convolutional calculation array and control logic module, the input feature vector figure Cache module is separately connected with control logic module, mapping logic module, and weight cache module is patrolled with control logic module, mapping It collects module to be separately connected, convolutional calculation array and control logic module, mapping logic module, output characteristic pattern cache module phase Even, output characteristic pattern cache module is connected with control logic module.

The data mapped system for realizing parallel-convolution calculating increases convolutional calculation by reconfiguring input feature vector figure Degree of parallelism, eliminate computing resource that is invalid or being not involved in.Input feature vector figure is particularly subjected to well-regulated piecemeal, is passed through Effective mapping means, reconfigure input feature vector figure, will be invalid or be not involved in calculating section and be substituted for effective calculating section, increasing The degree of parallelism for adding whole convolutional calculation improves the utilization rate of computing resource, improves system performance.

Preferably, caching of the input feature vector figure cache module as outer input data, mapping logic module are pressed According to the order that control logic module issues data, mapping logic mould are obtained from input feature vector figure cache module and weight cache module Block send the data of acquisition to convolutional calculation array, and convolutional calculation array will calculate the data completed and send to output characteristic pattern caching Module.

Preferably, the convolutional calculation array multiplies N row convolutional calculation units, adjacent convolutional calculation unit using N rows Interconnection.

Each convolutional calculation unit includes 2x2 PE（Processing Element, that is, processing unit）, convolution meter When calculation, each PE corresponds to the calculating of a pixel of an output characteristic pattern.

A method of realizing the data mapping that parallel-convolution calculates, the method carries out input feature vector figure well-regulated Piecemeal reconfigures input feature vector figure by mapping means, increases the degree of parallelism of convolutional calculation, and mapping logic will be from group again The data that the input feature vector figure of conjunction obtains are sent to convolutional calculation array, and convolutional calculation array is sent the data completed are calculated to output Characteristic pattern cache module.

Preferably, when convolution kernel sliding step is more than 1, convolution kernel in input feature vector figure is slided to the part of invalid computation It is partially filled with what is effectively calculated, the input feature vector figure reconfigured is inputted as convolution unit.

Preferably, the part of convolution kernel sliding invalid computation is filled out with the part effectively calculated in the figure by input feature vector It fills, invalid computation partial array is filled using the data of the effective calculating position in the matrix upper right corner, will participate in having in input feature vector figure The data that effect calculates translate downwards to the right, copy in adjacent convolutional calculation unit.

Preferably, the data copied in adjacent convolutional calculation unit and the volume read in from weight cache module Product core weighted value carries out convolutional calculation, and the characteristic pattern of Combination nova is made to have traversed weighted value, and result of calculation is sent to output characteristic pattern Cache module.

Preferably, when output characteristic pattern and computing array size mismatch, by multichannel input feature vector figure be divided into compared with Small characteristic pattern unit reconfigures the characteristic pattern unit of adjacency channel same position for new input feature vector figure, as volume Product computing array input.

Preferably, the multichannel input feature vector figure division proportion depends on output characteristic pattern size, port number depends on In convolutional calculation array sizes and output characteristic pattern size.

Compared with prior art, the data mapping method that realization parallel-convolution of the invention calculates has with following prominent Beneficial effect：The data mapping method for realizing parallel-convolution calculating reconfigures input feature vector by effectively mapping means Figure, increases the degree of parallelism of convolutional calculation, and input feature vector figure is particularly carried out well-regulated piecemeal, will be invalid or be not involved in meter Partial replacement is calculated into effective calculating section, computing resource that is invalid or being not involved in is eliminated, increases the degree of parallelism of whole convolutional calculation, The utilization rate of computing resource is improved, system performance is improved, there is good application value.

Description of the drawings

Fig. 1 is the topological diagram for the data mapped system that realization parallel-convolution of the present invention calculates；

Fig. 2 is that convolutional calculation unit progress convolutional calculation is opened up in the data mapped system that realization parallel-convolution of the present invention calculates Flutter figure；

Fig. 3 is the signal when data mapping method convolution kernel sliding step that realization parallel-convolution of the present invention calculates is more than 1 Figure；

Fig. 4 be realization parallel-convolution of the present invention the data mapping method output characteristic pattern and the computing array size that calculate not The schematic diagram of timing.

Specific implementation mode

Below in conjunction with drawings and examples, to the data mapped system and method for realizing parallel-convolution calculating of the present invention It is described in further detail.

Embodiment

As shown in Figure 1, the data mapped system of the present invention realized parallel-convolution and calculated, including input feature vector cache mould Block, mapping logic module, output characteristic pattern cache module, weight cache module, convolutional calculation array and control logic module.

Caching of the input feature vector figure cache module as outer input data, with control logic module, mapping logic module It is separately connected.

Convolutional calculation array multiplies N row convolutional calculation units, adjacent convolutional calculation element-interconn ection using N rows.Such as Fig. 2 institutes Show, each convolutional calculation unit includes 2x2 PE, and when convolutional calculation, each PE corresponds to a pixel of an output characteristic pattern The calculating of point.

Mapping logic module is cached according to the order that control logic module issues from input feature vector figure cache module and weight Module obtains data, and mapping logic module send the data of acquisition to convolutional calculation array, and convolutional calculation array will be calculated and be completed Data send to output characteristic pattern cache module.

Weight cache module is separately connected with control logic module, mapping logic module.Convolutional calculation array is patrolled with control Module, mapping logic module, output characteristic pattern cache module is collected to be connected.Export characteristic pattern cache module and control logic module phase Even.

The present invention's realizes that input feature vector figure is carried out well-regulated piecemeal by the data mapping method that parallel-convolution calculates, and leads to Mapping means are crossed, input feature vector figure is reconfigured, increase the degree of parallelism of convolutional calculation, mapping logic will be from the input reconfigured The data that characteristic pattern obtains are sent to convolutional calculation array, and convolutional calculation array, which will calculate the data completed and send to output characteristic pattern, to be delayed Storing module.

When convolution kernel sliding step is more than 1, convolution kernel in input feature vector figure is slided to the part effectively meter of invalid computation That calculates is partially filled with, and invalid computation partial array is filled using the data of the effective calculating position in the matrix upper right corner, by input feature vector The data effectively calculated are participated in figure to translate downwards to the right, are copied in adjacent computing unit.Copy to adjacent calculating list Data in member carry out convolutional calculation with the convolution kernel weighted value read in from weight cache module, and the characteristic pattern of Combination nova is made to traverse Complete weighted value, the input feature vector figure reconfigured are inputted as convolution unit, and result of calculation is sent to output characteristic pattern and is delayed Storing module.Specific implementation process is as shown in Figure 3.It is 4x4 with convolutional calculation array sizes, output characteristic pattern is 2x2, convolution kernel power Weight matrix is 1x1, is illustrated for the example that convolution kernel sliding step is 2.The each sliding step of convolution kernel is 2, often calculates one Effective output point can all carry out primary invalid calculating, and entire computing array effective rate of utilization is (2x2)/(4x4)=1/4, is calculated Resource receives waste, in order to make full use of computing resource, is replicated parallel using by effective computing resource, then respectively from different volumes Product nuclear convolution, and cache the mode of intermediate result.

1, period 1 T0 moment, control logic command mappings logic cache from input feature vector figure and obtain input feature vector figure In 11 point values input computing array, cached from weight and obtain respective weights k1 and input computing array.

2, the T1 moment, in computing array, 11 point values and weight k1 be calculated result of calculation out0 to exporting feature Figure caching, while copying to 12 position of clearing array by 11 points.

3, the T2 moment, 12 point values and weight k2 carry out result out1 is calculated in computing array delays to output characteristic pattern It deposits, while 21 position of clearing array is copied to by 11 points.

4, the T3 moment, in computing array, 21 point values and weight k3 carry out result out2 is calculated to be delayed to output characteristic pattern It deposits, while 22 position of clearing array is copied to by 11 points.

5, T4 moment, 22 positions with weight k4 carry out that result out3 is calculated, to output characteristic pattern caching.

The same processing mode of other computing units, until by first characteristic value of input feature vector figure and ownership restatement It calculates and completes, and preserve intermediate result, then carry out next characteristic value clearing, and so on, first passage input feature vector figure is whole After the completion of calculating, next channel input feature vector figure enters calculating, and different channels are corresponded to results of intermediate calculations and sum up place Reason.

When exporting characteristic pattern with computing array size mismatch, multichannel input feature vector figure is divided into smaller characteristic pattern Unit reconfigures the characteristic pattern unit of adjacency channel same position for new input feature vector figure, as convolutional calculation array Input.Multichannel input feature vector figure division proportion depend on output characteristic pattern size, port number depend on computing array size and Export characteristic pattern size.Specific implementation process is as shown in Figure 4.It is 3x3 with convolutional calculation array sizes, output characteristic pattern size is 2x2, convolution kernel size are 1x1, and the example that sliding step is 1 illustrates, and in the case of this kind, computing array size is more than output Characteristic pattern size, and be not integral multiple relation, computing resource utilization rate is（2x2）/ (3x3)=4/9, is not involved in the resource of calculating It is wasted.Input feature vector figure stripping and slicing is taken, the input feature vector figure stripping and slicing in different channels is combined, calculating is made full use of All resources of array so that computing resource can be fully used.Detailed process is as follows：

1, period 1 T0 moment, control logic command mappings logic divide 11 point values of the same position of the one two three four-way Not Shu Ru computing array 11,12,21,22 positions, four tunnel parallel computations simultaneously obtain 4 output characteristic pattern 11 point values, and Keep in output characteristic pattern caching.

2, T1 moment, control logic command mappings logic are defeated by the 12 point values difference of the same position of the one two three four-way Enter the 11 of computing array, 12,21,22 positions, four tunnel parallel computations obtain 12 point values of 4 output characteristic patterns simultaneously, and keep in To output characteristic pattern caching.

3, T2 moment, control logic command mappings logic are defeated by the 21 point values difference of the same position of the one two three four-way Enter the 11 of computing array, 12,21,22 positions, four tunnel parallel computations obtain 21 point values of 4 output characteristic patterns simultaneously, and keep in To output characteristic pattern caching.

4, T3 moment, control logic command mappings logic are defeated by the 22 point values difference of the same position of the one two three four-way Enter the 11 of computing array, 12,21,22 positions, four tunnel parallel computations obtain 22 point values of 4 output characteristic patterns simultaneously, and keep in To output characteristic pattern caching.

At the end of the T3 moment, all point values of output characteristic pattern in four channels, which all calculate, to be completed.

Embodiment described above, the only present invention more preferably specific implementation mode, those skilled in the art is at this The usual variations and alternatives carried out within the scope of inventive technique scheme should be all included within the scope of the present invention.

Claims

1. a kind of data mapped system realized parallel-convolution and calculated, it is characterised in that：The system includes input feature vector caching mould Block, mapping logic module, output characteristic pattern cache module, weight cache module, convolutional calculation array and control logic module, institute It states input feature vector figure cache module to be separately connected with control logic module, mapping logic module, weight cache module is patrolled with control Collect module, mapping logic module is separately connected, convolutional calculation array and control logic module, mapping logic module, output feature Figure cache module is connected, and output characteristic pattern cache module is connected with control logic module.

2. the data mapped system according to claim 1 realized parallel-convolution and calculated, it is characterised in that：The input Caching of the characteristic pattern cache module as outer input data, the order that mapping logic module is issued according to control logic module from Input feature vector figure cache module and weight cache module obtain data, and mapping logic module send the data of acquisition to convolutional calculation Array, convolutional calculation array will calculate the data completed and send to output characteristic pattern cache module.

3. the data mapped system according to claim 1 or 2 realized parallel-convolution and calculated, it is characterised in that：The volume Product computing array multiplies N row convolutional calculation units, adjacent convolutional calculation element-interconn ection using N rows.

4. a kind of method that data that realizing that parallel-convolution calculates map, it is characterised in that：The method by input feature vector figure into The well-regulated piecemeal of row reconfigures input feature vector figure by mapping means, increases the degree of parallelism of convolutional calculation, mapping logic The data obtained from the input feature vector figure reconfigured are sent to convolutional calculation array, convolutional calculation array will calculate the number completed According to send to output characteristic pattern cache module.

5. the data mapping method according to claim 4 realized parallel-convolution and calculated, it is characterised in that：Convolution kernel slides When step-length is more than 1, the part that convolution kernel in input feature vector figure is slided to invalid computation is partially filled with what is effectively calculated, obtains weight The input feature vector figure of Combination nova is inputted as convolution unit.

6. the data mapping method according to claim 4 or 5 realized parallel-convolution and calculated, it is characterised in that：It is described to incite somebody to action The part of convolution kernel sliding invalid computation is partially filled with what is effectively calculated in input feature vector figure, is effectively counted using the matrix upper right corner The data for calculating position fill invalid computation partial array, and the data for participating in effectively calculating in input feature vector figure are put down downwards to the right It moves, copies in adjacent convolutional calculation unit.

7. the data mapping method according to claim 6 realized parallel-convolution and calculated, it is characterised in that：It is described to copy to Data in adjacent convolutional calculation unit carry out convolutional calculation with the convolution kernel weighted value read in from weight cache module, make new The characteristic pattern of combination has traversed weighted value, and result of calculation is sent to output characteristic pattern cache module.

8. the data mapping method according to claim 4 realized parallel-convolution and calculated, it is characterised in that：Export characteristic pattern When being mismatched with computing array size, multichannel input feature vector figure is divided into smaller characteristic pattern unit, adjacency channel is same The characteristic pattern unit of one position reconfigures as new input feature vector figure, is inputted as convolutional calculation array.

9. the data mapping method according to claim 8 realized parallel-convolution and calculated, it is characterised in that：The multichannel Input feature vector figure division proportion depends on output characteristic pattern size, and port number depends on convolutional calculation array sizes and output feature Figure size.