WO2020087742A1 - 用于实现卷积运算的处理元件、装置和方法 - Google Patents

用于实现卷积运算的处理元件、装置和方法 Download PDF

Info

Publication number
WO2020087742A1
WO2020087742A1 PCT/CN2018/124828 CN2018124828W WO2020087742A1 WO 2020087742 A1 WO2020087742 A1 WO 2020087742A1 CN 2018124828 W CN2018124828 W CN 2018124828W WO 2020087742 A1 WO2020087742 A1 WO 2020087742A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
dma unit
buffer
unit
memory
Prior art date
Application number
PCT/CN2018/124828
Other languages
English (en)
French (fr)
Inventor
黎立煌
李炜
曹庆新
Original Assignee
深圳云天励飞技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳云天励飞技术有限公司 filed Critical 深圳云天励飞技术有限公司
Publication of WO2020087742A1 publication Critical patent/WO2020087742A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention relates to the field of convolutional neural network computing, and in particular, to a processing element, device, and method for implementing convolution operations.
  • DNN deep neural networks
  • SIMD shallow neural network
  • spatial architecture data stream processing
  • Embodiments of the present invention provide a processing element, device, and method for implementing convolution operations to solve the problems that the existing hardware architecture for processing deep neural networks cannot perform two-dimensional convolution operations and the data reuse rate is not high.
  • a processing element for implementing a convolution operation, which includes: a first buffer configured to store input data and corresponding to the convolution operation Weights; a shift unit configured to perform a shift operation on input data to generate first intermediate data; a plurality of operation units configured to perform at least a part of a two-dimensional convolution operation based on the weights and the first intermediate data Calculate and generate output data.
  • the shift operation performed by the shift unit includes: acquiring data from adjacent processing elements; setting the acquired data on both sides of the input data as first edge data to form data to be shifted; and to be shifted The data is shifted to generate first intermediate data.
  • a method for performing a convolution operation including: acquiring input data and a weight corresponding to the convolution operation; performing a shift operation on the input data to generate a first intermediate Data; perform at least a part of the two-dimensional convolution operation based on the weight and the first intermediate data, and generate output data.
  • the shift operation includes: acquiring data from adjacent processing elements; setting the acquired data on both sides of the input data as first edge data, thereby forming data to be shifted; and performing a shift operation on the data to be shifted To generate the first intermediate data.
  • the processing element has a shift unit capable of acquiring data from adjacent processing elements.
  • the shift unit can set the acquired data on both sides of the input data as the first An edge data, thereby forming data to be shifted, and the shift unit can also perform a shift operation on the shifted data to generate first intermediate data, thereby improving the data reuse rate.
  • the processing element also executes at least a part of the two-dimensional convolution operation based on the weight and the first intermediate data through a plurality of operation units, thereby realizing the function of performing a complete two-dimensional convolution operation in one processing element.
  • FIG. 1 is a schematic diagram of a time architecture for processing convolutional neural network hardware provided by an embodiment of the present invention
  • FIG. 2 is a schematic diagram of matrix multiplication provided by an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of using a loose form of Toplitz matrix multiplication provided by an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a spatial architecture for processing convolutional neural network hardware provided by an embodiment of the present invention
  • FIG. 5 is a schematic diagram of a spatial architecture with fixed weights provided by an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a fixed-output spatial architecture provided by an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of a spatial architecture without local reuse provided by an embodiment of the present invention.
  • FIGS. 8A to 8C are schematic diagrams of a process in which a processing element provided by an embodiment of the present invention completes one-dimensional convolution
  • FIG. 9 is a schematic diagram of a processing element group for performing a two-dimensional convolution operation provided by an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of a processing element for implementing convolution operations provided by an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of the processing element in FIG. 10 performing a convolution operation
  • FIG. 12 is a schematic diagram of an apparatus for implementing convolution operation provided by an embodiment of the present invention.
  • FIG. 13 is a schematic diagram of a data flow of the device in FIG. 12;
  • FIG. 14 is a schematic diagram of an instruction storage and sending process of the device in FIG. 12;
  • FIG. 15 is a schematic diagram of another data flow of the device in FIG. 12;
  • 16 is a schematic diagram of the slicing method of the data memory of the device in FIG. 12;
  • 17 is a schematic diagram of the storage manner of input data and output data of the device in FIG. 12;
  • FIG. 18 is a schematic diagram of the device in FIG. 12 for vertically cutting input data that is too wide;
  • FIG. 19 is a schematic diagram of the device of FIG. 12 horizontally cutting input data that is too high.
  • 20 is a schematic flowchart of a method for implementing a convolution operation provided by an embodiment of the present invention.
  • the time architecture of the hardware for processing convolutional neural networks shown in FIG. 1 usually uses technologies such as SIMD or SIMT to perform multiply-accumulate operations (ie, MAC operations) in parallel. All ALUs share the same control and register file. On these platforms, the fully connected layer (FC layer) and the convolution layer (CONV layer) are usually mapped to matrix multiplication.
  • FC layer fully connected layer
  • CONV layer convolution layer
  • the matrix multiplication shown in Figure 2 refers to a filter with M rows and column widths (CHW), and an input with row widths (CHW) and N columns.
  • fmap performs matrix multiplication, and finally gets the output fmap of M rows and N columns.
  • FIG. 3 shows a convolutional layer with 2 input fmaps and 2 output fmaps.
  • CONV layer the convolutional layer
  • the disadvantage of using the Toplitz matrix multiplication is that there is redundant data in the fmap matrix, resulting in inefficient storage efficiency and data transmission bandwidth efficiency.
  • the fast Fourier transform can also be used to reduce the number of multiplications: first convert the filter and input fmap to the "frequency domain”, perform the multiplication operation, and then perform the inverse FFT operation to obtain the "time domain” output fmap.
  • Other methods include Strassen algorithm and Winograd algorithm. They rearrange the calculations so that the number of multiplications can be reduced from O (N3) to O (N2.807). For a 3 ⁇ 3 filter, the number of multiplications can be reduced by 2.25 ⁇ .
  • the price is to reduce digital stability, increase storage requirements, and special processing for different parameter sizes.
  • the spatial architecture of the hardware processing convolutional neural network shown in FIG. 4 adopts a data flow (Dataflow) processing method.
  • ALU forms a data processing chain so that data can be transferred directly between ALUs.
  • each ALU has its own control logic and local storage (register file).
  • the ALU with local storage is defined as the processing element (PE).
  • PE processing element
  • the hardware design is based on low-energy memory in hierarchical storage and increases data reuse (essentially, convolution is space reuse, which can obtain space invariance) to reduce energy consumption .
  • data flow controls data reading, writing and processing.
  • the spatial architecture is based on the problem of balancing I / O and computing with hierarchical storage and data flow, thereby reducing energy consumption and increasing computing throughput.
  • data reuse There are four types of data reuse in spatial architecture: fixed weights, fixed output, no local reuse (NLR), and fixed lines.
  • FIG. 5 shows a schematic diagram of a spatial architecture with fixed weights.
  • the weights are first read into the register file (RF) of each processing element (PE) and remain unchanged. Then move the input fmap and the partial sum through the PE array and global buffer to reuse the weights in the PE as much as possible.
  • the input fmap is broadcast to all PEs, and the partial sums are accumulated through the PE array.
  • Fig. 6 shows a schematic diagram of a spatial architecture with a fixed output. Among them, by flowing input data in the PE array, and then broadcasting the weight data to the PE array, keeping the accumulation of the partial sum in the register file (RF) unchanged, thereby minimizing the energy consumption of reading and writing the partial sum.
  • RF register file
  • FIG. 7 shows a schematic diagram of a spatial architecture without local reuse.
  • the RF of the PE array does not store any fixed data. On the contrary, in this case, all data read and write operations are completed in the global buffer. In this case, the traffic of all data types between the PE array and the global buffer will increase.
  • FIG. 8A to FIG. 8C show the process of PE completing one-dimensional convolution.
  • three PEs can be used, and each PE runs a one-dimensional convolution. The partial sum is further vertically accumulated on the three PEs to produce the first output line.
  • To generate the second row of output we use another column of PE, where three rows of input activations move down one row and use the same filter row to perform three one-dimensional convolutions. Add additional PE columns until all rows of output are completed (ie, the number of PE columns is equal to the number of output rows).
  • each filter row is reused horizontally among multiple PEs.
  • Each line of input activation is reused diagonally across multiple PEs. As shown in Fig. 9, the partial sum of each row is vertically accumulated on the PE.
  • the hardware of the time system structure usually uses a relaxed form of toeplitz matrix multiplication, but its disadvantage is that there is redundant data in the fmap matrix, which leads to storage efficiency and data transmission bandwidth low efficiency.
  • Fast Fourier Transform can be used to reduce the number of multiplications, it will reduce the digital stability and increase storage requirements and special processing for different weight sizes.
  • the hardware based on the spatial system structure uses the spatial correlation calculated by CNN to avoid memory read and write bottlenecks.
  • the spatial system structure still has a low reuse rate of data.
  • the row fixed architecture can only implement one-dimensional convolution operations. That is, only one row in the weight matrix of the filter and one row in the input fmap matrix can be used for the convolution operation.
  • an embodiment of the present invention provides a processing element (PE) for implementing a convolution operation to perform calculation operations of multiple related data and weight streams to achieve maximum data reuse.
  • the processing element can realize complete two-dimensional convolution operation in one processing element.
  • the processing element PE for implementing the convolution operation includes: a first buffer 11 (ie, input buffer) configured to store input data and weights corresponding to the input data; shift unit 12 , Is configured to perform a shift operation on the input data to generate first intermediate data; a plurality of operation units 13 is configured to perform at least a part of the convolution operation based on the weight and the first intermediate data, and generate output data, wherein
  • the shift operation performed by the shift unit 12 includes: acquiring data from adjacent processing elements; setting the acquired data on both sides of the input data as edge data, thereby forming data to be shifted; and performing the data to be shifted The shift operation generates first intermediate data.
  • each PE is a SIMD processor with a digital width m (or a vector processor with a digital width m).
  • n PEs will be linked together to form a one-dimensional PE array of length n, so data (for example, by the operation of the shift unit in each PE) flows in this one-dimensional PE array in two directions.
  • each PE has its own instruction buffer IQ, instruction decoding and control logic.
  • Each PE can perform an independent convolutional neural network (CNN) calculation.
  • CNN convolutional neural network
  • FIG. 11 shows a schematic diagram of a shift operation performed by a processing element (PE) on input data and a subsequent convolution operation.
  • PE processing element
  • the input data received by the processing element (PE) is:
  • the processing element acquires data from adjacent processing elements, and sets the acquired data on both sides of the input data as edge data, thereby forming data to be shifted:
  • data (8,24,40,56) and data (0,16,32,48) are data obtained from adjacent processing elements, which are set on both sides of the input data as edge data.
  • the shift unit further shifts the data to be shifted.
  • the data used for calculation is:
  • the data to be shifted is shifted to the left, so that the data used for calculation is:
  • the data to be shifted is shifted to the right, so that the data used for calculation is:
  • a two-dimensional convolution operation can be performed in each processing element PE.
  • the processing elements PE provided in this embodiment can be separated Realize two-dimensional convolution operation, and further improve the data reuse rate.
  • the shift operation performed by the shift unit further includes: sending edge data on both sides of the input data to adjacent processing elements. Therefore, the processing element PE according to this embodiment, while acquiring data from the adjacent processing element PE, also sends the edge data on both sides of its own input data to the adjacent processing element.
  • the input data received by the processing element PE is:
  • the processing element also sends data (7, 23, 39, 55) and data (1, 17, 33, 49) to the adjacent processing element PE.
  • each PE can obtain data from two adjacent PEs through a shift unit (shift / mux), and can also provide data to two adjacent PEs.
  • the shift unit is mainly used to implement the following functions: i) receive data from the first buffer; ii) receive data from the neighboring PE; iii) send its edge data to the neighboring PE; iv) set the acquired data to the input Both sides of the data are used as edge data to form data to be shifted, and a shift operation is performed on the shifted data.
  • the shift operation includes a right shift or a left shift operation to generate first intermediate data; v) the first intermediate data Provided to multiple operating units.
  • the PE in addition to “borrowing" data from neighboring PEs, the PE also "borrows" its border data to its two neighboring PEs.
  • the purpose of performing the shift operation on the input data is achieved by the shift unit, and at the same time, the neighbors are ensured, ensuring the integrity of the convolution operation in the PE.
  • the convolution operation performed by the PE can only be performed on the input data of the PE itself (that is, the input data is 7 columns), as follows As shown:
  • the filter composed of the weight parameters shown in FIG. 11 is used for the shift operation and the convolution operation, only 5 columns of data can be generated as a result of the convolution operation. Therefore, this means that as the output result of the convolution operation of the PE, the data in the first column and the last column are empty or zero padding is required. That is to say, the two-dimensional convolution operation in this case is incomplete.
  • the processor uses multiple PEs to convolve larger data in parallel (for example, the data width is wider, multiple PEs are required to perform convolution operations side by side), since the data output by each PE has The data is an empty data column or a zero-filled data column, which will cause the processor's convolution result to be wrong.
  • the PE of this embodiment first obtains data from neighboring PEs and uses the obtained data as edge data, as shown below:
  • the plurality of operation units include a plurality of multiply-accumulate units (MAC) 131, a partial sum adder (PSUM) 132, and a second buffer 133, wherein the plurality of multiply-accumulate units 131 are configured
  • the intermediate data performs multiplication and accumulation operations and outputs second intermediate data;
  • the partial sum adder 132 is configured to iteratively add the second intermediate data to the corresponding partial sum stored in the second buffer 133,
  • the partial sum calculated in each iteration is stored in the second buffer 133 as the partial sum of the output data.
  • each PE has m MAC units.
  • the input data of m MACs will be linked together to form a one-dimensional data vector of length m.
  • n * m MAC units for n PEs.
  • N data vectors from n PEs, each of length m will be linked together to form a one-dimensional data vector of length n * m.
  • This one-dimensional data vector of length n * m can perform right shift and Move left.
  • the shifted data will be fed to n * m MAC units.
  • the first buffer 11 may include, for example, an input data buffer and a weight data buffer.
  • the MAC unit convolution calculation process is exemplified.
  • the italic data in the dotted frame is the overlapping (or shared) data obtained from two adjacent PEs.
  • the above shared data is placed on the left and right sides of the input data matrix;
  • the 3 ⁇ 3 letter matrix represents the filter ’s Weights, the weights in the filter are stored in the weight data buffer (WBUF) of the first buffer 11;
  • the black bold data represents 7 MAC units, and each digital label corresponds to the corresponding MAC unit; other data are Input data obtained from the input data buffer (IBUF) of the PE.
  • IBUF input data buffer
  • each MAC unit in a single PE performs 9 A multiplication and accumulation operation requires a total of 9 cycles to complete.
  • the CNN calculation is performed on the first intermediate data fed back by the shift unit through a plurality of multiply-accumulate units to obtain a convolution calculation result (ie, second intermediate data), and then partly and the adder store the convolution calculation result and the The corresponding partial sums in the two buffers are accumulated to obtain output data and stored in the second buffer. Therefore, through the above operations, a plurality of operation units complete the calculation of convolution and accumulation of the input data and output the above calculation results.
  • the first buffer 11 includes a weight data buffer 111 and an input data buffer 112.
  • the input data buffer 112 is configured to store input data; and the weight data buffer 111 is configured to store weight values corresponding to the input data.
  • each PE includes three local buffers: an input data buffer (IBUF) 112, a weight data buffer (WBUF) 111, and a second buffer (ie, output buffer, OBUF) 113.
  • IBUF input data buffer
  • WBUF weight data buffer
  • OBUF second buffer
  • the above three local buffers are used to store the input data, the weights corresponding to the input data, and the partial sum obtained by the accumulation calculation.
  • this embodiment provides a processing element (PE) that implements a convolution operation, which has a shift unit capable of acquiring data from adjacent processing elements, so that by processing from two adjacent The element acquires data as edge data to form data to be shifted, and performs convolution operation operations such as shifting on the data to be shifted, so that the PE in the present disclosure can perform a complete convolution operation using the provided weights.
  • a processing element that implements a convolution operation
  • the processing element PE can realize the two-dimensional convolution operation alone, and further improve the data reuse rate.
  • an embodiment of the present invention also provides an apparatus for implementing convolution operations, which provides a flexible, programmable, and high-throughput platform to accelerate convolutional neural networks and related calculations.
  • the device includes: a plurality of processing elements PE (including PE0 to PEn), wherein the processing elements PE include: a first buffer 11, configured to store input data and weights corresponding to the input data;
  • the shift unit 12 is configured to perform a shift operation on the input data to generate first intermediate data;
  • the plurality of operation units 13 is configured to perform at least a part of a convolution operation based on the weight and the first intermediate data, and generate an output Data, wherein the shift operation performed by the shift unit 13 includes: acquiring data from adjacent processing elements; setting the acquired data on both sides of the input data as edge data, thereby forming data to be shifted; and to be shifted The bit data is shifted to generate first intermediate data. Since the specific functions and functions of the processing element PE have been described above, they will not be repeated here.
  • the shift operation performed by the shift unit 12 further includes: sending edge data on both sides of the input data to adjacent processing elements.
  • the plurality of operation units 13 include a plurality of multiply-accumulate units (MAC) 131, a partial sum adder (PSUM) 132 and a second buffer 133, wherein the plurality of multiply-accumulate units (MAC) 131 are configured Value performs multiplication and accumulation operations on the first intermediate data and outputs second intermediate data; the partial sum adder 132 is configured to iteratively sum the second intermediate data and the corresponding partial stored in the second buffer Add up, and store the partial sum calculated in each iteration as the partial sum of the output data into the second buffer 133.
  • MAC multiply-accumulate units
  • PSUM partial sum adder
  • the first buffer 11 includes an input data buffer 112 for storing input data and a weight data buffer 111 for storing weight values corresponding to the input data
  • the device further includes: a first memory (ie, weight Value memory WM) 201, configured to store weights input from outside the device; second memory (ie, data memory DM) 202, configured to store input data; first DMA unit (ie, weight DMA or WDMA) 203 , Configured to write weights from the first memory 201 to the weight data buffer 111 of the first buffer 11; and the second DMA unit (ie, input data DMA or IDMA) 204, configured to write input data from The second memory 202 writes to the input data buffer 112 of the first buffer 11.
  • a first memory ie, weight Value memory WM
  • second memory ie, data memory DM
  • first DMA unit ie, weight DMA or WDMA
  • 204 Configured to write weights from the first memory 201 to the weight data buffer 111 of the first buffer 11
  • the second memory 202 and the first memory 201 are SRAM memories inside the device, and the first DMA unit and the second DMA unit are programmable functional units (ie, FU).
  • FU programmable functional units
  • external memory refers to some memory external to the device described in this embodiment, and may be on-chip SRAM or off-chip DDR memory.
  • the device may be composed of a plurality of loosely coupled, cooperative, and programmable data flow functional units (FU).
  • FU programmable data flow functional units
  • the first DMA unit 203 and the second DMA unit 204 described above are programmable functional units.
  • These FUs execute multiple interdependent data streams and data calculation operations in parallel. These operations can be performed in parallel, as long as the dependencies between these operations are not violated. And these FUs can execute an instruction stream separately. Therefore, under the uncertainty of the timing of various hardware operations, the device can maximize the use of parallelism, so the device can achieve the best performance. It can be understood that the apparatus may be a processor or other electronic devices.
  • the device further includes: a third DMA unit 205 (External Input Data DMA, or EIDMA) configured to send the input data from the external memory to the second memory 202; the fourth The DMA unit 206 (external weight DMA, or EWDMA) is configured to send the weight from the external memory to the first memory 201; the fifth DMA unit 208 (output DMA, or ODMA) is configured to The output data in the second buffer 133 of the plurality of processing elements is sent to the second memory 202; and the sixth DMA unit 207 (external output data DMA, or EODMA) is configured to output the output data from The second memory 202 is output to an external memory.
  • the second memory 202 is further configured to store the output data sent by the sixth DMA unit.
  • the third DMA unit 205, the fourth DMA unit 206, the fifth DMA unit 208, and the sixth DMA unit 207 are programmable functional units (FU).
  • These FUs execute multiple interdependent data streams and data calculation operations in parallel. These operations can be performed in parallel, as long as the dependencies between these operations are not violated. And these FUs can execute an instruction stream separately. Therefore, the device can maximize and exploit the parallelism of the hardware, and improve the performance to the maximum extent when there is timing uncertainty in various hardware operations in the device, thereby further reducing the power consumption of the hardware.
  • the processor has 32 PEs, and each PE has 7 MAC units. So the processor has 224 MACs.
  • Each PE is a 7-bit wide SIMD processor.
  • Each PE has its own instruction buffer (IQ), instruction decoding and control logic.
  • Each PE can perform an independent CNN calculation. Alternatively, multiple adjacent PEs can also be combined together to perform a CNN calculation together. And during the CNN calculation, each PE can: a) obtain data from two adjacent PEs; and b) provide data to the two adjacent PEs.
  • each PE there are three local buffers: i) IBUF (corresponding to input data buffer) for storing input data ci; ii) WBUF (corresponding to weight data buffer) for storing weight values; And iii) OBUF (corresponding to the second buffer) for storing the part and Psum.
  • each PE it also includes shift / select logic (Shift / Mux logic).
  • This logic block performs the following functions: i) receives data from IBUF; ii) receives data from neighboring PE; iii) sends its edge data to neighboring PE; iv) performs right or left shift operation on these data; v) will The data after the shift is provided to 7 MAC units.
  • part and adder PSUM are also included. This part and adder accumulates the CNN calculation result from MAC with the corresponding part and Psum stored in OBUF.
  • the device further includes a control unit 210 and a third memory 209.
  • the third memory 209 stores programs related to the operation of the device; the control unit 210 and the first DMA unit 203, second DMA unit 204, third DMA unit 205, fourth DMA unit 206, fifth DMA unit 208 and Six DMA units 207 are connected and configured to perform the following operations: receive instructions from the third memory 209; execute instructions related to the operation of the control unit 210; and connect the first DMA unit 203, the second DMA unit 204, the third DMA
  • the operation-related instructions of the unit 205, the fourth DMA unit 206, the fifth DMA unit 208 and / or the sixth DMA unit 207 are forwarded to the first DMA unit 203, the second DMA unit 204, the third DMA unit 205, the fourth DMA Unit 206, fifth DMA unit 208 and / or sixth DMA unit 207.
  • FIG. 14 shows a schematic flowchart of an instruction storage and transmission scheme of the control unit 210.
  • all programmable FU programs are stored in the third memory 209 in a centralized manner. These programs include the control unit 210, 32 PEs and 6 DMA units (ie: first DMA unit 203, second DMA unit 204, third DMA unit 205, fourth DMA unit 206, fifth DMA unit 208 and The sixth DMA unit 207) instruction. There are a total of eight instruction types.
  • the specific flow of the control unit 210 instruction storage and transmission scheme is as follows:
  • control unit 210 reads these instructions from the third memory 209 and executes only the control unit instructions, and then broadcasts all other instructions to other functional units through the bus cu_ibus;
  • each programmable FU has an instruction queue (IQ). All programmable FUs (except CU) constantly monitor the bus cu_ibus, and only load their own instructions into their respective IQ;
  • FIG. 15 shows a schematic diagram of six data flow directions in a device for implementing convolution operations according to an embodiment of the present disclosure.
  • FIG. 15 there are six predefined data and weight streams in the device described in this embodiment. In the operation of the device described in this embodiment, all data and weights pass through the six predefined streams through the processor.
  • First data stream 3 the input data stored in the external memory flows into the second memory 202 (DM) through the third DMA unit 205 (EIDMA);
  • the second data stream 4 the input data stored in the second memory 202 (DM) is transmitted to all PEs through the second DMA unit 204 (IDMA), this is a one-to-many broadcast, that is: from the first Two DMA unit 204 (IDMA) broadcast to all PEs that need it;
  • the data stored in the second memory 202 can come from two possible sources: (1) the fifth DMA unit 208 transfers the output data from the PE to the second memory 202; (2) the third The DMA unit 205 transfers the data of the external memory to the second memory 202.
  • the data stored in the DM has two possible destinations: (1) The sixth DMA unit 207 can write them back to the external memory as input data for the next layer; (2) The second DMA unit 204 can Read them back into multiple PEs as input data for the next layer.
  • the first weight value stream 1, the weight values stored in the external memory flow into the first memory 201 (WM) through the fourth DMA unit (EWDMA).
  • Second weight flow 2 the weight stored in the first memory 201 (WM) is the weight data buffer 111 (WBUF) that flows into the first buffer 11 in the PE through the first DMA unit 203 (WDMA) ).
  • FU transmits data (or weight) to another FU
  • the former is called producer FU or upstream FU
  • the latter is called consumer FU or downstream FU.
  • These two FUs are called related FUs, or communication FUs. Between the two communication FUs, a storage buffer will be placed between them.
  • the first memory 201 is a storage buffer between the fourth DMA unit 206 (producer FU) and the first DMA unit 203 (consumer FU);
  • the second memory 202 is also a storage buffer between the following related FUs
  • the input data buffer 112 is a storage buffer between the second memory 202 (producer FU) and the computing hardware of the processing element (consumer FU);
  • the second buffer 133 is a storage buffer between the PE computing hardware (producer FU) and the fifth DMA unit 208 (consumer FU);
  • the weight data buffer 111 is a storage buffer between the first DMA 203 (producer FU) and the computing hardware (consumer FU) of the processing element PE.
  • Table 4 provides detailed information on the six data / weight flows: their origin, destination, FU responsible for the flow, possible communication FU pairs, and the type of synchronization protocol used in these FU pairs.
  • FU usually synchronizes handshake with its upstream FU and its downstream FU at the same time:
  • IDMA the second DMA unit 204
  • EIDMA the third DMA unit 205
  • the two communicating FUs can use one of the following two synchronization protocols to ensure the integrity of the data (or weight) transmitted between them:
  • two communication FUs use the state of the buffer placed between them to perform a handshake. These states include "buffer empty” or “buffer full”. This is to prevent the producer from writing any data into the buffer that is already full, and to prevent consumers from reading data from the buffer that is already empty, etc.
  • two communication FUs execute a pair of matching synchronization instructions, that is, each FU executes a synchronization instruction.
  • each FU executes a synchronization instruction.
  • the producer FU executes the synchronization instruction, it establishes a synchronization barrier for the consumer FU.
  • a consumer FU executes a synchronization instruction, it must ensure that its producer FU has executed the corresponding synchronization instruction. Otherwise, the consumer FU will stop until the producer FU has reached the synchronization point (ie, the producer FU has executed the corresponding synchronization instruction).
  • FU can use two different synchronization protocols (hardware or software handshake) to shake hands with its upstream FU and its downstream FU.
  • hardware or software handshake to synchronize with its upstream FU (or downstream FU).
  • the hardware needs the assistance of software to correctly interact with its upstream FUs (or downstream FUs).
  • IDMA the second DMA unit 204
  • PE unique downstream FU
  • IDMA uses a software protocol to synchronize with the fifth DMA unit 208 and the third DMA unit 205, but uses a hardware protocol to synchronize with all PEs.
  • the second memory 202 that is, the data memory DM
  • the data memory DM is divided into 32 slices, and each DM slice has 7 data widths (that is, the width of each DM slice is 7). Therefore, the DM has 224 data per line (ie, the width of the DM is 224), which matches the total number of MACs.
  • Each DM slice is uniquely mapped to a PE. This is a one-to-one mapping.
  • Each data in the DM is uniquely mapped to a MAC. This is a many-to-one mapping.
  • input data (ci) and output data (co) are collectively referred to as a feature map fmap.
  • fmap (ci or co) is mapped to the second memory 202 (ie, DM), depending on the width of fmap, it may Spanning multiple DM slices and multiple PEs.
  • fmap w * h, where w represents the width of the feature map and h represents the height of the feature map.
  • fmap is mapped to one PE (and one DM slice).
  • PE group will have ceiling (w / m) PEs, where ceiling () is an integer ceiling function, and m is the width of the DM slice.
  • fmap will be mapped to the above PE group and its corresponding DM slice.
  • Figure 17 shows how 10 blocks of input data ci and 17 blocks of output data co are stored in the DM.
  • the number of ci is 10
  • the number of co is 17
  • the size of ci is 64x50
  • the size of co is 64x50
  • the number of PEs required to process a piece of input data ci that is, the number of DM slices
  • the number of PEs required to process one piece of output data co ie, the number of DM slices
  • All input data ci are arranged in a matrix in DM.
  • this matrix the CI matrix, or CIM.
  • IBUF First buffer
  • Second buffer (OBUF) height 64 lines. Therefore, OBUF can store and process up to two lines of Psum (two lines of co).
  • the output data co in COM must be processed in multiple rounds.
  • the number of COM rows 6 rows
  • the number of COM columns 3 columns
  • the height of OBUF 64
  • CONV convolution
  • IBUF can store p blocks of input data ci, and OBUF can store q parts and psums, then between these two, the hardware can perform pq CONV calculations without generating any data traffic to the DM.
  • fmap when fmap is too wide or too high with respect to some hardware design parameters, it will be cut into smaller pieces of data so that it can be processed in the device described in this embodiment.
  • fmap When the width of fmap (ci or co) is greater than nm (ie 224), fmap needs to be cut vertically. In other words, fmap is vertically cut into multiple vertical tiles, and then processed one by one.
  • FIG. 18 shows an example of how input data ci (width 320) that is too wide is cut vertically.
  • the fmap in the figure is divided into two vertical tiles, X and Y, with widths of 224 and 98, respectively. Then process each one twice. There are two columns of data on the X and Y boundaries that are shared by the two processes. Both shared columns will be used in both calculations. The width of this overlapping column is determined by the width of the CNN filter.
  • FIG. 19 shows an example where the height of a piece of output data co is 80.
  • the height of OBUF is 64, so this CO will be divided horizontally into two horizontal tiles, X and Y, with heights of 64 and 18, respectively. Then deal with each one separately. Two lines of data near the X and Y boundary are overlapping (shared). These two shared row data will be used in both calculations. The height of this overlapping line is determined by the height of the CNN filter.
  • fmap needs to cut bricks both vertically and horizontally.
  • a tile may have overlapping data with its four adjacent tiles.
  • the width of the shared column is determined by the width of the CNN filter, and the height of the shared row is determined by the height of the CNN filter.
  • FIG. 20 shows a schematic flowchart of a method for performing convolution operation, which is executed by the processing element PE. Referring to FIG. 20, the method includes:
  • S2006 Perform at least a part of the two-dimensional convolution operation based on the weight and the first intermediate data, and generate output data.
  • the shifting operation includes: acquiring data from adjacent processing elements; setting the acquired data on both sides of the input data as first edge data, thereby forming data to be shifted; and performing on the data to be shifted
  • the shift operation generates the first intermediate data.
  • the shift operation further includes: sending the second edge data on both sides of the input data to adjacent processing elements.
  • the operation of performing at least a part of the operations in the two-dimensional convolution operation includes: performing multiplication and accumulation operations on the first intermediate data according to the weights, and outputting the second intermediate data; The corresponding partial sums are added, and the partial sums calculated in each iteration are stored as the partial sums of the output data.
  • the disclosed technical content may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may Integration into another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing element, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solution of the present invention essentially or part of the contribution to the existing technology or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a storage medium , Including several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Neurology (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)

Abstract

一种用于实现卷积运算的处理元件、装置和方法。其中,处理元件(PE)包括:第一缓冲器(11),被配置为存储输入数据以及与卷积运算对应的权值;移位单元(12),被配置为对输入数据执行移位操作,生成第一中间数据;多个操作单元(13),被配置为基于权值和第一中间数据执行二维卷积运算中的至少一部分运算,并生成输出数据。能进行二维卷积运算且能提高数据的重用率,还能最大限度地利用和挖掘硬件的并行性,并能降低硬件的功耗。

Description

用于实现卷积运算的处理元件、装置和方法
本申请要求于2018年11月2日提交中国专利局,申请号为201811303442.8、发明名称为“用于实现卷积运算的处理元件、装置和方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及卷积神经网络计算的领域,具体而言,涉及一种用于实现卷积运算的处理元件、装置和方法。
背景技术
目前,处理深度神经网络(DNN)的硬件可以归纳为两类架构,一类为时间架构(SIMD/SIMT),另一类为空间架构(数据流处理)。然而,这两类架构的硬件在进行卷积运算时,存在着不能进行二维卷积运算、数据重用率不高等问题。
发明内容
本发明实施例提供了一种用于实现卷积运算的处理元件、装置和方法,以解决现有的处理深度神经网络的硬件架构不能进行二维卷积运算且数据重用率不高的问题。
根据本发明实施例的一个方面,提供了一种用于实现卷积运算的处理元件(PE),其特征在于,包括:第一缓冲器,被配置为存储输入数据以及与卷积运算对应的权值;移位单元,被配置为对输入数据执行移位操作,生成第一中间数据;多个操作单元,被配置为基于权值和第一中间数据执行二维卷积运算中的至少一部分运算,并生成输出数据。其中移位单元所执行的移位操作包括:从相邻的处理元件获取数据;将所获取的数据设置于输入数据的两侧作为第一边缘数据,从而形成待移位数据;以及对待移位数据进行移位操作,生成第一中间数据。
根据本发明实施例的另一个方面,提供了一种用于实行卷积运算的方法,包括:获取输入数据以及与卷积运算对应的权值;对输入数据执行移位操作,生成第一中间数据;基于权值和第一中间数据执行二维卷积运算中的至少一部分运算,并生成输出数据。其中,移位操作包括:从相邻的处理元件获取数据;将所获取的数据设置于输入数据的两侧作为第一边缘数据,从而形成待移位数据;以及对待移位数据进行移位操作,生成第一中间数据。
综上所述,在本发明中,处理元件(PE)具有能够从相邻的处理元件获取数据的移位单元,所述移位单元能将所获取的数据设置于输入数据的两侧作为第一边缘数据,从而形成待移位数据,所述移位单元还能对待移位数据进行移位操作,以生成第一中间数据,从而提高了数据重用率。所述处理元件还通过多个操作单元基于权值和第一中间数据执行二维卷积运算中的至少一部分运算,从而实现了在一个处理元件中进行完整的二维卷积运算的功能。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本发明的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是本发明实施例提供的处理卷积神经网络硬件的时间体系结构的示意图;
图2是本发明实施例提供的矩阵乘法的示意图;
图3是本发明实施例提供的使用松散形式的拓普利兹矩阵乘法的示意图;
图4是本发明实施例提供的处理卷积神经网络硬件的空间体系结构的示意图;
图5是本发明实施例提供的权值固定的空间体系结构的示意图;
图6是本发明实施例提供的输出固定的空间体系结构的示意图;
图7是本发明实施例提供的无本地重用的空间体系结构的示意图;
图8A~图8C是本发明实施例提供的处理元件完成一维卷积的过程的示意图;
图9是本发明实施例提供的执行二维卷积运算的处理元件组的示意图;
图10是本发明实施例提供的一种用于实现卷积运算的处理元件的示意图;
图11是图10中的所述处理元件进行卷积运算的示意图;
图12是本发明实施例提供的一种用于实现卷积运算的装置的示意图;
图13是图12中的所述装置的一种数据流的示意图;
图14是图12中的所述装置的指令存储和发送流程的示意图;
图15是图12中的所述装置的另一种数据流的示意图;
图16是图12中的所述装置的数据存储器的切片方式的示意图;
图17是图12中的所述装置的输入数据和输出数据的存储方式的示意图;
图18是图12中的所述装置的对过宽的输入数据进行垂直切割的示意图;
图19是图12中的所述装置的对过高的输入数据进行水平切割的示意图;以及
图20是本发明实施例提供的一种用于实现卷积运算的方法的流程示意图。
具体实施方式
请参考图1,如图1所示的处理卷积神经网络硬件的时间体系结构,通常使用诸如SIMD或SIMT技术来并行执行乘法累加运算(即MAC运算)。所有的ALU共享相同的控制和寄存器堆。在这些平台上,全连接层(FC层)和卷积层(CONV层)都通常被映射为矩阵乘法。
请参考图2,如图2所示的矩阵乘法是指行数为M行并且列数为通道宽度(CHW)的滤波器,与行数为通道宽度(CHW)并且列数为N列的输入fmap进行矩阵乘法,最后得到M行N列的输出fmap。
对于卷积层,可以使用松散形式的拓普利兹(Toeplitz)矩阵乘法。图3示出了具有2个输入fmap和2个输出fmap的卷积层。对于卷积层(CONV层),使用拓普利兹矩阵乘法的缺点是fmap矩阵中存在冗余数据,导致存储效率和数据传输带宽效率低下。
此外,快速傅里叶变换(FFT)也可用于减少乘法次数:首先将滤波器和输入fmap转换为“频域”,执行乘法运算,然后进行反向FFT运算,以获得“时域”的输出fmap。其他方法包括Strassen算法和Winograd算法。它们重新排列计算,使得乘法的次数可以从O(N3)减少到O(N2.807)。对于3×3的滤波器,乘法的次数可以减少2.25×。代价是降低数字稳定性,增加存储需求以及对于不同参数尺寸的专门处理。
请参考图4,如图4所示的处理卷积神经网络硬件的空间体系结构采用数据流(Dataflow)处理方式。在空间体系结构中,ALU形成一条数据处理链,从而能够在ALU间直接地传送数据。该空间体系结构中,每个ALU都有自己的控制逻辑与本地存储(寄存器堆)。其中,有本地存储的ALU被定义为处理元件(PE)。对于空间体系结构,硬件设计基于层次存储器中的低能耗内存,并增加数据重利用率(实质上,卷积是空间重用,这种重用可以获取空间的不变性),以此来减小能耗。另外,数据流(Dataflow)控制数据读、写及处理。总体上,空间体系结构基于层次存储器与数据流平衡I/O与运算问题,从而降低能耗并提升运算吞吐量。空间体系结构有四种类型的数据重用:权值固定、输出固定、无本地重用(NLR)以及行固定。
图5示出了权值固定的空间体系结构的示意图。在权值固定的处理方式中,首先将权值读入每个处理元件(PE)的寄存器堆(RF)中保持不变。然后将输入fmap以及部分和通过PE阵列以及全局缓冲器移动,从而尽可能重用PE中的权重。输入fmap被广播到所有PE,并且部分和穿过PE阵列而得到累积。
图6示出了输出固定的空间体系结构的示意图。其中,通过在PE阵列中流动输入数据,然后把权值数据广播到PE阵列,保持寄存器堆(RF)中的部分和的累加不变,从而最小化读写部分和的能耗。
图7示出了无本地重用的空间体系结构的示意图。其中PE阵列的RF中并不存储任何固定数据,相反,该情况下,所有数据的读写操作都是在全局缓冲器中完成。在这种情况下,PE阵列和全局缓冲器之间所有数据类型的流量将会增加。
在行固定的架构中,所有类型的数据(权值、输入fmap以及部分和)都存储在本地RF中,以便最大限度地发挥数据重复,并提高整体能效。
每个PE将处理一维卷积,聚合多个PE以完成二维卷积。其中图8A~图8C示出了PE完成一维卷积的过程。此外,参考图9所示,可以使用三个PE,每个PE运行一个一维卷积。部分和在三个PE上进一步垂直累积以产生第一输出行。为了生成第二行输出,我们使用另一列PE,其中三行输入激活向下移动一行,并使用相同的滤波器行执行三个一维卷积。添加额外的PE列,直到输出的所有行完成(即,PE列的数量等于输出行的数量)。在此体系结构中,每个滤波器行都在多个PE之间水平重用。输入激活的每行对角地在多个PE中重用。如图9所示,每行的部分和进一步在PE上垂直累积。
如上所述,在对图像进行卷积运算时,时间***结构的硬件通常使用放宽形式的拓扑利兹(Toeplitz)矩阵乘法,但是其缺点是fmap矩阵中存在冗余数据,导致存储效率和数据传输带宽效率低下。尽管可以采用快速傅里叶变换登发来减少乘法次数,但是会降低数字稳定性,并且增加存储需求以及对于不同权值尺寸的专门处理。然而,基于空间***结构的硬件利用CNN计算的空间相关性来避免内存读写瓶颈,但是,空间***结构对数据的重用率依然不高,例如,行固定架构只能实现一维的卷积操作,即,只能用滤波器的权值矩阵中的一行与输入fmap矩阵中的一行进行卷积操作。如果要实现更加复杂的二维的卷积操作,则只能采用多个处理元件共同完成。因此,以上结构限制了处理器硬件最大限度地挖掘和利用硬件并行性,并且限制了处理器内的各种硬件操作存在时序不确定性的情况下最大程度去提高性能,从而进一步增加了硬件处理器的功耗。
针对上述问题,本发明实施例提供了一种用于实现卷积运算的处理元件(PE)来执行多个相关数据和权值流的计算操作,实现了最大程度的数据重用。并且,该处理元件可以实现在一个处理元件内完成完整的二维卷积运算。
请参考图10,用于实现卷积运算的处理元件PE,包括:第一缓冲器11(即,输入缓冲器),被配置为存储输入数据以及与输入数据对应的权值;移位单元12,被配置为对输入数据执行移位操作,生成第一中间数据;多个操作单元13,被配置为基于权值和第一中间数据执行卷积运算的至少一部分运算,并生成输出数据,其中移位单元12所执行的移位操作包括:从相邻的处理元件获取数据;将所获取的数据设置于输入数据的两侧作为边缘数据,从而形成待移位数据;以及对待移位数据进行移位操作,生成第一中间数据。
在对例如一个图像进行卷积运算时,通常采用多个PE,对于该图像的不同部分的图像数据分别进行卷积运算。其中,每个PE是一个数位宽度为m的SIMD处理器(或数位宽度为m的矢量处理器)。从而n个PE会链接在一起形成一个长度为n的一维PE阵列,因此数据(例如可以通过每个PE中的移位单元的操作)在这个一维PE阵列中往两个方向流动。此外,每个PE有自己的指令缓冲器IQ,指令解码和控制逻辑等。每个PE可以执行一个独立的卷积神经网络(CNN)计算。或者,多个相邻的PE也可以组合在一起以共同执行一个CNN计算。
图11示出了处理元件(PE)对输入数据所进行的移位操作以及后续卷积操作的示意图。
参考图11所示,处理元件(PE)所接收的输入数据为:
7 6 5 4 3 2 1
23 22 21 20 19 18 17
39 38 37 36 35 34 33
55 54 53 52 51 50 49
然后处理元件(PE)分别从相邻的处理元件获取数据,并且将所获取的数据设置于输入数据的两侧作为边缘数据,从而形成待移位数据:
8 7 6 5 4 3 2 1 0
24 23 22 21 20 19 18 17 16
40 39 38 37 36 35 34 33 32
56 55 54 53 52 51 50 49 48
其中,数据(8,24,40,56)和数据(0,16,32,48)是从相邻处理元件获取的数据,设置于输入数据两侧作为边缘数据。
然后移位单元进一步对该待移位数据进行移位,具体地,在第一个周期,用于计算的数据为:
7 6 5 4 3 2 1
在第二个周期对待移位数据进行左移,从而用于计算的数据为:
6 5 4 3 2 1 0
在第三个周期对待移位数据进行右移,从而用于计算的数据为:
8 7 6 5 4 3 2
依次类推,参考图11所示,在9个周期中以不同方向进行移位,从而获取用于计算的数据。
进而通过上述移位单元进行的移位操作,可以在各个处理元件PE内进行二维卷积操作。相对于需要采用多个处理元件PE才能实现卷积运算的空间体系结构(例如权值固定、输出固定、无本地重用(NLR)以及行固定)而言,本实施例提供的处理元件PE可以单独实现二维卷积运算,并且进一步提高了数据重用率。
进一步地,移位单元所执行的所述移位操作,还包括:将所述输入数据两侧的边缘数据发送给相邻的处理元件。从而,根据本实施例的处理元件PE在从相邻的处理元件PE获取数据的同时,也将自己的输入数据两侧的边缘数据发送给相邻的处理元件。例如,参考上面所述,处理元件PE所接收的输入数据为:
7 6 5 4 3 2 1
23 22 21 20 19 18 17
39 38 37 36 35 34 33
55 54 53 52 51 50 49
从而处理元件也将数据(7,23,39,55)和数据(1,17,33,49)发送给相邻的处理元件PE。
在CNN计算期间,每个PE可以通过移位单元(shift/mux)从两个相邻的PE获取数据,同时也可以向两个相邻的PE提供数据。移位单元主要用于实现如下功能:i)从第一缓冲器接收数据;ii)从邻近PE接收数据;iii)将其边缘数据遣送给邻近的PE;iv)将所获取的数据设置于输入数据的两侧作为边缘数据,从而形成待移位数据,以及对待移位数据进行移位操作,上述移位操作包括右移或左移操作,生成第一中间数据;v)将第一中间数据提供给多个操作单元。
此外,除了向相邻PE“借入”数据之外,该PE还会将自己的边界数据“借”给其两个邻居PE。共享(或重叠)数据的数量取决于滤波权值的宽度,由以下公式推算出:每边共享的数据量=floor(W filter/2),其中,W filter是滤波权值的宽度。floor()是整数floor函数。举例说明:如果W filter=3,那么每边共享数据量为1。
从而,通过移位单元实现了对输入数据进行移位操作的目的,同时使得相邻,保证了PE中卷积运算的完整性。
具体地,以上面所述的例子为例,如果不从相邻PE获取数据,则PE所进行的卷积运算只能在PE本身的输入数据中进行(即该输入数据为7列),如下所示:
7 6 5 4 3 2 1
23 22 21 20 19 18 17
39 38 37 36 35 34 33
55 54 53 52 51 50 49
在这种情况下,如果采用图11中所示的权重参数构成的滤波器进行移位操作以及卷积运算时,作为卷积运算的结果只能产生5列数据。因此,这就意味着作为该PE的卷积运算的输出结果,其第1列和最后1列数据为空或者需要补零。也就是说,这种情况下的二维卷积运算是不完整的。那么在这种情况下,当处理器使用多个PE并行地对更大数据(例如数据宽度更宽,需要多个PE并排进行卷积运算)进行卷积时,由于各个PE输出的数据都有数据为空的数据列或补零的数据列,这会造成处理器的卷积结果出现错误。
针对这种情况,本实施例的PE在进行操作前,先从相邻PE获取数据并且将获取的数据作为边缘数据,如下所示:
8 7 6 5 4 3 2 1 0
24 23 22 21 20 19 18 17 16
40 39 38 37 36 35 34 33 32
56 55 54 53 52 51 50 49 48
在这种情况下,如果采用图11中所示的权重参数构成的滤波器进行移位操作以及卷积运算时,作为卷积运算的结果会产生7列数据。因此,这就意味着作为该PE的卷积运算的输出结果,其第1列和最后1列数据不为空并且不需要补零。即,PE对输入数据进行了完整的卷积运算。那么在这种情况下,当处理器使用多个PE并行地对更大数据(例如数据宽度更宽,需要多个PE并排进行卷积运算)进行卷积时,由于各个PE的输出的数据都是完整的,从而使得处理器能够输出准确的卷积结果。
进一步地,多个操作单元包括多个乘法累加单元(MAC)131、部分和加法器(PSUM)132以及第二缓冲器133,其中多个乘法累加单元131,被配置为根据权值对第一中间数据执行乘法和累加操作,并输出第二中间数据;部分和加法器132,被配置为迭代地将所述第二中间数据与存储于第二缓冲器133中的相应的部分总和相加,并将每次迭代计算的部分总和作为所述输出数据的部分总和存储到所述第二缓冲器133中。
其中,参考图11所示,每个PE有m个MAC单元。在每个PE中,m个MAC的输入数据会链接在一起,形成一个长度为m的一维数据向量。n个PE就有n*m个MAC单元。来自n个PE的n个数据向量,每个长度为m,会链接在一起,形成一个长度为n*m的一维数据向量,这个长度为n*m的一维数据向量可以执行右移和左移操作。移位后的数据将会被馈送给n*m个MAC单元。在本实施例中,假设有32个PE,其中每个PE是一个7位宽的SIMD处理器,每个PE有7个MAC单元,则共有224个MAC。
此外,第一缓冲器11例如可以包括输入数据缓冲器和权值数据缓冲器。
参考图11所示,举例说明MAC单元的卷积计算过程。需要说明的是:虚线框内的斜体数据是从两个相邻PE获取的重叠(或共享)数据,上述共享数据置于输入数据的矩阵的左右两侧;3×3字母矩阵代表滤波器的权值,滤波器内权值存储于第一缓冲器11的权值数据缓冲器内(WBUF)内;黑色加粗数据代表7个MAC单元,每个数字标号对应相应的MAC单元;其他数据是从PE的输入数据缓冲器(IBUF)中获取的输入数据。
假设MAC单元采用3×3的滤波器,滤波器的滤波权值采用由输入数据缓冲器输入的权值,每个滤波器中有9个权值,则单个PE中的每个MAC单元执行9个乘法和累加操作,共需要9个周期来完成。
参考表1和表2,具体说明每个周期的计算过程:在周期1至周期3的前3个周期中,MAC对第一行输入数据(ci)和第一行滤波权值进行乘法和累加操作;在周期4至周期6的中间3个周期中,MAC对第二行输入数据(ci)和第二行滤波权值进行乘法和累加操作;以及在周期7至周期9的最后3个周期中,MAC对第三行输入数据(ci)和第三行滤波权值进行乘法和累加操作。经过9个周期的乘法和累加操作后,最终结果如表2所示。
表1.每个周期的计算结果
Figure PCTCN2018124828-appb-000001
表2.9个周期的累加计算结果
MAC单元标号 9个周期的累加结果
6 7b+6a+8c+23e+22d+34f+39h+38g+40i
5 6b|+5a+7c+22e+21d+23f+38h+37g+39i
4 5b+4a+6c+21e+20d+22f+37h+36g+38i
3 4b+3a+5c+20e+19d+21f+36h+35g+37i
2 3b+2a+4c+19e+18d+20f+35h+34g+36i
1 2b+a+3c+18e+17d+19f+34h+33g+35i
0 b+0+2c+17e+16d+18f+33h+32g+34i
从而,通过多个乘法累加单元对移位单元反馈的第一中间数据执行CNN计算得到卷积计算结果(即:第二中间数据),进而部分和加法器将上述卷积计算结果与储存在第二缓冲器中的相应的部分和进行累加得到输出数据并存储在第二缓冲器中。从而通过上述操作,多个操作单元完成了对输入数据进行卷积和累加的计算并输出上述计算结果。
进一步地,第一缓冲器11包括权值数据缓冲器111和输入数据缓冲器112。其中输入数据缓冲器112配置为存储输入数据;以及权值数据缓冲器111配置为存储所述输入数据对应的权值。
从而,每个PE中包括三个本地缓冲器:输入数据缓冲器(IBUF)112、权值数据缓冲器(WBUF)111和第二缓冲器(即输出缓冲器,OBUF)113。上述三个本地缓冲器分别用于存储输入数据、与输入数据对应的权值和累加计算得到的部分和。
从而,综上所述,本实施例提供了一种实现卷积运算的处理元件(PE),其具有能够从相邻的处理元件获取数据的移位单元,从而通过从两个相邻的处理元件获取数据作为边缘数据从而形成待移位数据,并且对待移位数据进行移位等卷积运算操作,使得本公开中的PE能够利用提供的权值进行完整的卷积运算。相对于现有技术中的需要采用多个处理元件PE才能实现卷积运算的空间体系结构(例如权值固定、输出 固定、无本地重用(NLR)以及行固定)而言,本实施例提供的处理元件PE可以单独实现二维卷积运算,并且进一步提高了数据重用率。
请参考图12,本发明实施例还提供了一种实现卷积运算的装置,该提供了一种灵活、可编程、高吞吐量平台来加速卷积神经网络和相关的计算。如图12所示,该装置包括:多个处理元件PE 200(包括PE0~PEn),其中处理元件PE包括:第一缓冲器11,被配置为存储输入数据以及与输入数据对应的权值;移位单元12,被配置为对输入数据执行移位操作来生成第一中间数据;多个操作单元13,被配置为基于权值和第一中间数据执行卷积运算至少一部分运算,并生成输出数据,其中移位单元13所执行的移位操作包括:从相邻的处理元件获取数据;将所获取的数据设置于输入数据的两侧作为边缘数据,从而形成待移位数据;以及对待移位数据进行移位操作,生成第一中间数据。由于处理元件PE的具体功能以及作用前面已经说明过,此处不再赘述。
进一步地,移位单元12所执行的移位操作,还包括:将输入数据两侧的边缘数据发送给相邻的处理元件。
进一步地,多个操作单元13包括多个乘法累加单元(MAC)131、部分和加法器(PSUM)132以及第二缓冲器133,其中多个乘法累加单元(MAC)131,被配置为根据权值对第一中间数据执行乘法和累加操作,并输出第二中间数据;部分和加法器132,被配置为迭代地将所述第二中间数据与存储于第二缓冲器中的相应的部分总和相加,并将每次迭代计算的部分总和作为所述输出数据的部分总和存储到所述第二缓冲器133中。
进一步地,第一缓冲器11包括用于存储输入数据的输入数据缓冲器112和用于存储与输入数据对应的权值的权值数据缓冲器111,其中装置还包括:第一存储器(即权值存储器WM)201,被配置为存储从装置外输入的权值;第二存储器(即数据存储器DM)202,被配置为存储输入数据;第一DMA单元(即,权值DMA或WDMA)203,被配置为将权值从第一存储器201写入第一缓冲器11的权值数据缓冲器111;以及第二DMA单元(即,输入数据DMA或IDMA)204,被配置为将输入数据从第二存储器202写入第一缓冲器11的输入数据缓冲器112。
其中,第二存储器202和第一存储器201为装置内部的SRAM存储器,第一DMA单元和第二DMA单元为可编程的功能单元(即,FU)。需要说明的是,“外部存储器”是指本实施例所述的装置外部的一些存储器,可以为片上SRAM或者片下DDR存储器。
可以理解,所述装置可以由多个松散耦合、协作、可编程数据流功能单元(FU)组成,例如上面所述的第一DMA单元203和第二DMA单元204均为可编程的功能单元。
这些FU并行执行多个相互依赖的数据流和数据计算操作。这些操作可以并行进行,只要这些操作之间的依赖关系不被违反即可。并且这些FU可以分别执行一个指令流。因此,在各种硬件操作的时序不确定性下,该装置可以最大限度地挖掘和利用并行性,因此该装置可以达到最佳性能。可以理解,该装置可以是处理器,也可以是其他电子器件。
进一步地,参考图12所示,装置还包括:第三DMA单元205(外部输入数据DMA,即EIDMA),被配置为从外部存储器将所述输入数据发送至所述第二存储器202;第四DMA单元206(外部权值DMA,即EWDMA),被配置为从外部存储器将所述权值发送至所述第一存储器201;第五DMA单元208(输出DMA,即ODMA),被配置为将所述多个处理元件的第二缓冲器133中的输出数据发送至所述第二存储器202;以及第六DMA单元207(外部输出数据DMA,即EODMA),被配置为将所述输出数据从所述第二存储器202输出至外部存储器。其中第二存储器202还进一步配置为存储所述第六DMA单元发送的输出数据。
其中,第三DMA单元205、第四DMA单元206、第五DMA单元208和第六DMA单元207为可编程的功能单元(FU)。
这些FU并行执行多个相互依赖的数据流和数据计算操作。这些操作可以并行进行,只要这些操作之间的依赖关系不被违反即可。并且这些FU可以分别执行一个指令流。因此,所述装置能最大限度地挖掘和利用硬件的并行性,并且在装置内的各种硬件操作存在时序不确定性的情况下最大程度去提高性能,从而进一步降低了硬件的功耗。
具体地,参见图13所示,在本实施例中,假设该处理器有32个PE,每个PE有7个MAC单元。因此该处理器有224个MAC。
每个PE是一个7位宽的SIMD处理器。每个PE有自己的指令缓冲器(IQ),指令解码和控制逻辑等。
每个PE可以执行一个独立的CNN计算。或者,多个相邻的PE也可以组合在一起以共同执行一个CNN计算。并且在CNN计算期间,每个PE可以:a)从两个相邻的PE获取数据;以及b)向两个相邻的PE提供数据。
在每个PE中,有三个本地缓冲器:i)IBUF(对应于输入数据缓冲器),用于存储输入数据ci;ii)WBUF(对应于权值数据缓冲器),用于存储权值;以及iii)OBUF(对应于第二缓冲器),用于存储部分和Psum。
在每个PE中,还包括移位/选择逻辑(Shift/Mux logic)。这个逻辑块执行以下功能:i)从IBUF接收数据;ii)从邻近PE接收数据;iii)将其边缘数据发送给邻近的PE;iv)对这些数据进行右移或左移操作;v)将移位之后的数据提供给7个MAC单元。
在每个PE中,还包括部分和加法器PSUM。这个部分和加法器将来自于MAC的CNN计算结果与存储在OBUF中的相应的部分和Psum进行累加。
具体的卷积计算的示例序列,参见图11所示。此外,下面所述的表3中示出了所述装置中各个功能单元(FU)的功能:
表3
Figure PCTCN2018124828-appb-000002
Figure PCTCN2018124828-appb-000003
此外,参见图12所示,可选地,装置还包括控制单元210以及第三存储器209。其中,第三存储器209存储与装置的操作相关的程序;控制单元210与第一DMA单元203、第二DMA单元204、第三DMA单元205、第四DMA单元206、第五DMA单元208以及第六DMA单元207连接,并且配置为执行以下操作:从第三存储器209接收指令;执行与控制单元210的操作相关的指令;以及将与第一DMA单元203、第二DMA单元204、第三DMA单元205、第四DMA单元206、第五DMA单元208和/或第六DMA单元207的操作相关的指令转发至第一DMA单元203、第二DMA单元204、第三DMA单元205、第四DMA单元206、第五DMA单元208和/或第六DMA单元207。
此外,图14示出了控制单元210的指令存储和发送方案的流程示意图,参考图14所示,在本实施例中,所有可编程的FU程序都集中存储在第三存储器209中。这些程序包括发往控制单元210、32个PE和6个DMA单元(即:第一DMA单元203、第二DMA单元204、第三DMA单元205、第四DMA单元206、第五DMA单元208和第六DMA单元207)的指令。总共有八种指令类型。控制单元210指令存储和发送方案的具体流程如下:
首先,控制单元210从第三存储器209中读取这些指令并仅执行控制单元指令,然后通过总线cu_ibus向其他功能单元广播所有其他指令;
其次,每个可编程FU都有一个指令队列(IQ)。所有可编程的FU(CU除外)不断监视总线cu_ibus,并且只将属于自己的指令加载到它们各自的IQ里;
最后,这些FU从它们的IQ中获取指令并按顺序执行。
从而,多个可编程FU的程序可以共同组成一个完整、协调、一致的程序,进而在本实施例所述装置中实现CNN加速。
此外,图15示出了本公开实施例所述的实现卷积运算的装置中的六个数据流走向的示意图。参考图15所示,在本实施例所述的装置中有六个预定义的数据和权值流。在本实施例所述的装置的工作中,所有数据和权值都是通过这六个预定义的流通过处理器。
在这六个数据流中,四个是数据流(用实线表示),另外两个是权值流,用虚线表示。
(a)四个数据流
i)第一数据流3,存储在外部存储器中的输入数据是通过第三DMA单元205(EIDMA)流入第二存储器202(DM);
ii)第二数据流4,存储在第二存储器202(DM)中的输入数据是通过第二DMA单元204(IDMA)传输到所有PE中,这是种一对多的广播,即:从第二DMA单元204(IDMA)广播到所有需要它的PE中;
iii)第三数据流5,存储在各个处理元件PE的第二缓冲器133(OBUF)中的数据是通过第五DMA单元208(ODMA)传输到第二存储器202中,这是种同步传输操作,也就是说,所有PE以锁步方式将自己的输出数据同步写回第二存储器202,且在每个周期中最多可以将224个数据写回第二存储器202;
iv)第四数据流6,存储在第二存储器202(DM)中的输出数据通过第六DMA单元207(EODMA)传输到外部存储器。
需要说明的是,存储在第二存储器202(DM)中的数据可以来自两个可能的来源:(1)第五DMA单元208将输出数据从PE传输到第二存储器202;(2)第三DMA单元205将外部存储器的数据传输到第二存储器202。
并且,存储在DM中的数据有两个可能的目的地:(1)第六DMA单元207可将它们写回到外部存储器,作为下一层的输入数据;(2)第二DMA单元204可以将它们读回多个PE里,作为下一层的输入数据。
(a)两个权值流
i)第一权值流1,存储在外部存储器中的权值是通过第四DMA单元(EWDMA)流入第一存储器201(WM)。
ii)第二权值流2,存储在第一存储器201(WM)中的权值是通过第一DMA单元203(WDMA)流入PE中的第一缓冲器11的权值数据缓冲器111(WBUF)。
此外,需要说明的是:当FU将数据(或权值)传送给另一FU时,前者称为生产者FU或上游FU,后者称为消费者FU或下游FU。这两个FU被称为相关FU,或通信FU。在两个通信FU之间,它们之间将会放置一个存储缓冲器。例如,
i)第一存储器201是第四DMA单元206(生产者FU)和第一DMA单元203(消费者FU)之间的存储缓冲器;
ii)第二存储器202同时是以下相关FU之间的存储缓冲器
(1)第三DMA单元205(生产者FU)和第二DMA单元204(消费者FU),
(2)第五DMA单元208(生产者FU)和第六DMA单元207(消费者FU),
(3)第五DMA单元208(生产者FU)和第二DMA单元204(消费者FU)。
iii)在每个PE中,
(1)输入数据缓冲器112是第二存储器202(生产者FU)和处理元件的计算硬件(消费者FU)之间的存储缓冲器;
(2)第二缓冲器133是PE计算硬件(生产者FU)和第五DMA单元208(消费者FU)之间的存储缓冲器;
(3)权值数据缓冲器111是第一DMA203(生产者FU)和处理元件PE的计算硬件(消费者FU)之间的存储缓冲器。
表4提供了六个数据/权值流的详细信息:它们的起源地、目的地、负责该流的FU、可能的通信FU对以及这些FU对中所采用的同步协议的类型。
表4.六个数据/权值流的流向信息
Figure PCTCN2018124828-appb-000004
下面进一步说明表4中所述的握手协议的具体规定。FU通常同时与其上游FU和其下游FU进行握手同步:
a)当FU相对于下游的FU运行速度过快时,它将停止。
b)当FU相对于其上游FU运行过慢时,它将会导致其上游FU停止。
例如,相对于32个PE中的任何一个PE,如果第二DMA单元204(IDMA)运行速度过快时,它将停止。同样,当第二DMA单元204(IDMA)相对于第三DMA单元205(EIDMA)运行速度过慢时,会导致EIDMA停止(假设IDMA依赖于EIDMA)。
此外,两个通信的FU可以使用以下两个同步协议之一来确保在它们之间传输的数据(或权值)的完整性:
a)硬件握手
在这个握手协议中,两个通信FU使用放置在它们之间的缓冲器的状态来执行握手。这些状态包括“缓冲器空”或“缓冲器满”等。这是为了防止生产者将任何数据写入已经满了的缓冲器,并防止消费者从已经空了的缓冲器读取数据等。
b)软件握手
在这个握手协议中,两个通信FU执行一对匹配的同步指令,即每个FU各执行一个同步指令。当生产者FU执行同步指令时,它为消费者FU建立一个同步屏障。当消费FU执行同步指令时,它必须确保其生产者FU已经执行了相应的同步指令。否则,消费者FU将停止,直到生产者FU已经到达同步点(即,生产者FU已经执行了相应的同步指令)。
FU可以使用两个不同的同步协议(硬件或软件握手)与其上游FU和其下游FU握手。通常,如果FU只具有单一的,明确的上游FU(或下游FU),它将使用硬件握手协议来与其上游FU(或下游FU)同步。相反,如果FU具有多个可能的上游FU(或下游FU),则硬件需要软件的协助来正确的与其上游FU(或下游FU)进行数据流交互。例如,第二DMA单元204(IDMA)有两个可能的上游FU(第五DMA单元208和第三DMA单元205),但只有一个唯一下游FU(PE)。因此,IDMA使用软件协议与第五DMA单元208和第三DMA单元205进行同步,但使用硬件协议来与所有PE进行同步。
此外,参考图16所示,第二存储器202,即数据存储器DM,是根据处理元件PE来组织的。其中数据存储器DM分为32个切片,每个DM切片有7个数据宽(即,每个DM切片的宽度为7)。因此,DM每行有224个数据(即,DM的宽度为224),与MAC总数匹配。每个DM切片被唯一映射到一个PE。这是个一对一的映射。DM中的每个数据被唯一地映射到一个MAC。这是个多对一的映射。
本实施例中将输入数据(ci)和输出数据(co)统称为特征图fmap,当fmap(ci或co)被映射到第二存储器202(即DM)时,根据fmap的宽度,它可能会跨越多个DM切片和多个PE。
假设fmap的尺寸是w*h,其中,w代表特征图的宽度,h代表特征图的高度。针对不同尺寸的fmap,将会选择不同的方式将fmap映射到DM和PE,具体方式如下:
(1)狭窄类fmap(即:w<=7),则fmap映射到一个PE(和一个DM片)。
(2)中等宽度类fmap(即:w>7且w<=224):
如果w能被7整除,则可以将多个PE组合起来一起来处理单个fmap。这组PE被称为一个PE组。该PE组将有ceiling(w/m)个PE,其中ceiling()是整数的ceiling函数,m为DM切片的宽度。fmap将会映射到上述PE组及其相应的DM片。
如果w不能被7整除,则将fmap右对齐至PE组的右边界。在这种情况下,在CNN计算期间,PE组的上部的(m–module(w/m))个MAC单元将不被使用。这里modulo()是整数modulo函数。
(3)宽类fmap,如果w>224,则fmap将会被垂直切割。也就是说,它会被垂直切割成多个垂直瓦片,每个瓦片的宽度等于或小于224。
图17显示了10块输入数据ci和17块输出数据co在DM中的存储方式。在这个例子中,ci数量为10,co数量为17,ci的尺寸为64x50,co的尺寸为64x50,其中处理一块输入数据ci所需要PE的数量(即DM切片的数量)为ci_gp_size=ceiling(64/7)=10,处理一块输出数据co所需要PE的数量(即DM片的数量)为co_gp_size=ceiling(64/7)=10。
所有输入数据ci在DM中以矩阵方式排布。我们称该矩阵为CI矩阵,或CIM。在这个矩阵里,每行ci的数量=floor(32/ci_gp_size)=3;对于10块ci,则CIM的尺寸为4行*3列;此矩阵最后一行只有一块ci。
所有输出数据co在DM中以矩阵方式排布。我们称该矩阵为CO矩阵,或COM。在这个矩阵里,每行输出数据co的数量=floor(32/co_gp_size)=3;对于17块输出数据co,CIOM的尺寸为6行*3列;此矩阵最后一行只有两块输出数据co。
第一缓冲器(IBUF)高度:64行。因此IBUF最多可以存储两行ci。
第二缓冲器(OBUF)高度:64行。因此OBUF最多可以存储和处理两行Psum(两行co)。
当COM的高度高于OBUF的高度时,COM里的输出数据co必须分多轮来处理。参考图16所示,COM行数=6行,COM列数=3列,OBUF高度=64,OBUF只能容纳:floor(64/30)=2行co,轮次数=ceiling(COM行数/2)=3轮。COM矩阵中共有六行co,但OBUF每次只能容得下两行co,因此卷积(CONV)计算必须分成三轮来处理:第一轮产生6块输出数据co,第二轮产生6块输出数据co,最后一轮产生5块输出数co。在每一轮处理过程中,所有的ci被逐个流入先进先出(FIFO)的第一缓冲器(IBUF)。反过来,位于IBUF首位的输入数据ci与第二缓冲器(OBUF)中的每个单独的部分和(Psum)进行卷积(CONV)计算。
如果IBUF可以存储p块输入数据ci,而OBUF可以存储q个部分和psum,那么在这两者之间,硬件可以执行pq次CONV计算,而不会产生任何发送到DM的数据流量。
此外,当fmap相对于某些硬件设计参数为太宽或太高时,它会被切割为更小的数据片(piece),以便能在本实施例所述的装置内可以进行处理。
当fmap(ci或co)的宽度大于nm(即224)时,fmap需要进行垂直切割。也就是说,fmap被垂直切割成多个垂直瓦片,然后分别逐个处理。
图18示出了一个过宽的输入数据ci(宽度为320)如何垂直切割的例子。参照图18所示,图中的fmap被分割成两个垂直瓦片,X和Y,宽度分别为224和98。然后分别逐个处理两次。X和Y边界有两列数据是两次处理共享的数据。这两个共享列在两轮计算中都会被使用。这个重叠列的宽度由CNN滤波器的宽度来决定。
当输出数据co的高度高于OBUF的高度时,输出数据co被水平切割成多个水平瓦片,并且先后逐一处理。两个相邻的瓦片之间有共享的行数据。这些共享行在CNN计算中会使用两次。图19示出了一块输出数据co高度为80的例子。参照图19所示,在这个示例处理器中,OBUF的高度是64,因此这个CO将会被水平分割成两个水平的瓦片,X和Y,高度分别为64和18。然后分别逐个处理。X和Y边界附近有两行数据是重叠(共享)的。这两个共享行数据正在两轮计算中都会用得到。这个重叠行的高度是由CNN滤波器的高度来决定的。
此外,当fmap宽度大于224且高度大于64时,fmap需要同时垂直和水平切砖。一个瓦片有可能与其四个相邻的瓦片有重叠的数据。共享列的宽度由CNN滤波器的宽度决定,而共享行的高度由CNN滤波器的高度决定。
图20示出了一种用于实行卷积运算的方法的流程示意图,该方法由所述的处理元件PE执行。参考图20所示,该方法包括:
S2002:获取输入数据以及与卷积运算对应的权值;
S2004:对输入数据执行移位操作,生成第一中间数据;
S2006:基于权值和第一中间数据执行二维卷积运算中的至少一部分运算,并生成输出数据。
其中,移位操作包括:从相邻的处理元件获取数据;将所获取的数据设置于输入数据的两侧作为第一边缘数据,从而形成待移位数据;以及对所述待移位数据进行移位操作,生成所述第一中间数据。
可选地,移位操作还包括:将输入数据两侧的第二边缘数据发送给相邻的处理元件。
可选地,执行二维卷积运算中的至少一部分运算的操作,包括:根据权值对第一中间数据执行乘法和累加操作,并输出第二中间数据;迭代地将第二中间数据与存储的相应的部分总和相加,并将每次迭代计算的部分总和作为所述输出数据的部分总和进行存储。
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。
在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理元件中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (10)

  1. 一种用于实现卷积运算的处理元件(PE),其特征在于,包括:
    第一缓冲器(11),被配置为存储输入数据以及与所述卷积运算对应的权值;
    移位单元(12),被配置为对所述输入数据执行移位操作,生成第一中间数据;以及
    多个操作单元(13),被配置为基于所述权值和所述第一中间数据执行二维卷积运算中的至少一部分运算,并生成输出数据,其中,
    所述移位单元(12)所执行的所述移位操作包括:
    从相邻的处理元件获取数据;
    将所获取的数据设置于所述输入数据的两侧作为第一边缘数据,从而形成待移位数据;以及
    对所述待移位数据进行移位操作,生成所述第一中间数据。
  2. 根据权利要求1所述的处理元件,其特征在于,所述移位单元(12)所执行的所述移位操作,还包括:将所述输入数据两侧的第二边缘数据发送给相邻的处理元件。
  3. 根据权利要求2所述的处理元件,其特征在于,所述多个操作单元包括多个乘法累加单元(131)、部分和加法器(132)以及第二缓冲器(133),其中
    所述多个乘法累加单元(131)被配置为根据所述权值对所述第一中间数据执行乘法和累加操作,并输出第二中间数据;
    所述部分和加法器(132)被配置为迭代地将所述第二中间数据与存储于第二缓冲器(133)中的相应的部分总和相加,并将每次迭代计算的部分总和作为所述输出数据的部分总和存储到所述第二缓冲器中。
  4. 一种用于实现卷积运算的装置,包括多个处理元件(PE),其中所述处理元件包括:
    第一缓冲器(11),被配置为存储输入数据以及与所述卷积运算对应的权值;
    移位单元(12),被配置为对所述输入数据执行移位操作来生成第一中间数据;
    多个操作单元(13),被配置为基于所述权值和所述第一中间数据执行二维卷积运算中的至少一部分运算,并生成输出数据,其中
    所述移位单元(12)所执行的所述移位操作包括:
    从相邻的处理元件获取数据;
    将所获取的数据设置于所述输入数据的两侧作为第一边缘数据,从而形成待移位数据;以及
    对所述待移位数据进行移位操作,生成所述第一中间数据。
  5. 根据权利要求4所述的装置,其特征在于,所述移位单元(12)所执行的所述移位操作,还包括:将所述输入数据两侧的第二边缘数据发送给相邻的处理元件。
  6. 根据权利要求5所述的装置,其特征在于,所述多个操作单元包括多个乘法累加单元(131)、部分和加法器(132)以及第二缓冲器(133),其中
    所述多个乘法累加单元(131)被配置为根据所述权值对所述第一中间数据执行乘法和累加操作,并输出第二中间数据;
    所述部分和加法器(132)被配置为迭代地将所述第二中间数据与存储于第二缓冲器中的相应的部分总和相加,并将每次迭代计算的部分总和作为所述输出数据的部分总和存储到所述第二缓冲器(133)中。
  7. 根据权利要求6所述的装置,其特征在于,所述第一缓冲器(11)包括用于存储输入数据的输入数据缓冲器(111)和用于存储与所述卷积运算对应的权值的权值数据缓冲器(112),其中所述装置还包括:
    第一存储器(201),被配置为存储从所述装置外输入的所述权值;
    第二存储器(202),被配置为存储从所述装置外输入的所述输入数据;
    第一DMA单元(203),被配置为将所述权值从所述第一存储器写入所述权值数据缓冲器(111);以及
    第二DMA单元(204),被配置为将所述第二存储器(202)中的数据写入所述输入数据缓冲器(112)。
  8. 根据权利要求7所述的装置,其特征在于,所述装置还包括:
    第三DMA单元(205),被配置为从外部存储器将所述输入数据发送至所述第二存储器;
    第四DMA单元(206),被配置为从外部存储器将所述权值发送至所述第一存储器;
    第五DMA单元(208),被配置为将所述多个处理元件的第二缓冲器中的输出数据发送至所述第二存储器(202);以及
    第六DMA单元(207),被配置为将所述输出数据从所述第二存储器(202)输出至外部存储器。
  9. 根据权利要求8所述的装置,其特征在于,还包括控制单元(210)以及第三存储器(209),其中
    所述第三存储器(209)存储与所述装置的操作相关的程序;
    所述控制单元(210)与所述第一DMA单元(203)、所述第二DMA单元(204)、所述第三DMA单元(205)、所述第四DMA单元(206)、所述第五DMA单元(208)以及所述第六DMA单元(207)连接,并且配置为执行以下操作:
    从所述第三存储器(209)接收指令;
    执行与所述控制单元(210)的操作相关的指令;以及
    将与所述第一DMA单元(203)、所述第二DMA单元(204)、所述第三DMA单元(205)、所述第四DMA单元(206)、所述第五DMA单元(208)和/或所述第六DMA单元(207)的操作相关的指令转发至所述第一DMA单元(203)、所述第二DMA单元(204)、所述第三DMA单元(205)、所述第四DMA单元(206)、所述第五DMA单元(208)和/或所述第六DMA单元(207)。
  10. 一种用于实现卷积运算的方法,包括:
    获取输入数据以及与所述卷积运算对应的权值;
    对所述输入数据执行移位操作,生成第一中间数据;
    基于所述权值和所述第一中间数据执行二维卷积运算中的至少一部分运算,并生成输出数据,其中
    所述移位操作包括:
    从相邻的处理元件获取数据;
    将所获取的数据设置于所述输入数据的两侧作为第一边缘数据,从而形成待移位数据;以及
    对所述待移位数据进行移位操作,生成所述第一中间数据。
PCT/CN2018/124828 2018-11-02 2018-12-28 用于实现卷积运算的处理元件、装置和方法 WO2020087742A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811303442.8 2018-11-02
CN201811303442.8A CN111144545B (zh) 2018-11-02 2018-11-02 用于实现卷积运算的处理元件、装置和方法

Publications (1)

Publication Number Publication Date
WO2020087742A1 true WO2020087742A1 (zh) 2020-05-07

Family

ID=70462542

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/124828 WO2020087742A1 (zh) 2018-11-02 2018-12-28 用于实现卷积运算的处理元件、装置和方法

Country Status (2)

Country Link
CN (1) CN111144545B (zh)
WO (1) WO2020087742A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814957A (zh) * 2020-06-28 2020-10-23 深圳云天励飞技术有限公司 神经网络运算方法及相关设备
CN111898743A (zh) * 2020-06-02 2020-11-06 深圳市九天睿芯科技有限公司 一种cnn加速方法及加速器
WO2023140778A1 (en) * 2022-01-18 2023-07-27 Agency For Science, Technology And Research Convolution engine and methods of operating and forming thereof

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200183837A1 (en) * 2018-12-07 2020-06-11 Samsung Electronics Co., Ltd. Dataflow accelerator architecture for general matrix-matrix multiplication and tensor computation in deep learning
CN113312285B (zh) * 2021-06-11 2023-08-18 西安微电子技术研究所 一种卷积神经网络加速器及其工作方法
CN113486200A (zh) * 2021-07-12 2021-10-08 北京大学深圳研究生院 一种稀疏神经网络的数据处理方法、处理器和***

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106789063A (zh) * 2016-12-05 2017-05-31 济南大学 一种基于卷积和循环双重编码的双因子认证方法
WO2017214968A1 (en) * 2016-06-17 2017-12-21 Nokia Technologies Oy Method and apparatus for convolutional neural networks
WO2018125220A1 (en) * 2016-12-30 2018-07-05 Intel Corporation Systems, methods, and apparatuses for implementing opc modeling via machine learning on simulated 2d optical images for sed and post sed processes
CN108415881A (zh) * 2017-02-10 2018-08-17 耐能股份有限公司 卷积神经网络的运算装置及方法
CN108681984A (zh) * 2018-07-26 2018-10-19 珠海市微半导体有限公司 一种3*3卷积算法的加速电路

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3035204B1 (en) * 2014-12-19 2018-08-15 Intel Corporation Storage device and method for performing convolution operations
US10346350B2 (en) * 2015-10-08 2019-07-09 Via Alliance Semiconductor Co., Ltd. Direct execution by an execution unit of a micro-operation loaded into an architectural register file by an architectural instruction of a processor
CN106875011B (zh) * 2017-01-12 2020-04-17 南京风兴科技有限公司 二值权重卷积神经网络加速器的硬件架构及其计算流程

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017214968A1 (en) * 2016-06-17 2017-12-21 Nokia Technologies Oy Method and apparatus for convolutional neural networks
CN106789063A (zh) * 2016-12-05 2017-05-31 济南大学 一种基于卷积和循环双重编码的双因子认证方法
WO2018125220A1 (en) * 2016-12-30 2018-07-05 Intel Corporation Systems, methods, and apparatuses for implementing opc modeling via machine learning on simulated 2d optical images for sed and post sed processes
CN108415881A (zh) * 2017-02-10 2018-08-17 耐能股份有限公司 卷积神经网络的运算装置及方法
CN108681984A (zh) * 2018-07-26 2018-10-19 珠海市微半导体有限公司 一种3*3卷积算法的加速电路

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898743A (zh) * 2020-06-02 2020-11-06 深圳市九天睿芯科技有限公司 一种cnn加速方法及加速器
CN111814957A (zh) * 2020-06-28 2020-10-23 深圳云天励飞技术有限公司 神经网络运算方法及相关设备
CN111814957B (zh) * 2020-06-28 2024-04-02 深圳云天励飞技术股份有限公司 神经网络运算方法及相关设备
WO2023140778A1 (en) * 2022-01-18 2023-07-27 Agency For Science, Technology And Research Convolution engine and methods of operating and forming thereof

Also Published As

Publication number Publication date
CN111144545A (zh) 2020-05-12
CN111144545B (zh) 2022-02-22

Similar Documents

Publication Publication Date Title
WO2020087742A1 (zh) 用于实现卷积运算的处理元件、装置和方法
US11960566B1 (en) Reducing computations for data including padding
CN107679621B (zh) 人工神经网络处理装置
US20220292049A1 (en) Neural processing accelerator
US9294097B1 (en) Device array topology configuration and source code partitioning for device arrays
US7953684B2 (en) Method and system for optimal parallel computing performance
US20140149715A1 (en) Scalable and programmable computer systems
US9507753B2 (en) Coarse-grained reconfigurable array based on a static router
JP2003216943A (ja) 画像処理装置、この装置に用いられるコンパイラおよび画像処理方法
US9354826B2 (en) Capacity expansion method and device
WO2022179074A1 (zh) 数据处理装置、方法、计算机设备及存储介质
US7596679B2 (en) Interconnections in SIMD processor architectures
US11138106B1 (en) Target port with distributed transactions
JP2011141823A (ja) データ処理装置および並列演算装置
TW202343310A (zh) 用於稀疏神經網路的自適應張量運算核心
JP2022137247A (ja) 複数の入力データセットのための処理
CN103218301B (zh) 用于数字信号处理的存储器访问
US10311557B2 (en) Automated tonal balancing
Hinrichs et al. A 1.3-GOPS parallel DSP for high-performance image-processing applications
CN113703955A (zh) 计算***中数据同步的方法及计算节点
JP6503902B2 (ja) 並列計算機システム、並列計算方法及びプログラム
TWI616813B (zh) 卷積運算方法
Cho et al. Diastolic arrays: throughput-driven reconfigurable computing
US10990408B1 (en) Place and route aware data pipelining
Xiao et al. A hexagonal processor and interconnect topology for many-core architecture with dense on-chip networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18938888

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18938888

Country of ref document: EP

Kind code of ref document: A1