WO2020062284A1 - 基于卷积神经网络的图像处理方法和设备,以及无人机 - Google Patents

基于卷积神经网络的图像处理方法和设备,以及无人机 Download PDF

Info

Publication number
WO2020062284A1
WO2020062284A1 PCT/CN2018/109190 CN2018109190W WO2020062284A1 WO 2020062284 A1 WO2020062284 A1 WO 2020062284A1 CN 2018109190 W CN2018109190 W CN 2018109190W WO 2020062284 A1 WO2020062284 A1 WO 2020062284A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing
blocks
layer
block
data
Prior art date
Application number
PCT/CN2018/109190
Other languages
English (en)
French (fr)
Inventor
杨康
高明明
谷骞
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2018/109190 priority Critical patent/WO2020062284A1/zh
Priority to CN201880038969.4A priority patent/CN110770740A/zh
Publication of WO2020062284A1 publication Critical patent/WO2020062284A1/zh
Priority to US17/190,378 priority patent/US20210192246A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B64AIRCRAFT; AVIATION; COSMONAUTICS
    • B64DEQUIPMENT FOR FITTING IN OR TO AIRCRAFT; FLIGHT SUITS; PARACHUTES; ARRANGEMENT OR MOUNTING OF POWER PLANTS OR PROPULSION TRANSMISSIONS IN AIRCRAFT
    • B64D47/00Equipment not otherwise provided for
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/95Hardware or software architectures specially adapted for image or video understanding structured as a network, e.g. client-server architectures
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B64AIRCRAFT; AVIATION; COSMONAUTICS
    • B64UUNMANNED AERIAL VEHICLES [UAV]; EQUIPMENT THEREFOR
    • B64U2101/00UAVs specially adapted for particular uses or applications
    • B64U2101/30UAVs specially adapted for particular uses or applications for imaging, photography or videography

Definitions

  • the present application relates to the field of image processing, and more particularly, to an image processing method and device based on a convolutional neural network.
  • CNN Convolutional neural network
  • a typical CNN includes a convolutional layer, a pooling layer, an activation layer, and a fully connected layer.
  • the upper layer performs corresponding operations based on the input data, and outputs the results to the next layer.
  • the input initial data undergoes multiple layers of operations You get a final result afterwards.
  • the results are stored in off-chip memory, such as double-rate (Double Data Rate) memory, and the next layer is read from off-chip memory.
  • the output of the previous layer is stored in the on-chip memory and then operated. This requires more on-chip storage resources and stronger processing power.
  • Embodiments of the present application provide an image processing method and device based on a convolutional neural network, and a drone, which can implement the calculation of the convolutional neural network when the processing device has limited processing power or limited on-chip storage resources. Save storage space and improve processing efficiency.
  • an image processing method based on a convolutional neural network including: reading a 3D feature map from a first on-chip memory in blocks, where the 3D feature map is divided into L blocks; wherein the first The on-chip memory includes S first storage spaces, and each of the S first storage spaces is respectively used to store one of the L blocks included in the 3D feature map as a nerve
  • the input data of the current layer of the network is stored in the one first storage space after the input data of one of the L blocks stored on one of the first storage spaces is read.
  • the other of the L blocks processing the current layer of the 3D feature map by a convolutional neural network by block; storing the output result of the current layer to the first on-chip memory; wherein,
  • the first on-chip memory further includes R second storage spaces, and each of the R second storage spaces is respectively used to store a current layer of one of the L blocks.
  • Output data in one of the first storage After the output data of one of the L blocks stored in the space is read, the output data of another of the L blocks is stored on the one of the first storage spaces;
  • the L, the S, and the R are integers greater than or equal to 2, and the S and the R are smaller than the L.
  • an image processing device based on a convolutional neural network including: a reading unit for reading a 3D feature map from a first on-chip memory in blocks, the 3D feature map is divided into L blocks ; Wherein the first on-chip memory includes S first storage spaces, and each of the S first storage spaces is respectively used for storing L blocks included in the 3D feature map One of the blocks is used as input data of the current layer of the neural network.
  • a storage space stores another of the L blocks; a processing unit for processing the current layer of the 3D feature map by a convolutional neural network in blocks; a storage unit for processing the current layer
  • the output result of the layer is stored in the first on-chip memory; wherein the first on-chip memory further includes R second storage spaces, and each of the R second storage spaces is the second storage space Respectively for storing one of the L blocks
  • the output data of the current layer is stored in the one of the first storage spaces.
  • an image processing device based on a convolutional neural network which includes a first on-chip memory and an arithmetic circuit; wherein the arithmetic circuit is configured to read a 3D feature map from the first on-chip memory in blocks.
  • the 3D feature map is divided into L blocks; wherein the first on-chip memory includes S first storage spaces, and each of the S first storage spaces is used for One of the L blocks included in the 3D feature map is stored as input data of the current layer of the neural network, and the input data of one of the L blocks stored in one of the first storage spaces is read After the fetching is completed, another one of the L blocks is stored on the one of the first storage spaces; the 3D feature map is processed by the current layer of the convolutional neural network by block; The output result of the current layer is stored in the first on-chip memory; wherein the first on-chip memory further includes R second storage spaces, and each of the second storage in the R second storage spaces Space for storing the L blocks, respectively The output data of the current layer of one block is stored on the one first storage space after the output data of one of the L blocks stored on one of the first storage spaces is read. Output data of another one of the L blocks is stored; wherein the L, the S, and the R are integers greater than or equal to 2, and the
  • a drone including the image processing device based on the convolutional neural network according to the second aspect or the third aspect.
  • the 3D feature map is read from the first on-chip memory in blocks, the current layer of the convolutional neural network is processed according to the block, and the output of the current layer is processed.
  • the result is stored in the first on-chip memory, and processing on a block basis requires fewer on-chip storage resources and lower requirements on the processing capability of the arithmetic circuit.
  • the 3D feature map can be realized when the on-chip storage resources or processing power are insufficient.
  • the number of blocks included in the 3D feature map is L
  • the first on-chip memory includes S first storage spaces, and includes R second storage spaces, where S and R are less than L
  • each The first storage space is respectively used to store the input data of the current layer of a block
  • each second storage space is respectively used to store the output data of the current layer of a block, one of which is stored on one of the first storage spaces.
  • FIG. 1 is a schematic diagram of an architecture of a convolutional neural network according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a 3D feature map according to the present embodiment.
  • FIG. 3 is a schematic diagram of a pooling operation according to an embodiment of the present application.
  • FIG. 4 is an architecture diagram of a system of a convolutional neural network according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an image processing method based on a convolutional neural network according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a 3D feature map segmentation manner according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a division manner of a 3D feature map according to an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of an image processing method based on a convolutional neural network according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a storage pipeline of a storage space included in a first on-chip memory according to an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a storage pipeline of a storage space included in a first on-chip memory according to an embodiment of the present application.
  • FIG. 11 is a schematic diagram of an image processing device based on a convolutional neural network according to an embodiment of the present application.
  • FIG. 12 is a schematic diagram of an image processing device based on a convolutional neural network according to an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a drone according to an embodiment of the present application.
  • Convolutional neural network is an artificial neural network, which has been widely used in image recognition and other fields.
  • the convolutional neural network may include an input layer, a hidden layer, and an output layer.
  • the hidden layer may include a convolution layer, a pooling layer, an activation layer, a fully connected layer, and the like, as shown in FIG. 1.
  • Each layer of the convolutional neural network can process the feature map output by the previous layer (for example, convolution, pooling, activation, or full connection processing) to obtain the feature map output by the current layer.
  • the feature map mentioned in the embodiment of the present application may be a three-dimensional (3D) feature map.
  • the 3D feature map may be referred to as a 3D feature matrix.
  • a 3D feature map can be understood as multiple two-dimensional (2D) feature images stacked together.
  • a 2D feature image can be referred to as a feature, where each 2D feature image can respectively correspond to one of an image frame.
  • Channels, 3D feature maps can be obtained from one image frame or multiple image frames.
  • the thickness of the 3D feature map (that is, the number of 2D feature maps) can be equal to the channel of the image frame.
  • the number of channels such as three channels of R, G, and B, can be referred to as features here, and the number of channels can be understood as the number of features.
  • the size of the 3D feature map is W ⁇ H ⁇ M, where W can represent the width direction, H can represent the height direction, and M represents the channel direction (also known as the depth direction or thickness direction).
  • W ⁇ H can represent a 2D feature map.
  • the architecture of the convolutional neural network shown in FIG. 1 is only for illustrative purposes, and the convolutional neural network in the embodiment of the present application may have other architectures.
  • a convolutional neural network does not include an activation layer, or the activation layer may be located before the pooling layer, etc.
  • the convolution operation of the convolution layer can be performed by using a convolution kernel (which can be a 3D convolution kernel and a convolution kernel can also be called a filter) and a 3D feature map to output a 2D feature map.
  • the operation can be 3D
  • the eigenvalues of the feature map and the weights of the convolution kernel are used for inner product operations.
  • multiple convolution kernels can be used to perform operations on the 3D feature map respectively, and an output 3D feature map can be obtained.
  • the sizes of the multiple convolution kernels can be the same, but the parameters can be different.
  • the convolution kernel The size of the channel direction (that is, the number of features) may be the same as the size of the channel direction of the 3D feature map.
  • the convolution operation of the convolution layer can be performed by a sliding convolution kernel. Starting from the upper left corner of the 3D feature map, the sliding convolution kernel reaches the lower right corner of the 3D feature map to generate a 2D feature map, where each slide After the convolution kernel, the computing device will extract a 3D feature matrix with the same size as the convolution kernel from the 3D feature map, and perform an inner product operation with the convolution kernel to generate an output feature value. After performing the above operations using multiple convolution kernels, a 3D feature map can be output.
  • the size of the 3D feature map output by the convolution layer in the width direction can be
  • w 0 represents the size of the 3D feature map input in the convolution process in the width direction
  • p 0 represents the amount of data filled by the 3D feature map in the width direction during the convolution process
  • k 0 represents the width of the convolution kernel in the convolution process.
  • the size in the direction, s 0 represents the step of the convolution kernel in the convolution process sliding in the width direction.
  • the size of the 3D feature map output by the convolution layer in the height direction can be
  • H 0 represents the size of the 3D feature map input in the convolution process in the height direction
  • p 1 represents the amount of data filled by the 3D feature map in the height direction during the convolution process
  • k 1 represents the height of the convolution kernel in the convolution process.
  • the size in the direction, s 1 represents the step size of the convolution kernel in the convolution process sliding in the height direction.
  • the size of the 3D feature map output by the convolution layer in the channel direction may be equal to the number of convolution kernels used.
  • the pooling operation of the pooling layer can also be called down-samples operation.
  • the purpose is to reduce feature mapping.
  • a classifier with too many feature inputs is not easy to form, and Easy to overfit. Since the features after convolution are a static attribute, the features in two different image regions are likely to be the same. Therefore, when describing large images, aggregate statistics can be used for features at different positions. Pooling can use a sliding window method, starting from the upper left corner of each feature of the input 3D feature map, and sequentially sliding the window to the lower right corner of the feature in a certain step size to generate a 2D feature map.
  • the 3D feature map output by the pooling layer can be obtained.
  • Commonly used operations for pooling are: Max Pooling, Mean Pooling, Gauss Pooling, and Trainable Pooling.
  • the pooling window is 2 ⁇ 2, and the step size is 2.
  • Each maximum pooling operation can obtain a value after operating on four numbers.
  • the size of the 3D feature map output by the pooling layer in the width direction can be Where w 1 represents the size of the 3D feature map input in the pooling process in the width direction, p 2 represents the amount of data filled by the 3D feature map in the width direction during the pooling process, and k 2 represents the window of the pooling process in the width direction The size, s 2 represents the step of sliding the window of the pooling process in the width direction.
  • the size of the 3D feature map output by the pooling layer in the height direction can be
  • H 1 represents the size of the 3D feature map input in the pooling process in the height direction
  • p 3 represents the amount of data filled by the 3D feature map in the height direction during the pooling process
  • k 3 represents the window of the pooling process in the height direction
  • the size, s 3 represents the step of the pooling process window sliding in the height direction.
  • the size of the 3D feature map output by the pooling layer in the channel direction may be equal to the size of the 3D feature map input by the pooling layer in the channel direction, that is, the result of the pooling operation may keep the number of features of the 3D feature map unchanged.
  • a specific activation function can be used to perform point-to-point mapping to obtain the output 3D feature map of the activation layer.
  • CNN After the input 3D feature map passes through the convolution layer, pooling layer, and activation layer, it can enter the fully connected layer, and the 3D feature map can be mapped into a long input vector and enter the output layer.
  • the processing of the convolutional neural network may be implemented by a processor, for example, a field programmable gate array (Field Programmable Gate Array, FPGA) or an application-specific integrated circuit (ASIC).
  • a processor for example, a field programmable gate array (Field Programmable Gate Array, FPGA) or an application-specific integrated circuit (ASIC).
  • FPGA Field Programmable Gate Array
  • ASIC application-specific integrated circuit
  • the system implementing the convolutional neural network may include a processor 100 and an off-chip memory 200.
  • the processor 100 may be referred to as an accelerator.
  • the processor 100 may include a control circuit 110, a first operation circuit 122, a second operation circuit 124, a direct memory access (Direct Memory Access) (DMA) 130, and a static random access memory (SRAM) as an on-chip memory ( Static Random-Access Memory (SRAM) 140.
  • a control circuit 110 may include a control circuit 110, a first operation circuit 122, a second operation circuit 124, a direct memory access (Direct Memory Access) (DMA) 130, and a static random access memory (SRAM) as an on-chip memory ( Static Random-Access Memory (SRAM) 140.
  • DMA Direct Memory Access
  • SRAM static random access memory
  • control circuit 110 may control operations of the first operation circuit 122 and the second operation circuit 124 (for example, the size of the operation data and the operation timing, etc.), control the read time and address of the DMA 130, and enable the DMA 130 to convert Read from external memory 200 into SRAM 140 or write data from SRAM 140 to external memory 200.
  • the control circuit 110 can read instructions from the off-chip memory 200 to implement the first operation circuit 122 and the second operation. Control of circuits 124 and DMA 130.
  • the first operation circuit 122 and the second operation circuit 124 can implement the processing of the corresponding layers of the convolutional neural network.
  • One operation circuit can implement the operation of one layer, and the operation of one layer can be implemented by multiple operation circuits in parallel.
  • the first operation circuit 122 and the second operation circuit 124 can read data from the SRAM 140 and perform operations of the corresponding layers, and can output the operation results to the SRAM 140 for storage.
  • the first operation circuit 122 and the second operation circuit 124 may include an on-chip memory distinguished from the SRAM, for storing data in the first operation circuit 122 and the second operation circuit 124, for example, the first operation circuit 122 and the second operation Intermediate results obtained by circuit 124.
  • the DMA 130 can read data from the off-chip memory 200 (for example, data that can be used for operations of the first arithmetic circuit 122 and the second arithmetic circuit 124) and store it in the SRAM 140, or can read data from the SRAM 140 ( For example, operation results of the outputs of the first operation circuit 122 and the second operation circuit 124), and store the data in the off-chip memory 200.
  • the off-chip memory 200 for example, data that can be used for operations of the first arithmetic circuit 122 and the second arithmetic circuit 124) and store it in the SRAM 140, or can read data from the SRAM 140 ( For example, operation results of the outputs of the first operation circuit 122 and the second operation circuit 124), and store the data in the off-chip memory 200.
  • first operation circuit 122 and the second operation circuit 124 shown in FIG. 4 may perform processing of the same layer, and may also perform processing of different layers.
  • the processor 100 may further include other numbers of arithmetic circuits, which are not specifically limited in the embodiment of the present application.
  • FIG. 4 is only an implementation manner of the embodiment of the present application, and should not be particularly limited in the embodiment of the present application.
  • the operation circuit of the current layer needs to wait until the operation circuit of the next layer is idle before outputting the output result to the operation circuit of the next layer. In this way, the overall efficiency of the accelerator is relatively low, the circuit design requirements are high, and the flexibility is insufficient.
  • the 3D feature map of the convolutional neural network can be divided into a plurality of blocks, and the 3D feature map is processed by the convolutional neural network according to the block.
  • the specific execution process can be shown in FIG. 5, for example.
  • the method shown in FIG. 5 may be implemented by a processing device, and the processing device may optionally include the processor 100 shown in FIG. 4.
  • the processing device may include an arithmetic circuit of each layer, and the arithmetic circuit of each layer may perform processing of the corresponding layer according to the method shown in FIG. 5.
  • the processing device may include a control circuit and an operation circuit of each layer, and the control circuit may control the operation circuit of each layer to perform processing of the corresponding layer according to the method shown in FIG. 5.
  • the processing device may include a control unit without an arithmetic circuit.
  • performing at least two layers of processing based on a convolutional neural network in 320 may refer to controlling the arithmetic circuits of the respective layers for processing.
  • the processing device in the embodiment of the present application may be implemented by an FPGA or an ASIC.
  • FPGA or ASCI is an application-specific integrated circuit, it can achieve specific functions through custom hardware accelerators, and the processing is more efficient.
  • the processing device may read the 3D feature map in blocks, wherein the 3D feature map includes a plurality of blocks.
  • Reading the 3D feature map by block can be reading the data included in each block from the off-chip memory (at this time, the data of the read block can be stored in the first on-chip memory), or it can be from the first chip The data contained in each block is read from the upper memory.
  • the first on-chip memory mentioned in the embodiment of the present application may be an SRAM.
  • the first on-chip memory can be two-dimensional, for example, the storage format can be 4096 ⁇ 128b, and the storage of 3D feature maps (for example, reading data that has not been processed by the convolutional neural network or intermediate output results obtained after processing) can be It is an extension in 2D space. Specifically, an address can be introduced for each feature to achieve 3D space access.
  • the storage of the 3D feature map may be stored in a 2D manner.
  • the 3D feature map mentioned here may not have been processed by any layer of the hidden layer of the convolutional neural network, or may have been processed by at least one layer of the hidden layer.
  • the processing device may process the 3D feature map by a convolutional neural network in blocks.
  • processing performed on the 3D feature map in blocks may be processing in the same layer respectively in blocks.
  • arithmetic circuit that can process multiple blocks in sequence, that is, after processing of one block, processing of the next block can be performed.
  • the 3D feature map may be processed at least two layers of a convolutional neural network in blocks.
  • each layer of processing there may be one arithmetic circuit, or there may be multiple arithmetic circuits. In this case, the multiple arithmetic circuits may perform processing of the layer in parallel.
  • the 3D feature map is read in blocks and processed by a convolutional neural network, and the 3D feature map can be implemented when the on-chip storage resources or processing power are insufficient.
  • the 3D feature map can be read in blocks and the read blocks are stored in the first on-chip memory. At this time, only input data of a single block needs to be stored on the chip. Assuming that the 3D feature map is divided into multiple blocks in the channel direction, at this time, the data of some features of the 3D feature map can be read from the off-chip memory each time, stored on the first on-chip memory, and then convolved Or pooling.
  • a single arithmetic circuit can perform arithmetic processing in blocks.
  • the output result of the current layer is stored in the first on-chip memory until it is read by the next layer.
  • the output results can be stored in the first on-chip memory, and the output results are no longer stored from the first on-chip memory to the off-chip memory.
  • the operation circuit of the layer can read the operation result output by the operation circuit of the upper layer in the first on-chip memory from the first on-chip memory to perform the corresponding operation.
  • the arithmetic circuit used for convolution processing can store the output result of the convolution layer into the first on-chip memory in blocks, and the arithmetic circuit used for pooling processing can read the convolution layer in the first on-chip memory. This output is stored and the pooling layer is calculated by block.
  • the embodiment of the present application proposes that the output result of the current layer can be stored in the first on-chip memory.
  • the available storage space of the first on-chip memory is generally small, if the amount of data to be stored is large, it cannot be achieved. storage.
  • the step size is 1, and there is no element padding when processing the convolution layer, the output result of the convolution is 222 3D feature map of ⁇ 222 ⁇ 128.
  • the step size is 1, and there is no element filling when processing the pooling layer, the output result of the pooling is a 3D feature map of 220 ⁇ 220 ⁇ 128.
  • the data of 224 ⁇ 224 ⁇ 128 needs to be read from the memory, and the data of 220 ⁇ 220 ⁇ 128 needs to be output to the memory.
  • At least two layers of the convolutional neural network are processed for the 3D feature map by block, and when each block is processed, the output result of the current layer is stored in the first on-chip memory, It is used for the next layer of processing. It can realize the processing of 3D feature maps when the on-chip storage resources or processing power is insufficient, and can avoid repeatedly reading data from off-chip storage resources, and avoid taking up excessive system bandwidth. .
  • the pre-stage arithmetic circuit for example, the convolution layer arithmetic circuit
  • the post-stage arithmetic circuit for example, the pooling layer arithmetic circuit
  • reading by block and processing of the convolutional neural network does not mean that when reading data, the data of one block needs to be read all at once and then processed, taking into account the calculation circuits of each layer Processing performance, for the data in a single block, it can be read and processed multiple times when performing one layer of processing, or for the data in a single block, when processing one of the layers, it can be processed by multiple Processing circuits are processed in parallel.
  • the processing of the convolutional neural network may not be all processed in blocks.
  • one layer in the convolutional neural network is processed in blocks, and the processing of other layers may be processed in non-blocks (i.e., 3D feature maps). (Processed as a whole, block division is no longer performed).
  • the processing of the other layers that is not performed by blocks may be located before the processing performed by blocks, or may be located after the processing performed by blocks.
  • the convolutional and pooling layers can be processed block-wise, while the activation and fully-connected layers can be processed non-block-wise.
  • the convolutional layer, the pooling layer, and the activation layer are processed in blocks, while the fully connected layer may be processed in blocks.
  • the 3D feature map may be divided into a plurality of blocks according to the available storage capacity of the first on-chip memory and / or parameters used in processing of each layer of the convolutional neural network. , So that the output result obtained by processing each block can be stored in the first on-chip memory.
  • the parameters used in the processing of each layer of the convolutional neural network mentioned here can be understood as the parameters that have an effect on the size of the output result when performing the calculation of each layer.
  • the parameter can be the size of the convolution kernel and the sliding step of the convolution kernel.
  • the parameter can be the pooling mode, the pooling window size, and the pool. The sliding step of the window.
  • the 3D feature map is divided into multiple blocks.
  • the specific implementation operation may be to determine the size of each block. According to the determined size, from the 3D feature map, Read the data.
  • the execution subject processing device in the embodiment of the present application may determine the size of each of a plurality of blocks based on the available storage capacity of the first on-chip memory and / or parameters adopted by each layer of the convolutional neural network, where:
  • the processing device includes the processor 100 as shown in FIG. 4, the determining operation may be implemented by the control circuit 110.
  • the processing device in the embodiment of the present application may not have a substantial block division operation, but only performs reading and calculation on a block basis when reading and calculating.
  • the size and reading order of each block may be preset on the processing device, and the processing device may directly read the 3D feature by block based on the preset size and reading order. Illustration.
  • the size and read order of the blocks may be determined by the subject performing the preset operation based on the available storage capacity of the first on-chip memory and / or parameters used by each layer of the convolutional neural network.
  • the 3D feature map may not be segmented.
  • a feature usually has only one data output, that is, the output of global pooling is less than the output of maximum pooling.
  • the first on-chip memory can store the result output without the 3D feature map being segmented. Instead of segmenting the 3D feature map, the 3D feature map as a whole can be directly processed by a convolutional neural network.
  • the feature input data can be reused as much as possible, that is, the data of the feature can be reused as much as possible.
  • the result of the feature input data calculation is stored in the first on-chip memory. It is not necessary to repeatedly store the intermediate result in and read from the off-chip memory, but the parameters of the convolution layer can be stored in the off-chip memory. And read repeatedly. Of course, if the storage space of the first on-chip storage period is sufficient, the parameters of the convolution layer can also be stored in the first on-chip memory.
  • the off-chip memory mentioned in the embodiment of the present application may be a double-rate synchronous dynamic random access memory (Double Data Rate Synchronous Dynamic Random Access Memory, DDR).
  • DDR Double Data Rate Synchronous Dynamic Random Access Memory
  • the sizes of the multiple blocks into which the 3D feature map is divided may be the same or not exactly the same.
  • the largest block size can be determined based on the available storage capacity of the first on-chip memory, and the read and convolutional neural network processing is performed in accordance with the largest block in turn, until the last block is read and processed.
  • the size can be smaller than the size of the largest block.
  • the largest block size can be determined based on the available storage capacity of the first on-chip memory, and then the 3D feature map is evenly divided based on the largest block size, and the size of each block after division can be smaller than this determination The largest block size.
  • the 3D feature map may be divided into a plurality of blocks in at least one of a width direction, a height direction, and a channel direction.
  • a 3D feature map having a size of W ⁇ H ⁇ M may be segmented in the height direction, and specifically, three blocks as in (a) may be obtained; or, the size may be W ⁇ H
  • the 3D feature map of ⁇ M is segmented in the channel direction M, and three blocks in (b) can be obtained specifically; or the 3D feature map of size W ⁇ H ⁇ M can be segmented in the width direction, specifically Three blocks can be obtained as in (c).
  • FIG. 6 shows the division of the block in one direction, and the division of the block in at least two directions may also be performed.
  • FIG. 7 (a) it can be divided in the width direction and the channel direction to obtain 9 blocks; or, as shown in FIG. 7 (b), it can be performed in the height direction and the channel direction.
  • 9 blocks can be obtained by dividing; or, as shown in (c) of FIG. 7, 9 blocks can be obtained by dividing in the width direction and the height direction.
  • the read addresses and write addresses of multiple blocks in the same layer may have a certain relationship, for example, they may be continuous in storage space, or they may occupy the same storage space. .
  • This relationship can be preset on the processing device.
  • its read address can be obtained by the read address of one block on the same layer, or when writing the output data of one block of one layer,
  • the write address can be obtained from the write address of a block on the same layer.
  • the write address of the output data of the convolution layer of another block may be determined according to the write address of the output data of the one block.
  • the read address of the input data of the pooling layer of another block may be determined according to the read address of the input data of the pooling layer of the one block.
  • the output result of the current layer may be stored in the first on-chip memory in a manner of covering data that has been read during the processing of the convolutional neural network.
  • the on-chip cache can be recycled, which can improve the utilization of the on-chip cache.
  • the processing device may determine a storage address of the data that has been read, and store the output result of the current layer in the storage address.
  • the storage address may be a physical address, and may include a start address and an end address.
  • the output result of the current layer of the first block may be the first block of data read by the current layer. It should be understood that the first of the first blocks mentioned here is not intended to limit the processing order of the blocks, and is only used to distinguish the blocks.
  • the arithmetic circuit used for the convolution processing may read the data input in the first on-chip memory, and then perform the convolution processing, after the convolution processing is performed. After that, the arithmetic circuit used for the convolution processing may cover at least part of the read data corresponding to the first block in the first on-chip memory to store the output result of the convolution processing for pooling processing.
  • the operation circuit can read the output result of the convolution processing, perform the pooling processing, and cover the output result of the convolution processing that has been read, and so on.
  • the on-chip storage space required for the intermediate output results corresponding to each block may become smaller and smaller.
  • the excess storage space can be used to store other data, such as other blocks. Data, etc.
  • processing process of each block can be referred to as a processing line.
  • Parallel processing of multiple processing lines means that at least two blocks can be processed at the same time.
  • the parallel processing of multiple processing lines does not mean that the processing actions of the multiple processing lines must be the same.
  • the processing time of at least two processing lines in parallel processing may only partially overlap.
  • the output result of the current layer of the first block covers the read data of the second block (another block than the first block).
  • the output result may cover the read data of other blocks in the first on-chip memory.
  • the output result of the i + 1th layer of the first block covers the output result of the ith layer of the second block in the first on-chip memory, wherein the The output is data that has been read, where the convolutional neural network includes n layers, and i takes values from 1 to n.
  • the time when the input data of the i + 1th layer is read from the first on-chip memory + the calculation time of the i + 1th layer + the first The time when the output data of the i + 1 layer is written into the first on-chip memory ⁇ the time when the input data of the i-th layer is read from the first on-chip memory + the calculation time of the i-th layer + The time when the output data of the i-th layer is written into the first on-chip memory.
  • the results of the output of two blocks are stored in the first on-chip memory.
  • the pooling process of the first block can be performed in synchronization with the convolution process of the second block.
  • the output result of the pooling process can be overwritten the output result of the convolution process of the first block to be stored in the first on-chip memory, and then the The pooled result of the first block is output to off-chip memory, and the convolution result of the second block is stored to the first on-chip memory at a storage location for storing the pooled result of the first block in.
  • the computing power of the pooling can match the calculation time of the convolution, that is, in the system design, the following conditions can be set:
  • the blocks mentioned below are partitioned in the height direction of the block of the 3D feature map, which is characterized in the manner of W ⁇ H ⁇ M, for example, similar to the division manner in (a) in FIG. 6.
  • the input block of the convolution process is 224 ⁇ 6 ⁇ 128, the number of convolution kernels is 128, the size of the convolution kernel can be 3 ⁇ 3 ⁇ 128, the step size is 1, and the first output after the convolution processing
  • the subsequent second block can further input 4 lines of data, combining the last two lines of data of the first block.
  • the output result of the convolution of the second block is 112KB
  • the first on-chip memory stores the The output result of the convolution process is 224.
  • the pooling layer can read the convolution result of the first block.
  • the sliding window size of the pooling layer is 3 ⁇ 3, and the step size is 1. Then, the pool of the first block can be pooled.
  • the result is written to the storage space of the convolution result of the first block, that is, the storage space of the convolution result of the 6-line convolution processing is used to store the processing result of the 4-line pooling processing, and then the first The pooling result of each block is written from the first on-chip memory to the off-chip memory.
  • the output result of the i-th layer of the first block covers the output result of the i-th layer of another block in the first on-chip memory, where the i-th
  • the output result of the layer is the data that has been read by the i + 1th layer or the data that has been output to the off-chip memory, where the convolutional neural network includes n layers, and the value of i ranges from 1 to n.
  • the input data of the i-th layer of the first block covers the input data of the i-th layer of another block in the first on-chip memory, where the input data of the i-th layer of the other block has been
  • the data read by the i layer wherein the convolutional neural network includes n layers, and the value of i ranges from 1 to n.
  • the first on-chip memory stores input data and / or output data of the same layer of at least two blocks at the same time.
  • a specific implementation of the convolutional neural network may be a method 400 shown in FIG. 8.
  • the method 400 may be implemented by a processing device.
  • a 3D feature map is read from a first on-chip memory in blocks, where the 3D feature map includes L blocks; wherein the first on-chip memory includes S first storage spaces, and the S first Each of the first storage spaces in a storage space is respectively used to store input data of a current layer of one of the L blocks included in the 3D feature map, and the stored data is stored in one of the first storage spaces. After the input data of one of the L blocks is read, the input data of another of the L blocks is stored on the one of the first storage spaces.
  • the input data of the current layer stored in the S first storage spaces may be read from the on-chip memory.
  • the current layer may optionally be the first layer processed by the convolutional neural network.
  • the input data of the current layer stored in the S first storage spaces may be output data processed by the previous layer.
  • the 3D feature map is processed in blocks of the current layer of the convolutional neural network.
  • the output result of the current layer is stored in the first on-chip memory; wherein the first on-chip memory further includes R second storage spaces, Each of the second storage spaces is used to store output data of a current layer of one of the L blocks, and output of one of the L blocks is stored on one of the first storage spaces. After the data is read, the output data of another one of the L blocks is stored on the one of the first storage spaces;
  • the L, the S, and the R are integers greater than or equal to 2, and the S and the R are smaller than the L.
  • the number of operation circuits of the current layer may be less than S, and further may be less than R, for example, the number of operation circuits is one.
  • S may be equal to R.
  • S is not equal to R.
  • the data stored in the S first storage space is used as the input data of the convolution layer
  • the data stored in the R second storage space is the output data of the convolution layer and the output data of the pooling layer.
  • the number of operation circuits and / or the operation capabilities of the operation circuits are strong.
  • the data in the R second storage spaces can be quickly read by the operation circuits of the pooling layer, and then R can be less than S.
  • the division direction of the block is a width direction and / or a height direction, and the channel direction is not included. It should be understood that, at this time, a block may also be divided in the channel direction and divided into multiple sub-blocks.
  • the processing of each layer may correspond to the first storage space and the second storage space, that is, corresponding to different layers.
  • the storage space for storing input data is non-multiplexed, and the storage space corresponding to different layers for storing output data is completely non-multiplexed.
  • the first storage space of the current layer is used as the second storage space of the previous layer, and the second storage space of the current layer is used as the first storage space of the next layer.
  • the first on-chip memory includes storage spaces a1, a2, b1, and b2.
  • the storage blocks 1 and 2 are used for convolution processing (pooling processing and other The processing is also applicable.
  • the input data of the convolution processing can be read from the off-chip memory, and the input data of the pooling processing can be the input data of the convolution processing.
  • the input data for the convolution processing are respectively Perform convolution operations on blocks 1 and 2 and store the output results of the convolution processing of blocks 1 and 2 to the storage spaces b1 and b2, respectively, for the processing of the pooling layer.
  • the arithmetic circuit can directly read the data of block 2 from the storage space a2 and perform the convolution processing without waiting.
  • the input data for the block 1 is read, the input data for the convolution processing of the block 3 may be stored in the storage space a1 for the operation circuit to read the block after the convolution processing of the block 2 is completed.
  • 3 data for convolution processing Similarly, after the reading of the input data block 2 is completed, the input data may be stored in the block a2. 4, and so forth.
  • the above example assumes that the convolution processing of one block does not need to use the input data in other blocks as an example.
  • the processing of the current layer of one block may also use the input data of other blocks.
  • the first on-chip memory stores input data of the same layer of at least three blocks and / or output data of the same layer of the at least three blocks at the same time.
  • S and / or R mentioned above may be 3 or more.
  • S first storage spaces are used to store the input data of the convolution layer, and when the convolution layer processes one of the blocks, the data of the previous block is needed, then S may be greater than or equal to 3.
  • the R second storage spaces are used to store the output data of the convolution layer, which can be used for the processing of the pooling layer, and when the pooling layer processes one of the blocks, the data of the previous block is used , Then R can be greater than or equal to 3.
  • the first on-chip memory includes storage spaces a1, a2, a3, b1, b2, and b3.
  • the storage blocks 1, 2, and 3 are used for Convolution processing (other processing such as pooling processing is also applicable, where input data for convolution processing can be read from off-chip memory and input data for pooling processing can be output data for convolution processing),
  • the arithmetic circuit for convolution processing performs convolution operations on blocks 1, 2, and 3 respectively, and stores the output results of the convolution processing on blocks 1, 2, and 3 into the storage spaces b1, b2, and b3, respectively.
  • the input data of block 1 can be read first for convolution processing. After the convolution processing of block 1 is completed, the arithmetic circuit can be used without waiting. It reads the data of block 2 directly from storage space a2 and performs convolution processing. After the convolution processing of block 2 is completed, the arithmetic circuit can directly read the data of block 3 from storage space a3 without waiting.
  • the arithmetic circuit needs to wait for the data of the block 1a to be released and store another block's data. After the data, the calculation can be continued. Therefore, in this case, there needs to be at least 3 storage spaces for storing input data, and at least 3 storage spaces for storing output data.
  • the completion of reading of the data of one block may mean that the processing of the data of the one block for any block in the current layer does not need to be read again.
  • the data of the one block can be considered to be read Done.
  • the data of the one block needs to be processed by the current layer for another block, read all the data of the block at the current layer for the processing of the one block and read at least the block for the processing of the other block. After part of the data, the data of the one block can be considered as being read.
  • the first on-chip memory stores the input data of the same layer of at least two blocks at the same time, which can realize pipelined work, that is, the arithmetic circuits and storage space in the system can work efficiently without having to Wait.
  • the time when the input data of the i + 1th layer is read from the first on-chip memory + the calculation time of the i + 1th layer + the i + th The time when the output data of layer 1 is written into the first on-chip memory ⁇ the time when the input data of the i-th layer is read from the first on-chip memory + the calculation time of the i-th layer + the first Time when the output data of the i-layer is written into the first on-chip memory.
  • the sizes of the respective blocks are optionally the same, but it should be understood that the embodiments of the present application are not limited thereto, and the sizes of the respective blocks may also be different. In this case, the calculation speed of the larger blocks may be increased.
  • the processing time of multiple blocks is completely synchronous. At this time, there may be multiple storage spaces for storing data of each block, and the output result of the current layer of one block covers the current value of the block. Read data from layers.
  • the processing device may include multiple computing circuits.
  • the processing device may preset the processing blocks and processing order of each computing circuit, and the storage manner of the output results of each computing circuit. .
  • certain rules can be set in advance on the processing device to store data according to the specific rules, or the processing device can detect the storage space of the first on-chip memory in real time and store the data according to the detection result.
  • the dependency relationship may be a dependency relationship of a processing order.
  • the neural network needs to perform C1, C2, C3, P1, and C4 processing (C is convolution processing, P is pooling processing), P1 processing needs to wait for C1 processing and reading to be completed, so the output result of P1 processing can be It is stored in the storage space of C1 processing, and C4 processing needs to wait for the completion of P1 processing and reading, so the processing result of C4 can be stored in the storage space of P1 processing.
  • C convolution processing
  • P pooling processing
  • P1 processing needs to wait for C1 processing and reading to be completed, so the output result of P1 processing can be It is stored in the storage space of C1 processing
  • C4 processing needs to wait for the completion of P1 processing and reading, so the processing result of C4 can be stored in the storage space of P1 processing.
  • the compiler (for example, can be implemented by the control circuit 110 shown in FIG. 4) can record the dependency relationship between the instructions, so as to prevent stepping during storage, that is, to avoid no read The completed data is overwritten by new data.
  • the output result of the processing of the layer may be stored in the first on-chip memory for the following Treatment of one layer. If the output result is used in addition to the processing of the next layer, there are other operations (such as the processing of the layer after the current layer of the current convolutional neural network or other convolutional neural networks) that need to use the output result.
  • the output result can be stored in off-chip memory. When the other operation is performed, the output result may be read from the off-chip memory into the first on-chip memory again for the other operation.
  • the output result after reading the output result of the current layer in the first on-chip memory in the next layer, the output result can be read into the off-chip memory, and the output result can be deleted from the first on-chip memory (specifically, Is overwritten by other data, for example, it can be overwritten by the output result of the next layer), or it can be the output result of the current layer when the output result of the current layer has not yet been read from the first on-chip memory Store to off-chip memory.
  • you can delete the output result from the first on-chip memory specifically, it can be overwritten by other data, for example, you can Covered by the output of the next layer).
  • the output result of the current layer can be stored in the first on-chip memory instead of the off-chip memory.
  • the data used for the processing performed on the first block when the data used for the processing performed on the first block also needs to be used for the processing performed on the second block (another block other than the first block), the data may be stored in In the first on-chip memory, the data is used to process the second block.
  • the data may include data of an integer number of rows.
  • This method can be used for the 3D feature map that is not divided into two or more blocks in the row direction (that is, in the width direction).
  • the block division method can be as shown in FIG. 6 (a ) And (b).
  • the data used by the two blocks can be understood as data belonging to the previous block, but not the next block, or the data cached by the row can also be understood as belonging to The previous block belongs to another block.
  • the data in the same storage address is all or part of the data in one line, excluding data of two or more lines.
  • This type of data storage can be called row-based storage.
  • 16 data when storing, 16 data can be packed and stored to the same storage address. Reading a storage address can get 16 data.
  • the data of one storage address does not span two lines, that is, the data of one storage address does not exceed One row of data.
  • each storage address can store 16 data, it can correspond to 8 storage addresses.
  • each row of data is 127, which can still correspond to 8 storage addresses, except that one valid storage address can store 7 valid data and 1 invalid data.
  • the data on the first on-chip memory when the data on the first on-chip memory is released (also called deletion), it can be released according to the storage address. For example, after all 16 data in one storage address have been read, the data can be released. 16 data are released.
  • the data mentioned here may be input data of the input layer, or output results processed by one layer of the convolutional neural network.
  • the convolution processing is the first processing of a convolutional neural network
  • the data in the block that needs to be used for the convolution processing of another block can be cached in the first On a slice of memory, until the other block is convolved, it will not be covered by other data (for example, the output of the first block's convolution processing).
  • the window for convolution processing is 2 ⁇ 2
  • the sliding step of the window is 1
  • the 3D feature map is segmented in the manner of (a) in FIG. 6.
  • the previous block The last block of data used for convolution processing uses the next block, which is used in conjunction with the data of the first row of the next block for convolution processing. At this time, the data in the last row of the previous block can be used at all times. Store to the convolution process used for the second block.
  • the window for the convolution layer is 3 ⁇ 3, and the sliding step of the window is 2.
  • the 3D feature map is segmented in the manner of (a) in FIG. 6.
  • the previous one The data of the last two lines of the block used for convolution processing will use the next block, which is combined with the data of the first line of the next block for convolution processing. At this time, the last two lines of the previous block can be used at this time. The data is stored until it is used for the convolution processing of the second block.
  • the directions of segmenting the 3D feature map includes at least two directions, and the at least two directions include the height direction
  • the positions with the same width also called coordinates
  • / or channel positions also known as coordinates
  • all blocks at different height positions also known as coordinates
  • Blocks hereinafter may be referred to as blocks traversing in the height direction first
  • the order of block 1b, block 4b, block 7b, block 2b, block 5b, block 8b, block 3b, block 6b, and block 9b may be sequentially Process the convolutional layer.
  • the convolution layer for block 1b the last at least one row of data of the input data of block 1b needs to be stored in the first on-chip memory for processing of the convolution layer of block 2b, and the volume is processed for block 4b.
  • the last at least one row of data of the input data of block 4b needs to be stored in the first on-chip memory for the processing of the convolution layer of block 5b.
  • the processing of the convolution layer is performed on block 7b
  • the last at least one line of input data of block 7b needs to be stored in the first on-chip memory for processing of the convolutional layer of block 8b, that is, the processing of the convolutional layer for blocks 1b, 4b, and 7b
  • the first on-chip memory needs to store at least one line of input data of block 1b, at least one line of input data of block 4b, and at least one line of input data of block 7b.
  • the last at least one line of input data of block 1b can be read and deleted, but the last at least one line of input data of block 2b needs to be stored in the first on-chip memory And so on.
  • the blocks may be divided in the order of block 1b, block 2b, block 3b, block 4b, block 5b, block 6b, block 7b, block 8b, and block 9b.
  • Processing of convolutional layers When processing the convolutional layer for block 1b, the last at least one row of data of the input data of block 1b needs to be stored in the first on-chip memory for processing of the convolutional layer of block 2b, and then rolled for block 2b In the layering process, the last at least one line of input data of block 1b can be read and deleted, and the last at least one line of input data of block 2b can be stored in the first on-chip memory, and so on.
  • the blocks in the height direction can be traversed preferentially, so that fewer rows of data can be cached and the on-chip can be reduced. Store pressure.
  • the convolution processing is the first processing of a convolutional neural network and subsequent pooling processing is required
  • the output result can be stored on the first on-chip memory
  • the output result of the convolution processing of this block can all be read for the pooling processing of the first block, but some data of the output result of the convolution processing of this block still needs to be used for the pooling processing of another block, then This part of the data can be retained (the other part of the data can be deleted) until the part of the data is used for the pooling of the other block.
  • the data between each block may also be independent and there is no overlap. Specifically, the data used by one block can no longer be used by the other block.
  • the data is stored in rows (that is, multiple data in a single row are packed and stored into one).
  • Storage address if the data of one block includes part of the data of the last storage address, the last storage address can no longer be processed by the current layer (such as the convolution layer or the pooling layer), and the other block can Perform current layer processing on other or all data of the last storage address; or, the first block may perform current layer processing on some or all data of the last storage address, while the second block no longer processes the last one.
  • the data of the storage address is processed by the current layer.
  • a single line of data of a single feature of the 3D feature map corresponds to multiple storage addresses, and a single line of data belongs to at least two blocks, and then the processed data of the current layer of each of the at least two blocks includes an integer Data of two storage addresses, and the processed data of the current layer included in the at least two blocks does not overlap at all.
  • This implementation can simplify boundary processing, thereby simplifying the complexity of the implementation.
  • the data mentioned here can be the initial input data of the convolutional neural network without any layer processing, or the output result of one of the layers.
  • 16 data when storing, 16 data can be packed and stored to the same storage address. Reading a storage address can get 16 data.
  • the data of a storage address does not span two lines, that is, the data of one storage address does not Data beyond one line. Assuming that each row of data in the 3D feature map has 128 pieces, it can correspond to 8 storage addresses.
  • the data to be processed in the current layer of one of the blocks may be data of 4 storage addresses, and the data to be processed in the current layer of the other block may be data of another 4 storage addresses.
  • the processed data of the current layer of the block mentioned here is different from the data included in the pre-divided block. For example, assuming that each row of data includes 128 data, the division of the block is performed in an uneven manner. For example, the first block that is pre-divided includes 68 data per row, and the second block includes 60 per row. Data, when actually processing the current layer, for the first block, each row can process 64 data, and for the second block, each row can process 64 data.
  • the data of each block may only include data of an integer number of storage addresses, and the data included in the at least two blocks does not overlap at all.
  • the embodiments of the present application are not limited to the above description.
  • the column data may also be cached.
  • the data in the last at least one column of block 1c may be cached for processing of block 2c. Because the data is stored in rows (that is, multiple data in a single row are packaged and stored to a storage address), for each row, the cached data is data of at least one storage address.
  • block 1c Of the data if the data used for processing of block 2c belongs to a storage address, then a total of 16 data of 16 columns can be cached for processing of block 2c. If the data of block 1c is used for processing of block 2c The data belongs to two storage addresses. At this time, a total of 32 data of 32 columns can be cached for processing of block 2c.
  • the row data can be cached at this time, for example, as shown in FIG. 6
  • the data of at least one line of 1a can be buffered for processing of block 2c.
  • the cached data is the data of at least one storage address. For example, for a specific column, among the data of block 1a, if the data used for processing of block 2a belongs to one store Address, at this time, a total of 16 data of 16 rows can be cached for processing of block 2a. If the data used for processing of block 2a belongs to two storage addresses, then a total of 32 data of 32 rows can be cached at this time. Processing of block 2a.
  • the column data can be cached at this time.
  • the data in the last at least one column of 1c (which can be one column or multiple columns, and the number of columns has nothing to do with the amount of data in one storage address) can be cached in the column for block 2c. Processing.
  • the method of segmenting the 3D feature map may affect the processing order of each data of the convolutional neural network.
  • the set of processing includes a convolution circuit and a pooling circuit, and a convolution circuit and a pooling circuit can only process one block at a time.
  • the method affects the processing order of the data.
  • the data processing order of the blocks 1a, 2a, and 3a may be sequentially performed.
  • the data processing order of the blocks 1b, 2b, and 3b may be sequentially performed.
  • the data processing order of the blocks 1c, 2c, and 3c may be sequentially performed.
  • the data processing order may be different.
  • the convolution operation results of the features can be stored in an on-chip memory (hereinafter referred to as a second on-chip memory) included in the arithmetic circuit.
  • a second on-chip memory included in the arithmetic circuit.
  • the convolution layer in the case of segmentation in the channel direction of the 3D feature map, for at least two blocks having the same width position and height position, if the The processing of the convolution layer is performed on a partial block, and then the output result of the convolution layer of the partial block may be stored in an on-chip memory (hereinafter referred to as a second on-chip memory) included in the arithmetic circuit.
  • a second on-chip memory included in the arithmetic circuit.
  • the convolution results of the at least two blocks can be accumulated to obtain the output result of the convolutional layer corresponding to a convolution kernel to obtain the convolution corresponding to a convolution kernel.
  • the output result of a layer or a 2D feature map is output to the first on-chip memory.
  • the output results of the convolution layers of the blocks that are completed first may be stored in the second on-chip memory respectively.
  • the processing of the convolution layers of all the blocks is processed.
  • the results are accumulated and the processed results are output to the first on-chip memory.
  • the output results of the convolutional layers of the two blocks completed first may be accumulated and stored in the second on-chip memory. After the processing of the convolutional layer of one block is completed, the last obtained The accumulation result is accumulated with the output result of the convolution layer of the another block and stored in the second on-chip memory, and the accumulation result stored before the second memory is deleted until the accumulation result accumulates the output results of the convolution layers of all the blocks. And output to the first on-chip memory.
  • the block 1b and block 2b can be stored in the second on-chip memory.
  • the result of the convolution processing After the result of the convolution processing of block 3b is obtained, the results of the convolution processing of block 1b and block 2b can be read from the second on-chip memory, and the blocks 1b and 2b can be deleted from the second on-chip memory.
  • the result of the convolution processing of block 2b combined with the results of the convolution processing of blocks 1b, 2b, and 3b, outputs the result of the final convolution processing to the first on-chip memory.
  • the result of the convolution processing of the block 1b may be stored in the second on-chip memory, and after the result of the convolution processing of the block 2b is obtained, the accumulated results of the convolution processing results of the block 1b and the block 2b can be stored in the second on-chip memory, and the block 1b stored in the second on-chip memory can be deleted.
  • the accumulated results of the convolution processing of block 1b and block 2b can be read from the second on-chip memory, and the block can be deleted from the second on-chip memory
  • the accumulation result of the convolution processing of 1b and block 2b is combined with the accumulation result of the convolution processing of blocks 1b and 2b and the result of the convolution processing of 3b, and the result of the final convolution processing is output to the first on-chip memory.
  • the order of block 1a, block 4a, block 7a, block 2a, block 5a, block 8a, block 3a, block 6a, and block 9a may be followed.
  • the processing of the convolution layer after the result of the convolution processing of block 1a is obtained, the result of the convolution processing of the block 1a can be stored in the second on-chip memory, and the result of the convolution processing of block 4a is obtained. After that, the result of the convolution processing of the block 4a can be stored on the second on-chip memory. After the result of the convolution processing of the block 7a is obtained, the convolution processing of the blocks 1a and 4a can be read from the second on-chip memory.
  • the result of the convolution processing of 5a, and the results of the convolution processing of blocks 2a, 5a, and 8a can be accumulated and output to the first on-chip memory; and, after the result of the convolution processing of block 3a is obtained ,
  • the result of the convolution processing of the block 3a may be stored on the second on-chip memory, and after the result of the convolution processing of the block 6a is obtained, the result of the convolution processing of the block 6a may be stored on the second on-chip memory,
  • the results of the convolution processing of blocks 3a and 6a can be read from the second on-chip memory, and the results of deleting blocks 3a and 6a from the second on-chip memory can be read after reading.
  • the result of the convolution processing, and the results of the convolution processing of the blocks 3a, 6a, and 9a can be accumulated and output to the first on-chip memory.
  • the block may be divided in the order of block 1a, block 4a, block 7a, block 2a, block 5a, block 8a, block 3a, block 6a, and block 9a.
  • the result of the convolution processing of block 1a can be stored on the second on-chip memory, and the result of the convolution processing of block 4a is obtained.
  • the accumulation result of the convolution processing of the block 1a and the block 4a can be stored on the second on-chip memory, and the result of the convolution processing of the block 1a can be deleted.
  • the second on-chip memory reads the accumulation result of the convolution processing of blocks 1a and 4a, and the read-out accumulation result of the convolution processing of deleting blocks 1a and 4a from the second on-chip memory and reads the block 1a and block 4a.
  • the cumulative result of the sum and the result of the convolution processing of block 7a are accumulated and output to the first on-chip memory; similarly, after the result of the convolution processing of block 2a is obtained, it can be stored on the second on-chip memory As a result of the convolution processing of block 2a, the convolution of block 5a is obtained.
  • the accumulated result of the convolution processing of the block 2a and the block 5a can be stored on the second on-chip memory, and the result of the convolution processing of the block 2a can be deleted.
  • the accumulation result of the convolution processing of blocks 2a and 5a can be read from the second on-chip memory, and the accumulation result of the convolution processing of blocks 2a and 5a can be deleted from the second on-chip memory after reading, and the block 2a
  • the accumulation result of block 5a and the result of convolution processing of block 8a are accumulated and output to the first on-chip memory; and after the result of the convolution processing of block 3a is obtained, the second on-chip memory can be stored
  • the result of the convolution processing of the block 3a is stored on the computer.
  • the cumulative result of the convolution processing of the block 3a and the block 6a can be stored on the second on-chip memory, and the block is deleted The result of the convolution processing of 3a.
  • the accumulated results of the convolution processing of blocks 3a and 6a can be read from the second on-chip memory, and from the second after reading On-chip memory In addition to the accumulation result of the convolution processing blocks 3a and 6a of the block, and the block 3a and accumulating calculation results of the convolution processing results of accumulated blocks and the blocks 6a, 9a, and outputs it to the first memory chip.
  • the order of block 1a, block 2a, block 3a, block 4a, block 5a, block 6a, block 7a, block 8a, and block 9a may be followed.
  • the processing of the convolutional layer of the block is performed sequentially, and after the processing results of the convolutional layers of the blocks 1a, 2a, 3a, 4a, 5a, and 6a are sequentially obtained, the blocks 1a and 2a can be stored in the second on-chip memory respectively.
  • the operation result is obtained; and after the processing result of the convolution layer of the block 9a is obtained, the results of the convolution processing of the block 3a and the block 6a can be read from the second on-chip memory, and the block 3a and the block are deleted after reading.
  • the result of the convolution processing of 6a, and the operation results of the convolution layers of the blocks 3a, 6a, and 9a can be accumulated, and the operation result can be output to the first on-chip memory.
  • the blocks may be divided in the order of block 1a, block 2a, block 3a, block 4a, block 5a, block 6a, block 7a, block 8a, and block 9a.
  • the convolutional layers of the block 1a, 2a, 3a can be stored in the second on-chip memory respectively.
  • the result is accumulated and stored in the first on-chip memory and the accumulated result of the processing of deleting the convolutional layers of blocks 2a and 5a in the second on-chip memory; after obtaining the processing result of the convolutional layer of block 9a ,
  • the accumulation result of the processing of the convolutional layers of blocks 3a and 6a and the result of the convolutional layer of block 9a may be accumulated and stored in the first on-chip memory, and the blocks 3a and 2 in the second on-chip memory are deleted The accumulated result of the processing of the convolution layer of 6a.
  • the channel direction is preferentially traversed (specifically, the same height position and / or width position can be processed first and the All the blocks in the channel position, and then process all other blocks with the same height position and / or width position in different channel positions), then the result of the convolution processing of fewer blocks can be cached in the second on-chip memory .
  • the resources of the second on-chip memory occupied by the storage required for the accumulation operation for the convolution processing and the first occupied by the line buffer can be comprehensively considered.
  • the amount of on-chip memory resources to determine whether to traverse the channel direction or the height direction first.
  • the amount of resources of the second on-chip memory occupied by the storage required for the accumulation operation for the convolution processing can be comprehensively considered, and the first The amount of memory on a slice to determine whether to traverse the channel direction or the width direction first.
  • the storage capacity of the second on-chip memory included in the arithmetic circuit may also affect the division of the block. For example, if the storage capacity of the second on-chip memory is small, the division may not be performed in the channel direction.
  • the division direction of a block may be a height direction and / or a width direction, excluding a channel direction.
  • a block is divided into at least two sub-blocks in the channel direction.
  • the processing of the current layer is described as the processing of the convolution layer; then, the following two implementation manners may be available.
  • the processing of a convolution layer is performed on part of the sub-blocks of the at least two sub-blocks, the output results of the convolution layer of the part of the sub-blocks are respectively stored in the first In the two on-chip memories, after the processing of the convolution layers of the at least two sub-blocks is completed, the processing results of the convolution layers of the at least two sub-blocks are accumulated and output to the second storage space.
  • the output results of the convolution layer of the first completed subblock are first accumulated and stored in the operation.
  • the second on-chip memory included in the circuit after the processing of the convolution layer of another sub-block is completed, the accumulated result obtained last time and the output result of the convolution layer of the another sub-block are accumulated and stored in the The second on-chip memory and deleting the accumulation result stored before the second on-chip memory until the accumulation result accumulates the output result of the convolution layer of the at least two sub-blocks, and stores the output result to the first On-chip memory.
  • the reading mode of the input data may affect the data in the first on-chip memory. Release. The following is based on the premise that the data included in the block is released in rows, columns, or according to the storage address.
  • block division is performed in the width direction and block division is not performed in the height direction, for example, as shown in (c) of FIG. 6, at this time, the first slice needs to be
  • the data of at least one column of block 1c is stored in the memory for processing of block 2c.
  • the data is read in a row-by-row manner.
  • the block is divided in the height direction and the block is not divided in the width direction, for example, as shown in (a) of FIG. 6, at this time, the first slice needs to be
  • the upper memory stores data of at least one row of block 1a for processing of block 2a.
  • the sliding step length is 1, you need to After the data of one column of block 1a is traversed, the data of the next column can be processed, and the data belonging to the one column of the data of the at least one row is released.
  • the sliding step is 1, at this time, the data of the at least one line can be traversed first, and the data of the at least one line can be released.
  • the data is read in a row-by-column manner.
  • the blocks of the 3D feature map are divided.
  • the method is to divide the block in the height direction and not to divide the block in the width direction.
  • the boundary processing is more complicated (that is, the data of one storage address mentioned above may be divided into two blocks). Divides in the direction without dividing in the width direction.
  • the block division of the 3D feature map is performed in the width direction.
  • the block is divided, and the block is not divided in the height direction.
  • the boundary processing is more complicated (that is, the data of one storage address mentioned above may be divided into two blocks). Divide in the direction but not in the height direction.
  • the data of each block of the on-chip memory when the data of each block of the on-chip memory is released, the data may be released in rows, columns, or in accordance with the address of the storage space, but the embodiment of the present application is not limited to this, and may also be released in blocks.
  • the data of each block is released, that is, the on-chip storage space can be released after the data of one block is processed. This release method can reduce the complexity of control.
  • the above-mentioned block division method, reading order, and storage space multiplexing method may be preset on the processing device, or may be determined by the processing device according to specific conditions. It can be determined, for example, according to the actual situation of the convolutional neural network used.
  • the size of a block to be read by the arithmetic circuit may be preset to perform data reading.
  • Data and data output time For DAM130, you can preset the time to read data from SRAM140, the address to read data, the time to write data, and the address to write data, etc., where the preset operation can be performed by
  • the control circuit 110 reads the instructions from the DDR, preset the corresponding operations of the first operation circuit 122, the second operation circuit 124, and the DAM 130.
  • the control circuit 110 may also control other circuits in real time.
  • the 3D feature map is read in blocks and processed by a convolutional neural network, and the 3D feature map can be implemented when the on-chip storage resources or processing power are insufficient.
  • FIG. 11 is a schematic block diagram of a convolutional neural network-based image processing apparatus 500 according to an embodiment of the present application.
  • the device 500 includes:
  • the reading unit 510 is configured to read the three-dimensional 3D feature map from the first on-chip memory in blocks; wherein the first on-chip memory includes S first storage spaces, and each of the S first storage spaces Each of the first storage spaces is respectively used to store input data of a current layer of one of the L blocks included in the 3D feature map, in the L blocks stored in one of the first storage spaces After the input data of one of the blocks is read, the input data of another one of the L blocks is stored on the one of the first storage spaces;
  • a processing unit 520 configured to perform processing on the current layer of the convolutional neural network on the 3D feature map by block;
  • the storage unit 530 is configured to store the output result of the current layer to the first on-chip memory; wherein the first on-chip memory further includes R second storage spaces, and the R second storage spaces Each of the second storage spaces is used to store output data of a current layer of one of the L blocks, and one of the L blocks is stored on one of the first storage spaces. After reading the output data of, the output data of another one of the L blocks is stored on the one of the first storage spaces;
  • the L, the S, and the R are integers greater than or equal to 2, and the S and the R are smaller than the L.
  • the number of operation circuits included in the processing unit 520 to perform processing of the current layer is less than the S.
  • the output result of the current layer is stored in the second storage space, and the output result is read from the second storage space until the next layer.
  • the storage unit 530 is further configured to:
  • the output result of the current layer is stored in an off-chip memory.
  • the time when the input data of the i + 1th layer is read from the first on-chip memory + the calculation time of the i + 1th layer + the i + th The time when the output data of layer 1 is written into the first on-chip memory ⁇ the time when the input data of the i-th layer is read from the first on-chip memory + the calculation time of the i-th layer + the first Time when the output data of the i-layer is written into the first on-chip memory, where i ranges from n to n, and the processing of the convolutional neural network includes n-layers.
  • the input data used for processing the current layer for the first block of the L blocks also needs to be processed for the current layer for another block
  • the The input data is stored in the first storage space until the data is used for processing performed on the another block.
  • the S is greater than or equal to 3.
  • the data that needs to be used for both the processing for the first block and the processing for the other block includes data of an integer number of rows;
  • the plurality of blocks are obtained by dividing a height direction of the 3D feature map and not dividing a width direction;
  • the input data is read in a first-line and then-column manner.
  • the processing unit 520 is further configured to:
  • the direction for segmenting the 3D feature map includes at least two directions, and the at least two directions include a height direction
  • the processing having the same width position and channel position is completed first, and All blocks at different height positions, then all other blocks with the same width position and channel position and at different height positions are processed.
  • a direction in which the 3D feature map is divided into the L blocks includes a width direction and / or a height direction.
  • the first block of the L blocks is divided into at least two sub-blocks in the channel direction, and the processing of the current layer is the processing of the convolution layer;
  • the processing unit 520 is further configured to:
  • the output results of the convolution layers of the partial sub-blocks are respectively stored in a second on-chip memory included in the operation circuit, and After the processing of the convolution layers of the at least two sub-blocks is completed, the processing results of the convolution layers of the at least two sub-blocks are accumulated and output to the second storage space; or,
  • the output results of the convolution layer of the first completed subblock are first accumulated and stored in a second on-chip memory included in the operation circuit.
  • accumulating and storing the accumulation result obtained last time and the output result of the convolution layer of the another sub-block into the second on-chip memory and Deleting the accumulation result stored before the second on-chip memory until the accumulation result accumulates the output result of the convolution layer of the at least two sub-blocks, and stores the output result in the first on-chip memory.
  • the processing unit 520 is further configured to:
  • the size of each of the plurality of blocks is determined based on the storage capacity available in the first on-chip memory and / or parameters used by the processing of the convolutional neural network.
  • the first on-chip memory is a static random access memory (SRAM).
  • SRAM static random access memory
  • the processing of the convolutional neural network includes a convolutional layer processing and a pooling layer processing.
  • the device 500 is implemented by a field programmable gate array FPGA or an application specific integrated circuit ASIC.
  • image processing device 400 may implement the corresponding operations implemented by the processing device in the method 300 or 400. For brevity, details are not described herein again.
  • the image processing device may be implemented by software, hardware, or a combination of software and hardware, which is not specifically limited in the embodiment of the present application.
  • FIG. 12 is a schematic block diagram of an image processing apparatus 600 based on a convolutional neural network according to an embodiment of the present application.
  • the device 600 includes a first on-chip memory 610 and an arithmetic circuit 620.
  • the arithmetic circuit 610 is configured to:
  • the first on-chip memory 610 includes S first storage spaces, and each of the S first storage spaces
  • the storage space is respectively used to store input data of the current layer of one of the L blocks included in the 3D feature map, and the input of one of the L blocks is stored on one of the first storage spaces.
  • the input data of another one of the L blocks is stored on the one of the first storage spaces;
  • the first on-chip memory 610 further includes R second storage spaces, each of the R second storage spaces The second storage space is respectively used to store output data of the current layer of one of the L blocks, and the output data of one of the L blocks stored on one of the first storage spaces is After reading, storing output data of another one of the L blocks on the one of the first storage spaces;
  • the L, the S, and the R are integers greater than or equal to 2, and the S and the R are smaller than the L.
  • the number of the arithmetic circuits 620 that perform processing of the current layer is smaller than the S.
  • the output result of the current layer is stored in the second storage space, and the output result is read from the second storage space until the next layer.
  • the device 600 further includes a direct memory access DMA 640 for:
  • the output result of the current layer is stored in an off-chip memory.
  • the time when the input data of the i + 1th layer is read from the first on-chip memory 610 + the calculation time of the i + 1th layer + the ith The time when the output data of the +1 layer is written into the first on-chip memory 610 ⁇ the time when the input data of the i-th layer is read from the first on-chip memory 610 + the calculation time of the i-th layer + The time when the output data of the i-th layer is written into the first on-chip memory 610, where i ranges from n to n, and the processing of the convolutional neural network includes n layers.
  • the input data used for processing the current layer for the first block of the L blocks also needs to be processed for the current layer for another block
  • the The input data is stored in the first storage space until the data is used for processing performed on the another block.
  • the S is greater than or equal to 3.
  • the data that needs to be used for both the processing for the first block and the processing for the other block includes data of an integer number of rows;
  • the plurality of blocks are obtained by dividing a height direction of the 3D feature map and not dividing a width direction;
  • the input data is read in a first-line and then-column manner.
  • the operation circuit 620 is further configured to:
  • the direction for segmenting the 3D feature map includes at least two directions, and the at least two directions include a height direction
  • the processing having the same width position and channel position is completed first, and All blocks at different height positions, then all other blocks with the same width position and channel position and at different height positions are processed.
  • a direction in which the 3D feature map is divided into the L blocks includes a width direction and / or a height direction.
  • the first block of the L blocks is divided into at least two sub-blocks in the channel direction, and the processing of the current layer is the processing of the convolution layer;
  • the arithmetic circuit 620 is further configured to:
  • the output results of the convolution layer of the part of the sub-blocks are respectively stored in a second on-chip memory included in the operation circuit 620 .
  • the processing results of the convolution layers of the at least two sub-blocks are accumulated and output to the second storage space; or,
  • the output results of the convolution layers of the first-completed sub-block are first accumulated and stored in a second included in the operation circuit 620.
  • the on-chip memory after the processing of the convolution layer of another sub-block is completed, the accumulated result obtained last time and the output result of the convolution layer of the another sub-block are accumulated and stored in the second on-chip memory And delete the accumulation result previously stored in the second on-chip memory until the accumulation result accumulates the output result of the convolution layer of the at least two sub-blocks, and stores the output result in the first on-chip memory 610 .
  • the device 600 further includes a control circuit 630 for:
  • the size of each of the plurality of blocks is determined based on the storage capacity available in the first on-chip memory 610 and / or parameters used by the processing of the convolutional neural network.
  • the first on-chip memory 610 is a static random access memory SRAM.
  • the processing of the convolutional neural network includes a convolutional layer processing and a pooling layer processing.
  • the device 600 is implemented by a field programmable gate array FPGA or an application specific integrated circuit ASIC.
  • image processing device 600 may implement corresponding operations implemented by the processing device in the method 300 or 400. For brevity, details are not described herein again.
  • image processing device 400 may correspond to the processor 100 shown in FIG. 4, and for the sake of brevity, it will not be repeated here.
  • the image processing apparatus 400 or 500 of the embodiment of the present application may be used in a drone.
  • FIG. 13 is a schematic block diagram of a drone 700 according to an embodiment of the present application.
  • the drone 700 may include a power system 710, a sensing system 720, and a processor 730.
  • the power system 710 provides power to the drone 700 under the control of the processor 730;
  • the sensing system 720 includes a camera 722 for capturing image frames; and the processor 730 is used for images based on the camera 722 Generate a 3D feature map by frame, and read the 3D feature map by block, where the 3D feature map includes multiple blocks; perform convolutional neural network processing on the 3D feature map by block, and process the convolutional neural network.
  • the results can be used for image recognition, which can control the drone's flight.
  • the camera 722 may also be referred to as a camera component, or the camera may be a part of a camera component included in the drone for acquiring image frames.
  • the processor 730 may be used to implement the image processing method in the foregoing method embodiments. For brevity, details are not described herein again.
  • the processor 730 may be placed in a flight controller.
  • the processor 730 may be composed of multiple processors. For example, one processor may be used to control the flight of the drone, and one processor may be used to perform the processing of the convolutional neural network mentioned in the embodiment of the present application.
  • the drone may further include an off-chip memory 740, which stores data input to the processor 730, and may store data output by the processor 730.
  • an off-chip memory 740 which stores data input to the processor 730, and may store data output by the processor 730.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

一种基于卷积神经网络的图像处理方法和设备以及无人机,可以在处理设备的处理能力有限或者片上存储资源有限的情况下,实现卷积神经网络的计算。该方法包括:按块读取三维3D特征图,其中,所述3D特征图包括多个块;按块对所述3D特征图进行卷积神经网络的处理。

Description

基于卷积神经网络的图像处理方法和设备,以及无人机
版权申明
本专利文件披露的内容包含受版权保护的材料。该版权为版权所有人所有。版权所有人不反对任何人复制专利与商标局的官方记录和档案中所存在的该专利文件或者该专利披露。
技术领域
本申请涉及图像处理领域,并且更具体地,涉及一种基于卷积神经网络的图像处理方法和设备。
背景技术
卷积神经网络(Convolutional Neural Network,CNN)是一种人工神经网络,在图像识别等领域有着广泛的应用。典型的CNN包括卷积层、池化层、激活层以及全连接层等,上一层根据输入的数据进行相应的运算,将运算结果输出给下一层,输入的初始数据经过多层的运算之后得到一个最终的结果。
目前的CNN中,每一层在进行相应的运算之后,将结果存储在片外存储器中,例如存储在双倍速率(Double Data Rate,DDR)存储器中,下一层从片外存储器中读取上一层的输出结果,并存储到片上存储器中,然后进行运算。这需要较多的片上存储资源和较强的处理能力。
因此,如何在处理设备的处理能力有限或者片上存储资源有限的情况下,实现卷积神经网络的计算是一项亟待解决的问题。
发明内容
本申请实施例提供一种基于卷积神经网络的图像处理方法和设备以及无人机,可以在处理设备的处理能力有限或者片上存储资源有限的情况下,实现卷积神经网络的计算,并且可以节省存储空间,提高处理效率。
第一方面,提供了一种基于卷积神经网络的图像处理方法,包括:按块从第一片上存储器读取3D特征图,所述3D特征图分为L个块;其中,所述第一片上存储器包括S个第一存储空间,所述S个第一存储空间中的每个 所述第一存储空间分别用于存储所述3D特征图包括的L个块中的一个块作为神经网络当前层的输入数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输入数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块;按块对所述3D特征图进行卷积神经网络的所述当前层的处理;将所述当前层的输出结果存储到所述第一片上存储器;其中,所述第一片上存储器还包括R个第二存储空间,所述R个第二存储空间中的每个所述第二存储空间分别用于存储所述L个块中一个块的当前层的输出数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输出数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块的输出数据;其中,所述L、所述S和所述R为大于或等于2的整数,所述S和所述R小于所述L。
第二方面,提供了一种基于卷积神经网络的图像处理设备,包括:读取单元,用于按块从第一片上存储器读取3D特征图,所述3D特征图分为L个块;其中,所述第一片上存储器包括S个第一存储空间,所述S个第一存储空间中的每个所述第一存储空间分别用于存储所述3D特征图包括的L个块中的一个块作为神经网络当前层的输入数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输入数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块;处理单元,用于按块对所述3D特征图进行卷积神经网络的所述当前层的处理;存储单元,用于将所述当前层的输出结果存储到所述第一片上存储器;其中,所述第一片上存储器还包括R个第二存储空间,所述R个第二存储空间中的每个所述第二存储空间分别用于存储所述L个块中一个块的当前层的输出数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输出数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块的输出数据;其中,所述L、所述S和所述R为大于或等于2的整数,所述S和所述R小于所述L。
第三方面,提供了一种基于卷积神经网络的图像处理设备,包括第一片上存储器和运算电路;其中,所述运算电路用于:按块从第一片上存储器读取3D特征图,所述3D特征图分为L个块;其中,所述第一片上存储器包括S个第一存储空间,所述S个第一存储空间中的每个所述第一存储空间分别用于存储所述3D特征图包括的L个块中的一个块作为神经网络当前层的 输入数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输入数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块;按块对所述3D特征图进行卷积神经网络的所述当前层的处理;将所述当前层的输出结果存储到所述第一片上存储器;其中,所述第一片上存储器还包括R个第二存储空间,所述R个第二存储空间中的每个所述第二存储空间分别用于存储所述L个块中一个块的当前层的输出数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输出数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块的输出数据;其中,所述L、所述S和所述R为大于或等于2的整数,所述S和所述R小于所述L。
第四方面,提供了一种无人机,包括根据第二方面或第三方面所述的基于卷积神经网络的图像处理设备。
因此,在本申请实施例中,按块从第一片上存储器读取3D特征图,按块对所述3D特征图进行卷积神经网络的当前层的处理,以及将所述当前层的输出结果存储到所述第一片上存储器,按块处理需要较少的片上存储资源和对运算电路的处理能力要求较低,可以在片上存储资源或处理能力不足的情况下,实现对3D特征图的处理,并且进一步地,3D特征图包括的块的数量为L,第一片上存储器包括S个第一存储空间,以及包括R个第二存储空间,其中,S和R小于L,每个第一存储空间分别用于存储一个块的当前层的输入数据,以及每个第二存储空间分别用于存储一个块的当前层的输出数据,在其中一个所述第一存储空间上存储的一个块的输入数据被读取完毕之后,在所述一个所述第一存储空间上存储另一块的输入数据,以及在其中一个所述第二存储空间上存储的一个块的输出数据被读取完毕之后,在所述一个所述第二存储空间上存储另一块的输出数据,可以实现存储空间的重复利用,从而可以节省存储空间,以及由于S和R大于或等于2,可以保证处理的流水线式工作,提高处理效率。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造 性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是根据本申请实施例的卷积神经网络的架构的示意性图。
图2是根据本实施例的3D特征图的示意性图。
图3是根据本申请实施例的池化运算的示意性图。
图4是根据本申请实施例的卷积神经网络的***的架构图。
图5是根据本申请实施例的基于卷积神经网络的图像处理方法的示意性图。
图6是根据本申请实施例的3D特征图的分割方式的示意性图。
图7是根据本申请实施例的3D特征图的分割方式的示意性图。
图8是根据本申请实施例的基于卷积神经网络的图像处理方法的示意性流程图。
图9是根据本申请实施例的第一片上存储器包括的存储空间的存储流水的示意性图。
图10是根据本申请实施例的第一片上存储器包括的存储空间的存储流水的示意性图。
图11是根据本申请实施例的基于卷积神经网络的图像处理设备的示意性图。
图12是根据本申请实施例的基于卷积神经网络的图像处理设备的示意性图。
图13是根据本申请实施例的无人机的示意性图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
除非另有说明,本申请实施例所使用的所有技术和科学术语与本申请的技术领域的技术人员通常理解的含义相同。本申请中所使用的术语只是为了描述具体的实施例的目的,不是旨在限制本申请的范围。
卷积神经网络是一种人工神经网络,在图像识别等领域有着广泛的应用。 卷积神经网络可以包括输入层、隐藏层和输出层,其中,隐藏层可以包括卷积层、池化(pooling)层、激活层和全连接层等,具体可以如图1所示。
卷积神经网络的各层可以对上一层输出的特征图进行处理(例如,卷积、池化、激活或全连接处理)得到当前层输出的特征图。其中,本申请实施例提到的特征图可以是三维(3D)特征图。其中,可以将3D特征图称为3D特征矩阵。
3D特征图可以理解为多个二维(2D)特征图像堆叠在一起,此处可以将一个2D特征图像称为一个特征(feature),其中,每个2D特征图像可以分别对应一个图像帧的一个通道,3D特征图可以由一个图像帧得到,也可以有多个图像帧得到,在由一个图像帧得到时,3D特征图的厚度(也即,2D特征图的数量)可以等于图像帧的通道的数量,例如R、G、B三个通道,此处可以将通道称为特征,通道的数量可以理解为特征的数量。
例如,如图2所示,3D特征图的大小为W×H×M,其中,W可以代表宽度方向,H可以代表高度方向,M代表通道方向(也可以称为深度方向或厚度方向),W×H可以代表2D特征图。
应理解,本申请实施例中的特征还可以具有其他的解释,而非图像帧的通道的表征,本申请实施例不作具体的限定。
还应理解,图1所示的卷积神经网络的架构仅是用于示例性说明,本申请实施例的卷积神经网络还可以具有其他的架构。例如,卷积神经网络不包括激活层,或者激活层可以位于池化层之前等。
为了便于理解,以下将对卷积神经网络的各层的处理进行解释说明。
卷积层的卷积操作可以为利用卷积核(可以为3D卷积核,卷积核也可以被称为滤波器)和3D特征图进行运算后输出一个2D特征图,该运算可以为3D特征图的特征值与卷积核的权值做内积操作。其中,可以采用多个卷积核分别和3D特征图进行运算,则可以得到一个输出的3D特征图,该多个卷积核的大小可以是相同的,但是参数可以是不同的,卷积核的通道方向的大小(也即特征的数量)可以与3D特征图的通道方向的大小相同。
卷积层的卷积运算可以采用滑动卷积核的方式进行,以3D特征图的左上角为起点,滑动卷积核到3D特征图的右下角,产生一个2D特征图,其中,每次滑动卷积核后,运算装置都会从3D特征图中提取一个与卷积核大 小相同的3D特征矩阵,将其与卷积核进行内积操作,产生一个输出特征值。在利用多个卷积核执行上述操作之后,可以输出一个3D特征图。
其中,卷积层输出的3D特征图在宽度方向上的大小可以为
Figure PCTCN2018109190-appb-000001
其中w 0代表卷积处理输入的3D特征图在宽度方向上的大小,p 0代表卷积处理时3D特征图在宽度方向上填充的数据量,k 0代表卷积处理的卷积核在宽度方向上的大小,s 0代表卷积处理的卷积核在宽度方向上滑动的步长。
卷积层输出的3D特征图在高度方向上的大小可以为
Figure PCTCN2018109190-appb-000002
其中H 0代表卷积处理输入的3D特征图在高度方向上的大小,p 1代表卷积处理时3D特征图在高度方向上填充的数据量,k 1代表卷积处理的卷积核在高度方向上的大小,s 1代表卷积处理的卷积核在高度方向上滑动的步长。
卷积层输出的3D特征图在通道方向上的大小可以等于采用的卷积核的数量。
池化层的池化操作也可以称下采样(down-samples)操作,其目的是为减少特征映射,当在面临计算量非常大的时候,一个拥有过多特征输入的分类器不易形成,并且容易过拟合。由于卷积后的特征是一种静态属性,所以在两个不同图像区域的特征极可能一样,因此,描述大图像的时候可以对不同位置特征使用聚合统计。池化可以采用滑动窗口的方式,以输入的3D特征图的每个特征的左上角为起点,按照一定的步长,依次滑动窗口到该特征的右下角,产生一个2D特征图。依照上述方式,依次产生所有特征对应的2D特征图后,便可得到该池化层输出的3D特征图。池化常用的运算一般有:最大池化(Max Pooling)、均值池化(Mean Pooling)、高斯池化和可训练池化。
例如,如图3所示,池化窗口为2×2,步长为2,每个最大池化操作可以为对四个数操作后分别获得一个值。
其中,池化层输出的3D特征图在宽度方向上的大小可以为
Figure PCTCN2018109190-appb-000003
其中w 1代表池化处理输入的3D特征图在宽度方向上的大小,p 2代表池化处理时3D特征图在宽度方向上填充的数据量,k 2代表池化处理的窗口在宽度方向上的大小,s 2代表池化处理的窗口在宽度方向上滑动的步长。
池化层输出的3D特征图在高度方向上的大小可以为
Figure PCTCN2018109190-appb-000004
其 中H 1代表池化处理输入的3D特征图在高度方向上的大小,p 3代表池化处理时3D特征图在高度方向上填充的数据量,k 3代表池化处理的窗口在高度方向上的大小,s 3代表池化处理的窗口在高度方向上滑动的步长。
池化层输出的3D特征图在通道方向上的大小可以等于池化层输入的3D特征图在通道方向上的大小,也即池化操作的结果可以使得3D特征图的特征数量保持不变。
在激活层的激活操作中,针对3D特征图,可以采用特定的激活函数进行点对点的映射,得到激活层的输出的3D特征图。
在CNN中,在输入的3D特征图经过卷积层、池化层和激活层之后,可以进入全连接层,可以将3D特征图映射为一个长的输入向量并进入输出层。
应理解,以上介绍的各层的操作仅是可用的一种实现方式,仅用于更好地理解本申请,各层操作还可以有其他的实现方式,为了简洁,本申请实施例对此不再赘述。
卷积神经网络的处理可以由处理器来实现,例如可以由现场可编程门阵列(Field Programmable Gate Array,FPGA)或特定应用的集成电路(Application Specific Integrated Circuit,ASIC)实现。
但应理解,本申请实施例并不限于此。
以下结合图4描述本申请实施例的实现卷积神经网络的***架构图,其中,实现卷积神经网络的***可以包括处理器100和片外存储器200。其中,可以将处理器100称为加速器。
如图4所示,处理器100可以包括控制电路110、第一运算电路122、第二运算电路124、直接内存存取(Direct Memory Access,DMA)130和作为片上存储器的静态随机存取存储器(Static Random-Access Memory,SRAM)140。
其中,控制电路110可以控制第一运算电路122和第二运算电路124的运算(例如,运算的数据的大小以及运算的时序等),控制DMA130的读取时间和读取地址,使DMA130将数据从外部存储器200读入到SRAM140中或从数据从SRAM140写出到外部存储器200,其中,控制电路110可以从片外存储器200中读取指令,用于实现对第一运算电路122和第二运算电路124和DMA130的控制。
第一运算电路122和第二运算电路124可以实现卷积神经网络的相应层的处理,一个运算电路可以实现一个层的运算,一个层的运算可以由多个运算电路并行实现。第一运算电路122和第二运算电路124可以从SRAM 140中读取数据进行相应层的运算,以及可以将运算结果输出到SRAM140中进行存储。第一运算电路122和第二运算电路124内可以包括区分于SRAM的片上存储器,用于存储第一运算电路122和第二运算电路124中的数据,例如,第一运算电路122和第二运算电路124得到的中间结果。
DMA130可以从片外存储器200中读取数据(例如,可以用于第一运算电路122和第二运算电路124的运算的数据),并存储到SRAM140中,或者,可以从SRAM140中读取数据(例如,第一运算电路122和第二运算电路124的输出的运算结果),并将数据存储到片外存储器200中。
应理解,图4示出的第一运算电路122和第二运算电路124可以进行同一层的处理,也可以进行不同层的处理。处理器100还可以包括其他数量的运算电路,本申请实施例对此不作具体限定。
应理解,图4所示的***仅仅是本申请实施例的一种实现方式,不应对本申请实施例构成特别的限定。
在卷积神经网络的运算中,每一层在进行相应的运算之后,如果将输出结果存储在片外存储器中,则需下一层从片外存储器中读取上一层的输出结果,,这将导致***需要反复从片外存储器上读取数据,占用***带宽。
或者,如果当前层的输出结果直接输出到下一层,不占用任何存储空间,则当前层的运算电路需要等到下一层的运算电路空闲之后才能将输出结果输出给下一层的运算电路,这种方式加速器整体效率偏低,对电路的设计要求较高,且灵活性不足。
因此,可以将卷积神经网络的3D特征图分割为多个块,按块对3D特征图进行基于卷积神经网络的处理。具体的执行流程例如可以如图5所示。其中,图5所示的方法可以由处理设备来实现,该处理设备可选地可以包括图4所示的处理器100。
可选地,该处理设备可以包括各层的运算电路,各层运算电路可以按照图5所示的方法进行相应层的处理。
或者,该处理设备可以包括控制电路和各层的运算电路,该控制电路可 以控制各层的运算电路按照图5所示的方法进行相应层的处理。
或者,该处理设备可以包括控制单元而不包括运算电路,此时,320中的进行基于卷积神经网络的至少两层处理可以是指控制各层的运算电路进行处理。
可选地,本申请实施例中的处理设备可以由FPGA或ASIC实现。由于FPGA或ASCI属于专用集成电路,其可以通过定制硬件加速器实现特定的功能,处理更高效。
但应理解,本申请实施例并不限于此。
在310中,处理设备可以按块读取3D特征图,其中,所述3D特征图包括多个块。
按块读取3D特征图可以是从片外存储器中读取各块包括的数据(此时,可以将读取的块的数据存储到第一片上存储器中),也可以是从第一片上存储器中读取各块包括的数据。本申请实施例提到的第一片上存储器可以是SRAM。
第一片上存储器可以是二维的,例如存储形式可以为4096×128b,3D特征图的存储(例如,读取还未进行卷积神经网络处理的数据或者经过处理得到的中间输出结果)可以是在2D空间上的扩展,具体可以为每个特征分别引入一个地址,以实现3D空间的访问。
应理解,在本申请实施例中,在特征的数量为1时,该3D特征图的存储可以按照2D的方式进行存储。
此处提到的3D特征图可以未经过卷积神经网络的隐藏层的任一层的处理,或者,也可以已经经过隐藏层的至少一层的处理。
在320中,处理设备可以按块对所述3D特征图进行卷积神经网络的处理。
可选地,按块对所述3D特征图进行的处理可以是按块分别进行同一层的处理。
此时,可以存在一个运算电路,该一个运算电路可以按顺序处理多个块,也即在进行一个块的处理之后,可以进行下一块的处理。或者,也可以存在至少两个运算电路,分别执行该多个块的处理。
可选地,在本申请实施例中,可以按块对所述3D特征图进行卷积神经 网络的至少两层的处理。
其中,针对每层处理,可以存在一个运算电路,也可以存在多个运算电路,此时,该多个运算电路可以并行进行该层的处理。
本申请实施例对3D特征图按块读取和进行卷积神经网络的处理,可以在片上存储资源或处理能力不足的情况下,实现对3D特征图的处理。
例如,如果第一片上存储器的存储资源不足,则可以按块读取3D特征图,并将读取的块存储到第一片上存储器,则此时片上只需要存储单个块的输入数据。假设3D特征图在通道方向上被划分为了多个块,则此时每次可以从片外存储器中读取3D特征图的部分特征的数据,存储在第一片上存储器上,然后进行卷积或池化等处理。
再例如,如果单个运算电路的处理能力有限,则单个运算电路可以按块进行运算处理。
可选地,在对每个块进行处理时,将当前层的输出结果存储到第一片上存储器中,一直到被下一层读取。
具体地,各层的运算电路在进行相应层的处理之后,可以将输出结果存储到第一片上存储器中,以及该输出结果不再从第一片上存储器存储到片外存储器中,下一层的运算电路可以从该第一片上存储器中读取由上一层的运算电路在第一片上存储器中输出的运算结果,以进行相应的运算。
例如,用于卷积处理的运算电路可以按块将卷积层的输出结果存储到第一片上存储器中,用于池化处理的运算电路可以在第一片上存储器中读取卷积层存储该输出结果,并按块进行池化层的计算。
本申请实施例提出可以将当前层的输出结果存储到第一片上存储器中,然而考虑到由于第一片上存储器的可用存储空间一般较小,如果待存储的数据量较大,将无法实现存储。
例如,假设CNN的输入数据为W=224,H=224和M=128的224×224×128的3D特征图,以及假设当前网络的隐藏层包括卷积层和池化层。
假设卷积核的数量为128个,卷积核的大小为3×3×128,步长为1,进行卷积层的处理时没有元素填充(no padding),则卷积的输出结果为222×222×128的3D特征图。以及,假设需要进行窗口为3×3的最大池化,步长为1,进行池化层的处理时没有元素填充,则池化的输出结果为220×220×128的3D特征图。
基于以上的卷积和池化运算,则需要从存储器中读取224×224×128的数据,以及需要将220×220×128的数据输出到存储器中。
针对以上的各步操作,可以得到以下表1中的存储容量。
表1
Figure PCTCN2018109190-appb-000005
其中,在以上表1中,“16B对齐,222向上取整为224”或者“16B对齐,220向上取整为224”意味着,在存储过程中,每16个数进行打包存储,具有一个存储地址,此时每行的存储数据需要按照16的倍数进行存储,在每行的数据不够16的情况下,可以填充一些无效数据使得行的数据为16的倍数,例如,无效数据的取值可以为0为255等。此处提到的行也即H=1的情况下,2D特征图上所包含的数据,一行数据的数据量可以等于W。
应理解,以上以每16个数据进行打包存储为例进行的说明,但应理解,本申请实施例并不限于此,也可以以其他数量的数据进行打包存储,例如,可以以每8个数据进行打包存储,其中,每次打包存储的数据的数据量可以基于存储资源而定。
从上面的计算结果可以看出,除了卷积层的参数之外,其他均不能存储到片上可用空间为512KB的存储器中。
因此,本申请实施例中,按块对3D特征图进行卷积神经网络的至少两层处理,并且在对每个块进行处理时,将当前层的输出结果存储到第一片上存储器中,用于下一层的处理,可以在片上存储资源或处理能力不足的情况下,实现对3D特征图的处理,以及可以避免反复从片外存储资源上读取数据,避免占用过多的***带宽。
并且进一步地,使用第一片上存储器存储输出结果,可以避免前级运算电路(例如,卷积层运算电路)需要等待后级运算电路(例如,池化层运算 电路)的空闲时,才能将前级运算电路输出结果输出给后级运算电路,避免电路的灵活性不足。
应理解,按块进行读取和进行卷积神经网络的处理并不意味着在读取数据时,需要把一个块的数据一次性读取完毕,然后进行处理,考虑到各层的运算电路的处理性能,针对单个块中的数据,在进行其中的一层处理时,可以分多次进行读取并进行处理,或者针对单个块中的数据,在进行其中的一层处理时,可以由多个运算电路并行处理。
还应理解,卷积神经网络的处理可以不是全部都按块处理,例如卷积神经网络中的一层是分块处理的,其他层的处理可以是非按块进行的处理(也即将3D特征图作为一个整体进行处理,不再进行块的分割)。该非按块进行的其他层的处理可以位于该按块进行的处理之前,也可以是位于该按块进行的处理之后。
例如,卷积层和池化层可以是按块进行处理的,而激活层和全连接层可以是非按块进行处理的。
再例如,卷积层、池化层和激活层是按块进行处理的,而全连接层可以是非按块进行处理的。
可选地,在本申请实施例中,可以按照第一片上存储器的可用存储容量和/或所述卷积神经网络各层处理所采用的参数,将所述3D特征图分割为多个块,使得对各个块进行处理得到的输出结果可以存储到第一片上存储器中。
其中,此处提到的卷积神经网络各层处理所采用的参数可以理解为在进行各层运算时,对输出结果的大小有影响的参数。
例如,对于卷积层而言,该参数可以为卷积核的大小和卷积核的滑动步长等;而对于池化层而言,该参数可以为池化方式,池化窗口大小以及池化窗口的滑动步长等。
应理解,在本申请实施例中,将3D特征图分割为多个块,在处理设备进行实现时,具体的实现操作可以为确定每个块的大小,按照确定的大小,从3D特征图中读取数据。
例如,可以由本申请实施例中的执行主体处理设备基于第一片上存储器的可用存储容量和/或卷积神经网络各层所采用的参数,确定多个块中每个块的大小,其中,在该处理设备包括如图4所示的处理器100时,该确定操作 可以由控制电路110实现。
本申请实施例的处理设备可以不具有实质的块的分割操作,仅是在读取和计算时,按块进行读取和计算。
可选地,在本申请实施例中,各个块的大小和读取顺序可以是预设在处理设备上的,处理设备可以直接基于该预设的大小和读取顺序,按块读取3D特征图。块的大小和读取顺序可以是执行预设操作的主体基于第一片上存储器的可用存储容量和/或卷积神经网络各层所采用的参数确定的。
可选地,如果第一片上存储器的可用存储资源足够存储3D特征图在各层运算的输出结果,则可以不对3D特征图进行块的分割。
例如,对于全局池化而言,相比于最大池化,一个特征通常只有一个数据输出,也就是说,全局池化的输出结果的存储量相比于最大池化的输出结果的存储量小很多,以及相应地,如果该卷积神经网络采用的卷积层的处理的结果输出的也很小,则第一片上存储器可以存储在3D特征图未进行分割的情况下所输出的结果,则可以不对3D特征图进行分割,直接将该3D特征图作为一个整体进行卷积神经网络的处理。
另外,如表1所示,由于卷积层的参数(例如,卷积核等)的数据量相对于特征输入的数据量较少,因此可以尽可能的重用特征的输入数据,也就是可以将特征的输入数据计算的结果存储到第一片上存储器中,无须将该中间结果反复存储到片外存储器中以及从片外存储器中读取,而可以将卷积层的参数存储到片外存储器中,并进行反复的读取。当然,如果第一片上存储期的存储空间足够,可以将卷积层的参数也存储到第一片上存储器中。
可选地,本申请实施例提到的片外存储器可以为双倍速率同步动态随机存储器(Double Data Rate Synchronous Dynamic Random Access Memory,DDR)等。
可选地,在本申请实施例中,3D特征图被分割成的多个块的大小可以是相同的,也可以是不完全相同的。
例如,可以基于第一片上存储器的可用存储容量,确定最大的块的大小,依次按照该最大的块进行读取和卷积神经网络的处理,直到读取和处理最后一块时,该最后一块的大小可以小于该最大的块的大小。
例如,可以基于第一片上存储器的可用存储容量,确定最大的块的大小, 然后基于该最大的块的大小,对3D特征图进行平均分割,分割后的每个块的大小可以小于该确定的最大的块的大小。
可选地,在本申请实施例中,可以在宽度方向、高度方向和通道方向中的至少一个方向上,将3D特征图分割为多个块。
例如,如图6所示,可以将大小为W×H×M的3D特征图在高度方向上进行分割,具体可以得到如(a)中的3个块;或者,可以将大小为W×H×M的3D特征图在通道方向M上进行分割,具体可以得到如(b)中的3个块;或者,可以将大小为W×H×M的3D特征图在宽度方向上进行分割,具体可以得到如(c)中的3个块。
以上图6示出的是在在一个方向上进行块的分割,也可以在至少两个方向上进行块的分割。
例如,如图7中的(a)所示,可以在宽度方向和通道方向进行分割,可以得到9个块;或者,如图7中的(b)所示,可以在高度方向和通道方向进行分割,可以得到9个块;或者,如图7中的(c)所示,可以在宽度方向和高度方向上进行分割,可以得到9个块。
可选地,在本申请实施例中,同一层的多个块的读取地址和写入地址可以是具有一定关系的,例如,可以在存储空间上连续的,或者可以是占用同一存储空间的。该种关系可以预设在处理设备上。此时,在读取一层的其中一个块的输入数据时,其读取地址可以通过同一层上一个块的读取地址得到,或者在写入一层的其中一个块的输出数据时,其写入地址可以通过同一层上一个块的写入地址得到。
例如,在写入一个块的卷积层处理的输出数据之后,可以根据该一个块的输出数据的写入地址,确定另一个块的卷积层的输出数据的写入地址。
再例如,在读取一个块的池化层的输入数据之后,可以根据该一个块的池化层的输入数据的读取地址,确定另一个块的池化层的输入数据的读取地址。
可选地,在本申请实施例中,可以采用覆盖所述卷积神经网络的处理过程中的已被读取的数据的方式,将所述当前层的输出结果存储到第一片上存储器中。
也就是在卷积神经网络的处理过程中,可以循环使用片上缓存,这样可 以提高片上缓存的利用率。
其中,处理设备可以确定已被读取的数据的存储地址,并在该存储地址中存储当前层的输出结果。该存储地址可以是物理地址,可以包括起始地址和结束地址。
作为示例性地,第一块的所述当前层的输出结果覆盖的可以为第一块的被所述当前层的读取的数据。应理解,此处提到的第一块中的第一不是为了限定块的处理顺序,仅用于块的区分。
例如,第一块的数据输入到第一片上存储器之后,用于卷积处理的运算电路可以读取该第一片上存储器中输入的数据,然后执行卷积处理,在执行完卷积处理之后,用于卷积处理的运算电路可以覆盖第一片上存储器中该第一块对应的已被读取的数据中的至少部分数据,以存储卷积处理的输出结果,用于池化处理的运算电路可以读取该卷积处理的输出结果,执行池化处理,并将池化处理的输出结果覆盖已被读取的卷积处理的输出结果,以此类推。
其中,随着卷积神经网络处理的进行,各个块对应的中间输出结果所需占用的片上存储空间可能越来越小,此时多余的存储空间可以用来存储其他的数据,例如,其他块的数据等。
为了提高卷积神经网络的效率,可以采用多个处理线(pipeline)并行处理的方式。
其中,可以将每个块的处理过程称为一个处理线,多个处理线并行处理意味着同一个时刻可以存在至少两个块在被处理。
但应理解,多个处理线并行处理并不意味着多个处理线的处理动作必须是一样的,并行处理的至少两个处理线的处理时间上可以是仅存在部分的重叠。
可选地,在本申请实施例中,第一块的所述当前层的输出结果覆盖的为第二块(非第一块的另一块)的已被读取的数据。
也就是说,在3D特征图的其中一个块进行处理时,其所输出的结果可以覆盖第一片上存储器中其他块的已被读取的数据。
在一种实现方式中,第一块的第i+1层的输出结果覆盖第一片上存储器中的第二块的第i层的输出结果,其中,所述第二块的第i层的输出结果为已被读取的数据,其中,所述卷积神经网络包括n层,且i取值从1到n。
其中,在所述卷积神经网络的处理中,所述第i+1层的输入数据从所述第一片上存储器读取的时间+所述第i+1层的计算时间+所述第i+1层的输出数据写入所述第一片上存储器的时间≤所述第i层的输入数据从所述第一片上存储器读取的时间+所述第i层的计算时间+所述第i层的输出数据写入所述第一片上存储器的时间。
例如,为了实现两个处理线并行进行,在第一片上存储器中存储2个块输出后的结果,第一个块的池化过程可以与第二块的卷积过程同步进行,在第一个块的池化过程完成之后,可以将池化处理的输出结果覆盖第一个块的卷积处理的输出结果,以存储到第一片上存储器中,随后可以从第一片上存储器中将该第一个块的池化结果输出到片外存储器中,并在用于存储第一个块的池化结果的存储位置上,将第二个块的卷积结果存储到第一片上存储器中。
其中,池化的计算能力可以匹配卷积的计算时间,也就是说,在***设计上,可以设置以下的条件:
池化层的输入数据从所述第一片上存储器读取的时间+池化计算的时间+池化层的输出数据写入所述第一片上存储器的时间≤卷积层的输入数据从所述第一片上存储器读取的时间+卷积计算的时间+卷积层的输出数据写入所述第一片上存储器的时间。
以下将以CNN的卷积层的输入数据为W=224,H=224和M=128的3D特征图为例进行说明。其中,以下提到的块是按照W×H×M的方式表征的该3D特征图的块的划分方式可以是在高度方向上进行划分,例如,类似于图6中(a)的划分方式。
首先卷积处理的输入的块为224×6×128,卷积核的数量为128个,卷积核的大小可以3×3×128,步长为1,经过卷积处理输出的第一个块大小为222×4×128,需要存储到第一片上存储器的大小为224×4×128=112KB,后续第二个块可以进一步输入4行数据,结合第一个块的后两行数据,得到第二个块的卷积的输出结果,即大小为224×4×128=112KB,那么第二个块的卷积的输出结果为112KB,则第一片上存储器存储了两个块的卷积处理的输出结果为224,池化层可以读取第一个块的卷积结果,池化层的滑窗大小为3×3,步长为1,则可以将第一个块的池化结果写入到第一个块的卷积结果的存储 空间中,也就是说将6行卷积处理的卷积结果的存储空间用来存储4行池化处理的处理结果,再将第一个块的池化结果从第一片上存储器写入到片外存储器。
在另一种实现方式中,针对输出而言,第一块的第i层的输出结果覆盖第一片上存储器中的另一块的第i层的输出结果,其中,所述另一块的第i层的输出结果为已被第i+1层读取的数据或为已经输出到片外存储器的数据,其中,所述卷积神经网络包括n层,且i取值从1到n。
针对输入而言,第一块的第i层的输入数据覆盖第一片上存储器中的另一块的第i层的输入数据,其中,所述另一块的第i层的输入数据为已被第i层读取的数据,其中,所述卷积神经网络包括n层,且i取值从1到n。
可选地,所述第一片上存储器同时存储至少两个块的同一层的输入数据和/或输出数据。在该种情况下,卷积神经网络的具体实现方式可以如图8所示的方法400。该方法400可以由处理设备实现。
在410中,按块从第一片上存储器读取3D特征图,所述3D特征图包括L个块;其中,所述第一片上存储器包括S个第一存储空间,所述S个第一存储空间中的每个所述第一存储空间分别用于存储所述3D特征图包括的L个块中的一个块的当前层的输入数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输入数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块的输入数据。
其中,该S个第一存储空间存储的当前层的输入数据可以是从片上存储器读取的,此时该当前层可选地可以是卷积神经网络处理的第一层。
或者,该S个第一存储空间存储的当前层的输入数据可以是前一层处理的输出数据。
在420中,按块对所述3D特征图进行卷积神经网络的所述当前层的处理。
在430中,将所述当前层的输出结果存储到所述第一片上存储器;其中,所述第一片上存储器还包括R个第二存储空间,所述R个第二存储空间中的每个所述第二存储空间分别用于存储所述L个块中一个块的当前层的输出数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输出数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块的输出数据;
其中,所述L、所述S和所述R为大于或等于2的整数,所述S和所述R小于所述L。
可选地,在本申请实施例中,在该种实现方式中,当前层的运算电路的数量可以小于S,以及进一步地可以小于R,例如,运算电路的数量为1。
可选地,在本申请实施例中,S可以等于R。
或者,S不等于R。
例如,S个第一存储空间存储的数据作为卷积层的输入数据,R个第二存储空间存储的数据是卷积层的输出数据,以及作为池化层的输出数据,如果池化层的运算电路的数量和/或运算电路的运算能力较强,R个第二存储空间中的数据可以快速被池化层的运算电路读取,则R可以小于S。
可选地,在图8所示的实现方式中,块的分割方向为可以宽度方向和/或高度方向,而不包括通道方向。应理解,此时也可以对一个块在通道方向进行划分,划分为多个子块。
可选地,在本申请实施例中,在卷积神经网络包括至少两层的处理时,每一层的处理均可以对应有第一存储空间和第二存储空间,也就是不同层对应的用于存储输入数据的存储空间是不复用的,以及不同层对应的用于存储输出数据的存储空间完全是不复用的。但是,当前层的第一存储空间是作为前一层的第二存储空间的,以及当前层的第二存储空间是作为下一层的第一存储空间的。
例如,如图9所示,第一片上存储器包括存储空间a1、a2、b1和b2,在存储空间a1和a2中,存储块1和块2的用于卷积处理(池化处理等其他处理同样适用,其中,卷积处理的输入数据可以是从片外存储器读取的,池化处理的输入数据可以是卷积处理的输出数据)的输入数据,用于卷积处理的运算电路分别对块1和块2进行卷积运算,分别将块1和块2卷积处理的输出结果分别存储到存储空间b1和b2,以用于池化层的处理,在进行卷积处理时,可以先读取块1的输入数据进行卷积处理,在块1的卷积处理完毕之后,运算电路可以不用进行等待,而是直接从存储空间a2中读取块2的数据,进行卷积处理,以及在针对块1的输入数据读取完毕之后,可以在存储空间a1中存储块3的用于卷积处理的输入数据,以用于运算电路对块2进行卷积处理完毕之后,读取块3的数据进行卷积处理,同样,在对块2的输入数据读取完毕之后,可以在a2中存储块4的输入数据,并以此类推。
以上举例是假设其中一个块的卷积处理无需采用其他块中的输入数据为例进行说明,在本申请实施例中,一个块的当前层的处理也可以采用其他块的输入数据,此时,所述第一片上存储器同时存储至少三个块的同一层的输入数据和/或该至少三个块同一层的输出数据。
具体地,上述提到的S和/或R可以大于等于3。例如,S个第一存储空间用于存储卷积层的输入数据,以及卷积层对其中一个块进行处理时,需要用到上一个块的数据,则S可以大于或等于3。例如,R个第二存储空间用于存储卷积层的输出数据,该输出数据可以用于池化层的处理,以及池化层对其中一个块进行处理时,需要用到上一个块的数据,则R可以大于或等于3。
例如,如图10所示,第一片上存储器包括存储空间a1、a2、a3、b1、b2和b3,在存储空间a1、a2和a3中,存储块1、块2和块3的用于卷积处理(池化处理等其他处理同样适用,其中,卷积处理的输入数据可以是从片外存储器读取的,池化处理的输入数据可以是卷积处理的输出数据)的输入数据,用于卷积处理的运算电路分别对块1、块2和块3进行卷积运算,分别将块1、块2和块3的卷积处理的输出结果分别存储到存储空间b1、b2和b3中,以用于池化层的处理,在进行卷积处理时,可以先读取块1的输入数据进行卷积处理,在块1的卷积处理完毕之后,运算电路可以不用进行等待,而是直接从存储空间a2中读取块2的数据,进行卷积处理,在块2的卷积处理完毕之后,运算电路可以不用进行等待,而是直接从存储空间a3中读取块3的数据,进行卷积处理;由于块2的处理需要用到块1的数据,所以即使块1被卷积处理完毕,则该块1中的数据需要仍然存储到存储空间a1中;当块2进行卷积处理的数据读取完毕之后,可以在存储空间a1中存储块4的数据,类似地,在块3进行卷积处理的数据读取完毕之后,可以在存储空间a2中存储块5的数据,以及在块4进行卷积处理的数据读取完毕之后,可以在存储空间a3中存储块6的数据。其中,在块2的卷积处理完毕之后,如果没有存储空间a3,此时,由于存储空间1a的数据释放的较晚,此时运算电路需要等待块1a的数据被释放并存储了另一块的数据之后,可以继续进行运算,因此,在该种情况下,需要至少存在3个存储空间用于存储输入数据,以及至少存在3个存储空间用于存储输出数据。
正如上文举例所述,本申请实施例中,一个块的数据被读取完毕可以是 指该一个块的数据在当前层的针对任一个块的处理均无需再被读取。
例如,如果该一个块的数据无需用针对另一块进行的当前层的处理时,在当前层针对该一个块的处理读取该块的全部数据之后,该一个块的数据即可认为被读取完毕。
例如,如果该一个块的数据需要用到针对另一块进行的当前层的处理时,在当前层针对该一个块的处理读取该块的全部数据和针对另一块的处理读取该块的至少部分数据之后,该一个块的数据即可认为被读取完毕。
因此,在本申请实施例中,第一片上存储器同时存储至少两个块的同一层的输入数据,可以实现流水式的工作,也就是说***中的运算电路和存储空间可以高效工作,不用进行等待。
可选地,在本申请实施例中,所述第i+1层的输入数据从所述第一片上存储器读取的时间+所述第i+1层的计算时间+所述第i+1层的输出数据写入所述第一片上存储器的时间≤所述第i层的输入数据从所述第一片上存储器读取的时间+所述第i层的计算时间+所述第i层的输出数据写入所述第一片上存储器的时间。此时各个块的大小可选地是相同的,但应理解,本申请实施例并不限于此,各个块的大小也可以不相同,此时,可以增加较大块的计算速度。
例如,为了保证卷积处理输出的数据可以覆盖其他块的数据时,该其他块的数据已经完成了池化操作,可以设置以下的条件:
池化层的输入数据从所述第一片上存储器读取的时间+池化计算的时间+池化层的输出数据写入所述第一片上存储器的时间≤卷积层的输入数据从所述第一片上存储器读取的时间+卷积计算的时间+卷积层的输出数据写入所述第一片上存储器的时间。
应理解,针对如何进行各层的输出结果的存储,除了以上的实现方式,本申请实施例还可以具有其他的实现方式,
例如,多个块的处理时间完全是同步的,此时可以存在多个存储空间,分别用于存储各块的数据,其中一个块的当前层的输出结果覆盖的为该块的被所述当前层的读取的数据。
可选地,在本申请实施例中,处理设备可以包括多个运算电路,可以在处理设备上预设每个运算电路需要处理的块以及处理顺序,以及各个运算电路的输出结果的存储方式等。
可选地,可以在处理设备上预先设置一定的规则,按照特定的规则进行数据的存储,或者处理设备可以实时对第一片上存储器的存储空间进行检测,按照检测结果进行数据的存储。
可选地,在本申请实施例中,各层处理的指令之间可以具有依赖关系,该依赖关系可以是处理顺序的依赖关系。
例如,神经网络需要执行C1、C2、C3、P1和C4处理(C为卷积处理,P为池化处理),P1处理需要等C1处理和读取执行完毕,由此P1处理的输出结果可以存储到C1处理的存储空间中,而C4处理需要等P1处理和读取执行完毕,由此C4的处理结果可以存储到P1处理的存储空间中。
因此,在本申请实施例中编译器(例如,可以由如图4所示的控制电路110实现)可以记录指令之间的依赖关系,以此来防止存储时发***,也即避免没有读取完的数据被新的数据覆盖。
可选地,在本申请实施例中,在对3D特征图的一个块进行卷积神经网络的一层处理时,可以将该层处理的输出结果存储到第一片上存储器中,用于下一层的处理。如果除了下一层的处理需要用到该输出结果,还有其他的操作(例如,当前卷积神经网络的下一层之后的层的处理或者其他卷积神经网络)需要用到该输出结果,则可以将该输出结果存储到片外存储器中。在执行到该其他的操作时,可以从片外存储器中,再次将该输出结果读取到第一片上存储器中,用于该其他的操作。
其中,可以在下一层读取第一片上存储器中的当前层的输出结果之后,将该输出结果读取到片外存储器中,并将该输出结果从第一片上存储器中删除(具体可以被其他数据所覆盖,例如,可以被下一层的输出结果所覆盖),也可以是在下一层还未从第一片上存储器中读取当前层的输出结果时,即将当前层的输出结果存储到片外存储器,在下一层读取第一片上存储器中的当前层的输出结果之后,可以将该输出结果从第一片上存储器中删除(具体可以被其他数据所覆盖,例如,可以被下一层的输出结果所覆盖)。
如果除了下一层的处理之外没有其它操作需要用到当前层的输出结果,则可以只需将该当前层的输出结果存储到第一片上存储器中,无需再存储到片外存储器中。
可选地,在本申请实施例中,针对第一块进行的处理所采用的数据也需 要用到针对第二块(非第一块的另一块)进行的处理时,可以将该数据存储到第一片上存储器中,一直到该数据被用到对针对第二块进行的处理。
此时,该数据可以包括整数个行的数据。该种方式可以用于所述3D特征图在行的方向上(也即宽度方向上)未被分割成两个或两个以上的块,例如,块的分割方式可以如图6中的(a)和(b)所示。
应理解,在本申请实施例中,被两个块共同使用的数据可以理解为是属于前一个块的数据,而不属于下一个块,或者,也可以将该行缓存的数据理解为即属于前一个块,又属于另一个块。
通常,3D特征图的单个特征的数据在存储时,同一个存储地址中的数据为一行内的全部或部分数据,不包括两行及两行以上的数据,此时,本申请实施例中,可以将该种数据的存储方式称为按行存储。
例如,在进行存储时,16个数据可以打包存储到同一个存储地址,读取一个存储地址,可以得到16个数据,一个存储地址的数据不跨越两行,也就是一个存储地址的数据不超出一行的数据。
假设3D特征图的每行数据有128个,如果每个存储地址可以存储16个数据,则可以对应8个存储地址。在3D特征图经过卷积处理之后,每行的数据为127个,则仍然可以对应8个存储地址,只是在其中一个存储地址可以存储7个有效数据以及1个无效数据。
应理解,单个特征的数据在存储时,除了可以按行进行存储之后,也可以按列进行存储,也即,同一个存储地址中的数据为一列内的全部或部分数据,不包括两列及两列以上的数据。
其中,在将第一片上存储器的数据进行释放(也可以称为删除)时,可以按照存储地址进行释放,例如,在一个存储地址中的16个数据被全部读取完毕之后,可以将该16个数据进行释放。
可选地,此处提到的数据可以是输入层输入的数据,也可以是经过卷积神经网络的其中一层处理的输出结果。
作为示例性地,假设卷积处理是卷积神经网络的首次处理,则从片外读取其中一块的数据时,可以将该块中的需要用到另一块的卷积处理的数据缓存在第一片上存储器中,直到该另一块被卷积处理,在这之前不会被其他的数据(例如,第一块的卷积处理的输出结果)所覆盖。
例如,进行卷积处理的窗口为2×2,窗口的滑动步长为1,3D特征图是 按照图6中的(a)的方式进行块的分割的,则针对每个特征,前一个块的用于卷积处理的最后一行的数据要用到下一个块,与下一块的第一行的数据共同结合用于进行卷积处理,则此时可以将该前一块的最后一行的数据一直存储到被用于第二块的卷积处理。
再例如,进行卷积层的窗口为3×3,窗口的滑动步长为2,3D特征图是按照图6中的(a)的方式进行块的分割的,则针对每个特征,前一个块的用于卷积处理的最后两行的数据要用到下一个块,与下一块的第一行的数据共同结合用于进行卷积处理,则此时可以将该前一块的最后两行的数据一直存储到被用于第二块的卷积处理。
在对3D特征图进行块的分割的方向包括至少两个方向,且该至少两个方向包括高度方向时,针对同一层的处理,则可以先处理完具有相同宽度位置(也可以称为坐标)和/或通道位置(也可以称为坐标)且在不同高度位置(也可以称为坐标)上的所有块,然后处理另外的具有相同宽度位置和/或通道位置且在不同高度位置上的所有块(以下可以称为优先遍历高度方向上的块),由此可以实现缓存较少的行数据。
以下将结合图7中的(b)所示的块的分割方式,以及卷积层的处理为为例进行说明。
例如,如图7中的(b)所示的块的分割方式下,可以按照块1b、块4b、块7b、块2b、块5b、块8b、块3b、块6b和块9b的顺序依次进行卷积层的处理。在针对块1b进行卷积层的处理时,需要将块1b的输入数据的最后至少一行数据存储到第一片上存储器中,用于块2b的卷积层的处理,在针对块4b进行卷积层的处理时,需要将块4b的输入数据的最后至少一行数据存储到第一片上存储器中,用于块5b的卷积层的处理,在针对块7b进行卷积层的处理时,需要将块7b的输入数据的最后至少一行数据存储到第一片上存储器中,用于块8b的卷积层的处理,也就是说,在针对块1b、4b和7b的卷积层的处理结束之后,需要第一片上存储器中存储块1b的输入数据的最后至少一行数据、块4b的输入数据的最后至少一行数据,块7b的输入数据的最后至少一行数据。然后进行块2b的卷积层的处理,此时,可以读取并删除块1b的输入数据的最后至少一行数据,但是需要将块2b的输入数据的最后至少一行数据存储到第一片上存储器中,并以此类推。
因此,在该种实现方式中,需要同时存储3个块的输入数据的最后至少一行的数据。
或者,如图7中的(b)所示的块的分割方式下,可以按照块1b、块2b、块3b、块4b、块5b、块6b、块7b、块8b和块9b的顺序依次卷积层的处理。在针对块1b进行卷积层的处理时,需要将块1b的输入数据的最后至少一行数据存储到第一片上存储器中,用于块2b的卷积层的处理,然后针对块2b进行卷积层的处理,此时可以读取并删除块1b的输入数据的最后至少一行数据,以及可以将块2b的输入数据的最后至少一行数据存储到第一片上存储器中,依次类推。
因此,在该种实现方式(也即按照优先遍历高度方向的方式进行块的运算)下,每次只需要存储一个块的最后至少一行数据。
因此,在对3D特征图进行块的分割的方向包括至少两个方向,且该至少两个方向包括高度方向时,可以优先遍历高度方向的块,从而可以实现缓存较少的行数据,减轻片上存储压力。
可选地,假设卷积处理是卷积神经网络的首次处理,后续需要进行池化处理,在其中一个块的数据经过卷积处理之后,可以将输出结果存储到第一片上存储器上,则该块的卷积处理的输出结果可以被全部读取用于第一块的池化处理,但是该块的卷积处理的输出结果的部分数据仍然需要用于另一个块的池化处理,则此时可以保留该部分数据(其他部分数据可以被删除),直到该部分数据被用于该另一块的池化处理。
当然,在本申请实施例中,各个块之间的数据也可以是独立的,没有重叠的,具体可以为其中一个块所采用的数据不再被另一个块所利用。
作为示例性地,在对3D特征图的宽度进行分割时(例如,如图6中的c的划分方式),由于数据是按行存储的(也即单行中的多个数据进行打包存储到一个存储地址),如果其中一个块的数据包括最后一个存储地址的部分数据,则可以不再对该最后一个存储地址进行当前层(例如卷积层或池化层)的处理,则另一个块可以对最后一个存储地址的其他部分数据或全部数据进行当前层的处理;或者,第一块可以对最后一个存储地址的部分数据或全部数据进行当前层的处理,而第二块不再对最后一个存储地址的数据进行当前层的处理。
也就是说,所述3D特征图的单个特征的单行数据对应于多个存储地址,单行数据属于至少两个块,则所述至少两个块中每个块的当前层的被处理数据包括整数个存储地址的数据,且所述至少两个块包括的当前层的被处理数据完全不重合。该种实现可以简化边界处理,从而简化实现的复杂度。
同样地,此处提到的数据可以是卷积神经网络的初始输入的未经任何层的处理的数据,也可以是其中一层的输出结果。
例如,在进行存储时,16个数据可以打包存储到同一个存储地址,读取一个存储地址,则可以得到16个数据,一个存储地址的数据不跨越两行,也就是一个存储地址的数据不超出一行的数据。假设3D特征图的每行数据有128个,则可以对应8个存储地址。则此时,其中一个块的当前层的待处理的数据可以为4个存储地址的数据,另一个块的当前层的待处理的数据可以为另4个存储地址的数据。
应理解,此处提到的块的当前层的被处理的数据的与预先划分的块包括的数据不同。例如,假设每行数据包括128个数据,则进行块的划分时,采用的是不均匀的划分方式,例如,预先划分的第一个块每行包括68个数据,第二块每行包括60个数据,则在实际进行当前层的处理时,针对第一块,每行可以处理64个数据,针对第二块,每行可以处理64个数据。
在本申请实施例中,在进行块的初始分割时,既可以实现每个块的数据仅包括整数个存储地址的数据,且所述至少两个块包括的数据完全不重合。
应理解,本申请实施例并不限于以上的描述,在数据是按行存储(也即单行中的多个数据进行打包存储到一个存储地址)的情况下,如果在宽度方向进行了块的分割,此时,也可以缓存列数据,例如,在图6中的(c)的划分方式下,块1c的最后至少一列的数据可以进行缓存,用于块2c的处理。由于数据是按行存储(也即单行中的多个数据进行打包存储到一个存储地址)的,则针对各行而言,缓存的数据是至少一个存储地址的数据,例如,对于特定行,块1c的数据中,如果用于块2c的处理的数据属于一个存储地址,则此时可以缓存16列的共16个数据用于块2c的处理,如果块1c的数据中,用于块2c的处理的数据属于两个存储地址,则此时可以缓存32列的共32个数据用于块2c的处理。
数据除了按行进行存储外,也可以按列进行存储,也即同一个存储地址 中的数据为一列内的全部或部分数据,不包括两列及两列以上的数据。
在数据是按列存储(也即单列中的多个数据进行打包存储到一个存储地址)的情况下,如果在高度方向进行了块的分割,此时,可以缓存行数据,例如,在图6中的(a)的划分方式下,1a的最后至少一行的数据可以进行缓存,用于块2c的处理。由于数据是按列存储的,则针对各列而言,缓存的数据是至少一个存储地址的数据,例如,对于特定列,块1a的数据中,如果用于块2a的处理的数据属于一个存储地址,则此时可以缓存16行的共16个数据用于块2a的处理,如果用于块2a的处理的数据属于两个存储地址,则此时可以缓存32行的共32个数据用于块2a的处理。
在数据是按列存储(也即每行中的多个数据进行打包存储到一个存储空间)的情况下,如果在宽度方向进行了块的分割,此时,可以缓存列数据,例如,在图6中的(c)的划分方式下,1c的最后至少一列(可以是一列,也可以是多列,列的数量与一个存储地址的数据量无关)的数据可以进列缓存,用于块2c的处理。
基于以上描述,在数据按行存储(也即单行中的多个数据进行打包存储到一个存储地址)时,可以在高度方向进行分割,在数据按列(也即单列中的多个数据进行打包存储到一个存储地址)进行存储时,可以在宽度方向进行分割,以此减少缓存数据。
可选地,对3D特征图的分割方法可以影响卷积神经网络的各数据的处理顺序。
作为示例性地,假设存在一组运算电路,该一组处理包括一个卷积电路和一个池化电路,以及一个卷积电路和一个池化电路一次只能处理一个块,则此时块的划分方式影响了数据的处理顺序。
在按照图6中的(a)所示的块的分割方式的情况下,在进行卷积神经网络的计算时,可以依次按照块1a、2a和3a的数据处理顺序进行。
在按照图6中的(b)所示的块的分割方式的情况下,则在进行卷积神经网络的计算时,则可以依次按照块1b、2b和3b的数据处理顺序进行。
在按照图6中的(c)所示的块的分割方式的情况下,则在进行卷积神经网络的计算时,则可以依次按照块1c、2c和3c的数据处理顺序进行。
由此可见,针对不同的块的分割方式,数据的处理顺序可以是不同的。
可选地,在本申请实施例中,在3D特征图的通道方向上分割了多个块的情况下(例如,如图6中的(b)所示的块的分割方式,以及图7中的(a)和(b)所示的块的分割方式),在进行卷积运算时,由于卷积运算对于多个特征上的同一高度位置和宽度位置上的数据需要进行累加计算,则在对部分特征进行了卷积运算之后,可以在运算电路包括的片上存储器(下文称为第二片上存储器)中存储该部分特征的卷积运算结果,在等到所有特征的卷积运算计算完毕之后,结合所有特征的卷积运算结果进行处理,例如累加处理,以得到一个卷积核对应的卷积层的输出结果或一个2D特征图,并将其输出到第一片上存储器中。
可选地,在本申请实施例中,在3D特征图的通道方向上进行分割的情况下,对于具有相同宽度位置和高度位置的至少两个块而言,如果先对该至少两个块的部分块进行了卷积层的处理,则可以先将该部分块的卷积层的输出结果存储到运算电路包括的片上存储器(下文称为第二片上存储器)中,在所述至少两个块的卷积层的处理进行完毕之后,可以将该至少两个块的卷积结果进行累加处理,以得到一个卷积核对应的卷积层的输出结果,以得到一个卷积核对应的卷积层的输出结果或一个2D特征图,并将其输出到第一片上存储器中。
具体地,可以先将先完成的块的卷积层的输出结果分别存储到第二片上存储器中,在完成了所有的块的卷积层的处理之后,对所有的块的卷积层的处理结果进行累加处理,并将处理结果输出到第一片上存储器中。
或者,可以先将先完成的两个块的卷积层的输出结果进行累加处理并存储到第二片上存储器中,在又完成了一个块的卷积层的处理之后,可以将上次得到的累加结果与该又一个块的卷积层的输出结果进行累加存储到第二片上存储器中,并删除第二存储器之前存储的累加结果,直到累加结果累加了所有的块的卷积层的输出结果,并输出到第一片上存储器。
例如,如图6中的(b)所示的块的分割方式下,在得到了块1b和块2b的卷积处理的结果之后,可以在第二片上存储器中存储该块1b和块2b的卷积处理的结果,在得到块3b的卷积处理的结果之后,可以从第二片上存储器中去读块1b和块2b的卷积处理的结果,以及从第二片上存储器中删除块1b和块2b的卷积处理的结果,结合块1b、2b和3b的卷积处理的结果,向 第一片上存储器输出最终的卷积处理的结果。
或者,如图6中的(b)所示的块的分割方式下,在得到了块1b的卷积处理的结果之后,可以在第二片上存储器中存储该块1b的卷积处理的结果,以及在得到了块2b的卷积处理的结果之后,可以在第二片上存储器中存储该块1b和块2b的卷积处理的结果的累加结果,并删除在第二片上存储器中存储的块1b的卷积处理的结果,在得到块3b的卷积处理的结果之后,可以从第二片上存储器中去读块1b和块2b的卷积处理的累加结果,以及从第二片上存储器中删除块1b和块2b的卷积处理的累加结果,结合块1b和块2b的卷积处理的累加结果和3b的卷积处理的结果,向第一片上存储器输出最终的卷积处理的结果。
又例如,如图7中的(a)所示的块的分割方式下,可以按照块1a、块4a、块7a、块2a、块5a、块8a、块3a、块6a和块9a的顺序依进行卷积层的处理,则在得到了块1a的卷积处理的结果之后,可以在第二片上存储器上存储该块1a的卷积处理的结果,在得到了块4a的卷积处理的之后,可以在第二片上存储器上存储该块4a的卷积处理的结果,在得到了块7a的卷积处理的结果之后,可以从第二片上存储器中读取块1a和4a的卷积处理的结果,以及在读取之后从第二片上存储器中删除块1a和4a的卷积处理的结果,并对块1a、4a和7a的卷积处理的结果进行累加运算,并输出到第一片上存储器中;类似地,在得到了块2a的卷积处理的结果之后,可以在第二片上存储器上存储该块2a的卷积处理的结果,在得到了块5a的卷积处理的结果之后,可以在第二片上存储器上存储该块5a的卷积处理的结果,在得到了块8a的卷积处理的结果之后,可以从第二片上存储器中读取块2a和5a的卷积处理的结果,以及在读取之后从第二片上存储器中删除块2a和5a的卷积处理的结果,并可以对块2a、5a和8a的卷积处理的结果进行累加运算,并输出到第一片上存储器中;以及,得到了块3a的卷积处理的结果之后,可以在第二片上存储器上存储该块3a的卷积处理的结果,在得到了块6a的卷积处理的结果之后,可以在第二片上存储器上存储该块6a的卷积处理的结果,在得到了块9a的卷积处理的结果之后,可以从第二片上存储器中读取块3a和6a的卷积处理的结果,以及在读取之后从第二片上存储器中删除块3a和6a的卷积处理的结果,并可以对块3a、6a和9a的卷积处理的结果进行累加运算,并输出到第一片上存储器中。
或者,如图7中的(a)所示的块的分割方式下,可以按照块1a、块4a、块7a、块2a、块5a、块8a、块3a、块6a和块9a的顺序依进行卷积层的处理,则在得到了块1a的卷积处理的结果之后,可以在第二片上存储器上存储该块1a的卷积处理的结果,在得到了块4a的卷积处理的结果之后,可以在第二片上存储器上存储该块1a和块4a的卷积处理的累加结果,并删除块1a的卷积处理的结果,在得到了块7a的卷积处理的结果之后,可以从第二片上存储器读取块1a和块4a的卷积处理的累加结果,以及在读取之后从第二片上存储器删除块1a和块4a的卷积处理的累加结果,并对块1a和块4a的累加结果和块7a的卷积处理的结果进行累加运算,并输出到第一片上存储器中;类似地,在得到了块2a的卷积处理的结果之后,可以在第二片上存储器上存储该块2a的卷积处理的结果,在得到了块5a的卷积处理的结果之后,可以在第二片上存储器上存储该块2a和块5a的卷积处理的累加结果,并删除块2a的卷积处理的结果,在得到了块8a的卷积处理的结果之后,可以从第二片上存储器读取块2a和块5a的卷积处理的累加结果,以及在读取之后从第二片上存储器删除块2a和块5a的卷积处理的累加结果,并对块2a和块5a的累加结果和块8a的卷积处理的结果进行累加运算,并输出到第一片上存储器中;以及,在得到了块3a的卷积处理的结果之后,可以在第二片上存储器上存储该块3a的卷积处理的结果,在得到了块6a的卷积处理的结果之后,可以在第二片上存储器上存储该块3a和块6a的卷积处理的累加结果,并删除块3a的卷积处理的结果,在得到了块9a的卷积处理的结果之后,可以从第二片上存储器读取块3a和块6a的卷积处理的累加结果,以及在读取之后从第二片上存储器删除块3a和块6a的卷积处理的累加结果,并对块3a和块6a的累加结果和块9a的卷积处理的结果进行累加运算,并输出到第一片上存储器中。
又例如,如图7中的(a)所示的块的分割方式下,可以按照块1a、块2a、块3a、块4a、块5a、块6a、块7a、块8a和块9a的顺序依次进行块的卷积层的处理,则在依次获取了块1a、2a、3a、4a、5a、6a的卷积层的处理结果之后,可以分别在第二片上存储器上存储该块1a、2a、3a、4a、5a、6a的卷积层的处理结果;在获取了块7a的卷积层的处理结果之后,可以从第二片上存储器读取块1a和块4a的卷积处理的结果,以及在读取之后删除块1a和块4a的卷积处理的结果,并对块1a、4a和7a的卷积层的处理结果进 行累加运算,并向第一片上存储器输出运算结果;在获取了块8a的卷积层的处理结果之后,可以从第二片上存储器读取块2a和块5a的卷积处理的结果,以及在读取之后删除块2a和块5a的卷积处理的结果,并可以对块2a、5a和8a的卷积层的处理结果进行累加运算,并向第一片上存储器输出运算结果;以及,在获取了块9a的卷积层的处理结果之后,可以从第二片上存储器读取块3a和块6a的卷积处理的结果,以及在读取之后删除块3a和块6a的卷积处理的结果,并可以对块3a、6a和9a的卷积层的处理结果进行累加运算,并向第一片上存储器输出运算结果。
或者,如图7中的(a)所示的块的分割方式下,可以按照块1a、块2a、块3a、块4a、块5a、块6a、块7a、块8a和块9a的顺序依次进行块的卷积层的处理,则在依次获取了块1a、2a、3a的卷积层的处理结果之后,可以分别在第二片上存储器上存储该块1a、2a、3a的卷积层的处理结果;在获取了块4a的卷积层的处理结果之后,可以对块1a和4a卷积层的处理结果进行累加运算,并存储到第二片上存储器中,以及删除块1a的卷积层的处理结果;在获取了块5a的卷积层的处理结果之后,可以对块2a和5a卷积层的处理结果进行累加运算,并存储到第二片上存储器中,以及删除块2a的卷积层的处理结果;在获取了块6a的卷积层的处理结果之后,可以对块3a和6a卷积层的处理结果进行累加运算,并存储到第二片上存储器中,以及删除块3a的卷积层的处理结果;在获取了块7a的卷积层的处理结果之后,可以对块1a和4a卷积层的处理的累加结果与块7a的卷积层到的结果进行累加运算,并存储到第一片上存储器中,以及删除第二片上存储器中的块1a和4a的卷积层的处理的累加结果;在获取了块8a的卷积层的处理结果之后,可以对块2a和5a卷积层的处理的累加结果与块8a的卷积层到的结果进行累加运算,并存储到第一片上存储器中,以及删除第二片上存储器中的块2a和5a的卷积层的处理的累加结果;在获取了块9a的卷积层的处理结果之后,可以对块3a和6a卷积层的处理的累加结果与块9a的卷积层到的结果进行累加运算,并存储到第一片上存储器中,以及删除第二片上存储器中的块3a和6a的卷积层的处理的累加结果。
从以上的举例可以看出,在按照图7中的(a)所示的块的分割方式下(也即通道方向和宽度方向均进行块的分割),在进行卷积层的处理时,如果优先遍历宽度方向(具体地,可以先处理完具有相同高度位置和/或通道位 置且在不同宽度位置上的所有块,然后处理另外的具有相同高度位置和/或通道位置且在不同宽度位置上的所有块),则需要在第二片上存储器中缓存较多的块的卷积处理的结果,如果优先遍历通道方向(具体地,可以先处理完具有相同高度位置和/或宽度位置且在不同通道位置上的所有块,然后处理另外的具有相同高度位置和/或宽度位置且在不同通道位置上的所有块),则可以在第二片上存储器中缓存较少的块的卷积处理的结果。
类似地,在图7中的(b)所示的块的分割方式下(也即通道方向和高度方向均进行块的分割),在进行卷积层的处理时,如果优先遍历高度方向,则需要在第二片上存储器中缓存较多的块的卷积处理的结果,如果优先遍历通道方向,则可以在第二片上存储器中缓存较少的块的卷积处理的结果。
然而,正如上文所示,在优先遍历高度方向时,可以在第一片上存储器中缓存较少的行数据。
因此,在通道方向和高度方向均进行块的分割的情况下,可以综合考虑用于卷积处理的累加运算所需进行的存储占用的第二片上存储器的资源量,以及行缓存占用的第一片上存储器的资源量,来确定是先遍历通道方向还是先遍历高度方向。
类似地,在通道方向和宽度方向均进行块的分割的情况下,可以综合考虑用于卷积处理的累加运算所需进行的存储占用的第二片上存储器的资源量,以及列缓存占用的第一片上存储器的资源量,来确定是先遍历通道方向还是先遍历宽度方向。
并且,从以上描述可以看出,运算电路包括的第二片上存储器的存储能力也可以影响到块的分割,例如,如果第二片上存储器的存储能力较小,则可以不在通道方向进行分割。
应理解,在图8所示的方案下,块的分割方向可以为高度方向和/或宽度方向,而不包括通道方向,此时,假设某一块在通道方向被分割为了至少两个子块,所述当前层的处理为卷积层的处理;则可以具有以下两种实现方式。
在一种实现方式中,如果先对所述至少两个子块的部分子块进行了卷积层的处理,则将所述部分子块的卷积层的输出结果分别存储到运算电路包括的第二片上存储器中,在所述至少两个子块的卷积层的处理进行完毕之后,将所述至少两个子块的卷积层的处理结果进行累加处理并输出到所述第二 存储空间。
在另一种实现方式中,如果先对所述至少两个子块的部分子块进行了卷积层的处理,先将先完成的子块的卷积层的输出结果进行累加处理并存储到运算电路包括的第二片上存储器中,在完成了又一个子块的卷积层的处理之后,将上次得到的累加结果与所述又一个子块的卷积层的输出结果进行累加存储到所述第二片上存储器中,并删除所述第二片上存储器之前存储的累加结果,直到累加结果累加了所述至少两个子块的卷积层的输出结果,并将所述输出结果存储到第一片上存储器中。
可选地,在本申请实施例中,在进行卷积神经网络的各层的处理时,输入数据的读取方式(例如,滑窗的滑动方式)可以影响到第一片上存储器中的数据的释放。以下是以对块包括的数据按行释放、按列释放或者按照存储地址进行释放为前提的。
在一种实现方式中,假设在宽度方向进行了块的分割且未在高度方向进行块的分割,例如,如图6中的(c)所示的方式,此时,需要在第一片上存储器中存储块1c的至少一列的数据,用于块2c的处理;在进行滑窗的滑动时,如果按照先行再列的方式进行滑动,且滑动的步长为1,则此时需要在块2c的一行的数据被遍历完之后,才能处理下一行的数据,并释放该至少一列的数据中属于该一行的数据;在进行滑窗的滑动时,如果按照先列再行的方式进行滑动,且滑动的步长为1,则此时可以先遍历该至少一列的数据,并释放该放该至少一列的数据。
因此,在对3D特征图进行了宽度方向的块分割且未进行高度方向的块分割时,读取数据时按照先列再行的方式进行读取。
在另一种实现方式中,假设在高度方向进行了块的分割且未在宽度方向进行块的分割,例如,如图6中的(a)所示的方式,此时,需要在第一片上存储器中存储块1a的至少一行的数据,用于块2a的处理;在进行滑窗的滑动时,如果按照先列再行的方式进行滑动,且滑动的步长为1,则此时需要在块1a的一列的数据被遍历完之后,才能处理下一列的数据,并释放该至少一行的数据中属于该一列的数据;在进行滑窗的滑动时,如果按照先行再列的方式进行滑动,且滑动的步长为1,则此时可以先遍历该至少一行的数据,并释放该放该至少一行的数据。
因此,在对3D特征图进行了高度方向的块分割且未进行宽度方向的块 分割时,读取数据时按照先行再列的方式进行读取。
并且,上文已经阐述,在数据进按行存储(也即每行中的多个数据进行打包存储到一个存储空间)时,可以在高度方向进行分割,在数据按列(也即每列中的多个数据进行打包存储到一个存储空间)进行存储时,可以在宽度方向进行分割,以此减少第一片上存储器中缓存的数据。
因此,在本申请实施例中,在卷积神经网络的各层的输入数据按行存储,且输入数据的读取方式为先行再列的方式进行读取时,则3D特征图的块的分割方式为在高度方向进行块的分割且在宽度方向不进行块的分割。
以及,由于数据是按行进行存储的,为了避免在宽度方向进行分割,边界处理较复杂(也即上述提到的一个存储地址的数据可能分属于两个块的情况)的问题,可以在高度方向进行分割而不在宽度方向进行分割。
以及,在卷积神经网络的各层的输入数据按列存储,且输入数据的读取方式为先列再行的方式进行读取时,则3D特征图的块的分割方式为在宽度方向进行块的分割且在高度方向不进行块的分割。
以及,由于数据是按列进行存储的,为了避免在高度方向进行分割,边界处理较复杂的问题(也即上述提到的一个存储地址的数据可能分属于两个块的情况),可以在宽方向进行分割而不在高度方向进行分割。
应理解,以上描述了片上存储器的各块的数据进行释放时,可以按行释放、按列释放或者按照存储空间的地址为单位进行释放,但是本申请实施例并不限于此,也可以按块释放各块的数据,也就是说一个块的数据处理完了片上存储空间可以被释放出来,该种释放方式可以降低控制的复杂度。
可选地,在本申请实施例中,以上提到的块的分割方式、读取顺序、存储空间的复用方式等可以预设在处理设备上的,也可以是由处理设备根据具体情况而定的,例如,可以根据实际所用的卷积神经网络的情况确定的。
例如,在该处理设备可以包括图4所示的处理器100时,可以针对第一运算电路122和第二运算电路124,预设该运算电路需要读取的块的大小,进行数据读取的数据和数据输出的时间;针对DAM130,可以预设从SRAM140中读取数据的时间,读取数据的地址、写入数据的时间以及写入数据的地址等;其中,该预设操作可以是由控制电路110从DDR读取指令之后,对第一运算电路122和第二运算电路124和DAM130的相应操作进 行预设的。当然,在本申请实施例中,控制电路110也可以实时实现对其他电路的控制。
本申请实施例对3D特征图按块读取和进行卷积神经网络的处理,可以在片上存储资源或处理能力不足的情况下,实现对3D特征图的处理。
图11是根据本申请实施例的基于卷积神经网络的图像处理设备500的示意性框图。该设备500包括:
读取单元510,用于按块从第一片上存储器读取三维3D特征图;其中,所述第一片上存储器包括S个第一存储空间,所述S个第一存储空间中的每个所述第一存储空间分别用于存储所述3D特征图包括的L个块中的一个块的当前层的输入数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输入数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块的输入数据;
处理单元520,用于按块对所述3D特征图进行卷积神经网络的所述当前层的处理;
存储单元530,用于将所述当前层的输出结果存储到所述第一片上存储器;其中,所述第一片上存储器还包括R个第二存储空间,所述R个第二存储空间中的每个所述第二存储空间分别用于存储所述L个块中一个块的当前层的输出数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输出数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块的输出数据;
其中,所述L、所述S和所述R为大于或等于2的整数,所述S和所述R小于所述L。
可选地,在本申请实施例中,所述处理单元520包括的进行所述当前层的处理的运算电路的数量小于所述S。
可选地,在本申请实施例中,所述当前层的输出结果被存储到所述第二存储空间,一直到下一层从所述第二存储空间中读取所述输出结果。
可选地,在本申请实施例中,所述存储单元530进一步用于:
在除所述下一层的处理之外的其他处理需要采用所述当前层的输出结果的情况下,将所述当前层的输出结果存储到片外存储器。
可选地,在本申请实施例中,所述第i+1层的输入数据从所述第一片上存储器读取的时间+所述第i+1层的计算时间+所述第i+1层的输出数据写入 所述第一片上存储器的时间≤所述第i层的输入数据从所述第一片上存储器读取的时间+所述第i层的计算时间+所述第i层的输出数据写入所述第一片上存储器的时间,其中,i取值从到n,所述卷积神经网络的处理包括n层。
可选地,在本申请实施例中,在针对所述L个块中的第一块进行当前层的处理所采用的输入数据也需要用到针对另一块进行的当前层的处理时,所述输入数据被存储到所述第一存储空间中,一直到所述数据被用到针对所述另一块进行的处理。
可选地,在本申请实施例中,所述S大于或等于3。
可选地,在本申请实施例中,既需要用到针对所述第一块的处理又需要用到针对所述另一块的处理的数据包括整数个行的数据;
所述3D特征图单个特征的数据在存储时,同一个存储地址中的数据不超出一行的数据。
可选地,在本申请实施例中,所述多个块为对所述3D特征图的高度方向进行分割且对宽度方向未进行分割得到的;在所述多个块中的各个块进行所述当前层的处理时,对输入数据是按照先行再列的方式进行读取的。
可选地,在本申请实施例中,所述处理单元520进一步用于:
在对所述3D特征图进行块的分割的方向包括至少两个方向,且所述至少两个方向包括高度方向的情况下,针对同一层的处理,先处理完具有相同宽度位置和通道位置且在不同高度位置上的所有块,然后处理另外的具有相同宽度位置和通道位置且在不同高度位置上的所有块。
可选地,在本申请实施例中,将所述3D特征图分割为所述L个块的方向包括宽度方向和/或高度方向。
可选地,在本申请实施例中,所述L个块中的第一块在通道方向被分割为了至少两个子块,所述当前层的处理为卷积层的处理;
所述处理单元520进一步用于:
如果先对所述至少两个子块的部分子块进行了卷积层的处理,则将所述部分子块的卷积层的输出结果分别存储到运算电路包括的第二片上存储器中,在所述至少两个子块的卷积层的处理进行完毕之后,将所述至少两个子块的卷积层的处理结果进行累加处理并输出到所述第二存储空间;或者,
如果先对所述至少两个子块的部分子块进行了卷积层的处理,先将先完成的子块的卷积层的输出结果进行累加处理并存储到运算电路包括的第二 片上存储器中,在完成了又一个子块的卷积层的处理之后,将上次得到的累加结果与所述又一个子块的卷积层的输出结果进行累加存储到所述第二片上存储器中,并删除所述第二片上存储器之前存储的累加结果,直到累加结果累加了所述至少两个子块的卷积层的输出结果,并将所述输出结果存储到第一片上存储器中。
可选地,在本申请实施例中,所述处理单元520进一步用于:
基于第一片上存储器中可用的存储容量和/或所述卷积神经网络的处理所采用的参数,确定所述多个块中每个块的大小。
可选地,在本申请实施例中,所述第一片上存储器为静态随机存取存储器SRAM。
可选地,在本申请实施例中,所述卷积神经网络的处理包括卷积层处理和池化层处理。
可选地,在本申请实施例中,所述设备500由现场可编程门阵列FPGA或特定应用的集成电路ASIC实现。
应理解,该图像处理设备400可以实现方法300或400中由处理设备实现的相应操作,为了简洁,在此不再赘述。
还应理解,图像处理设备可以由软件实现,可以由硬件实现,也可以由软硬件结合实现,本申请实施例对此不做具体限定。
图12是根据本申请实施例的基于卷积神经网络的图像处理设备600的示意性框图。该设备600包括第一片上存储器610和运算电路620;其中,所述运算电路610用于:
按块从第一片上存储器610读取三维3D特征图;其中,所述第一片上存储器610包括S个第一存储空间,所述S个第一存储空间中的每个所述第一存储空间分别用于存储所述3D特征图包括的L个块中的一个块的当前层的输入数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输入数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块的输入数据;
按块对所述3D特征图进行卷积神经网络的所述当前层的处理;
将所述当前层的输出结果存储到所述第一片上存储器610;其中,所述第一片上存储器610还包括R个第二存储空间,所述R个第二存储空间中的每个所述第二存储空间分别用于存储所述L个块中一个块的当前层的输 出数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输出数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块的输出数据;
其中,所述L、所述S和所述R为大于或等于2的整数,所述S和所述R小于所述L。
可选地,在本申请实施例中,进行所述当前层的处理的所述运算电路620的数量小于所述S。
可选地,在本申请实施例中,所述当前层的输出结果被存储到所述第二存储空间,一直到下一层从所述第二存储空间中读取所述输出结果。
可选地,在本申请实施例中,如图12所示,该设备600还包括直接内存存取DMA640,用于:
在除所述下一层的处理之外的其他处理需要采用所述当前层的输出结果的情况下,将所述当前层的输出结果存储到片外存储器。
可选地,在本申请实施例中,所述第i+1层的输入数据从所述第一片上存储器610读取的时间+所述第i+1层的计算时间+所述第i+1层的输出数据写入所述第一片上存储器610的时间≤所述第i层的输入数据从所述第一片上存储器610读取的时间+所述第i层的计算时间+所述第i层的输出数据写入所述第一片上存储器610的时间,其中,i取值从到n,所述卷积神经网络的处理包括n层。
可选地,在本申请实施例中,在针对所述L个块中的第一块进行当前层的处理所采用的输入数据也需要用到针对另一块进行的当前层的处理时,所述输入数据被存储到所述第一存储空间中,一直到所述数据被用到针对所述另一块进行的处理。
可选地,在本申请实施例中,所述S大于或等于3。
可选地,在本申请实施例中,既需要用到针对所述第一块的处理又需要用到针对所述另一块的处理的数据包括整数个行的数据;
所述3D特征图单个特征的数据在存储时,同一个存储地址中的数据不超出一行的数据。
可选地,在本申请实施例中,所述多个块为对所述3D特征图的高度方向进行分割且对宽度方向未进行分割得到的;在所述多个块中的各个块进行所述当前层的处理时,对输入数据是按照先行再列的方式进行读取的。
可选地,在本申请实施例中,所述运算电路620进一步用于:
在对所述3D特征图进行块的分割的方向包括至少两个方向,且所述至少两个方向包括高度方向的情况下,针对同一层的处理,先处理完具有相同宽度位置和通道位置且在不同高度位置上的所有块,然后处理另外的具有相同宽度位置和通道位置且在不同高度位置上的所有块。
可选地,在本申请实施例中,将所述3D特征图分割为所述L个块的方向包括宽度方向和/或高度方向。
可选地,在本申请实施例中,所述L个块中的第一块在通道方向被分割为了至少两个子块,所述当前层的处理为卷积层的处理;
所述运算电路620进一步用于:
如果先对所述至少两个子块的部分子块进行了卷积层的处理,则将所述部分子块的卷积层的输出结果分别存储到所述运算电路620包括的第二片上存储器中,在所述至少两个子块的卷积层的处理进行完毕之后,将所述至少两个子块的卷积层的处理结果进行累加处理并输出到所述第二存储空间;或者,
如果先对所述至少两个子块的部分子块进行了卷积层的处理,先将先完成的子块的卷积层的输出结果进行累加处理并存储到所述运算电路620包括的第二片上存储器中,在完成了又一个子块的卷积层的处理之后,将上次得到的累加结果与所述又一个子块的卷积层的输出结果进行累加存储到所述第二片上存储器中,并删除所述第二片上存储器之前存储的累加结果,直到累加结果累加了所述至少两个子块的卷积层的输出结果,并将所述输出结果存储到第一片上存储器610中。
可选地,在本申请实施例中,如图12所述,该设备600还包括控制电路630,用于:
基于第一片上存储器610中可用的存储容量和/或所述卷积神经网络的处理所采用的参数,确定所述多个块中每个块的大小。
可选地,在本申请实施例中,所述第一片上存储器610为静态随机存取存储器SRAM。
可选地,在本申请实施例中,所述卷积神经网络的处理包括卷积层处理和池化层处理。
可选地,在本申请实施例中,所述设备600由现场可编程门阵列FPGA 或特定应用的集成电路ASIC实现。
应理解,该图像处理设备600可以实现方法300或400中由处理设备实现的相应操作,为了简洁,在此不再赘述。
还应理解,图像处理设备400可以对应于图4所示的处理器100,为了简洁,在此不再赘述。
本申请实施例的图像处理设备400或500可以用于无人机中。
图13是根据本申请实施例的无人机700的示意性框图。该无人机700可以包括动力***710、传感***720和处理器730。
其中,该动力***710在处理器730的控制下为该无人机700提供动力;该传感***720包括摄像头722,用于拍摄图像帧;该处理器730用于基于该摄像头722拍摄的图像帧生成3D特征图,按块读取三维3D特征图,其中,所述3D特征图包括多个块;按块对所述3D特征图进行卷积神经网络的处理,卷积神经网络的处理的结果可以用于图像识别,从而可以控制无人机的飞行。
其中,该摄像头722还可以称为摄像组件,或者摄像头可以为无人机包括的用于获取图像帧的摄像组件的一部分。
其中,该处理器730可以用于实现上述方法实施例中的图像处理方法,为了简洁,在此不再赘述。
可选地,该处理器730可以置于飞行控制器中。该处理器730可以由多个处理器组成,例如一个处理器可以用于控制无人机的飞行,一个处理器可以用于进行本申请实施例提到的卷积神经网络的处理。
可选地,该无人机还可以包括片外存储器740,存储向处理器730输入的数据,以及可以存储处理器730输出的数据。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应所述以权利要求的保护范围为准。

Claims (49)

  1. 一种基于卷积神经网络的图像处理方法,其特征在于,包括:
    按块从第一片上存储器读取3D特征图,所述3D特征图分为L个块;其中,所述第一片上存储器包括S个第一存储空间,所述S个第一存储空间中的每个所述第一存储空间分别用于存储所述3D特征图包括的L个块中的一个块作为神经网络当前层的输入数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输入数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块;
    按块对所述3D特征图进行卷积神经网络的所述当前层的处理;
    将所述当前层的输出结果存储到所述第一片上存储器;其中,所述第一片上存储器还包括R个第二存储空间,所述R个第二存储空间中的每个所述第二存储空间分别用于存储所述L个块中一个块的当前层的输出数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输出数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块的输出数据;
    其中,所述L、所述S和所述R为大于或等于2的整数,所述S和所述R小于所述L。
  2. 根据权利要求1所述的方法,其特征在于,进行所述当前层的处理的运算电路的数量小于所述S。
  3. 根据权利要求1或2所述的方法,其特征在于,所述当前层的输出结果被存储到所述第二存储空间,一直到下一层从所述第二存储空间中读取所述输出结果。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    在除所述下一层的处理之外的其他处理需要采用所述当前层的输出结果的情况下,将所述当前层的输出结果存储到片外存储器。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述第i+1层的输入数据从所述第一片上存储器读取的时间+所述第i+1层的计算时间+所述第i+1层的输出数据写入所述第一片上存储器的时间≤所述第i层的输入数据从所述第一片上存储器读取的时间+所述第i层的计算时间+所述第i层的输出数据写入所述第一片上存储器的时间,其中,i取值从到n,所述卷积 神经网络的处理包括n层。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,在针对所述L个块中的第一块进行当前层的处理所采用的输入数据也需要用到针对另一块进行的当前层的处理时,所述输入数据被存储到所述第一存储空间中,一直到所述数据被用到针对所述另一块进行的处理。
  7. 根据权利要求6所述的方法,其特征在于,所述S大于或等于3。
  8. 根据权利要求6或7所述的方法,其特征在于,既需要用到针对所述第一块的处理又需要用到针对所述另一块的处理的数据包括整数个行的数据;
    所述3D特征图单个特征的数据在存储时,同一个存储地址中的数据不超出一行的数据。
  9. 根据权利要求8所述的方法,其特征在于,所述多个块为对所述3D特征图的高度方向进行分割且对宽度方向未进行分割得到的;在所述多个块中的各个块进行所述当前层的处理时,对输入数据是按照先行再列的方式进行读取的。
  10. 根据权利要求8或9所述的方法,其特征在于,所述按块对所述3D特征图进行卷积神经网络的处理,包括:
    在对所述3D特征图进行块的分割的方向包括至少两个方向,且所述至少两个方向包括高度方向的情况下,针对同一层的处理,先处理完具有相同宽度位置和通道位置且在不同高度位置上的所有块,然后处理另外的具有相同宽度位置和通道位置且在不同高度位置上的所有块。
  11. 根据权利要求1至10中任一项所述的方法其特征在于,将所述3D特征图分割为所述L个块的方向包括宽度方向和/或高度方向。
  12. 根据权利要求11所述的方法,其特征在于,所述L个块中的第一块在通道方向被分割为了至少两个子块,所述当前层的处理为卷积层的处理;
    所述按块对所述3D特征图进行卷积神经网络的所述当前层的处理,包括:
    如果先对所述至少两个子块的部分子块进行了卷积层的处理,则将所述部分子块的卷积层的输出结果分别存储到运算电路包括的第二片上存储器中,在所述至少两个子块的卷积层的处理进行完毕之后,将所述至少两个子块的卷积层的处理结果进行累加处理并输出到所述第二存储空间;或者,
    如果先对所述至少两个子块的部分子块进行了卷积层的处理,先将先完成的子块的卷积层的输出结果进行累加处理并存储到运算电路包括的第二片上存储器中,在完成了又一个子块的卷积层的处理之后,将上次得到的累加结果与所述又一个子块的卷积层的输出结果进行累加存储到所述第二片上存储器中,并删除所述第二片上存储器之前存储的累加结果,直到累加结果累加了所述至少两个子块的卷积层的输出结果,并将所述输出结果存储到第一片上存储器中。
  13. 根据权利要求1至12中任一项所述的方法,其特征在于,所述方法还包括:
    基于第一片上存储器中可用的存储容量和/或所述卷积神经网络的处理所采用的参数,确定所述多个块中每个块的大小。
  14. 根据权利要求1至13中任一项所述的方法,其特征在于,所述第一片上存储器为静态随机存取存储器SRAM。
  15. 根据权利要求1至14中任一项所述的方法,其特征在于,所述卷积神经网络的处理包括卷积层处理和池化层处理。
  16. 根据权利要求1至15中任一项所述的方法,其特征在于,所述方法由现场可编程门阵列FPGA或特定应用的集成电路ASIC实现。
  17. 一种基于卷积神经网络的图像处理设备,其特征在于,包括:
    读取单元,用于按块从第一片上存储器读取3D特征图,所述3D特征图分为L个块;其中,所述第一片上存储器包括S个第一存储空间,所述S个第一存储空间中的每个所述第一存储空间分别用于存储所述3D特征图包括的L个块中的一个块作为神经网络当前层的输入数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输入数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块;
    处理单元,用于按块对所述3D特征图进行卷积神经网络的所述当前层的处理;
    存储单元,用于将所述当前层的输出结果存储到所述第一片上存储器;其中,所述第一片上存储器还包括R个第二存储空间,所述R个第二存储空间中的每个所述第二存储空间分别用于存储所述L个块中一个块的当前层的输出数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输出数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L 个块中的另一块的输出数据;
    其中,所述L、所述S和所述R为大于或等于2的整数,所述S和所述R小于所述L。
  18. 根据权利要求17所述的设备,其特征在于,所述处理单元包括的进行所述当前层的处理的运算电路的数量小于所述S。
  19. 根据权利要求17或18所述的设备,其特征在于,所述当前层的输出结果被存储到所述第二存储空间,一直到下一层从所述第二存储空间中读取所述输出结果。
  20. 根据权利要求19所述的设备,其特征在于,所述存储单元进一步用于:
    在除所述下一层的处理之外的其他处理需要采用所述当前层的输出结果的情况下,将所述当前层的输出结果存储到片外存储器。
  21. 根据权利要求17至20中任一项所述的设备,其特征在于,所述第i+1层的输入数据从所述第一片上存储器读取的时间+所述第i+1层的计算时间+所述第i+1层的输出数据写入所述第一片上存储器的时间≤所述第i层的输入数据从所述第一片上存储器读取的时间+所述第i层的计算时间+所述第i层的输出数据写入所述第一片上存储器的时间,其中,i取值从到n,所述卷积神经网络的处理包括n层。
  22. 根据权利要求17至21中任一项所述的设备,其特征在于,在针对所述L个块中的第一块进行当前层的处理所采用的输入数据也需要用到针对另一块进行的当前层的处理时,所述输入数据被存储到所述第一存储空间中,一直到所述数据被用到针对所述另一块进行的处理。
  23. 根据权利要求22所述的设备,其特征在于,所述S大于或等于3。
  24. 根据权利要求22或23所述的设备,其特征在于,既需要用到针对所述第一块的处理又需要用到针对所述另一块的处理的数据包括整数个行的数据;
    所述3D特征图单个特征的数据在存储时,同一个存储地址中的数据不超出一行的数据。
  25. 根据权利要求24所述的设备,其特征在于,所述多个块为对所述3D特征图的高度方向进行分割且对宽度方向未进行分割得到的;在所述多个块中的各个块进行所述当前层的处理时,对输入数据是按照先行再列的方 式进行读取的。
  26. 根据权利要求24或25所述的设备,其特征在于,所述处理单元进一步用于:
    在对所述3D特征图进行块的分割的方向包括至少两个方向,且所述至少两个方向包括高度方向的情况下,针对同一层的处理,先处理完具有相同宽度位置和通道位置且在不同高度位置上的所有块,然后处理另外的具有相同宽度位置和通道位置且在不同高度位置上的所有块。
  27. 根据权利要求17至26中任一项所述的设备其特征在于,将所述3D特征图分割为所述L个块的方向包括宽度方向和/或高度方向。
  28. 根据权利要求27所述的设备,其特征在于,所述L个块中的第一块在通道方向被分割为了至少两个子块,所述当前层的处理为卷积层的处理;
    所述处理单元进一步用于:
    如果先对所述至少两个子块的部分子块进行了卷积层的处理,则将所述部分子块的卷积层的输出结果分别存储到运算电路包括的第二片上存储器中,在所述至少两个子块的卷积层的处理进行完毕之后,将所述至少两个子块的卷积层的处理结果进行累加处理并输出到所述第二存储空间;或者,
    如果先对所述至少两个子块的部分子块进行了卷积层的处理,先将先完成的子块的卷积层的输出结果进行累加处理并存储到运算电路包括的第二片上存储器中,在完成了又一个子块的卷积层的处理之后,将上次得到的累加结果与所述又一个子块的卷积层的输出结果进行累加存储到所述第二片上存储器中,并删除所述第二片上存储器之前存储的累加结果,直到累加结果累加了所述至少两个子块的卷积层的输出结果,并将所述输出结果存储到第一片上存储器中。
  29. 根据权利要求17至28中任一项所述的设备,其特征在于,所述处理单元进一步用于:
    基于第一片上存储器中可用的存储容量和/或所述卷积神经网络的处理所采用的参数,确定所述多个块中每个块的大小。
  30. 根据权利要求17至29中任一项所述的设备,其特征在于,所述第一片上存储器为静态随机存取存储器SRAM。
  31. 根据权利要求17至30中任一项所述的设备,其特征在于,所述卷积神经网络的处理包括卷积层处理和池化层处理。
  32. 根据权利要求17至31中任一项所述的设备,其特征在于,所述设备由现场可编程门阵列FPGA或特定应用的集成电路ASIC实现。
  33. 一种基于卷积神经网络的图像处理设备,其特征在于,包括第一片上存储器和运算电路;其中,所述运算电路用于:
    按块从第一片上存储器读取3D特征图,所述3D特征图分为L个块;其中,所述第一片上存储器包括S个第一存储空间,所述S个第一存储空间中的每个所述第一存储空间分别用于存储所述3D特征图包括的L个块中的一个块作为神经网络当前层的输入数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输入数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块;
    按块对所述3D特征图进行卷积神经网络的所述当前层的处理;
    将所述当前层的输出结果存储到所述第一片上存储器;其中,所述第一片上存储器还包括R个第二存储空间,所述R个第二存储空间中的每个所述第二存储空间分别用于存储所述L个块中一个块的当前层的输出数据,在其中一个所述第一存储空间上存储的所述L个块中的一个块的输出数据被读取完毕之后,在所述一个所述第一存储空间上存储所述L个块中的另一块的输出数据;
    其中,所述L、所述S和所述R为大于或等于2的整数,所述S和所述R小于所述L。
  34. 根据权利要求33所述的设备,其特征在于,进行所述当前层的处理的所述运算电路的数量小于所述S。
  35. 根据权利要求33或34所述的设备,其特征在于,所述当前层的输出结果被存储到所述第二存储空间,一直到下一层从所述第二存储空间中读取所述输出结果。
  36. 根据权利要求35所述的设备,其特征在于,还包括直接内存存取DMA,用于:
    在除所述下一层的处理之外的其他处理需要采用所述当前层的输出结果的情况下,将所述当前层的输出结果存储到片外存储器。
  37. 根据权利要求33至36中任一项所述的设备,其特征在于,所述第i+1层的输入数据从所述第一片上存储器读取的时间+所述第i+1层的计算时间+所述第i+1层的输出数据写入所述第一片上存储器的时间≤所述第i层的 输入数据从所述第一片上存储器读取的时间+所述第i层的计算时间+所述第i层的输出数据写入所述第一片上存储器的时间,其中,i取值从到n,所述卷积神经网络的处理包括n层。
  38. 根据权利要求33至37中任一项所述的设备,其特征在于,在针对所述L个块中的第一块进行当前层的处理所采用的输入数据也需要用到针对另一块进行的当前层的处理时,所述输入数据被存储到所述第一存储空间中,一直到所述数据被用到针对所述另一块进行的处理。
  39. 根据权利要求38所述的设备,其特征在于,所述S大于或等于3。
  40. 根据权利要求38或39所述的设备,其特征在于,既需要用到针对所述第一块的处理又需要用到针对所述另一块的处理的数据包括整数个行的数据;
    所述3D特征图单个特征的数据在存储时,同一个存储地址中的数据不超出一行的数据。
  41. 根据权利要求40所述的设备,其特征在于,所述多个块为对所述3D特征图的高度方向进行分割且对宽度方向未进行分割得到的;在所述多个块中的各个块进行所述当前层的处理时,对输入数据是按照先行再列的方式进行读取的。
  42. 根据权利要求40或41所述的设备,其特征在于,所述运算电路进一步用于:
    在对所述3D特征图进行块的分割的方向包括至少两个方向,且所述至少两个方向包括高度方向的情况下,针对同一层的处理,先处理完具有相同宽度位置和通道位置且在不同高度位置上的所有块,然后处理另外的具有相同宽度位置和通道位置且在不同高度位置上的所有块。
  43. 根据权利要求33至42中任一项所述的设备其特征在于,将所述3D特征图分割为所述L个块的方向包括宽度方向和/或高度方向。
  44. 根据权利要求43所述的设备,其特征在于,所述L个块中的第一块在通道方向被分割为了至少两个子块,所述当前层的处理为卷积层的处理;
    所述运算电路进一步用于:
    如果先对所述至少两个子块的部分子块进行了卷积层的处理,则将所述部分子块的卷积层的输出结果分别存储到所述运算电路包括的第二片上存储器中,在所述至少两个子块的卷积层的处理进行完毕之后,将所述至少两 个子块的卷积层的处理结果进行累加处理并输出到所述第二存储空间;或者,
    如果先对所述至少两个子块的部分子块进行了卷积层的处理,先将先完成的子块的卷积层的输出结果进行累加处理并存储到所述运算电路包括的第二片上存储器中,在完成了又一个子块的卷积层的处理之后,将上次得到的累加结果与所述又一个子块的卷积层的输出结果进行累加存储到所述第二片上存储器中,并删除所述第二片上存储器之前存储的累加结果,直到累加结果累加了所述至少两个子块的卷积层的输出结果,并将所述输出结果存储到第一片上存储器中。
  45. 根据权利要求33至44中任一项所述的设备,其特征在于,还包括控制电路,用于:
    基于第一片上存储器中可用的存储容量和/或所述卷积神经网络的处理所采用的参数,确定所述多个块中每个块的大小。
  46. 根据权利要求33至45中任一项所述的设备,其特征在于,所述第一片上存储器为静态随机存取存储器SRAM。
  47. 根据权利要求33至46中任一项所述的设备,其特征在于,所述卷积神经网络的处理包括卷积层处理和池化层处理。
  48. 根据权利要求33至47中任一项所述的设备,其特征在于,所述设备由现场可编程门阵列FPGA或特定应用的集成电路ASIC实现。
  49. 一种无人机,其特征在于,包括根据权利要求17至48中任一项所述的基于卷积神经网络的图像处理设备。
PCT/CN2018/109190 2018-09-30 2018-09-30 基于卷积神经网络的图像处理方法和设备,以及无人机 WO2020062284A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2018/109190 WO2020062284A1 (zh) 2018-09-30 2018-09-30 基于卷积神经网络的图像处理方法和设备,以及无人机
CN201880038969.4A CN110770740A (zh) 2018-09-30 2018-09-30 基于卷积神经网络的图像处理方法和设备,以及无人机
US17/190,378 US20210192246A1 (en) 2018-09-30 2021-03-02 Convolutional neural network-based image processing method and device, and unmanned aerial vehicle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/109190 WO2020062284A1 (zh) 2018-09-30 2018-09-30 基于卷积神经网络的图像处理方法和设备,以及无人机

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/190,378 Continuation US20210192246A1 (en) 2018-09-30 2021-03-02 Convolutional neural network-based image processing method and device, and unmanned aerial vehicle

Publications (1)

Publication Number Publication Date
WO2020062284A1 true WO2020062284A1 (zh) 2020-04-02

Family

ID=69328774

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/109190 WO2020062284A1 (zh) 2018-09-30 2018-09-30 基于卷积神经网络的图像处理方法和设备,以及无人机

Country Status (3)

Country Link
US (1) US20210192246A1 (zh)
CN (1) CN110770740A (zh)
WO (1) WO2020062284A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898081A (zh) * 2020-07-09 2020-11-06 上海兆芯集成电路有限公司 卷积运算方法及卷积运算装置
CN113554157A (zh) * 2020-04-24 2021-10-26 上海商汤智能科技有限公司 数据处理方法及相关产品
CN113949592A (zh) * 2021-12-22 2022-01-18 湖南大学 一种基于fpga的对抗攻击防御***及方法
CN114089911A (zh) * 2021-09-07 2022-02-25 上海新氦类脑智能科技有限公司 基于数据复用的块切分拼接处理方法、装置、设备及介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3089664A1 (fr) * 2018-12-05 2020-06-12 Stmicroelectronics (Rousset) Sas Procédé et dispositif pour réduire la charge de calcul d’un microprocesseur destiné à traiter des données par un réseau de neurones à convolution
US11507831B2 (en) * 2020-02-24 2022-11-22 Stmicroelectronics International N.V. Pooling unit for deep learning acceleration
CN112955908A (zh) * 2020-03-13 2021-06-11 深圳市大疆创新科技有限公司 卷积神经网络的数据处理方法、预测方法、计算装置和存储介质
CN112541929A (zh) * 2021-01-25 2021-03-23 翱捷科技股份有限公司 一种用于卷积神经网络的图像处理方法及***
CN113688069B (zh) * 2021-09-10 2022-08-02 北京百度网讯科技有限公司 数据处理方法、装置、电子设备及介质
CN113946538B (zh) * 2021-09-23 2024-04-12 南京大学 一种基于行缓存机制的卷积层融合存储装置及方法
CN114202067A (zh) * 2021-11-30 2022-03-18 山东产研鲲云人工智能研究院有限公司 面向卷积神经网络加速器的带宽优化方法及相关设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107025317A (zh) * 2015-10-07 2017-08-08 阿尔特拉公司 用于实施卷积神经网络加速器上的层的方法和装置
CN107203807A (zh) * 2016-03-16 2017-09-26 中国科学院计算技术研究所 神经网络的计算方法、***及其装置
CN108133270A (zh) * 2018-01-12 2018-06-08 清华大学 卷积神经网络加速方法及装置
CN108427990A (zh) * 2016-01-20 2018-08-21 北京中科寒武纪科技有限公司 神经网络计算***和方法
CN108573305A (zh) * 2017-03-15 2018-09-25 杭州海康威视数字技术股份有限公司 一种数据处理方法、设备及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107025317A (zh) * 2015-10-07 2017-08-08 阿尔特拉公司 用于实施卷积神经网络加速器上的层的方法和装置
CN108427990A (zh) * 2016-01-20 2018-08-21 北京中科寒武纪科技有限公司 神经网络计算***和方法
CN107203807A (zh) * 2016-03-16 2017-09-26 中国科学院计算技术研究所 神经网络的计算方法、***及其装置
CN108573305A (zh) * 2017-03-15 2018-09-25 杭州海康威视数字技术股份有限公司 一种数据处理方法、设备及装置
CN108133270A (zh) * 2018-01-12 2018-06-08 清华大学 卷积神经网络加速方法及装置

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554157A (zh) * 2020-04-24 2021-10-26 上海商汤智能科技有限公司 数据处理方法及相关产品
CN111898081A (zh) * 2020-07-09 2020-11-06 上海兆芯集成电路有限公司 卷积运算方法及卷积运算装置
CN111898081B (zh) * 2020-07-09 2024-02-27 上海兆芯集成电路股份有限公司 卷积运算方法及卷积运算装置
CN114089911A (zh) * 2021-09-07 2022-02-25 上海新氦类脑智能科技有限公司 基于数据复用的块切分拼接处理方法、装置、设备及介质
CN114089911B (zh) * 2021-09-07 2024-01-05 上海新氦类脑智能科技有限公司 基于数据复用的块切分拼接处理方法、装置、设备及介质
CN113949592A (zh) * 2021-12-22 2022-01-18 湖南大学 一种基于fpga的对抗攻击防御***及方法
CN113949592B (zh) * 2021-12-22 2022-03-22 湖南大学 一种基于fpga的对抗攻击防御***及方法

Also Published As

Publication number Publication date
US20210192246A1 (en) 2021-06-24
CN110770740A (zh) 2020-02-07

Similar Documents

Publication Publication Date Title
WO2020062284A1 (zh) 基于卷积神经网络的图像处理方法和设备,以及无人机
US20220261615A1 (en) Neural network devices and methods of operating the same
EP3306478B1 (en) Buffer addressing for a convolutional neural network
KR102642853B1 (ko) 컨볼루션 회로, 그것을 포함하는 어플리케이션 프로세서 및 그것의 동작 방법
US20220365700A1 (en) Matrix transfer accelerator system and method
TWI716108B (zh) 適用於深度神經網路之卷積計算積體電路及其方法
Goetschalckx et al. Breaking high-resolution CNN bandwidth barriers with enhanced depth-first execution
CN108573305B (zh) 一种数据处理方法、设备及装置
US20200118249A1 (en) Device configured to perform neural network operation and method of operating same
EP3161793B1 (en) Adaptive partition mechanism with arbitrary tile shape for tile based rendering gpu architecture
WO2020073801A1 (zh) 一种3d图像处理中数据读写方法及***、存储介质及终端
CN112184587B (zh) 一种边缘数据增强模型、以及基于所述模型的高效边缘数据增强方法及***
CN109416743B (zh) 一种用于识别人为动作的三维卷积装置
US11748010B2 (en) Methods and systems for storing variable length data blocks in memory
US20200356844A1 (en) Neural network processor for compressing featuremap data and computing system including the same
WO2021102946A1 (zh) 计算装置、方法、处理器和可移动设备
US10878592B2 (en) Video data processing
CN111191780B (zh) 均值池化累加电路、装置以及方法
US20210174181A1 (en) Hardware Implementation of a Neural Network
US20210303974A1 (en) Neural network processing
GB2585810A (en) Buffer addressing for a convolutional neural network
CN113538237A (zh) 一种图像拼接***、方法及电子设备
WO2022027818A1 (zh) 数据批处理方法及其批处理装置、存储介质
RU2820172C1 (ru) Способ обработки данных посредством нейронной сети, подвергнутой декомпозиции с учетом объема памяти вычислительного устройства (варианты), и компьютерно-читаемый носитель
US20240031704A1 (en) Hybrid addressing for imaging and vision data

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018934779

Country of ref document: EP

Effective date: 20210419

122 Ep: pct application non-entry in european phase

Ref document number: 18934779

Country of ref document: EP

Kind code of ref document: A1