TWI765336B

TWI765336B - Block-based inference method for memory-efficient convolutional neural network implementation and system thereof

Info

Publication number: TWI765336B
Application number: TW109130493A
Authority: TW
Inventors: 黃朝宗
Original assignee: 國立清華大學
Priority date: 2019-10-08
Filing date: 2020-09-04
Publication date: 2022-05-21
Also published as: TW202115624A; CN112633462A

Abstract

A block-based inference method for memory-efficient convolutional neural network (CNN) implementation is proposed. The block-based inference method for memory-efficient CNN implementation includes a parameter setting step, a dividing step, a block-based inference step and a temporary storing step. The parameter setting step includes setting an inference parameter group. The inference parameter group includes a depth, a block width, a block height and a kernel size. The dividing step includes driving a processing unit to divide the image into a plurality of input block data according to the depth, the block width and the block height. Each of the input block data has an input block size. The block-based inference step includes driving the processing unit to perform a multi-layer convolution operation on each of the input block data to generate an output block data. The multi-layer convolution operation includes a first direction data selecting step, a second direction data selecting step and a convolution operation step. The first direction data selecting step includes selecting a plurality of ith layer recomputing features according to a position of the output block data along a first direction, and then selecting an ith layer recomputing input feature block data according to the position of the output block data and the ith layer recomputing features. i is one of a plurality of positive integers from 1 to the depth. The second direction data selecting step includes selecting a plurality of ith layer reusing features according to the ith layer recomputing input feature block data along a second direction, and then combining the ith layer recomputing input feature block data with the ith layer reusing features to generate an ith layer reusing input feature block data. The convolution operation step includes selecting a plurality of ith layer sub-block input feature groups from the ith layer reusing input feature block data according to an ith layer kernel size, and then performing a convolution operation on each of the ith layer sub-block input feature groups to generate an ith layer sub-block output feature, and combining the ith layer sub-block output features corresponding to the ith layer sub-block input feature groups to form an ith layer output feature block data. The temporary storing step includes driving a block buffer bank to store the output feature block data and the ith layer reusing features. Therefore, the present disclosure reuses the features along the block scanning direction to reduce recomputing overheads and recomputes the features between different scan lines to eliminate the global line buffer, so that the inference flow of the present disclosure can provide great flexibility and good tradeoffs between computing and memory overheads for high-performance and memory-efficient CNN inference.

Description

適用於卷積神經網路之記憶體優化實現之區塊式推論方法及其系統A block-based inference method and system for memory-optimized implementation of convolutional neural networks

本發明是關於一種區塊式推論方法及其系統，特別是關於一種適用於卷積神經網路之記憶體優化實現之區塊式推論方法及其系統。The present invention relates to a block-based inference method and system, in particular to a block-based inference method and system suitable for memory optimization of convolutional neural networks.

當使用卷積神經網路於影像處理應用時，其外部記憶體頻寬需求可能會相當高，而使用區塊式推論流程，可以大幅降低此頻寬需求。然而，區塊間會有重疊的特徵向量，目前已知有兩種不同的處理方法，一種是採重新計算方式，另一種則是採重複利用方式。其中前者會增加計算量而降低輸出像素量，而後者則是需要大量的區塊暫存器來存放重複使用的特徵向量。由此可知，目前市場上缺乏一種能在不增加太多計算量以及區塊暫存器前提下，可大幅降低外部記憶體頻寬需求的適用於卷積神經網路之記憶體優化實現之區塊式推論方法及其系統，故相關業者均在尋求其解決之道。When using Convolutional Neural Networks for image processing applications, the external memory bandwidth requirement can be quite high, and using a block-based inference process can greatly reduce this bandwidth requirement. However, there will be overlapping eigenvectors between blocks, and two different processing methods are known, one is recalculation, and the other is reuse. The former will increase the amount of calculation and reduce the amount of output pixels, while the latter requires a large number of block registers to store the feature vectors that are reused. It can be seen that there is currently a lack of a memory optimization implementation area suitable for convolutional neural networks that can greatly reduce the bandwidth requirements of external memory without increasing too much computation and block registers. The block inference method and its system, so the relevant industry are all looking for its solution.

因此，本發明之目的在於提供一種適用於卷積神經網路之記憶體優化實現之區塊式推論方法及其系統，當進行區塊式推論時，於區塊前行的方向上重複利用已計算過的特徵，而於另一個方向上採用重新計算的方式，使區塊式推論在不增加過多計算量以及區塊暫存器之前提下，依然能大幅降低外部記憶體之頻寬需求。Therefore, the purpose of the present invention is to provide a block-based inference method and system suitable for memory optimization of convolutional neural networks. Calculated features and recalculation in the other direction enable block-based inference to significantly reduce the bandwidth requirements of external memory without increasing too much computation and block registers.

依據本發明的方法態樣之一實施方式提供一種適用於卷積神經網路之記憶體優化實現之區塊式推論方法，其用以處理一輸入影像。此適用於卷積神經網路之記憶體優化實現之區塊式推論方法包含參數設定步驟、分割步驟、區塊推論步驟以及暫存步驟，其中參數設定步驟係設定一推論參數組，此推論參數組包含一卷積深度、一區塊寬度、一區塊高度及複數層卷積核大小。分割步驟係驅動一運算處理單元依據卷積深度、區塊寬度、區塊高度及此些層卷積核大小劃分輸入影像成複數輸入區塊資料，各輸入區塊資料具有輸入區塊大小。區塊推論步驟係驅動運算處理單元將各輸入區塊資料執行多層卷積操作而產生輸出區塊資料，此多層卷積操作包含第一方向資料選取步驟、第二方向資料選取步驟及一卷積運算步驟，其中第一方向資料選取步驟係依據輸出區塊資料之一位置沿一掃描換行方向選擇複數第i層重新計算特徵，然後依據輸出區塊資料之位置及此些第i層重新計算特徵選取出一第i層重新計算輸入特徵區塊資料，其中i為1至卷積深度之複數個正整數之其中一者。第二方向資料選取步驟係依據第i層重新計算輸入特徵區塊資料沿一區塊掃描方向選取出複數第i層重複利用特徵，並將第i層重新計算輸入特徵區塊資料及此些第i層重複利用特徵組合而產生一第i層重複利用輸入特徵區塊資料。此外，卷積運算步驟係依據第i層卷積核大小從第i層重複利用輸入特徵區塊資料中選取出複數第i層子區塊輸入特徵群，然後對各第i層子區塊輸入特徵群及一卷積參數組執行一卷積運算而產生一第i層子區塊輸出特徵，並將對應此些第i層子區塊輸入特徵群之此些第i層子區塊輸出特徵組合而形成一第i層輸出特徵區塊資料。暫存步驟係驅動一區塊暫存器暫存第i層輸出特徵區塊資料及此些第i層重複利用特徵。One embodiment of a method aspect according to the present invention provides a block-based inference method suitable for memory-optimized implementation of a convolutional neural network for processing an input image. This block-based inference method suitable for memory optimization of convolutional neural networks includes a parameter setting step, a segmentation step, a block inference step and a temporary storage step, wherein the parameter setting step is to set an inference parameter group, the inference parameter The group includes a convolution depth, a block width, a block height, and the size of the convolution kernels of multiple layers. The dividing step drives an arithmetic processing unit to divide the input image into a plurality of input block data according to the convolution depth, block width, block height and the size of these layers of convolution kernels, and each input block data has an input block size. The block inference step is to drive the arithmetic processing unit to perform a multi-layer convolution operation on each input block data to generate output block data. The multi-layer convolution operation includes a first direction data selection step, a second direction data selection step and a convolution The operation step, wherein the first direction data selection step is to select a plurality of i-th layer recalculation features according to a position of the output block data along a scanning line feed direction, and then recalculate the features according to the position of the output block data and these i-th layers. An i-th layer is selected to recalculate the input feature block data, where i is one of a plurality of positive integers from 1 to the convolution depth. The second direction data selection step is to select a plurality of i-th layers of reused features along a block scanning direction according to the i-th layer recalculated input feature block data, and recalculate the i-th layer of input feature block data and these The i-layer reuses feature combination to generate an i-th layer reuses input feature block data. In addition, the convolution operation step is to select a plurality of i-th layer sub-block input feature groups from the i-th layer reused input feature block data according to the i-th layer convolution kernel size, and then input the input feature group to each i-th layer sub-block The feature group and a convolution parameter group perform a convolution operation to generate an output feature of an i-th layer sub-block, and output the i-th layer sub-block output features corresponding to the i-th layer sub-block input feature group combined to form an i-th layer output feature block data. The step of temporary storage drives a block register to temporarily store the output feature block data of the i-th layer and the reused features of the i-th layer.

藉此，本發明的適用於卷積神經網路之記憶體優化實現之區塊式推論方法透過不同方向使用不同特徵之計算方式，使區塊式推論在不增加過多計算量以及區塊暫存器之前提下，依然能大幅降低外部記憶體之頻寬需求。In this way, the block-based inference method of the present invention suitable for memory optimization of convolutional neural networks uses computing methods with different characteristics in different directions, so that block-based inference does not increase too much computation and block temporary storage. Under the premise of the device, the bandwidth requirement of the external memory can still be greatly reduced.

前述實施方式之其他實施例如下：當前述i等於1時，第i層重新計算輸入特徵區塊資料等於各輸入區塊資料。當i等於卷積深度時，第i層輸出特徵區塊資料等於輸出區塊資料。Other examples of the aforementioned embodiments are as follows: when the aforementioned i is equal to 1, the i-th layer recalculates the input feature block data equal to each input block data. When i is equal to the convolution depth, the output feature block data of the i-th layer is equal to the output block data.

前述實施方式之其他實施例如下：前述第i層重新計算輸入特徵區塊資料具有一第i層重新計算輸入特徵區塊大小與一第i層重新計算輸入特徵區塊通道數，第i層輸出特徵區塊資料具有一第i層輸出特徵區塊大小與一第i層輸出特徵區塊通道數。第i層輸出特徵區塊大小大於第i層重新計算輸入特徵區塊大小，且第i層重新計算輸入特徵區塊通道數等於第i層輸出特徵區塊通道數。Other examples of the aforementioned embodiments are as follows: the i-th layer recalculated input feature block data has an i-th layer re-calculated input feature block size and an i-th layer re-calculated the number of input feature blocks channels, and the i-th layer outputs The feature block data has an output feature block size of the i-th layer and a channel number of the i-th layer output feature block. The size of the output feature block of the ith layer is larger than the size of the recalculated input feature block of the ith layer, and the number of channels of the recalculated input feature block of the ith layer is equal to the number of channels of the output feature block of the ith layer.

前述實施方式之其他實施例如下：前述區塊掃描方向垂直於掃描換行方向，區塊寬度大於區塊高度，且區塊高度的延伸方向平行於區塊掃描方向。Other examples of the foregoing embodiments are as follows: the block scanning direction is perpendicular to the scanning line feed direction, the block width is greater than the block height, and the extending direction of the block height is parallel to the block scanning direction.

前述實施方式之其他實施例如下：前述卷積深度、區塊寬度及區塊高度均為正整數，第i層卷積核大小為k_Wi ×k_Hi 。此些第i層重複利用特徵沿區塊掃描方向具有一重複利用特徵數量，且重複利用特徵數量等於k_Hi ‒1。Other examples of the aforementioned embodiments are as follows: the aforementioned convolution depth, block width and block height are all positive integers, and the size of the convolution kernel of the i-th layer is k _Wi ×k _Hi . The i-th layer of reused features has a number of reused features along the block scanning direction, and the number of reused features is equal to k _Hi −1.

前述實施方式之其他實施例如下：前述區塊寬度表示為B_W ，卷積深度表示為D，區塊高度表示為B_H 。輸入區塊大小等於B_W ×B_H 。輸出區塊資料具有一輸出區塊大小，且輸出區塊大小等於(B_W ‒2D)×B_H 。第i層重新計算輸入特徵區塊資料具有一第i層重新計算輸入特徵區塊大小，且第i層重新計算輸入特徵區塊大小等於(B_W ‒2i+2)×B_H 。第i層重複利用輸入特徵區塊資料具有一第i層重複利用輸入特徵區塊大小，且第i層重複利用輸入特徵區塊大小等於(B_W ‒2i+2)×(B_H +2)。第i層輸出特徵區塊資料具有一第i層輸出特徵區塊大小，且第i層輸出特徵區塊大小等於(B_W ‒2i)×B_H 。卷積深度小於區塊寬度之一半。Other examples of the aforementioned embodiments are as follows: the aforementioned block width is represented by B _W , the convolution depth is represented by D, and the block height is represented by B _H . The input block size is equal to B _W × B _H . The output block data has an output block size, and the output block size is equal to (B _W -2D)×B _H . The i-th layer recomputed input feature block data has an i-th layer re-computed input feature block size, and the i-th layer re-computed input feature block size is equal to (B _W -2i+2)×B _H . The i-th layer reused input feature block data has an i-th layer reused input feature block size, and the i-th layer reused input feature block size is equal to (B _W –2i+2)×(B _H +2) . The i-th layer output feature block data has an i-th layer output feature block size, and the i-th layer output feature block size is equal to (B _W -2i)×B _H . The convolution depth is less than half of the block width.

前述實施方式之其他實施例如下：當前述其中一第i層子區塊輸入特徵群之複數輸入特徵之至少一者位於第i層重複利用輸入特徵區塊資料之外區域時，此其中一第i層子區塊輸入特徵群之此些輸入特徵包含複數個外區塊特徵及複數個第一內區塊特徵。此些外區塊特徵代表已運算之特徵，此些第一內區塊特徵代表未運算之特徵。再者，當其中一第i層子區塊輸入特徵群之此些輸入特徵均位於第i層重複利用輸入特徵區塊資料之內區域時，此其中一第i層子區塊輸入特徵群之此些輸入特徵僅包含複數個第二內區塊特徵。第i層重複利用輸入特徵區塊資料沿區塊掃描方向之排列順序為外區域與內區域。Other examples of the aforementioned implementation manner are as follows: when at least one of the complex input features of the input feature group of a sub-block of the i-th layer is located in an area outside the data of the re-used input feature block of the i-th layer, the one of the first The input features of the i-layer sub-block input feature group include a plurality of outer block features and a plurality of first inner block features. The outer block features represent features that have been computed, and the first inner block features represent features that have not been computed. Furthermore, when the input features of the input feature group of a sub-block of the i-th layer are all located in the area within the data of the input feature block of the i-th layer that is reused, the input feature of the input feature group of the sub-block of the i-th layer is Such input features only include a plurality of second inner block features. The i-th layer reuses the input feature block data in the order of the block scanning direction as the outer area and the inner area.

前述實施方式之其他實施例如下：前述外區塊特徵係儲存於區塊暫存器，此區塊暫存器具有暫存空間，暫存空間透過第i層重新計算輸入特徵區塊資料之寬度、卷積深度、層數、通道數及第i層卷積核大小運算求得。第i層重新計算輸入特徵區塊資料之寬度表示為B_Wi ，卷積深度表示為D，層數表示為i，通道數表示為C，第i層卷積核大小為k_Wi ×k_Hi ，暫存空間表示為LBS且符合下式：

。Other examples of the aforementioned embodiments are as follows: the aforementioned outer block features are stored in a block register, which has a temporary storage space, and the temporary storage space recalculates the width of the input feature block data through the i-th layer , the convolution depth, the number of layers, the number of channels, and the size of the i-th layer convolution kernel. The width of the recalculated input feature block data of the i-th layer is denoted as B _Wi , the convolution depth is denoted as D, the number of layers is denoted as i, the number of channels is denoted as C, and the size of the convolution kernel of the i-th layer is k _Wi × k _Hi , The temporary storage space is expressed as LBS and conforms to the following formula:

.

依據本發明的結構態樣之一實施方式提供一種適用於卷積神經網路之記憶體優化實現之區塊式推論系統，其用以處理一輸入影像，此適用於卷積神經網路之記憶體優化實現之區塊式推論系統包含區塊暫存器以及運算處理單元，其中區塊暫存器用以存取第i層輸出特徵區塊資料及複數第i層重複利用特徵。運算處理單元電性連接於區塊暫存器，運算處理單元接收輸入影像並經配置以實施包含以下步驟之操作：參數設定步驟、分割步驟及區塊推論步驟。其中參數設定步驟係設定推論參數組，推論參數組包含卷積深度、區塊寬度、區塊高度及複數層卷積核大小。分割步驟係依據卷積深度、區塊寬度、區塊高度及此些層卷積核大小劃分輸入影像成複數輸入區塊資料，各輸入區塊資料具有輸入區塊大小。此外，區塊推論步驟係將各輸入區塊資料執行一多層卷積操作而產生輸出區塊資料，且此多層卷積操作包含第一方向資料選取步驟、第二方向資料選取步驟及卷積運算步驟。第一方向資料選取步驟係依據輸出區塊資料之位置沿掃描換行方向選擇複數第i層重新計算特徵，然後依據輸出區塊資料之位置及此些第i層重新計算特徵選取出第i層重新計算輸入特徵區塊資料，其中i為1至卷積深度之複數個正整數之其中一者。第二方向資料選取步驟係依據第i層重新計算輸入特徵區塊資料沿區塊掃描方向選取出此些第i層重複利用特徵，並將第i層重新計算輸入特徵區塊資料及此些第i層重複利用特徵組合而產生第i層重複利用輸入特徵區塊資料。卷積運算步驟係依據第i層卷積核大小從第i層重複利用輸入特徵區塊資料中選取出複數第i層子區塊輸入特徵群，然後對各第i層子區塊輸入特徵群及一卷積參數組執行卷積運算而產生第i層子區塊輸出特徵，並將對應此些第i層子區塊輸入特徵群之此些第i層子區塊輸出特徵組合而形成第i層輸出特徵區塊資料。According to one embodiment of the structural aspect of the present invention, a block-based inference system suitable for memory-optimized implementation of convolutional neural networks is provided for processing an input image, which is suitable for memory in convolutional neural networks. The block-based inference system implemented by the volume optimization includes a block register and an operation processing unit, wherein the block register is used to access the i-th layer output feature block data and a plurality of i-th layer reuse features. The arithmetic processing unit is electrically connected to the block register, and the arithmetic processing unit receives the input image and is configured to perform operations including the following steps: a parameter setting step, a segmentation step, and a block inference step. The parameter setting step is to set an inference parameter group, and the inference parameter group includes a convolution depth, a block width, a block height, and the size of the convolution kernel of multiple layers. The dividing step is to divide the input image into a plurality of input block data according to the convolution depth, block width, block height and the size of these layers of convolution kernels, and each input block data has the input block size. In addition, the block inference step is to perform a multi-layer convolution operation on each input block data to generate output block data, and the multi-layer convolution operation includes a first direction data selection step, a second direction data selection step and convolution Operation steps. The first direction data selection step is to select a plurality of i-th layer recalculation features along the scanning line feed direction according to the position of the output block data, and then select the i-th layer recalculation feature according to the position of the output block data and these i-th layer recalculation features. Calculate the input feature block data, where i is one of a plurality of positive integers from 1 to the convolution depth. The second direction data selection step is to select the i-th layer of reused features along the block scanning direction according to the i-th layer recalculated input feature block data, and recalculate the i-th layer of input feature block data and these The i-th layer reuses the feature combination to generate the i-th layer reuses the input feature block data. The convolution operation step is to select a plurality of i-th layer sub-block input feature groups from the i-th layer reused input feature block data according to the size of the i-th layer convolution kernel, and then input feature groups to each i-th layer sub-block. and a convolution parameter group to perform convolution operation to generate the output features of the i-th layer sub-blocks, and combine the i-th layer sub-block output features corresponding to the i-th layer sub-block input feature groups to form the i-th layer sub-block output features The i-layer outputs feature block data.

藉此，本發明之適用於卷積神經網路之記憶體優化實現之區塊式推論系統藉由不同方向使用不同特徵之計算方式，使區塊式推論在不增加過多計算量以及區塊暫存器之前提下，依然能大幅降低外部記憶體之頻寬需求。Therefore, the block-based inference system of the present invention, which is suitable for memory optimization of convolutional neural networks, uses computing methods with different characteristics in different directions, so that block-based inference does not increase too much computation and block temporary Under the premise of memory, it can still greatly reduce the bandwidth requirement of external memory.

前述實施方式之其他實施例如下：當前述i等於1時，第i層重新計算輸入特徵區塊資料等於各輸入區塊資料。而當i等於卷積深度時，第i層輸出特徵區塊資料等於輸出區塊資料。Other examples of the aforementioned embodiments are as follows: when the aforementioned i is equal to 1, the i-th layer recalculates the input feature block data equal to each input block data. When i is equal to the convolution depth, the output feature block data of the i-th layer is equal to the output block data.

前述實施方式之其他實施例如下：前述第i層重新計算輸入特徵區塊資料具有第i層重新計算輸入特徵區塊大小與第i層重新計算輸入特徵區塊通道數，第i層輸出特徵區塊資料具有第i層輸出特徵區塊大小與第i層輸出特徵區塊通道數。第i層輸出特徵區塊大小大於第i層重新計算輸入特徵區塊大小，且第i層重新計算輸入特徵區塊通道數等於第i層輸出特徵區塊通道數。Other examples of the aforementioned embodiments are as follows: the i-th layer recalculates the input feature block data, the i-th layer re-calculates the input feature block size and the i-th layer re-calculates the input feature block channel number, and the i-th layer outputs the feature region The block data has the i-th layer output feature block size and the i-th layer output feature block channel number. The size of the output feature block of the ith layer is larger than the size of the recalculated input feature block of the ith layer, and the number of channels of the recalculated input feature block of the ith layer is equal to the number of channels of the output feature block of the ith layer.

前述實施方式之其他實施例如下：前述卷積深度、區塊寬度及區塊高度均為正整數，第i層卷積核大小為k_Wi ×k_Hi ，此些第i層重複利用特徵沿區塊掃描方向具有重複利用特徵數量，且重複利用特徵數量等於k_Hi ‒1。Other examples of the aforementioned embodiments are as follows: the aforementioned convolution depth, block width, and block height are all positive integers, the size of the convolution kernel of the i-th layer is k _Wi × k _Hi , and the i-th layer reuses the feature edge region The block scan direction has the number of reused features, and the number of reused features is equal to k _Hi ‒1.

前述實施方式之其他實施例如下：前述區塊寬度表示為B_W ，卷積深度表示為D，區塊高度表示為B_H 。輸入區塊大小等於B_W ×B_H 。輸出區塊資料具有輸出區塊大小，且輸出區塊大小等於(B_W ‒2D)×B_H 。第i層重新計算輸入特徵區塊資料具有第i層重新計算輸入特徵區塊大小，且第i層重新計算輸入特徵區塊大小等於(B_W ‒2i+2)×B_H 。第i層重複利用輸入特徵區塊資料具有第i層重複利用輸入特徵區塊大小，且第i層重複利用輸入特徵區塊大小等於(B_W ‒2i+2)×(B_H +2)。第i層輸出特徵區塊資料具有第i層輸出特徵區塊大小，且第i層輸出特徵區塊大小等於(B_W ‒2i)×B_H 。卷積深度小於區塊寬度之一半。Other examples of the aforementioned embodiments are as follows: the aforementioned block width is represented by B _W , the convolution depth is represented by D, and the block height is represented by B _H . The input block size is equal to B _W × B _H . The output block data has the output block size, and the output block size is equal to (B _W -2D)×B _H . The i-th layer recomputed input feature block data has the i-th layer re-computed input feature block size, and the i-th layer re-computed input feature block size is equal to (B _W -2i+2)×B _H . The i-th layer reused input feature block data has the i-th layer reused input feature block size, and the i-th layer reused input feature block size is equal to (B _W –2i+2)×(B _H +2). The i-th layer output feature block data has the i-th layer output feature block size, and the i-th layer output feature block size is equal to (B _W -2i)×B _H . The convolution depth is less than half of the block width.

前述實施方式之其他實施例如下：前述外區塊特徵係儲存於區塊暫存器，區塊暫存器具有暫存空間，暫存空間透過第i層重新計算輸入特徵區塊資料之寬度、卷積深度、層數、通道數及第i層卷積核大小運算求得。第i層重新計算輸入特徵區塊資料之寬度表示為B_Wi ，卷積深度表示為D，層數表示為i，通道數表示為C，第i層卷積核大小為k_Wi ×k_Hi ，暫存空間表示為LBS且符合下式：

。Other examples of the aforementioned embodiments are as follows: the aforementioned outer block features are stored in a block register, which has a temporary storage space, and the temporary storage space recalculates the width of the input feature block data through the i-th layer, The convolution depth, the number of layers, the number of channels and the size of the i-th layer convolution kernel are calculated. The width of the recalculated input feature block data of the i-th layer is denoted as B _Wi , the convolution depth is denoted as D, the number of layers is denoted as i, the number of channels is denoted as C, and the size of the convolution kernel of the i-th layer is k _Wi × k _Hi , The temporary storage space is expressed as LBS and conforms to the following formula:

.

以下將參照圖式說明本發明之複數個實施例。為明確說明起見，許多實務上的細節將在以下敘述中一併說明。然而，應瞭解到，這些實務上的細節不應用以限制本發明。也就是說，在本發明部分實施例中，這些實務上的細節是非必要的。此外，為簡化圖式起見，一些習知慣用的結構與元件在圖式中將以簡單示意的方式繪示之；並且重複之元件將可能使用相同的編號表示之。Several embodiments of the present invention will be described below with reference to the drawings. For the sake of clarity, many practical details are set forth in the following description. It should be understood, however, that these practical details should not be used to limit the invention. That is, in some embodiments of the present invention, these practical details are unnecessary. In addition, for the purpose of simplifying the drawings, some well-known and conventional structures and elements will be shown in a simplified and schematic manner in the drawings; and repeated elements may be denoted by the same reference numerals.

此外，本文中當某一元件(或單元或模組等)「連接」於另一元件，可指所述元件是直接連接於另一元件，亦可指某一元件是間接連接於另一元件，意即，有其他元件介於所述元件及另一元件之間。而當有明示某一元件是「直接連接」於另一元件時，才表示沒有其他元件介於所述元件及另一元件之間。而第一、第二、第三等用語只是用來描述不同元件，而對元件本身並無限制，因此，第一元件亦可改稱為第二元件。且本文中之元件/單元/電路之組合非此領域中之一般周知、常規或習知之組合，不能以元件/單元/電路本身是否為習知，來判定其組合關係是否容易被技術領域中之通常知識者輕易完成。In addition, when a certain element (or unit or module, etc.) is "connected" to another element herein, it may mean that the element is directly connected to another element, or it may also mean that a certain element is indirectly connected to another element , that is, there are other elements interposed between said element and another element. When it is expressly stated that an element is "directly connected" to another element, it means that no other element is interposed between the element and the other element. The terms first, second, third, etc. are only used to describe different elements, and do not limit the elements themselves. Therefore, the first element can also be renamed as the second element. And the combination of elements/units/circuits in this article is not a commonly known, conventional or well-known combination in this field, and it cannot be determined whether the combination relationship of the elements/units/circuits is well-known or not easily understood by those in the technical field. Usually the knowledgeable can do it easily.

請參閱第1圖，第1圖係繪示本發明第一實施例之適用於卷積神經網路之記憶體優化實現之區塊式推論方法100的流程示意圖。此適用於卷積神經網路之記憶體優化實現之區塊式推論方法100用以處理一輸入影像而產生一輸出影像，且包含一參數設定步驟S02、分割步驟S04、區塊推論步驟S06以及暫存步驟S08。Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a block-based inference method 100 suitable for implementing memory optimization of a convolutional neural network according to a first embodiment of the present invention. The block-based inference method 100 suitable for memory-optimized implementation of convolutional neural networks is used to process an input image to generate an output image, and includes a parameter setting step S02, a segmentation step S04, a block inference step S06 and Step S08 is temporarily stored.

參數設定步驟S02係設定一推論參數組，此推論參數組包含卷積深度(depth)、區塊寬度、區塊高度及複數層卷積核大小(kernel size)。此些層卷積核大小的層數等於卷積深度。The parameter setting step S02 is to set an inference parameter group. The inference parameter group includes a convolution depth (depth), a block width, a block height, and a complex-layer convolution kernel size (kernel size). The number of layers with the kernel size of these layers is equal to the convolution depth.

分割步驟S04係驅動運算處理單元依據卷積深度、區塊寬度、區塊高度及此些層卷積核大小劃分輸入影像成複數個輸入區塊資料，各輸入區塊資料具有一輸入區塊大小。The dividing step S04 is to drive the arithmetic processing unit to divide the input image into a plurality of input block data according to the convolution depth, block width, block height and the size of these layers of convolution kernels, and each input block data has an input block size .

區塊推論步驟S06係驅動運算處理單元將各輸入區塊資料執行一多層卷積操作而產生輸出區塊資料，且多層卷積操作包含第一方向資料選取步驟S062、第二方向資料選取步驟S064及卷積運算步驟S066。其中第一方向資料選取步驟S062係依據輸出區塊資料之位置沿掃描換行方向選擇複數個第i層重新計算特徵，然後依據輸出區塊資料之位置及此些第i層重新計算特徵選取出一第i層重新計算輸入特徵區塊資料，其中i為1至卷積深度之複數個正整數之其中一者。此外，第二方向資料選取步驟S064係依據第i層重新計算輸入特徵區塊資料沿區塊掃描方向選取出複數個第i層重複利用特徵，並將第i層重新計算輸入特徵區塊資料及此些第i層重複利用特徵組合而產生一第i層重複利用輸入特徵區塊資料。再者，卷積運算步驟S066係依據第i層卷積核大小從第i層重複利用輸入特徵區塊資料中選取出複數個第i層子區塊輸入特徵群，然後對各第i層子區塊輸入特徵群及卷積參數組執行卷積運算而產生第i層子區塊輸出特徵，並將對應此些第i層子區塊輸入特徵群之此些第i層子區塊輸出特徵組合而形成第i層輸出特徵區塊資料。卷積參數組包含權重參數(weight parameter)與偏差參數(bias parameter)。The block inference step S06 drives the arithmetic processing unit to perform a multi-layer convolution operation on each input block data to generate output block data, and the multi-layer convolution operation includes a first direction data selection step S062 and a second direction data selection step S062 S064 and convolution operation step S066. The first direction data selection step S062 is to select a plurality of i-th layer recalculation features along the scanning line feed direction according to the position of the output block data, and then select a The i-th layer recalculates the input feature block data, where i is one of a plurality of positive integers from 1 to the convolution depth. In addition, the second direction data selection step S064 is to select a plurality of i-th layer reuse features along the block scanning direction according to the i-th layer recalculated input feature block data, and recalculate the i-th layer of input feature block data and The i-th layer reused features are combined to generate an i-th layer of reused input feature block data. Furthermore, the convolution operation step S066 is to select a plurality of i-th layer sub-block input feature groups from the i-th layer reused input feature block data according to the size of the i-th layer convolution kernel, and then perform a The block input feature group and the convolution parameter group perform a convolution operation to generate the i-th layer sub-block output features, and the i-th layer sub-block output features corresponding to the i-th layer sub-block input feature group combined to form the i-th layer output feature block data. The convolution parameter group includes weight parameters and bias parameters.

暫存步驟S08係驅動區塊暫存器(Block buffer bank)暫存第i層輸出特徵區塊資料及此些第i層重複利用特徵。The temporary storage step S08 is to drive a block buffer bank to temporarily store the i-th layer output feature block data and the i-th layer reuse features.

藉此，本發明的適用於卷積神經網路之記憶體優化實現之區塊式推論方法100透過不同方向使用不同特徵之計算方式，使區塊式推論在不增加過多計算量以及區塊暫存器之前提下，依然能大幅降低外部記憶體之頻寬需求。以下將透過較詳細的實施例來說明上述各步驟之細節。Therefore, the block-based inference method 100 of the present invention, which is suitable for memory optimization of convolutional neural networks, uses computing methods with different characteristics in different directions, so that block-based inference does not increase excessive computation and block temporary Under the premise of memory, it can still greatly reduce the bandwidth requirement of external memory. The details of the above steps will be described below through more detailed embodiments.

請一併參閱第1圖至第6圖，其中第2圖係繪示第1圖之分割步驟S04的示意圖；第3圖係繪示第1圖之區塊推論步驟S06的多層卷積操作之輸入區塊資料IB與輸出區塊資料OB的立體示意圖；第4圖係繪示第1圖之第一方向資料選取步驟S062的示意圖；第5圖係繪示第1圖之第二方向資料選取步驟S064的示意圖；以及第6圖係繪示第3圖之第1層重複利用輸入特徵區塊資料L1FU_I的示意圖。如圖所示，此實施例是於每一層(即第i層之i=1~D)均執行第一方向資料選取步驟S062、第二方向資料選取步驟S064及卷積運算步驟S066。卷積深度D、區塊寬度B_W 及區塊高度B_H 均為正整數。第i層卷積核大小為k_Wi ×k_Hi ，其中k_Wi 、k_Hi 均為正整數。掃描換行方向D1為水平方向，區塊掃描方向D2為垂直方向；換言之，區塊掃描方向D2垂直於掃描換行方向D1。區塊寬度B_W 大於區塊高度B_H ，且區塊高度B_H 的延伸方向平行於區塊掃描方向D2。輸入區塊大小等於B_W ×B_H 。輸出區塊資料OB具有一輸出區塊大小，且輸出區塊大小等於(B_W ‒2D)×B_H 。第i層重新計算輸入特徵區塊資料具有一第i層重新計算輸入特徵區塊大小，且第i層重新計算輸入特徵區塊大小等於(B_W ‒2i+2)×B_H 。第i層重複利用輸入特徵區塊資料具有一第i層重複利用輸入特徵區塊大小，且第i層重複利用輸入特徵區塊大小等於(B_W ‒2i+2)×(B_H +2)。第i層輸出特徵區塊資料具有一第i層輸出特徵區塊大小，且第i層輸出特徵區塊大小等於(B_W ‒2i)×B_H 。第i層輸出特徵區塊資料代表第i層執行完卷積運算之輸出特徵，其用於同一區塊之下一層(第i+1層)之重新計算。卷積深度D小於區塊寬度B_W 之一半。再者，第i層重複利用特徵沿區塊掃描方向D2具有一重複利用特徵數量，且重複利用特徵數量等於k_Hi ‒1(即k‒1)。第i層重複利用特徵是用於下一區塊之同一層(第i層)之重複利用。當i等於1時，第i層重新計算輸入特徵區塊資料等於各輸入區塊資料IB；當i等於卷積深度D時，第i層輸出特徵區塊資料等於輸出區塊資料OB。Please refer to Fig. 1 to Fig. 6 together, wherein Fig. 2 is a schematic diagram of the segmentation step S04 of Fig. 1; Fig. 3 is a schematic diagram of the multi-layer convolution operation of the block inference step S06 of Fig. 1 A three-dimensional schematic diagram of input block data IB and output block data OB; FIG. 4 is a schematic diagram of the first direction data selection step S062 in FIG. 1; FIG. 5 is a second direction data selection in FIG. 1 A schematic diagram of step S064; and FIG. 6 is a schematic diagram of the first layer of FIG. 3 reusing the input feature block data L1FU_I. As shown in the figure, in this embodiment, the first direction data selection step S062, the second direction data selection step S064 and the convolution operation step S066 are performed in each layer (ie, i=1~D of the i-th layer). The convolution depth D, the block width B _W , and the block height B _H are all positive integers. The size of the convolution kernel of the i-th layer is k _Wi ×k _Hi , where k _Wi and k _Hi are both positive integers. The scanning line feed direction D1 is a horizontal direction, and the block scanning direction D2 is a vertical direction; in other words, the block scanning direction D2 is perpendicular to the scanning line feed direction D1. The block width B _W is greater than the block height B _H , and the extending direction of the block height B _H is parallel to the block scanning direction D2 . The input block size is equal to B _W × B _H . The output block data OB has an output block size, and the output block size is equal to (B _W −2D)×B _H . The i-th layer recomputed input feature block data has an i-th layer re-computed input feature block size, and the i-th layer re-computed input feature block size is equal to (B _W -2i+2)×B _H . The i-th layer reused input feature block data has an i-th layer reused input feature block size, and the i-th layer reused input feature block size is equal to (B _W –2i+2)×(B _H +2) . The i-th layer output feature block data has an i-th layer output feature block size, and the i-th layer output feature block size is equal to (B _W -2i)×B _H . The i-th layer output feature block data represents the output feature of the i-th layer after performing the convolution operation, which is used for the recalculation of the next layer (i+1-th layer) of the same block. The convolution depth D is less than half of the block width _BW . Furthermore, the i-th layer of reused features has a number of reused features along the block scanning direction D2, and the number of reused features is equal to k _Hi −1 (ie, k−1). The i-th layer reuse feature is for reuse of the same layer (i-th layer) of the next block. When i is equal to 1, the recalculated input feature block data of the i-th layer is equal to the input block data IB; when i is equal to the convolution depth D, the output feature block data of the i-th layer is equal to the output block data OB.

在第3圖至第6圖中，卷積深度D為3，區塊寬度B_W 為10、區塊高度B_H 為4，第i層卷積核大小為3×3，即k_Wi =k_Hi =k且均為3。卷積深度D為3代表有3層卷積操作，故多層卷積操作包含第1層卷積操作、第2層卷積操作及第3層卷積操作(即i=1、2及3)。In Figures 3 to 6, the convolution depth D is 3, the block width B _W is 10, the block height B _H is 4, and the i-th layer convolution kernel size is 3 × 3, that is, k _Wi = k _Hi = k and both are 3. The convolution depth D is 3, which means that there are 3 layers of convolution operations, so the multi-layer convolution operation includes the first layer convolution operation, the second layer convolution operation and the third layer convolution operation (ie i=1, 2 and 3) .

第1層卷積操作(i=1)包含第一方向資料選取步驟S062、第二方向資料選取步驟S064及卷積運算步驟S066。其中第一方向資料選取步驟S062係依據輸出區塊資料OB(即第3層輸出特徵區塊資料L3_O)之位置沿掃描換行方向D1選擇6個第1層重新計算特徵L1FC(即(D‒i+1)×(k‒1)個)，然後依據輸出區塊資料OB之位置及此些第1層重新計算特徵L1FC選取出一第1層重新計算輸入特徵區塊資料L1FC_I。此第1層重新計算輸入特徵區塊資料L1FC_I等於輸入區塊資料IB，輸入區塊資料IB之輸入區塊大小等於第1層重新計算輸入特徵區塊資料L1FC_I之第1層重新計算輸入特徵區塊大小，且均等於(B_W ‒2i+2)×B_H =(10‒2+2)×4=10×4，如第3圖之第1層L1、第4圖之第1層L1及第6圖所示。再者，第二方向資料選取步驟S064依據第1層重新計算輸入特徵區塊資料L1FC_I沿區塊掃描方向D2選取出2個第1層重複利用特徵L1FU，並將第1層重新計算輸入特徵區塊資料L1FC_I及此些第1層重複利用特徵L1FU組合而產生一第1層重複利用輸入特徵區塊資料L1FU_I。第1層重複利用輸入特徵區塊資料L1FU_I之第1層重複利用輸入特徵區塊大小等於(B_W ‒2i+2)×(B_H +2)=(10‒2+2)×(4+2)=10×6，如第3圖之第1層L1、第5圖之第1層L1及第6圖所示。此外，卷積運算步驟S066係依據第i層卷積核大小(即3×3)從第1層重複利用輸入特徵區塊資料L1FU_I中選取出複數個第1層子區塊輸入特徵群SBG1(即3×3特徵)，然後對各第1層子區塊輸入特徵群SBG1及卷積參數組執行卷積運算而產生第1層子區塊輸出特徵，並將對應此些第1層子區塊輸入特徵群SBG1之此些第1層子區塊輸出特徵組合而形成第1層輸出特徵區塊資料L1_O。第1層輸出特徵區塊資料L1_O之第1層輸出特徵區塊大小等於(B_W ‒2i)×B_H =(10‒2)×4=8×4，如第3圖與第5圖之第1層L1所示。The first layer convolution operation (i=1) includes a first direction data selection step S062, a second direction data selection step S064 and a convolution operation step S066. The first direction data selection step S062 is to select 6 first layer recalculation features L1FC (ie (D-i +1)×(k−1)), and then according to the position of the output block data OB and these first layer recalculation features L1FC, a first layer recalculation input feature block data L1FC_I is selected. This layer 1 recalculated input feature block data L1FC_I is equal to the input block data IB, and the input block size of the input block data IB is equal to the layer 1 recalculated input feature block data L1FC_I of the first layer recalculated input feature area The block size is equal to (B _W ‒2i+2)×B _H =(10‒2+2)×4=10×4, such as the first layer L1 in Figure 3 and the first layer L1 in Figure 4 and shown in Figure 6. Furthermore, the second direction data selection step S064 selects two first layer reuse features L1FU according to the first layer recalculation input feature block data L1FC_I along the block scanning direction D2, and recalculates the first layer input feature area The block data L1FC_I and the layer 1 reuse features L1FU are combined to generate a layer 1 reuse input feature block data L1FU_I. The first layer reuses the input feature block data L1FU_I The first layer reuses the input feature block size is equal to (B _W –2i+2)×(B _H +2)=(10–2+2)×(4+ 2)=10×6, as shown in the first layer L1 in Figure 3, the first layer L1 in Figure 5 and Figure 6. In addition, the convolution operation step S066 selects a plurality of first-layer sub-block input feature groups SBG1( That is, 3×3 features), and then perform convolution operation on the input feature group SBG1 and the convolution parameter group of each first-layer sub-block to generate the first-layer sub-block output features, which will correspond to these first-layer sub-regions The first-level sub-block output features of the block input feature group SBG1 are combined to form the first-level output feature block data L1_O. The size of the first layer output feature block of the first layer output feature block data L1_O is equal to (B _W –2i)×B _H =(10–2)×4=8×4, as shown in Figures 3 and 5 Layer 1 is shown as L1.

第2層卷積操作(i=2)包含第一方向資料選取步驟S062、第二方向資料選取步驟S064及卷積運算步驟S066。其中第一方向資料選取步驟S062係依據輸出區塊資料OB(即第3層輸出特徵區塊資料L3_O)之位置沿掃描換行方向D1選擇4個第2層重新計算特徵L2FC(即(D‒i+1)×(k‒1)個)，然後依據輸出區塊資料OB之位置及此些第2層重新計算特徵L2FC選取出一第2層重新計算輸入特徵區塊資料L2FC_I。第2層重新計算輸入特徵區塊資料L2FC_I等於第1層輸出特徵區塊資料L1_O。第2層重新計算輸入特徵區塊資料L2FC_I之第2層重新計算輸入特徵區塊大小等於(B_W ‒2i+2)×B_H =(10‒4+2)×4=8×4，如第3圖與第4圖之第2層L2所示。再者，第二方向資料選取步驟S064依據第2層重新計算輸入特徵區塊資料L2FC_I沿區塊掃描方向D2選取出2個第2層重複利用特徵L2FU，並將第2層重新計算輸入特徵區塊資料L2FC_I及此些第2層重複利用特徵L2FU組合而產生一第2層重複利用輸入特徵區塊資料L2FU_I。第2層重複利用輸入特徵區塊資料L2FU_I之第2層重複利用輸入特徵區塊大小等於(B_W ‒2i+2)×(B_H +2)=(10‒4+2)×(4+2)=8×6，如第3圖與第5圖之第2層L2所示。此外，卷積運算步驟S066係依據第i層卷積核大小(即3×3)從第2層重複利用輸入特徵區塊資料L2FU_I中選取出複數個第2層子區塊輸入特徵群SBG2(即3×3特徵)，然後對各第2層子區塊輸入特徵群SBG2及卷積參數組執行卷積運算而產生第2層子區塊輸出特徵，並將對應此些第2層子區塊輸入特徵群SBG2之此些第2層子區塊輸出特徵組合而形成第2層輸出特徵區塊資料L2_O。第2層輸出特徵區塊資料L2_O之第2層輸出特徵區塊大小等於(B_W ‒2i)×B_H =(10‒4)×4=6×4，如第3圖與第5圖之第2層L2所示。The second layer convolution operation (i=2) includes a first direction data selection step S062, a second direction data selection step S064 and a convolution operation step S066. The first direction data selection step S062 is to select four layer 2 recalculation features L2FC (ie (D-i +1)×(k−1)), and then according to the position of the output block data OB and these layer 2 recalculation features L2FC, a second layer recalculation input feature block data L2FC_I is selected. The second layer recalculates the input feature block data L2FC_I equal to the first layer output feature block data L1_O. The second layer recalculates the input feature block data L2FC_I The second layer recalculates the input feature block size is equal to (B _W – 2i+2) × B _H = (10 – 4+2) × 4 = 8 × 4, such as The second layer L2 is shown in FIG. 3 and FIG. 4 . Furthermore, the second direction data selection step S064 selects two second layer reuse features L2FU according to the second layer recalculation input feature block data L2FC_I along the block scanning direction D2, and recalculates the second layer input feature area The block data L2FC_I and these layer 2 reuse features L2FU are combined to generate a layer 2 reuse input feature block data L2FU_I. The second layer reuses the input feature block data L2FU_I The second layer reuses the input feature block size is equal to (B _W –2i+2)×(B _H +2)=(10–4+2)×(4+ 2)=8×6, as shown in the second layer L2 in Figures 3 and 5. In addition, the convolution operation step S066 selects a plurality of second-layer sub-block input feature groups SBG2( That is, 3×3 features), and then perform convolution operation on the input feature group SBG2 and the convolution parameter group of each second-layer sub-block to generate the second-layer sub-block output features, which will correspond to these second-layer sub-regions The second-level sub-block output features of the block input feature group SBG2 are combined to form the second-level output feature block data L2_O. The size of the second layer output feature block of the second layer output feature block data L2_O is equal to (B _W –2i)×B _H =(10–4)×4=6×4, as shown in Figures 3 and 5 Layer 2 L2 is shown.

第3層卷積操作(i=3)包含第一方向資料選取步驟S062、第二方向資料選取步驟S064及卷積運算步驟S066。其中第一方向資料選取步驟S062係依據輸出區塊資料OB(即第3層輸出特徵區塊資料L3_O)之位置沿掃描換行方向D1選擇2個第3層重新計算特徵L3FC(即(D‒i+1)×(k‒1)個)，然後依據輸出區塊資料OB之位置及此些第3層重新計算特徵L3FC選取出一第3層重新計算輸入特徵區塊資料L3FC_I。第3層重新計算輸入特徵區塊資料L3FC_I等於第2層輸出特徵區塊資料L2_O。第3層重新計算輸入特徵區塊資料L3FC_I之第3層重新計算輸入特徵區塊大小等於(B_W ‒2i+2)×B_H =(10‒6+2)×4=6×4，如第3圖與第4圖之第3層L3所示。再者，第二方向資料選取步驟S064依據第3層重新計算輸入特徵區塊資料L3FC_I沿區塊掃描方向D2選取出2個第3層重複利用特徵L3FU，並將第3層重新計算輸入特徵區塊資料L3FC_I及此些第3層重複利用特徵L3FU組合而產生一第3層重複利用輸入特徵區塊資料L3FU_I。第3層重複利用輸入特徵區塊資料L3FU_I之第3層重複利用輸入特徵區塊大小等於(B_W ‒2i+2)×(B_H +2)=(10‒6+2)×(4+2)=6×6，如第3圖與第5圖之第3層L3所示。此外，卷積運算步驟S066係依據第i層卷積核大小(即3×3)從第3層重複利用輸入特徵區塊資料L3FU_I中選取出複數個第3層子區塊輸入特徵群SBG3(即3×3特徵)，然後對各第3層子區塊輸入特徵群SBG3及卷積參數組執行卷積運算而產生第3層子區塊輸出特徵，並將對應此些第3層子區塊輸入特徵群SBG3之此些第3層子區塊輸出特徵組合而形成第3層輸出特徵區塊資料L3_O。第3層輸出特徵區塊資料L3_O等於輸出區塊資料OB。第3層輸出特徵區塊資料L3_O之第3層輸出特徵區塊大小等於(B_W ‒2i)×B_H =(10‒6)×4=4×4，而輸出區塊資料OB之輸出區塊大小等於(B_W ‒2D)×B_H =(10‒6)×4=4×4，如第3圖與第5圖之第3層L3所示。The third layer convolution operation (i=3) includes a first direction data selection step S062, a second direction data selection step S064 and a convolution operation step S066. The first direction data selection step S062 is to select two third layer recalculation features L3FC (ie (D−i) along the scanning line feed direction D1 according to the position of the output block data OB (ie, the third layer output feature block data L3_O) +1)×(k−1)), and then according to the position of the output block data OB and these layer 3 recalculation features L3FC, a layer 3 recalculation input feature block data L3FC_I is selected. Layer 3 recalculates input feature block data L3FC_I equal to layer 2 output feature block data L2_O. The third layer recalculates the input feature block data L3FC_I The third layer recalculates the input feature block size is equal to (B _W – 2i+2) × B _H = (10 – 6 + 2) × 4 = 6 × 4, such as The third layer L3 is shown in FIG. 3 and FIG. 4 . Furthermore, the second direction data selection step S064 selects two third layer reuse features L3FU according to the third layer recalculation input feature block data L3FC_1 along the block scanning direction D2, and recalculates the third layer input feature area The block data L3FC_I and these layer 3 reuse features L3FU are combined to generate a layer 3 reuse input feature block data L3FU_I. The third layer reuses the input feature block data L3FU_I The third layer reuses the input feature block size is equal to (B _W –2i+2)×(B _H +2)=(10–6+2)×(4+ 2)=6×6, as shown in the third layer L3 in Figure 3 and Figure 5. In addition, the convolution operation step S066 selects a plurality of third-layer sub-block input feature groups SBG3( That is, 3×3 features), and then perform convolution operation on the input feature group SBG3 and the convolution parameter group of each third-layer sub-block to generate the third-layer sub-block output features, which will correspond to these third-layer sub-regions The third-level sub-block output features of the block input feature group SBG3 are combined to form the third-level output feature block data L3_O. The layer 3 output feature block data L3_O is equal to the output block data OB. The layer 3 output feature block size of the third layer output feature block data L3_O is equal to (B _W – 2i) × B _H = (10 – 6) × 4 = 4 × 4, and the output area of the output block data OB The block size is equal to (B _W – 2D) × B _H = (10 – 6) × 4 = 4 × 4, as shown in Layer 3 L3 in Figures 3 and 5.

在本發明的適用於卷積神經網路之記憶體優化實現之區塊式推論方法100中，當其中一第i層子區塊輸入特徵群之複數個輸入特徵之至少一者位於第i層重複利用輸入特徵區塊資料之外區域時，此其中一第i層子區塊輸入特徵群之輸入特徵包含複數個外區塊特徵及複數個第一內區塊特徵。外區塊特徵代表前一區塊已運算之特徵，而第一內區塊特徵代表目前區塊未運算之特徵。另外，當其中一第i層子區塊輸入特徵群之輸入特徵均位於第i層重複利用輸入特徵區塊資料之內區域時，此其中一第i層子區塊輸入特徵群之輸入特徵僅包含複數第二內區塊特徵，第二內區塊特徵代表目前區塊未運算之特徵。第i層重複利用輸入特徵區塊資料沿區塊掃描方向D2之排列順序為外區域與內區域。舉第6圖為例，當第1層子區塊輸入特徵群SBG11之9個輸入特徵之6個位於第1層重複利用輸入特徵區塊資料L1FU_I之外區域OR時，此第1層子區塊輸入特徵群SBG11之9個輸入特徵包含6個外區塊特徵及3個內區塊特徵。外區塊特徵代表已運算之特徵且位於外區域OR，而內區塊特徵代表未運算之特徵且位於內區域IR。另外，當第1層子區塊輸入特徵群SBG12之9個輸入特徵均位於第1層重複利用輸入特徵區塊資料L1FU_I之內區域IR時，此第1層子區塊輸入特徵群SBG12之9個輸入特徵僅包含9個內區塊特徵，亦即9個輸入特徵均為內區塊特徵。此外，第1層重複利用輸入特徵區塊資料L1FU_I沿區塊掃描方向D2之排列順序為外區域OR與內區域IR。In the block-based inference method 100 of the present invention suitable for memory optimization of convolutional neural networks, when at least one of the plurality of input features of the sub-block input feature group of the i-th layer is located in the i-th layer When the area outside the input feature block data is reused, the input features of the input feature group of an i-th layer sub-block include a plurality of outer block features and a plurality of first inner block features. The features of the outer block represent the features that have been computed in the previous block, and the features of the first inner block represent the features that have not been computed in the current block. In addition, when the input features of the input feature group of a sub-block of the i-th layer are all located in the area within the data of the re-used input feature block of the i-th layer, the input features of the input feature group of the sub-block of the i-th layer are only It includes a plurality of second inner block features, and the second inner block features represent the features that are not calculated in the current block. The i-th layer reuses the input feature block data along the block scanning direction D2 in the order of an outer area and an inner area. Taking Figure 6 as an example, when 6 of the 9 input features of the input feature group SBG11 of the first-layer sub-block are located in the area OR outside the first-layer reused input feature block data L1FU_I, the first-layer sub-region The 9 input features of the block input feature group SBG11 include 6 outer block features and 3 inner block features. Outer block features represent computed features and are located in outer region OR, while inner block features represent uncomputed features and are located in inner region IR. In addition, when the 9 input features of the first-layer sub-block input feature group SBG12 are all located in the region IR within the first-layer reused input feature block data L1FU_I, 9 of the first-layer sub-block input feature group SBG12 Each input feature contains only 9 inner-block features, that is, the 9 input features are all inner-block features. In addition, the arrangement sequence of the first layer reused input feature block data L1FU_I along the block scanning direction D2 is the outer area OR and the inner area IR.

另外值得一提的是，在暫存步驟S08中，第i層的LiFC_I的最下面k_Hi ‒1行存到區塊暫存器內供下一區塊使用，變成下一區塊的LiFU。舉例來說，當區塊推論步驟S06之第1層卷積操作執行後，暫存步驟S08被執行，其為第1層重新計算輸入特徵區塊資料L1FC_I的最下面k_Hi ‒1行存到區塊暫存器內供下一區塊使用，亦即變成下一區塊的第1層重複利用特徵L1FU。當區塊推論步驟S06之第2層卷積操作執行後，暫存步驟S08被執行，其為第2層重新計算輸入特徵區塊資料L2FC_I的最下面k_Hi ‒1行存到區塊暫存器內供下一區塊使用，亦即變成下一區塊的第2層重複利用特徵L2FU。當區塊推論步驟S06之第3層卷積操作執行後，暫存步驟S08被執行，其為第3層重新計算輸入特徵區塊資料L3FC_I的最下面k_Hi ‒1行存到區塊暫存器內供下一區塊使用，亦即變成下一區塊的第3層重複利用特徵L3FU。藉此，可大幅降低計算量。It is also worth mentioning that in the temporary storage step S08, the bottom k _Hi -1 row of the LiFC_I of the i-th layer is stored in the block temporary register for use in the next block, and becomes the LiFU of the next block. For example, after the first-layer convolution operation in the block inference step S06 is performed, the temporary storage step S08 is performed, which is to store the bottom k _Hi -1 row of the first-layer recalculated input feature block data L1FC_I into The block register is used for the next block, that is, the layer 1 reuse feature L1FU of the next block. After the second layer convolution operation of the block inference step S06 is performed, the temporary storage step S08 is performed, which is to recalculate the bottom k _Hi -1 row of the input feature block data L2FC_I for the second layer and store it in the block temporary storage The storage is used for the next block, that is, it becomes the layer 2 reuse feature L2FU of the next block. After the layer 3 convolution operation of the block inference step S06 is performed, the temporary storage step S08 is performed, which is to recalculate the bottom k _Hi -1 row of the input feature block data L3FC_I for the third layer and store it in the block temporary storage The storage is used for the next block, that is, it becomes the layer 3 reuse feature L3FU of the next block. Thereby, the amount of calculation can be greatly reduced.

請一併參閱第1圖至第7圖，其中第7圖係繪示本發明第二實施例之通道混洗(shuffle)之示意圖。本發明之推論流程可應用於通道混洗之運算，第i層重複利用輸入特徵區塊資料LiFU_I具有一第i層重複利用輸入特徵區塊大小W1×H1與一第i層重複利用輸入特徵區塊通道數C1。第i層中間區塊資料Li_M具有一第i層中間特徵區塊大小W2×H2與一第i層中間特徵區塊通道數C2。第i層輸出特徵區塊資料Li_O具有一第i層輸出特徵區塊大小W3×H3與一第i層輸出特徵區塊通道數C3。第i層輸出特徵區塊大小W3×H3大於第i層重複利用輸入特徵區塊大小W1×H1，且第i層重複利用輸入特徵區塊大小W1×H1大於第i層中間特徵區塊大小W2×H2。其中W1、W2及W3為區塊寬度，H1、H2及H3為區塊高度。此外，第i層重複利用輸入特徵區塊通道數C1等於第i層輸出特徵區塊通道數C3，且第i層中間特徵區塊通道數C2大於第i層重複利用輸入特徵區塊通道數C1。舉例來說，第i層重複利用輸入特徵區塊大小W1×H1、第i層中間特徵區塊大小W2×H2及第i層輸出特徵區塊大小W3×H3可分別為10×10、8×8及16×16，而第i層重複利用輸入特徵區塊通道數C1、第i層中間特徵區塊通道數C2及第i層輸出特徵區塊通道數C3可分別為32、128及32，但本發明不以此為限。Please refer to FIG. 1 to FIG. 7 together, wherein FIG. 7 is a schematic diagram of channel shuffling according to the second embodiment of the present invention. The inference process of the present invention can be applied to the operation of channel shuffling. The i-th layer reused input feature block data LiFU_I has an i-th layer reused input feature block size W1×H1 and an i-th layer reused input feature region Block channel number C1. The i-th layer intermediate block data Li_M has a i-th layer intermediate characteristic block size W2×H2 and a i-th layer intermediate characteristic block channel number C2. The i-th layer output feature block data Li_O has a i-th layer output feature block size W3×H3 and a i-th layer output feature block channel number C3. The i-th layer output feature block size W3×H3 is larger than the i-th layer reused input feature block size W1×H1, and the i-th layer reused input feature block size W1×H1 is larger than the i-th layer intermediate feature block size W2 ×H2. Wherein W1, W2 and W3 are block widths, and H1, H2 and H3 are block heights. In addition, the number of channels C1 of input feature blocks reused in the ith layer is equal to the number of channels C3 of output feature blocks in the ith layer, and the number of channels C2 of intermediate feature blocks in the ith layer is greater than the number of channels C1 of reused input feature blocks in the ith layer. . For example, the i-th layer reuses the input feature block size W1×H1, the i-th layer intermediate feature block size W2×H2, and the i-th layer output feature block size W3×H3 can be respectively 10×10 and 8× 8 and 16×16, while the number of channels C1 of the input feature block in the i-th layer, the number of channels of the i-th layer of intermediate feature blocks C2 and the number of channels of the i-th layer of output feature blocks can be 32, 128 and 32, respectively. However, the present invention is not limited to this.

藉此，本發明可實現特定之多層卷積操作，當進行區塊式推論時，於區塊前行的方向(即區塊掃描方向D2)上重複利用已計算過的特徵，而於另一個方向(即掃描換行方向D1)上採用重新計算的方式，使區塊式推論在不增加過多計算量以及區塊暫存器之前提下，依然能大幅降低外部記憶體之頻寬需求。Thereby, the present invention can realize a specific multi-layer convolution operation. When performing block-based inference, the calculated features are reused in the forward direction of the block (ie, the block scanning direction D2), and the other The recalculation method is adopted in the direction (ie, the scan line feed direction D1), so that the block-based inference can still greatly reduce the bandwidth requirement of the external memory without increasing the excessive calculation amount and the block register.

請一併參閱第1圖、第2圖、第8圖及第9圖，其中第8圖係繪示本發明第三實施例之適用於卷積神經網路之記憶體優化實現之區塊式推論系統200的方塊示意圖；以及第9圖係繪示本發明第三實施例之具有3×3濾波器之多層卷積操作的流程示意圖。如圖所示，適用於卷積神經網路之記憶體優化實現之區塊式推論系統200用以處理輸入影像而產生輸出影像110，並包含區塊暫存器220以及運算處理單元230。輸入區塊資料IB、推論參數組212及卷積參數組214輸入至運算處理單元230，輸出區塊資料OB輸出會組成輸出影像110。區塊暫存器220用以存取第i層輸出特徵區塊資料及複數個第i層重複利用特徵，且此兩種暫存是使用區塊暫存器220中不同位置的區域暫存。此外，運算處理單元230電性連接於區塊暫存器220，運算處理單元230接收輸入影像並經配置以實施第1圖之適用於卷積神經網路之記憶體優化實現之區塊式推論方法100。運算處理單元230包含卷積引擎232(Convolution Engine)，卷積引擎232用以執行卷積運算。運算處理單元230可為微處理器、中央處理器或影像處理器，但本發明不以此為限。L1、L2及LD分別代表第1層、第2層及第D層，第1層L1至第D層LD均透過運算處理單元230之卷積引擎232進行運算。此外，區塊暫存器220可儲存外區塊特徵，區塊暫存器220具有一暫存空間，此暫存空間可透過第i層重新計算輸入特徵區塊資料之寬度B_Wi 、卷積深度D、層數i、通道數C及第i層卷積核大小k_Wi ×k_Hi 運算求得。暫存空間表示為LBS(Line Buffer Size)且符合下列式子(1)：

(1)。Please refer to Fig. 1, Fig. 2, Fig. 8 and Fig. 9 together, wherein Fig. 8 shows a block-based implementation of memory optimization suitable for convolutional neural networks according to the third embodiment of the present invention A block diagram of the inference system 200 ; and FIG. 9 is a schematic flowchart of a multi-layer convolution operation with 3×3 filters according to the third embodiment of the present invention. As shown in the figure, a block-based inference system 200 suitable for memory-optimized implementation of a convolutional neural network is used to process an input image to generate an output image 110 , and includes a block register 220 and an arithmetic processing unit 230 . The input block data IB, the inference parameter set 212 and the convolution parameter set 214 are input to the operation processing unit 230 , and the output block data OB is output to form the output image 110 . The block register 220 is used for accessing the i-th layer output feature block data and a plurality of i-th layer reuse features, and these two kinds of temporary storage are temporary storage in different positions in the block register 220 . In addition, the arithmetic processing unit 230 is electrically connected to the block register 220, and the arithmetic processing unit 230 receives the input image and is configured to implement the block-based inference of FIG. 1 suitable for the memory-optimized implementation of the convolutional neural network Method 100. The operation processing unit 230 includes a convolution engine 232 (Convolution Engine), and the convolution engine 232 is used for performing convolution operations. The arithmetic processing unit 230 may be a microprocessor, a central processing unit or an image processor, but the invention is not limited thereto. L1 , L2 and LD represent the first layer, the second layer and the Dth layer, respectively. The first layer L1 to the D layer LD are all operated by the convolution engine 232 of the operation processing unit 230 . In addition, the block register 220 can store the features of external blocks, and the block register 220 has a temporary storage space, which can recalculate the width B _Wi of the input feature block data, the convolution through the i-th layer The depth D, the number of layers i, the number of channels C, and the size of the convolution kernel of the i-th layer k _Wi ×k _Hi are calculated. The temporary storage space is expressed as LBS (Line Buffer Size) and conforms to the following formula (1):

(1).

舉例來說，若每一層(即第i層之i=1~D)均執行第一方向資料選取步驟S062、第二方向資料選取步驟S064及卷積運算步驟S066，且k_Wi =k_Hi =k且均等於3，則暫存空間符合下列式子(2)：

(2)。For example, if each layer (i.e. i=1~D of the i-th layer) performs the first direction data selection step S062, the second direction data selection step S064 and the convolution operation step S066, and k _Wi =k _Hi = If k is equal to 3, then the temporary storage space conforms to the following formula (2):

(2).

藉此，本發明之適用於卷積神經網路之記憶體優化實現之區塊式推論系統200藉由不同方向使用不同特徵之計算方式，使區塊式推論在不增加過多計算量以及區塊暫存器220之前提下，依然能大幅降低外部記憶體對輸入區塊資料IB和輸出區塊資料OB之頻寬需求。Therefore, the block-based inference system 200 of the present invention suitable for memory optimization of convolutional neural networks uses different computing methods with different characteristics in different directions, so that block-based inference does not increase too much calculation amount and block size. On the premise of the register 220, the bandwidth requirements of the external memory for the input block data IB and the output block data OB can still be greatly reduced.

請一併參閱第1圖與第10圖，其中第10圖係繪示重新計算(Feature-reComputing；FC)、重複利用(Feature-reUsing；FU)及本發明之重新計算併重複利用(FCFU)的比較結果示意圖。其參數設定條件為乘積值A設為64² ，輸出影像110的大小為960×540，k_Wi =k_Hi =k。乘積值A為區塊寬度B_W 與區塊高度B_H 相乘的數值之最小值。本發明之多層卷積操作具有一標準化吞吐率(Normalized Throughput Ratio；NTR)，標準化吞吐率NTR透過卷積深度D與標準化運算率(Normalized Computing Ratio；NCR)運算求得，而標準化運算率透過區塊寬度B_W 、區塊高度B_H 、卷積深度D及變數h運算求得。對於本發明之標準化吞吐率NTR與標準化運算率NCR分別符合下列式子(3)與(4)：

(3)；

(4)。Please refer to Fig. 1 and Fig. 10 together, wherein Fig. 10 shows recalculation (Feature-reComputing; FC), reuse (Feature-reUsing; FU), and recalculation and reuse (FCFU) of the present invention Schematic diagram of the comparison results. The parameter setting conditions are that the product value A is set to 64 ² , the size of the output image 110 is 960×540, and k _Wi =k _Hi =k. The product value A is the minimum value of the values obtained by multiplying the block width _BW and the block height _BH . The multi-layer convolution operation of the present invention has a normalized throughput ratio (NTR), the normalized throughput rate NTR is obtained through the operation of the convolution depth D and the normalized computing ratio (NCR), and the normalized computing ratio passes through the area The block width B _W , the block height B _H , the convolution depth D and the variable h are calculated and obtained. For the normalized throughput rate NTR and normalized operation rate NCR of the present invention, the following equations (3) and (4) are respectively satisfied:

(3);

(4).

由第10圖可知，若對於區塊暫存器220有區塊暫存器大小限制S，則重複利用FU所能支持的最大支援卷積深度D_max 在三者中為最淺；相反地，重新計算FC雖能支持寬廣之模型卷積深度範圍，但因其需較高之計算複雜度而導致標準化吞吐率NTR大幅降低。而本發明之重新計算併重複利用FCFU不僅較重複利用FU能支持較寬的模型卷積深度範圍，而且還能提供較重新計算FC更好的標準化吞吐率NTR。As can be seen from FIG. 10, if there is a block register size limit S for the block register 220, the maximum supported convolution depth _Dmax that can be supported by the reuse FU is the shallowest among the three; on the contrary, Although recomputing FC can support a wide range of model convolution depths, it requires a high computational complexity, resulting in a significant reduction in the normalized throughput rate NTR. However, the recalculated and reused FCFU of the present invention not only supports a wider model convolution depth range than the reused FU, but also provides a better normalized throughput rate NTR than the recalculated FC.

由上述實施方式可知，本發明具有下列優點：其一，本發明的適用於卷積神經網路之記憶體優化實現之區塊式推論方法透過不同方向使用不同特徵之計算方式，使區塊式推論在不增加過多計算量以及區塊暫存器之前提下，依然能大幅降低外部記憶體之頻寬需求。其二，本發明之適用於卷積神經網路之記憶體優化實現之區塊式推論系統藉由不同方向使用不同特徵之計算方式，使區塊式推論在不增加過多計算量以及區塊暫存器之前提下，依然能大幅降低外部記憶體之頻寬需求。其三，本發明之重新計算併重複利用不僅較重複利用能支持較寬的模型卷積深度範圍，而且還能提供較重新計算更好的標準化吞吐率。It can be seen from the above-mentioned embodiments that the present invention has the following advantages: First, the block-based inference method of the present invention, which is suitable for memory optimization of convolutional neural networks, uses computing methods with different characteristics in different directions, so that the block-based inference method The inference can still greatly reduce the bandwidth requirement of external memory without increasing too much computation and block registers. Second, the block-based inference system of the present invention, which is suitable for memory optimization of convolutional neural networks, uses computing methods with different characteristics in different directions, so that block-based inference does not increase too much computation and block temporary. Under the premise of memory, it can still greatly reduce the bandwidth requirement of external memory. Third, the recalculation and reuse of the present invention not only supports a wider range of model convolution depths than reuse, but also provides better normalized throughput than recalculation.

雖然本發明已以實施方式揭露如上，然其並非用以限定本發明，任何熟習此技藝者，在不脫離本發明之精神和範圍內，當可作各種之更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present invention. Anyone skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the protection of the present invention The scope shall be determined by the scope of the appended patent application.

100:適用於卷積神經網路之記憶體優化實現之區塊式推論方法 S02:參數設定步驟 S04:分割步驟 S06:區塊推論步驟 S062:第一方向資料選取步驟 S064:第二方向資料選取步驟 S066:卷積運算步驟 S08:暫存步驟 110:輸出影像 200:適用於卷積神經網路之記憶體優化實現之區塊式推論系統 212:推論參數組 214:卷積參數組 220:區塊暫存器 230:運算處理單元 232:卷積引擎 B_W ,W1,W2,W3:區塊寬度 B_H ,H1,H2,H3:區塊高度 C1:第i層重複利用輸入特徵區塊通道數 C2:第i層中間特徵區塊通道數 C3:第i層輸出特徵區塊通道數 D:卷積深度 D_max ：最大支援卷積深度 D1:掃描換行方向 D2:區塊掃描方向 FC:重新計算 FU:重複利用 FCFU:重新計算併重複利用 IB:輸入區塊資料 IR:內區域 k‒1:重複利用特徵數量 L1:第1層 L1FC:第1層重新計算特徵 L1FC_I:第1層重新計算輸入特徵區塊資料 L1FU:第1層重複利用特徵 L1FU_I:第1層重複利用輸入特徵區塊資料 L1_O:第1層輸出特徵區塊資料 L2:第2層 L2FC:第2層重新計算特徵 L2FC_I:第2層重新計算輸入特徵區塊資料 L2FU:第2層重複利用特徵 L2FU_I:第2層重複利用輸入特徵區塊資料 L2_O:第2層輸出特徵區塊資料 L3:第3層 L3FC:第3層重新計算特徵 L3FC_I:第3層重新計算輸入特徵區塊資料 L3FU:第3層重複利用特徵 L3FU_I:第3層重複利用輸入特徵區塊資料 L3_O:第3層輸出特徵區塊資料 LD:第D層 LiFU_I:第i層重複利用輸入特徵區塊資料 Li_M:第i層中間區塊資料 Li_O:第i層輸出特徵區塊資料 NTR:標準化吞吐率 OB:輸出區塊資料 OR:外區域 S:區塊暫存器大小限制 SBG1,SBG11,SBG12:第1層子區塊輸入特徵群 SBG2:第2層子區塊輸入特徵群 SBG3:第3層子區塊輸入特徵群100: block inference method suitable for memory optimization implementation of convolutional neural network S02: parameter setting step S04: segmentation step S06: block inference step S062: first direction data selection step S064: second direction data selection Step S066: Convolution operation Step S08: Temporary storage Step 110: Output image 200: Block-based inference system suitable for memory optimization of convolutional neural network 212: Inference parameter group 214: Convolution parameter group 220: Area Block register 230: Operation processing unit 232: Convolution engine B _W , W1, W2, W3: Block width B _H , H1, H2, H3: Block height C1: The i-th layer reuses the input feature block channel Number C2: Number of channels in the middle feature block of the i-th layer C3: Number of channels of the output feature block of the i-th layer D: Convolution depth D _max : Maximum supported convolution depth D1: Scanning line feed direction D2: Block scanning direction FC: Re Calculate FU: Reuse FCFU: Recalculate and reuse IB: Input block data IR: Inner region k–1: Reuse number of features L1: Layer 1 L1FC: Recalculate features at layer 1 L1FC_I: Recalculate at layer 1 Input feature block data L1FU: Layer 1 reuse feature L1FU_I: Layer 1 reuse input feature block data L1_O: Layer 1 output feature block data L2: Layer 2 L2FC: Layer 2 recalculation feature L2FC_I: Layer 2 recalculates input feature block data L2FU: Layer 2 reuse feature L2FU_I: Layer 2 reuses input feature block data L2_O: Layer 2 output feature block data L3: Layer 3 L3FC: Layer 3 Recalculated features L3FC_I: Layer 3 recalculates input feature block data L3FU: Layer 3 reuses features L3FU_I: Layer 3 reuses input feature block data L3_O: Layer 3 output feature block data LD: Layer D LiFU_I: i-th layer reuses input feature block data Li_M: i-th layer intermediate block data Li_O: i-th layer output feature block data NTR: normalized throughput rate OB: output block data OR: outer area S: block Register size limit SBG1, SBG11, SBG12: 1st layer sub-block input feature group SBG2: 2nd layer sub-block input feature group SBG3: 3rd layer sub-block input feature group

第1圖係繪示本發明第一實施例之適用於卷積神經網路之記憶體優化實現之區塊式推論方法的流程示意圖；第2圖係繪示第1圖之分割步驟的示意圖；第3圖係繪示第1圖之區塊推論步驟的多層卷積操作之輸入區塊資料與輸出區塊資料的立體示意圖；第4圖係繪示第1圖之第一方向資料選取步驟的示意圖；第5圖係繪示第1圖之第二方向資料選取步驟的示意圖；第6圖係繪示第3圖之第1層重複利用輸入特徵區塊資料的示意圖；第7圖係繪示本發明第二實施例之通道混洗之示意圖；第8圖係繪示本發明第三實施例之適用於卷積神經網路之記憶體優化實現之區塊式推論系統的方塊示意圖；第9圖係繪示本發明第四實施例之具有3×3濾波器之多層卷積操作的流程示意圖；以及第10圖係繪示重新計算、重複利用及本發明之重新計算併重複利用的模擬結果示意圖。FIG. 1 is a schematic flowchart of a block-based inference method suitable for memory optimization of a convolutional neural network according to a first embodiment of the present invention; FIG. 2 is a schematic diagram illustrating the dividing step of FIG. 1; FIG. 3 is a three-dimensional schematic diagram illustrating the input block data and the output block data of the multi-layer convolution operation in the block inference step of FIG. 1; FIG. 4 is a schematic diagram illustrating the first direction data selection step of FIG. 1; FIG. 5 is a schematic diagram illustrating a step of selecting the second direction data in FIG. 1; FIG. 6 is a schematic diagram illustrating the reuse of input feature block data in the first layer of FIG. 3; FIG. 7 is a schematic diagram of channel shuffling according to the second embodiment of the present invention; FIG. 8 is a block diagram illustrating a block-based inference system suitable for memory optimization of a convolutional neural network according to a third embodiment of the present invention; FIG. 9 is a schematic flowchart illustrating a multi-layer convolution operation with 3×3 filters according to a fourth embodiment of the present invention; and FIG. 10 is a schematic diagram showing the simulation results of recalculation, reuse, and recalculation and reuse of the present invention.

100:適用於卷積神經網路之記憶體優化實現之區塊式推論方法100: A block-based inference method for memory-optimized implementation of convolutional neural networks

S02:參數設定步驟S02: Parameter setting steps

S04:分割步驟S04: Segmentation step

S06:區塊推論步驟S06: Block inference steps

S062:第一方向資料選取步驟S062: The first direction data selection step

S064:第二方向資料選取步驟S064: The second direction data selection step

S066:卷積運算步驟S066: Convolution operation steps

S08:暫存步驟S08: Temporary storage step

Claims

一種適用於卷積神經網路之記憶體優化實現之區塊式推論方法，用以處理一輸入影像，該適用於卷積神經網路之記憶體優化實現之區塊式推論方法包含以下步驟：一參數設定步驟，係設定一推論參數組，該推論參數組包含一卷積深度、一區塊寬度、一區塊高度及複數層卷積核大小；一分割步驟，係驅動一運算處理單元依據該卷積深度、該區塊寬度、該區塊高度及該些層卷積核大小劃分該輸入影像成複數輸入區塊資料，各該輸入區塊資料具有一輸入區塊大小；一區塊推論步驟，係驅動該運算處理單元將各該輸入區塊資料執行一多層卷積操作而產生一輸出區塊資料，且該多層卷積操作包含：一第一方向資料選取步驟，係依據該輸出區塊資料之一位置沿一掃描換行方向選擇複數第i層重新計算特徵，然後依據該輸出區塊資料之該位置及該些第i層重新計算特徵選取出一第i層重新計算輸入特徵區塊資料，其中i為1至該卷積深度之複數個正整數之其中一者；一第二方向資料選取步驟，係依據該第i層重新計算輸入特徵區塊資料沿一區塊掃描方向選取出複數第i層重複利用特徵，並將該第i層重新計算輸入特徵區塊資料及該些第i層重複利用特徵組合而產生一第i層重複利用輸入特徵區塊資料；及一卷積運算步驟，係依據一第i層卷積核大小從該第i層重複利用輸入特徵區塊資料中選取出複數第i層子區塊輸入特徵群，然後對各該第i層子區塊輸入特徵群及一卷積參數組執行一卷積運算而產生一第i層子區塊輸出特徵，並將對應該些第i層子區塊輸入特徵群之該些第i層子區塊輸出特徵組合而形成一第i層輸出特徵區塊資料；以及一暫存步驟，係驅動一區塊暫存器暫存該第i層輸出特徵區塊資料及該些第i層重複利用特徵。A block-based inference method suitable for memory-optimized implementation of convolutional neural networks is used to process an input image. The block-based inference method suitable for memory-optimized implementation of convolutional neural networks includes the following steps: A parameter setting step is to set an inference parameter group, the inference parameter group includes a convolution depth, a block width, a block height and the size of the convolution kernel of multiple layers; a dividing step, driving an arithmetic processing unit to divide the input image into a plurality of input block data according to the convolution depth, the block width, the block height and the size of the convolution kernels of the layers, and each of the input block data has an input block size; A block inference step drives the operation processing unit to perform a multi-layer convolution operation on each of the input block data to generate an output block data, and the multi-layer convolution operation includes: A first direction data selection step is to select a plurality of i-th layer recalculation features along a scan line feed direction according to a position of the output block data, and then recalculate the i-th layers according to the position of the output block data and the i-th layers The feature selects a layer i to recalculate the input feature block data, wherein i is one of a plurality of positive integers from 1 to the depth of the convolution; A second direction data selection step is to select a plurality of i-th layer reuse features along a block scanning direction according to the i-th layer recalculated input feature block data, and recalculate the i-th layer input feature block data and the i-th layer reused features are combined to generate an i-th layer of reused input feature block data; and A convolution operation step is to select a plurality of i-th layer sub-block input feature groups from the i-th layer reused input feature block data according to an i-th layer convolution kernel size, and then perform a The block input feature group and a convolution parameter group perform a convolution operation to generate an i-th layer sub-block output feature, which corresponds to the i-th layer sub-regions of the i-th layer sub-block input feature group combining the block output features to form an i-th layer of output feature block data; and A temporary storage step is to drive a block register to temporarily store the output feature block data of the i-th layer and the reuse features of the i-th layer.

如請求項1所述之適用於卷積神經網路之記憶體優化實現之區塊式推論方法，其中，當i等於1時，該第i層重新計算輸入特徵區塊資料等於各該輸入區塊資料；及當i等於該卷積深度時，該第i層輸出特徵區塊資料等於該輸出區塊資料。The block-based inference method applicable to memory-optimized implementation of convolutional neural networks as described in claim 1, wherein, When i is equal to 1, the i-th layer recomputed input feature block data is equal to each of the input block data; and When i is equal to the convolution depth, the i-th layer output feature block data is equal to the output block data.

如請求項1所述之適用於卷積神經網路之記憶體優化實現之區塊式推論方法，其中該第i層重新計算輸入特徵區塊資料具有一第i層重新計算輸入特徵區塊大小與一第i層重新計算輸入特徵區塊通道數，該第i層輸出特徵區塊資料具有一第i層輸出特徵區塊大小與一第i層輸出特徵區塊通道數，該第i層輸出特徵區塊大小大於該第i層重新計算輸入特徵區塊大小，且該第i層重新計算輸入特徵區塊通道數等於該第i層輸出特徵區塊通道數。The block-based inference method suitable for memory-optimized implementation of convolutional neural networks as described in claim 1, wherein the i-th layer recalculated input feature block data has an i-th layer recalculated input feature block size Recalculate the input feature block channel number with an i-th layer, the i-th layer output feature block data has an i-th layer output feature block size and an i-th layer output feature block channel number, the i-th layer output The feature block size is larger than the recalculated input feature block size of the ith layer, and the number of channels of the recalculated input feature block of the ith layer is equal to the number of output feature block channels of the ith layer.

如請求項1所述之適用於卷積神經網路之記憶體優化實現之區塊式推論方法，其中該區塊掃描方向垂直於該掃描換行方向，該區塊寬度大於該區塊高度，且該區塊高度的一延伸方向平行於該區塊掃描方向。The block-based inference method suitable for memory optimization implementation of a convolutional neural network as described in claim 1, wherein the block scanning direction is perpendicular to the scanning line feed direction, the block width is greater than the block height, and An extension direction of the block height is parallel to the block scanning direction.

如請求項1所述之適用於卷積神經網路之記憶體優化實現之區塊式推論方法，其中該卷積深度、該區塊寬度及該區塊高度均為正整數，該第i層卷積核大小為k_Wi ×k_Hi ，該些第i層重複利用特徵沿該區塊掃描方向具有一重複利用特徵數量，且該重複利用特徵數量等於k_Hi ‒1。The block-based inference method suitable for memory optimization implementation of convolutional neural networks as described in claim 1, wherein the convolution depth, the block width and the block height are all positive integers, and the i-th layer The size of the convolution kernel is k _Wi ×k _Hi , the i-th layer reused features have a number of reused features along the block scanning direction, and the number of reused features is equal to k _Hi −1.

如請求項1所述之適用於卷積神經網路之記憶體優化實現之區塊式推論方法，其中該區塊寬度表示為B_W ，該卷積深度表示為D，該區塊高度表示為B_H ；該輸入區塊大小等於B_W ×B_H ；該輸出區塊資料具有一輸出區塊大小，且該輸出區塊大小等於(B_W ‒2D)×B_H ；該第i層重新計算輸入特徵區塊資料具有一第i層重新計算輸入特徵區塊大小，且該第i層重新計算輸入特徵區塊大小等於(B_W ‒2i+2)×B_H ；該第i層重複利用輸入特徵區塊資料具有一第i層重複利用輸入特徵區塊大小，且該第i層重複利用輸入特徵區塊大小等於(B_W ‒2i+2)×(B_H +2)；該第i層輸出特徵區塊資料具有一第i層輸出特徵區塊大小，且該第i層輸出特徵區塊大小等於(B_W ‒2i)×B_H ；及該卷積深度小於該區塊寬度之一半。The block-based inference method applicable to memory optimization implementation of convolutional neural network as described in claim 1, wherein the block width is represented by B _W , the convolution depth is represented by D, and the block height is represented by B _H ; the input block size is equal to B _W ×B _H ; the output block data has an output block size equal to (B _W -2D)×B _H ; the i-th layer recalculates The input feature block data has an i-th layer recalculated input feature block size, and the i-th layer recalculated input feature block size is equal to (B _W -2i+2)×B _H ; the i-th layer reuses the input The feature block data has an i-th layer reuses the input feature block size, and the i-th layer reuses the input feature block size equal to (B _W -2i+2)×(B _H +2); the i-th layer reuses the input feature block size The output feature block data has an output feature block size of the i-th layer, and the output feature block size of the i-th layer is equal to (B _W -2i)×B _H ; and the convolution depth is less than half of the block width.

如請求項1所述之適用於卷積神經網路之記憶體優化實現之區塊式推論方法，其中，當其中一該第i層子區塊輸入特徵群之複數輸入特徵之至少一者位於該第i層重複利用輸入特徵區塊資料之一外區域時，該其中一第i層子區塊輸入特徵群之該些輸入特徵包含複數外區塊特徵及複數第一內區塊特徵，該些外區塊特徵代表已運算之特徵，該些第一內區塊特徵代表未運算之特徵；當其中一該第i層子區塊輸入特徵群之該些輸入特徵均位於該第i層重複利用輸入特徵區塊資料之一內區域時，該其中一第i層子區塊輸入特徵群之該些輸入特徵僅包含複數第二內區塊特徵；及該第i層重複利用輸入特徵區塊資料沿該區塊掃描方向之排列順序為該外區域與該內區域。The block-based inference method applicable to memory-optimized implementation of convolutional neural networks as described in claim 1, wherein, When at least one of the complex input features of the i-th layer sub-block input feature group is located in an outer area of the i-th layer reused input feature block data, the i-th layer sub-block input feature The input features of the group include a plurality of outer block features and a plurality of first inner block features, the outer block features represent computed features, and the first inner block features represent uncomputed features; When the input features of one of the i-th layer sub-block input feature groups are all located in an inner area of the i-th layer reused input feature block data, the one of the i-th layer sub-block input feature groups the input features include only the plurality of second inner block features; and The arrangement sequence of the i-th layer reused input feature block data along the block scanning direction is the outer area and the inner area.

如請求項7所述之適用於卷積神經網路之記憶體優化實現之區塊式推論方法，其中該些外區塊特徵係儲存於該區塊暫存器，該區塊暫存器具有一暫存空間，該暫存空間透過該第i層重新計算輸入特徵區塊資料之一寬度、該卷積深度、一層數、一通道數及該第i層卷積核大小運算求得，該第i層重新計算輸入特徵區塊資料之該寬度表示為B_Wi ，該卷積深度表示為D，該層數表示為i，該通道數表示為C，該第i層卷積核大小為k_Wi ×k_Hi ，該暫存空間表示為LBS且符合下式：

。The block-based inference method suitable for memory optimization implementation of convolutional neural network according to claim 7, wherein the outer block features are stored in the block register, and the block register has a Temporary storage space, the temporary storage space is obtained by recalculating a width of the input feature block data, the convolution depth, the number of layers, the number of a channel and the size of the convolution kernel of the i-th layer through the calculation of the i-th layer. The width of the i-layer recalculated input feature block data is represented by B _Wi , the convolution depth is represented by D, the number of layers is represented by i, the number of channels is represented by C, and the size of the convolution kernel of the i-th layer is k _Wi ×k _Hi , the temporary storage space is expressed as LBS and conforms to the following formula:

.

一種適用於卷積神經網路之記憶體優化實現之區塊式推論系統，用以處理一輸入影像，該適用於卷積神經網路之記憶體優化實現之區塊式推論系統包含：一區塊暫存器，用以存取一第i層輸出特徵區塊資料及複數第i層重複利用特徵；以及一運算處理單元，電性連接於該區塊暫存器，該運算處理單元接收該輸入影像並經配置以實施包含以下步驟之操作：一參數設定步驟，係設定一推論參數組，該推論參數組包含一卷積深度、一區塊寬度、一區塊高度及複數層卷積核大小；一分割步驟，係依據該卷積深度、該區塊寬度、該區塊高度及該些層卷積核大小劃分該輸入影像成複數輸入區塊資料，各該輸入區塊資料具有一輸入區塊大小；及一區塊推論步驟，係將各該輸入區塊資料執行一多層卷積操作而產生一輸出區塊資料，且該多層卷積操作包含：一第一方向資料選取步驟，係依據該輸出區塊資料之一位置沿一掃描換行方向選擇複數第i層重新計算特徵，然後依據該輸出區塊資料之該位置及該些第i層重新計算特徵選取出一第i層重新計算輸入特徵區塊資料，其中i為1至該卷積深度之複數個正整數之其中一者；一第二方向資料選取步驟，係依據該第i層重新計算輸入特徵區塊資料沿一區塊掃描方向選取出該些第i層重複利用特徵，並將該第i層重新計算輸入特徵區塊資料及該些第i層重複利用特徵組合而產生一第i層重複利用輸入特徵區塊資料；及一卷積運算步驟，係依據一第i層卷積核大小從該第i層重複利用輸入特徵區塊資料中選取出複數第i層子區塊輸入特徵群，然後對各該第i層子區塊輸入特徵群及一卷積參數組執行一卷積運算而產生一第i層子區塊輸出特徵，並將對應該些第i層子區塊輸入特徵群之該些第i層子區塊輸出特徵組合而形成該第i層輸出特徵區塊資料。A block-based inference system suitable for memory-optimized implementation of convolutional neural networks is used to process an input image. The block-based inference system suitable for memory-optimized implementation of convolutional neural networks includes: a block register for accessing an i-th layer output feature block data and a plurality of i-th layer reuse features; and an arithmetic processing unit, electrically connected to the block register, the arithmetic processing unit receives the input image and is configured to perform operations including the following steps: A parameter setting step is to set an inference parameter group, the inference parameter group includes a convolution depth, a block width, a block height and the size of the convolution kernel of multiple layers; A dividing step is to divide the input image into a plurality of input block data according to the convolution depth, the block width, the block height and the size of the convolution kernels, and each of the input block data has an input block size; and A block inference step is to perform a multi-layer convolution operation on each of the input block data to generate an output block data, and the multi-layer convolution operation includes: A first direction data selection step is to select a plurality of i-th layer recalculation features along a scan line feed direction according to a position of the output block data, and then recalculate the i-th layers according to the position of the output block data and the i-th layers The feature selects a layer i to recalculate the input feature block data, wherein i is one of a plurality of positive integers from 1 to the depth of the convolution; A second direction data selection step is to select the i-th layer reuse features along a block scanning direction according to the i-th layer recalculated input feature block data, and recalculate the i-th layer input feature block the data and the layer i reuse features are combined to generate a layer i reuse input feature block data; and A convolution operation step is to select a plurality of i-th layer sub-block input feature groups from the i-th layer reused input feature block data according to an i-th layer convolution kernel size, and then perform a The block input feature group and a convolution parameter group perform a convolution operation to generate an i-th layer sub-block output feature, which corresponds to the i-th layer sub-regions of the i-th layer sub-block input feature group The block output features are combined to form the i-th layer output feature block data.

如請求項9所述之適用於卷積神經網路之記憶體優化實現之區塊式推論系統，其中，當i等於1時，該第i層重新計算輸入特徵區塊資料等於各該輸入區塊資料；及當i等於該卷積深度時，該第i層輸出特徵區塊資料等於該輸出區塊資料。The block-based inference system suitable for memory-optimized implementation of convolutional neural networks as described in claim 9, wherein, When i is equal to 1, the i-th layer recomputed input feature block data is equal to each of the input block data; and When i is equal to the convolution depth, the i-th layer output feature block data is equal to the output block data.

如請求項9所述之適用於卷積神經網路之記憶體優化實現之區塊式推論系統，其中該第i層重新計算輸入特徵區塊資料具有一第i層重新計算輸入特徵區塊大小與一第i層重新計算輸入特徵區塊通道數，該第i層輸出特徵區塊資料具有一第i層輸出特徵區塊大小與一第i層輸出特徵區塊通道數，該第i層輸出特徵區塊大小大於該第i層重新計算輸入特徵區塊大小，且該第i層重新計算輸入特徵區塊通道數等於該第i層輸出特徵區塊通道數。The block-based inference system suitable for memory-optimized implementation of convolutional neural networks as described in claim 9, wherein the ith layer recalculated input feature block data has an ith layer recalculated input feature block size Recalculate the input feature block channel number with an i-th layer, the i-th layer output feature block data has an i-th layer output feature block size and an i-th layer output feature block channel number, the i-th layer output The feature block size is larger than the recalculated input feature block size of the ith layer, and the number of channels of the recalculated input feature block of the ith layer is equal to the number of output feature block channels of the ith layer.

如請求項9所述之適用於卷積神經網路之記憶體優化實現之區塊式推論系統，其中該區塊掃描方向垂直於該掃描換行方向，該區塊寬度大於該區塊高度，且該區塊高度的一延伸方向平行於該區塊掃描方向。The block-based inference system suitable for memory-optimized implementation of convolutional neural networks as described in claim 9, wherein the block scan direction is perpendicular to the scan line feed direction, the block width is greater than the block height, and An extension direction of the block height is parallel to the block scanning direction.

如請求項9所述之適用於卷積神經網路之記憶體優化實現之區塊式推論系統，其中該卷積深度、該區塊寬度及該區塊高度均為正整數，該第i層卷積核大小為k_Wi ×k_Hi ，該些第i層重複利用特徵沿該區塊掃描方向具有一重複利用特徵數量，且該重複利用特徵數量等於k_Hi ‒1。The block-based inference system suitable for memory-optimized implementation of convolutional neural networks as described in claim 9, wherein the convolution depth, the block width and the block height are all positive integers, and the i-th layer The size of the convolution kernel is k _Wi ×k _Hi , the i-th layer reused features have a number of reused features along the block scanning direction, and the number of reused features is equal to k _Hi −1.

如請求項9所述之適用於卷積神經網路之記憶體優化實現之區塊式推論系統，其中該區塊寬度表示為B_W ，該卷積深度表示為D，該區塊高度表示為B_H ；該輸入區塊大小等於B_W ×B_H ；該輸出區塊資料具有一輸出區塊大小，且該輸出區塊大小等於(B_W ‒2D)×B_H ；該第i層重新計算輸入特徵區塊資料具有一第i層重新計算輸入特徵區塊大小，且該第i層重新計算輸入特徵區塊大小等於(B_W ‒2i+2)×B_H ；該第i層重複利用輸入特徵區塊資料具有一第i層重複利用輸入特徵區塊大小，且該第i層重複利用輸入特徵區塊大小等於(B_W ‒2i+2)×(B_H +2)；該第i層輸出特徵區塊資料具有一第i層輸出特徵區塊大小，且該第i層輸出特徵區塊大小等於(B_W ‒2i)×B_H ；及該卷積深度小於該區塊寬度之一半。The block-based inference system suitable for memory-optimized implementation of convolutional neural networks as described in claim 9, wherein the block width is represented by B _W , the convolution depth is represented by D, and the block height is represented by B _H ; the input block size is equal to B _W ×B _H ; the output block data has an output block size equal to (B _W -2D)×B _H ; the i-th layer recalculates The input feature block data has an i-th layer recalculated input feature block size, and the i-th layer recalculated input feature block size is equal to (B _W -2i+2)×B _H ; the i-th layer reuses the input The feature block data has an i-th layer reuses the input feature block size, and the i-th layer reuses the input feature block size equal to (B _W -2i+2)×(B _H +2); the i-th layer reuses the input feature block size The output feature block data has an output feature block size of the i-th layer, and the output feature block size of the i-th layer is equal to (B _W -2i)×B _H ; and the convolution depth is less than half of the block width.

如請求項9所述之適用於卷積神經網路之記憶體優化實現之區塊式推論系統，其中，當其中一該第i層子區塊輸入特徵群之複數輸入特徵之至少一者位於該第i層重複利用輸入特徵區塊資料之一外區域時，該其中一第i層子區塊輸入特徵群之該些輸入特徵包含複數外區塊特徵及複數第一內區塊特徵，該些外區塊特徵代表已運算之特徵，該些第一內區塊特徵代表未運算之特徵；當其中一該第i層子區塊輸入特徵群之該些輸入特徵均位於該第i層重複利用輸入特徵區塊資料之一內區域時，該其中一第i層子區塊輸入特徵群之該些輸入特徵僅包含複數第二內區塊特徵；及該第i層重複利用輸入特徵區塊資料沿該區塊掃描方向之排列順序為該外區域與該內區域。The block-based inference system suitable for memory-optimized implementation of convolutional neural networks as described in claim 9, wherein, When at least one of the complex input features of the i-th layer sub-block input feature group is located in an outer area of the i-th layer reused input feature block data, the i-th layer sub-block input feature The input features of the group include a plurality of outer block features and a plurality of first inner block features, the outer block features represent computed features, and the first inner block features represent uncomputed features; When the input features of one of the i-th layer sub-block input feature groups are all located in an inner area of the i-th layer reused input feature block data, the one of the i-th layer sub-block input feature groups the input features include only the plurality of second inner block features; and The arrangement sequence of the i-th layer reused input feature block data along the block scanning direction is the outer area and the inner area.

如請求項15所述之適用於卷積神經網路之記憶體優化實現之區塊式推論系統，其中該些外區塊特徵係儲存於該區塊暫存器，該區塊暫存器具有一暫存空間，該暫存空間透過該第i層重新計算輸入特徵區塊資料之一寬度、該卷積深度、一層數、一通道數及該第i層卷積核大小運算求得，該第i層重新計算輸入特徵區塊資料之該寬度表示為B_Wi ，該卷積深度表示為D，該層數表示為i，該通道數表示為C，該第i層卷積核大小為k_Wi ×k_Hi ，該暫存空間表示為LBS且符合下式：

。The block-based inference system suitable for memory-optimized implementation of convolutional neural networks as described in claim 15, wherein the outer block features are stored in the block register, and the block register has a Temporary storage space, the temporary storage space is obtained by recalculating a width of the input feature block data, the convolution depth, the number of layers, the number of a channel and the size of the convolution kernel of the i-th layer through the calculation of the i-th layer. The width of the i-layer recalculated input feature block data is represented by B _Wi , the convolution depth is represented by D, the number of layers is represented by i, the number of channels is represented by C, and the size of the convolution kernel of the i-th layer is k _Wi ×k _Hi , the temporary storage space is expressed as LBS and conforms to the following formula:

.