TWI765336B - Block-based inference method for memory-efficient convolutional neural network implementation and system thereof - Google Patents
Block-based inference method for memory-efficient convolutional neural network implementation and system thereof Download PDFInfo
- Publication number
- TWI765336B TWI765336B TW109130493A TW109130493A TWI765336B TW I765336 B TWI765336 B TW I765336B TW 109130493 A TW109130493 A TW 109130493A TW 109130493 A TW109130493 A TW 109130493A TW I765336 B TWI765336 B TW I765336B
- Authority
- TW
- Taiwan
- Prior art keywords
- block
- layer
- input
- input feature
- features
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/60—Memory management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
Description
本發明是關於一種區塊式推論方法及其系統,特別是關於一種適用於卷積神經網路之記憶體優化實現之區塊式推論方法及其系統。The present invention relates to a block-based inference method and system, in particular to a block-based inference method and system suitable for memory optimization of convolutional neural networks.
當使用卷積神經網路於影像處理應用時,其外部記憶體頻寬需求可能會相當高,而使用區塊式推論流程,可以大幅降低此頻寬需求。然而,區塊間會有重疊的特徵向量,目前已知有兩種不同的處理方法,一種是採重新計算方式,另一種則是採重複利用方式。其中前者會增加計算量而降低輸出像素量,而後者則是需要大量的區塊暫存器來存放重複使用的特徵向量。由此可知,目前市場上缺乏一種能在不增加太多計算量以及區塊暫存器前提下,可大幅降低外部記憶體頻寬需求的適用於卷積神經網路之記憶體優化實現之區塊式推論方法及其系統,故相關業者均在尋求其解決之道。When using Convolutional Neural Networks for image processing applications, the external memory bandwidth requirement can be quite high, and using a block-based inference process can greatly reduce this bandwidth requirement. However, there will be overlapping eigenvectors between blocks, and two different processing methods are known, one is recalculation, and the other is reuse. The former will increase the amount of calculation and reduce the amount of output pixels, while the latter requires a large number of block registers to store the feature vectors that are reused. It can be seen that there is currently a lack of a memory optimization implementation area suitable for convolutional neural networks that can greatly reduce the bandwidth requirements of external memory without increasing too much computation and block registers. The block inference method and its system, so the relevant industry are all looking for its solution.
因此,本發明之目的在於提供一種適用於卷積神經網路之記憶體優化實現之區塊式推論方法及其系統,當進行區塊式推論時,於區塊前行的方向上重複利用已計算過的特徵,而於另一個方向上採用重新計算的方式,使區塊式推論在不增加過多計算量以及區塊暫存器之前提下,依然能大幅降低外部記憶體之頻寬需求。Therefore, the purpose of the present invention is to provide a block-based inference method and system suitable for memory optimization of convolutional neural networks. Calculated features and recalculation in the other direction enable block-based inference to significantly reduce the bandwidth requirements of external memory without increasing too much computation and block registers.
依據本發明的方法態樣之一實施方式提供一種適用於卷積神經網路之記憶體優化實現之區塊式推論方法,其用以處理一輸入影像。此適用於卷積神經網路之記憶體優化實現之區塊式推論方法包含參數設定步驟、分割步驟、區塊推論步驟以及暫存步驟,其中參數設定步驟係設定一推論參數組,此推論參數組包含一卷積深度、一區塊寬度、一區塊高度及複數層卷積核大小。分割步驟係驅動一運算處理單元依據卷積深度、區塊寬度、區塊高度及此些層卷積核大小劃分輸入影像成複數輸入區塊資料,各輸入區塊資料具有輸入區塊大小。區塊推論步驟係驅動運算處理單元將各輸入區塊資料執行多層卷積操作而產生輸出區塊資料,此多層卷積操作包含第一方向資料選取步驟、第二方向資料選取步驟及一卷積運算步驟,其中第一方向資料選取步驟係依據輸出區塊資料之一位置沿一掃描換行方向選擇複數第i層重新計算特徵,然後依據輸出區塊資料之位置及此些第i層重新計算特徵選取出一第i層重新計算輸入特徵區塊資料,其中i為1至卷積深度之複數個正整數之其中一者。第二方向資料選取步驟係依據第i層重新計算輸入特徵區塊資料沿一區塊掃描方向選取出複數第i層重複利用特徵,並將第i層重新計算輸入特徵區塊資料及此些第i層重複利用特徵組合而產生一第i層重複利用輸入特徵區塊資料。此外,卷積運算步驟係依據第i層卷積核大小從第i層重複利用輸入特徵區塊資料中選取出複數第i層子區塊輸入特徵群,然後對各第i層子區塊輸入特徵群及一卷積參數組執行一卷積運算而產生一第i層子區塊輸出特徵,並將對應此些第i層子區塊輸入特徵群之此些第i層子區塊輸出特徵組合而形成一第i層輸出特徵區塊資料。暫存步驟係驅動一區塊暫存器暫存第i層輸出特徵區塊資料及此些第i層重複利用特徵。One embodiment of a method aspect according to the present invention provides a block-based inference method suitable for memory-optimized implementation of a convolutional neural network for processing an input image. This block-based inference method suitable for memory optimization of convolutional neural networks includes a parameter setting step, a segmentation step, a block inference step and a temporary storage step, wherein the parameter setting step is to set an inference parameter group, the inference parameter The group includes a convolution depth, a block width, a block height, and the size of the convolution kernels of multiple layers. The dividing step drives an arithmetic processing unit to divide the input image into a plurality of input block data according to the convolution depth, block width, block height and the size of these layers of convolution kernels, and each input block data has an input block size. The block inference step is to drive the arithmetic processing unit to perform a multi-layer convolution operation on each input block data to generate output block data. The multi-layer convolution operation includes a first direction data selection step, a second direction data selection step and a convolution The operation step, wherein the first direction data selection step is to select a plurality of i-th layer recalculation features according to a position of the output block data along a scanning line feed direction, and then recalculate the features according to the position of the output block data and these i-th layers. An i-th layer is selected to recalculate the input feature block data, where i is one of a plurality of positive integers from 1 to the convolution depth. The second direction data selection step is to select a plurality of i-th layers of reused features along a block scanning direction according to the i-th layer recalculated input feature block data, and recalculate the i-th layer of input feature block data and these The i-layer reuses feature combination to generate an i-th layer reuses input feature block data. In addition, the convolution operation step is to select a plurality of i-th layer sub-block input feature groups from the i-th layer reused input feature block data according to the i-th layer convolution kernel size, and then input the input feature group to each i-th layer sub-block The feature group and a convolution parameter group perform a convolution operation to generate an output feature of an i-th layer sub-block, and output the i-th layer sub-block output features corresponding to the i-th layer sub-block input feature group combined to form an i-th layer output feature block data. The step of temporary storage drives a block register to temporarily store the output feature block data of the i-th layer and the reused features of the i-th layer.
藉此,本發明的適用於卷積神經網路之記憶體優化實現之區塊式推論方法透過不同方向使用不同特徵之計算方式,使區塊式推論在不增加過多計算量以及區塊暫存器之前提下,依然能大幅降低外部記憶體之頻寬需求。In this way, the block-based inference method of the present invention suitable for memory optimization of convolutional neural networks uses computing methods with different characteristics in different directions, so that block-based inference does not increase too much computation and block temporary storage. Under the premise of the device, the bandwidth requirement of the external memory can still be greatly reduced.
前述實施方式之其他實施例如下:當前述i等於1時,第i層重新計算輸入特徵區塊資料等於各輸入區塊資料。當i等於卷積深度時,第i層輸出特徵區塊資料等於輸出區塊資料。Other examples of the aforementioned embodiments are as follows: when the aforementioned i is equal to 1, the i-th layer recalculates the input feature block data equal to each input block data. When i is equal to the convolution depth, the output feature block data of the i-th layer is equal to the output block data.
前述實施方式之其他實施例如下:前述第i層重新計算輸入特徵區塊資料具有一第i層重新計算輸入特徵區塊大小與一第i層重新計算輸入特徵區塊通道數,第i層輸出特徵區塊資料具有一第i層輸出特徵區塊大小與一第i層輸出特徵區塊通道數。第i層輸出特徵區塊大小大於第i層重新計算輸入特徵區塊大小,且第i層重新計算輸入特徵區塊通道數等於第i層輸出特徵區塊通道數。Other examples of the aforementioned embodiments are as follows: the i-th layer recalculated input feature block data has an i-th layer re-calculated input feature block size and an i-th layer re-calculated the number of input feature blocks channels, and the i-th layer outputs The feature block data has an output feature block size of the i-th layer and a channel number of the i-th layer output feature block. The size of the output feature block of the ith layer is larger than the size of the recalculated input feature block of the ith layer, and the number of channels of the recalculated input feature block of the ith layer is equal to the number of channels of the output feature block of the ith layer.
前述實施方式之其他實施例如下:前述區塊掃描方向垂直於掃描換行方向,區塊寬度大於區塊高度,且區塊高度的延伸方向平行於區塊掃描方向。Other examples of the foregoing embodiments are as follows: the block scanning direction is perpendicular to the scanning line feed direction, the block width is greater than the block height, and the extending direction of the block height is parallel to the block scanning direction.
前述實施方式之其他實施例如下:前述卷積深度、區塊寬度及區塊高度均為正整數,第i層卷積核大小為kWi ×kHi 。此些第i層重複利用特徵沿區塊掃描方向具有一重複利用特徵數量,且重複利用特徵數量等於kHi ‒1。Other examples of the aforementioned embodiments are as follows: the aforementioned convolution depth, block width and block height are all positive integers, and the size of the convolution kernel of the i-th layer is k Wi ×k Hi . The i-th layer of reused features has a number of reused features along the block scanning direction, and the number of reused features is equal to k Hi −1.
前述實施方式之其他實施例如下:前述區塊寬度表示為BW ,卷積深度表示為D,區塊高度表示為BH 。輸入區塊大小等於BW ×BH 。輸出區塊資料具有一輸出區塊大小,且輸出區塊大小等於(BW ‒2D)×BH 。第i層重新計算輸入特徵區塊資料具有一第i層重新計算輸入特徵區塊大小,且第i層重新計算輸入特徵區塊大小等於(BW ‒2i+2)×BH 。第i層重複利用輸入特徵區塊資料具有一第i層重複利用輸入特徵區塊大小,且第i層重複利用輸入特徵區塊大小等於(BW ‒2i+2)×(BH +2)。第i層輸出特徵區塊資料具有一第i層輸出特徵區塊大小,且第i層輸出特徵區塊大小等於(BW ‒2i)×BH 。卷積深度小於區塊寬度之一半。Other examples of the aforementioned embodiments are as follows: the aforementioned block width is represented by B W , the convolution depth is represented by D, and the block height is represented by B H . The input block size is equal to B W × B H . The output block data has an output block size, and the output block size is equal to (B W -2D)×B H . The i-th layer recomputed input feature block data has an i-th layer re-computed input feature block size, and the i-th layer re-computed input feature block size is equal to (B W -2i+2)×B H . The i-th layer reused input feature block data has an i-th layer reused input feature block size, and the i-th layer reused input feature block size is equal to (B W –2i+2)×(B H +2) . The i-th layer output feature block data has an i-th layer output feature block size, and the i-th layer output feature block size is equal to (B W -2i)×B H . The convolution depth is less than half of the block width.
前述實施方式之其他實施例如下:當前述其中一第i層子區塊輸入特徵群之複數輸入特徵之至少一者位於第i層重複利用輸入特徵區塊資料之外區域時,此其中一第i層子區塊輸入特徵群之此些輸入特徵包含複數個外區塊特徵及複數個第一內區塊特徵。此些外區塊特徵代表已運算之特徵,此些第一內區塊特徵代表未運算之特徵。再者,當其中一第i層子區塊輸入特徵群之此些輸入特徵均位於第i層重複利用輸入特徵區塊資料之內區域時,此其中一第i層子區塊輸入特徵群之此些輸入特徵僅包含複數個第二內區塊特徵。第i層重複利用輸入特徵區塊資料沿區塊掃描方向之排列順序為外區域與內區域。Other examples of the aforementioned implementation manner are as follows: when at least one of the complex input features of the input feature group of a sub-block of the i-th layer is located in an area outside the data of the re-used input feature block of the i-th layer, the one of the first The input features of the i-layer sub-block input feature group include a plurality of outer block features and a plurality of first inner block features. The outer block features represent features that have been computed, and the first inner block features represent features that have not been computed. Furthermore, when the input features of the input feature group of a sub-block of the i-th layer are all located in the area within the data of the input feature block of the i-th layer that is reused, the input feature of the input feature group of the sub-block of the i-th layer is Such input features only include a plurality of second inner block features. The i-th layer reuses the input feature block data in the order of the block scanning direction as the outer area and the inner area.
前述實施方式之其他實施例如下:前述外區塊特徵係儲存於區塊暫存器,此區塊暫存器具有暫存空間,暫存空間透過第i層重新計算輸入特徵區塊資料之寬度、卷積深度、層數、通道數及第i層卷積核大小運算求得。第i層重新計算輸入特徵區塊資料之寬度表示為BWi ,卷積深度表示為D,層數表示為i,通道數表示為C,第i層卷積核大小為kWi ×kHi ,暫存空間表示為LBS且符合下式:。Other examples of the aforementioned embodiments are as follows: the aforementioned outer block features are stored in a block register, which has a temporary storage space, and the temporary storage space recalculates the width of the input feature block data through the i-th layer , the convolution depth, the number of layers, the number of channels, and the size of the i-th layer convolution kernel. The width of the recalculated input feature block data of the i-th layer is denoted as B Wi , the convolution depth is denoted as D, the number of layers is denoted as i, the number of channels is denoted as C, and the size of the convolution kernel of the i-th layer is k Wi × k Hi , The temporary storage space is expressed as LBS and conforms to the following formula: .
依據本發明的結構態樣之一實施方式提供一種適用於卷積神經網路之記憶體優化實現之區塊式推論系統,其用以處理一輸入影像,此適用於卷積神經網路之記憶體優化實現之區塊式推論系統包含區塊暫存器以及運算處理單元,其中區塊暫存器用以存取第i層輸出特徵區塊資料及複數第i層重複利用特徵。運算處理單元電性連接於區塊暫存器,運算處理單元接收輸入影像並經配置以實施包含以下步驟之操作:參數設定步驟、分割步驟及區塊推論步驟。其中參數設定步驟係設定推論參數組,推論參數組包含卷積深度、區塊寬度、區塊高度及複數層卷積核大小。分割步驟係依據卷積深度、區塊寬度、區塊高度及此些層卷積核大小劃分輸入影像成複數輸入區塊資料,各輸入區塊資料具有輸入區塊大小。此外,區塊推論步驟係將各輸入區塊資料執行一多層卷積操作而產生輸出區塊資料,且此多層卷積操作包含第一方向資料選取步驟、第二方向資料選取步驟及卷積運算步驟。第一方向資料選取步驟係依據輸出區塊資料之位置沿掃描換行方向選擇複數第i層重新計算特徵,然後依據輸出區塊資料之位置及此些第i層重新計算特徵選取出第i層重新計算輸入特徵區塊資料,其中i為1至卷積深度之複數個正整數之其中一者。第二方向資料選取步驟係依據第i層重新計算輸入特徵區塊資料沿區塊掃描方向選取出此些第i層重複利用特徵,並將第i層重新計算輸入特徵區塊資料及此些第i層重複利用特徵組合而產生第i層重複利用輸入特徵區塊資料。卷積運算步驟係依據第i層卷積核大小從第i層重複利用輸入特徵區塊資料中選取出複數第i層子區塊輸入特徵群,然後對各第i層子區塊輸入特徵群及一卷積參數組執行卷積運算而產生第i層子區塊輸出特徵,並將對應此些第i層子區塊輸入特徵群之此些第i層子區塊輸出特徵組合而形成第i層輸出特徵區塊資料。According to one embodiment of the structural aspect of the present invention, a block-based inference system suitable for memory-optimized implementation of convolutional neural networks is provided for processing an input image, which is suitable for memory in convolutional neural networks. The block-based inference system implemented by the volume optimization includes a block register and an operation processing unit, wherein the block register is used to access the i-th layer output feature block data and a plurality of i-th layer reuse features. The arithmetic processing unit is electrically connected to the block register, and the arithmetic processing unit receives the input image and is configured to perform operations including the following steps: a parameter setting step, a segmentation step, and a block inference step. The parameter setting step is to set an inference parameter group, and the inference parameter group includes a convolution depth, a block width, a block height, and the size of the convolution kernel of multiple layers. The dividing step is to divide the input image into a plurality of input block data according to the convolution depth, block width, block height and the size of these layers of convolution kernels, and each input block data has the input block size. In addition, the block inference step is to perform a multi-layer convolution operation on each input block data to generate output block data, and the multi-layer convolution operation includes a first direction data selection step, a second direction data selection step and convolution Operation steps. The first direction data selection step is to select a plurality of i-th layer recalculation features along the scanning line feed direction according to the position of the output block data, and then select the i-th layer recalculation feature according to the position of the output block data and these i-th layer recalculation features. Calculate the input feature block data, where i is one of a plurality of positive integers from 1 to the convolution depth. The second direction data selection step is to select the i-th layer of reused features along the block scanning direction according to the i-th layer recalculated input feature block data, and recalculate the i-th layer of input feature block data and these The i-th layer reuses the feature combination to generate the i-th layer reuses the input feature block data. The convolution operation step is to select a plurality of i-th layer sub-block input feature groups from the i-th layer reused input feature block data according to the size of the i-th layer convolution kernel, and then input feature groups to each i-th layer sub-block. and a convolution parameter group to perform convolution operation to generate the output features of the i-th layer sub-blocks, and combine the i-th layer sub-block output features corresponding to the i-th layer sub-block input feature groups to form the i-th layer sub-block output features The i-layer outputs feature block data.
藉此,本發明之適用於卷積神經網路之記憶體優化實現之區塊式推論系統藉由不同方向使用不同特徵之計算方式,使區塊式推論在不增加過多計算量以及區塊暫存器之前提下,依然能大幅降低外部記憶體之頻寬需求。Therefore, the block-based inference system of the present invention, which is suitable for memory optimization of convolutional neural networks, uses computing methods with different characteristics in different directions, so that block-based inference does not increase too much computation and block temporary Under the premise of memory, it can still greatly reduce the bandwidth requirement of external memory.
前述實施方式之其他實施例如下:當前述i等於1時,第i層重新計算輸入特徵區塊資料等於各輸入區塊資料。而當i等於卷積深度時,第i層輸出特徵區塊資料等於輸出區塊資料。Other examples of the aforementioned embodiments are as follows: when the aforementioned i is equal to 1, the i-th layer recalculates the input feature block data equal to each input block data. When i is equal to the convolution depth, the output feature block data of the i-th layer is equal to the output block data.
前述實施方式之其他實施例如下:前述第i層重新計算輸入特徵區塊資料具有第i層重新計算輸入特徵區塊大小與第i層重新計算輸入特徵區塊通道數,第i層輸出特徵區塊資料具有第i層輸出特徵區塊大小與第i層輸出特徵區塊通道數。第i層輸出特徵區塊大小大於第i層重新計算輸入特徵區塊大小,且第i層重新計算輸入特徵區塊通道數等於第i層輸出特徵區塊通道數。Other examples of the aforementioned embodiments are as follows: the i-th layer recalculates the input feature block data, the i-th layer re-calculates the input feature block size and the i-th layer re-calculates the input feature block channel number, and the i-th layer outputs the feature region The block data has the i-th layer output feature block size and the i-th layer output feature block channel number. The size of the output feature block of the ith layer is larger than the size of the recalculated input feature block of the ith layer, and the number of channels of the recalculated input feature block of the ith layer is equal to the number of channels of the output feature block of the ith layer.
前述實施方式之其他實施例如下:前述區塊掃描方向垂直於掃描換行方向,區塊寬度大於區塊高度,且區塊高度的延伸方向平行於區塊掃描方向。Other examples of the foregoing embodiments are as follows: the block scanning direction is perpendicular to the scanning line feed direction, the block width is greater than the block height, and the extending direction of the block height is parallel to the block scanning direction.
前述實施方式之其他實施例如下:前述卷積深度、區塊寬度及區塊高度均為正整數,第i層卷積核大小為kWi ×kHi ,此些第i層重複利用特徵沿區塊掃描方向具有重複利用特徵數量,且重複利用特徵數量等於kHi ‒1。Other examples of the aforementioned embodiments are as follows: the aforementioned convolution depth, block width, and block height are all positive integers, the size of the convolution kernel of the i-th layer is k Wi × k Hi , and the i-th layer reuses the feature edge region The block scan direction has the number of reused features, and the number of reused features is equal to k Hi ‒1.
前述實施方式之其他實施例如下:前述區塊寬度表示為BW ,卷積深度表示為D,區塊高度表示為BH 。輸入區塊大小等於BW ×BH 。輸出區塊資料具有輸出區塊大小,且輸出區塊大小等於(BW ‒2D)×BH 。第i層重新計算輸入特徵區塊資料具有第i層重新計算輸入特徵區塊大小,且第i層重新計算輸入特徵區塊大小等於(BW ‒2i+2)×BH 。第i層重複利用輸入特徵區塊資料具有第i層重複利用輸入特徵區塊大小,且第i層重複利用輸入特徵區塊大小等於(BW ‒2i+2)×(BH +2)。第i層輸出特徵區塊資料具有第i層輸出特徵區塊大小,且第i層輸出特徵區塊大小等於(BW ‒2i)×BH 。卷積深度小於區塊寬度之一半。Other examples of the aforementioned embodiments are as follows: the aforementioned block width is represented by B W , the convolution depth is represented by D, and the block height is represented by B H . The input block size is equal to B W × B H . The output block data has the output block size, and the output block size is equal to (B W -2D)×B H . The i-th layer recomputed input feature block data has the i-th layer re-computed input feature block size, and the i-th layer re-computed input feature block size is equal to (B W -2i+2)×B H . The i-th layer reused input feature block data has the i-th layer reused input feature block size, and the i-th layer reused input feature block size is equal to (B W –2i+2)×(B H +2). The i-th layer output feature block data has the i-th layer output feature block size, and the i-th layer output feature block size is equal to (B W -2i)×B H . The convolution depth is less than half of the block width.
前述實施方式之其他實施例如下:當前述其中一第i層子區塊輸入特徵群之複數輸入特徵之至少一者位於第i層重複利用輸入特徵區塊資料之外區域時,此其中一第i層子區塊輸入特徵群之此些輸入特徵包含複數個外區塊特徵及複數個第一內區塊特徵。此些外區塊特徵代表已運算之特徵,此些第一內區塊特徵代表未運算之特徵。再者,當其中一第i層子區塊輸入特徵群之此些輸入特徵均位於第i層重複利用輸入特徵區塊資料之內區域時,此其中一第i層子區塊輸入特徵群之此些輸入特徵僅包含複數個第二內區塊特徵。第i層重複利用輸入特徵區塊資料沿區塊掃描方向之排列順序為外區域與內區域。Other examples of the aforementioned implementation manner are as follows: when at least one of the complex input features of the input feature group of a sub-block of the i-th layer is located in an area outside the data of the re-used input feature block of the i-th layer, the one of the first The input features of the i-layer sub-block input feature group include a plurality of outer block features and a plurality of first inner block features. The outer block features represent features that have been computed, and the first inner block features represent features that have not been computed. Furthermore, when the input features of the input feature group of a sub-block of the i-th layer are all located in the area within the data of the input feature block of the i-th layer that is reused, the input feature of the input feature group of the sub-block of the i-th layer is Such input features only include a plurality of second inner block features. The i-th layer reuses the input feature block data in the order of the block scanning direction as the outer area and the inner area.
前述實施方式之其他實施例如下:前述外區塊特徵係儲存於區塊暫存器,區塊暫存器具有暫存空間,暫存空間透過第i層重新計算輸入特徵區塊資料之寬度、卷積深度、層數、通道數及第i層卷積核大小運算求得。第i層重新計算輸入特徵區塊資料之寬度表示為BWi ,卷積深度表示為D,層數表示為i,通道數表示為C,第i層卷積核大小為kWi ×kHi ,暫存空間表示為LBS且符合下式:。Other examples of the aforementioned embodiments are as follows: the aforementioned outer block features are stored in a block register, which has a temporary storage space, and the temporary storage space recalculates the width of the input feature block data through the i-th layer, The convolution depth, the number of layers, the number of channels and the size of the i-th layer convolution kernel are calculated. The width of the recalculated input feature block data of the i-th layer is denoted as B Wi , the convolution depth is denoted as D, the number of layers is denoted as i, the number of channels is denoted as C, and the size of the convolution kernel of the i-th layer is k Wi × k Hi , The temporary storage space is expressed as LBS and conforms to the following formula: .
以下將參照圖式說明本發明之複數個實施例。為明確說明起見,許多實務上的細節將在以下敘述中一併說明。然而,應瞭解到,這些實務上的細節不應用以限制本發明。也就是說,在本發明部分實施例中,這些實務上的細節是非必要的。此外,為簡化圖式起見,一些習知慣用的結構與元件在圖式中將以簡單示意的方式繪示之;並且重複之元件將可能使用相同的編號表示之。Several embodiments of the present invention will be described below with reference to the drawings. For the sake of clarity, many practical details are set forth in the following description. It should be understood, however, that these practical details should not be used to limit the invention. That is, in some embodiments of the present invention, these practical details are unnecessary. In addition, for the purpose of simplifying the drawings, some well-known and conventional structures and elements will be shown in a simplified and schematic manner in the drawings; and repeated elements may be denoted by the same reference numerals.
此外,本文中當某一元件(或單元或模組等)「連接」於另一元件,可指所述元件是直接連接於另一元件,亦可指某一元件是間接連接於另一元件,意即,有其他元件介於所述元件及另一元件之間。而當有明示某一元件是「直接連接」於另一元件時,才表示沒有其他元件介於所述元件及另一元件之間。而第一、第二、第三等用語只是用來描述不同元件,而對元件本身並無限制,因此,第一元件亦可改稱為第二元件。且本文中之元件/單元/電路之組合非此領域中之一般周知、常規或習知之組合,不能以元件/單元/電路本身是否為習知,來判定其組合關係是否容易被技術領域中之通常知識者輕易完成。In addition, when a certain element (or unit or module, etc.) is "connected" to another element herein, it may mean that the element is directly connected to another element, or it may also mean that a certain element is indirectly connected to another element , that is, there are other elements interposed between said element and another element. When it is expressly stated that an element is "directly connected" to another element, it means that no other element is interposed between the element and the other element. The terms first, second, third, etc. are only used to describe different elements, and do not limit the elements themselves. Therefore, the first element can also be renamed as the second element. And the combination of elements/units/circuits in this article is not a commonly known, conventional or well-known combination in this field, and it cannot be determined whether the combination relationship of the elements/units/circuits is well-known or not easily understood by those in the technical field. Usually the knowledgeable can do it easily.
請參閱第1圖,第1圖係繪示本發明第一實施例之適用於卷積神經網路之記憶體優化實現之區塊式推論方法100的流程示意圖。此適用於卷積神經網路之記憶體優化實現之區塊式推論方法100用以處理一輸入影像而產生一輸出影像,且包含一參數設定步驟S02、分割步驟S04、區塊推論步驟S06以及暫存步驟S08。Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a block-based
參數設定步驟S02係設定一推論參數組,此推論參數組包含卷積深度(depth)、區塊寬度、區塊高度及複數層卷積核大小(kernel size)。此些層卷積核大小的層數等於卷積深度。The parameter setting step S02 is to set an inference parameter group. The inference parameter group includes a convolution depth (depth), a block width, a block height, and a complex-layer convolution kernel size (kernel size). The number of layers with the kernel size of these layers is equal to the convolution depth.
分割步驟S04係驅動運算處理單元依據卷積深度、區塊寬度、區塊高度及此些層卷積核大小劃分輸入影像成複數個輸入區塊資料,各輸入區塊資料具有一輸入區塊大小。The dividing step S04 is to drive the arithmetic processing unit to divide the input image into a plurality of input block data according to the convolution depth, block width, block height and the size of these layers of convolution kernels, and each input block data has an input block size .
區塊推論步驟S06係驅動運算處理單元將各輸入區塊資料執行一多層卷積操作而產生輸出區塊資料,且多層卷積操作包含第一方向資料選取步驟S062、第二方向資料選取步驟S064及卷積運算步驟S066。其中第一方向資料選取步驟S062係依據輸出區塊資料之位置沿掃描換行方向選擇複數個第i層重新計算特徵,然後依據輸出區塊資料之位置及此些第i層重新計算特徵選取出一第i層重新計算輸入特徵區塊資料,其中i為1至卷積深度之複數個正整數之其中一者。此外,第二方向資料選取步驟S064係依據第i層重新計算輸入特徵區塊資料沿區塊掃描方向選取出複數個第i層重複利用特徵,並將第i層重新計算輸入特徵區塊資料及此些第i層重複利用特徵組合而產生一第i層重複利用輸入特徵區塊資料。再者,卷積運算步驟S066係依據第i層卷積核大小從第i層重複利用輸入特徵區塊資料中選取出複數個第i層子區塊輸入特徵群,然後對各第i層子區塊輸入特徵群及卷積參數組執行卷積運算而產生第i層子區塊輸出特徵,並將對應此些第i層子區塊輸入特徵群之此些第i層子區塊輸出特徵組合而形成第i層輸出特徵區塊資料。卷積參數組包含權重參數(weight parameter)與偏差參數(bias parameter)。The block inference step S06 drives the arithmetic processing unit to perform a multi-layer convolution operation on each input block data to generate output block data, and the multi-layer convolution operation includes a first direction data selection step S062 and a second direction data selection step S062 S064 and convolution operation step S066. The first direction data selection step S062 is to select a plurality of i-th layer recalculation features along the scanning line feed direction according to the position of the output block data, and then select a The i-th layer recalculates the input feature block data, where i is one of a plurality of positive integers from 1 to the convolution depth. In addition, the second direction data selection step S064 is to select a plurality of i-th layer reuse features along the block scanning direction according to the i-th layer recalculated input feature block data, and recalculate the i-th layer of input feature block data and The i-th layer reused features are combined to generate an i-th layer of reused input feature block data. Furthermore, the convolution operation step S066 is to select a plurality of i-th layer sub-block input feature groups from the i-th layer reused input feature block data according to the size of the i-th layer convolution kernel, and then perform a The block input feature group and the convolution parameter group perform a convolution operation to generate the i-th layer sub-block output features, and the i-th layer sub-block output features corresponding to the i-th layer sub-block input feature group combined to form the i-th layer output feature block data. The convolution parameter group includes weight parameters and bias parameters.
暫存步驟S08係驅動區塊暫存器(Block buffer bank)暫存第i層輸出特徵區塊資料及此些第i層重複利用特徵。The temporary storage step S08 is to drive a block buffer bank to temporarily store the i-th layer output feature block data and the i-th layer reuse features.
藉此,本發明的適用於卷積神經網路之記憶體優化實現之區塊式推論方法100透過不同方向使用不同特徵之計算方式,使區塊式推論在不增加過多計算量以及區塊暫存器之前提下,依然能大幅降低外部記憶體之頻寬需求。以下將透過較詳細的實施例來說明上述各步驟之細節。Therefore, the block-based
請一併參閱第1圖至第6圖,其中第2圖係繪示第1圖之分割步驟S04的示意圖;第3圖係繪示第1圖之區塊推論步驟S06的多層卷積操作之輸入區塊資料IB與輸出區塊資料OB的立體示意圖;第4圖係繪示第1圖之第一方向資料選取步驟S062的示意圖;第5圖係繪示第1圖之第二方向資料選取步驟S064的示意圖;以及第6圖係繪示第3圖之第1層重複利用輸入特徵區塊資料L1FU_I的示意圖。如圖所示,此實施例是於每一層(即第i層之i=1~D)均執行第一方向資料選取步驟S062、第二方向資料選取步驟S064及卷積運算步驟S066。卷積深度D、區塊寬度BW 及區塊高度BH 均為正整數。第i層卷積核大小為kWi ×kHi ,其中kWi 、kHi 均為正整數。掃描換行方向D1為水平方向,區塊掃描方向D2為垂直方向;換言之,區塊掃描方向D2垂直於掃描換行方向D1。區塊寬度BW 大於區塊高度BH ,且區塊高度BH 的延伸方向平行於區塊掃描方向D2。輸入區塊大小等於BW ×BH 。輸出區塊資料OB具有一輸出區塊大小,且輸出區塊大小等於(BW ‒2D)×BH 。第i層重新計算輸入特徵區塊資料具有一第i層重新計算輸入特徵區塊大小,且第i層重新計算輸入特徵區塊大小等於(BW ‒2i+2)×BH 。第i層重複利用輸入特徵區塊資料具有一第i層重複利用輸入特徵區塊大小,且第i層重複利用輸入特徵區塊大小等於(BW ‒2i+2)×(BH +2)。第i層輸出特徵區塊資料具有一第i層輸出特徵區塊大小,且第i層輸出特徵區塊大小等於(BW ‒2i)×BH 。第i層輸出特徵區塊資料代表第i層執行完卷積運算之輸出特徵,其用於同一區塊之下一層(第i+1層)之重新計算。卷積深度D小於區塊寬度BW 之一半。再者,第i層重複利用特徵沿區塊掃描方向D2具有一重複利用特徵數量,且重複利用特徵數量等於kHi ‒1(即k‒1)。第i層重複利用特徵是用於下一區塊之同一層(第i層)之重複利用。當i等於1時,第i層重新計算輸入特徵區塊資料等於各輸入區塊資料IB;當i等於卷積深度D時,第i層輸出特徵區塊資料等於輸出區塊資料OB。Please refer to Fig. 1 to Fig. 6 together, wherein Fig. 2 is a schematic diagram of the segmentation step S04 of Fig. 1; Fig. 3 is a schematic diagram of the multi-layer convolution operation of the block inference step S06 of Fig. 1 A three-dimensional schematic diagram of input block data IB and output block data OB; FIG. 4 is a schematic diagram of the first direction data selection step S062 in FIG. 1; FIG. 5 is a second direction data selection in FIG. 1 A schematic diagram of step S064; and FIG. 6 is a schematic diagram of the first layer of FIG. 3 reusing the input feature block data L1FU_I. As shown in the figure, in this embodiment, the first direction data selection step S062, the second direction data selection step S064 and the convolution operation step S066 are performed in each layer (ie, i=1~D of the i-th layer). The convolution depth D, the block width B W , and the block height B H are all positive integers. The size of the convolution kernel of the i-th layer is k Wi ×k Hi , where k Wi and k Hi are both positive integers. The scanning line feed direction D1 is a horizontal direction, and the block scanning direction D2 is a vertical direction; in other words, the block scanning direction D2 is perpendicular to the scanning line feed direction D1. The block width B W is greater than the block height B H , and the extending direction of the block height B H is parallel to the block scanning direction D2 . The input block size is equal to B W × B H . The output block data OB has an output block size, and the output block size is equal to (B W −2D)×B H . The i-th layer recomputed input feature block data has an i-th layer re-computed input feature block size, and the i-th layer re-computed input feature block size is equal to (B W -2i+2)×B H . The i-th layer reused input feature block data has an i-th layer reused input feature block size, and the i-th layer reused input feature block size is equal to (B W –2i+2)×(B H +2) . The i-th layer output feature block data has an i-th layer output feature block size, and the i-th layer output feature block size is equal to (B W -2i)×B H . The i-th layer output feature block data represents the output feature of the i-th layer after performing the convolution operation, which is used for the recalculation of the next layer (i+1-th layer) of the same block. The convolution depth D is less than half of the block width BW . Furthermore, the i-th layer of reused features has a number of reused features along the block scanning direction D2, and the number of reused features is equal to k Hi −1 (ie, k−1). The i-th layer reuse feature is for reuse of the same layer (i-th layer) of the next block. When i is equal to 1, the recalculated input feature block data of the i-th layer is equal to the input block data IB; when i is equal to the convolution depth D, the output feature block data of the i-th layer is equal to the output block data OB.
在第3圖至第6圖中,卷積深度D為3,區塊寬度BW 為10、區塊高度BH 為4,第i層卷積核大小為3×3,即kWi =kHi =k且均為3。卷積深度D為3代表有3層卷積操作,故多層卷積操作包含第1層卷積操作、第2層卷積操作及第3層卷積操作(即i=1、2及3)。In Figures 3 to 6, the convolution depth D is 3, the block width B W is 10, the block height B H is 4, and the i-th layer convolution kernel size is 3 × 3, that is, k Wi = k Hi = k and both are 3. The convolution depth D is 3, which means that there are 3 layers of convolution operations, so the multi-layer convolution operation includes the first layer convolution operation, the second layer convolution operation and the third layer convolution operation (ie i=1, 2 and 3) .
第1層卷積操作(i=1)包含第一方向資料選取步驟S062、第二方向資料選取步驟S064及卷積運算步驟S066。其中第一方向資料選取步驟S062係依據輸出區塊資料OB(即第3層輸出特徵區塊資料L3_O)之位置沿掃描換行方向D1選擇6個第1層重新計算特徵L1FC(即(D‒i+1)×(k‒1)個),然後依據輸出區塊資料OB之位置及此些第1層重新計算特徵L1FC選取出一第1層重新計算輸入特徵區塊資料L1FC_I。此第1層重新計算輸入特徵區塊資料L1FC_I等於輸入區塊資料IB,輸入區塊資料IB之輸入區塊大小等於第1層重新計算輸入特徵區塊資料L1FC_I之第1層重新計算輸入特徵區塊大小,且均等於(BW
‒2i+2)×BH
=(10‒2+2)×4=10×4,如第3圖之第1層L1、第4圖之第1層L1及第6圖所示。再者,第二方向資料選取步驟S064依據第1層重新計算輸入特徵區塊資料L1FC_I沿區塊掃描方向D2選取出2個第1層重複利用特徵L1FU,並將第1層重新計算輸入特徵區塊資料L1FC_I及此些第1層重複利用特徵L1FU組合而產生一第1層重複利用輸入特徵區塊資料L1FU_I。第1層重複利用輸入特徵區塊資料L1FU_I之第1層重複利用輸入特徵區塊大小等於(BW
‒2i+2)×(BH
+2)=(10‒2+2)×(4+2)=10×6,如第3圖之第1層L1、第5圖之第1層L1及第6圖所示。此外,卷積運算步驟S066係依據第i層卷積核大小(即3×3)從第1層重複利用輸入特徵區塊資料L1FU_I中選取出複數個第1層子區塊輸入特徵群SBG1(即3×3特徵),然後對各第1層子區塊輸入特徵群SBG1及卷積參數組執行卷積運算而產生第1層子區塊輸出特徵,並將對應此些第1層子區塊輸入特徵群SBG1之此些第1層子區塊輸出特徵組合而形成第1層輸出特徵區塊資料L1_O。第1層輸出特徵區塊資料L1_O之第1層輸出特徵區塊大小等於(BW
‒2i)×BH
=(10‒2)×4=8×4,如第3圖與第5圖之第1層L1所示。The first layer convolution operation (i=1) includes a first direction data selection step S062, a second direction data selection step S064 and a convolution operation step S066. The first direction data selection step S062 is to select 6 first layer recalculation features L1FC (ie (D-i +1)×(k−1)), and then according to the position of the output block data OB and these first layer recalculation features L1FC, a first layer recalculation input feature block data L1FC_I is selected. This
第2層卷積操作(i=2)包含第一方向資料選取步驟S062、第二方向資料選取步驟S064及卷積運算步驟S066。其中第一方向資料選取步驟S062係依據輸出區塊資料OB(即第3層輸出特徵區塊資料L3_O)之位置沿掃描換行方向D1選擇4個第2層重新計算特徵L2FC(即(D‒i+1)×(k‒1)個),然後依據輸出區塊資料OB之位置及此些第2層重新計算特徵L2FC選取出一第2層重新計算輸入特徵區塊資料L2FC_I。第2層重新計算輸入特徵區塊資料L2FC_I等於第1層輸出特徵區塊資料L1_O。第2層重新計算輸入特徵區塊資料L2FC_I之第2層重新計算輸入特徵區塊大小等於(BW
‒2i+2)×BH
=(10‒4+2)×4=8×4,如第3圖與第4圖之第2層L2所示。再者,第二方向資料選取步驟S064依據第2層重新計算輸入特徵區塊資料L2FC_I沿區塊掃描方向D2選取出2個第2層重複利用特徵L2FU,並將第2層重新計算輸入特徵區塊資料L2FC_I及此些第2層重複利用特徵L2FU組合而產生一第2層重複利用輸入特徵區塊資料L2FU_I。第2層重複利用輸入特徵區塊資料L2FU_I之第2層重複利用輸入特徵區塊大小等於(BW
‒2i+2)×(BH
+2)=(10‒4+2)×(4+2)=8×6,如第3圖與第5圖之第2層L2所示。此外,卷積運算步驟S066係依據第i層卷積核大小(即3×3)從第2層重複利用輸入特徵區塊資料L2FU_I中選取出複數個第2層子區塊輸入特徵群SBG2(即3×3特徵),然後對各第2層子區塊輸入特徵群SBG2及卷積參數組執行卷積運算而產生第2層子區塊輸出特徵,並將對應此些第2層子區塊輸入特徵群SBG2之此些第2層子區塊輸出特徵組合而形成第2層輸出特徵區塊資料L2_O。第2層輸出特徵區塊資料L2_O之第2層輸出特徵區塊大小等於(BW
‒2i)×BH
=(10‒4)×4=6×4,如第3圖與第5圖之第2層L2所示。The second layer convolution operation (i=2) includes a first direction data selection step S062, a second direction data selection step S064 and a convolution operation step S066. The first direction data selection step S062 is to select four
第3層卷積操作(i=3)包含第一方向資料選取步驟S062、第二方向資料選取步驟S064及卷積運算步驟S066。其中第一方向資料選取步驟S062係依據輸出區塊資料OB(即第3層輸出特徵區塊資料L3_O)之位置沿掃描換行方向D1選擇2個第3層重新計算特徵L3FC(即(D‒i+1)×(k‒1)個),然後依據輸出區塊資料OB之位置及此些第3層重新計算特徵L3FC選取出一第3層重新計算輸入特徵區塊資料L3FC_I。第3層重新計算輸入特徵區塊資料L3FC_I等於第2層輸出特徵區塊資料L2_O。第3層重新計算輸入特徵區塊資料L3FC_I之第3層重新計算輸入特徵區塊大小等於(BW
‒2i+2)×BH
=(10‒6+2)×4=6×4,如第3圖與第4圖之第3層L3所示。再者,第二方向資料選取步驟S064依據第3層重新計算輸入特徵區塊資料L3FC_I沿區塊掃描方向D2選取出2個第3層重複利用特徵L3FU,並將第3層重新計算輸入特徵區塊資料L3FC_I及此些第3層重複利用特徵L3FU組合而產生一第3層重複利用輸入特徵區塊資料L3FU_I。第3層重複利用輸入特徵區塊資料L3FU_I之第3層重複利用輸入特徵區塊大小等於(BW
‒2i+2)×(BH
+2)=(10‒6+2)×(4+2)=6×6,如第3圖與第5圖之第3層L3所示。此外,卷積運算步驟S066係依據第i層卷積核大小(即3×3)從第3層重複利用輸入特徵區塊資料L3FU_I中選取出複數個第3層子區塊輸入特徵群SBG3(即3×3特徵),然後對各第3層子區塊輸入特徵群SBG3及卷積參數組執行卷積運算而產生第3層子區塊輸出特徵,並將對應此些第3層子區塊輸入特徵群SBG3之此些第3層子區塊輸出特徵組合而形成第3層輸出特徵區塊資料L3_O。第3層輸出特徵區塊資料L3_O等於輸出區塊資料OB。第3層輸出特徵區塊資料L3_O之第3層輸出特徵區塊大小等於(BW
‒2i)×BH
=(10‒6)×4=4×4,而輸出區塊資料OB之輸出區塊大小等於(BW
‒2D)×BH
=(10‒6)×4=4×4,如第3圖與第5圖之第3層L3所示。The third layer convolution operation (i=3) includes a first direction data selection step S062, a second direction data selection step S064 and a convolution operation step S066. The first direction data selection step S062 is to select two third layer recalculation features L3FC (ie (D−i) along the scanning line feed direction D1 according to the position of the output block data OB (ie, the third layer output feature block data L3_O) +1)×(k−1)), and then according to the position of the output block data OB and these layer 3 recalculation features L3FC, a layer 3 recalculation input feature block data L3FC_I is selected. Layer 3 recalculates input feature block data L3FC_I equal to
在本發明的適用於卷積神經網路之記憶體優化實現之區塊式推論方法100中,當其中一第i層子區塊輸入特徵群之複數個輸入特徵之至少一者位於第i層重複利用輸入特徵區塊資料之外區域時,此其中一第i層子區塊輸入特徵群之輸入特徵包含複數個外區塊特徵及複數個第一內區塊特徵。外區塊特徵代表前一區塊已運算之特徵,而第一內區塊特徵代表目前區塊未運算之特徵。另外,當其中一第i層子區塊輸入特徵群之輸入特徵均位於第i層重複利用輸入特徵區塊資料之內區域時,此其中一第i層子區塊輸入特徵群之輸入特徵僅包含複數第二內區塊特徵,第二內區塊特徵代表目前區塊未運算之特徵。第i層重複利用輸入特徵區塊資料沿區塊掃描方向D2之排列順序為外區域與內區域。舉第6圖為例,當第1層子區塊輸入特徵群SBG11之9個輸入特徵之6個位於第1層重複利用輸入特徵區塊資料L1FU_I之外區域OR時,此第1層子區塊輸入特徵群SBG11之9個輸入特徵包含6個外區塊特徵及3個內區塊特徵。外區塊特徵代表已運算之特徵且位於外區域OR,而內區塊特徵代表未運算之特徵且位於內區域IR。另外,當第1層子區塊輸入特徵群SBG12之9個輸入特徵均位於第1層重複利用輸入特徵區塊資料L1FU_I之內區域IR時,此第1層子區塊輸入特徵群SBG12之9個輸入特徵僅包含9個內區塊特徵,亦即9個輸入特徵均為內區塊特徵。此外,第1層重複利用輸入特徵區塊資料L1FU_I沿區塊掃描方向D2之排列順序為外區域OR與內區域IR。In the block-based
另外值得一提的是,在暫存步驟S08中,第i層的LiFC_I的最下面kHi
‒1行存到區塊暫存器內供下一區塊使用,變成下一區塊的LiFU。舉例來說,當區塊推論步驟S06之第1層卷積操作執行後,暫存步驟S08被執行,其為第1層重新計算輸入特徵區塊資料L1FC_I的最下面kHi
‒1行存到區塊暫存器內供下一區塊使用,亦即變成下一區塊的第1層重複利用特徵L1FU。當區塊推論步驟S06之第2層卷積操作執行後,暫存步驟S08被執行,其為第2層重新計算輸入特徵區塊資料L2FC_I的最下面kHi
‒1行存到區塊暫存器內供下一區塊使用,亦即變成下一區塊的第2層重複利用特徵L2FU。當區塊推論步驟S06之第3層卷積操作執行後,暫存步驟S08被執行,其為第3層重新計算輸入特徵區塊資料L3FC_I的最下面kHi
‒1行存到區塊暫存器內供下一區塊使用,亦即變成下一區塊的第3層重複利用特徵L3FU。藉此,可大幅降低計算量。It is also worth mentioning that in the temporary storage step S08, the bottom k Hi -1 row of the LiFC_I of the i-th layer is stored in the block temporary register for use in the next block, and becomes the LiFU of the next block. For example, after the first-layer convolution operation in the block inference step S06 is performed, the temporary storage step S08 is performed, which is to store the bottom k Hi -1 row of the first-layer recalculated input feature block data L1FC_I into The block register is used for the next block, that is, the
請一併參閱第1圖至第7圖,其中第7圖係繪示本發明第二實施例之通道混洗(shuffle)之示意圖。本發明之推論流程可應用於通道混洗之運算,第i層重複利用輸入特徵區塊資料LiFU_I具有一第i層重複利用輸入特徵區塊大小W1×H1與一第i層重複利用輸入特徵區塊通道數C1。第i層中間區塊資料Li_M具有一第i層中間特徵區塊大小W2×H2與一第i層中間特徵區塊通道數C2。第i層輸出特徵區塊資料Li_O具有一第i層輸出特徵區塊大小W3×H3與一第i層輸出特徵區塊通道數C3。第i層輸出特徵區塊大小W3×H3大於第i層重複利用輸入特徵區塊大小W1×H1,且第i層重複利用輸入特徵區塊大小W1×H1大於第i層中間特徵區塊大小W2×H2。其中W1、W2及W3為區塊寬度,H1、H2及H3為區塊高度。此外,第i層重複利用輸入特徵區塊通道數C1等於第i層輸出特徵區塊通道數C3,且第i層中間特徵區塊通道數C2大於第i層重複利用輸入特徵區塊通道數C1。舉例來說,第i層重複利用輸入特徵區塊大小W1×H1、第i層中間特徵區塊大小W2×H2及第i層輸出特徵區塊大小W3×H3可分別為10×10、8×8及16×16,而第i層重複利用輸入特徵區塊通道數C1、第i層中間特徵區塊通道數C2及第i層輸出特徵區塊通道數C3可分別為32、128及32,但本發明不以此為限。Please refer to FIG. 1 to FIG. 7 together, wherein FIG. 7 is a schematic diagram of channel shuffling according to the second embodiment of the present invention. The inference process of the present invention can be applied to the operation of channel shuffling. The i-th layer reused input feature block data LiFU_I has an i-th layer reused input feature block size W1×H1 and an i-th layer reused input feature region Block channel number C1. The i-th layer intermediate block data Li_M has a i-th layer intermediate characteristic block size W2×H2 and a i-th layer intermediate characteristic block channel number C2. The i-th layer output feature block data Li_O has a i-th layer output feature block size W3×H3 and a i-th layer output feature block channel number C3. The i-th layer output feature block size W3×H3 is larger than the i-th layer reused input feature block size W1×H1, and the i-th layer reused input feature block size W1×H1 is larger than the i-th layer intermediate feature block size W2 ×H2. Wherein W1, W2 and W3 are block widths, and H1, H2 and H3 are block heights. In addition, the number of channels C1 of input feature blocks reused in the ith layer is equal to the number of channels C3 of output feature blocks in the ith layer, and the number of channels C2 of intermediate feature blocks in the ith layer is greater than the number of channels C1 of reused input feature blocks in the ith layer. . For example, the i-th layer reuses the input feature block size W1×H1, the i-th layer intermediate feature block size W2×H2, and the i-th layer output feature block size W3×H3 can be respectively 10×10 and 8× 8 and 16×16, while the number of channels C1 of the input feature block in the i-th layer, the number of channels of the i-th layer of intermediate feature blocks C2 and the number of channels of the i-th layer of output feature blocks can be 32, 128 and 32, respectively. However, the present invention is not limited to this.
藉此,本發明可實現特定之多層卷積操作,當進行區塊式推論時,於區塊前行的方向(即區塊掃描方向D2)上重複利用已計算過的特徵,而於另一個方向(即掃描換行方向D1)上採用重新計算的方式,使區塊式推論在不增加過多計算量以及區塊暫存器之前提下,依然能大幅降低外部記憶體之頻寬需求。Thereby, the present invention can realize a specific multi-layer convolution operation. When performing block-based inference, the calculated features are reused in the forward direction of the block (ie, the block scanning direction D2), and the other The recalculation method is adopted in the direction (ie, the scan line feed direction D1), so that the block-based inference can still greatly reduce the bandwidth requirement of the external memory without increasing the excessive calculation amount and the block register.
請一併參閱第1圖、第2圖、第8圖及第9圖,其中第8圖係繪示本發明第三實施例之適用於卷積神經網路之記憶體優化實現之區塊式推論系統200的方塊示意圖;以及第9圖係繪示本發明第三實施例之具有3×3濾波器之多層卷積操作的流程示意圖。如圖所示,適用於卷積神經網路之記憶體優化實現之區塊式推論系統200用以處理輸入影像而產生輸出影像110,並包含區塊暫存器220以及運算處理單元230。輸入區塊資料IB、推論參數組212及卷積參數組214輸入至運算處理單元230,輸出區塊資料OB輸出會組成輸出影像110。區塊暫存器220用以存取第i層輸出特徵區塊資料及複數個第i層重複利用特徵,且此兩種暫存是使用區塊暫存器220中不同位置的區域暫存。此外,運算處理單元230電性連接於區塊暫存器220,運算處理單元230接收輸入影像並經配置以實施第1圖之適用於卷積神經網路之記憶體優化實現之區塊式推論方法100。運算處理單元230包含卷積引擎232(Convolution Engine),卷積引擎232用以執行卷積運算。運算處理單元230可為微處理器、中央處理器或影像處理器,但本發明不以此為限。L1、L2及LD分別代表第1層、第2層及第D層,第1層L1至第D層LD均透過運算處理單元230之卷積引擎232進行運算。此外,區塊暫存器220可儲存外區塊特徵,區塊暫存器220具有一暫存空間,此暫存空間可透過第i層重新計算輸入特徵區塊資料之寬度BWi
、卷積深度D、層數i、通道數C及第i層卷積核大小kWi
×kHi
運算求得。暫存空間表示為LBS(Line Buffer Size)且符合下列式子(1):(1)。Please refer to Fig. 1, Fig. 2, Fig. 8 and Fig. 9 together, wherein Fig. 8 shows a block-based implementation of memory optimization suitable for convolutional neural networks according to the third embodiment of the present invention A block diagram of the
舉例來說,若每一層(即第i層之i=1~D)均執行第一方向資料選取步驟S062、第二方向資料選取步驟S064及卷積運算步驟S066,且kWi =kHi =k且均等於3,則暫存空間符合下列式子(2):(2)。For example, if each layer (i.e. i=1~D of the i-th layer) performs the first direction data selection step S062, the second direction data selection step S064 and the convolution operation step S066, and k Wi =k Hi = If k is equal to 3, then the temporary storage space conforms to the following formula (2): (2).
藉此,本發明之適用於卷積神經網路之記憶體優化實現之區塊式推論系統200藉由不同方向使用不同特徵之計算方式,使區塊式推論在不增加過多計算量以及區塊暫存器220之前提下,依然能大幅降低外部記憶體對輸入區塊資料IB和輸出區塊資料OB之頻寬需求。Therefore, the block-based
請一併參閱第1圖與第10圖,其中第10圖係繪示重新計算(Feature-reComputing;FC)、重複利用(Feature-reUsing;FU)及本發明之重新計算併重複利用(FCFU)的比較結果示意圖。其參數設定條件為乘積值A設為642
,輸出影像110的大小為960×540,kWi
=kHi
=k。乘積值A為區塊寬度BW
與區塊高度BH
相乘的數值之最小值。本發明之多層卷積操作具有一標準化吞吐率(Normalized Throughput Ratio;NTR),標準化吞吐率NTR透過卷積深度D與標準化運算率(Normalized Computing Ratio;NCR)運算求得,而標準化運算率透過區塊寬度BW
、區塊高度BH
、卷積深度D及變數h運算求得。對於本發明之標準化吞吐率NTR與標準化運算率NCR分別符合下列式子(3)與(4):(3);(4)。Please refer to Fig. 1 and Fig. 10 together, wherein Fig. 10 shows recalculation (Feature-reComputing; FC), reuse (Feature-reUsing; FU), and recalculation and reuse (FCFU) of the present invention Schematic diagram of the comparison results. The parameter setting conditions are that the product value A is set to 64 2 , the size of the
由第10圖可知,若對於區塊暫存器220有區塊暫存器大小限制S,則重複利用FU所能支持的最大支援卷積深度Dmax
在三者中為最淺;相反地,重新計算FC雖能支持寬廣之模型卷積深度範圍,但因其需較高之計算複雜度而導致標準化吞吐率NTR大幅降低。而本發明之重新計算併重複利用FCFU不僅較重複利用FU能支持較寬的模型卷積深度範圍,而且還能提供較重新計算FC更好的標準化吞吐率NTR。As can be seen from FIG. 10, if there is a block register size limit S for the
由上述實施方式可知,本發明具有下列優點:其一,本發明的適用於卷積神經網路之記憶體優化實現之區塊式推論方法透過不同方向使用不同特徵之計算方式,使區塊式推論在不增加過多計算量以及區塊暫存器之前提下,依然能大幅降低外部記憶體之頻寬需求。其二,本發明之適用於卷積神經網路之記憶體優化實現之區塊式推論系統藉由不同方向使用不同特徵之計算方式,使區塊式推論在不增加過多計算量以及區塊暫存器之前提下,依然能大幅降低外部記憶體之頻寬需求。其三,本發明之重新計算併重複利用不僅較重複利用能支持較寬的模型卷積深度範圍,而且還能提供較重新計算更好的標準化吞吐率。It can be seen from the above-mentioned embodiments that the present invention has the following advantages: First, the block-based inference method of the present invention, which is suitable for memory optimization of convolutional neural networks, uses computing methods with different characteristics in different directions, so that the block-based inference method The inference can still greatly reduce the bandwidth requirement of external memory without increasing too much computation and block registers. Second, the block-based inference system of the present invention, which is suitable for memory optimization of convolutional neural networks, uses computing methods with different characteristics in different directions, so that block-based inference does not increase too much computation and block temporary. Under the premise of memory, it can still greatly reduce the bandwidth requirement of external memory. Third, the recalculation and reuse of the present invention not only supports a wider range of model convolution depths than reuse, but also provides better normalized throughput than recalculation.
雖然本發明已以實施方式揭露如上,然其並非用以限定本發明,任何熟習此技藝者,在不脫離本發明之精神和範圍內,當可作各種之更動與潤飾,因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present invention. Anyone skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the protection of the present invention The scope shall be determined by the scope of the appended patent application.
100:適用於卷積神經網路之記憶體優化實現之區塊式推論方法 S02:參數設定步驟 S04:分割步驟 S06:區塊推論步驟 S062:第一方向資料選取步驟 S064:第二方向資料選取步驟 S066:卷積運算步驟 S08:暫存步驟 110:輸出影像 200:適用於卷積神經網路之記憶體優化實現之區塊式推論系統 212:推論參數組 214:卷積參數組 220:區塊暫存器 230:運算處理單元 232:卷積引擎 BW ,W1,W2,W3:區塊寬度 BH ,H1,H2,H3:區塊高度 C1:第i層重複利用輸入特徵區塊通道數 C2:第i層中間特徵區塊通道數 C3:第i層輸出特徵區塊通道數 D:卷積深度 Dmax :最大支援卷積深度 D1:掃描換行方向 D2:區塊掃描方向 FC:重新計算 FU:重複利用 FCFU:重新計算併重複利用 IB:輸入區塊資料 IR:內區域 k‒1:重複利用特徵數量 L1:第1層 L1FC:第1層重新計算特徵 L1FC_I:第1層重新計算輸入特徵區塊資料 L1FU:第1層重複利用特徵 L1FU_I:第1層重複利用輸入特徵區塊資料 L1_O:第1層輸出特徵區塊資料 L2:第2層 L2FC:第2層重新計算特徵 L2FC_I:第2層重新計算輸入特徵區塊資料 L2FU:第2層重複利用特徵 L2FU_I:第2層重複利用輸入特徵區塊資料 L2_O:第2層輸出特徵區塊資料 L3:第3層 L3FC:第3層重新計算特徵 L3FC_I:第3層重新計算輸入特徵區塊資料 L3FU:第3層重複利用特徵 L3FU_I:第3層重複利用輸入特徵區塊資料 L3_O:第3層輸出特徵區塊資料 LD:第D層 LiFU_I:第i層重複利用輸入特徵區塊資料 Li_M:第i層中間區塊資料 Li_O:第i層輸出特徵區塊資料 NTR:標準化吞吐率 OB:輸出區塊資料 OR:外區域 S:區塊暫存器大小限制 SBG1,SBG11,SBG12:第1層子區塊輸入特徵群 SBG2:第2層子區塊輸入特徵群 SBG3:第3層子區塊輸入特徵群100: block inference method suitable for memory optimization implementation of convolutional neural network S02: parameter setting step S04: segmentation step S06: block inference step S062: first direction data selection step S064: second direction data selection Step S066: Convolution operation Step S08: Temporary storage Step 110: Output image 200: Block-based inference system suitable for memory optimization of convolutional neural network 212: Inference parameter group 214: Convolution parameter group 220: Area Block register 230: Operation processing unit 232: Convolution engine B W , W1, W2, W3: Block width B H , H1, H2, H3: Block height C1: The i-th layer reuses the input feature block channel Number C2: Number of channels in the middle feature block of the i-th layer C3: Number of channels of the output feature block of the i-th layer D: Convolution depth D max : Maximum supported convolution depth D1: Scanning line feed direction D2: Block scanning direction FC: Re Calculate FU: Reuse FCFU: Recalculate and reuse IB: Input block data IR: Inner region k–1: Reuse number of features L1: Layer 1 L1FC: Recalculate features at layer 1 L1FC_I: Recalculate at layer 1 Input feature block data L1FU: Layer 1 reuse feature L1FU_I: Layer 1 reuse input feature block data L1_O: Layer 1 output feature block data L2: Layer 2 L2FC: Layer 2 recalculation feature L2FC_I: Layer 2 recalculates input feature block data L2FU: Layer 2 reuse feature L2FU_I: Layer 2 reuses input feature block data L2_O: Layer 2 output feature block data L3: Layer 3 L3FC: Layer 3 Recalculated features L3FC_I: Layer 3 recalculates input feature block data L3FU: Layer 3 reuses features L3FU_I: Layer 3 reuses input feature block data L3_O: Layer 3 output feature block data LD: Layer D LiFU_I: i-th layer reuses input feature block data Li_M: i-th layer intermediate block data Li_O: i-th layer output feature block data NTR: normalized throughput rate OB: output block data OR: outer area S: block Register size limit SBG1, SBG11, SBG12: 1st layer sub-block input feature group SBG2: 2nd layer sub-block input feature group SBG3: 3rd layer sub-block input feature group
第1圖係繪示本發明第一實施例之適用於卷積神經網路之記憶體優化實現之區塊式推論方法的流程示意圖; 第2圖係繪示第1圖之分割步驟的示意圖; 第3圖係繪示第1圖之區塊推論步驟的多層卷積操作之輸入區塊資料與輸出區塊資料的立體示意圖; 第4圖係繪示第1圖之第一方向資料選取步驟的示意圖; 第5圖係繪示第1圖之第二方向資料選取步驟的示意圖; 第6圖係繪示第3圖之第1層重複利用輸入特徵區塊資料的示意圖; 第7圖係繪示本發明第二實施例之通道混洗之示意圖; 第8圖係繪示本發明第三實施例之適用於卷積神經網路之記憶體優化實現之區塊式推論系統的方塊示意圖; 第9圖係繪示本發明第四實施例之具有3×3濾波器之多層卷積操作的流程示意圖;以及 第10圖係繪示重新計算、重複利用及本發明之重新計算併重複利用的模擬結果示意圖。FIG. 1 is a schematic flowchart of a block-based inference method suitable for memory optimization of a convolutional neural network according to a first embodiment of the present invention; FIG. 2 is a schematic diagram illustrating the dividing step of FIG. 1; FIG. 3 is a three-dimensional schematic diagram illustrating the input block data and the output block data of the multi-layer convolution operation in the block inference step of FIG. 1; FIG. 4 is a schematic diagram illustrating the first direction data selection step of FIG. 1; FIG. 5 is a schematic diagram illustrating a step of selecting the second direction data in FIG. 1; FIG. 6 is a schematic diagram illustrating the reuse of input feature block data in the first layer of FIG. 3; FIG. 7 is a schematic diagram of channel shuffling according to the second embodiment of the present invention; FIG. 8 is a block diagram illustrating a block-based inference system suitable for memory optimization of a convolutional neural network according to a third embodiment of the present invention; FIG. 9 is a schematic flowchart illustrating a multi-layer convolution operation with 3×3 filters according to a fourth embodiment of the present invention; and FIG. 10 is a schematic diagram showing the simulation results of recalculation, reuse, and recalculation and reuse of the present invention.
100:適用於卷積神經網路之記憶體優化實現之區塊式推論方法100: A block-based inference method for memory-optimized implementation of convolutional neural networks
S02:參數設定步驟S02: Parameter setting steps
S04:分割步驟S04: Segmentation step
S06:區塊推論步驟S06: Block inference steps
S062:第一方向資料選取步驟S062: The first direction data selection step
S064:第二方向資料選取步驟S064: The second direction data selection step
S066:卷積運算步驟S066: Convolution operation steps
S08:暫存步驟S08: Temporary storage step
Claims (16)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/064,561 US20210103793A1 (en) | 2019-10-08 | 2020-10-06 | Block-based inference method for memory-efficient convolutional neural network implementation and system thereof |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962912630P | 2019-10-08 | 2019-10-08 | |
US62/912,630 | 2019-10-08 |
Publications (2)
Publication Number | Publication Date |
---|---|
TW202115624A TW202115624A (en) | 2021-04-16 |
TWI765336B true TWI765336B (en) | 2022-05-21 |
Family
ID=75300104
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW109130493A TWI765336B (en) | 2019-10-08 | 2020-09-04 | Block-based inference method for memory-efficient convolutional neural network implementation and system thereof |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112633462A (en) |
TW (1) | TWI765336B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114118389B (en) * | 2022-01-28 | 2022-05-10 | 深圳鲲云信息科技有限公司 | Neural network data processing method, device and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI622939B (en) * | 2015-05-21 | 2018-05-01 | 谷歌有限責任公司 | Method,system and computer-readable medium for performing neural network computations for a neural network |
US20190012559A1 (en) * | 2017-07-06 | 2019-01-10 | Texas Instruments Incorporated | Dynamic quantization for deep neural network inference system and method |
CN110175636A (en) * | 2019-05-08 | 2019-08-27 | 深圳欧翼思特科技有限公司 | A kind of Internet of Things deep neural network distribution differentiation inference system and method |
TW201935327A (en) * | 2018-02-09 | 2019-09-01 | 宏達國際電子股份有限公司 | Adjustment method for convolutional neural network and electronic apparatus |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017015649A1 (en) * | 2015-07-23 | 2017-01-26 | Mireplica Technology, Llc | Performance enhancement for two-dimensional array processor |
US10048826B2 (en) * | 2016-10-04 | 2018-08-14 | Sas Institute Inc. | Interactive visualizations of a convolutional neural network |
US20180096249A1 (en) * | 2016-10-04 | 2018-04-05 | Electronics And Telecommunications Research Institute | Convolutional neural network system using adaptive pruning and weight sharing and operation method thereof |
US20180131946A1 (en) * | 2016-11-07 | 2018-05-10 | Electronics And Telecommunications Research Institute | Convolution neural network system and method for compressing synapse data of convolution neural network |
CN106779146A (en) * | 2016-11-15 | 2017-05-31 | 广州铁路职业技术学院 | A kind of tourism service system for providing recommendation tourism route |
CN108415881A (en) * | 2017-02-10 | 2018-08-17 | 耐能股份有限公司 | The arithmetic unit and method of convolutional neural networks |
KR101847874B1 (en) * | 2017-06-28 | 2018-05-25 | 서경대학교 산학협력단 | Image recognition method using convolution neural network and recording medium thereof |
CN107437110B (en) * | 2017-07-11 | 2021-04-02 | 中国科学院自动化研究所 | Block convolution optimization method and device of convolutional neural network |
JP6778842B2 (en) * | 2017-07-21 | 2020-11-04 | センスタイム グループ リミテッド | Image processing methods and systems, storage media and computing devices |
KR102561261B1 (en) * | 2017-11-14 | 2023-07-28 | 삼성전자주식회사 | Apparatus and method for processing convolution operation using kernel |
US11227214B2 (en) * | 2017-11-14 | 2022-01-18 | Advanced Micro Devices, Inc. | Memory bandwidth reduction techniques for low power convolutional neural network inference applications |
US10565285B2 (en) * | 2017-12-18 | 2020-02-18 | International Business Machines Corporation | Processor and memory transparent convolutional lowering and auto zero padding for deep neural network implementations |
-
2020
- 2020-09-04 TW TW109130493A patent/TWI765336B/en active
- 2020-09-04 CN CN202010922472.8A patent/CN112633462A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI622939B (en) * | 2015-05-21 | 2018-05-01 | 谷歌有限責任公司 | Method,system and computer-readable medium for performing neural network computations for a neural network |
US20190012559A1 (en) * | 2017-07-06 | 2019-01-10 | Texas Instruments Incorporated | Dynamic quantization for deep neural network inference system and method |
TW201935327A (en) * | 2018-02-09 | 2019-09-01 | 宏達國際電子股份有限公司 | Adjustment method for convolutional neural network and electronic apparatus |
CN110175636A (en) * | 2019-05-08 | 2019-08-27 | 深圳欧翼思特科技有限公司 | A kind of Internet of Things deep neural network distribution differentiation inference system and method |
Also Published As
Publication number | Publication date |
---|---|
TW202115624A (en) | 2021-04-16 |
CN112633462A (en) | 2021-04-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11436483B2 (en) | Neural network engine with tile-based execution | |
US20210103793A1 (en) | Block-based inference method for memory-efficient convolutional neural network implementation and system thereof | |
US10402628B2 (en) | Image classification systems based on CNN based IC and light-weight classifier | |
US20180189595A1 (en) | Implementation Of MobileNet In A CNN Based Digital Integrated Circuit | |
US10339445B2 (en) | Implementation of ResNet in a CNN based digital integrated circuit | |
US20210406010A1 (en) | Processor and control method for processor | |
KR20190062304A (en) | Method and apparatus for performing operation of convolutional layers in convolutional neural network | |
US20170353708A1 (en) | Method and device for stereo images processing | |
US20110200274A1 (en) | Method and system for image resizing based on interpolation enhanced seam operations | |
KR102107077B1 (en) | Line-based memory management method for performing convolution operation in convolutional neural network inference and its inference device | |
TWI765336B (en) | Block-based inference method for memory-efficient convolutional neural network implementation and system thereof | |
US11526723B2 (en) | Apparatus and methods of obtaining multi-scale feature vector using CNN based integrated circuits | |
WO2022110386A1 (en) | Data processing method and artificial intelligence processor | |
US7299440B2 (en) | Semiconductor integrated circuit including standard cell, standard cell layout design method, and layout design software product stored in computer-readable recording medium | |
JP6684951B2 (en) | Artificial intelligence reasoning arithmetic unit | |
JP2017151604A (en) | Arithmetic processing unit | |
JP2023014091A (en) | efficient convolutional engine | |
JP2020119534A (en) | Learning method and learning device for cnn, and testing method and testing device using the same | |
KR20230081697A (en) | Method and apparatus for accelerating dilatational convolution calculation | |
CN113657587B (en) | Deformable convolution acceleration method and device based on FPGA | |
CN116012588A (en) | Novel feature up-sampling method for semantic segmentation | |
WO2021036424A1 (en) | Parallel efficient computing method for box filter | |
US20220198243A1 (en) | Processing data for a layer of a neural network | |
Huang et al. | A real time super resolution accelerator with tilted layer fusion | |
US20240135677A1 (en) | Method, system and storage media for training a graphics processing neural network with a patch-based approach |