KR102023855B1

KR102023855B1 - Deep Learning Running Hardware Accelerator

Info

Publication number: KR102023855B1
Application number: KR1020180154989A
Authority: KR
Inventors: 이상설; 장성준
Original assignee: 전자부품연구원
Priority date: 2018-12-05
Filing date: 2018-12-05
Publication date: 2019-09-20
Also published as: WO2020116672A1

Abstract

Provided is a deep learning hardware accelerator which processes data only with an input image without using a preprocessor such as image to column operation (im2col). According to an embodiment of the present invention, the deep learning hardware accelerator comprises: first resistors storing a part of pixels among lines of an input image; first memories storing the residual pixels among the lines of the input image; and a first kernel using the pixels stored in the resistors to perform calculation. Accordingly, the data can be processed only with the input image without using the preprocessor such as the im2col so as to implement the deep learning hardware accelerator suitable for a hardware structure.

Description

딥러닝 하드웨어 가속장치{Deep Learning Running Hardware Accelerator}Deep Learning Running Hardware Accelerator

본 발명은 영상 처리를 위한 SoC(System on Chip) 기술에 관한 것으로, 더욱 상세하게는 입력 영상을 딥러닝 처리하기 위한 하드웨어 가속장치의 구조와 설계 방안에 관한 것이다.The present invention relates to a System on Chip (SoC) technology for image processing, and more particularly, to a structure and a design method of a hardware accelerator for deep learning an input image.

도 1은 im2col(image to column operation)의 개념을 나타낸 도면이다. im2col은 슬라이딩 윈도우와 유사한 기능을 수행하여 컨볼루션을 위한 데이터를 생성하는 방식으로, GPU(Graphic Processing Unit)를 활용한다.1 is a diagram illustrating a concept of im2col (image to column operation). im2col utilizes the Graphic Processing Unit (GPU) by performing a function similar to a sliding window to generate data for convolution.

im2col은 입력 영상으로부터 커널에 해당하는 데이터를 채널 별로 생성하여 내부 또는 외부 공간에 저장을 하고, 해당 데이터를 불러들여 연산을 수행하는 과정을 거친다. im2col을 사용할 경우 빠른 속도로 데이터 연산을 수행할 수 있다고 되어 있다.im2col generates the data corresponding to the kernel from the input image for each channel, stores it in the internal or external space, and loads the data to perform the operation. When im2col is used, data operation can be performed at high speed.

하지만, SoC 혹은 하드웨어 장치에서 위와 같은 방식은 사용이 불가능하다. 하드웨어의 특성상 메모리의 한계가 있기 때문이다. 특히, 입력 영상(Input Feature Map)의 크기가 클 경우에는 생성된 데이터가 굉장히 커지게 되어 저장 자체가 불가능한 형태가 된다.However, this is not possible with SoCs or hardware devices. This is because there is a limit of memory due to the nature of hardware. In particular, when the size of the input feature map is large, the generated data becomes very large, and thus the storage itself is impossible.

따라서, im2col은 하드웨어 구현 시에 외부 대용량/저속 저장공간에 해당 데이터를 저장하고, 매번 외부 저장공간으로의 데이터 패칭이 필요하게 되어 고속 처리를 할 수 없는 형태가 된다.Therefore, im2col stores the corresponding data in an external large capacity / low speed storage space at the time of hardware implementation, and data patching to the external storage space is required every time, thereby preventing high speed processing.

또한, 입력 영상 기준으로 하나의 feature map이 32비트의 크기가 필요하게 되어 해당 데이터를 저장하는 메모리 공간이 크며, 데이터를 불러들일 때 Bandwidth가 많이 필요하게 된다. In addition, one feature map needs to have a size of 32 bits as an input image reference, and thus a large memory space for storing corresponding data is required, and a large amount of bandwidth is required when loading data.

본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 목적은, im2col과 같은 전처리기를 사용하지 않고, 입력 영상만으로 데이터를 처리하는 딥러닝 하드웨어 가속장치를 제공함에 있다.SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and an object of the present invention is to provide a deep learning hardware accelerator that processes data using only an input image without using a preprocessor such as im2col.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른, 딥러닝 하드웨어 가속장치는 입력 영상의 라인들 중 일부 픽셀들을 저장하는 제1 레지스터들; 입력 영상의 라인들 중 나머지 픽셀들을 저장하는 제1 메모리들; 및 레지스터들에 저장된 픽셀들을 이용하여 연산을 수행하는 제1 커널;을 포함한다. According to an embodiment of the present invention, a deep learning hardware accelerator includes: first registers for storing some pixels of lines of an input image; First memories for storing the remaining pixels of the lines of the input image; And a first kernel that performs the operation using the pixels stored in the registers.

제1 레지스터들은, 픽셀 단위로 구분되어 있을 수 있다. The first registers may be divided in units of pixels.

제1 메모리들은, 라인 단위로 구분되어 있을 수 있다.The first memories may be divided in line units.

제1 레지스터들 각각은 제1 메모리들 각각의 전단에 위치하여, 제1 레지스터들에 저장된 데이터들은 제1 메모리들로 시프트 될 수 있다. Each of the first registers is located in front of each of the first memories so that data stored in the first registers may be shifted to the first memories.

제1 메모리들에 저장된 데이터들은, 다음 행에 위치한 제1 레지스터들로 시프트될 수 있다. Data stored in the first memories may be shifted to the first registers located in the next row.

제1 레지스터들의 행×열 배열은, 커널이 연산을 위해 이용하는 필터의 행×열 규격에 의해 결정될 수 있다. The row by column arrangement of the first registers may be determined by the row by column specification of the filter the kernel uses for the operation.

본 발명에 따른 딥러닝 하드웨어 가속장치는 제1 커널의 연산으로 생성된 데이터를 저장하는 제2 레지스터들; 제2 레지스터들 각각의 후단에 위치하여 시프트 되는 데이터들을 저장하는 제2 메모리들;을 더 포함할 수 있다. Deep learning hardware accelerator according to the present invention includes a second register for storing data generated by the operation of the first kernel; And second memories for storing shifted data located at a rear end of each of the second registers.

본 발명의 다른 측면에 따르면, 제1 레지스터들이, 입력 영상의 라인들 중 일부 픽셀들을 저장하는 단계; 제1 메모리들이, 입력 영상의 라인들 중 나머지 픽셀들을 저장하는 단계; 및 제1 커널이, 레지스터들에 저장된 픽셀들을 이용하여 연산을 수행하는 단계;를 포함한다. According to another aspect of the invention, the first registers, storing some pixels of the lines of the input image; Storing, by the first memories, the remaining pixels of the lines of the input image; And performing, by the first kernel, an operation using the pixels stored in the registers.

이상 설명한 바와 같이, 본 발명의 실시예들에 따르면, im2col과 같은 전처리기를 사용하지 않고 입력 영상만으로 데이터를 처리함으로써, 하드웨어 구조에 적합한 딥러닝 하드웨어 가속장치를 구현할 수 있다.As described above, according to embodiments of the present invention, a deep learning hardware accelerator suitable for a hardware structure may be implemented by processing data using only an input image without using a preprocessor such as im2col.

또한, 본 발명의 실시예들에 따르면, 영상의 해상도가 커짐에 상관 없이 메모리 증가 또는 비트수의 감소 없이 동일한 구조의 하드웨어 구조를 적용할 수 있어, 유연한 하드웨어 블럭 설계가 가능하다.In addition, according to embodiments of the present invention, the hardware structure having the same structure can be applied without increasing the memory or decreasing the number of bits regardless of the resolution of the image is increased, it is possible to design a flexible hardware block.

그리고, 본 발명의 실시예들에 따르면, 필요한 픽셀당 접근 횟수 감소로 메모리 패칭 횟수의 감소에 의한 속도 향상을 기대할 수 있다.In addition, according to embodiments of the present invention, it is possible to expect a speed improvement by reducing the number of times of memory patching by reducing the number of accesses required per pixel.

도 1은 im2col의 개념을 나타낸 도면,
도 2는 본 발명의 일 실시예에 따른 딥러닝 처리 시스템의 블록도,
도 3은, 도 2에 도시된 입력 영상 메모리의 저장 공간을 예시한 도면,
도 4는 본 발명의 다른 실시예에 따른 딥러닝 하드웨어 가속장치의 개념 설명에 제공되는 도면이다.1 is a view showing the concept of im2col,
2 is a block diagram of a deep learning processing system according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a storage space of an input image memory illustrated in FIG. 2;
4 is a view provided to explain a concept of a deep learning hardware accelerator according to another embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명을 보다 상세하게 설명한다.Hereinafter, with reference to the drawings will be described the present invention in more detail.

도 1에 도시된 기존의 영상 생성 전처리기와 같이, 영상을 생성하기 위한 커널이 3개의 필터(2x2)를 이용한다고 가정하면, 영상 데이터를 필터의 크기에 맞게 미리 구성하고, 다음 커널의 계산을 위한 데이터를 생성하는데 평균 2픽셀의 데이터가 중첩되어 저장되어야 한다. 이는 전체 저장 공간 대비 2.25배가 되어, 많은 저장 공간을 요구하게 된다.Like the conventional image generation preprocessor shown in FIG. 1, assuming that a kernel for generating an image uses three filters (2x2), the image data is pre-configured according to the size of the filter, and the next kernel is calculated for the following. In order to generate the data, an average of two pixels of data should be overlapped and stored. This is 2.25 times the total storage space, which requires a lot of storage space.

하지만, 본 발명의 실시예에 따른 딥러닝 하드웨어 가속장치는, 전처리된 영상이 아닌 입력 영상 혹은 처리된 영상만으로 구성하여, 하드웨어의 메모리 구조 측면에서 기본 필요 영역 만으로 구현이 가능하다.However, the deep learning hardware accelerator according to an exemplary embodiment of the present invention may be configured with only an input image or a processed image, not a preprocessed image, and thus may be implemented with only a basic necessary area in terms of hardware memory.

도 2는 본 발명의 일 실시예에 따른 딥러닝 처리 시스템의 블록도이다. 본 발명의 실시예에 따른 딥러닝 처리 시스템은, 도 2에 도시된 바와 같이, 입력 영상 메모리(110), 딥러닝 가속장치(120) 및 처리부(130)를 포함한다.2 is a block diagram of a deep learning processing system according to an embodiment of the present invention. As shown in FIG. 2, the deep learning processing system according to an exemplary embodiment of the present invention includes an input image memory 110, a deep learning accelerator 120, and a processor 130.

입력 영상 메모리(110)는 입력 영상을 저장하는 메모리이다. 입력 영상 메모리(110)의 저장 공간을 도 3에 예시하였다. 도 3에서는 608×608 영상을 저장하기 위한 입력 영상 메모리(110)를 상정하였다.The input image memory 110 is a memory that stores an input image. A storage space of the input image memory 110 is illustrated in FIG. 3. In FIG. 3, an input image memory 110 for storing a 608 × 608 image is assumed.

도 3에 도시된 저장 공간은 한 개의 입력 채널을 위해 할당한 메모리이다. 해당 주소를 예측 가능한 주소값(라인-1)으로 할당(2nd line : 0x00001xxx, 608th line : 0x0025Fxxx) 하였다.The storage space shown in FIG. 3 is a memory allocated for one input channel. The address was assigned as a predictable address value (line-1) (2nd line: 0x00001xxx, 608th line: 0x0025Fxxx).

상정한 영상 규격은 설명을 위해 든 일 예에 해당하는 것으로, 다른 영상 규격에 대해서도 본 발명의 기술적 사상이 그대로 적용될 수 있음은 물론이다.The assumed image standard corresponds to an example for explanation, and the technical idea of the present invention may be applied to other image standards as it is.

다시, 도 2를 참조하여 설명한다.Again, this will be described with reference to FIG. 2.

딥러닝 가속장치(120)는 im2col과 같은 전처리기를 사용하지 않고, 입력 영상 메모리(110)로부터 제공되는 입력 영상만으로 데이터로 연산을 수행한다. 딥러닝 가속장치(120)의 상세 구조에 대해서는 도 4를 참조하여 상세히 설명한다.The deep learning accelerator 120 does not use a preprocessor such as im2col, and performs operations on data using only the input image provided from the input image memory 110. The detailed structure of the deep learning accelerator 120 will be described in detail with reference to FIG. 4.

처리부(130)는 딥러닝 가속장치(120)에서 출력되는 연산 결과(Feature MaP)에 대해 필요한 후속 처리를 수행한다.The processor 130 performs subsequent processing necessary for the calculation result (Feature MaP) output from the deep learning accelerator 120.

이하에서는, 도 2에 도시된 딥러닝 가속장치(120)에 대해, 도 4를 참조하여 상세히 설명한다. 도 4는 본 발명의 다른 실시예에 따른 딥러닝 하드웨어 가속장치의 개념 설명에 제공되는 도면이다.Hereinafter, the deep learning accelerator 120 illustrated in FIG. 2 will be described in detail with reference to FIG. 4. 4 is a view provided to explain a concept of a deep learning hardware accelerator according to another embodiment of the present invention.

본 발명의 실시예에 따른 딥러닝 가속장치(120)는, 도 4에 도시된 바와 같이, 버퍼(121), 레지스터 셋 #1(122), 블록 RAM 셋 #1(123), 필터 #1(124), 레지스터 셋 #2(125) 및 블록 RAM 셋 #2(126)를 포함하여 구성된다.Deep learning accelerator 120 according to an embodiment of the present invention, as shown in Figure 4, the buffer 121, register set # 1 (122), block RAM set # 1 (123), filter # 1 ( 124, register set # 2 (125) and block RAM set # 2 (126).

버퍼(121)는 입력 영상 메모리(110)에 저장된 입력 영상을 옮겨, 레지스터 셋 #1(122)에 픽셀 단위로 전달하기 위한 저장 공간이다.The buffer 121 is a storage space for transferring an input image stored in the input image memory 110 and transferring the input image to the register set # 1 122 in units of pixels.

레지스터 셋 #1(122)은 9개의 레지스터들(IF22 ~ IF00)이 3×3 으로 배열되어 구성된다. 그리고, 블록 RAM 셋 #1(123)은 2개의 블록 RAM들(Block RAM0, Block RAM1)이 라인 단위로 배열되어 구성된다.Register set # 1 122 is composed of nine registers IF22 to IF00 arranged in 3x3. The block RAM set # 1 123 includes two block RAMs Block RAM0 and Block RAM1 arranged in line units.

버퍼(121)를 통해 출력되는 영상 픽셀은, 클럭 마다, 1행에 위치한 레지스터들(IF22,IF21,IF20)로 시프트 되면서 이동한 후, 1행에 위치한 블록 RAM(Block RAM0)에서 시프트 되면서 이동한 후에, 2행 1열에 위치한 레지스터(IF12)로 이동한다.The image pixel output through the buffer 121 is shifted by shifting to the registers IF22, IF21, and IF20 located in one row per clock, and then shifted in the block RAM 0 located in one row. After that, it moves to register (IF12) located in 2 rows and 1 column.

이동된 영상 픽셀은, 클럭 마다, 2행에 위치한 레지스터들(IF12,IF11,IF10)로 시프트 되면서 이동한 후, 2행에 위치한 블록 RAM(Block RAM1)에서 시프트 되면서 이동한 후에, 3행 1열에 위치한 레지스터(IF02)로 이동한다.The shifted image pixel is shifted by shifting to the registers IF12, IF11 and IF10 located in two rows per clock, and then shifted in the block RAM 1 located in row 2, and then moved to the first row of three rows. Move to the register (IF02) located.

이동된 영상 픽셀은, 클럭 마다, 3행에 위치한 레지스터들(IF02,IF01,IF00)로 시프트 되면서 이동한다.The shifted image pixel is shifted every clock by shifting to registers IF02, IF01, and IF00 located in three rows.

이 과정에서, 커널(미도시)은, 클럭 마다, 레지스터 셋 #1(122)에 저장된 9개의 영상 필셀 데이터들을 이용하여 필터 #1(124)로 연산을 수행한다.In this process, the kernel (not shown) performs an operation on the filter # 1 124 for each clock by using nine image pixel data stored in the register set # 1 122.

도 4에 도시된 바와 같이, 본 발명의 실시예에 따른 딥러닝 가속장치(120)는, 레지스터 셋 #1(122)을 블록 RAM 셋 #1(123)의 전단에 위치시켜, 입력 영상 라인의 일부 픽셀들은 레지스터 셋 #1(122)에 저장되어 커널에 의해 연산에 이용되도록 하였다.As shown in FIG. 4, the deep learning accelerator 120 according to an exemplary embodiment of the present invention places the register set # 1 122 at the front end of the block RAM set # 1 123 to determine the input image line. Some pixels are stored in register set # 1 122 to be used for computation by the kernel.

그리고, 다음 연산에 필요한 나머지 픽셀들은, 레지스터 셋 #1(122)의 후단에 위치한 블록 RAM 셋 #1(123)에 저장된 상태로 시프트 되도록 하여, 레지스터 셋 #1(122)으로 다시 전달되도록 하였다.The remaining pixels required for the next operation are shifted to the state stored in the block RAM set # 1 123 located at the rear end of the register set # 1 122, and then transferred back to the register set # 1 122.

라인 단위로 구분되어 있는 블록 RAM 셋 #1(123)의 블록 RAM들(Block RAM0, Block RAM1)과 달리, 레지스터 셋 #1(122)을 구성하는 레지스터들(IF22 ~ IF00)은 픽셀 단위로 구분하여 구현함으로써, 커널이 픽셀 데이터들을 동일 클럭에 모두 취득할 수 있도록 하였다.Unlike the block RAMs (Block RAM0 and Block RAM1) of the block RAM set # 1 (123) divided by line, the registers IF22 to IF00 constituting the register set # 1 122 are classified by pixel unit. The implementation allows the kernel to acquire all pixel data on the same clock.

한편, 레지스터 셋 #1(122)의 규격 3×3은 커널 연산을 위해 이용하는 필터(124)의 규격 3×3에 일치시킨 것이다. 만약, 커널 연산을 위해 이용하는 필터(124)의 규격이 4×4 라면, 레지스터 셋 #1(122)도 4×4 레지스터들로 구현하여야 한다.On the other hand, the standard 3x3 of the register set # 1 122 matches the standard 3x3 of the filter 124 used for kernel operation. If the size of the filter 124 used for kernel operation is 4x4, register set # 1 122 should also be implemented with 4x4 registers.

한편, 레지스터 셋 #2(125)는 16개의 레지스터들(F2321 ~ F0000)이 4×4 으로 배열되어 구성된다. 그리고, 블록 RAM 셋 #2(126)은 3개의 블록 RAM들(Block RAM0, Block RAM1, Block RAM2)이 라인 단위로 배열되어 구성된다.On the other hand, register set # 2 (125) is composed of 16 registers (F2321 ~ F0000) arranged in 4x4. The block RAM set # 2 126 is configured by arranging three block RAMs (Block RAM0, Block RAM1, Block RAM2) in line units.

커널에 의한 연산 결과(Feature Map Data)는, 클럭 마다, 1행에 위치한 레지스터들로 시프트 되면서 이동한 후, 1행에 위치한 블록 RAM에서 시프트 되면서 이동한 후에, 2행 1열에 위치한 레지스터로 이동한다.The function map data by the kernel is shifted by shifting to registers located in one row per clock, shifted and shifted in block RAM located in one row, and then moved to registers located in two rows and one column. .

이동된 연산 결과는, 클럭 마다, 2행에 위치한 레지스터들로 시프트 되면서 이동한 후, 2행에 위치한 블록 RAM에서 시프트 되면서 이동한 후에, 3행 1열에 위치한 레지스터로 이동한다.The shifted operation result is shifted by shifting to the registers located in two rows, every clock, and then shifted by shifting in the block RAM located in two rows, and then to the registers located in three rows and one column.

그리고, 이동된 연산 결과는, 클럭 마다, 3행에 위치한 레지스터들로 시프트 되면서 이동한 후, 3행에 위치한 블록 RAM에서 시프트 되면서 이동한 후에, 4행 1열에 위치한 레지스터로 이동한다.Then, the shifted operation result is shifted by shifting to registers located in three rows per clock, and then shifted by shifting in block RAM located in three rows, and then moved to registers located in four rows and one column.

당므, 이동된 영상 픽셀은, 클럭 마다, 4행에 위치한 레지스터들로 시프트 되면서 이동한다.Therefore, the shifted image pixel shifts by shifting to registers located in four rows every clock.

이 과정에서, 커널(미도시)은, 클럭 마다, 레지스터 셋 #2(125)에 저장된 16개의 영상 필셀들을 이용하여 4×4 규격의 필터 #2(미도시)로 연산을 수행한다.In this process, the kernel (not shown) performs operations with filter # 2 (not shown) of 4x4 standard by using 16 image fill cells stored in register set # 2 125 for each clock.

도 4에 도시된 딥러닝 가속장치(120)는 전체 구성의 일부만을 도시한 것이다. 도시된 딥러닝 가속장치(120)의 후단에 레지스터 셋과 블록 RAM 셋이 후속 연산을 위해 추가될 수 있음은 물론이다.The deep learning accelerator 120 shown in FIG. 4 shows only a part of the overall configuration. It is a matter of course that a register set and a block RAM set may be added to a subsequent operation of the deep learning accelerator 120 shown for subsequent operations.

지금까지, 딥러닝 하드웨어 가속장치에 대해 바람직한 실시예를 들어 상세히 설명하였다.So far, the deep learning hardware accelerator has been described in detail with reference to a preferred embodiment.

본 발명의 실시예에서는, 입력 영상을 이용한 딥러닝 처리를 위한 하드웨어 가속기 구조 설계 방안을 제시하였다. 구체적으로, 본 발명의 실시예에서는, im2col과 같은 전처리기를 사용하지 않고, 단순한 구조로 알고리즘 변경에 따른 재설계 불필요하며, 불필요한 영상 생성 연산을 배제하여 처리 속도를 향상시킨 딥러닝 하드웨어 가속장치를 제시하였다.In an embodiment of the present invention, a hardware accelerator structure design method for deep learning processing using an input image is proposed. Specifically, in the embodiment of the present invention, a deep learning hardware accelerator which does not use a preprocessor such as im2col, has a simple structure, does not require redesign according to an algorithm change, and improves processing speed by eliminating unnecessary image generation operations. It was.

이에 의해, 딥러닝 가속기에서 전처리를 거친 데이터를 사용하지 않고, 입력되는 영상만으로 데이터를 처리하여, 대용량의 메모리 접근 횟수를 줄이고, 커널 기반으로 처리하여 처리 속도 향상 및 전처리 영상 생성 연산의 배제를 기대할 수 있다.In this way, the deep learning accelerator does not use the preprocessed data, but processes the data with only the input image, thereby reducing the number of large memory accesses and processing based on the kernel, thereby improving processing speed and eliminating preprocessing image generation operations. Can be.

또한, im2col과 같은 전처리기를 사용하지 않고 입력 영상만으로 데이터를 처리함으로써, 하드웨어 구조에 적합한 딥러닝 하드웨어 가속장치를 구현함으로써, 영상의 해상도가 커짐에 상관 없이 메모리 증가 또는 비트수의 감소 없이 동일한 구조의 하드웨어 구조를 적용할 수 있어, 유연한 하드웨어 블럭 설계가 가능하며, 필요한 픽셀당 접근 횟수 감소로 메모리 패칭 횟수의 감소에 의해 속도를 향상시킬 수 있다.In addition, by processing the data with only the input image without using a preprocessor such as im2col, a deep learning hardware accelerator suitable for the hardware structure is implemented, so that the same structure without increasing the memory or decreasing the number of bits regardless of the resolution of the image is increased. The hardware structure can be applied to enable flexible hardware block design, and the speed can be improved by reducing the number of patching times by reducing the number of accesses per pixel.

또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.In addition, although the preferred embodiment of the present invention has been shown and described above, the present invention is not limited to the specific embodiments described above, but the technical field to which the invention belongs without departing from the spirit of the invention claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be individually understood from the technical spirit or the prospect of the present invention.

110 : 입력 영상 메모리
120 : 딥러닝 가속장치
121 : 버퍼
122 : 레지스터 #1
123 : 블록 RAM #1
124 : 필터 #1
125 : 레지스터 #2
126 : 블록 RAM #2110: input image memory
120: deep learning accelerator
121: buffer
122: Register # 1
123: Block RAM # 1
124 filter # 1
125: register # 2
126: block RAM # 2

Claims

입력 영상의 라인들 중 일부 픽셀들을 저장하는 제1 레지스터들;
입력 영상의 라인들 중 나머지 픽셀들을 저장하는 제1 메모리들; 및
레지스터들에 저장된 픽셀들을 이용하여 연산을 수행하는 제1 커널;을 포함하고,
제1 레지스터들은 픽셀 단위로 구분되어 있으며,
제1 메모리들은 라인 단위로 구분되어 있고,
제1 레지스터들 각각은 제1 메모리들 각각의 전단에 위치하여, 제1 레지스터들에 저장된 데이터들은 제1 메모리들로 시프트 되는 것을 특징으로 하는 딥러닝 하드웨어 가속장치.
First registers for storing some pixels of lines of an input image;
First memories for storing the remaining pixels of the lines of the input image; And
A first kernel performing an operation using the pixels stored in the registers;
The first registers are divided in pixel units,
The first memories are divided in line units,
Each of the first registers is located in front of each of the first memories, such that the data stored in the first registers are shifted to the first memories.

삭제delete

청구항 1에 있어서,
제1 메모리들에 저장된 데이터들은,
다음 행에 위치한 제1 레지스터들로 시프트되는 것을 특징으로 하는 딥러닝 하드웨어 가속장치.
The method according to claim 1,
Data stored in the first memories,
And a deep learning hardware accelerator which is shifted to first registers located in the next row.

청구항 1에 있어서,
제1 레지스터들의 행×열 배열은,
커널이 연산을 위해 이용하는 필터의 행×열 규격에 의해 결정되는 것을 특징으로 하는 딥러닝 하드웨어 가속장치.
The method according to claim 1,
The row x column arrangement of the first registers is
A deep learning hardware accelerator, characterized by the row x column specification of the filter the kernel uses for computation.

청구항 6에 있어서,
제1 커널의 연산으로 생성된 데이터를 저장하는 제2 레지스터들;
제2 레지스터들 각각의 후단에 위치하여 시프트 되는 데이터들을 저장하는 제2 메모리들;을 더 포함하는 것을 특징으로 하는 딥러닝 하드웨어 가속장치.
The method according to claim 6,
Second registers for storing data generated by an operation of the first kernel;
And a second memory configured to store data shifted at a rear end of each of the second registers.

제1 레지스터들이, 입력 영상의 라인들 중 일부 픽셀들을 저장하는 단계;
제1 메모리들이, 입력 영상의 라인들 중 나머지 픽셀들을 저장하는 단계; 및
제1 커널이, 레지스터들에 저장된 픽셀들을 이용하여 연산을 수행하는 단계;를 포함하고,
제1 레지스터들은 픽셀 단위로 구분되어 있으며,
제1 메모리들은 라인 단위로 구분되어 있고,
제1 레지스터들 각각은 제1 메모리들 각각의 전단에 위치하여, 제1 레지스터들에 저장된 데이터들은 제1 메모리들로 시프트 되는 것을 특징으로 하는 딥러닝 하드웨어 가속방법.
Storing, by the first registers, some pixels of the lines of the input image;
Storing, by the first memories, the remaining pixels of the lines of the input image; And
And performing, by the first kernel, the operation using the pixels stored in the registers.
The first registers are divided in pixel units,
The first memories are divided in line units,
Wherein each of the first registers is located in front of each of the first memories, such that the data stored in the first registers are shifted to the first memories.