KR102518066B1

KR102518066B1 - Neural network circuit having round robin data transfer scheme

Info

Publication number: KR102518066B1
Application number: KR1020190122458A
Authority: KR
Inventors: 박내범; 김재준
Original assignee: 포항공과대학교 산학협력단
Priority date: 2019-10-02
Filing date: 2019-10-02
Publication date: 2023-04-04
Also published as: KR20210039801A

Abstract

본 실시예에 의한 뉴럴 네트워크 회로는 메모리가 형성된 다이(die)가 복수개 적층된 메모리 스택 및 뉴럴 네트워크의 연산을 수행하는 복수의 엔진들과, 메모리 스택에 저장된 데이터를 획득하여 복수의 엔진에 제공하는 데이터 제공부가 형성된 로직 다이(logic die)를 포함하며, 데이터 제공부는, 메모리 스택으로부터 복수의 엔진들 중 어느 하나의 엔진에 제공될 데이터를 획득하여 하나의 엔진에 제공한다.The neural network circuit according to the present embodiment includes a memory stack in which a plurality of dies on which memory is formed are stacked, a plurality of engines that perform neural network operations, and data stored in the memory stack is obtained and provided to the plurality of engines. It includes a logic die on which a data providing unit is formed, and the data providing unit obtains data to be provided to any one of a plurality of engines from a memory stack and provides the data to one engine.

Description

라운드 로빈 데이터 전달 방식의 뉴럴 네트워크 회로{NEURAL NETWORK CIRCUIT HAVING ROUND ROBIN DATA TRANSFER SCHEME}Neural network circuit with round robin data transfer method {NEURAL NETWORK CIRCUIT HAVING ROUND ROBIN DATA TRANSFER SCHEME}

본 기술은 라운드 로빈 데이터 전달 방식의 뉴럴네트 워크 회로에 관한 것이다. The present technology relates to a neural network circuit of a round robin data transfer method.

기존의 폰 노이만 구조를 기반으로 하는 칩들의 구조적 한계를 극복하기 위하여, IC 칩 개발 업체들은 사람의 뇌를 이루는 기본 단위인 뉴런과 이러한 뉴런 사이의 연결을 이어주는 시냅스 등으로 이루어진 뉴럴 네트워크를 바탕으로 하는 뉴럴 네트워크 하드웨어 또는 뉴로모픽 하드웨어를 개발해 오고 있다. 뉴럴 네트워크는 기존의 머신 러닝 알고리즘들이 갖고 있던 한계를 뛰어넘어 사람에 근접한 수준의 이미지, 영상, 패턴 학습 및 인지 능력을 보여주고 있으며, 이미 수많은 분야에 사용되고 있다. 수많은 회사와 연구자들은 이러한 뉴럴 네트워크의 연산 작업을 보다 저전력으로 빠르게 수행하기 위하여 전용 ASIC 칩을 개발해 오고 있다. In order to overcome the structural limitations of chips based on the existing von Neumann structure, IC chip developers are developing neural networks based on neurons, which are the basic units of the human brain, and synapses that connect these neurons. Neural network hardware or neuromorphic hardware has been developed. Neural networks exceed the limits of existing machine learning algorithms and show image, video, and pattern learning and recognition capabilities close to those of humans, and are already being used in numerous fields. Numerous companies and researchers have been developing dedicated ASIC chips to perform these neural network computations quickly and with less power.

전용 하드웨어(customized hardware)를 이용하여 뉴럴 네트워크 학습 및 추론을 수행하는데, 그 이유는 뉴럴 네트워크라는 학습 프로그램(및 추론 프로그램) 을 수행하는 것은 수많은 병렬적 연산을 수행하는 것이다. 즉, 뉴런과 뉴런 사이의 가중치 뉴런을 가중치 매트릭스로 표현하고, 입력 벡터와 가중치 매트릭스를 곱하여 출력 벡터를 얻는 과정으로 표현할 수 있으나, 뉴럴 네트워크를 처리하기 위해서는 엄청난 양의 연산을 처리해야 한다. Neural network learning and reasoning are performed using customized hardware, because the reason is that performing a learning program (and reasoning program) called a neural network performs numerous parallel calculations. That is, it can be expressed as a process in which weight neurons between neurons are expressed as a weight matrix, and an output vector is obtained by multiplying an input vector and a weight matrix, but an enormous amount of computation is required to process a neural network.

많은 연산기를 사용할수록 뉴럴 네트워크의 전체 처리 속도는 증가한다. 그러나 연산기의 숫자와 전체 처리 속도가 반드시 비례한다고 할 수 없다. 증가한 연산기들에 연산에 필요한 값들을 끊이지 않고 전달할 수 있어야만 늘어난 연산기의 숫자로 인한 전체 처리 속도 증가를 보장할 수 있다. 즉, 연산기의 수를 늘리게 되면 그 만큼 연산기들에게 전달되는 데이터의 대역폭 또한 증가해야한다.The overall processing speed of a neural network increases as more operators are used. However, the number of operators and the overall processing speed are not necessarily proportional. The increase in the overall processing speed due to the increased number of operators can be guaranteed only when values necessary for operation can be continuously transmitted to the increased operators. That is, if the number of operators is increased, the bandwidth of data transmitted to the operators must also increase correspondingly.

본 기술은 상기한 종래 기술에 의한 뉴럴 네트워크 장치의 단점을 해소하기 위한 것으로, 뉴럴 네트워크 회로의 처리 속도를 향상시키기 위한 것이다. The present technology is intended to solve the above disadvantages of the neural network device according to the prior art, and to improve the processing speed of the neural network circuit.

본 실시예에 의한 뉴럴 네트워크 회로의 한 모습에서, 복수개 적층된 메모리 스택은, 서로 동일한 위치에 형성된 관통 실리콘 비아(TSV, Through Silicon Via)를 포함한다.In one aspect of the neural network circuit according to the present embodiment, a plurality of stacked memory stacks include Through Silicon Vias (TSVs) formed at the same positions as each other.

본 실시예에 의한 뉴럴 네트워크 회로의 한 모습에서, 데이터 제공부는 복수의 관통 실리콘 비아들을 통하여 메모리 스택으로부터 데이터를 획득한다.In one aspect of the neural network circuit according to this embodiment, the data providing unit obtains data from the memory stack through a plurality of through-silicon vias.

본 실시예에 의한 뉴럴 네트워크 회로의 한 모습에서, 데이터 제공부는, 관통 실리콘 비아들을 통해 제공된 데이터를 래치업하는 레지스터를 포함한다.In one aspect of the neural network circuit according to this embodiment, the data provision unit includes a register for latching up data provided through through-silicon vias.

본 실시예에 의한 뉴럴 네트워크 회로의 한 모습에서, 데이터 제공부는, 메모리 스택으로부터 복수의 엔진들 중 어느 하나의 엔진에 제공될 데이터를 획득하여 하나의 엔진에 제공하고, 하나의 엔진이 제공된 데이터를 이용하여 연산을 수행할 때, 데이터 제공부는 메모리 스택으로부터 복수의 엔진들 중 다른 하나의 엔진에 제공될 데이터를 획득하여 다른 하나의 엔진에 제공한다.In one aspect of the neural network circuit according to this embodiment, the data providing unit acquires data to be provided to any one of a plurality of engines from a memory stack, provides the obtained data to one engine, and provides the provided data to the one engine. When an operation is performed using the data providing unit, data to be provided to another one of the plurality of engines is obtained from the memory stack and provided to the other engine.

본 실시예에 의하면 높은 대역폭으로 엔진에 제공하거나, 엔진이 연산한 결과를 메모리 스택으로 제공하므로 종래 기술(distributed data fetching scheme) 에 비하여 빠른 처리 속도를 얻을 수 있으며, 높은 스루풋 성능을 얻을 수 있다는 장점이 제공된다. According to the present embodiment, since a high bandwidth is provided to the engine or the engine provides the calculation result to the memory stack, a faster processing speed can be obtained compared to the prior art (distributed data fetching scheme), and high throughput performance can be obtained. is provided.

도 1은 본 실시예에 의한 뉴럴 네트워크 회로의 개요적 도면이다.
도 2는 메모리 스택에 포함된 어느 한 메모리 다이의 평면도이다.
도 3은 로직 다이의 평면도이다.
도 4는 로직 다이에 포함된 데이터 제공부의 기능을 설명하기 위한 개요적 도면이다.
도 5는 종래 기술과 본 실시예를 이용하여 신경망 중 하나인 AlexNet의 세 번째 계층을 처리하는데 걸린 시간을 보여주는 타이밍 다이어그램이다. 1 is a schematic diagram of a neural network circuit according to the present embodiment.
2 is a plan view of a memory die included in a memory stack.
3 is a plan view of a logic die.
4 is a schematic diagram for explaining a function of a data providing unit included in a logic die.
5 is a timing diagram showing the time taken to process the third layer of AlexNet, one of the neural networks, using the prior art and this embodiment.

이하에서는 첨부된 도면들을 참조하여 본 실시예에 의한 뉴럴 네트워크 회로를 설명한다. 도 1은 본 실시예에 의한 뉴럴 네트워크 회로(1)의 개요적 도면이다. 도 1을 참조하면, 본 실시예에 의한 뉴럴 네트워크 회로(1)는 메모리가 형성된 메모리 다이(die, 100a)가 복수개 적층된 메모리 스택(100)과, 뉴럴 네트워크의 연산을 수행하는 복수의 엔진들(210, 도 3 참조)과, 메모리 스택에 저장된 데이터를 획득하여 복수의 엔진에 제공하는 데이터 제공부(230, 도 3 참조)가 형성된 로직 다이(logic die, 200)를 포함한다. Hereinafter, a neural network circuit according to the present embodiment will be described with reference to the accompanying drawings. 1 is a schematic diagram of a neural network circuit 1 according to this embodiment. Referring to FIG. 1, a neural network circuit 1 according to this embodiment includes a memory stack 100 in which a plurality of memory dies 100a are stacked, and a plurality of engines that perform neural network calculations. (210, see FIG. 3) and a logic die 200 having a data providing unit 230 (see FIG. 3) that obtains data stored in the memory stack and provides the obtained data to a plurality of engines.

도 2는 메모리 스택(100)에 포함된 어느 한 메모리 다이(100a)의 평면도이다. 도 1 및 도 2를 참조하면, 메모리 다이(100a)는 메모리가 형성된 영역인 메모리 영역(110)과 로직 다이(200) 사이에서 데이터를 전달하는 복수의 관통 실리콘 비아들(TSV, 120)을 포함한다. 일 실시예로, 메모리 영역(110)에는 커패시터(Capacitor)에 전하를 충전하거나 충전하지 않는 방식으로 데이터를 읽거나 쓸 수 있는 다이내믹 램이 형성될 수 있다. 다른 예로, 메모리는 스태틱 램(SRAM), MTJ의 자기저항 특성을 이용하는 MRAM(Magnetic Random Access Memory)등의 메모리 중 어느 하나일 수 있다. 2 is a plan view of one memory die 100a included in the memory stack 100 . Referring to FIGS. 1 and 2 , a memory die 100a includes a plurality of through silicon vias TSV 120 transferring data between a memory area 110, which is an area where a memory is formed, and a logic die 200. do. As an example, a dynamic RAM capable of reading or writing data may be formed in the memory area 110 in a manner in which a capacitor is charged or not charged. As another example, the memory may be any one of static random access memory (SRAM) and magnetic random access memory (MRAM) using magnetoresistive characteristics of the MTJ.

복수의 관통 실리콘 비아(120)들은 메모리 영역(110)에 형성된 메모리로부터 데이터를 제공받고, 로직 다이(200)에 형성된 데이터 제공부(230, 도 3 참조)에 제공한다. 또한, 로직 다이(200)에도 메모리 스택(100)에 형성된 관통 실리콘 비아(120)들에 상응하는 위치에 관통 실리콘 비아(120)가 형성될 수 있다.The plurality of through silicon vias 120 receive data from the memory formed in the memory area 110 and provide the data to the data providing unit 230 formed in the logic die 200 (see FIG. 3 ). Also, through-silicon vias 120 may be formed in the logic die 200 at locations corresponding to the through-silicon vias 120 formed in the memory stack 100 .

일 실시예에서, 관통 실리콘 비아(120)들은 메모리 다이(100)의 중앙에 위치할 수 있다. 관통 실리콘 비아(120)들이 메모리 다이(100)의 중앙에 위치함으로써 데이터 전송시 메모리 다이(100) 내에서의 위치 관계에 따른 타이밍 문제 발생을 최소화할 수 있다. In one embodiment, through-silicon vias 120 may be located in the center of memory die 100 . Since the through-silicon vias 120 are positioned at the center of the memory die 100 , occurrence of a timing problem due to a positional relationship within the memory die 100 during data transmission may be minimized.

도 3은 로직 다이(200)의 평면도이고, 도 4는 로직 다이(200)에 포함된 데이터 제공부(230)의 기능을 설명하기 위한 개요적 도면이다. 도 3 및 도 4를 참조하면, 로직 다이(200)에는 관통 실리콘 비아(120)틀 통해 제공된 데이터를 목적하는 엔진(210)에 출력하는 데이터 제공부(230)와, 데이터 제공부(230)로부터 데이터를 제공받아 뉴럴 네트워크 연산을 수행하는 복수의 엔진(210)들이 위치한다. FIG. 3 is a plan view of the logic die 200 and FIG. 4 is a schematic diagram for explaining the function of the data provider 230 included in the logic die 200 . Referring to FIGS. 3 and 4 , the logic die 200 includes a data providing unit 230 that outputs data provided through the through-silicon vias 120 to a target engine 210 , and A plurality of engines 210 receiving data and performing neural network calculations are located.

데이터 제공부(230)는 관통 실리콘 비아들(120)을 통하여 메모리 스택(100)에서 데이터(data)를 제공받고, 클록 신호(CLK)에 동기하여 제공된 데이터를 래치업(latch-up)하여 출력한다.The data providing unit 230 receives data from the memory stack 100 through the through-silicon vias 120, latches up the provided data in synchronization with the clock signal CLK, and outputs the do.

데이터 제공부(230)는 큰 대역폭의 데이터를 한 번에 하나의 신경망 엔진씩 순차적으로 전달하는 방식(round-robin data fetching scheme)으로 각 엔진에 데이터를 제공한다. 일 예로, 데이터 제공부(230)가 어느 하나의 신경망 엔진(210)에 데이터를 제공하면, 상기 신경망 엔진(210)은 제공된 데이터를 이용하여 연산을 수행한다. 해당 신경망 엔진(210)이 제공된 데이터로 연산을 수행할 때, 데이터 제공부(230)는 메모리 스택(100)에서 데이터를 제공받아 다른 하나의 신경망 엔진(210)에 제공한다. The data providing unit 230 provides data to each engine in a manner of sequentially transferring high-bandwidth data one neural network engine at a time (round-robin data fetching scheme). For example, when the data provider 230 provides data to any one neural network engine 210, the neural network engine 210 performs an operation using the provided data. When the corresponding neural network engine 210 performs an operation with the provided data, the data provider 230 receives data from the memory stack 100 and provides it to another neural network engine 210 .

또한, 신경망 엔진(210)은 제공된 데이터를 이용한 연산이 완료되면 연산된 데이터를 데이터 제공부(230)에 제공하고, 데이터 제공부(230)는 해당 신경망 엔진(210)이 연산한 데이터를 관통 실리콘 비아(120)를 통해 메모리 스택(100)에 전달한다. In addition, when the calculation using the provided data is completed, the neural network engine 210 provides the calculated data to the data providing unit 230, and the data providing unit 230 transmits the data calculated by the neural network engine 210 through silicon. It is transferred to the memory stack 100 through the via 120 .

종래 기술은 메모리 스택(100)으로부터 모든 엔진에 제공될 데이터를 균등하게 읽어서 제공하는 방식(distributed data fetching scheme)을 사용하였다. 따라서, 모든 엔진에 데이터가 모두 완전하게 제공된 이후에 비로소 연산이 수행되었다. 나아가, 엔진 각각이 연산한 데이터도 동시에 메모리 스택으로 전송되었다.In the prior art, data to be provided to all engines is equally read from the memory stack 100 and provided (a distributed data fetching scheme). Therefore, calculations were performed only after all data had been completely supplied to all engines. Furthermore, the data calculated by each engine was simultaneously transferred to the memory stack.

연산 속도 특성을 향상시키기 위하여 엔진의 개수가 증가함에도 불구하고 메모리 스택으로부터 어느 하나의 엔진으로의 데이터를 전송할 수 있는 대역폭이 맞추어 증가하지 못하여 데이터의 병목현상이 발생하였으며, 그에 따른 연산 속도의 향상이 미미하였다. In spite of the increase in the number of engines to improve the calculation speed characteristics, the bandwidth for transmitting data from the memory stack to any one engine did not increase accordingly, resulting in a data bottleneck, resulting in an increase in calculation speed. It was insignificant.

그러나, 본 실시예에 의한 데이터 제공부(230)는 어느 한 엔진(210a)의 연산에 필요한 데이터를 우선 획득하여 해당 엔진(210a)에 제공한다. 이 때, 메모리 스택(100)으로부터 제공되는 데이터는 넓은 대역폭으로 전송되는 것이므로 높은 속도로 전송된다. 데이터 제공부(230)는 전송된 데이터를 목표하는 엔진(210a)에 제공하며, 데이터가 제공된 엔진(210a)은 연산을 시작한다. 이어서, 데이터 제공부(230)는 다른 한 엔진(210b)의 연산에 필요한 데이터를 획득하여 해당 엔진(210b)에 제공하여 연산을 수행하도록 한다. However, the data providing unit 230 according to the present embodiment first obtains data necessary for operation of any one engine 210a and provides it to the corresponding engine 210a. At this time, data provided from the memory stack 100 is transmitted at a high speed because it is transmitted over a wide bandwidth. The data provider 230 provides the transmitted data to the target engine 210a, and the engine 210a provided with the data starts an operation. Next, the data providing unit 230 acquires data necessary for calculation of another engine 210b and provides the obtained data to the corresponding engine 210b to perform the calculation.

이와 같이 데이터 제공부(230)는 메모리 스택(100)으로부터 어느 한 엔진에 제공될 데이터를 획득하여 해당 엔진에 제공하여 우선 연산을 수행하도록 하고, 데이터가 제공된 엔진이 연산을 수행하는 동안 메모리 스택(100)으로부터 다른 한 엔진에 제공될 데이터를 획득하여 해당 엔진에 제공하여 연산을 수행하도록 한다.In this way, the data provider 230 obtains data to be provided to any one engine from the memory stack 100 and provides it to the corresponding engine to perform an operation first, and while the engine provided with the data performs the operation, the memory stack ( 100) to obtain data to be provided to another engine and provide the data to the corresponding engine to perform calculations.

이어서, 데이터 제공부(230)는 연산을 완료한 어느 한 엔진(210)으로부터 데이터를 제공받고 관통 실리콘 비아(120)을 통하여 메모리 스택(100)에 저장하며, 메모리 스택(100)에 저장이 완료되면, 연산을 완료한 다른 엔진(210)으로부터 데이터를 제공받아 관통 실리콘 비아(120)을 통하여 메모리 스택(100)에 저장한다.Subsequently, the data provider 230 receives data from any one engine 210 that has completed the operation and stores the data in the memory stack 100 through the through-silicon via 120, and the storage in the memory stack 100 is completed. , data is received from the other engine 210 that has completed the calculation and stored in the memory stack 100 through the through-silicon via 120 .

즉, 본 실시예에 의한 데이터 제공부(230)는 라운드 로빈 방식으로 메모리 스택에서 데이터를 읽어서 하나의 엔진에 제공하거나, 엔진이 연산한 결과를 메모리 스택으로 제공하므로 높은 대역폭으로 데이터를 전송할 수 있어 종래 기술(distributed data fetching scheme) 에 비하여 빠른 처리 속도를 얻을 수 있으며, 높은 스루풋 성능을 얻을 수 있다는 장점이 제공된다. That is, the data providing unit 230 according to the present embodiment reads data from the memory stack in a round-robin manner and provides it to one engine, or provides the engine's calculation result to the memory stack, so data can be transmitted with a high bandwidth. Compared to the prior art (distributed data fetching scheme), faster processing speed and higher throughput performance are provided.

모의 실험 결과simulation results

도 5는 종래 기술과 본 실시예를 이용하여 신경망 중 하나인 AlexNet의 세 번째 계층을 처리하는데 걸린 시간을 보여주는 타이밍 다이어그램이다. 종래 기술과 본 실시예 모두 16개의 신경망 엔진을 사용하여 연산을 수행하였으며, 메모리는 DRAM(Dynamic RAM)을 사용하였다. 도 5를 참조하면, 각 엔진의 연산처리 과정은 크게 IDLE, DRAM to Engine, Engine Processing, Engine to DRAM으로 나눌 수 있다. 종래 기술에 의한 Distributed Data Fetching을 사용하는 경우, 16개의 신경망 엔진들이 동시에 동일한 양의 데이터를 받은 뒤 연산을 처리하고 동일한 양의 연산을 수행하기 때문에 모든 신경망 엔진들이 동일한 시간동안 연산을 수행한다. 5 is a timing diagram showing the time taken to process the third layer of AlexNet, one of the neural networks, using the prior art and this embodiment. Both the prior art and the present embodiment performed calculations using 16 neural network engines, and DRAM (Dynamic RAM) was used as a memory. Referring to FIG. 5 , the operation processing process of each engine can be largely divided into IDLE, DRAM to Engine, Engine Processing, and Engine to DRAM. In the case of using Distributed Data Fetching according to the prior art, since 16 neural network engines simultaneously receive the same amount of data, process calculations, and perform the same amount of calculations, all neural network engines perform calculations for the same amount of time.

반면, 본 실시예에 의한 Round-Robin Data Fetching Scheme을 사용하는 경우, 엔진 #0부터 차례로 엔진#15까지 차례로 데이터가 전달된다. 종래 기술에 비하여 16배 넓은 대역폭으로 데이터가 전달되므로 DRAM to Engine 과정에서 걸리는 시간이 16배로 감소한다. 각 엔진은 필요한 데이터를 모두 다 받은 뒤, Engine Processing 과정을 수행하는데, 이 과정에서는 Distributed 와 Round-Robin Data Fetching 두 경우 모두 동일한 시간을 소모한다. On the other hand, in the case of using the Round-Robin Data Fetching Scheme according to this embodiment, data is sequentially transferred from engine #0 to engine #15. Since data is transferred with a bandwidth 16 times wider than that of the prior art, the time taken in the DRAM to Engine process is reduced by 16 times. After each engine receives all necessary data, it performs Engine Processing, which consumes the same amount of time in both Distributed and Round-Robin Data Fetching.

그리고 Engine to DRAM 과정 또한 Round-Robin Data Fetching을 사용할 때 Distributed Data Fetching 보다 16배의 대역폭을 사용하기 때문에 이 과정에서 16배로 걸리는 시간을 줄일 수 있다. 즉, Round-Robin Data Fetching을 사용하게 되면, 엔진에 좀 더 빠르게 데이터를 전달해주고 Engine Processing과정을 빠르게 시작할 수 있도록 한다. 또한 Engine Processing 과정이 완료된 뒤에, 다른 엔진들의 Engine Processing 과정의 완료 여부와 무관하게 다음 연산을 수행할 수 있기 때문에, Distributed Data Fetching 방식보다 더 빠르게 연산을 수행할 수 있다.Also, when Round-Robin Data Fetching is used, the engine to DRAM process uses 16 times the bandwidth compared to Distributed Data Fetching, so the time taken during this process can be reduced by 16 times. In other words, when Round-Robin Data Fetching is used, data is delivered to the engine more quickly and the engine processing process can be started quickly. In addition, since the next operation can be performed after the engine processing process is completed, regardless of whether the engine processing process of other engines is completed, the operation can be performed faster than the Distributed Data Fetching method.

본 발명에 대한 이해를 돕기 위하여 도면에 도시된 실시 예를 참고로 설명되었으나, 이는 실시를 위한 실시예로, 예시적인 것에 불과하며, 당해 분야에서 통상적 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시 예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위에 의해 정해져야 할 것이다.Although it has been described with reference to the embodiments shown in the drawings to aid understanding of the present invention, this is an embodiment for implementation and is only exemplary, and those having ordinary knowledge in the field can make various modifications and equivalents therefrom. It will be appreciated that other embodiments are possible. Therefore, the true technical scope of protection of the present invention will be defined by the appended claims.

1: 뉴럴 네트워크 회로 100: 메모리 스택
100a: 메모리 다이 110: 메모리 영역
120: 관통 실리콘 비아 200: 로직 다이
210: 엔진 230: 데이터 제공부1: neural network circuit 100: memory stack
100a: memory die 110: memory area
120 through silicon via 200 logic die
210: engine 230: data providing unit

Claims

메모리가 형성된 다이(die)가 복수개 적층된 메모리 스택 및
뉴럴 네트워크의 연산을 수행하는 복수의 엔진들과,
상기 메모리 스택에 저장된 데이터를 획득하여 상기 복수의 엔진에 제공하는 데이터 제공부가 형성된 로직 다이(logic die)를 포함하며,
상기 데이터 제공부는, 상기 메모리 스택으로부터 상기 복수의 엔진들 중 어느 하나의 엔진에 제공될 데이터를 획득하여 상기 하나의 엔진에 제공하되,
상기 데이터 제공부는 상기 메모리 스택으로부터 상기 어느 하나의 엔진이 연산 수행시 상기 어느 하나의 엔진의 연산에 수행에 필요한 데이터만 획득하여 상기 어느 하나의 엔진에 제공하는 뉴럴 네트워크 회로.A memory stack in which a plurality of dies on which memories are formed are stacked, and
A plurality of engines that perform calculations of the neural network;
A logic die having a data providing unit configured to obtain data stored in the memory stack and provide the obtained data to the plurality of engines;
The data provider obtains data to be provided to any one of the plurality of engines from the memory stack and provides the data to the one engine,
The neural network circuit of claim 1 , wherein the data providing unit acquires only data necessary for the operation of the one engine from the memory stack when the one engine performs the operation, and provides the obtained data to the one engine.

제1항에 있어서,
상기 복수개 적층된 메모리 스택은,
서로 동일한 위치에 형성된 관통 실리콘 비아(TSV, Through Silicon Via)를 포함하는 뉴럴 네트워크 회로.According to claim 1,
The plurality of stacked memory stacks,
A neural network circuit including Through Silicon Vias (TSVs) formed at the same positions as each other.

제1항에 있어서,
상기 데이터 제공부는 상기 메모리가 형성된 다이(die)와 전기적으로 연결되어 상기 메모리에 저장된 데이터를 전송하도록 상기 다이를 관통하는 복수의 관통 실리콘 비아들을 통하여 상기 메모리 스택으로부터 상기 데이터를 획득하는 뉴럴 네트워크 회로.According to claim 1,
The neural network circuit of

제3항에 있어서,
상기 데이터 제공부는, 상기 관통 실리콘 비아들을 통해 제공된 데이터를 래치업하는 레지스터를 포함하는 뉴럴 네트워크 회로.According to claim 3,
The data providing unit includes a register that latches up data provided through the through-silicon vias.

제1항에 있어서,
상기 데이터 제공부는,
상기 어느 하나의 엔진이 제공된 데이터를 이용하여 연산을 수행할 때, 상기 데이터 제공부는 상기 메모리 스택으로부터 상기 복수의 엔진들 중 다른 하나의 엔진에 제공될 데이터를 획득하여 상기 다른 하나의 엔진에 제공하는 뉴럴 네트워크 회로.
According to claim 1,
The data provider,
When the one engine performs an operation using the provided data, the data provider obtains data to be provided to another one of the plurality of engines from the memory stack and provides the data to the other engine. neural network circuit.