TWI688895B

TWI688895B - Fast vector multiplication and accumulation circuit

Info

Publication number: TWI688895B
Application number: TW107114790A
Authority: TW
Inventors: 林永隆; 李道一
Original assignee: 國立清華大學
Priority date: 2018-03-02
Filing date: 2018-05-01
Publication date: 2020-03-21
Also published as: TW201939266A

Abstract

A fast vector multiplication and accumulation circuit is proposed. The fast vector multiplication and accumulation circuit is applied to an artificial neural network accelerator and used to calculate an inner product of a multiplier vector and a multiplicand vector. The fast vector multiplication and accumulation circuit includes a scheduler, a self-accumulating adder and an adder. The fast vector multiplication and accumulation circuit utilizes a multi-bit compressor of the self-accumulating adder and a binary arithmetic coding of the scheduler to greatly enhance a level of vector parallelism of a long vector inner product operation. Therefore, the proposed circuit and the method of the present disclosure may utilize the self-accumulating adder combined with application-specific integrated circuits (ASIC) to accomplish a fast inner product operation, thereby greatly reducing the computational complexity, latency and power consumption.

Description

快速向量乘累加電路 Fast vector multiply-accumulate circuit

本發明是關於一種快速向量乘累加電路，特別是關於一種應用於類神經網路硬體加速器之快速向量乘累加電路。 The invention relates to a fast vector multiply-accumulate circuit, in particular to a fast vector multiply-accumulate circuit applied to a neural network-like hardware accelerator.

類神經網路(neural network)係採用模型之一或多個層以針對一經接收輸入產生一輸出(例如，一分類)之機器學習模型。一些類神經網路除了輸出層之外，亦包含一或多個隱藏層。每一個隱藏層之輸出用作網路中之下一層(即網路之下一隱藏層或輸出層)之輸入。網路之每一層根據一各自參數集合之當前值自一經接收輸入產生一輸出。 A neural network (neural network) is a machine learning model that uses one or more layers of a model to produce an output (eg, a classification) for a received input. Some neural networks include one or more hidden layers in addition to the output layer. The output of each hidden layer is used as the input of the next layer in the network (that is, a hidden layer or output layer under the network). Each layer of the network generates an output from a received input based on the current value of a respective set of parameters.

一些類神經網路包含一或多個卷積類神經網路層。每一卷積類神經網路層具有一相關聯核心集合。每一核心包含由憑藉一使用者產生之一類神經網路模型建立之值。在一些實施方式中，核心識別特定影像輪廓、形狀或色彩。核心可被表示為權重輸入之一矩陣結構。每一卷積層亦可處理一激勵輸入集合。此激勵輸入集合亦可被表示為一矩陣結構。 Some neural networks include one or more convolutional neural network layers. Each convolutional neural network layer has an associated core set. Each core contains values created by a type of neural network model generated by a user. In some embodiments, the core recognizes a specific image outline, shape, or color. The core can be expressed as a matrix structure of weight inputs. Per convolution The layer can also process a set of stimulus inputs. This set of excitation inputs can also be expressed as a matrix structure.

目前存在一些習知系統，其在軟體中對一給定卷積層執行計算。例如：軟體可將各層之每一核心應用至激勵輸入集合。換句話說，對於每一核心，軟體能將可多維表示之核心覆疊在可多維表示之激勵輸入的一第一部分上方。接著，軟體可依據重疊元素計算一內積(inner product)。內積可對應於一單個激勵輸入，亦即在重疊多維空間中具有一左上位置之一激勵輸入元素。例如，在使用滑動窗時，軟體可將核心移位以覆疊激勵輸入之一第二部分並計算對應於另一激勵輸入之另一內積。軟體可重複執行此程序直至每一激勵輸入具有一對應內積。在其他一些實施方式中，內積的結果被輸入至一激勵函數，此激勵函數產生激勵值。激勵值可在被發送至類神經網路之一後續層之前組合(即匯集(pooling))。 There are currently some conventional systems that perform calculations for a given convolutional layer in software. For example: software can apply each core of each layer to a set of stimulus inputs. In other words, for each core, the software can overlay the multi-dimensionally representable core over a first portion of the multi-dimensionally representable excitation input. Then, the software can calculate an inner product based on the overlapping elements. The inner product may correspond to a single excitation input, that is, an excitation input element with an upper left position in the overlapping multidimensional space. For example, when using a sliding window, the software can shift the core to overlap a second part of one of the excitation inputs and calculate another inner product corresponding to the other excitation input. The software can repeat this process until each excitation input has a corresponding inner product. In some other embodiments, the result of the inner product is input to an excitation function, which generates an excitation value. The stimulus values can be combined (ie pooling) before being sent to a subsequent layer of a neural-like network.

計算卷積計算之一種方式需要一大尺寸空間緩衝激勵張量與核心張量。一般處理器係透過直接實現的乘法器計算矩陣乘法。也就是說，雖然矩陣乘法為計算密集型及時間密集型，但處理器會針對卷積計算重複地計算個別總和及乘積，此方式會導致平行化處理的運算受限，而且會大幅增加運算複雜度以及功率消耗。 One way to calculate convolution calculations requires a large space to buffer the excitation tensor and core tensor. The general processor calculates the matrix multiplication through the directly implemented multiplier. That is to say, although matrix multiplication is computationally intensive and time intensive, the processor will repeatedly calculate individual sums and products for convolution calculations. This method will cause the parallelization processing to be limited and greatly increase the complexity of the operation. Degree and power consumption.

由此可知，目前市場上缺乏一種可大幅增加向量化以及降低功率消耗的快速向量乘累加電路，故相關業者均在尋求其解決之道。 It can be seen that there is currently no fast vector multiplying and accumulating circuit on the market that can greatly increase vectorization and reduce power consumption, so related companies are all looking for solutions.

因此，本發明之目的在於提供一種快速向量乘累加電路，其適合應用於類神經網路硬體加速器，而且其透過特定的自累加器結合特殊應用積體電路來實現向量的內積運算，不但可大幅地增加向量化，還可大幅地減少運算硬體結構複雜度、延遲以及功率消耗。 Therefore, the object of the present invention is to provide a fast vector multiply-accumulate circuit, which is suitable for application to neural network-like hardware accelerators, and it realizes the inner product operation of vectors through a specific self-accumulator combined with a special application integrated circuit. The vectorization can be greatly increased, and the complexity, delay and power consumption of the computing hardware structure can be greatly reduced.

依據本發明的結構態樣之一實施方式提供一種快速向量乘累加電路，應用於一類神經網路硬體加速器且用以運算一乘數向量與一被乘數向量之一內積。快速向量乘累加電路包含一排列位移器、一自累加器以及一加法器，其中排列位移器依據乘數向量之複數個乘數分別將被乘數向量之複數個被乘數排列成複數個排列位移運算元。此外，自累加器訊號連接排列位移器，自累加器包含一壓縮器、至少二延遲元件以及至少一位移器，其中壓縮器具有複數個輸入埠與複數個輸出埠，其中一個輸入埠依序接收排列位移運算元，壓縮器將排列位移運算元相加而產生複數個壓縮運算元，這些壓縮運算元分別由輸出埠輸出。而二個延遲元件分別訊號連接壓縮器之其中另二個輸入埠，且其中一個延遲元件訊號連接輸出埠之一者。至於位移器則連接於輸出埠之另一者與另一個延遲元件之間，位移器位移這些壓縮運算元之一者。再者，加法器訊號連接自累加器之輸出埠，加法器將壓縮運算元相加而產生內積。 According to an embodiment of the structural aspect of the present invention, a fast vector multiply-accumulate circuit is provided, which is applied to a type of neural network hardware accelerator and used to calculate an inner product of a multiplier vector and a multiplicand vector. The fast vector multiply-accumulate circuit includes an array shifter, a self-accumulator, and an adder, wherein the array shifter arranges the complex multiplicands of the multiplicand vector into complex arrays according to the complex multipliers of the multiplier vector, respectively Displacement operator. In addition, the self-accumulator signal is connected to the array shifter. The self-accumulator includes a compressor, at least two delay elements, and at least one shifter. The compressor has a plurality of input ports and a plurality of output ports, and one of the input ports is received in sequence. Arrangement displacement operators, the compressor adds the arrangement displacement operators to generate a plurality of compression operators, and these compression operators are respectively output by the output ports. The two delay elements are respectively connected to the other two input ports of the compressor, and one of the delay elements is connected to one of the output ports. As for the shifter, it is connected between the other one of the output ports and another delay element, and the shifter shifts one of these compression operators. Furthermore, the adder signal is connected to the output port of the accumulator, and the adder adds the compression operands to generate an inner product.

藉此，本發明的快速向量乘累加電路透過特定的自累加器結合特殊應用積體電路來實現向量的內積運算，既可大幅地增加向量化，亦可大幅地減少運算硬體結構複雜度、延遲以及功率消耗。 In this way, the fast vector multiply-accumulate circuit of the present invention realizes the inner product operation of the vector through a specific self-accumulator combined with a special application integrated circuit, which can greatly increase the vectorization and greatly reduce the complexity of the calculation hardware structure. , Delay and power consumption.

依據前述實施方式之快速向量乘累加電路的其他實施例，其中前述快速向量乘累加電路可包含一激勵單元，此激勵單元訊號連接加法器，激勵單元接收內積並執行一非線性運算。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the fast vector multiply-accumulate circuit may include an excitation unit. The excitation unit signal is connected to the adder. The excitation unit receives the inner product and performs a non-linear operation.

依據前述實施方式之快速向量乘累加電路的其他實施例，其中前述非線性運算可包含一邏輯函數(sigmoid function)、一正負號函數(signum function)、一閥值函數(threshold function)、一片段線性函數(piecewise-linear function)、一步階函數(step function)或一雙曲函數(tanh function)。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the non-linear operation may include a sigmoid function, a signum function, a threshold function, and a segment Linear function (piecewise-linear function), step function (step function) or a hyperbolic function (tanh function).

依據前述實施方式之快速向量乘累加電路的其他實施例，其中前述非線性運算可依據一分段二次逼近法(a piecewise quadratic approximation)實現。 According to other embodiments of the fast vector multiply-accumulate circuit according to the foregoing embodiment, the aforementioned nonlinear operation may be implemented according to a piecewise quadratic approximation.

依據前述實施方式之快速向量乘累加電路的其他實施例，其中前述壓縮器可為一全加器，此全加器具有一第一輸入埠、一第二輸入埠、一第三輸入埠、一第一輸出埠及一第二輸出埠。其中一個延遲元件設於第一輸入埠與第一輸出埠之間，另一個延遲元件與位移器設於第二輸入埠與第二輸出埠之間，且第三輸入埠訊號連接排列位移器。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the aforementioned compressor may be a full adder with a first input port, a second input port, a third input port, a first An output port and a second output port. One of the delay elements is arranged between the first input port and the first output port, the other delay element and the shifter are arranged between the second input port and the second output port, and the signal of the third input port is connected to arrange the shifter.

依據前述實施方式之快速向量乘累加電路的其他實施例，其中前述壓縮器可為一7-3壓縮器(7 to 3 compressor)，此7-3壓縮器具有一第一輸入埠、一第二輸入埠、一第三輸入埠、一第四輸入埠、一第五輸入埠、一第六輸入埠、一第七輸入埠、一第一輸出埠、一第二輸出埠及一第三輸出埠。二個延遲元件分別代表一第一延遲元件與一第二延遲元件，位移器代表一第一位移器，自累加器更包含第三延遲元件及第二位移器。第一延遲元件設於第一輸入埠與第一輸出埠之間，第二延遲元件與第二位移器設於第二輸入埠與第三輸出埠之間，第三延遲元件與第一位移器設於第三輸入埠與第二輸出埠之間，且第四輸入埠、第五輸入埠、第六輸入埠及第七輸入埠均訊號連接排列位移器。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the aforementioned compressor may be a 7-3 compressor, which has a first input port and a second input Port, a third input port, a fourth input port, a fifth input port, a sixth input port, a seventh input port, a first output port, a second output port and a third output port. The two delay elements respectively represent a first delay element and a second delay element, the shifter represents a first shifter, and the self-accumulator further includes a third delay element and a second shifter. The first delay element is disposed between the first input port and the first output port, the second delay element and the second shifter are disposed between the second input port and the third output port, and the third delay element and the first shifter It is arranged between the third input port and the second output port, and the fourth input port, the fifth input port, the sixth input port, and the seventh input port are all connected to the signal shifter.

依據前述實施方式之快速向量乘累加電路的其他實施例，其中前述加法器係可依據一進位預看加法電路(carry look-ahead adder)、一進位傳播加法電路(carry propagate adder)、一進位儲存加法電路(carry save adder)或一漣波進位加法電路(ripple carry adder)實現。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the adder may be based on a carry look-ahead adder, a carry propagate adder, and a carry store The addition circuit (carry save adder) or a ripple carry adder (ripple carry adder) is implemented.

依據前述實施方式之快速向量乘累加電路的其他實施例，其中前述類神經網路硬體加速器可包含一第一層處理模組與一第二層處理模組，第一層處理模組具有一第一層輸出端，第二層處理模組具有一第二層輸入端，快速向量乘累加電路設於第一層處理模組的第一層輸出端與第二層處理模組的第二層輸入端之間。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the aforementioned neural network hardware accelerator may include a first layer processing module and a second layer processing module, the first layer processing module has a The first layer output terminal, the second layer processing module has a second layer input terminal, fast The speed vector multiply-accumulate circuit is provided between the first layer output terminal of the first layer processing module and the second layer input terminal of the second layer processing module.

依據前述實施方式之快速向量乘累加電路的其他實施例，其中前述快速向量乘累加電路可包含一控制處理器，此控制處理器訊號連接並控制排列位移器、自累加器及加法器。類神經網路硬體加速器包含複數層處理模組，控制處理器訊號連接並偵測這些層處理模組。控制處理器依據這些層處理模組之處理結果產生複數控制訊號至排列位移器、自累加器及加法器，以決定排程或停止作動。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the fast vector multiply-accumulate circuit may include a control processor, which connects and controls the arrangement shifter, the self-accumulator, and the adder. Neural-like hardware accelerators include multiple layers of processing modules that control processor signal connections and detect these layers of processing modules. The control processor generates complex control signals to the arrangement shifter, self-accumulator and adder according to the processing results of these layer processing modules to decide the scheduling or stop the operation.

依據前述實施方式之快速向量乘累加電路的其他實施例，其中前述排列位移器包含可至少一優先編碼器(priority encoder)與至少一桶形移位器(barrel shifter)。其中優先編碼器依序接收乘數向量之乘數，優先編碼器判斷各乘數之至少一有效位元位置。此外，桶形移位器依序接收被乘數向量之被乘數並訊號連接優先編碼器，桶形移位器依據有效位元位置位移對應之各被乘數而排列成排列位移運算元。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the array shifter includes at least one priority encoder and at least one barrel shifter. The priority encoder sequentially receives the multiplier of the multiplier vector, and the priority encoder determines at least one valid bit position of each multiplier. In addition, the barrel shifter receives the multiplicands of the multiplicand vectors in sequence and connects the signal to the priority encoder. The barrel shifter arranges the multiplicands according to the position shift of the effective bit and arranges them into array shift operators.

依據前述實施方式之快速向量乘累加電路的其他實施例，其中前述快速向量乘累加電路可由一特殊應用積體電路(Application-Specific Integrated Circuit；ASIC)之一半導體製程實現，半導體製程包含一互補式金屬氧化物半導體(Complementary Metal-Oxide- Semiconductor；CMOS)或一絕緣體覆矽(Silicone On Insulator；SOI)。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, wherein the fast vector multiply-accumulate circuit can be implemented by a semiconductor process of an application-specific integrated circuit (ASIC), the semiconductor process includes a complementary type Complementary Metal-Oxide- Semiconductor; CMOS) or a Silicon On Insulator (SOI).

依據前述實施方式之快速向量乘累加電路的其他實施例，其中前述快速向量乘累加電路可由一場式可程式閘陣列(Field Programmable Gate Array；FPGA)實現。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the fast vector multiply-accumulate circuit may be implemented by a field programmable gate array (Field Programmable Gate Array; FPGA).

100‧‧‧快速向量乘累加電路 100‧‧‧ fast vector multiply accumulate circuit

102‧‧‧動態隨機存取記憶體 102‧‧‧Dynamic Random Access Memory

104‧‧‧全域緩衝記憶體 104‧‧‧Global buffer memory

110、110a‧‧‧類神經網路硬體加速器 110, 110a‧‧‧ neural network hardware accelerator

200‧‧‧排列位移器 200‧‧‧Arrangement shifter

210‧‧‧優先編碼器 210‧‧‧ Priority encoder

220a、220b‧‧‧桶形移位器 220a, 220b‧‧‧barrel shifter

230‧‧‧延遲元件 230‧‧‧ Delay element

240‧‧‧開關元件 240‧‧‧Switching element

300、300a‧‧‧自累加器 300, 300a‧‧‧self accumulator

310‧‧‧壓縮器 310‧‧‧Compressor

310a‧‧‧7-3壓縮器 310a‧‧‧-7-3 compressor

320a、320b、320c‧‧‧延遲元件 320a, 320b, 320c‧‧‧ delay element

330、330a、330b‧‧‧位移器 330, 330a, 330b‧‧‧shifter

400、400a‧‧‧加法器 400, 400a‧‧‧ adder

410a、410b‧‧‧並列輸入/串列輸出模組 410a, 410b‧‧‧ parallel input/serial output module

420‧‧‧全加器 420‧‧‧Full Adder

430‧‧‧串列輸入/並列輸出模組 430‧‧‧Serial input/parallel output module

440‧‧‧互斥或閘 440‧‧‧ mutually exclusive or gate

450‧‧‧優先編碼器 450‧‧‧ Priority encoder

460‧‧‧計數器 460‧‧‧Counter

470‧‧‧比較器 470‧‧‧Comparator

EP、EP₀、EP₁、EP₂、EP₃、EP₄、EP₅、EP₆、EP₇、DONE‧‧‧優先編碼輸出埠 EP, EP ₀ , EP ₁ , EP ₂ , EP ₃ , EP ₄ , EP ₅ , EP ₆ , EP ₇ , DONE‧‧‧ priority encoding output port

x、x₀、x₁、x₂、x₃、x₄、x₅、x₆、x₇‧‧‧桶形移位輸入埠 x, x ₀ , x ₁ , x ₂ , x ₃ , x ₄ , x ₅ , x ₆ , x ₇ ‧‧‧barrel shift input port

y、y₀、y₁、y₂、y₃、y₄、y₅、y₆、y₇‧‧‧桶形移位輸出埠 y, y ₀ , y ₁ , y ₂ , y ₃ , y ₄ , y ₅ , y ₆ , y ₇ ‧‧‧barrel shift output port

w、w₀、w₁、w₂、w₃、w₄、w₅、w₆、w₇‧‧‧桶形移位控制埠 w, w ₀ , w ₁ , w ₂ , w ₃ , w ₄ , w ₅ , w ₆ , w ₇ ‧‧‧barrel shift control port

X、Y、C_in、X₀、X₁、X₂、X₃、X₄、X₅、X₆、p_i[15：0]、S_i、FSM‧‧‧輸入埠 X, Y, C _in , X ₀ , X ₁ , X ₂ , X ₃ , X ₄ , X ₅ , X ₆ , p _i [15:0], S _i , FSM‧‧‧ input port

S、C_out、Y₀、Y₁、Y₂、S_o、p_o[15：0]、EQ‧‧‧輸出埠 S, C _out , Y ₀ , Y ₁ , Y ₂ , S _o , p _o [15:0], EQ‧‧‧ output port

Z、Z[15：0]、Z[MSB：1]、Z[MSB：n_i+1]‧‧‧內積 Z, Z[15:0], Z[MSB: 1], Z[MSB: n _i +1] ‧‧‧ inner product

READY、LOAD、 PROC、FIFO_WEN、RST、COUNTER、q[15：0]、q[15]、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、CX、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、

、x_ser、y_ser、Lx₀、Lx₁、Wx[0]₀、Wx[1]₀、Wx[2]₀、Wx[n+3]₀、Wx[0]₁、Wx[1]₁、 Ly₀、Ly₁、Wy[0]₀、Wy[1]₀、Wy[2]₀、Wy[n+3]₀、Wy[0]₁、Wy[1]₁、CZ[0]₀、CZ[1]₀、CZ[2]₀、CZ[n+3]₀、CZ[0]₁、CZ[1]₁、CEP₀、CEP₁、WZ[0]₀、WZ[1]₀、WZ[n+2]₀、WZ[n+3]₀、WZ[0]₁‧‧‧訊號 READY, LOAD, PROC, FIFO_WEN, RST, COUNTER, q[15:0], q[15],

,

, CX,

,

, X _ser , y _ser , Lx ₀ , Lx ₁ , Wx[0] ₀ , Wx[1] ₀ , Wx[2] ₀ , Wx[n+3] ₀ , Wx[0] ₁ , Wx[1] ₁ , Ly ₀ , Ly ₁ , Wy[0] ₀ , Wy[1] ₀ , Wy[2] ₀ , Wy[n+3] ₀ , Wy[0] ₁ , Wy[1] ₁ , CZ[0] ₀ , CZ[1] ₀ , CZ[2] ₀ , CZ[n+3] ₀ , CZ[0] ₁ , CZ[1] ₁ , CEP ₀ , CEP ₁ , WZ[0] ₀ , WZ[1] ₀ 、WZ[n+2] ₀ 、WZ[n+3] ₀ 、WZ[0] ₁ ‧‧‧Signal

500‧‧‧控制處理器 500‧‧‧Control processor

600‧‧‧激勵單元 600‧‧‧Incentive unit

700、700a‧‧‧快速向量乘累加方法 700, 700a‧‧‧ fast vector multiply-accumulate method

S12、S22‧‧‧排列位移步驟 S12, S22‧‧‧Arrangement displacement steps

S14、S24‧‧‧自累加步驟 S14, S24‧‧‧Self accumulation step

S16、S26‧‧‧加法步驟 S16, S26‧‧‧Addition steps

S222‧‧‧優先編碼步驟 S222‧‧‧ Priority coding steps

S224‧‧‧桶形移位步驟 S224‧‧‧Barrel shift procedure

S242‧‧‧加總步驟 S242‧‧‧Total steps

S244‧‧‧延遲步驟 S244‧‧‧ Delay step

S246‧‧‧位移步驟 S246‧‧‧Displacement steps

S28‧‧‧激勵步驟 S28‧‧‧Incentive steps

M_r‧‧‧乘數向量 M _r ‧‧‧ multiplier vector

M_c‧‧‧被乘數向量 M _c ‧‧‧ multiplicand vector

Z^-1‧‧‧延遲符號 Z ^-1 ‧‧‧ Delay symbol

M_c[0]、M_c[1]、M_c[2]‧‧‧被乘數 M _c [0], M _c [1], M _c [2]‧‧‧ multiplicand

M_s、M_s0、M_s1、M_s2、M_s3‧‧‧排列位移運算元 M _s , M _s0 , M _s1 , M _s2 , M _s3

S[n]、C_out[n]、S[0]、C_out[0]、S[1]、C_out[1]、S[2]、C_out[2]、S[3]、 C_out[3]、S[15：0]、C_out[15：0]‧‧‧壓縮運算元 S[n], C _out [n], S[0], C _out [0], S[1], C _out [1], S[2], C _out [2], S[3], C _out [3], S[15:0], C _out [15:0] ‧‧‧compression operand

M、M₀、M₁、M₂、M₃、M₄、M₅、M₆、M₇‧‧‧優先編碼輸入埠 M, M ₀ , M ₁ , M ₂ , M ₃ , M ₄ , M ₅ , M ₆ , M ₇ ‧‧‧ Priority coding input port

P₀、P₁、P₂、P₃、P₄、P₅、P₆、P₇、P₈、P_n、P_n+1‧‧‧優先控制訊號 P ₀ 、P ₁ 、P ₂ 、P ₃ 、P ₄ 、P ₅ 、P ₆ 、P ₇ 、P ₈ 、P _n 、P _n+1 ‧‧‧Priority control signal

n‧‧‧整數 n‧‧‧ integer

NOP‧‧‧無操作 NOP‧‧‧no operation

第1圖繪示本發明一實施例之類神經網路硬體加速器的電路架構圖。 FIG. 1 shows a circuit architecture diagram of a neural network hardware accelerator such as an embodiment of the invention.

第2圖繪示第1圖一實施例的快速向量乘累加電路之電路架構圖。 FIG. 2 is a circuit architecture diagram of the fast vector multiply-accumulate circuit of the embodiment of FIG. 1.

第3A圖繪示第2圖之排列位移器的電路架構圖。 FIG. 3A is a circuit diagram of the shifter of FIG. 2.

第3B圖繪示第3A圖之優先編碼器的電路架構圖。 FIG. 3B is a circuit diagram of the priority encoder shown in FIG. 3A.

第3C圖繪示第3A圖之桶形移位器的電路架構圖。 FIG. 3C is a circuit diagram of the barrel shifter shown in FIG. 3A.

第3D圖繪示第3A圖之排列位移器的管線(pipeline)時序示意圖。 FIG. 3D is a schematic timing diagram of the pipeline in which the shifters are arranged in FIG. 3A.

第4A圖繪示第2圖一實施例之自累加器的電路架構圖。 FIG. 4A is a circuit diagram of the self-accumulator according to the embodiment of FIG. 2.

第4B圖繪示第4A圖之自累加器的管線時序示意圖。 FIG. 4B is a schematic diagram of the pipeline timing of the self-accumulator in FIG. 4A.

第5圖繪示第2圖一實施例之加法器的電路架構圖。 FIG. 5 is a circuit diagram of the adder in the embodiment of FIG. 2.

第6圖繪示第2圖另一實施例之加法器的電路架構圖。 FIG. 6 is a circuit diagram of an adder according to another embodiment of FIG. 2.

第7圖繪示第6圖之加法器的管線時序示意圖。 FIG. 7 is a schematic diagram of the pipeline timing of the adder of FIG. 6.

第8圖繪示本發明一實施例之快速向量乘累加方法的流程示意圖。 FIG. 8 illustrates the flow of a fast vector multiply-accumulate method according to an embodiment of the invention Schematic diagram.

第9圖繪示第1圖另一實施例的類神經網路硬體加速器之電路架構圖。 FIG. 9 is a circuit diagram of a neural network-like hardware accelerator according to another embodiment of FIG. 1.

第10圖繪示第2圖另一實施例之自累加器的電路架構圖。 FIG. 10 is a circuit diagram of a self-accumulator according to another embodiment of FIG. 2.

第11圖繪示本發明另一實施例之快速向量乘累加方法的流程示意圖。 FIG. 11 is a schematic flowchart of a fast vector multiply-accumulate method according to another embodiment of the invention.

以下將參照圖式說明本發明之複數個實施例。為明確說明起見，許多實務上的細節將在以下敘述中一併說明。然而，應瞭解到，這些實務上的細節不應用以限制本發明。也就是說，在本發明部分實施例中，這些實務上的細節是非必要的。此外，為簡化圖式起見，一些習知慣用的結構與元件在圖式中將以簡單示意的方式繪示之；並且重複之元件將可能使用相同的編號表示之。 Hereinafter, a plurality of embodiments of the present invention will be described with reference to the drawings. For clarity, many practical details will be explained in the following description. However, it should be understood that these practical details should not be used to limit the present invention. That is to say, in some embodiments of the present invention, these practical details are unnecessary. In addition, for the sake of simplifying the drawings, some conventionally used structures and elements will be shown in a simple schematic manner in the drawings; and repeated elements may be indicated by the same number.

請一併參閱第1圖與第2圖，第1圖繪示本發明一實施例之類神經網路硬體加速器110的電路架構圖。第2圖繪示第1圖之類神經網路硬體加速器110的快速向量乘累加電路100之電路架構圖。如圖所示，類神經網路硬體加速器110包含動態隨機存取記憶體102(Dynamic Random Access Memory；DRAM)、全域緩衝記憶體104(Global Buffer；GLB)、複數個快速向量乘累加電路100以及控制處理器500，快速向量乘累加電路100係應用於類神經網路硬體加速器110且用以運算乘數向量M_r與被乘數向量M_c之內積Z。快速向量乘累加電路100包含排列位移器200、自累加器300以及加法器400。 Please refer to FIG. 1 and FIG. 2 together. FIG. 1 illustrates a circuit architecture diagram of a neural network hardware accelerator 110 such as an embodiment of the present invention. FIG. 2 shows a circuit architecture diagram of the fast vector multiply-accumulate circuit 100 of the neural network hardware accelerator 110 such as FIG. 1. As shown in the figure, the neural network-like hardware accelerator 110 includes a dynamic random access memory 102 (Dynamic Random Access Memory; DRAM), a global buffer memory 104 (Global Buffer; GLB), and a plurality of fast vector multiply-accumulate circuits 100 and a control processor 500, 100 based fast vector multiply-accumulate circuit is applied to the neural network hardware accelerator 110 and the multiplier for calculating the product of the vector Z. M _r M _c of the multiplicand vector The fast vector multiply-accumulate circuit 100 includes an array shifter 200, a self-accumulator 300, and an adder 400.

排列位移器200依據乘數向量M_r之複數個乘數分別將被乘數向量M_c之複數個被乘數排列成複數個排列位移運算元M_s。茲舉式子(1)與表一之例說明，式子(1)係表示一乘數向量M_r與一被乘數向量M_c之內積運算；表一係繪示式子(1)之內積運算運用第2圖之快速向量乘累加電路100實現所得到的數值結果，如下所示：

The permutation shifter 200 arranges the plural multiplicands of the multiplicand vector M _c into plural permutation displacement operators M _s according to the plural multipliers of the multiplier vector M _r , respectively. Here is an example of formula (1) and Table 1. The formula (1) represents the inner product operation of a multiplier vector M _r and a multiplicand vector M _c ; Table 1 shows the formula (1) The inner product operation uses the fast vector multiply-accumulate circuit 100 in Figure 2 to achieve the obtained numerical results, as shown below:

參閱前述式子(1)與表一，假設被乘數向量M_c具有三個被乘數M_c[0]、M_c[1]、M_c[2]，其十進位表示分別為10、15及3，且其二進位表示分別為「00001010」、「00001111」及「00000011」。乘數向量M_r具有三個乘數，其十進位表示分別為7、4及9，且其二進位表示分別為「00000111」、「00000100」及「00001001」。當第一組被乘數M_c[0](10_dec；00001010_bin)與乘數(7_dec；00000111_bin)相乘時，排列位移器200會依據乘數「00000111」中的三個1而對被乘數M_c[0]執行排列位移，進而產生三個排列位移運算元M_s，此三個排列位移運算元M_s分別為「00001010」、「00010100」及「00101000」。其中第一個排列位移運算元M_s即為被乘數M_c[0]，第二個排列位移運算元M_s為被乘數M_c[0]左移一個位元，第三個排列位移運算元M_s為被乘數M_c[0]左移二個位元，如表一之第1-3行所示。另外，當第二組被乘數M_c[1](15_dec；00001111_bin)與乘數(4_dec；00000100_bin)相乘時，排列位移器200會依據乘數「00000100」中的一個1而對被乘數M_c[1]執行排列位移，進而產生一個排列位移運算元M_s，此排列位移運算元M_s為「00111100」，亦即被乘數M_c[1]左移二個位元，如表一之第6行所示。此外，當第三組被乘數M_c[2](3_dec；00000011_bin)與乘數(9_dec；00001001_bin)相乘時，排列位移器200會依據乘數「00001001」中的兩個1而對被乘數M_c[2]執行排列位移，進而產生兩個排列位移運算元 M_s，此兩個排列位移運算元M_s分別為「00000011」與「00011000」。其中第一個排列位移運算元M_s即為被乘數M_c[2]，第二個排列位移運算元M_s為被乘數M_c[2]左移三個位元，如表一之第9及12行所示。 Referring to the foregoing formula (1) and Table 1, suppose the multiplicand vector M _c has three multiplicands M _c [0], M _c [1], M _c [2], and their decimal representations are 10, 15 and 3, and their binary representations are "00001010", "00001111" and "00000011" respectively. M _r vector multiplier has three multiplier, its decimal representation are 7,4 and 9, and the other carry denote as "00000111", "00000100" and "00001001." When the first multiplicand M _c [0] (10 _dec ; 00001010 _bin ) is multiplied by the multiplier (7 _dec ; 00000111 _bin ), the arrangement shifter 200 will be based on the three 1s in the multiplier "00000111" Perform array shift on the multiplicand M _c [0], and then generate three array shift operators M _s , the three array shift operators M _s are “00001010”, “00010100” and “00101000”, respectively. The first array displacement operator M _{s is} the multiplicand M _c [0], the second array displacement operator M _s is the multiplicand M _c [0] is shifted one bit to the left, and the third array displacement The operand M _s is the multiplicand M _c [0] left shifted by two bits, as shown in rows 1-3 of Table 1. In addition, when the second group multiplicand M _c [1] (15 _dec ; 00001111 _bin ) is multiplied by the multiplier (4 _dec ; 00000100 _bin ), the arrangement shifter 200 will be based on a 1 in the multiplier "00000100" The arrangement shift is performed on the multiplicand M _c [1], and an arrangement shift operand M _{s is} generated, and the arrangement shift operand M _s is “00111100”, that is, the multiplicand M _c [1] is shifted to the left by two Bits, as shown in row 6 of Table 1. In addition, when the third group multiplicand M _c [2] (3 _dec ; 00000011 _bin ) is multiplied by the multiplier (9 _dec ; 00001001 _bin ), the arrangement shifter 200 will use two of the multipliers “00001001” 1. Perform an array shift on the multiplicand M _c [2], and then generate two array shift operators M _s , and the two array shift operators M _s are “00000011” and “00011000”, respectively. The first array displacement operator M _{s is} the multiplicand M _c [2], and the second array displacement operator M _s is the multiplicand M _c [2] is shifted three bits to the left, as shown in Table 1. As shown in lines 9 and 12.

自累加器300訊號連接排列位移器200，且自累加器300將排列位移運算元M_s相加而產生複數個壓縮運算元S[n]、C_out[n]，n為大於等於0之整數。茲舉式子(1)與表一之例說明，自累加器300依序執行4次相加之作動。第一次相加係自累加器300將三個排列位移運算元M_s(即M_c[0]=00001010、M_c[0](<<1)=00010100及M_c[0](<<2)=00101000)相加而產生二個壓縮運算元S[0]、C_out[0]，如表一之第4及5行所示。第二次相加係自累加器300將二個壓縮運算元S[0]、C_out[0]及一個排列位移運算元M_s(即M_c[1](<<2)=00111100)相加而產生二個壓縮運算元S[1]、C_out[1]，如表一之第7及8行所示。第三次相加係自累加器300將二個壓縮運算元S[1]、C_out[1]及一個排列位移運算元M_s(即M_c[2]=00000011)相加而產生二個壓縮運算元S[2]、C_out[2]，如表一之第10及11行所示。第四次相加係自累加器300將二個壓縮運算元S[2]、C_out[2]及一個排列位移運算元M_s(即M_c[2](<<3)=00011000)相加而產生二個壓縮運算元S[3]、C_out[3]，如表一之第13及14行所示。 The signal from the self-accumulator 300 is connected to the array shifter 200, and the self-accumulator 300 adds the array shift operator M _s to generate a plurality of compression operators S[n], C _out [n], where n is an integer greater than or equal to 0 . Here, the expression (1) and the example in Table 1 illustrate that the self-accumulator 300 performs four additions in sequence. The first addition is that the self-accumulator 300 divides the three permutation operators M _s (ie, M _c [0]=00001010, M _c [0](<<1)=00010100 and M _c [0](<< 2)=00101000) add up to produce two compression operands S[0], C _out [0], as shown in lines 4 and 5 of Table 1. In the second addition, the self-accumulator 300 combines two compression operators S[0], C _out [0] and an array displacement operator M _s (ie, M _c [1](<<2)=00111100). Add to produce two compression operands S[1], C _out [1], as shown in lines 7 and 8 of Table 1. The third addition is from the accumulator 300 to add two compression operators S[1], C _out [1] and an array displacement operator M _s (ie, M _c [2]=00000011) to produce two Compression operands S[2], C _out [2], as shown in rows 10 and 11 of Table 1. The fourth addition is from the accumulator 300 to combine two compression operators S[2], C _out [2] and an array displacement operator M _s (ie, M _c [2](<<3)=00011000) Add to produce two compression operands S[3], C _out [3], as shown in lines 13 and 14 of Table 1.

加法器400則訊號連接自累加器300之輸出埠S、C_out，加法器400將壓縮運算元S[3]、C_out[3]相加而產生內積Z，如表一之第15行所示。再者，加法器400可依據一進位預看加法電路(carry look-ahead adder)、一進位傳播加法電路(carry propagate adder)、一進位儲存加法電路(carry save adder)或一漣波進位加法電路(ripple carry adder)實現。 The adder 400 connects the signals from the output ports S and C _{out of the} accumulator 300. The adder 400 adds the compression operands S[3] and C _out [3] to generate the inner product Z, as shown in line 15 of Table 1 As shown. Furthermore, the adder 400 may be based on a carry look-ahead adder, a carry propagate adder, a carry save adder, or a ripple carry adder (ripple carry adder) implementation.

控制處理器500訊號連接並控制排列位移器200、自累加器300及加法器400。控制處理器500可為中央處理單元(Central Processing Unit；CPU)、微控制器(Micro-Control Unit；MCU)或其他控制邏輯。另外值得一提的是，類神經網路硬體加速器110包含複數層處理模組(未示於圖中)，控制處理器500訊號連接並偵測這些層處理模組，控制處理器500依據層處理模組之處理結果產生複數個控制訊號至排列位移器200、自累加器300及加法器400，以決定排程或停止作動。此外，在其他實施例中，類神經網路硬體加速器110可包含第一層處理模組與第二層處理模組，第一層處理模組具有第一層輸出端，第二層處理模組具有第二層輸入端，快速向量乘累加電路100設於第一層處理模組的第一層輸出端與第二層處理模組的第二層輸入端之間，藉以處理第一層處理模組之輸出訊號。另外值得一提的是，快速向量乘累加電路100可由一特殊應用積體電路(Application-Specific Integrated Circuit；ASIC)之一半導體製程實現，半導體製程包含一互補式金屬氧化物半導體(Complementary Metal-Oxide-Semiconductor；CMOS)或一絕緣體覆矽 (Silicone On Insulator；SOI)。再者，快速向量乘累加電路100可由一場式可程式閘陣列(Field Programmable Gate Array；FPGA)實現。藉此，本發明之快速向量乘累加電路100適合應用於類神經網路硬體加速器110，而且其透過特定的自累加器300結合特殊應用積體電路來實現向量的內積運算，可大幅地減少運算硬體結構複雜度、延遲以及功率消耗。 The control processor 500 signals to connect and control the array shifter 200, the self-accumulator 300 and the adder 400. The control processor 500 may be a central processing unit (Central Processing Unit; CPU), a microcontroller (Micro-Control Unit; MCU), or other control logic. It is also worth mentioning that the neural network-like hardware accelerator 110 includes a plurality of layers of processing modules (not shown in the figure), and the control processor 500 connects signals and detects these layers of processing modules. The processing result of the processing module generates a plurality of control signals to the arrangement shifter 200, the self-accumulator 300 and the adder 400 to decide the schedule or stop the operation. In addition, in other embodiments, the neural network-like hardware accelerator 110 may include a first layer processing module and a second layer processing module. The first layer processing module has a first layer output terminal and a second layer processing module The group has a second layer input terminal, and the fast vector multiply-accumulate circuit 100 is provided between the first layer output terminal of the first layer processing module and the second layer input terminal of the second layer processing module to process the first layer processing The output signal of the module. It is also worth mentioning that the fast vector multiply-accumulate circuit 100 can be implemented by a semiconductor process of an application-specific integrated circuit (ASIC), which includes a complementary metal oxide semiconductor (Complementary Metal-Oxide) -Semiconductor (CMOS) or a silicon-on-insulator (Silicone On Insulator; SOI). Furthermore, the fast vector multiply-accumulate circuit 100 can be realized by a field programmable gate array (Field Programmable Gate Array; FPGA). In this way, the fast vector multiply-accumulate circuit 100 of the present invention is suitable for neural network-like hardware accelerator 110, and it realizes the inner product operation of vectors through a specific self-accumulator 300 combined with a special application integrated circuit, which can greatly Reduce the complexity, delay and power consumption of computing hardware.

請一併參閱第2、3A、3B、3C及3D圖，第3A圖繪示第2圖之排列位移器200的電路架構圖。第3B圖繪示第3A圖之優先編碼器210的電路架構圖。第3C圖繪示第3A圖之桶形移位器220a的電路架構圖。第3D圖繪示第3A圖之排列位移器200的管線(pipeline)時序示意圖。如圖所示，排列位移器200包含一個優先編碼器210(priority encoder)、二個桶形移位器220a、220b(barrel shifter)、五個延遲元件230及四個開關元件240。 Please refer to FIGS. 2, 3A, 3B, 3C and 3D together. FIG. 3A shows a circuit architecture diagram of the arrangement shifter 200 of FIG. 2. FIG. 3B is a circuit diagram of the priority encoder 210 of FIG. 3A. FIG. 3C illustrates a circuit architecture diagram of the barrel shifter 220a of FIG. 3A. FIG. 3D is a schematic timing diagram of the pipeline in which the shifter 200 of FIG. 3A is arranged. As shown in the figure, the arrangement shifter 200 includes a priority encoder 210 (priority encoder), two barrel shifters 220a and 220b (barrel shifter), five delay elements 230 and four switching elements 240.

優先編碼器210依序接收乘數向量M_r之乘數，優先編碼器210判斷各乘數之至少一有效位元位置，亦即判斷乘數中數值為1的位置。優先編碼器210包含八個優先編碼輸入埠M₀、M₁、M₂、M₃、M₄、M₅、M₆、M₇、九個優先控制訊號P₀、P₁、P₂、P₃、P₄、P₅、P₆、P₇、P₈、八個優先編碼輸出埠EP₀、EP₁、EP₂、EP₃、EP₄、EP₅、EP₆、EP₇以及一個訊號READY。其中優先編碼輸入埠M₀~M₇接收乘數向量M_r之乘數。優先控制訊號P₀~P₈為優先編碼器210的內部訊號，其代表優先狀態。當P_n為0時，其後續P_n+1~P₈即無法再取得優先狀態，P₀等於1(即邏輯“真”值)。此外，優先編碼器210包含十九個AND閘及九個反相器，其連結方式如第3B圖所示。優先編碼器210透過AND閘及反相器的串接，所產生的優先編碼輸出埠EP₀~EP₇可判斷出乘數中數值為1的位置。茲舉式子(1)為例，若乘數向量M_r的乘數為7(00000111)，則對應之優先編碼輸入埠M₀、M₁、M₂、M₃、M₄、M₅、M₆、M₇的訊號分別為1、1、1、0、0、0、0、0，而經過優先編碼器210後所得到之優先編碼輸出埠EP₀、EP₁、EP₂、EP₃、EP₄、EP₅、EP₆、EP₇的訊號會分別為1、0、0、0、0、0、0、0；換句話說，若乘數不為零，則優先編碼輸出埠EP₀~EP₇的訊號只會有一個為1，其餘為0；若乘數等於零，則優先編碼輸出埠EP₀~EP₇全為0。 The priority encoder 210 sequentially receives the multiplier of the multiplier vector M _r . The priority encoder 210 determines at least one valid bit position of each multiplier, that is, the position where the value of the multiplier is 1. The priority encoder 210 includes eight priority code input ports M ₀ , M ₁ , M ₂ , M ₃ , M ₄ , M ₅ , M ₆ , M ₇ , nine priority control signals P ₀ , P ₁ , P ₂ , P _{_{_{3, P 4, P 5,}}} P 6, P 7, P 8, eight priority encoder output port _{_{_{EP 0, EP 1, EP 2}}} , EP 3, EP 4, EP 5, EP 6, EP 7 and a signal READY . The priority code input ports M ₀ ~ M ₇ receive the multiplier of the multiplier vector M _r . The priority control signals P ₀ ~P ₈ are internal signals of the priority encoder 210, which represent the priority status. When P _n is 0, its subsequent P _n+1 ~P ₈ can no longer obtain the priority status, and P ₀ is equal to 1 (that is, the logical “true” value). In addition, the priority encoder 210 includes nineteen AND gates and nine inverters, the connection of which is shown in FIG. 3B. The priority encoder 210 can determine the position of the multiplier with a value of 1 through the serial connection of the AND gate and the inverter, and the generated priority code output ports EP ₀ to EP ₇ . Taking equation (1) as an example, if the multiplier of the multiplier vector M _r is 7 (00000111), the corresponding priority encoding input ports M ₀ , M ₁ , M ₂ , M ₃ , M ₄ , M ₅ , The signals of M ₆ and M ₇ are ₁ , ₁ , ₁ , ₀ , ₀ , ₀ , ₀ , ₀ , and the priority encoding output ports EP ₀ , EP ₁ , EP ₂ , EP ₃ obtained after passing through the priority encoder 210 , EP ₄ , EP ₅ , EP ₆ , and EP ₇ signals will be 1, 0, 0, 0, 0, 0, 0, 0; in other words, if the multiplier is not zero, the output port EP will be preferentially encoded Only one signal from ₀ to EP ₇ is 1, and the rest are 0; if the multiplier is equal to zero, the priority encoding output ports EP ₀ to EP _{7 are} all 0.

桶形移位器220a與桶形移位器220b的結構相同，且均包含多個三態緩衝器(tri-state buffer)、八個桶形移位輸入埠x₀、x₁、x₂、x₃、x₄、x₅、x₆、x₇、八個桶形移位輸出埠y₀、y₁、y₂、y₃、y₄、y₅、y₆、y₇以及八個桶形移位控制埠w₀、w₁、w₂、w₃、w₄、w₅、w₆、w₇，其連結方式如第3C圖所示。其中桶形移位控制埠w₀~w₇分別連接第3B圖的優先編碼輸出埠EP₀~EP₇。桶形移位器220a依序接收乘數向量M_r之乘數並訊號連接優先編碼器210，桶形移位器220a依據有效位元位置位移對應之各乘數。而且桶形移位器220b依序接收被乘數向量M_c之被乘數M_c[0]、M_c[1]、M_c[2]並訊號連接優先編碼器210，桶形移位器220b依據有效位元位置位移對應之各被乘數M_c[0]、M_c[1]、M_c[2]而排列成排列位移運算元M_s。此外，乘數向量M_r、被乘數向量M_c會依照乘數向量M_r的優先編碼結果進行複數次之位移，每次位移由開關元件240決定，位移一次完畢後即可輸出。此外，訊號LOAD可控制排列位移器200，其代表載入新的乘數向量M_r及被乘數向量M_c，且排列位移器200會產生訊號READY、PROC、FIFO_WEN，以供排列位移運算元M_s正確地排列並輸出至自累加器300。其中訊號READY代表位移全部完成；訊號PROC代表位移運算；訊號FIFO_WEN則代表位移一次完成並寫入一組位移運算子至下級輸入。 The barrel shifter 220a has the same structure as the barrel shifter 220b, and each includes multiple tri-state buffers, eight barrel shift input ports x ₀ , x ₁ , x ₂ , x ₃ , x ₄ , x ₅ , x ₆ , x ₇ , eight barrel shift output ports y ₀ , y ₁ , y ₂ , y ₃ , y ₄ , y ₅ , y ₆ , y ₇ and eight barrels The shape shift control ports w ₀ , w ₁ , w ₂ , w ₃ , w ₄ , w ₅ , w ₆ , w ₇ are connected as shown in FIG. 3C. The barrel shift control ports w ₀ ~w ₇ are respectively connected to the priority code output ports EP ₀ ~EP _{7 of} FIG. 3B. Barrel shifter 220a sequentially receives vector multiplier M _r of the multiplier and the signal connection priority encoder 210, the barrel shifter according to the respective multipliers 220a corresponding to the displacement of the effective bit positions. Barrel shifter 220b and sequentially receives the multiplicand multiplicand vector M _c _{_{M c [0], M c}} [1], M c [2] and signal connection priority encoder 210, the barrel shifter 220b is arranged into an array displacement operand M _s according to the respective multiplicands M _c [0], M _c [1], M _c [2] corresponding to the displacement of the effective bit position. In addition, the multiplier vector M _r and the multiplicand vector M _c will be displaced a plurality of times according to the priority encoding result of the multiplier vector M _r . Each displacement is determined by the switching element 240 and can be output after the displacement is completed once. Further, the control signal LOAD may be arranged shifter 200, which represents the new load vector multiplier and multiplicand vector M _r M _c, and arranged shifter 200 produces signal READY, PROC, FIFO_WEN, arranged for displacement operand M _{s is} correctly arranged and output to the self-accumulator 300. The signal READY represents the completion of the displacement; the signal PROC represents the displacement operation; the signal FIFO_WEN represents the completion of the displacement and writes a group of displacement operators to the lower level input.

延遲元件230及開關元件240均受控制處理器500控制，適當的控制可使優先編碼器210以及桶形移位器220a、220b的輸入埠及輸出埠之訊號在時序上能正確地對應，進而增加管線(pipeline)的執行效率。延遲元件230用以延遲訊號，而開關元件240用以決定是否載入新的乘數向量M_r及被乘數向量M_c或以回授路徑之數值繼續進行位移。再者，由第3D圖的管線時序可得知，當前一個時脈(如cycle=1)輸入被乘數向量M_c的其中一個被乘數與乘數向量M_r的其中一個乘數至排列位移器200之後，優先編碼器210、桶形移位器220a、220b以及排列位移運算元M_S均會在下一個時脈(如cycle=2)時對應輸出。第3D圖中訊號

之L代表「Load，載入」；訊號

之C代表「Compute，執行運算(ALU)」；訊號

之W代表「Write，寫入」。 Both the delay element 230 and the switching element 240 are controlled by the control processor 500, and proper control can make the signals of the input ports and output ports of the priority encoder 210 and the

barrel shifters

220a, 220b correctly correspond in timing, and Increase the execution efficiency of the pipeline. The delay element 230 is used to delay the signal, and the switch element 240 is used to decide whether to load the new multiplier vector M _r and the multiplicand vector M _c or continue the displacement by the value of the feedback path. Furthermore, as can be seen from the pipeline timing of the 3D graph, the current clock (eg cycle=1) inputs one of the multiplicands of the multiplicand vector M _c and one of the multiplier vectors M _r to the arrangement after 200, the priority encoder 210, the

barrel shifter

220a, 220b and the displacement operand M _S arrangement will next displacement corresponds a clock output (e.g., cycle = 2). Signal in 3D

Where L stands for "Load"; signal

C stands for "Compute, Perform Operation (ALU)"; signal

The W stands for "Write, write".

請一併參閱第2、4A及4B圖，第4A圖繪示第2圖一實施例之自累加器300的電路架構圖。第4B圖繪示第4A圖之自累加器300的管線時序示意圖。如圖所示，本發明之自累加器300包含一個壓縮器310、至少二個延遲元件320a、320b以及至少一個位移器330，其中壓縮器310具有複數個輸入埠X、Y、C_in與複數個輸出埠S、C_out，其中一個輸入埠C_in依序接收排列位移運算元M_s。壓縮器310將排列位移運算元M_s相加而產生複數個壓縮運算元S[n]、C_out[n]，這些壓縮運算元S[n]、C_out[n]分別由輸出埠S、C_out輸出。而二個延遲元件320a、320b分別訊號連接壓縮器310之其中另二個輸入埠X、Y，且其中一個延遲元件320a訊號連接輸出埠S。至於位移器330則連接於輸出埠C_out與另一個延遲元件320b之間，位移器330位移壓縮運算元C_out[n]。詳細地說，本實施例之壓縮器310為一全加器(full adder；FA)，此全加器具有第一輸入埠X、第二輸入埠Y、第三輸入埠C_in、第一輸出埠S及第二輸出埠C_out，且此全加器為3-2壓縮器(3 to 2 compressor)，其真值表如表二所示。其中一個延遲元件320a設於第一輸入埠X與第一輸出埠S之間，另一個延遲元件320b與位移器330設於第二輸入埠Y與第二輸出埠C_out之間，且第三輸入埠C_in訊號連接排列位移器200。此外，由第4B圖的管線時序可得知，經過n+5個時脈後，第一輸出埠S及第二輸出埠 C_out即可輸出正確的壓縮運算元S[n]、C_out[n]，以供後續電路(如加法器400)使用。其中壓縮運算元S[n]、C_out[n]分別對應第4B圖之訊號

、

，而輸入暫存器與輸出暫存器(FIFO)則分別耦接於自累加器300的輸入端與輸出端，且輸入暫存器與輸出暫存器均受控制處理器500控制。 Please refer to FIGS. 2, 4A, and 4B together. FIG. 4A illustrates a circuit architecture diagram of the self-accumulator 300 in the first embodiment of FIG. 2. FIG. 4B is a schematic diagram of the pipeline timing of the self-accumulator 300 of FIG. 4A. As shown, the self-accumulator 300 of the present invention includes a compressor 310, at least two

delay elements

320a, 320b and at least one shifter 330, wherein the compressor 310 has a plurality of input ports X, Y, C _in and a complex number One output port S, C _out , one of the input ports C _in receives the displacement operator M _s _in sequence. Compressor 310 adds the array displacement operands M _s to generate a plurality of compression operands S[n], C _out [n]. These compression operands S[n], C _out [n] are respectively output from the output port S, C _out output. The two

delay elements

320a and 320b are respectively connected to the other two input ports X and Y of the compressor 310, and one of the delay elements 320a is connected to the output port S. The shifter 330 is connected between the output port C _out and another delay element 320b, and the shifter 330 shifts the compression operator C _out [n]. In detail, the compressor 310 of this embodiment is a full adder (FA). The full adder has a first input port X, a second input port Y, a third input port C _in , and a first output Port S and the second output port C _out , and the full adder is a 3 to 2 compressor, the truth table of which is shown in Table 2. One of the delay elements 320a is disposed between the first input port X and the first output port S, the other delay element 320b and the shifter 330 are disposed between the second input port Y and the second output port C _out , and the third The input port C _in signal is connected to the array shifter 200. In addition, according to the pipeline timing of FIG. 4B, after n+5 clocks, the first output port S and the second output port C _out can output correct compression operators S[n], C _out [ n] for subsequent circuits (such as adder 400). The compression operators S[n] and C _out [n] respectively correspond to the signal in Figure 4B

,

The input register and the output register (FIFO) are respectively coupled to the input end and the output end of the self-accumulator 300, and the input register and the output register are controlled by the control processor 500.

請一併參閱第2、5、6及7圖，第5圖繪示第2圖一實施例之加法器400的電路架構圖。第6圖繪示第2圖另一實施例之加法器400a的電路架構圖。第7圖繪示第6圖之加法器400a的管線時序示意圖。如圖所示，第5圖的加法器400包含二個並列輸入/串列輸出模組410a、410b(Parallel-In/Serial-Out；PISO)、一個全加器420以及一個串列輸入/並列輸出模組430(Serial-In/Parallel-Out；PISO)。全加器420連接於並列輸入/串列輸出模組410a、410b及串列輸入/並列輸出模組430之間。另外，第6圖的加法器400a更包含互斥或閘440(XOR)、優先編碼器450、計數器460(counter)以及比較器470。其中互斥或閘440耦接於第一輸出埠S與第二輸出埠C_out，並輸出至優先編碼器450與串列輸入/並列輸出模組430。優先編碼器450與計數器460均連接至比較器470。在比較器470中，當輸入埠X的數值等於輸入埠Y的數值，輸出埠EQ等於1；當輸入埠X的數值不等於輸入埠Y的數值，輸出埠EQ等於0。上述互斥或閘440、優先編碼器450、計數器460及比較器470之作用在於利用壓縮運算元S[15：0]、C_out[15：0]產生一訊號READY，以判斷訊號q[15：0]之16個位元中的最高有效處理位元。若訊號READY為0，代表尚未找到訊號q[15：0]的最高有效處理位元；若訊號READY為1，代表已找到訊號q[15：0]的最高有效處理位元，可用以提前停止加法器400之作動，進一步達到節省運算以及功率消耗之目的。舉例來說，若訊號q[15：0]=0000000011111111，則訊號q[7]為最高有效處理位元，且壓縮運算元S[15：8]、C_out[15：8]之數值均為0，故加法器400無需再處理壓縮運算元S[15：8]、C_out[15：8]之加法。此外，由第7圖的管線時序可得知，經過n+5個時脈後，串列輸入/並列輸出模組430即可輸出正確的內積Z，以供後續電路(如激勵單元600)使用。第7圖中訊號RST代表「Reset，重置」。藉此，本發明的加法器400、400a透過特定電路之訊號判斷，可大幅節省運算以及功率消耗。 Please refer to FIGS. 2, 5, 6 and 7 together. FIG. 5 shows a circuit architecture diagram of the adder 400 in the embodiment of FIG. 2. FIG. 6 is a circuit diagram of an adder 400a according to another embodiment of FIG. 2. FIG. 7 is a schematic diagram of the pipeline timing of the adder 400a of FIG. 6. As shown in the figure, the adder 400 in FIG. 5 includes two parallel input/serial output modules 410a, 410b (Parallel-In/Serial-Out; PISO), a full adder 420, and a serial input/parallel Output module 430 (Serial-In/Parallel-Out; PISO). The full adder 420 is connected between the parallel input/serial output modules 410a, 410b and the serial input/parallel output module 430. In addition, the adder 400a of FIG. 6 further includes a mutex OR gate 440 (XOR), a priority encoder 450, a counter 460 (counter), and a comparator 470. The mutually exclusive or gate 440 is coupled to the first output port S and the second output port C _out , and is output to the priority encoder 450 and the serial input/parallel output module 430. Both the priority encoder 450 and the counter 460 are connected to the comparator 470. In the comparator 470, when the value of the input port X is equal to the value of the input port Y, the output port EQ is equal to 1; when the value of the input port X is not equal to the value of the input port Y, the output port EQ is equal to 0. The functions of the above-mentioned mutually exclusive OR gate 440, priority encoder 450, counter 460 and comparator 470 are to generate a signal READY using the compression operators S[15:0] and C _out [15:0] to determine the signal q[15 : 0] The most significant processing bit of the 16 bits. If the signal READY is 0, it means that the most effective processing bit of the signal q[15:0] has not been found; if the signal READY is 1, it means that the most effective processing bit of the signal q[15:0] has been found, which can be used to stop early The action of the adder 400 further achieves the purpose of saving calculation and power consumption. For example, if the signal q[15:0]=0000000011111111, the signal q[7] is the most significant processing bit, and the values of the compression operators S[15:8] and C _out [15:8] are all 0, so the adder 400 does not need to process the addition of the compression operands S[15:8], _Cout [15:8]. In addition, as can be seen from the pipeline timing in Figure 7, after n+5 clocks, the serial input/parallel output module 430 can output the correct inner product Z for subsequent circuits (such as the excitation unit 600) use. The signal RST in Figure 7 stands for "Reset, Reset". In this way, the adders 400 and 400a of the present invention can significantly reduce calculation and power consumption through signal determination of a specific circuit.

請一併參閱第2及8圖，第8圖繪示本發明一實施例之快速向量乘累加方法700的流程示意圖，其可用於第2圖之快速向量乘累加電路100，但本發明不以此為限。快速向量乘累加方法700包含排列位移步驟S12、自累加步驟S14以及加法步驟S16。其中排列位移步驟S12係利用排列位移器200依據乘數向量M_r之乘數分別將被乘數向量M_c之被乘數排列成排列位移運算元M_s。自累加步驟S14係利用自累加器300將排列位移運算元M_s相加而產生壓縮運算元S[n]、C_out[n]。加法步驟S16係利用加法器400將壓縮運算元S[n]、C_out[n]相加而產生內積Z。藉此，本發明之快速向量乘累加方法700非常適合應用於類神經網路之內積運算，而且其透過特定的排列位移步驟S12搭配自累加步驟S14以實現向量的內積運算，可大幅地減少運算硬體結構複雜度、延遲以及功率消耗。 Please refer to FIGS. 2 and 8 together. FIG. 8 is a schematic flowchart of a fast vector multiply-accumulate method 700 according to an embodiment of the present invention, which can be used in the fast vector multiply-accumulate circuit 100 of FIG. 2, but the present invention does not This is limited. The fast vector multiply-accumulate method 700 includes an arrangement displacement step S12, a self-accumulation step S14, and an addition step S16. Wherein the arrangement of the displacement step S12 based arrangement using a displacement vector multiplier 200 according to the multiplier M _r of each vector M _c of the multiplicand multiplicand operand displacement arrangement arranged in M _s. The self-accumulation step S14 uses the self-accumulator 300 to add the arrangement displacement operands M _s to generate compression operands S[n], C _out [n]. The addition step S16 uses the adder 400 to add the compression operands S[n] and C _out [n] to generate an inner product Z. In this way, the fast vector multiply-accumulate method 700 of the present invention is very suitable for the inner product operation of neural-like networks, and it realizes the inner product operation of the vector through the specific arrangement displacement step S12 and the self-accumulation step S14, which can greatly Reduce the complexity, delay and power consumption of computing hardware.

請一併參閱第1、2及9圖，第9圖繪示第1圖另一實施例的類神經網路硬體加速器110a之電路架構圖。類神經網路硬體加速器110a包含快速向量乘累加電路100、控制處理器500以及激勵單元600。 Please refer to FIGS. 1, 2 and 9 together. FIG. 9 illustrates a circuit architecture diagram of the neural network-like hardware accelerator 110a of another embodiment of FIG. The neural network-like hardware accelerator 110a includes a fast vector multiply-accumulate circuit 100, a control processor 500, and an excitation unit 600.

在第9圖的實施方式中，快速向量乘累加電路100、控制處理器500均分別與第2圖中的快速向量乘累加電路100、控制處理器500相同，不再贅述。特別的是，第9圖之類神經網路硬體加速器110a更包含激勵單元600，此激勵單元600訊號連接加法器400，且激勵單元600接收內積Z並執行一非線性運算。非線性運算包含邏輯函數 (sigmoid function)、正負號函數(signum function)、閥值函數(threshold function)、片段線性函數(piecewise-linear function)、步階函數(step function)或雙曲函數(tanh function)。此外，非線性運算可依據一分段二次逼近法(piecewise quadratic approximation)實現。 In the embodiment of FIG. 9, the fast vector multiply-accumulate circuit 100 and the control processor 500 are the same as the fast vector multiply-accumulate circuit 100 and the control processor 500 in FIG. 2, and will not be described in detail. In particular, the neural network hardware accelerator 110a such as FIG. 9 further includes an excitation unit 600, the signal of the excitation unit 600 is connected to the adder 400, and the excitation unit 600 receives the inner product Z and performs a nonlinear operation. Non-linear operations include logic functions (sigmoid function), sign function (signum function), threshold function (threshold function), piecewise linear function (piecewise-linear function), step function (step function) or hyperbolic function (tanh function). In addition, the nonlinear operation can be implemented according to a piecewise quadratic approximation.

請一併參閱第2、9及10圖，第10圖繪示第2圖另一實施例之自累加器300a的電路架構圖。此自累加器300a可一次處理較大資料量的乘數向量M_r與被乘數向量M_c，自累加器300a包含一個7-3壓縮器310a(7 to 3 compressor)、第一延遲元件320a、第二延遲元件320b、第三延遲元件320c、第一位移器330a以及第二位移器330b。其中7-3壓縮器310a具有第一輸入埠X₀、第二輸入埠X₁、第三輸入埠X₂、第四輸入埠X₃、第五輸入埠X₄、第六輸入埠X₅、第七輸入埠X₆、第一輸出埠Y₀、第二輸出埠Y₁及第三輸出埠Y₂。第一延遲元件320a設於第一輸入埠X₀與第一輸出埠Y₀之間，第二延遲元件320b與第二位移器330b設於第二輸入埠X₁與第三輸出埠Y₂之間，且第三延遲元件320c與第一位移器330a設於第三輸入埠X₂與第二輸出埠Y₁之間。第四輸入埠X₃、第五輸入埠X₄、第六輸入埠X₅及第七輸入埠X₆均訊號連接排列位移器200。本實施例之7-3壓縮器310a如表三所示。 Please refer to FIGS. 2, 9 and 10 together. FIG. 10 illustrates a circuit architecture diagram of the self-accumulator 300 a according to another embodiment of FIG. 2. The self-accumulator 300a can process the multiplier vector M _r and the multiplicand vector M _{c of} a larger amount of data at a time. The self-accumulator 300 a includes a 7-3 compressor 310 a (7 to 3 compressor) and a first delay element 320 a , The second delay element 320b, the third delay element 320c, the first shifter 330a and the second shifter 330b. The 7-3 compressor 310a has a first input port X ₀ , a second input port X ₁ , a third input port X ₂ , a fourth input port X ₃ , a fifth input port X ₄ , a sixth input port X ₅ , The seventh input port X ₆ , the first output port Y ₀ , the second output port Y ₁ and the third output port Y ₂ . The first delay element 320a is provided between the first input port and the first output port X ₀ Y _0, the second delay element 320b and the second shifter 330b is provided to the second input port X ₁ Y ₂ and the third output port The third delay element 320c and the first shifter 330a are disposed between the third input port X ₂ and the second output port Y ₁ . The fourth input port X ₃ , the fifth input port X ₄ , the sixth input port X ₅ and the seventh input port X ₆ are all connected to the signal shifter 200. The 7-3 compressor 310a of this embodiment is shown in Table 3.

請一併參閱第3A、4A、9、10及11圖，第11圖繪示本發明另一實施例之快速向量乘累加方法700a的流程示意圖。快速向量乘累加方法700a包含排列位移步驟S22、自累加步驟S24、加法步驟S26以及激勵步驟S28。 Please refer to FIGS. 3A, 4A, 9, 10, and 11 together. FIG. 11 is a schematic flowchart of a fast vector multiply-accumulate method 700a according to another embodiment of the present invention. The fast vector multiply-accumulate method 700a includes an arrangement displacement step S22, a self-accumulation step S24, an addition step S26, and an excitation step S28.

排列位移步驟S22係利用排列位移器200依據乘數向量M_r之乘數分別將被乘數向量M_c之被乘數排列成排列位移運算元M_s。詳細地說，排列位移步驟S22包含優先編碼步驟S222與桶形移位步驟S224，其中優先編碼步驟S222係利用優先編碼器210依序接收乘數向量M_r之乘數，並判斷各乘數之至少一有效位元位置。而桶形移位步驟S224係利用桶形移位器220b依序接收被乘數向量M_c之被乘數，並依據有效位元位置位移對應之各被乘數而排列成排列位移運算元M_s。 Displacement arrangement step S22 based displacement using the arrangement according to the multiplier 200. The multiplier vector M _r M _c respectively multiplicand vector of displacement arrangement arranged multiplicand operand M _s. In detail, the arrangement shifting step S22 includes a priority encoding step S222 and a barrel shifting step S224, wherein the priority encoding step S222 utilizes the priority encoder 210 to sequentially receive the multipliers of the multiplier vector M _r and determine the multipliers of each multiplier At least one valid bit position. And using the barrel shifter to step S224 based barrel shifter 220b sequentially receives the multiplicand multiplicand vector _C M, and based on each of the multiplicand significant bit position corresponding to the displacement of the arrangement and arranged operand M _s .

自累加步驟S24係利用自累加器300將排列位移運算元M_s相加而產生壓縮運算元S[n]、C_out[n]。詳細地說，自累加步驟S24包含加總步驟S242、延遲步驟S244以及位移步驟S246，其中加總步驟S242係利用壓縮器310將排列位移運算元M_s相加而產生壓縮運算元S[n]、C_out[n]。而延遲步驟S244係利用延遲元件320a、320b分別延遲壓縮運算元S[n]、C_out[n]而傳輸至壓縮器310。至於位移步驟S246係利用位移器330位移壓縮運算元C_out[n]而傳輸至延遲元件320b。另外，自累加步驟S24可選擇3-2壓縮器、7-3壓縮器或其他不同形式的加法器作為壓縮器310，其中3-2壓縮器與7-3壓縮器之結構分別如第4A、10圖所示，故不再贅述。 The self-accumulation step S24 uses the self-accumulator 300 to add the arrangement displacement operands M _s to generate compression operands S[n], C _out [n]. In detail, the self-accumulation step S24 includes an addition step S242, a delay step S244, and a displacement step S246, where the addition step S242 uses the compressor 310 to add the arrangement displacement operand M _s to generate a compression operand S[n] , C _out [n]. The delay step S244 delays the compression operands S[n] and C _out [n] by the delay elements 320a and 320b, respectively, and is transmitted to the compressor 310. As for the shifting step S246, the shifting element 330 shifts the compression operand C _out [n] and transmits it to the delay element 320b. In addition, in the self-accumulation step S24, a 3-2 compressor, a 7-3 compressor, or other different forms of adders may be selected as the compressor 310, and the structures of the 3-2 compressor and the 7-3 compressor are respectively as described in 4A, As shown in Figure 10, it will not be repeated.

加法步驟S26係利用加法器400將壓縮運算元S[n]、C_out[n]相加而產生內積Z。加法器400之結構如第5、6圖所示，故不再贅述。 The addition step S26 uses the adder 400 to add the compression operands S[n] and C _out [n] to generate an inner product Z. The structure of the adder 400 is shown in Figures 5 and 6, so it will not be described in detail.

激勵步驟S28係利用激勵單元600接收內積Z並執行一非線性運算，此線性運算包含一邏輯函數、一正負號函數、一閥值函數、一片段線性函數、一步階函數或一雙曲函數。藉此，本發明之快速向量乘累加方法700a相當適合應用於類神經網路之內積運算，而且其透過特定的排列位移步驟S22搭配自累加步驟S24以實現向量的乘累加運算，不但可大幅地減少運算硬體結構複雜度、延遲以及功率消耗，還可縮小晶片面積並節省製造成本。表四顯示本發明之硬體複雜度中所使用到的全加器總數量相較習知之直接乘積累加運算的全加器總數量為低。 In the excitation step S28, the excitation unit 600 is used to receive the inner product Z and perform a nonlinear operation. The linear operation includes a logic function, a sign function, a threshold function, a piecewise linear function, a step function or a hyperbolic function . Therefore, the fast vector multiply-accumulate method 700a of the present invention is quite suitable for the inner product operation of neural-like networks, and it realizes the vector multiply-accumulate operation through the specific arrangement displacement step S22 and the self-accumulation step S24. To reduce the complexity, delay and power consumption of computing hardware, it can also reduce the chip area and save manufacturing costs. Table 4 shows the total number of full adders used in the hardware complexity of the present invention compared to conventional ones The total number of full adders that are directly multiplied, accumulated, and added is low.

由上述實施方式可知，本發明具有下列優點：其一，透過特定的自累加器結合特殊應用積體電路來實現向量的內積運算，可大幅地減少運算硬體結構複雜度、延遲以及功率消耗。此外，利用多位元壓縮器結合位元序列式算術演算法可大幅提升長向量內積之向量平行化程度。其二，快速向量乘累加電路非常適合應用於類神經網路硬體加速器。其三，透過特定的排列位移步驟搭配自累加步驟以實現向量的乘累加運算，不但可大幅地減少運算硬體結構複雜度、延遲以及功率消耗，還可縮小晶片面積並節省製造成本。 It can be seen from the above embodiments that the present invention has the following advantages: First, the inner product operation of vectors can be realized through a specific self-accumulator combined with a special application integrated circuit, which can greatly reduce the complexity, delay and power consumption of the hardware structure of the operation . In addition, the use of multi-bit compressors combined with bit-sequence arithmetic algorithms can greatly improve the degree of vector parallelization of the inner product of long vectors. Second, the fast vector multiply-accumulate circuit is very suitable for neural network-like hardware accelerators. Third, through a specific arrangement and displacement step combined with a self-accumulation step to achieve the vector multiply-accumulate operation, not only can the computation hardware complexity, delay and power consumption be greatly reduced, but also the chip area can be reduced and the manufacturing cost can be saved.

雖然本發明已以實施方式揭露如上，然其並非用以限定本發明，任何熟習此技藝者，在不脫離本發明之精神和範圍內，當可作各種之更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。 Although the present invention has been disclosed as above in an embodiment, it is not intended to limit the present invention. Anyone who is familiar with this art can make various modifications and retouching without departing from the spirit and scope of the present invention, so the protection of the present invention The scope shall be as defined in the appended patent application scope.

300‧‧‧自累加器 300‧‧‧self accumulator

310‧‧‧壓縮器 310‧‧‧Compressor

320a、320b‧‧‧延遲元件 320a, 320b‧‧‧delay element

330‧‧‧位移器 330‧‧‧Displacement

M_s‧‧‧排列位移運算元 M _s ‧‧‧Arrange displacement operators

X、Y、C_in‧‧‧輸入埠 X, Y, C _in ‧‧‧ input port

S、C_out‧‧‧輸出埠 S, C _out ‧‧‧ output port

Z^-1‧‧‧延遲符號 Z ^-1 ‧‧‧ Delay symbol

Claims

一種快速向量乘累加電路，應用於一類神經網路硬體加速器且用以運算一乘數向量與一被乘數向量之一內積，該快速向量乘累加電路包含：一排列位移器，依據該乘數向量之複數乘數分別將該被乘數向量之複數被乘數排列成複數排列位移運算元；一自累加器，訊號連接該排列位移器，該自累加器包含：一壓縮器，具有複數輸入埠與複數輸出埠，其中一該輸入埠依序接收該些排列位移運算元，該壓縮器將該些排列位移運算元相加而產生複數壓縮運算元，該些壓縮運算元分別由該些輸出埠輸出；至少一位移器，具有一位移輸入埠與一位移輸出埠，該位移輸入埠連接於該壓縮器之該些輸出埠之一者，該位移器位移該些壓縮運算元之一者；及至少二延遲元件，各該延遲元件具有一延遲輸入埠與一延遲輸出埠，其中一該延遲元件之該延遲輸入埠及該延遲輸出埠分別連接於該壓縮器之該些輸出埠之另一者及該壓縮器之其中另二該輸入埠之一者，另一該延遲元件之該延遲輸入及該延遲輸出埠分別連接於該位移器之該位移輸出埠及該壓縮器之其中另二該輸入埠之另一者；以及一加法器，訊號連接該自累加器之該些輸出埠，該加法器將該些壓縮運算元相加而產生該內積。 A fast vector multiply-accumulate circuit applied to a class of neural network hardware accelerators and used to calculate the inner product of a multiplier vector and a multiplicand vector. The fast vector multiply-accumulate circuit includes: an array shifter, based on the The complex multiplier of the multiplier vector arranges the complex multiplier of the multiplicand vector into a complex array displacement operator; a self-accumulator, the signal is connected to the arrangement shifter, the self-accumulator includes: a compressor, having A complex input port and a complex output port, one of the input ports sequentially receives the arrangement displacement operators, the compressor adds the arrangement displacement operators to generate a complex compression operator, and the compression operators are respectively generated by the Output ports; at least one shifter having a shift input port and a shift output port, the shift input port is connected to one of the output ports of the compressor, the shifter shifts one of the compression operators And at least two delay elements, each of which has a delay input port and a delay output port, wherein the delay input port and the delay output port of the delay element are respectively connected to the output ports of the compressor The other and one of the other two input ports of the compressor, the delay input and the delay output port of the other delay element are respectively connected to the displacement output port of the shifter and the other of the compressor 2. The other of the input ports; and An adder, the signal is connected to the output ports of the self-accumulator, and the adder adds the compression operators to generate the inner product.

如申請專利範圍第1項所述之快速向量乘累加電路，更包含：一激勵單元，訊號連接該加法器，該激勵單元接收該內積並執行一非線性運算。 The fast vector multiply-accumulate circuit as described in item 1 of the patent application scope further includes: an excitation unit, the signal is connected to the adder, the excitation unit receives the inner product and performs a nonlinear operation.

如申請專利範圍第2項所述之快速向量乘累加電路，其中該非線性運算包含一邏輯函數(sigmoid function)、一正負號函數(signum function)、一閥值函數(threshold function)、一片段線性函數(piecewise-linear function)、一步階函數(step function)或一雙曲函數(tanh function)。 The fast vector multiply-accumulate circuit as described in item 2 of the patent application scope, wherein the nonlinear operation includes a sigmoid function, a signum function, a threshold function, and a piece of linear Function (piecewise-linear function), step function (step function) or a hyperbolic function (tanh function).

如申請專利範圍第2項所述之快速向量乘累加電路，其中該非線性運算係依據一分段二次逼近法(piecewise quadratic approximation)實現。 The fast vector multiply-accumulate circuit as described in item 2 of the patent application scope, wherein the nonlinear operation is implemented according to a piecewise quadratic approximation.

如申請專利範圍第1項所述之快速向量乘累加電路，其中該壓縮器為一全加器，該全加器具有一第一輸入埠、一第二輸入埠、一第三輸入埠、一第一輸出埠及一第二輸出埠，其中一該延遲元件設於該第一輸入埠與該第一輸出埠之間，另一該延遲元件與該位移器設於該第二輸入埠與該第二輸出埠之間，且該第三輸入埠訊號連接該排列位移器。 The fast vector multiply-accumulate circuit as described in item 1 of the patent scope, wherein the compressor is a full adder, the full adder has a first input port, a second input port, a third input port, a first One output port And a second output port, wherein one of the delay elements is disposed between the first input port and the first output port, and the other of the delay element and the shifter are disposed between the second input port and the second output port And the signal of the third input port is connected to the array shifter.

如申請專利範圍第1項所述之快速向量乘累加電路，其中該壓縮器為一7-3壓縮器(7 to 3 compressor)，該7-3壓縮器具有一第一輸入埠、一第二輸入埠、一第三輸入埠、一第四輸入埠、一第五輸入埠、一第六輸入埠、一第七輸入埠、一第一輸出埠、一第二輸出埠及一第三輸出埠，該二延遲元件分別代表一第一延遲元件與一第二延遲元件，該位移器代表一第一位移器，自累加器更包含一第三延遲元件與一第二位移器，該第一延遲元件設於該第一輸入埠與該第一輸出埠之間，該第二延遲元件與該第二位移器設於該第二輸入埠與該第三輸出埠之間，該第三延遲元件與該第一位移器設於該第三輸入埠與該第二輸出埠之間，且該第四輸入埠、該第五輸入埠、該第六輸入埠及該第七輸入埠均訊號連接該排列位移器。 The fast vector multiply-accumulate circuit as described in item 1 of the patent application scope, wherein the compressor is a 7-3 compressor (7 to 3 compressor), the 7-3 compressor has a first input port and a second input Port, a third input port, a fourth input port, a fifth input port, a sixth input port, a seventh input port, a first output port, a second output port and a third output port, The two delay elements respectively represent a first delay element and a second delay element, the shifter represents a first shifter, and the self-accumulator further includes a third delay element and a second shifter, the first delay element Is disposed between the first input port and the first output port, the second delay element and the second shifter are disposed between the second input port and the third output port, the third delay element and the The first shifter is disposed between the third input port and the second output port, and the fourth input port, the fifth input port, the sixth input port, and the seventh input port are all connected to the arrangement shift Device.

如申請專利範圍第1項所述之快速向量乘累加電路，其中該加法器係依據一進位預看加法電路(carry look-ahead adder)、一進位傳播加法電路(carry propagate adder)、一進位儲存加法電路(carry save adder)或一漣波進位加法電路(ripple carry adder)實現。 The fast vector multiply-accumulate circuit as described in item 1 of the patent scope, wherein the adder is based on a carry look-ahead adder (carry look-ahead adder) and a carry propagation adder (carry propagate adder), a carry save adder (carry save adder) or a ripple carry adder (ripple carry adder).

如申請專利範圍第1項所述之快速向量乘累加電路，其中該類神經網路硬體加速器包含一第一層處理模組與一第二層處理模組，該第一層處理模組具有一第一層輸出端，該第二層處理模組具有一第二層輸入端，該快速向量乘累加電路設於該第一層處理模組的該第一層輸出端與該第二層處理模組的該第二層輸入端之間。 The fast vector multiply-accumulate circuit as described in item 1 of the patent application scope, wherein the neural network hardware accelerator includes a first layer processing module and a second layer processing module, the first layer processing module has A first layer output terminal, the second layer processing module has a second layer input terminal, and the fast vector multiply-accumulate circuit is provided at the first layer output terminal of the first layer processing module and the second layer processing Between the second layer input ends of the module.

如申請專利範圍第1項所述之快速向量乘累加電路，更包含：一控制處理器，訊號連接並控制該排列位移器、該自累加器及該加法器；其中，該類神經網路硬體加速器包含複數層處理模組，該控制處理器訊號連接並偵測該些層處理模組，該控制處理器依據該些層處理模組之處理結果產生複數控制訊號至該排列位移器、該自累加器及該加法器，以決定排程或停止作動。 The fast vector multiply-accumulate circuit as described in item 1 of the patent application scope further includes: a control processor, a signal is connected to and controls the arrangement shifter, the self-accumulator and the adder; wherein, such neural network hardware The volume accelerator includes a plurality of layers of processing modules. The control processor signals are connected to and detect the layers of processing modules. The control processor generates a plurality of layers of control signals to the array shifter according to the processing results of the layers of processing modules. Self-accumulator and the adder to decide the schedule or stop the action.

如申請專利範圍第1項所述之快速向量乘累加電路，其中該排列位移器包含：至少一優先編碼器，依序接收該乘數向量之該些乘數，該優先編碼器判斷各該乘數之至少一有效位元位置；及至少一桶形移位器(barrel shifter)，依序接收該被乘數向量之該些被乘數並訊號連接該優先編碼器，該桶形移位器依據該有效位元位置位移對應之各該被乘數而排列成該些排列位移運算元。 The fast vector multiply-accumulate circuit as described in item 1 of the patent application scope, wherein the arrangement shifter includes: At least one priority encoder, sequentially receiving the multipliers of the multiplier vector, the priority encoder determining at least one valid bit position of each multiplier; and at least one barrel shifter, according to Sequentially receive the multiplicands of the multiplicand vector and connect the signal to the priority encoder, and the barrel shifter arranges the multiplicands according to the effective bit position shifts and arranges them into the array shift operators .

如申請專利範圍第1項所述之快速向量乘累加電路，其中該快速向量乘累加電路係由一特殊應用積體電路(Application-Specific Integrated Circuit；ASIC)之一半導體製程實現，該半導體製程包含一互補式金屬氧化物半導體(Complementary Metal-Oxide-Semiconductor；CMOS)或一絕緣體覆矽(Silicone On Insulator；SOI)。 The fast vector multiply-accumulate circuit as described in item 1 of the patent application scope, wherein the fast vector multiply-accumulate circuit is realized by a semiconductor process of an application-specific integrated circuit (ASIC), the semiconductor process includes A complementary metal oxide semiconductor (Complementary Metal-Oxide-Semiconductor; CMOS) or a silicon-on-insulator (Silicone On Insulator; SOI).

如申請專利範圍第1項所述之快速向量乘累加電路，其中該快速向量乘累加電路係由一場式可程式閘陣列(Field Programmable Gate Array；FPGA)實現。 The fast vector multiply-accumulate circuit as described in item 1 of the patent application scope, wherein the fast vector multiply-accumulate circuit is implemented by a field programmable gate array (FPGA).