TWI688895B - Fast vector multiplication and accumulation circuit - Google Patents
Fast vector multiplication and accumulation circuit Download PDFInfo
- Publication number
- TWI688895B TWI688895B TW107114790A TW107114790A TWI688895B TW I688895 B TWI688895 B TW I688895B TW 107114790 A TW107114790 A TW 107114790A TW 107114790 A TW107114790 A TW 107114790A TW I688895 B TWI688895 B TW I688895B
- Authority
- TW
- Taiwan
- Prior art keywords
- input port
- shifter
- adder
- fast
- compressor
- Prior art date
Links
Images
Landscapes
- Complex Calculations (AREA)
Abstract
Description
本發明是關於一種快速向量乘累加電路,特別是關於一種應用於類神經網路硬體加速器之快速向量乘累加電路。 The invention relates to a fast vector multiply-accumulate circuit, in particular to a fast vector multiply-accumulate circuit applied to a neural network-like hardware accelerator.
類神經網路(neural network)係採用模型之一或多個層以針對一經接收輸入產生一輸出(例如,一分類)之機器學習模型。一些類神經網路除了輸出層之外,亦包含一或多個隱藏層。每一個隱藏層之輸出用作網路中之下一層(即網路之下一隱藏層或輸出層)之輸入。網路之每一層根據一各自參數集合之當前值自一經接收輸入產生一輸出。 A neural network (neural network) is a machine learning model that uses one or more layers of a model to produce an output (eg, a classification) for a received input. Some neural networks include one or more hidden layers in addition to the output layer. The output of each hidden layer is used as the input of the next layer in the network (that is, a hidden layer or output layer under the network). Each layer of the network generates an output from a received input based on the current value of a respective set of parameters.
一些類神經網路包含一或多個卷積類神經網路層。每一卷積類神經網路層具有一相關聯核心集合。每一核心包含由憑藉一使用者產生之一類神經網路模型建立之值。在一些實施方式中,核心識別特定影像輪廓、形狀或色彩。核心可被表示為權重輸入之一矩陣結構。每一卷積 層亦可處理一激勵輸入集合。此激勵輸入集合亦可被表示為一矩陣結構。 Some neural networks include one or more convolutional neural network layers. Each convolutional neural network layer has an associated core set. Each core contains values created by a type of neural network model generated by a user. In some embodiments, the core recognizes a specific image outline, shape, or color. The core can be expressed as a matrix structure of weight inputs. Per convolution The layer can also process a set of stimulus inputs. This set of excitation inputs can also be expressed as a matrix structure.
目前存在一些習知系統,其在軟體中對一給定卷積層執行計算。例如:軟體可將各層之每一核心應用至激勵輸入集合。換句話說,對於每一核心,軟體能將可多維表示之核心覆疊在可多維表示之激勵輸入的一第一部分上方。接著,軟體可依據重疊元素計算一內積(inner product)。內積可對應於一單個激勵輸入,亦即在重疊多維空間中具有一左上位置之一激勵輸入元素。例如,在使用滑動窗時,軟體可將核心移位以覆疊激勵輸入之一第二部分並計算對應於另一激勵輸入之另一內積。軟體可重複執行此程序直至每一激勵輸入具有一對應內積。在其他一些實施方式中,內積的結果被輸入至一激勵函數,此激勵函數產生激勵值。激勵值可在被發送至類神經網路之一後續層之前組合(即匯集(pooling))。 There are currently some conventional systems that perform calculations for a given convolutional layer in software. For example: software can apply each core of each layer to a set of stimulus inputs. In other words, for each core, the software can overlay the multi-dimensionally representable core over a first portion of the multi-dimensionally representable excitation input. Then, the software can calculate an inner product based on the overlapping elements. The inner product may correspond to a single excitation input, that is, an excitation input element with an upper left position in the overlapping multidimensional space. For example, when using a sliding window, the software can shift the core to overlap a second part of one of the excitation inputs and calculate another inner product corresponding to the other excitation input. The software can repeat this process until each excitation input has a corresponding inner product. In some other embodiments, the result of the inner product is input to an excitation function, which generates an excitation value. The stimulus values can be combined (ie pooling) before being sent to a subsequent layer of a neural-like network.
計算卷積計算之一種方式需要一大尺寸空間緩衝激勵張量與核心張量。一般處理器係透過直接實現的乘法器計算矩陣乘法。也就是說,雖然矩陣乘法為計算密集型及時間密集型,但處理器會針對卷積計算重複地計算個別總和及乘積,此方式會導致平行化處理的運算受限,而且會大幅增加運算複雜度以及功率消耗。 One way to calculate convolution calculations requires a large space to buffer the excitation tensor and core tensor. The general processor calculates the matrix multiplication through the directly implemented multiplier. That is to say, although matrix multiplication is computationally intensive and time intensive, the processor will repeatedly calculate individual sums and products for convolution calculations. This method will cause the parallelization processing to be limited and greatly increase the complexity of the operation. Degree and power consumption.
由此可知,目前市場上缺乏一種可大幅增加向量化以及降低功率消耗的快速向量乘累加電路,故相關業者均在尋求其解決之道。 It can be seen that there is currently no fast vector multiplying and accumulating circuit on the market that can greatly increase vectorization and reduce power consumption, so related companies are all looking for solutions.
因此,本發明之目的在於提供一種快速向量乘累加電路,其適合應用於類神經網路硬體加速器,而且其透過特定的自累加器結合特殊應用積體電路來實現向量的內積運算,不但可大幅地增加向量化,還可大幅地減少運算硬體結構複雜度、延遲以及功率消耗。 Therefore, the object of the present invention is to provide a fast vector multiply-accumulate circuit, which is suitable for application to neural network-like hardware accelerators, and it realizes the inner product operation of vectors through a specific self-accumulator combined with a special application integrated circuit. The vectorization can be greatly increased, and the complexity, delay and power consumption of the computing hardware structure can be greatly reduced.
依據本發明的結構態樣之一實施方式提供一種快速向量乘累加電路,應用於一類神經網路硬體加速器且用以運算一乘數向量與一被乘數向量之一內積。快速向量乘累加電路包含一排列位移器、一自累加器以及一加法器,其中排列位移器依據乘數向量之複數個乘數分別將被乘數向量之複數個被乘數排列成複數個排列位移運算元。此外,自累加器訊號連接排列位移器,自累加器包含一壓縮器、至少二延遲元件以及至少一位移器,其中壓縮器具有複數個輸入埠與複數個輸出埠,其中一個輸入埠依序接收排列位移運算元,壓縮器將排列位移運算元相加而產生複數個壓縮運算元,這些壓縮運算元分別由輸出埠輸出。而二個延遲元件分別訊號連接壓縮器之其中另二個輸入埠,且其中一個延遲元件訊號連接輸出埠之一者。至於位移器則連接於輸出埠之另一者與另一個延遲元件之間,位移器位移這些壓縮運算元之一者。再者,加法器訊號連接自累加器之輸出埠,加法器將壓縮運算元相加而產生內積。 According to an embodiment of the structural aspect of the present invention, a fast vector multiply-accumulate circuit is provided, which is applied to a type of neural network hardware accelerator and used to calculate an inner product of a multiplier vector and a multiplicand vector. The fast vector multiply-accumulate circuit includes an array shifter, a self-accumulator, and an adder, wherein the array shifter arranges the complex multiplicands of the multiplicand vector into complex arrays according to the complex multipliers of the multiplier vector, respectively Displacement operator. In addition, the self-accumulator signal is connected to the array shifter. The self-accumulator includes a compressor, at least two delay elements, and at least one shifter. The compressor has a plurality of input ports and a plurality of output ports, and one of the input ports is received in sequence. Arrangement displacement operators, the compressor adds the arrangement displacement operators to generate a plurality of compression operators, and these compression operators are respectively output by the output ports. The two delay elements are respectively connected to the other two input ports of the compressor, and one of the delay elements is connected to one of the output ports. As for the shifter, it is connected between the other one of the output ports and another delay element, and the shifter shifts one of these compression operators. Furthermore, the adder signal is connected to the output port of the accumulator, and the adder adds the compression operands to generate an inner product.
藉此,本發明的快速向量乘累加電路透過特定的自累加器結合特殊應用積體電路來實現向量的內積運算,既可大幅地增加向量化,亦可大幅地減少運算硬體結構複雜度、延遲以及功率消耗。 In this way, the fast vector multiply-accumulate circuit of the present invention realizes the inner product operation of the vector through a specific self-accumulator combined with a special application integrated circuit, which can greatly increase the vectorization and greatly reduce the complexity of the calculation hardware structure. , Delay and power consumption.
依據前述實施方式之快速向量乘累加電路的其他實施例,其中前述快速向量乘累加電路可包含一激勵單元,此激勵單元訊號連接加法器,激勵單元接收內積並執行一非線性運算。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the fast vector multiply-accumulate circuit may include an excitation unit. The excitation unit signal is connected to the adder. The excitation unit receives the inner product and performs a non-linear operation.
依據前述實施方式之快速向量乘累加電路的其他實施例,其中前述非線性運算可包含一邏輯函數(sigmoid function)、一正負號函數(signum function)、一閥值函數(threshold function)、一片段線性函數(piecewise-linear function)、一步階函數(step function)或一雙曲函數(tanh function)。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the non-linear operation may include a sigmoid function, a signum function, a threshold function, and a segment Linear function (piecewise-linear function), step function (step function) or a hyperbolic function (tanh function).
依據前述實施方式之快速向量乘累加電路的其他實施例,其中前述非線性運算可依據一分段二次逼近法(a piecewise quadratic approximation)實現。 According to other embodiments of the fast vector multiply-accumulate circuit according to the foregoing embodiment, the aforementioned nonlinear operation may be implemented according to a piecewise quadratic approximation.
依據前述實施方式之快速向量乘累加電路的其他實施例,其中前述壓縮器可為一全加器,此全加器具有一第一輸入埠、一第二輸入埠、一第三輸入埠、一第一輸出埠及一第二輸出埠。其中一個延遲元件設於第一輸入埠與第一輸出埠之間,另一個延遲元件與位移器設於第二輸入埠與第二輸出埠之間,且第三輸入埠訊號連接排列位移器。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the aforementioned compressor may be a full adder with a first input port, a second input port, a third input port, a first An output port and a second output port. One of the delay elements is arranged between the first input port and the first output port, the other delay element and the shifter are arranged between the second input port and the second output port, and the signal of the third input port is connected to arrange the shifter.
依據前述實施方式之快速向量乘累加電路的其他實施例,其中前述壓縮器可為一7-3壓縮器(7 to 3 compressor),此7-3壓縮器具有一第一輸入埠、一第二輸入埠、一第三輸入埠、一第四輸入埠、一第五輸入埠、一第六輸入埠、一第七輸入埠、一第一輸出埠、一第二輸出埠及一第三輸出埠。二個延遲元件分別代表一第一延遲元件與一第二延遲元件,位移器代表一第一位移器,自累加器更包含第三延遲元件及第二位移器。第一延遲元件設於第一輸入埠與第一輸出埠之間,第二延遲元件與第二位移器設於第二輸入埠與第三輸出埠之間,第三延遲元件與第一位移器設於第三輸入埠與第二輸出埠之間,且第四輸入埠、第五輸入埠、第六輸入埠及第七輸入埠均訊號連接排列位移器。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the aforementioned compressor may be a 7-3 compressor, which has a first input port and a second input Port, a third input port, a fourth input port, a fifth input port, a sixth input port, a seventh input port, a first output port, a second output port and a third output port. The two delay elements respectively represent a first delay element and a second delay element, the shifter represents a first shifter, and the self-accumulator further includes a third delay element and a second shifter. The first delay element is disposed between the first input port and the first output port, the second delay element and the second shifter are disposed between the second input port and the third output port, and the third delay element and the first shifter It is arranged between the third input port and the second output port, and the fourth input port, the fifth input port, the sixth input port, and the seventh input port are all connected to the signal shifter.
依據前述實施方式之快速向量乘累加電路的其他實施例,其中前述加法器係可依據一進位預看加法電路(carry look-ahead adder)、一進位傳播加法電路(carry propagate adder)、一進位儲存加法電路(carry save adder)或一漣波進位加法電路(ripple carry adder)實現。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the adder may be based on a carry look-ahead adder, a carry propagate adder, and a carry store The addition circuit (carry save adder) or a ripple carry adder (ripple carry adder) is implemented.
依據前述實施方式之快速向量乘累加電路的其他實施例,其中前述類神經網路硬體加速器可包含一第一層處理模組與一第二層處理模組,第一層處理模組具有一第一層輸出端,第二層處理模組具有一第二層輸入端,快 速向量乘累加電路設於第一層處理模組的第一層輸出端與第二層處理模組的第二層輸入端之間。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the aforementioned neural network hardware accelerator may include a first layer processing module and a second layer processing module, the first layer processing module has a The first layer output terminal, the second layer processing module has a second layer input terminal, fast The speed vector multiply-accumulate circuit is provided between the first layer output terminal of the first layer processing module and the second layer input terminal of the second layer processing module.
依據前述實施方式之快速向量乘累加電路的其他實施例,其中前述快速向量乘累加電路可包含一控制處理器,此控制處理器訊號連接並控制排列位移器、自累加器及加法器。類神經網路硬體加速器包含複數層處理模組,控制處理器訊號連接並偵測這些層處理模組。控制處理器依據這些層處理模組之處理結果產生複數控制訊號至排列位移器、自累加器及加法器,以決定排程或停止作動。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the fast vector multiply-accumulate circuit may include a control processor, which connects and controls the arrangement shifter, the self-accumulator, and the adder. Neural-like hardware accelerators include multiple layers of processing modules that control processor signal connections and detect these layers of processing modules. The control processor generates complex control signals to the arrangement shifter, self-accumulator and adder according to the processing results of these layer processing modules to decide the scheduling or stop the operation.
依據前述實施方式之快速向量乘累加電路的其他實施例,其中前述排列位移器包含可至少一優先編碼器(priority encoder)與至少一桶形移位器(barrel shifter)。其中優先編碼器依序接收乘數向量之乘數,優先編碼器判斷各乘數之至少一有效位元位置。此外,桶形移位器依序接收被乘數向量之被乘數並訊號連接優先編碼器,桶形移位器依據有效位元位置位移對應之各被乘數而排列成排列位移運算元。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the array shifter includes at least one priority encoder and at least one barrel shifter. The priority encoder sequentially receives the multiplier of the multiplier vector, and the priority encoder determines at least one valid bit position of each multiplier. In addition, the barrel shifter receives the multiplicands of the multiplicand vectors in sequence and connects the signal to the priority encoder. The barrel shifter arranges the multiplicands according to the position shift of the effective bit and arranges them into array shift operators.
依據前述實施方式之快速向量乘累加電路的其他實施例,其中前述快速向量乘累加電路可由一特殊應用積體電路(Application-Specific Integrated Circuit;ASIC)之一半導體製程實現,半導體製程包含一互補式金屬氧化物半導體(Complementary Metal-Oxide- Semiconductor;CMOS)或一絕緣體覆矽(Silicone On Insulator;SOI)。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, wherein the fast vector multiply-accumulate circuit can be implemented by a semiconductor process of an application-specific integrated circuit (ASIC), the semiconductor process includes a complementary type Complementary Metal-Oxide- Semiconductor; CMOS) or a Silicon On Insulator (SOI).
依據前述實施方式之快速向量乘累加電路的其他實施例,其中前述快速向量乘累加電路可由一場式可程式閘陣列(Field Programmable Gate Array;FPGA)實現。 According to other embodiments of the fast vector multiply-accumulate circuit of the foregoing embodiment, the fast vector multiply-accumulate circuit may be implemented by a field programmable gate array (Field Programmable Gate Array; FPGA).
100‧‧‧快速向量乘累加電路 100‧‧‧ fast vector multiply accumulate circuit
102‧‧‧動態隨機存取記憶體 102‧‧‧Dynamic Random Access Memory
104‧‧‧全域緩衝記憶體 104‧‧‧Global buffer memory
110、110a‧‧‧類神經網路硬體加速器 110, 110a‧‧‧ neural network hardware accelerator
200‧‧‧排列位移器 200‧‧‧Arrangement shifter
210‧‧‧優先編碼器 210‧‧‧ Priority encoder
220a、220b‧‧‧桶形移位器 220a, 220b‧‧‧barrel shifter
230‧‧‧延遲元件 230‧‧‧ Delay element
240‧‧‧開關元件 240‧‧‧Switching element
300、300a‧‧‧自累加器 300, 300a‧‧‧self accumulator
310‧‧‧壓縮器 310‧‧‧Compressor
310a‧‧‧7-3壓縮器 310a‧‧‧-7-3 compressor
320a、320b、320c‧‧‧延遲元件 320a, 320b, 320c‧‧‧ delay element
330、330a、330b‧‧‧位移器 330, 330a, 330b‧‧‧shifter
400、400a‧‧‧加法器 400, 400a‧‧‧ adder
410a、410b‧‧‧並列輸入/串列輸出模組 410a, 410b‧‧‧ parallel input/serial output module
420‧‧‧全加器 420‧‧‧Full Adder
430‧‧‧串列輸入/並列輸出模組 430‧‧‧Serial input/parallel output module
440‧‧‧互斥或閘 440‧‧‧ mutually exclusive or gate
450‧‧‧優先編碼器 450‧‧‧ Priority encoder
460‧‧‧計數器 460‧‧‧Counter
470‧‧‧比較器 470‧‧‧Comparator
EP、EP0、EP1、EP2、EP3、EP4、EP5、EP6、EP7、DONE‧‧‧優先編碼輸出埠 EP, EP 0 , EP 1 , EP 2 , EP 3 , EP 4 , EP 5 , EP 6 , EP 7 , DONE‧‧‧ priority encoding output port
x、x0、x1、x2、x3、x4、x5、x6、x7‧‧‧桶形移位輸入埠 x, x 0 , x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 ‧‧‧barrel shift input port
y、y0、y1、y2、y3、y4、y5、y6、y7‧‧‧桶形移位輸出埠 y, y 0 , y 1 , y 2 , y 3 , y 4 , y 5 , y 6 , y 7 ‧‧‧barrel shift output port
w、w0、w1、w2、w3、w4、w5、w6、w7‧‧‧桶形移位控制埠 w, w 0 , w 1 , w 2 , w 3 , w 4 , w 5 , w 6 , w 7 ‧‧‧barrel shift control port
X、Y、Cin、X0、X1、X2、X3、X4、X5、X6、pi[15:0]、Si、FSM‧‧‧輸入埠 X, Y, C in , X 0 , X 1 , X 2 , X 3 , X 4 , X 5 , X 6 , p i [15:0], S i , FSM‧‧‧ input port
S、Cout、Y0、Y1、Y2、So、po[15:0]、EQ‧‧‧輸出埠 S, C out , Y 0 , Y 1 , Y 2 , S o , p o [15:0], EQ‧‧‧ output port
Z、Z[15:0]、Z[MSB:1]、Z[MSB:ni+1]‧‧‧內積 Z, Z[15:0], Z[MSB: 1], Z[MSB: n i +1] ‧‧‧ inner product
READY、LOAD、 PROC、FIFO_WEN、RST、COUNTER、q[15:0]、q[15]、、、、、、、、、、、、、、、、CX、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、、xser、yser、Lx0、Lx1、Wx[0]0、Wx[1]0、Wx[2]0、Wx[n+3]0、Wx[0]1、Wx[1]1、 Ly0、Ly1、Wy[0]0、Wy[1]0、Wy[2]0、Wy[n+3]0、Wy[0]1、Wy[1]1、CZ[0]0、CZ[1]0、CZ[2]0、CZ[n+3]0、CZ[0]1、CZ[1]1、CEP0、CEP1、WZ[0]0、WZ[1]0、WZ[n+2]0、WZ[n+3]0、WZ[0]1‧‧‧訊號 READY, LOAD, PROC, FIFO_WEN, RST, COUNTER, q[15:0], q[15], , , , , , , , , , , , , , , , CX, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , X ser , y ser , Lx 0 , Lx 1 , Wx[0] 0 , Wx[1] 0 , Wx[2] 0 , Wx[n+3] 0 , Wx[0] 1 , Wx[1] 1 , Ly 0 , Ly 1 , Wy[0] 0 , Wy[1] 0 , Wy[2] 0 , Wy[n+3] 0 , Wy[0] 1 , Wy[1] 1 , CZ[0] 0 , CZ[1] 0 , CZ[2] 0 , CZ[n+3] 0 , CZ[0] 1 , CZ[1] 1 , CEP 0 , CEP 1 , WZ[0] 0 , WZ[1] 0 、WZ[n+2] 0 、WZ[n+3] 0 、WZ[0] 1 ‧‧‧Signal
500‧‧‧控制處理器 500‧‧‧Control processor
600‧‧‧激勵單元 600‧‧‧Incentive unit
700、700a‧‧‧快速向量乘累加方法 700, 700a‧‧‧ fast vector multiply-accumulate method
S12、S22‧‧‧排列位移步驟 S12, S22‧‧‧Arrangement displacement steps
S14、S24‧‧‧自累加步驟 S14, S24‧‧‧Self accumulation step
S16、S26‧‧‧加法步驟 S16, S26‧‧‧Addition steps
S222‧‧‧優先編碼步驟 S222‧‧‧ Priority coding steps
S224‧‧‧桶形移位步驟 S224‧‧‧Barrel shift procedure
S242‧‧‧加總步驟 S242‧‧‧Total steps
S244‧‧‧延遲步驟 S244‧‧‧ Delay step
S246‧‧‧位移步驟 S246‧‧‧Displacement steps
S28‧‧‧激勵步驟 S28‧‧‧Incentive steps
Mr‧‧‧乘數向量 M r ‧‧‧ multiplier vector
Mc‧‧‧被乘數向量 M c ‧‧‧ multiplicand vector
Z-1‧‧‧延遲符號 Z -1 ‧‧‧ Delay symbol
Mc[0]、Mc[1]、Mc[2]‧‧‧被乘數 M c [0], M c [1], M c [2]‧‧‧ multiplicand
Ms、Ms0、Ms1、Ms2、Ms3‧‧‧排列位移運算元 M s , M s0 , M s1 , M s2 , M s3
S[n]、Cout[n]、S[0]、Cout[0]、S[1]、Cout[1]、S[2]、Cout[2]、S[3]、 Cout[3]、S[15:0]、Cout[15:0]‧‧‧壓縮運算元 S[n], C out [n], S[0], C out [0], S[1], C out [1], S[2], C out [2], S[3], C out [3], S[15:0], C out [15:0] ‧‧‧compression operand
M、M0、M1、M2、M3、M4、M5、M6、M7‧‧‧優先編碼輸入埠 M, M 0 , M 1 , M 2 , M 3 , M 4 , M 5 , M 6 , M 7 ‧‧‧ Priority coding input port
P0、P1、P2、P3、P4、P5、P6、P7、P8、Pn、Pn+1‧‧‧優先控制訊號 P 0 、P 1 、P 2 、P 3 、P 4 、P 5 、P 6 、P 7 、P 8 、P n 、P n+1 ‧‧‧Priority control signal
n‧‧‧整數 n‧‧‧ integer
NOP‧‧‧無操作 NOP‧‧‧no operation
第1圖繪示本發明一實施例之類神經網路硬體加速器的電路架構圖。 FIG. 1 shows a circuit architecture diagram of a neural network hardware accelerator such as an embodiment of the invention.
第2圖繪示第1圖一實施例的快速向量乘累加電路之電路架構圖。 FIG. 2 is a circuit architecture diagram of the fast vector multiply-accumulate circuit of the embodiment of FIG. 1.
第3A圖繪示第2圖之排列位移器的電路架構圖。 FIG. 3A is a circuit diagram of the shifter of FIG. 2.
第3B圖繪示第3A圖之優先編碼器的電路架構圖。 FIG. 3B is a circuit diagram of the priority encoder shown in FIG. 3A.
第3C圖繪示第3A圖之桶形移位器的電路架構圖。 FIG. 3C is a circuit diagram of the barrel shifter shown in FIG. 3A.
第3D圖繪示第3A圖之排列位移器的管線(pipeline)時序示意圖。 FIG. 3D is a schematic timing diagram of the pipeline in which the shifters are arranged in FIG. 3A.
第4A圖繪示第2圖一實施例之自累加器的電路架構圖。 FIG. 4A is a circuit diagram of the self-accumulator according to the embodiment of FIG. 2.
第4B圖繪示第4A圖之自累加器的管線時序示意圖。 FIG. 4B is a schematic diagram of the pipeline timing of the self-accumulator in FIG. 4A.
第5圖繪示第2圖一實施例之加法器的電路架構圖。 FIG. 5 is a circuit diagram of the adder in the embodiment of FIG. 2.
第6圖繪示第2圖另一實施例之加法器的電路架構圖。 FIG. 6 is a circuit diagram of an adder according to another embodiment of FIG. 2.
第7圖繪示第6圖之加法器的管線時序示意圖。 FIG. 7 is a schematic diagram of the pipeline timing of the adder of FIG. 6.
第8圖繪示本發明一實施例之快速向量乘累加方法的流 程示意圖。 FIG. 8 illustrates the flow of a fast vector multiply-accumulate method according to an embodiment of the invention Schematic diagram.
第9圖繪示第1圖另一實施例的類神經網路硬體加速器之電路架構圖。 FIG. 9 is a circuit diagram of a neural network-like hardware accelerator according to another embodiment of FIG. 1.
第10圖繪示第2圖另一實施例之自累加器的電路架構圖。 FIG. 10 is a circuit diagram of a self-accumulator according to another embodiment of FIG. 2.
第11圖繪示本發明另一實施例之快速向量乘累加方法的流程示意圖。 FIG. 11 is a schematic flowchart of a fast vector multiply-accumulate method according to another embodiment of the invention.
以下將參照圖式說明本發明之複數個實施例。為明確說明起見,許多實務上的細節將在以下敘述中一併說明。然而,應瞭解到,這些實務上的細節不應用以限制本發明。也就是說,在本發明部分實施例中,這些實務上的細節是非必要的。此外,為簡化圖式起見,一些習知慣用的結構與元件在圖式中將以簡單示意的方式繪示之;並且重複之元件將可能使用相同的編號表示之。 Hereinafter, a plurality of embodiments of the present invention will be described with reference to the drawings. For clarity, many practical details will be explained in the following description. However, it should be understood that these practical details should not be used to limit the present invention. That is to say, in some embodiments of the present invention, these practical details are unnecessary. In addition, for the sake of simplifying the drawings, some conventionally used structures and elements will be shown in a simple schematic manner in the drawings; and repeated elements may be indicated by the same number.
請一併參閱第1圖與第2圖,第1圖繪示本發明一實施例之類神經網路硬體加速器110的電路架構圖。第2圖繪示第1圖之類神經網路硬體加速器110的快速向量乘累加電路100之電路架構圖。如圖所示,類神經網路硬體加速器110包含動態隨機存取記憶體102(Dynamic Random Access Memory;DRAM)、全域緩衝記憶體104(Global Buffer;GLB)、複數個快速向量乘累加電路100以及控制處理器500,快速向量乘累加電路100係應用於類神經網路
硬體加速器110且用以運算乘數向量Mr與被乘數向量Mc之內積Z。快速向量乘累加電路100包含排列位移器200、自累加器300以及加法器400。
Please refer to FIG. 1 and FIG. 2 together. FIG. 1 illustrates a circuit architecture diagram of a neural
排列位移器200依據乘數向量Mr之複數個乘數分別將被乘數向量Mc之複數個被乘數排列成複數個排列位移運算元Ms。茲舉式子(1)與表一之例說明,式子(1)係表示一乘數向量Mr與一被乘數向量Mc之內積運算;表一係繪示式子(1)之內積運算運用第2圖之快速向量乘累加電路100實現所得到的數值結果,如下所示:
參閱前述式子(1)與表一,假設被乘數向量Mc具有三個被乘數Mc[0]、Mc[1]、Mc[2],其十進位表示分別為10、15及3,且其二進位表示分別為「00001010」、「00001111」及「00000011」。乘數向量Mr具有三個乘數,其十進位表示分別為7、4及9,且其二進位表示分別為「00000111」、「00000100」及「00001001」。當第一組被乘數Mc[0](10dec;00001010bin)與乘數(7dec;00000111bin)相乘時,排列位移器200會依據乘數「00000111」中的三個1而對被乘數Mc[0]執行排列位移,進而產生三個排列位移運算元Ms,此三個排列位移運算元Ms分別為「00001010」、「00010100」及「00101000」。其中第一個排列位移運算元Ms即為被乘數Mc[0],第二個排列位移運算元Ms為被乘數Mc[0]左移一個位元,第三個排列位移運算元Ms為被乘數Mc[0]左移二個位元,如表一之第1-3行所示。另外,當第二組被乘數Mc[1](15dec;00001111bin)與乘數(4dec;00000100bin)相乘時,排列位移器200會依據乘數「00000100」中的一個1而對被乘數Mc[1]執行排列位移,進而產生一個排列位移運算元Ms,此排列位移運算元Ms為「00111100」,亦即被乘數Mc[1]左移二個位元,如表一之第6行所示。此外,當第三組被乘數Mc[2](3dec;00000011bin)與乘數(9dec;00001001bin)相乘時,排列位移器200會依據乘數「00001001」中的兩個1而對被乘數Mc[2]執行排列位移,進而產生兩個排列位移運算元
Ms,此兩個排列位移運算元Ms分別為「00000011」與「00011000」。其中第一個排列位移運算元Ms即為被乘數Mc[2],第二個排列位移運算元Ms為被乘數Mc[2]左移三個位元,如表一之第9及12行所示。
Referring to the foregoing formula (1) and Table 1, suppose the multiplicand vector M c has three multiplicands M c [0], M c [1], M c [2], and their decimal representations are 10, 15 and 3, and their binary representations are "00001010", "00001111" and "00000011" respectively. M r vector multiplier has three multiplier, its decimal representation are 7,4 and 9, and the other carry denote as "00000111", "00000100" and "00001001." When the first multiplicand M c [0] (10 dec ; 00001010 bin ) is multiplied by the multiplier (7 dec ; 00000111 bin ), the
自累加器300訊號連接排列位移器200,且自累加器300將排列位移運算元Ms相加而產生複數個壓縮運算元S[n]、Cout[n],n為大於等於0之整數。茲舉式子(1)與表一之例說明,自累加器300依序執行4次相加之作動。第一次相加係自累加器300將三個排列位移運算元Ms(即Mc[0]=00001010、Mc[0](<<1)=00010100及Mc[0](<<2)=00101000)相加而產生二個壓縮運算元S[0]、Cout[0],如表一之第4及5行所示。第二次相加係自累加器300將二個壓縮運算元S[0]、Cout[0]及一個排列位移運算元Ms(即Mc[1](<<2)=00111100)相加而產生二個壓縮運算元S[1]、Cout[1],如表一之第7及8行所示。第三次相加係自累加器300將二個壓縮運算元S[1]、Cout[1]及一個排列位移運算元Ms(即Mc[2]=00000011)相加而產生二個壓縮運算元S[2]、Cout[2],如表一之第10及11行所示。第四次相加係自累加器300將二個壓縮運算元S[2]、Cout[2]及一個排列位移運算元Ms(即Mc[2](<<3)=00011000)相加而產生二個壓縮運算元S[3]、Cout[3],如表一之第13及14行所示。
The signal from the self-
加法器400則訊號連接自累加器300之輸出埠S、Cout,加法器400將壓縮運算元S[3]、Cout[3]相加而
產生內積Z,如表一之第15行所示。再者,加法器400可依據一進位預看加法電路(carry look-ahead adder)、一進位傳播加法電路(carry propagate adder)、一進位儲存加法電路(carry save adder)或一漣波進位加法電路(ripple carry adder)實現。
The
控制處理器500訊號連接並控制排列位移器200、自累加器300及加法器400。控制處理器500可為中央處理單元(Central Processing Unit;CPU)、微控制器(Micro-Control Unit;MCU)或其他控制邏輯。另外值得一提的是,類神經網路硬體加速器110包含複數層處理模組(未示於圖中),控制處理器500訊號連接並偵測這些層處理模組,控制處理器500依據層處理模組之處理結果產生複數個控制訊號至排列位移器200、自累加器300及加法器400,以決定排程或停止作動。此外,在其他實施例中,類神經網路硬體加速器110可包含第一層處理模組與第二層處理模組,第一層處理模組具有第一層輸出端,第二層處理模組具有第二層輸入端,快速向量乘累加電路100設於第一層處理模組的第一層輸出端與第二層處理模組的第二層輸入端之間,藉以處理第一層處理模組之輸出訊號。另外值得一提的是,快速向量乘累加電路100可由一特殊應用積體電路(Application-Specific Integrated Circuit;ASIC)之一半導體製程實現,半導體製程包含一互補式金屬氧化物半導體(Complementary Metal-Oxide-Semiconductor;CMOS)或一絕緣體覆矽
(Silicone On Insulator;SOI)。再者,快速向量乘累加電路100可由一場式可程式閘陣列(Field Programmable Gate Array;FPGA)實現。藉此,本發明之快速向量乘累加電路100適合應用於類神經網路硬體加速器110,而且其透過特定的自累加器300結合特殊應用積體電路來實現向量的內積運算,可大幅地減少運算硬體結構複雜度、延遲以及功率消耗。
The
請一併參閱第2、3A、3B、3C及3D圖,第3A圖繪示第2圖之排列位移器200的電路架構圖。第3B圖繪示第3A圖之優先編碼器210的電路架構圖。第3C圖繪示第3A圖之桶形移位器220a的電路架構圖。第3D圖繪示第3A圖之排列位移器200的管線(pipeline)時序示意圖。如圖所示,排列位移器200包含一個優先編碼器210(priority encoder)、二個桶形移位器220a、220b(barrel shifter)、五個延遲元件230及四個開關元件240。
Please refer to FIGS. 2, 3A, 3B, 3C and 3D together. FIG. 3A shows a circuit architecture diagram of the
優先編碼器210依序接收乘數向量Mr之乘數,優先編碼器210判斷各乘數之至少一有效位元位置,亦即判斷乘數中數值為1的位置。優先編碼器210包含八個優先編碼輸入埠M0、M1、M2、M3、M4、M5、M6、M7、九個優先控制訊號P0、P1、P2、P3、P4、P5、P6、P7、P8、八個優先編碼輸出埠EP0、EP1、EP2、EP3、EP4、EP5、EP6、EP7以及一個訊號READY。其中優先編碼輸入埠M0~M7接收乘數向量Mr之乘數。優先控制訊號P0~P8為優
先編碼器210的內部訊號,其代表優先狀態。當Pn為0時,其後續Pn+1~P8即無法再取得優先狀態,P0等於1(即邏輯“真”值)。此外,優先編碼器210包含十九個AND閘及九個反相器,其連結方式如第3B圖所示。優先編碼器210透過AND閘及反相器的串接,所產生的優先編碼輸出埠EP0~EP7可判斷出乘數中數值為1的位置。茲舉式子(1)為例,若乘數向量Mr的乘數為7(00000111),則對應之優先編碼輸入埠M0、M1、M2、M3、M4、M5、M6、M7的訊號分別為1、1、1、0、0、0、0、0,而經過優先編碼器210後所得到之優先編碼輸出埠EP0、EP1、EP2、EP3、EP4、EP5、EP6、EP7的訊號會分別為1、0、0、0、0、0、0、0;換句話說,若乘數不為零,則優先編碼輸出埠EP0~EP7的訊號只會有一個為1,其餘為0;若乘數等於零,則優先編碼輸出埠EP0~EP7全為0。
The
桶形移位器220a與桶形移位器220b的結構相同,且均包含多個三態緩衝器(tri-state buffer)、八個桶形移位輸入埠x0、x1、x2、x3、x4、x5、x6、x7、八個桶形移位輸出埠y0、y1、y2、y3、y4、y5、y6、y7以及八個桶形移位控制埠w0、w1、w2、w3、w4、w5、w6、w7,其連結方式如第3C圖所示。其中桶形移位控制埠w0~w7分別連接第3B圖的優先編碼輸出埠EP0~EP7。桶形移位器220a依序接收乘數向量Mr之乘數並訊號連接優先編碼器210,桶形移位器220a依據有效位元位置位移對應之各乘數。而且桶形移位器220b依序接收被乘數向量Mc之被乘
數Mc[0]、Mc[1]、Mc[2]並訊號連接優先編碼器210,桶形移位器220b依據有效位元位置位移對應之各被乘數Mc[0]、Mc[1]、Mc[2]而排列成排列位移運算元Ms。此外,乘數向量Mr、被乘數向量Mc會依照乘數向量Mr的優先編碼結果進行複數次之位移,每次位移由開關元件240決定,位移一次完畢後即可輸出。此外,訊號LOAD可控制排列位移器200,其代表載入新的乘數向量Mr及被乘數向量Mc,且排列位移器200會產生訊號READY、PROC、FIFO_WEN,以供排列位移運算元Ms正確地排列並輸出至自累加器300。其中訊號READY代表位移全部完成;訊號PROC代表位移運算;訊號FIFO_WEN則代表位移一次完成並寫入一組位移運算子至下級輸入。
The
延遲元件230及開關元件240均受控制處理器500控制,適當的控制可使優先編碼器210以及桶形移位器220a、220b的輸入埠及輸出埠之訊號在時序上能正確地對應,進而增加管線(pipeline)的執行效率。延遲元件230用以延遲訊號,而開關元件240用以決定是否載入新的乘數向量Mr及被乘數向量Mc或以回授路徑之數值繼續進行位移。再者,由第3D圖的管線時序可得知,當前一個時脈(如cycle=1)輸入被乘數向量Mc的其中一個被乘數與乘數向量Mr的其中一個乘數至排列位移器200之後,優先編碼器210、桶形移位器220a、220b以及排列位移運算元MS均會在下一個時脈(如cycle=2)時對應輸出。第3D圖中訊號之L代表「Load,載入」;訊號之C代表
「Compute,執行運算(ALU)」;訊號之W代表「Write,寫入」。
Both the
請一併參閱第2、4A及4B圖,第4A圖繪示第2圖一實施例之自累加器300的電路架構圖。第4B圖繪示第4A圖之自累加器300的管線時序示意圖。如圖所示,本發明之自累加器300包含一個壓縮器310、至少二個延遲元件320a、320b以及至少一個位移器330,其中壓縮器310具有複數個輸入埠X、Y、Cin與複數個輸出埠S、Cout,其中一個輸入埠Cin依序接收排列位移運算元Ms。壓縮器310將排列位移運算元Ms相加而產生複數個壓縮運算元S[n]、Cout[n],這些壓縮運算元S[n]、Cout[n]分別由輸出埠S、Cout輸出。而二個延遲元件320a、320b分別訊號連接壓縮器310之其中另二個輸入埠X、Y,且其中一個延遲元件320a訊號連接輸出埠S。至於位移器330則連接於輸出埠Cout與另一個延遲元件320b之間,位移器330位移壓縮運算元Cout[n]。詳細地說,本實施例之壓縮器310為一全加器(full adder;FA),此全加器具有第一輸入埠X、第二輸入埠Y、第三輸入埠Cin、第一輸出埠S及第二輸出埠Cout,且此全加器為3-2壓縮器(3 to 2 compressor),其真值表如表二所示。其中一個延遲元件320a設於第一輸入埠X與第一輸出埠S之間,另一個延遲元件320b與位移器330設於第二輸入埠Y與第二輸出埠Cout之間,且第三輸入埠Cin訊號連接排列位移器200。此外,由第4B圖的管線時序可得知,經過n+5個時脈後,第一輸出埠S及第二輸出埠
Cout即可輸出正確的壓縮運算元S[n]、Cout[n],以供後續電路(如加法器400)使用。其中壓縮運算元S[n]、Cout[n]分別對應第4B圖之訊號、,而輸入暫存器與輸出暫存器(FIFO)則分別耦接於自累加器300的輸入端與輸出端,且輸入暫存器與輸出暫存器均受控制處理器500控制。
Please refer to FIGS. 2, 4A, and 4B together. FIG. 4A illustrates a circuit architecture diagram of the self-
請一併參閱第2、5、6及7圖,第5圖繪示第2圖一實施例之加法器400的電路架構圖。第6圖繪示第2圖另一實施例之加法器400a的電路架構圖。第7圖繪示第6圖之加法器400a的管線時序示意圖。如圖所示,第5圖的加法器400包含二個並列輸入/串列輸出模組410a、410b(Parallel-In/Serial-Out;PISO)、一個全加器420以及一個串列輸入/並列輸出模組430(Serial-In/Parallel-Out;PISO)。全加器420連接於並列輸入/串列輸出模組410a、410b及串列輸入/並列輸出模組430之間。另外,
第6圖的加法器400a更包含互斥或閘440(XOR)、優先編碼器450、計數器460(counter)以及比較器470。其中互斥或閘440耦接於第一輸出埠S與第二輸出埠Cout,並輸出至優先編碼器450與串列輸入/並列輸出模組430。優先編碼器450與計數器460均連接至比較器470。在比較器470中,當輸入埠X的數值等於輸入埠Y的數值,輸出埠EQ等於1;當輸入埠X的數值不等於輸入埠Y的數值,輸出埠EQ等於0。上述互斥或閘440、優先編碼器450、計數器460及比較器470之作用在於利用壓縮運算元S[15:0]、Cout[15:0]產生一訊號READY,以判斷訊號q[15:0]之16個位元中的最高有效處理位元。若訊號READY為0,代表尚未找到訊號q[15:0]的最高有效處理位元;若訊號READY為1,代表已找到訊號q[15:0]的最高有效處理位元,可用以提前停止加法器400之作動,進一步達到節省運算以及功率消耗之目的。舉例來說,若訊號q[15:0]=0000000011111111,則訊號q[7]為最高有效處理位元,且壓縮運算元S[15:8]、Cout[15:8]之數值均為0,故加法器400無需再處理壓縮運算元S[15:8]、Cout[15:8]之加法。此外,由第7圖的管線時序可得知,經過n+5個時脈後,串列輸入/並列輸出模組430即可輸出正確的內積Z,以供後續電路(如激勵單元600)使用。第7圖中訊號RST代表「Reset,重置」。藉此,本發明的加法器400、400a透過特定電路之訊號判斷,可大幅節省運算以及功率消耗。
Please refer to FIGS. 2, 5, 6 and 7 together. FIG. 5 shows a circuit architecture diagram of the
請一併參閱第2及8圖,第8圖繪示本發明一實施例之快速向量乘累加方法700的流程示意圖,其可用於第2圖之快速向量乘累加電路100,但本發明不以此為限。快速向量乘累加方法700包含排列位移步驟S12、自累加步驟S14以及加法步驟S16。其中排列位移步驟S12係利用排列位移器200依據乘數向量Mr之乘數分別將被乘數向量Mc之被乘數排列成排列位移運算元Ms。自累加步驟S14係利用自累加器300將排列位移運算元Ms相加而產生壓縮運算元S[n]、Cout[n]。加法步驟S16係利用加法器400將壓縮運算元S[n]、Cout[n]相加而產生內積Z。藉此,本發明之快速向量乘累加方法700非常適合應用於類神經網路之內積運算,而且其透過特定的排列位移步驟S12搭配自累加步驟S14以實現向量的內積運算,可大幅地減少運算硬體結構複雜度、延遲以及功率消耗。
Please refer to FIGS. 2 and 8 together. FIG. 8 is a schematic flowchart of a fast vector multiply-accumulate
請一併參閱第1、2及9圖,第9圖繪示第1圖另一實施例的類神經網路硬體加速器110a之電路架構圖。類神經網路硬體加速器110a包含快速向量乘累加電路100、控制處理器500以及激勵單元600。
Please refer to FIGS. 1, 2 and 9 together. FIG. 9 illustrates a circuit architecture diagram of the neural network-
在第9圖的實施方式中,快速向量乘累加電路100、控制處理器500均分別與第2圖中的快速向量乘累加電路100、控制處理器500相同,不再贅述。特別的是,第9圖之類神經網路硬體加速器110a更包含激勵單元600,此激勵單元600訊號連接加法器400,且激勵單元600接收內積Z並執行一非線性運算。非線性運算包含邏輯函數
(sigmoid function)、正負號函數(signum function)、閥值函數(threshold function)、片段線性函數(piecewise-linear function)、步階函數(step function)或雙曲函數(tanh function)。此外,非線性運算可依據一分段二次逼近法(piecewise quadratic approximation)實現。
In the embodiment of FIG. 9, the fast vector multiply-accumulate
請一併參閱第2、9及10圖,第10圖繪示第2圖另一實施例之自累加器300a的電路架構圖。此自累加器300a可一次處理較大資料量的乘數向量Mr與被乘數向量Mc,自累加器300a包含一個7-3壓縮器310a(7 to 3 compressor)、第一延遲元件320a、第二延遲元件320b、第三延遲元件320c、第一位移器330a以及第二位移器330b。其中7-3壓縮器310a具有第一輸入埠X0、第二輸入埠X1、第三輸入埠X2、第四輸入埠X3、第五輸入埠X4、第六輸入埠X5、第七輸入埠X6、第一輸出埠Y0、第二輸出埠Y1及第三輸出埠Y2。第一延遲元件320a設於第一輸入埠X0與第一輸出埠Y0之間,第二延遲元件320b與第二位移器330b設於第二輸入埠X1與第三輸出埠Y2之間,且第三延遲元件320c與第一位移器330a設於第三輸入埠X2與第二輸出埠Y1之間。第四輸入埠X3、第五輸入埠X4、第六輸入埠X5及第七輸入埠X6均訊號連接排列位移器200。本實施例之7-3壓縮器310a如表三所示。
Please refer to FIGS. 2, 9 and 10 together. FIG. 10 illustrates a circuit architecture diagram of the self-
請一併參閱第3A、4A、9、10及11圖,第11圖繪示本發明另一實施例之快速向量乘累加方法700a的流程示意圖。快速向量乘累加方法700a包含排列位移步驟S22、自累加步驟S24、加法步驟S26以及激勵步驟S28。
Please refer to FIGS. 3A, 4A, 9, 10, and 11 together. FIG. 11 is a schematic flowchart of a fast vector multiply-accumulate
排列位移步驟S22係利用排列位移器200依據乘數向量Mr之乘數分別將被乘數向量Mc之被乘數排列成排列位移運算元Ms。詳細地說,排列位移步驟S22包含優先編碼步驟S222與桶形移位步驟S224,其中優先編碼步驟S222係利用優先編碼器210依序接收乘數向量Mr之乘數,並判斷各乘數之至少一有效位元位置。而桶形移位步驟S224係利用桶形移位器220b依序接收被乘數向量Mc之被乘數,並依據有效位元位置位移對應之各被乘數而排列成排列位移運算元Ms。
Displacement arrangement step S22 based displacement using the arrangement according to the
自累加步驟S24係利用自累加器300將排列位移運算元Ms相加而產生壓縮運算元S[n]、Cout[n]。詳細地說,自累加步驟S24包含加總步驟S242、延遲步驟S244以及位移步驟S246,其中加總步驟S242係利用壓縮器310將排列位移運算元Ms相加而產生壓縮運算元S[n]、Cout[n]。而延遲步驟S244係利用延遲元件320a、320b分別延遲壓縮運算元S[n]、Cout[n]而傳輸至壓縮器310。至於位移步驟S246係利用位移器330位移壓縮運算元Cout[n]而傳輸至延遲元件320b。另外,自累加步驟S24可選擇3-2壓縮器、7-3壓縮器或其他不同形式的加法器作為壓縮器310,其中3-2壓縮器與7-3壓縮器之結構分別如第4A、10圖所示,故不再贅述。
The self-accumulation step S24 uses the self-
加法步驟S26係利用加法器400將壓縮運算元S[n]、Cout[n]相加而產生內積Z。加法器400之結構如第5、6圖所示,故不再贅述。
The addition step S26 uses the
激勵步驟S28係利用激勵單元600接收內積Z並執行一非線性運算,此線性運算包含一邏輯函數、一正負號函數、一閥值函數、一片段線性函數、一步階函數或一雙曲函數。藉此,本發明之快速向量乘累加方法700a相當適合應用於類神經網路之內積運算,而且其透過特定的排列位移步驟S22搭配自累加步驟S24以實現向量的乘累加運算,不但可大幅地減少運算硬體結構複雜度、延遲以及功率消耗,還可縮小晶片面積並節省製造成本。表四顯示本發明之硬體複雜度中所使用到的全加器總數量相較習知
之直接乘積累加運算的全加器總數量為低。
In the excitation step S28, the
由上述實施方式可知,本發明具有下列優點:其一,透過特定的自累加器結合特殊應用積體電路來實現向量的內積運算,可大幅地減少運算硬體結構複雜度、延遲以及功率消耗。此外,利用多位元壓縮器結合位元序列式算術演算法可大幅提升長向量內積之向量平行化程度。其二,快速向量乘累加電路非常適合應用於類神經網路硬體加速器。其三,透過特定的排列位移步驟搭配自累加步驟以實現向量的乘累加運算,不但可大幅地減少運算硬體結構複雜度、延遲以及功率消耗,還可縮小晶片面積並節省製造成本。 It can be seen from the above embodiments that the present invention has the following advantages: First, the inner product operation of vectors can be realized through a specific self-accumulator combined with a special application integrated circuit, which can greatly reduce the complexity, delay and power consumption of the hardware structure of the operation . In addition, the use of multi-bit compressors combined with bit-sequence arithmetic algorithms can greatly improve the degree of vector parallelization of the inner product of long vectors. Second, the fast vector multiply-accumulate circuit is very suitable for neural network-like hardware accelerators. Third, through a specific arrangement and displacement step combined with a self-accumulation step to achieve the vector multiply-accumulate operation, not only can the computation hardware complexity, delay and power consumption be greatly reduced, but also the chip area can be reduced and the manufacturing cost can be saved.
雖然本發明已以實施方式揭露如上,然其並非用以限定本發明,任何熟習此技藝者,在不脫離本發明之精神和範圍內,當可作各種之更動與潤飾,因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。 Although the present invention has been disclosed as above in an embodiment, it is not intended to limit the present invention. Anyone who is familiar with this art can make various modifications and retouching without departing from the spirit and scope of the present invention, so the protection of the present invention The scope shall be as defined in the appended patent application scope.
300‧‧‧自累加器 300‧‧‧self accumulator
310‧‧‧壓縮器 310‧‧‧Compressor
320a、320b‧‧‧延遲元件 320a, 320b‧‧‧delay element
330‧‧‧位移器 330‧‧‧Displacement
Ms‧‧‧排列位移運算元 M s ‧‧‧Arrange displacement operators
X、Y、Cin‧‧‧輸入埠 X, Y, C in ‧‧‧ input port
S、Cout‧‧‧輸出埠 S, C out ‧‧‧ output port
Z-1‧‧‧延遲符號 Z -1 ‧‧‧ Delay symbol
Claims (12)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/190,129 US10908879B2 (en) | 2018-03-02 | 2018-11-13 | Fast vector multiplication and accumulation circuit |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201862637399P | 2018-03-02 | 2018-03-02 | |
US62/637,399 | 2018-03-02 |
Publications (2)
Publication Number | Publication Date |
---|---|
TW201939266A TW201939266A (en) | 2019-10-01 |
TWI688895B true TWI688895B (en) | 2020-03-21 |
Family
ID=69023134
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW107114790A TWI688895B (en) | 2018-03-02 | 2018-05-01 | Fast vector multiplication and accumulation circuit |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI688895B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI770668B (en) * | 2019-11-25 | 2022-07-11 | 旺宏電子股份有限公司 | Operation method for artificial neural network |
TWI777231B (en) * | 2020-08-28 | 2022-09-11 | 國立中正大學 | Device for computing an inner product of vectors |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3752971A (en) * | 1971-10-18 | 1973-08-14 | Hughes Aircraft Co | Expandable sum of cross product multiplier/adder module |
CN1278341A (en) * | 1997-10-28 | 2000-12-27 | 爱特梅尔股份有限公司 | Fast regular multiplier architecture |
US20090063608A1 (en) * | 2007-09-04 | 2009-03-05 | Eric Oliver Mejdrich | Full Vector Width Cross Product Using Recirculation for Area Optimization |
US8959137B1 (en) * | 2008-02-20 | 2015-02-17 | Altera Corporation | Implementing large multipliers in a programmable integrated circuit device |
-
2018
- 2018-05-01 TW TW107114790A patent/TWI688895B/en active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3752971A (en) * | 1971-10-18 | 1973-08-14 | Hughes Aircraft Co | Expandable sum of cross product multiplier/adder module |
CN1278341A (en) * | 1997-10-28 | 2000-12-27 | 爱特梅尔股份有限公司 | Fast regular multiplier architecture |
US20090063608A1 (en) * | 2007-09-04 | 2009-03-05 | Eric Oliver Mejdrich | Full Vector Width Cross Product Using Recirculation for Area Optimization |
US8959137B1 (en) * | 2008-02-20 | 2015-02-17 | Altera Corporation | Implementing large multipliers in a programmable integrated circuit device |
Also Published As
Publication number | Publication date |
---|---|
TW201939266A (en) | 2019-10-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jang et al. | Sparsity-aware and re-configurable NPU architecture for Samsung flagship mobile SoC | |
Samimi et al. | Res-DNN: A residue number system-based DNN accelerator unit | |
Tenca et al. | High-radix design of a scalable modular multiplier | |
US7774400B2 (en) | Method and system for performing calculation operations and a device | |
Dadda et al. | Pipelined adders | |
US11599779B2 (en) | Neural network circuitry having approximate multiplier units | |
Olivieri | Design of synchronous and asynchronous variable-latency pipelined multipliers | |
TWI688895B (en) | Fast vector multiplication and accumulation circuit | |
Filippas et al. | Reduced-Precision Floating-Point Arithmetic in Systolic Arrays with Skewed Pipelines | |
US6513054B1 (en) | Asynchronous parallel arithmetic processor utilizing coefficient polynomial arithmetic (CPA) | |
Lee et al. | Energy-efficient high-speed ASIC implementation of convolutional neural network using novel reduced critical-path design | |
US10908879B2 (en) | Fast vector multiplication and accumulation circuit | |
CN109284085B (en) | High-speed modular multiplication and modular exponentiation operation method and device based on FPGA | |
Mehta et al. | High speed SRT divider for intelligent embedded system | |
Devic et al. | Highly-adaptive mixed-precision MAC unit for smart and low-power edge computing | |
Pawar et al. | Review on multiply-accumulate unit | |
EP3610367B1 (en) | Energy-efficient variable power adder and methods of use thereof | |
Hazarika et al. | Shift and accumulate convolution processing unit | |
Wu et al. | High-speed power-efficient coarse-grained convolver architecture using depth-first compression scheme | |
Samanth et al. | A novel approach to develop low power MACs for 2D image filtering | |
Bhadra et al. | Design and Analysis of High-Throughput Two-Cycle Multiply-Accumulate (MAC) Architectures for Fixed-Point Arithmetic | |
Kannappan et al. | A Survey on Multi-operand Adder | |
Ramezani et al. | An efficient look up table based approximate adder for field programmable gate array | |
Dhanya et al. | Vedic Multiplier and Wallace Tree Adders Based Optimised Processing Element Unit for CNN on FPGA | |
Prathyusha et al. | Designing a Mac Unit Using Approximate Multiplier |