CN111915003A

CN111915003A - Neural network hardware accelerator

Info

Publication number: CN111915003A
Application number: CN201910386225.8A
Authority: CN
Inventors: 李文江; 黄运新; 冯涛; 徐斌; 王岩; 李卫军
Original assignee: Shenzhen Dapu Microelectronics Co Ltd
Current assignee: Shenzhen Dapu Microelectronics Co Ltd
Priority date: 2019-05-09
Filing date: 2019-05-09
Publication date: 2020-11-10
Anticipated expiration: 2039-05-09
Also published as: CN111915003B; WO2020224516A1

Abstract

The application discloses neural network hardware accelerator, hardware accelerator's pipeline architecture includes: the instruction acquisition module is used for acquiring an instruction; the instruction decoding module is used for performing instruction decoding operation; the half-precision floating point operation module is used for performing one-dimensional vector operation; the activation function calculation module is used for calculating the activation function in a lookup table mode; the floating point post-processing unit is used for performing floating point operation on the data after the activation function calculation; the cache module is used for caching intermediate data in the implementation process of the neural network algorithm; and the register files distributed on the same stage of the assembly line as the instruction decoding module are used for temporarily storing relevant instructions, data and addresses in the implementation process of the neural network algorithm. The method and the device can greatly improve the hardware resource utilization rate of the hardware accelerator in the process of realizing the RNN neural network algorithm, thereby improving the unit power consumption efficiency of the RNN neural network algorithm of unit calculated amount in unit time when the RNN neural network algorithm is operated by the hardware accelerator.

Description

Neural network hardware accelerator

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a neural network hardware accelerator.

Background

Currently, hardware accelerators for neural networks are Google's TPU, NVDLA of NVDIA, cambrian, and so on. Mainstream Neural network hardware accelerators perform a large amount of calculation optimization on CNN (Convolutional Neural Networks) Networks, and perform targeted optimization on convolution and convolution kernel convolution operations with different sizes in the hardware calculation process.

It can be seen that the architecture of the current overall neural network hardware accelerator is biased towards optimization for CNN, and this part is indeed the most computationally demanding part of the neural network algorithm. Although these Neural Network hardware accelerators can also be used in RNN (Recurrent Neural Network) Neural networks, the calculation optimization of RNN Neural networks is rare and the calculation efficiency ratio is too poor. That is, the same hardware is used to calculate CNN with less resource waste, but used to calculate RNN with much resource waste. The waste of resources means that the power consumption is relatively low, so that the architecture of the neural network hardware accelerator has no advantages in power consumption efficiency and cost.

Disclosure of Invention

In view of this, an object of the present application is to provide a neural network hardware accelerator, which can greatly improve the utilization rate of hardware resources of the hardware accelerator in the process of implementing the RNN neural network algorithm, so as to improve the unit power consumption efficiency of the RNN neural network algorithm of unit computation amount in unit time when the hardware accelerator is operated. The specific scheme is as follows:

a neural network hardware accelerator, a pipeline architecture of the hardware accelerator comprising:

the instruction acquisition module is used for acquiring an instruction;

the instruction decoding module is used for carrying out instruction decoding operation;

the half-precision floating point operation module is used for performing one-dimensional vector operation in a neural network algorithm to obtain a corresponding one-dimensional vector operation result;

the activation function calculation module is used for calculating the activation function in the neural network algorithm in a lookup table mode to obtain a corresponding activation function calculation result;

the floating point post-processing unit is used for carrying out floating point operation on the data which is positioned after the calculation of the activating function calculation module in the implementation process of the neural network algorithm;

the cache module is used for caching the intermediate data in the implementation process of the neural network algorithm;

the register files distributed in the same stage of the assembly line as the instruction decoding module are used for temporarily storing relevant instructions, data and addresses in the implementation process of the neural network algorithm;

and the write-back level module is used for performing data write-back operation.

Optionally, the register file includes:

the device comprises a vector register group, an address register group used for being matched with an adder to assist the instruction decoding module in completing address addressing and calculation processing of addresses, a common register group used for storing non-vector calculation results, a functional unit register group used for providing services for the semi-precision floating-point operation module and the activation function calculation module to reduce pipeline waiting time, and a loop register group used for providing services for loop instructions and jump instructions.

Optionally, the instruction decoding module includes:

the first-level decoding unit is used for completing addressing and calculation processing of addresses, updating numerical values of address registers and initiating access requests to the cache module;

and the second-level decoding unit is used for reading the data in the cache module to a corresponding register and writing the register data to the cache module.

Optionally, the half-precision floating-point operation module includes:

one or more half-precision floating-point calculation sub-modules, wherein each half-precision floating-point calculation sub-module is used for realizing one-dimensional vector multiplication with the maximum length of 64 bits;

and the result accumulation submodule is used for accumulating the calculation results of the plurality of half-precision floating point calculation submodules according to the instruction decoding result so as to obtain the corresponding one-dimensional vector operation result.

Optionally, the half-precision floating-point calculation submodule includes

The floating-point multiplication unit is used for realizing 64 floating-point multiplication operations to obtain 64 floating-point multiplication results;

and the floating-point multiplication result accumulation unit is used for accumulating the 64 floating-point multiplication results to obtain corresponding one-dimensional vector multiplication calculation results.

Optionally, the activation function calculation module further includes:

a look-up table configuration module; wherein the look-up table configuration module comprises a look-up table.

Optionally, the cache module includes:

the execution instruction cache unit is used for caching the instruction parameters of the execution instruction;

an input data caching unit for caching input data;

the weight vector caching unit is used for caching weight vectors in the neural network algorithm;

the offset parameter caching unit is used for caching the offset parameters of the weight vector;

the temporary data caching unit is used for caching temporary non-vector data;

and the output data caching unit is used for caching the output data.

Optionally, the instruction set of the hardware accelerator includes:

null instructions, vector multiply operation instructions, data transfer instructions, loop instructions, and register clear instructions.

Optionally, the neural network algorithm is an RNN neural network algorithm.

Optionally, the RNN neural network algorithm includes an LSTM algorithm, a GRN algorithm, or a bidirectional LSTM algorithm.

The pipeline architecture of the neural network hardware accelerator comprises an instruction acquisition module, an instruction decoding module, a half-precision floating point operation module, an activation function calculation module, a floating point post-processing unit, a cache module, a register file and a write-back level module; the half-precision floating point operation module is used for performing one-dimensional vector operation in a neural network algorithm; the activation function calculation module is used for calculating an activation function in the neural network algorithm in a lookup table mode; the floating point post-processing unit is used for performing floating point operation on data which is positioned after the calculation of the activating function calculation module in the implementation process of the neural network algorithm; the cache module is used for caching intermediate data in the implementation process of the neural network algorithm; the register file is distributed in a production line, is in the same stage as the instruction decoding module, and is specifically used for temporarily storing relevant instructions, data and addresses in the implementation process of the neural network algorithm. Through the design of the pipeline architecture, the hardware accelerator in the method can better accord with the operational characteristics of the RNN, and the pipeline architecture uses hardware scheduling as much as possible, so that software intervention is avoided, the method can greatly improve the utilization rate of hardware resources of the hardware accelerator in the process of realizing the RNN neural network algorithm, thereby improving the unit power consumption efficiency of the RNN neural network algorithm with unit calculated amount in unit time when the RNN neural network algorithm is operated by the hardware accelerator, and simultaneously improving the utilization rate of the unit hardware resources.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic diagram of a neural network hardware accelerator pipeline architecture according to the present disclosure;

FIG. 2 is a diagram of a specific neural network hardware accelerator pipeline architecture disclosed herein;

FIG. 3 shows an activation function lookup intent;

FIG. 4 is a diagram of a register file.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Currently, the architecture of neural network hardware accelerators is biased towards optimization for CNN. Although these neural network hardware accelerators can also be used for the RNN neural network, the calculation optimization of the RNN neural network is very little, and the calculation efficiency ratio is too poor, resulting in much waste of resources. The waste of resources means that the power consumption is relatively low, so that the architecture of the neural network hardware accelerator has no advantages in power consumption efficiency and cost. Therefore, the present application provides a neural network hardware accelerator, which can overcome the above technical problems.

Referring to fig. 1, an embodiment of the present application discloses a neural network hardware accelerator, where a pipeline architecture of the hardware accelerator includes:

an instruction obtaining module 11, configured to obtain an instruction;

an instruction decoding module 12, configured to perform an instruction decoding operation;

the half-precision floating point operation module 13 is used for performing one-dimensional vector operation in a neural network algorithm to obtain a corresponding one-dimensional vector operation result;

an activation function calculation module 14, configured to calculate an activation function in the neural network algorithm in a lookup table manner to obtain a corresponding activation function calculation result;

a floating point post-processing unit 15, configured to perform floating point operation on the data after the calculation of the activation function calculation module 14 in the implementation process of the neural network algorithm;

the cache module 16 is used for caching the intermediate data in the implementation process of the neural network algorithm;

the register files 17 distributed in the same stage of the pipeline as the instruction decoding module 12 are used for temporarily storing relevant instructions, data and addresses in the implementation process of the neural network algorithm;

and a write-back stage module 18 for performing data write-back operations.

It should be noted that the Neural network algorithm in this embodiment mainly refers to an RNN Neural network algorithm, and may specifically include, but not limited to, an LSTM (Long Short-Term Memory network) algorithm, a GRN (generalized Regression Neural) algorithm, or a bi-directional LSTM algorithm.

In addition, it should be noted that, in this embodiment, in consideration of that the calculation in the RNN neural network algorithm is changed more, there are many calculations and further data operations after the activation function, and in order to meet the requirement of this algorithm, a floating point post-processing unit 15 is added in the neural network hardware accelerator in this embodiment, and is used to complete some post-processing after the activation function is calculated. Wherein the post-processing may include:

1. performing magnification or reduction processing according to the magnification, wherein the floating point post-processing unit 15 may perform magnification or reduction processing according to an exponential magnification of 2; 2. a process of multiplying by a fixed coefficient (floating point), wherein the above process can be realized by floating point multiplication in the floating point post-processing unit 15; 3. processing with a fixed offset (floating point) added, which may be implemented by floating point addition in the floating point post-processing unit 15; 4. the process of multiplying and adding a variable, which can be realized in the floating point post-processing unit 15 by multiplying a variable (half-precision floating point) stored in the register file. It can be seen that the pipeline architecture of the neural network hardware accelerator in the embodiment of the present application includes an instruction obtaining module, an instruction decoding module, a half-precision floating point operation module, an activation function calculation module, a floating point post-processing unit, a cache module, a register file, and a write-back stage module; the half-precision floating point operation module is used for performing one-dimensional vector operation in a neural network algorithm; the activation function calculation module is used for calculating an activation function in the neural network algorithm in a lookup table mode; the floating point post-processing unit is used for performing floating point operation on data which is positioned after the calculation of the activating function calculation module in the implementation process of the neural network algorithm; the cache module is used for caching intermediate data in the implementation process of the neural network algorithm; the register file is distributed in a production line, is in the same stage as the instruction decoding module, and is specifically used for temporarily storing relevant instructions, data and addresses in the implementation process of the neural network algorithm. Through the design of the pipeline architecture, the hardware accelerator in the method can better accord with the operational characteristics of the RNN, and the pipeline architecture uses hardware scheduling as much as possible, so that software intervention is avoided, the method can greatly improve the utilization rate of hardware resources of the hardware accelerator in the process of realizing the RNN neural network algorithm, thereby improving the unit power consumption efficiency of the RNN neural network algorithm with unit calculated amount in unit time when the RNN neural network algorithm is operated by the hardware accelerator, and simultaneously improving the utilization rate of the unit hardware resources.

On the basis of the above embodiments, the embodiments of the present application further explain and optimize the technical solutions. Specifically, the method comprises the following steps:

in this embodiment, the register file 17 may specifically include:

a vector register set, an address register set for cooperating with an adder to assist the instruction decoding module 12 in addressing and computing an address, a general register set for storing a result of a non-vector computation, a functional block register set for serving the half-precision floating-point arithmetic module 13 and the active function computation module 14 to reduce pipeline latency, and a loop register set for serving loop instructions and jump instructions.

It should be noted that in a normal RISC processor (RISC), data dependencies are typically resolved by bypass circuits and pipeline latency. However, in the neural network accelerator, due to the excessive number of pipeline stages, the overhead is too large when the problems of data dependency and dependency of the former and later instruction data are solved by the bypass, and the highest frequency of the whole pipeline operation is influenced. Therefore, the register can be directly arranged in the half-precision floating point calculation pipeline and the active function calculation pipeline to partially solve the pipeline waiting problem, improve the pipeline frequency and reduce pipeline bubbles.

In addition, the instruction decoding module in this embodiment may specifically include:

the first-level decoding unit is used for completing addressing and calculation processing of addresses, updating numerical values of address registers and initiating access requests to the cache module 16;

and the second-level decoding unit is used for reading the data in the cache module 16 to a corresponding register and writing the register data to the cache module 16.

Therefore, the embodiment of the application adopts two stages of decoding stages, and cache reading and storage access are solved in the decoding stages.

In this embodiment, the two levels of decoding units are used to cooperate with a hardware architecture that needs to access Memory (Memory) resources in the cache module 16 and perform Memory read/write operations in the decoding units. Generally, when a Memory is read and written, a corresponding Memory address needs to be provided. For this reason, the present embodiment determines the Memory address by the two-stage decoding unit. Note that, normally, in order to obtain a Memory address, the following method may be used: and a certain variable A stored in the register file is operated in a mode specified in the instruction, the operations comprise fixed offset, addition and subtraction of offset and incremental and decremental offset, a direct address is obtained from the instruction code, and a destination address is obtained through operations such as related addition and subtraction of the certain variable A stored in the register file and the direct address obtained from the instruction code. Therefore, the first-stage decoding unit in the two-stage decoding unit obtains an address, which includes a possible Memory address and also may include an address of a variable in the register file, and the second-stage decoding unit is responsible for fetching data corresponding to the address into the pipeline and providing input for the operation of the next stage of the pipeline. In addition, the two-stage decoding unit in this embodiment may include a large number of functional units for address maintenance, such as an adder including address offset calculation, a subtractor, and an address register addressing unit.

It should be noted that, in the existing general CPU and DPU, the address calculation result is generally obtained in the subsequent calculation unit, and then the Memory resource is accessed. However, in the pipeline architecture of the present embodiment, if the pipeline architecture is designed according to the conventional method, the efficiency of the pipeline will be reduced by about 50%. By designing the two-stage decoding units, the access behavior of the storage resources can be completed in the decoding units in advance, so that the reading of the storage resources and the calculation can be completed in the same instruction.

In this embodiment, the half-precision floating-point operation module 13 may specifically include:

It can be understood that, according to different requirements of the algorithm, the present embodiment may select a corresponding number of half-precision floating-point calculation sub-modules to complete the final one-dimensional vector operation. It can be seen that the half-precision floating-point operation capability of the hardware accelerator in this embodiment is an extensible operation capability.

The half-precision floating-point calculation submodule may specifically include

In order to reduce the overhead of table lookup in the floating point operation process, the activating function calculating module in this embodiment may further include:

a look-up table configuration module; wherein the look-up table configuration module comprises a look-up table. In this embodiment, let X-axis start point a, X-axis end point B, step size step, and lookup table, where the lookup table value is Y [ n ] ═ activation _ function (a + step × n).

Further, the cache module may specifically include:

an input data caching unit for caching input data;

the temporary data caching unit is used for caching temporary non-vector data;

and the output data caching unit is used for caching the output data.

In addition, the instruction set of the hardware accelerator in this embodiment may specifically include: null instructions, vector multiply operation instructions, data transfer instructions, loop instructions, and register clear instructions.

It is to be understood that the Instruction Set in the present embodiment may specifically be an Instruction Set of a CISC-like architecture (CISC).

Referring to fig. 2, the present embodiment discloses a more specific neural network hardware accelerator. The hardware accelerator pipeline mainly comprises the following parts:

a. IF (i.e., Instruction Fetch) for fetching instructions.

b. An ID (i.e., Instruction Decode) for decoding the Instruction; this stage of the pipeline contains 2 sub-pipelines, ID0 and ID1 respectively. In hardware, the register file of the entire hardware accelerator (core) is at the stage of instruction decoding, and all operations related to the registers are performed at the stage of instruction decoding.

c. A MAC 64; the MAC64 is an FP16 (half precision floating point) calculation module, and can implement one-dimensional vector multiplication with a maximum length of 64 (Y ═ x0 × w0+ x1 × w1+ x2 × w2 … + x63 × w 63). When computing a pipeline, multiple MACs 64 may be combined to achieve a larger length one-dimensional vector operation. Meanwhile, MAC64 can produce length 16, 32, 64 one-dimensional vector component results, if the length 16 one-dimensional vector multiplication, 4 results can be output, thereby improving the parallelism of calculation. The calculation of the MAC64 is divided into a pipeline of several stages, the first stage is the implementation of 64 floating-point multiplications, and the subsequent 3 stages are the results of the 64 floating-point multiplications, which are accumulated to obtain the results required by the whole MAC 64. The embodiment may implement multiple MACs 64, that is, NxMAC64, in a hardware implementation in a configured manner, so as to meet different application computing bandwidth requirements. In addition, the result of the MAC64 calculation is input to the ACC accumulation. For most applications, the length of one-dimensional vector operations is often not 64, and may be 128, 256, 1024 or longer vectors, and the ACC is a module for half-precision floating-point accumulation that accumulates the results of multiple MAC64 operations according to instruction decoding.

d. LUTxN; in recurrent neural network algorithms, there are many types of activation functions. In order to support a plurality of activation functions simply and effectively, the embodiment selects a mode of using a lookup table to realize the calculation of the activation functions. The lookup table is realized by layering the whole activation function into multiple segments, each segment corresponds to a lookup table, and the rest part calculates the lookup table structure by linear interpolationAnd (5) fruit. The step size of the lookup table is 2-t, (t)>0). In the embodiment, the step length of the binary exponential integer is taken, so that the hardware implementation can be greatly simplified, and meanwhile, the influence on the calculation precision of the activation function is small. The step of calculating the look-up table LUT may specifically include: the min term of the look-up table expression range (X-axis) is subtracted from the input X to determine which look-up table the input X falls into, assuming it falls into the look-up table LUTx. Wherein, X-LUTxmin is calculated, wherein, LUTxmin is the minimum item (X axis) of the LUTx table, and is divided by the Step length (Step) of the lookup table on the X axis, and the integer part of the obtained value is the item label of the lookup table, namely the ith item. The ratio of X divided by the fraction of Step. Since the step size of the above calculation is 2^-tTherefore, the result can be obtained in a shifting mode instead of floating-point division, so that floating-point division operation and floating-point addition and subtraction operation are avoided. After the entry number is obtained, the entry number may be represented by the formula Y ═ LUTx [ i ═ l]*ratio+LUTx[i+1]And (1-ratio), and then carrying out floating point standardization processing on the calculation result to obtain a table lookup result. It can be seen that the above formula can be implemented by two half-precision floating-point multipliers and a half-precision floating-point addition module. The table lookup design can greatly reduce the number of floating point operation units and simplify the table lookup floating point operation. And because the table look-up module is used, precision and hardware implementation overhead can be balanced by setting a plurality of table look-up items and tables with different sizes. In addition, the process of the lookup table can be seen in fig. 3. In fig. 3, the first row shows the contents of a LUT lookup table, including Step, i.e., the difference on the X-axis of each cell, here called Step, is defined as an exponential power of 2 to simplify the calculation; further, LUT0Min is the start point of the LUT x axis and LUT0Max is the end point of the LUTX axis. (LUTXmin, LUTXMax) is the range; after obtaining a value, the result of the function needs to be obtained by looking up a table, specifically (x-LUTXMin)/(LUTXMax-LUTXMin), to obtain which segment x falls on, and the step number of this segment is (LUTXMax-LUTXMin)/step. After the ratio is obtained, the formula Y ═ LUT [ i ] can be calculated by linear interpolation]*ratio+LUT[i+1](1-ratio) obtaining the result of the operation. In addition, the lookup table supports multi-segment lookup and linear fitting, whichLinear fitting is generally used at the beginning or end of the function curve, where a look-up table (with minimal slope change) is not needed to look at the specific use environment. The obtained Y is subjected to the normalization of FP16 and the post-processing of the activation function, so that the calculation process of the whole activation function can be completed.

e. FALU; and the floating-point ALU unit supports operations of multiplication, addition, subtraction, floating-point precision interception and the like of a floating point. The purpose of the FALU is to handle more changes of calculation in RNN network algorithm, there are many calculations that have further data operation after the activation function, and the FALU (1/2/4/8 path) is added at the last of the pipeline calculation unit, and the data after the activation function pipeline unit calculation is post-processed depending on the algorithm requirement.

f. A memory interface of the pipeline; in the recurrent neural network algorithm, a lot of weights and a part of intermediate data cache are needed, and when the intermediate data cache is more, the intermediate data cache can be stored through a fast SRAM. The SRAM is instead divided into several parts:

Ins-Ram, instruction Memory, store execution instructions;

Input-Ram, inputting Memory, storing Input data;

Weight-Ram, Weight Memory, Weight vector parameters stored in neural network algorithm;

bias parameters of weight vectors in a neural network algorithm and some temporary non-vector data are stored;

Output-Ram, Output Memory, and store and Output data.

g. Register File (Register File); the register file is located in the pipeline. The register file contains register definitions of many hardware accelerators, including vector register sets, and normal register sets, address register sets, as well as MAC intermediate value registers in the last stage sub-PIPELINE of MAC-computing PIPELINE (including ACC), and similar register definitions in LUTs, and the register set design in this class of functional blocks is called a block register set. The register file in this embodiment may specifically include:

vector register set: the length of the vector register is consistent with the maximum vector length of MAC operation, a vector register group can be set according to needs, generally, the number of the vector register groups is at least 4, and the fewer the vector register groups, the more the resources are saved;

an address register group: each address register can be matched with an adder to carry out addressing and operation of addresses, unlike the common RISC processor which uses an ALU to calculate the addresses, in the embodiment, the address calculation is carried out in a neural network hardware accelerator, each address register is provided with an adder to calculate the addresses, wherein the bit number of the address register is 16 bits;

a general register set: in neural network algorithms such as RNN/LSTM/GRN and the like, a plurality of gate calculations are carried out, and the results of the gate calculations are only intermediate results, so that a common register can be arranged for storing the results of the non-vector calculations;

a component register group: the registers in the MAC and the registers in the LUT are set to reduce latency for data-dependent instructions. In a conventional RISC processor, data correlation is typically addressed by bypass circuits and pipeline latency. In the neural network accelerator, due to too many pipeline stages, the overhead of solving the problems of data dependency and preceding and following instruction data dependence through the bypass is too large, and the highest frequency of the whole pipeline operation is influenced. Thus, the present embodiment partially addresses the pipeline latency problem by directly setting registers in the MAC and LUT pipelines. In addition, pipeline latency is also addressed by parallelizing the unrolled instructions appropriately. Because of the RNN/LSTM/GRN and other cyclic neural networks and branches thereof, the cyclic expansion at the algorithm level is relatively easy;

a loop register set; the hidden loop register group can serve loop instructions and jump instructions.

Referring to fig. 4, fig. 4 is a diagram illustrating a specific register file. In fig. 4, VERGx is a vector register, 128X 16; adr0 is an address register dedicated for address access; ix, fx, kx, tx are general registers or common half-precision floating-point registers, used for storing the calculation result of the pipeline or the intermediate calculation result; the MacRx is a register resource designed at the last stage of an MAC calculation pipeline, and mainly aims to reduce register access waiting, if the register is not provided, the result of the Mac calculation needs to go through a subsequent activating function calculation module LUT, a floating point post-processing unit and a write-back stage module of the whole pipeline until a register file is backfilled, and thus, the data waiting time is increased invisibly. If the instruction is placed in the MAC pipeline, after the result is obtained through calculation, the instruction can be immediately used by the subsequent instruction; LUTRx, a register resource designed at the last stage of the LUT calculation pipeline, is mainly intended to reduce register access latency and improve pipeline efficiency.

h. Write Back Stage; similar to the write-back stage in the RISC processor, the register FILE for writing back the result is also different here in that the neural network hardware accelerator, which uses the Memory as an intermediate storage module, does not only use a register FILE, so here, the write-back stage also bears the burden of writing back part of the intermediate data that needs to be written back to the Memory, and at the same time, part of the output data is also written to the output Memory at the write-back stage.

In addition, the instruction set in the present embodiment may specifically include the following 5 types of instructions: NULL instruction NULL, vector multiply operation instruction VECM, data transfer instruction DTRN, LOOP instruction LOOP, register clear instruction CLER.

In summary, in the present embodiment, for the neural network algorithms of various time series, the basic calculations are disassembled to obtain different types of basic calculations, then different combinations of instructions are combined to achieve the effect of combining various atomic calculations, and the calculation bandwidth maximization is taken as a design, an optimization goal is achieved, pipeline bubbles are reduced as much as possible, the maximization of execution efficiency along with a pipeline is achieved, hardware calculation resources are fully used to the maximum, and the variation of the cyclic neural network algorithm and the algorithm thereof is achieved with a small hardware overhead.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The present application provides a neural network hardware accelerator, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A neural network hardware accelerator, wherein a pipeline architecture of the hardware accelerator comprises:

the instruction acquisition module is used for acquiring an instruction;

2. The neural network hardware accelerator of claim 1, wherein the register file comprises:

3. The neural network hardware accelerator of claim 1, wherein the instruction decode module comprises:

4. The neural network hardware accelerator of claim 1, wherein the half-precision floating-point arithmetic module comprises:

5. The neural network hardware accelerator of claim 4, wherein the half-precision floating-point computation submodule comprises

6. The neural network hardware accelerator of claim 1, wherein the activation function computation module further comprises:

7. The neural network hardware accelerator of claim 1, wherein the cache module comprises:

an input data caching unit for caching input data;

the temporary data caching unit is used for caching temporary non-vector data;

and the output data caching unit is used for caching the output data.

8. The neural network hardware accelerator of claim 1, wherein the instruction set of the hardware accelerator comprises:

9. The neural network hardware accelerator of any one of claims 1 to 8, wherein the neural network algorithm is an RNN neural network algorithm.

10. The neural network hardware accelerator of claim 9, wherein the RNN neural network algorithm comprises an LSTM algorithm, a GRN algorithm, or a bi-directional LSTM algorithm.