CN111915003A - Neural network hardware accelerator - Google Patents

Neural network hardware accelerator Download PDF

Info

Publication number
CN111915003A
CN111915003A CN201910386225.8A CN201910386225A CN111915003A CN 111915003 A CN111915003 A CN 111915003A CN 201910386225 A CN201910386225 A CN 201910386225A CN 111915003 A CN111915003 A CN 111915003A
Authority
CN
China
Prior art keywords
neural network
module
hardware accelerator
instruction
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910386225.8A
Other languages
Chinese (zh)
Other versions
CN111915003B (en
Inventor
李文江
黄运新
冯涛
徐斌
王岩
李卫军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Dapu Microelectronics Co Ltd
Original Assignee
Shenzhen Dapu Microelectronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Dapu Microelectronics Co Ltd filed Critical Shenzhen Dapu Microelectronics Co Ltd
Priority to CN201910386225.8A priority Critical patent/CN111915003B/en
Priority to PCT/CN2020/087999 priority patent/WO2020224516A1/en
Publication of CN111915003A publication Critical patent/CN111915003A/en
Application granted granted Critical
Publication of CN111915003B publication Critical patent/CN111915003B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/3013Organisation of register space, e.g. banked or distributed register file according to data content, e.g. floating-point registers, address registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30141Implementation provisions of register files, e.g. ports
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Neurology (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)

Abstract

The application discloses neural network hardware accelerator, hardware accelerator's pipeline architecture includes: the instruction acquisition module is used for acquiring an instruction; the instruction decoding module is used for performing instruction decoding operation; the half-precision floating point operation module is used for performing one-dimensional vector operation; the activation function calculation module is used for calculating the activation function in a lookup table mode; the floating point post-processing unit is used for performing floating point operation on the data after the activation function calculation; the cache module is used for caching intermediate data in the implementation process of the neural network algorithm; and the register files distributed on the same stage of the assembly line as the instruction decoding module are used for temporarily storing relevant instructions, data and addresses in the implementation process of the neural network algorithm. The method and the device can greatly improve the hardware resource utilization rate of the hardware accelerator in the process of realizing the RNN neural network algorithm, thereby improving the unit power consumption efficiency of the RNN neural network algorithm of unit calculated amount in unit time when the RNN neural network algorithm is operated by the hardware accelerator.

Description

Neural network hardware accelerator
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a neural network hardware accelerator.
Background
Currently, hardware accelerators for neural networks are Google's TPU, NVDLA of NVDIA, cambrian, and so on. Mainstream Neural network hardware accelerators perform a large amount of calculation optimization on CNN (Convolutional Neural Networks) Networks, and perform targeted optimization on convolution and convolution kernel convolution operations with different sizes in the hardware calculation process.
It can be seen that the architecture of the current overall neural network hardware accelerator is biased towards optimization for CNN, and this part is indeed the most computationally demanding part of the neural network algorithm. Although these Neural Network hardware accelerators can also be used in RNN (Recurrent Neural Network) Neural networks, the calculation optimization of RNN Neural networks is rare and the calculation efficiency ratio is too poor. That is, the same hardware is used to calculate CNN with less resource waste, but used to calculate RNN with much resource waste. The waste of resources means that the power consumption is relatively low, so that the architecture of the neural network hardware accelerator has no advantages in power consumption efficiency and cost.
Disclosure of Invention
In view of this, an object of the present application is to provide a neural network hardware accelerator, which can greatly improve the utilization rate of hardware resources of the hardware accelerator in the process of implementing the RNN neural network algorithm, so as to improve the unit power consumption efficiency of the RNN neural network algorithm of unit computation amount in unit time when the hardware accelerator is operated. The specific scheme is as follows:
a neural network hardware accelerator, a pipeline architecture of the hardware accelerator comprising:
the instruction acquisition module is used for acquiring an instruction;
the instruction decoding module is used for carrying out instruction decoding operation;
the half-precision floating point operation module is used for performing one-dimensional vector operation in a neural network algorithm to obtain a corresponding one-dimensional vector operation result;
the activation function calculation module is used for calculating the activation function in the neural network algorithm in a lookup table mode to obtain a corresponding activation function calculation result;
the floating point post-processing unit is used for carrying out floating point operation on the data which is positioned after the calculation of the activating function calculation module in the implementation process of the neural network algorithm;
the cache module is used for caching the intermediate data in the implementation process of the neural network algorithm;
the register files distributed in the same stage of the assembly line as the instruction decoding module are used for temporarily storing relevant instructions, data and addresses in the implementation process of the neural network algorithm;
and the write-back level module is used for performing data write-back operation.
Optionally, the register file includes:
the device comprises a vector register group, an address register group used for being matched with an adder to assist the instruction decoding module in completing address addressing and calculation processing of addresses, a common register group used for storing non-vector calculation results, a functional unit register group used for providing services for the semi-precision floating-point operation module and the activation function calculation module to reduce pipeline waiting time, and a loop register group used for providing services for loop instructions and jump instructions.
Optionally, the instruction decoding module includes:
the first-level decoding unit is used for completing addressing and calculation processing of addresses, updating numerical values of address registers and initiating access requests to the cache module;
and the second-level decoding unit is used for reading the data in the cache module to a corresponding register and writing the register data to the cache module.
Optionally, the half-precision floating-point operation module includes:
one or more half-precision floating-point calculation sub-modules, wherein each half-precision floating-point calculation sub-module is used for realizing one-dimensional vector multiplication with the maximum length of 64 bits;
and the result accumulation submodule is used for accumulating the calculation results of the plurality of half-precision floating point calculation submodules according to the instruction decoding result so as to obtain the corresponding one-dimensional vector operation result.
Optionally, the half-precision floating-point calculation submodule includes
The floating-point multiplication unit is used for realizing 64 floating-point multiplication operations to obtain 64 floating-point multiplication results;
and the floating-point multiplication result accumulation unit is used for accumulating the 64 floating-point multiplication results to obtain corresponding one-dimensional vector multiplication calculation results.
Optionally, the activation function calculation module further includes:
a look-up table configuration module; wherein the look-up table configuration module comprises a look-up table.
Optionally, the cache module includes:
the execution instruction cache unit is used for caching the instruction parameters of the execution instruction;
an input data caching unit for caching input data;
the weight vector caching unit is used for caching weight vectors in the neural network algorithm;
the offset parameter caching unit is used for caching the offset parameters of the weight vector;
the temporary data caching unit is used for caching temporary non-vector data;
and the output data caching unit is used for caching the output data.
Optionally, the instruction set of the hardware accelerator includes:
null instructions, vector multiply operation instructions, data transfer instructions, loop instructions, and register clear instructions.
Optionally, the neural network algorithm is an RNN neural network algorithm.
Optionally, the RNN neural network algorithm includes an LSTM algorithm, a GRN algorithm, or a bidirectional LSTM algorithm.
The pipeline architecture of the neural network hardware accelerator comprises an instruction acquisition module, an instruction decoding module, a half-precision floating point operation module, an activation function calculation module, a floating point post-processing unit, a cache module, a register file and a write-back level module; the half-precision floating point operation module is used for performing one-dimensional vector operation in a neural network algorithm; the activation function calculation module is used for calculating an activation function in the neural network algorithm in a lookup table mode; the floating point post-processing unit is used for performing floating point operation on data which is positioned after the calculation of the activating function calculation module in the implementation process of the neural network algorithm; the cache module is used for caching intermediate data in the implementation process of the neural network algorithm; the register file is distributed in a production line, is in the same stage as the instruction decoding module, and is specifically used for temporarily storing relevant instructions, data and addresses in the implementation process of the neural network algorithm. Through the design of the pipeline architecture, the hardware accelerator in the method can better accord with the operational characteristics of the RNN, and the pipeline architecture uses hardware scheduling as much as possible, so that software intervention is avoided, the method can greatly improve the utilization rate of hardware resources of the hardware accelerator in the process of realizing the RNN neural network algorithm, thereby improving the unit power consumption efficiency of the RNN neural network algorithm with unit calculated amount in unit time when the RNN neural network algorithm is operated by the hardware accelerator, and simultaneously improving the utilization rate of the unit hardware resources.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic diagram of a neural network hardware accelerator pipeline architecture according to the present disclosure;
FIG. 2 is a diagram of a specific neural network hardware accelerator pipeline architecture disclosed herein;
FIG. 3 shows an activation function lookup intent;
FIG. 4 is a diagram of a register file.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Currently, the architecture of neural network hardware accelerators is biased towards optimization for CNN. Although these neural network hardware accelerators can also be used for the RNN neural network, the calculation optimization of the RNN neural network is very little, and the calculation efficiency ratio is too poor, resulting in much waste of resources. The waste of resources means that the power consumption is relatively low, so that the architecture of the neural network hardware accelerator has no advantages in power consumption efficiency and cost. Therefore, the present application provides a neural network hardware accelerator, which can overcome the above technical problems.
Referring to fig. 1, an embodiment of the present application discloses a neural network hardware accelerator, where a pipeline architecture of the hardware accelerator includes:
an instruction obtaining module 11, configured to obtain an instruction;
an instruction decoding module 12, configured to perform an instruction decoding operation;
the half-precision floating point operation module 13 is used for performing one-dimensional vector operation in a neural network algorithm to obtain a corresponding one-dimensional vector operation result;
an activation function calculation module 14, configured to calculate an activation function in the neural network algorithm in a lookup table manner to obtain a corresponding activation function calculation result;
a floating point post-processing unit 15, configured to perform floating point operation on the data after the calculation of the activation function calculation module 14 in the implementation process of the neural network algorithm;
the cache module 16 is used for caching the intermediate data in the implementation process of the neural network algorithm;
the register files 17 distributed in the same stage of the pipeline as the instruction decoding module 12 are used for temporarily storing relevant instructions, data and addresses in the implementation process of the neural network algorithm;
and a write-back stage module 18 for performing data write-back operations.
It should be noted that the Neural network algorithm in this embodiment mainly refers to an RNN Neural network algorithm, and may specifically include, but not limited to, an LSTM (Long Short-Term Memory network) algorithm, a GRN (generalized Regression Neural) algorithm, or a bi-directional LSTM algorithm.
In addition, it should be noted that, in this embodiment, in consideration of that the calculation in the RNN neural network algorithm is changed more, there are many calculations and further data operations after the activation function, and in order to meet the requirement of this algorithm, a floating point post-processing unit 15 is added in the neural network hardware accelerator in this embodiment, and is used to complete some post-processing after the activation function is calculated. Wherein the post-processing may include:
1. performing magnification or reduction processing according to the magnification, wherein the floating point post-processing unit 15 may perform magnification or reduction processing according to an exponential magnification of 2; 2. a process of multiplying by a fixed coefficient (floating point), wherein the above process can be realized by floating point multiplication in the floating point post-processing unit 15; 3. processing with a fixed offset (floating point) added, which may be implemented by floating point addition in the floating point post-processing unit 15; 4. the process of multiplying and adding a variable, which can be realized in the floating point post-processing unit 15 by multiplying a variable (half-precision floating point) stored in the register file. It can be seen that the pipeline architecture of the neural network hardware accelerator in the embodiment of the present application includes an instruction obtaining module, an instruction decoding module, a half-precision floating point operation module, an activation function calculation module, a floating point post-processing unit, a cache module, a register file, and a write-back stage module; the half-precision floating point operation module is used for performing one-dimensional vector operation in a neural network algorithm; the activation function calculation module is used for calculating an activation function in the neural network algorithm in a lookup table mode; the floating point post-processing unit is used for performing floating point operation on data which is positioned after the calculation of the activating function calculation module in the implementation process of the neural network algorithm; the cache module is used for caching intermediate data in the implementation process of the neural network algorithm; the register file is distributed in a production line, is in the same stage as the instruction decoding module, and is specifically used for temporarily storing relevant instructions, data and addresses in the implementation process of the neural network algorithm. Through the design of the pipeline architecture, the hardware accelerator in the method can better accord with the operational characteristics of the RNN, and the pipeline architecture uses hardware scheduling as much as possible, so that software intervention is avoided, the method can greatly improve the utilization rate of hardware resources of the hardware accelerator in the process of realizing the RNN neural network algorithm, thereby improving the unit power consumption efficiency of the RNN neural network algorithm with unit calculated amount in unit time when the RNN neural network algorithm is operated by the hardware accelerator, and simultaneously improving the utilization rate of the unit hardware resources.
On the basis of the above embodiments, the embodiments of the present application further explain and optimize the technical solutions. Specifically, the method comprises the following steps:
in this embodiment, the register file 17 may specifically include:
a vector register set, an address register set for cooperating with an adder to assist the instruction decoding module 12 in addressing and computing an address, a general register set for storing a result of a non-vector computation, a functional block register set for serving the half-precision floating-point arithmetic module 13 and the active function computation module 14 to reduce pipeline latency, and a loop register set for serving loop instructions and jump instructions.
It should be noted that in a normal RISC processor (RISC), data dependencies are typically resolved by bypass circuits and pipeline latency. However, in the neural network accelerator, due to the excessive number of pipeline stages, the overhead is too large when the problems of data dependency and dependency of the former and later instruction data are solved by the bypass, and the highest frequency of the whole pipeline operation is influenced. Therefore, the register can be directly arranged in the half-precision floating point calculation pipeline and the active function calculation pipeline to partially solve the pipeline waiting problem, improve the pipeline frequency and reduce pipeline bubbles.
In addition, the instruction decoding module in this embodiment may specifically include:
the first-level decoding unit is used for completing addressing and calculation processing of addresses, updating numerical values of address registers and initiating access requests to the cache module 16;
and the second-level decoding unit is used for reading the data in the cache module 16 to a corresponding register and writing the register data to the cache module 16.
Therefore, the embodiment of the application adopts two stages of decoding stages, and cache reading and storage access are solved in the decoding stages.
In this embodiment, the two levels of decoding units are used to cooperate with a hardware architecture that needs to access Memory (Memory) resources in the cache module 16 and perform Memory read/write operations in the decoding units. Generally, when a Memory is read and written, a corresponding Memory address needs to be provided. For this reason, the present embodiment determines the Memory address by the two-stage decoding unit. Note that, normally, in order to obtain a Memory address, the following method may be used: and a certain variable A stored in the register file is operated in a mode specified in the instruction, the operations comprise fixed offset, addition and subtraction of offset and incremental and decremental offset, a direct address is obtained from the instruction code, and a destination address is obtained through operations such as related addition and subtraction of the certain variable A stored in the register file and the direct address obtained from the instruction code. Therefore, the first-stage decoding unit in the two-stage decoding unit obtains an address, which includes a possible Memory address and also may include an address of a variable in the register file, and the second-stage decoding unit is responsible for fetching data corresponding to the address into the pipeline and providing input for the operation of the next stage of the pipeline. In addition, the two-stage decoding unit in this embodiment may include a large number of functional units for address maintenance, such as an adder including address offset calculation, a subtractor, and an address register addressing unit.
It should be noted that, in the existing general CPU and DPU, the address calculation result is generally obtained in the subsequent calculation unit, and then the Memory resource is accessed. However, in the pipeline architecture of the present embodiment, if the pipeline architecture is designed according to the conventional method, the efficiency of the pipeline will be reduced by about 50%. By designing the two-stage decoding units, the access behavior of the storage resources can be completed in the decoding units in advance, so that the reading of the storage resources and the calculation can be completed in the same instruction.
In this embodiment, the half-precision floating-point operation module 13 may specifically include:
one or more half-precision floating-point calculation sub-modules, wherein each half-precision floating-point calculation sub-module is used for realizing one-dimensional vector multiplication with the maximum length of 64 bits;
and the result accumulation submodule is used for accumulating the calculation results of the plurality of half-precision floating point calculation submodules according to the instruction decoding result so as to obtain the corresponding one-dimensional vector operation result.
It can be understood that, according to different requirements of the algorithm, the present embodiment may select a corresponding number of half-precision floating-point calculation sub-modules to complete the final one-dimensional vector operation. It can be seen that the half-precision floating-point operation capability of the hardware accelerator in this embodiment is an extensible operation capability.
The half-precision floating-point calculation submodule may specifically include
The floating-point multiplication unit is used for realizing 64 floating-point multiplication operations to obtain 64 floating-point multiplication results;
and the floating-point multiplication result accumulation unit is used for accumulating the 64 floating-point multiplication results to obtain corresponding one-dimensional vector multiplication calculation results.
In order to reduce the overhead of table lookup in the floating point operation process, the activating function calculating module in this embodiment may further include:
a look-up table configuration module; wherein the look-up table configuration module comprises a look-up table. In this embodiment, let X-axis start point a, X-axis end point B, step size step, and lookup table, where the lookup table value is Y [ n ] ═ activation _ function (a + step × n).
Further, the cache module may specifically include:
the execution instruction cache unit is used for caching the instruction parameters of the execution instruction;
an input data caching unit for caching input data;
the weight vector caching unit is used for caching weight vectors in the neural network algorithm;
the offset parameter caching unit is used for caching the offset parameters of the weight vector;
the temporary data caching unit is used for caching temporary non-vector data;
and the output data caching unit is used for caching the output data.
In addition, the instruction set of the hardware accelerator in this embodiment may specifically include: null instructions, vector multiply operation instructions, data transfer instructions, loop instructions, and register clear instructions.
It is to be understood that the Instruction Set in the present embodiment may specifically be an Instruction Set of a CISC-like architecture (CISC).
Referring to fig. 2, the present embodiment discloses a more specific neural network hardware accelerator. The hardware accelerator pipeline mainly comprises the following parts:
a. IF (i.e., Instruction Fetch) for fetching instructions.
b. An ID (i.e., Instruction Decode) for decoding the Instruction; this stage of the pipeline contains 2 sub-pipelines, ID0 and ID1 respectively. In hardware, the register file of the entire hardware accelerator (core) is at the stage of instruction decoding, and all operations related to the registers are performed at the stage of instruction decoding.
c. A MAC 64; the MAC64 is an FP16 (half precision floating point) calculation module, and can implement one-dimensional vector multiplication with a maximum length of 64 (Y ═ x0 × w0+ x1 × w1+ x2 × w2 … + x63 × w 63). When computing a pipeline, multiple MACs 64 may be combined to achieve a larger length one-dimensional vector operation. Meanwhile, MAC64 can produce length 16, 32, 64 one-dimensional vector component results, if the length 16 one-dimensional vector multiplication, 4 results can be output, thereby improving the parallelism of calculation. The calculation of the MAC64 is divided into a pipeline of several stages, the first stage is the implementation of 64 floating-point multiplications, and the subsequent 3 stages are the results of the 64 floating-point multiplications, which are accumulated to obtain the results required by the whole MAC 64. The embodiment may implement multiple MACs 64, that is, NxMAC64, in a hardware implementation in a configured manner, so as to meet different application computing bandwidth requirements. In addition, the result of the MAC64 calculation is input to the ACC accumulation. For most applications, the length of one-dimensional vector operations is often not 64, and may be 128, 256, 1024 or longer vectors, and the ACC is a module for half-precision floating-point accumulation that accumulates the results of multiple MAC64 operations according to instruction decoding.
d. LUTxN; in recurrent neural network algorithms, there are many types of activation functions. In order to support a plurality of activation functions simply and effectively, the embodiment selects a mode of using a lookup table to realize the calculation of the activation functions. The lookup table is realized by layering the whole activation function into multiple segments, each segment corresponds to a lookup table, and the rest part calculates the lookup table structure by linear interpolationAnd (5) fruit. The step size of the lookup table is 2-t, (t)>0). In the embodiment, the step length of the binary exponential integer is taken, so that the hardware implementation can be greatly simplified, and meanwhile, the influence on the calculation precision of the activation function is small. The step of calculating the look-up table LUT may specifically include: the min term of the look-up table expression range (X-axis) is subtracted from the input X to determine which look-up table the input X falls into, assuming it falls into the look-up table LUTx. Wherein, X-LUTxmin is calculated, wherein, LUTxmin is the minimum item (X axis) of the LUTx table, and is divided by the Step length (Step) of the lookup table on the X axis, and the integer part of the obtained value is the item label of the lookup table, namely the ith item. The ratio of X divided by the fraction of Step. Since the step size of the above calculation is 2-tTherefore, the result can be obtained in a shifting mode instead of floating-point division, so that floating-point division operation and floating-point addition and subtraction operation are avoided. After the entry number is obtained, the entry number may be represented by the formula Y ═ LUTx [ i ═ l]*ratio+LUTx[i+1]And (1-ratio), and then carrying out floating point standardization processing on the calculation result to obtain a table lookup result. It can be seen that the above formula can be implemented by two half-precision floating-point multipliers and a half-precision floating-point addition module. The table lookup design can greatly reduce the number of floating point operation units and simplify the table lookup floating point operation. And because the table look-up module is used, precision and hardware implementation overhead can be balanced by setting a plurality of table look-up items and tables with different sizes. In addition, the process of the lookup table can be seen in fig. 3. In fig. 3, the first row shows the contents of a LUT lookup table, including Step, i.e., the difference on the X-axis of each cell, here called Step, is defined as an exponential power of 2 to simplify the calculation; further, LUT0Min is the start point of the LUT x axis and LUT0Max is the end point of the LUTX axis. (LUTXmin, LUTXMax) is the range; after obtaining a value, the result of the function needs to be obtained by looking up a table, specifically (x-LUTXMin)/(LUTXMax-LUTXMin), to obtain which segment x falls on, and the step number of this segment is (LUTXMax-LUTXMin)/step. After the ratio is obtained, the formula Y ═ LUT [ i ] can be calculated by linear interpolation]*ratio+LUT[i+1](1-ratio) obtaining the result of the operation. In addition, the lookup table supports multi-segment lookup and linear fitting, whichLinear fitting is generally used at the beginning or end of the function curve, where a look-up table (with minimal slope change) is not needed to look at the specific use environment. The obtained Y is subjected to the normalization of FP16 and the post-processing of the activation function, so that the calculation process of the whole activation function can be completed.
e. FALU; and the floating-point ALU unit supports operations of multiplication, addition, subtraction, floating-point precision interception and the like of a floating point. The purpose of the FALU is to handle more changes of calculation in RNN network algorithm, there are many calculations that have further data operation after the activation function, and the FALU (1/2/4/8 path) is added at the last of the pipeline calculation unit, and the data after the activation function pipeline unit calculation is post-processed depending on the algorithm requirement.
f. A memory interface of the pipeline; in the recurrent neural network algorithm, a lot of weights and a part of intermediate data cache are needed, and when the intermediate data cache is more, the intermediate data cache can be stored through a fast SRAM. The SRAM is instead divided into several parts:
Ins-Ram, instruction Memory, store execution instructions;
Input-Ram, inputting Memory, storing Input data;
Weight-Ram, Weight Memory, Weight vector parameters stored in neural network algorithm;
bias parameters of weight vectors in a neural network algorithm and some temporary non-vector data are stored;
Output-Ram, Output Memory, and store and Output data.
g. Register File (Register File); the register file is located in the pipeline. The register file contains register definitions of many hardware accelerators, including vector register sets, and normal register sets, address register sets, as well as MAC intermediate value registers in the last stage sub-PIPELINE of MAC-computing PIPELINE (including ACC), and similar register definitions in LUTs, and the register set design in this class of functional blocks is called a block register set. The register file in this embodiment may specifically include:
vector register set: the length of the vector register is consistent with the maximum vector length of MAC operation, a vector register group can be set according to needs, generally, the number of the vector register groups is at least 4, and the fewer the vector register groups, the more the resources are saved;
an address register group: each address register can be matched with an adder to carry out addressing and operation of addresses, unlike the common RISC processor which uses an ALU to calculate the addresses, in the embodiment, the address calculation is carried out in a neural network hardware accelerator, each address register is provided with an adder to calculate the addresses, wherein the bit number of the address register is 16 bits;
a general register set: in neural network algorithms such as RNN/LSTM/GRN and the like, a plurality of gate calculations are carried out, and the results of the gate calculations are only intermediate results, so that a common register can be arranged for storing the results of the non-vector calculations;
a component register group: the registers in the MAC and the registers in the LUT are set to reduce latency for data-dependent instructions. In a conventional RISC processor, data correlation is typically addressed by bypass circuits and pipeline latency. In the neural network accelerator, due to too many pipeline stages, the overhead of solving the problems of data dependency and preceding and following instruction data dependence through the bypass is too large, and the highest frequency of the whole pipeline operation is influenced. Thus, the present embodiment partially addresses the pipeline latency problem by directly setting registers in the MAC and LUT pipelines. In addition, pipeline latency is also addressed by parallelizing the unrolled instructions appropriately. Because of the RNN/LSTM/GRN and other cyclic neural networks and branches thereof, the cyclic expansion at the algorithm level is relatively easy;
a loop register set; the hidden loop register group can serve loop instructions and jump instructions.
Referring to fig. 4, fig. 4 is a diagram illustrating a specific register file. In fig. 4, VERGx is a vector register, 128X 16; adr0 is an address register dedicated for address access; ix, fx, kx, tx are general registers or common half-precision floating-point registers, used for storing the calculation result of the pipeline or the intermediate calculation result; the MacRx is a register resource designed at the last stage of an MAC calculation pipeline, and mainly aims to reduce register access waiting, if the register is not provided, the result of the Mac calculation needs to go through a subsequent activating function calculation module LUT, a floating point post-processing unit and a write-back stage module of the whole pipeline until a register file is backfilled, and thus, the data waiting time is increased invisibly. If the instruction is placed in the MAC pipeline, after the result is obtained through calculation, the instruction can be immediately used by the subsequent instruction; LUTRx, a register resource designed at the last stage of the LUT calculation pipeline, is mainly intended to reduce register access latency and improve pipeline efficiency.
h. Write Back Stage; similar to the write-back stage in the RISC processor, the register FILE for writing back the result is also different here in that the neural network hardware accelerator, which uses the Memory as an intermediate storage module, does not only use a register FILE, so here, the write-back stage also bears the burden of writing back part of the intermediate data that needs to be written back to the Memory, and at the same time, part of the output data is also written to the output Memory at the write-back stage.
In addition, the instruction set in the present embodiment may specifically include the following 5 types of instructions: NULL instruction NULL, vector multiply operation instruction VECM, data transfer instruction DTRN, LOOP instruction LOOP, register clear instruction CLER.
In summary, in the present embodiment, for the neural network algorithms of various time series, the basic calculations are disassembled to obtain different types of basic calculations, then different combinations of instructions are combined to achieve the effect of combining various atomic calculations, and the calculation bandwidth maximization is taken as a design, an optimization goal is achieved, pipeline bubbles are reduced as much as possible, the maximization of execution efficiency along with a pipeline is achieved, hardware calculation resources are fully used to the maximum, and the variation of the cyclic neural network algorithm and the algorithm thereof is achieved with a small hardware overhead.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The present application provides a neural network hardware accelerator, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A neural network hardware accelerator, wherein a pipeline architecture of the hardware accelerator comprises:
the instruction acquisition module is used for acquiring an instruction;
the instruction decoding module is used for carrying out instruction decoding operation;
the half-precision floating point operation module is used for performing one-dimensional vector operation in a neural network algorithm to obtain a corresponding one-dimensional vector operation result;
the activation function calculation module is used for calculating the activation function in the neural network algorithm in a lookup table mode to obtain a corresponding activation function calculation result;
the floating point post-processing unit is used for carrying out floating point operation on the data which is positioned after the calculation of the activating function calculation module in the implementation process of the neural network algorithm;
the cache module is used for caching the intermediate data in the implementation process of the neural network algorithm;
the register files distributed in the same stage of the assembly line as the instruction decoding module are used for temporarily storing relevant instructions, data and addresses in the implementation process of the neural network algorithm;
and the write-back level module is used for performing data write-back operation.
2. The neural network hardware accelerator of claim 1, wherein the register file comprises:
the device comprises a vector register group, an address register group used for being matched with an adder to assist the instruction decoding module in completing address addressing and calculation processing of addresses, a common register group used for storing non-vector calculation results, a functional unit register group used for providing services for the semi-precision floating-point operation module and the activation function calculation module to reduce pipeline waiting time, and a loop register group used for providing services for loop instructions and jump instructions.
3. The neural network hardware accelerator of claim 1, wherein the instruction decode module comprises:
the first-level decoding unit is used for completing addressing and calculation processing of addresses, updating numerical values of address registers and initiating access requests to the cache module;
and the second-level decoding unit is used for reading the data in the cache module to a corresponding register and writing the register data to the cache module.
4. The neural network hardware accelerator of claim 1, wherein the half-precision floating-point arithmetic module comprises:
one or more half-precision floating-point calculation sub-modules, wherein each half-precision floating-point calculation sub-module is used for realizing one-dimensional vector multiplication with the maximum length of 64 bits;
and the result accumulation submodule is used for accumulating the calculation results of the plurality of half-precision floating point calculation submodules according to the instruction decoding result so as to obtain the corresponding one-dimensional vector operation result.
5. The neural network hardware accelerator of claim 4, wherein the half-precision floating-point computation submodule comprises
The floating-point multiplication unit is used for realizing 64 floating-point multiplication operations to obtain 64 floating-point multiplication results;
and the floating-point multiplication result accumulation unit is used for accumulating the 64 floating-point multiplication results to obtain corresponding one-dimensional vector multiplication calculation results.
6. The neural network hardware accelerator of claim 1, wherein the activation function computation module further comprises:
a look-up table configuration module; wherein the look-up table configuration module comprises a look-up table.
7. The neural network hardware accelerator of claim 1, wherein the cache module comprises:
the execution instruction cache unit is used for caching the instruction parameters of the execution instruction;
an input data caching unit for caching input data;
the weight vector caching unit is used for caching weight vectors in the neural network algorithm;
the offset parameter caching unit is used for caching the offset parameters of the weight vector;
the temporary data caching unit is used for caching temporary non-vector data;
and the output data caching unit is used for caching the output data.
8. The neural network hardware accelerator of claim 1, wherein the instruction set of the hardware accelerator comprises:
null instructions, vector multiply operation instructions, data transfer instructions, loop instructions, and register clear instructions.
9. The neural network hardware accelerator of any one of claims 1 to 8, wherein the neural network algorithm is an RNN neural network algorithm.
10. The neural network hardware accelerator of claim 9, wherein the RNN neural network algorithm comprises an LSTM algorithm, a GRN algorithm, or a bi-directional LSTM algorithm.
CN201910386225.8A 2019-05-09 2019-05-09 Neural network hardware accelerator Active CN111915003B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910386225.8A CN111915003B (en) 2019-05-09 2019-05-09 Neural network hardware accelerator
PCT/CN2020/087999 WO2020224516A1 (en) 2019-05-09 2020-04-30 Neural network hardware accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910386225.8A CN111915003B (en) 2019-05-09 2019-05-09 Neural network hardware accelerator

Publications (2)

Publication Number Publication Date
CN111915003A true CN111915003A (en) 2020-11-10
CN111915003B CN111915003B (en) 2024-03-22

Family

ID=73051454

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910386225.8A Active CN111915003B (en) 2019-05-09 2019-05-09 Neural network hardware accelerator

Country Status (2)

Country Link
CN (1) CN111915003B (en)
WO (1) WO2020224516A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651496A (en) * 2020-12-30 2021-04-13 深圳大普微电子科技有限公司 Hardware circuit and chip for processing activation function
CN112712168A (en) * 2020-12-31 2021-04-27 北京清微智能科技有限公司 Method and system for realizing high-efficiency calculation of neural network
CN113157638A (en) * 2021-01-27 2021-07-23 浙江大学 Low-power-consumption in-memory calculation processor and processing operation method
CN113313244A (en) * 2021-06-17 2021-08-27 东南大学 Near-storage neural network accelerator facing to addition network and acceleration method thereof
CN113469349A (en) * 2021-07-02 2021-10-01 上海酷芯微电子有限公司 Multi-precision neural network model implementation method and system
CN115599442A (en) * 2022-12-14 2023-01-13 成都登临科技有限公司(Cn) AI chip, electronic equipment and tensor processing method
TWI792665B (en) * 2021-01-21 2023-02-11 創惟科技股份有限公司 Ai algorithm operation accelerator and method thereof, computing system and non-transitory computer readable media

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113220630B (en) * 2021-05-19 2024-05-10 西安交通大学 Reconfigurable array optimization method and automatic optimization method for hardware accelerator
CN114969446B (en) * 2022-06-02 2023-05-05 中国人民解放军战略支援部队信息工程大学 Grouping hybrid precision configuration scheme searching method based on sensitivity model
CN116863490B (en) * 2023-09-04 2023-12-12 之江实验室 Digital identification method and hardware accelerator for FeFET memory array

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106991477A (en) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 A kind of artificial neural network compression-encoding device and method
CN107239823A (en) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 A kind of apparatus and method for realizing sparse neural network
CN108008948A (en) * 2016-11-30 2018-05-08 上海寒武纪信息科技有限公司 A kind of multiplexer and method, processing unit for instructing generating process
CN108090560A (en) * 2018-01-05 2018-05-29 中国科学技术大学苏州研究院 The design method of LSTM recurrent neural network hardware accelerators based on FPGA
US20180157969A1 (en) * 2016-12-05 2018-06-07 Beijing Deephi Technology Co., Ltd. Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network
CN108197705A (en) * 2017-12-29 2018-06-22 国民技术股份有限公司 Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium
CN108351778A (en) * 2015-12-20 2018-07-31 英特尔公司 Instruction for detecting floating-point cancellation effect and logic
CN109144573A (en) * 2018-08-16 2019-01-04 胡振波 Two-level pipeline framework based on RISC-V instruction set
CN109388777A (en) * 2017-08-07 2019-02-26 英特尔公司 A kind of system and method for optimized Winograd convolution accelerator
CN109427033A (en) * 2017-08-22 2019-03-05 英特尔公司 For realizing the efficient memory layout of intelligent data compression under machine learning environment
CN109726822A (en) * 2018-12-14 2019-05-07 北京中科寒武纪科技有限公司 Operation method, device and Related product

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10621486B2 (en) * 2016-08-12 2020-04-14 Beijing Deephi Intelligent Technology Co., Ltd. Method for optimizing an artificial neural network (ANN)
CN108446761B (en) * 2018-03-23 2021-07-20 中国科学院计算技术研究所 Neural network accelerator and data processing method
CN108763159A (en) * 2018-05-22 2018-11-06 中国科学技术大学苏州研究院 To arithmetic accelerator before a kind of LSTM based on FPGA

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108351778A (en) * 2015-12-20 2018-07-31 英特尔公司 Instruction for detecting floating-point cancellation effect and logic
CN106991477A (en) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 A kind of artificial neural network compression-encoding device and method
CN108427990A (en) * 2016-01-20 2018-08-21 北京中科寒武纪科技有限公司 Neural computing system and method
CN107239823A (en) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 A kind of apparatus and method for realizing sparse neural network
CN108008948A (en) * 2016-11-30 2018-05-08 上海寒武纪信息科技有限公司 A kind of multiplexer and method, processing unit for instructing generating process
US20180157969A1 (en) * 2016-12-05 2018-06-07 Beijing Deephi Technology Co., Ltd. Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network
CN109388777A (en) * 2017-08-07 2019-02-26 英特尔公司 A kind of system and method for optimized Winograd convolution accelerator
CN109427033A (en) * 2017-08-22 2019-03-05 英特尔公司 For realizing the efficient memory layout of intelligent data compression under machine learning environment
CN108197705A (en) * 2017-12-29 2018-06-22 国民技术股份有限公司 Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium
CN108090560A (en) * 2018-01-05 2018-05-29 中国科学技术大学苏州研究院 The design method of LSTM recurrent neural network hardware accelerators based on FPGA
CN109144573A (en) * 2018-08-16 2019-01-04 胡振波 Two-level pipeline framework based on RISC-V instruction set
CN109726822A (en) * 2018-12-14 2019-05-07 北京中科寒武纪科技有限公司 Operation method, device and Related product

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651496A (en) * 2020-12-30 2021-04-13 深圳大普微电子科技有限公司 Hardware circuit and chip for processing activation function
CN112712168A (en) * 2020-12-31 2021-04-27 北京清微智能科技有限公司 Method and system for realizing high-efficiency calculation of neural network
TWI792665B (en) * 2021-01-21 2023-02-11 創惟科技股份有限公司 Ai algorithm operation accelerator and method thereof, computing system and non-transitory computer readable media
CN113157638A (en) * 2021-01-27 2021-07-23 浙江大学 Low-power-consumption in-memory calculation processor and processing operation method
CN113157638B (en) * 2021-01-27 2022-06-21 浙江大学 Low-power-consumption in-memory calculation processor and processing operation method
CN113313244A (en) * 2021-06-17 2021-08-27 东南大学 Near-storage neural network accelerator facing to addition network and acceleration method thereof
CN113313244B (en) * 2021-06-17 2024-04-09 东南大学 Near-storage neural network accelerator for addition network and acceleration method thereof
CN113469349A (en) * 2021-07-02 2021-10-01 上海酷芯微电子有限公司 Multi-precision neural network model implementation method and system
CN113469349B (en) * 2021-07-02 2022-11-08 上海酷芯微电子有限公司 Multi-precision neural network model implementation method and system
CN115599442A (en) * 2022-12-14 2023-01-13 成都登临科技有限公司(Cn) AI chip, electronic equipment and tensor processing method
CN115599442B (en) * 2022-12-14 2023-03-10 成都登临科技有限公司 AI chip, electronic equipment and tensor processing method

Also Published As

Publication number Publication date
CN111915003B (en) 2024-03-22
WO2020224516A1 (en) 2020-11-12

Similar Documents

Publication Publication Date Title
CN111915003A (en) Neural network hardware accelerator
US20210264273A1 (en) Neural network processor
KR20210082058A (en) Configurable processor element arrays for implementing convolutional neural networks
US20190095175A1 (en) Arithmetic processing device and arithmetic processing method
JPWO2006112045A1 (en) Arithmetic processing unit
US20140351566A1 (en) Moving average processing in processor and processor
US20190235834A1 (en) Optimization apparatus and control method thereof
CN111859277B (en) Sparse matrix vector multiplication vectorization implementation method
WO2022142479A1 (en) Hardware accelerator, data processing method, system-level chip, and medium
CN115526301A (en) Method and apparatus for loading data within a machine learning accelerator
CN113366462A (en) Vector processor with first and multi-channel configuration
CN112540946A (en) Reconfigurable processor and method for calculating activation functions of various neural networks on reconfigurable processor
US11551087B2 (en) Information processor, information processing method, and storage medium
EP3671432B1 (en) Arithmetic processing device and method of controlling arithmetic processing device
CN112559954A (en) FFT algorithm processing method and device based on software-defined reconfigurable processor
US11836492B2 (en) Extended pointer register for configuring execution of a store and pack instruction and a load and unpack instruction
US20130262819A1 (en) Single cycle compare and select operations
CN115293978A (en) Convolution operation circuit and method, image processing apparatus
WO2023146519A1 (en) Parallel decode instruction set computer architecture with variable-length instructions
CN114330669A (en) Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system
CN116382782A (en) Vector operation method, vector operator, electronic device, and storage medium
CN112232496A (en) Method, system, equipment and medium for processing int4 data type based on Tenscorore
WO2022126630A1 (en) Reconfigurable processor and method for computing multiple neural network activation functions thereon
EP4345600A1 (en) Multiplication hardware block with adaptive fidelity control system
US20220222251A1 (en) Semiconducor device for computing non-linear function using a look-up table

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 518000 room 3501, venture capital building, No. 9, Tengfei Road, Longgang District, Shenzhen, Guangdong Province

Applicant after: SHENZHEN DAPU MICROELECTRONICS Co.,Ltd.

Address before: 518000 Guangdong province Shenzhen Longgang District Bantian Street five and Avenue North 4012 Yuan Zheng Industrial Park.

Applicant before: SHENZHEN DAPU MICROELECTRONICS Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant