CN113159309A - NAND flash memory-based low-power-consumption neural network accelerator storage architecture - Google Patents

NAND flash memory-based low-power-consumption neural network accelerator storage architecture Download PDF

Info

Publication number
CN113159309A
CN113159309A CN202110349392.2A CN202110349392A CN113159309A CN 113159309 A CN113159309 A CN 113159309A CN 202110349392 A CN202110349392 A CN 202110349392A CN 113159309 A CN113159309 A CN 113159309A
Authority
CN
China
Prior art keywords
matrix
cache
weight
neural network
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110349392.2A
Other languages
Chinese (zh)
Other versions
CN113159309B (en
Inventor
姜小波
邓晗珂
莫志杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110349392.2A priority Critical patent/CN113159309B/en
Publication of CN113159309A publication Critical patent/CN113159309A/en
Application granted granted Critical
Publication of CN113159309B publication Critical patent/CN113159309B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0866Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
    • G06F12/0868Data transfer between cache memory and other subsystems, e.g. storage devices or host systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0877Cache access modes
    • G06F12/0882Page mode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/20Employing a main memory using a specific memory technology
    • G06F2212/202Non-volatile memory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Neurology (AREA)
  • Semiconductor Memories (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention provides a low-power consumption neural network accelerator storage framework based on a NAND flash memory, which is characterized in that: the off-chip NAND flash memory comprises an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache and a controller; the internal global cache comprises a weight cache, an input cache, an intermediate result cache and an output cache; when the neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit and loads the weight data into the weight cache; loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and then the final computing result is output. The framework can meet the calculation requirement of a deep learning algorithm for completing reasoning tasks on end-side equipment, is low in power consumption and has a power-off protection effect.

Description

NAND flash memory-based low-power-consumption neural network accelerator storage architecture
Technical Field
The invention relates to the technical field of integrated circuit design, in particular to a low-power-consumption neural network accelerator storage architecture based on a NAND flash memory.
Background
With the development of artificial intelligence and the Internet of things, the combination of the artificial intelligence and the Internet of things is further accelerated, and the end-side artificial intelligence is in a rapid development stage. On the other hand, with the synchronous improvement of the performance and complexity of the deep learning algorithm, the calculation task of deep learning needs to be integrated into the hardware architecture to accelerate the operation speed.
For end-side devices, various complex application scenarios are also encountered, subject to power consumption and cost constraints. In this context, low power consumption and low cost are the basic features of the end-side neural network accelerator in order to satisfy the deep learning algorithm that can complete the inference task at the end side.
Currently, the memories used are classified into two categories, volatile and nonvolatile, and the volatile memories are classified into Static Random Access Memories (SRAMs) and Dynamic Random Access Memories (DRAMs), and among the nonvolatile memories, flash memories are the dominant market at present. With the development of mobile devices and the internet of things, the off-chip memory is mainly based on DRAM and NAND flash memory at present. SRAM has the fastest read speed but the highest cost and is therefore only used for high speed internal cache usage. In the existing neural network accelerator architecture, a common memory architecture is: the SRAM serves as an on-chip cache and the DRAM serves as an off-chip memory. The DRAM has a higher read/write speed than the NAND flash memory, but cannot store data for a long time and is higher in cost. When the end side uses the DRAM for off-chip storage, data loss will re-read the data from the cloud, wasting a lot of power consumption, bandwidth and time cost.
Disclosure of Invention
In order to overcome the defects and shortcomings in the prior art, the invention aims to provide a low-power-consumption neural network accelerator storage architecture based on a NAND flash memory; the framework can meet the calculation requirement of a deep learning algorithm for completing reasoning tasks on end-side equipment, is low in power consumption and has a power-off protection effect.
In order to achieve the purpose, the invention is realized by the following technical scheme: a low-power consumption neural network accelerator storage architecture based on NAND flash memory is characterized in that: the off-chip NAND flash memory comprises an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache and a controller; the internal global cache comprises a weight cache, an input cache, an intermediate result cache and an output cache;
the off-chip NAND flash memory storage unit is used for storing weight data from a cloud or a server;
the controller is used for controlling the flow and calculation of data and is responsible for controlling the calculation of the neural network calculation circuit and the writing-in and writing-out of the data;
the neural network computing circuit is used for carrying out data computation;
the weight cache is used for caching weight data loaded from an off-chip NAND flash memory storage unit; the input cache is used for caching input data; the intermediate result cache is used for caching the intermediate calculation result of the neural network calculation circuit; the output cache is used for caching output data of the neural network computing circuit;
when the neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit and loads the weight data into the weight cache; loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and then the final computing result is output.
Preferably, it is applied to a neural network model with several module layers; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: loading the weight data of the neural network model into an off-chip NAND flash memory storage unit from a cloud or a server; the neural network computing circuit sequentially computes each module layer of the neural network model layer by layer:
when the first layer module layer is calculated, corresponding weight data of the first layer module layer in an off-chip NAND flash memory storage unit is loaded into a weight cache, the weight data stored in the weight cache and input data stored in an input cache are loaded into a neural network calculating circuit, then operation is carried out, and a middle calculation result of the neural network calculating circuit is cached into a middle result cache;
when the next module layer is calculated, the corresponding weight data of the currently calculated module layer in the off-chip NAND flash memory storage unit is loaded into the weight cache, and the neural network calculating circuit loads the weight data stored in the weight cache and the intermediate calculation result stored in the intermediate result cache and then performs operation; if the current calculated module layer is not the last module layer, caching the intermediate calculation result of the neural network calculation circuit into an intermediate result cache; if the current calculated module layer is the last module layer, the calculation result of the neural network calculation circuit is cached in an output cache, and then the final calculation result is output.
Preferably, applied to a neural network model transform encoder; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: the method comprises the following steps:
s1, loading the weight and the bias data of the neural network model from the cloud or the server by the off-chip NAND flash memory storage unit: position coding weight matrix and query vector weight matrix WQKey vector weight matrix WKValue vector weight matrix WVQuery vector offset vector BQKey vector offset vector BKValue vector offset vector BVFeedforward layer first layer weight matrix WE1Feedforward layer second layer weight matrix WF2Feedforward layer first layer offset vector BF1Feedforward layer second layer offset vector BF2The multi-head attention module outputs a weight matrix Wo and a multi-head attention output offset vector BOAnd layer normalized gain and bias;
s2, loading the position coding weight matrix stored in the off-chip NAND flash memory storage unit into a weight cache; loading an input matrix X corresponding to the current word vector into an input cache; the neural network computing circuit takes out an input matrix X from the input cache, takes out a position coding weight matrix from the weight cache, and computes the input matrix X after position coding;
query vector weight matrix W for off-chip NAND flash memory cell storageQKey vector weight matrix WKSum vector weight matrix WVAnd query vector offset vector BQKey vector offset vector BKSum vector offset vector BVLoading into a weight cache; taking out the weight and the offset data from the weight cache, carrying out linear layer calculation on the weight and the input matrix X after position coding in a neural network calculation circuit to obtain a matrix Q, a matrix K and a matrix V, and storing the matrix Q, the matrix K and the matrix V into an intermediate result cache; the calculation formula is as follows:
Q=WQX+BQ
K=WKX+BK
V=WVX+BV
multi-head attention module output weight W for storing off-chip NAND flash memory storage unitOAnd an output offset BOLoading into a weight cache; loading the matrix Q and the matrix K which are cached and stored by the intermediate result to a neural network computing circuit; performing multiplication and addition operation on the matrix in the PE array, and calculating the multiplication and addition operation result through a Softmax module to obtain an attention fraction matrix S; the calculation formula is as follows:
Figure BDA0003001724670000031
wherein d iskThe column numbers of the matrix Q and the matrix K correspond to the dimensionality of the word vector;
then, loading the matrix V stored in the intermediate result cache to a neural network computing circuit, and multiplying the matrix V with the attention fractional matrix S to obtain an output matrix Zi(ii) a Multi-head attention module output weight W for caching weightOAnd an output offset BOLoading to a neural network computing circuit; output matrix Z of each head of multi-head attention moduleiOutput weight matrix W of multi-head attention module after splicingOPerforming multiply-addOperation, adding the result of the multiply-add operation to the output offset vector BOObtaining a matrix Z; storing the matrix Z into an intermediate result cache; the calculation formula is as follows:
Zi=SV
Z=WO(Z1...Zi)+BO
feed-forward layer first layer weight matrix W for off-chip NAND flash memory cell storageF1And feed forward layer first layer offset vector BF1Loading into a weight cache; loading the input matrix X and the final output matrix Z cached and stored in the intermediate result to a neural network computing circuit, and performing addition operation of the input matrix X and the final output matrix Z in the PE array to obtain a matrix A0Matrix A0Calculating by a layer normalization module to obtain a matrix L0
Feed-forward layer first layer weight matrix W for storing weight cacheF1And feed forward layer first layer offset vector BF1Loading to a neural network computing circuit; matrix L0And a feedforward layer first layer weight matrix WF1Performing multiply-add operation, adding the result of the multiply-add operation to the first layer offset vector B of the feedforward layerF1And derive the matrix F by ReLU0(ii) a Will matrix F0Storing the data in an intermediate result cache; the calculation formula is as follows:
F0=ReLU(WF1L0+BF1)
feed-forward layer second layer weight matrix W to store off-chip NAND flash memory cellsF2And a feedforward layer second layer offset vector BF2Loading into a weight cache; matrix F for caching intermediate results0And a feedforward layer second layer weight matrix W stored in a weight cacheF2And a feedforward layer second layer offset vector BF2Loading to a neural network computing circuit, and performing matrix F on the PE array0And a feedforward layer second layer weight matrix WF2Multiply-add operation of adding the result of the multiply-add operation to the second layer offset vector B of the feedforward layerF2Calculating to obtain a matrix F1
F1=WF2F0+BF2
Matrix A for caching intermediate results0Loading to a neural network computing circuit to obtain a matrix F1And matrix A0Performing addition operation to obtain a matrix A1Matrix A1The result L is obtained through the calculation of a layer normalization module1The result L is1Storing the output data in an output cache as the output of a neural network model Transformer encoder;
s3, the off-chip NAND flash memory storage unit adopts the weight data of the neural network model stored currently, and jumps to step S2 to encode the word vector of the next sentence until all the word vectors of all the input sentences are encoded.
Preferably, the neural network computing circuit includes a PE (basic arithmetic unit) array for performing a matrix multiply-add operation and other computing modules; the other calculation module comprises any one or more than two of an addition tree, an activation function operation unit and a nonlinear function operation unit.
Preferably, a high-speed interface is also included; the high-speed interface is used for data transmission between the off-chip NAND flash memory storage unit and the internal global cache.
Preferably, the weight data is loaded by reading in units of pages between the off-chip NAND flash memory cells and the weight cache.
Preferably, the off-chip NAND flash memory storage unit is further configured to store intermediate calculation results and/or final operation results of the neural network calculation circuit.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the low-power-consumption neural network accelerator storage framework can meet the calculation requirement of a deep learning algorithm for completing an inference task on end-side equipment, is low in power consumption and has a power-off protection function; the deep learning algorithm is realized by adopting hardware, so that the performance of the deep learning algorithm can be improved, and the operation speed is accelerated.
Drawings
FIG. 1 is a block diagram of the structure of the storage architecture of the NAND flash memory based low power consumption neural network accelerator of the present invention;
FIG. 2 is a schematic block diagram of a neural network model Transformer encoder applied to the storage architecture of the NAND flash based low-power neural network accelerator according to the second embodiment.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Example one
In this embodiment, a storage architecture of a low-power neural network accelerator based on a NAND flash memory, as shown in fig. 1, includes an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache, and a controller; the internal global cache includes a weight cache, an input cache, an intermediate result cache, and an output cache.
The off-chip NAND flash memory storage unit is used for storing weight data from a cloud or a server.
The controller is used for controlling the flow and calculation of data and is responsible for controlling the calculation of the neural network calculation circuit and the writing-in and writing-out of data.
The neural network computing circuit is used for carrying out data computation; the neural network computing circuit comprises a PE (basic operation unit) array and other computing modules for matrix multiply-add operation; the other calculation modules include any one or more than two of an addition tree, an activation function operation unit and a nonlinear function operation unit.
The weight cache is used for caching weight data loaded from an off-chip NAND flash memory storage unit; the input cache is used for caching input data; the intermediate result cache is used for caching the intermediate calculation result of the neural network calculation circuit; the output buffer is used for buffering the output data of the neural network computing circuit.
The system also comprises a high-speed interface; the high-speed interface is a medium for data exchange between the off-chip NAND flash memory storage unit and the internal global cache, and is used for data transmission between the off-chip NAND flash memory storage unit and the internal global cache.
The off-chip NAND flash memory storage unit may also be used to store intermediate calculation results and/or final operation results of the neural network calculation circuit.
When the neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit through controlling the high-speed interface and loads the weight data into the weight cache; between the off-chip NAND flash memory cell and the weight cache, the weight data is read and loaded by page unit. Loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and then the final computing result is output.
The following description will take the example that the storage architecture of the low-power neural network accelerator is applied to a neural network model with several module layers. The method for calculating the neural network by the storage framework of the low-power-consumption neural network accelerator comprises the following steps: loading the weight data of the neural network model into an off-chip NAND flash memory storage unit from a cloud or a server; the neural network computing circuit sequentially computes each module layer of the neural network model layer by layer:
when the first layer module layer is calculated, corresponding weight data of the first layer module layer in an off-chip NAND flash memory storage unit is loaded into a weight cache, the weight data stored in the weight cache and input data stored in an input cache are loaded into a neural network calculating circuit, then operation is carried out, and a middle calculation result of the neural network calculating circuit is cached into a middle result cache;
when the next module layer is calculated, the corresponding weight data of the currently calculated module layer in the off-chip NAND flash memory storage unit is loaded into the weight cache, and the neural network calculating circuit loads the weight data stored in the weight cache and the intermediate calculation result stored in the intermediate result cache and then performs operation; if the current calculated module layer is not the last module layer, caching the intermediate calculation result of the neural network calculation circuit into an intermediate result cache; if the current calculated module layer is the last module layer, the calculation result of the neural network calculation circuit is cached in an output cache, and then the final calculation result is output.
The low-power-consumption neural network accelerator storage framework can meet the calculation requirement of a deep learning algorithm for completing an inference task on end-side equipment, is low in power consumption and has a power-off protection function; the deep learning algorithm is realized by adopting hardware, so that the performance of the deep learning algorithm can be improved, and the operation speed is accelerated.
The off-chip storage of the invention uses NAND flash memory, belongs to nonvolatile storage equipment, and the basic storage unit is a floating gate MOS tube. The electrons are controlled to pass in and out of the floating gate by an external electric field to write and read data, and the stored information is retained in the floating gate and cannot be lost due to power failure when the power is cut off. However, most of the off-chip memories used in the neural network accelerators are DRAMs, the DRAMs store data of one bit by adding a capacitor to an MOS transistor, and the capacitor is used for storing charges, so that the charges must be periodically supplemented to refresh the data, and once the power is turned off, the charges stored in the capacitor disappear, and the data stored in the DRAMs are lost. When the end side uses the DRAM for off-chip storage, once the data is lost due to power failure, the data is to be read from the cloud again, which wastes a lot of power consumption, bandwidth and time cost. Compared with the current neural network accelerator based on DRAM as off-chip storage, the neural network accelerator based on NAND flash memory as off-chip storage obviously has better power-off protection effect.
The basic memory cells of an off-chip storage DRAM used by the existing neural network are an MOS transistor and a capacitor, data is stored by charging and discharging the capacitor, and extra power consumption is caused because the charge in the capacitor is gradually lost along with time and needs to be refreshed additionally at regular time.
Compared with the existing neural network processor storage architecture based on DRAM as off-chip storage, the neural network processor storage architecture using the NAND flash memory as off-chip storage has the advantages that the basic storage unit is only one floating gate MOS tube, one capacitor is used less than that of the DRAM with the basic storage unit being one capacitor and MOS tube, the price cost is lower than that of the DRAM, and the low cost has extremely high competitiveness in end-side equipment. In addition, the basic memory unit of the NAND flash memory is one less capacitor than that of the DRAM, namely the NAND flash memory has higher integration level under the same area, the NAND flash memory has the advantages of light weight and small volume under the condition of the same integration level, and the neural network accelerator based on the NAND flash memory as off-chip storage is lighter and smaller in volume and is more suitable for being used by end-side devices.
Example two
In this embodiment, a storage architecture of a low-power neural network accelerator is applied to a neural network model Transformer encoder as an example for explanation. The principle of the neural network model Transformer encoder is shown in fig. 2. The method for calculating the neural network by the storage framework of the low-power-consumption neural network accelerator comprises the following steps: the method comprises the following steps:
s1, loading the weight and the bias data of the neural network model from the cloud or the server by the off-chip NAND flash memory storage unit: position coding weight matrix and query vector weight matrix WQKey vector weight matrix WKValue vector weight matrix WVQuery vector offset vector BQKey vector offset vector BKValue vector offset vector BVFeedforward layer first layer weight matrix WF1Feedforward layer second layer weight matrix WF2Feedforward layer first layer offset vector BF1Feedforward layer second layer offset vector BF2The multi-head attention module outputs a weight matrix Wo and a multi-head attention output offset vector BOAnd layer normalized gain and bias;
s2, loading the position coding weight matrix stored in the off-chip NAND flash memory storage unit into a weight cache; loading an input matrix X corresponding to the current word vector into an input cache; the neural network computing circuit takes out an input matrix X from the input cache, takes out a position coding weight matrix from the weight cache, and computes the input matrix X after position coding;
query vector weight matrix W for off-chip NAND flash memory cell storageQKey vector weight matrix WKSum vector weight matrixWVAnd query vector offset vector BQKey vector offset vector BKSum vector offset vector BVLoading into a weight cache; taking out the weight and the offset data from the weight cache, carrying out linear layer calculation on the weight and the input matrix X after position coding in a neural network calculation circuit to obtain a matrix Q, a matrix K and a matrix V, and storing the matrix Q, the matrix K and the matrix V into an intermediate result cache; the calculation formula is as follows:
Q=WQX+BQ
K=WKX+BK
V=WVX+BV
multi-head attention module output weight W for storing off-chip NAND flash memory storage unitOAnd an output offset BOLoading into a weight cache; and loading the matrix Q and the matrix K stored in the intermediate result cache to a neural network computing circuit. Firstly, performing multiplication and addition operation on a matrix in a PE array, and calculating a result of the multiplication and addition operation through a Softmax module to obtain an attention fraction matrix S; the calculation formula is as follows:
Figure BDA0003001724670000081
wherein d iskThe column numbers of the matrix Q and the matrix K correspond to the dimensionality of the word vector;
then, loading the matrix V stored in the intermediate result cache to a neural network computing circuit, and multiplying the matrix V with the attention fractional matrix S to obtain an output matrix Zi(ii) a Multi-head attention module output weight W for caching weightOAnd an output offset BOLoading to a neural network computing circuit; output matrix Z of each head of multi-head attention moduleiOutput weight matrix W of multi-head attention module after splicingOPerforming multiplication and addition operation, adding output offset vector B to the result of multiplication and addition operationOObtaining a matrix Z; storing the matrix Z into an intermediate result cache; the calculation formula is as follows:
Zi=SV
Z=WO(Z1...Zi)+BO
feed-forward layer first layer weight matrix W for off-chip NAND flash memory cell storageF1And feed forward layer first layer offset vector BF1Loading into a weight cache; loading the input matrix X and the final output matrix Z cached and stored in the intermediate result to a neural network computing circuit, and performing addition operation of the input matrix X and the final output matrix Z in the PE array to obtain a matrix A0Matrix A0Calculating by a layer normalization module to obtain a matrix L0
Feed-forward layer first layer weight matrix W for storing weight cacheF1And feed forward layer first layer offset vector BF1Loading to a neural network computing circuit; matrix L0And a feedforward layer first layer weight matrix WF1Performing multiply-add operation, adding the result of the multiply-add operation to the first layer offset vector B of the feedforward layerF1And derive the matrix F by ReLU0(ii) a Will matrix F0Storing the data in an intermediate result cache; the calculation formula is as follows:
F0=ReLU(WF1L0+BF1)
feed-forward layer second layer weight matrix W to store off-chip NAND flash memory cellsF2And a feedforward layer second layer offset vector BF2Loading into a weight cache; matrix F for caching intermediate results0And a feedforward layer second layer weight matrix W stored in a weight cacheF2And a feedforward layer second layer offset vector BF2Loading to a neural network computing circuit, and performing matrix F on the PE array0And a feedforward layer second layer weight matrix WF2Multiply-add operation of adding the result of the multiply-add operation to the second layer offset vector B of the feedforward layerF2Calculating to obtain a matrix F1
F1=WF2F0+BF2
Matrix A for caching intermediate results0Loading to a neural network computing circuit to obtain a matrix F1And matrix A0Performing addition operation to obtain a matrix A1Matrix A1Calculated by a layer normalization moduleObtaining the result L1The result L is1Storing the output data in an output cache as the output of a neural network model Transformer encoder;
s3, the off-chip NAND flash memory storage unit adopts the weight data of the neural network model stored currently, and jumps to step S2 to encode the word vector of the next sentence until all the word vectors of all the input sentences are encoded.
The Softmax module and the layer normalization module are both nonlinear function operation units of the neural network computing circuit.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (6)

1. A low-power consumption neural network accelerator storage architecture based on NAND flash memory is characterized in that: the off-chip NAND flash memory comprises an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache and a controller; the internal global cache comprises a weight cache, an input cache, an intermediate result cache and an output cache;
the off-chip NAND flash memory storage unit is used for storing weight data from a cloud or a server;
the controller is used for controlling the flow and calculation of data and is responsible for controlling the calculation of the neural network calculation circuit and the writing-in and writing-out of the data;
the neural network computing circuit is used for carrying out data computation;
the weight cache is used for caching weight data loaded from an off-chip NAND flash memory storage unit; the input cache is used for caching input data; the intermediate result cache is used for caching the intermediate calculation result of the neural network calculation circuit; the output cache is used for caching output data of the neural network computing circuit;
when the neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit and loads the weight data into the weight cache; loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and then the final computing result is output.
2. The NAND-flash-based low power neural network accelerator memory architecture of claim 1, wherein: the method is applied to a neural network model with a plurality of module layers; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: loading the weight data of the neural network model into an off-chip NAND flash memory storage unit from a cloud or a server; the neural network computing circuit sequentially computes each module layer of the neural network model layer by layer:
when the first layer module layer is calculated, corresponding weight data of the first layer module layer in an off-chip NAND flash memory storage unit is loaded into a weight cache, the weight data stored in the weight cache and input data stored in an input cache are loaded into a neural network calculating circuit, then operation is carried out, and a middle calculation result of the neural network calculating circuit is cached into a middle result cache;
when the next module layer is calculated, the corresponding weight data of the currently calculated module layer in the off-chip NAND flash memory storage unit is loaded into the weight cache, and the neural network calculating circuit loads the weight data stored in the weight cache and the intermediate calculation result stored in the intermediate result cache and then performs operation; if the current calculated module layer is not the last module layer, caching the intermediate calculation result of the neural network calculation circuit into an intermediate result cache; if the current calculated module layer is the last module layer, the calculation result of the neural network calculation circuit is cached in an output cache, and then the final calculation result is output.
3. The NAND-flash-based low power neural network accelerator memory architecture of claim 2, wherein: applying the neural network model transform coder; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: the method comprises the following steps:
s1, loading the weight and the bias data of the neural network model from the cloud or the server by the off-chip NAND flash memory storage unit: position coding weight matrix and query vector weight matrix WQKey vector weight matrix WKValue vector weight matrix WVQuery vector offset vector BQKey vector offset vector BKValue vector offset vector BVFeedforward layer first layer weight matrix WF1Feedforward layer second layer weight matrix WF2Feedforward layer first layer offset vector BF1Feedforward layer second layer offset vector BF2The multi-head attention module outputs a weight matrix Wo and a multi-head attention output offset vector BOAnd layer normalized gain and bias;
s2, loading the position coding weight matrix stored in the off-chip NAND flash memory storage unit into a weight cache; loading an input matrix X corresponding to the current word vector into an input cache; the neural network computing circuit takes out an input matrix X from the input cache, takes out a position coding weight matrix from the weight cache, and computes the input matrix X after position coding;
query vector weight matrix W for off-chip NAND flash memory cell storageQKey vector weight matrix WKSum vector weight matrix WVAnd query vector offset vector BQKey vector offset vector BKSum vector offset vector BVLoading into a weight cache; taking out the weight and the offset data from the weight cache, carrying out linear layer calculation on the weight and the input matrix X after position coding in a neural network calculation circuit to obtain a matrix Q, a matrix K and a matrix V, and storing the matrix Q, the matrix K and the matrix V into an intermediate result cache;
multi-head attention module output weight W for storing off-chip NAND flash memory storage unitOAnd an output offset BOLoading into a weight cache; loading the matrix Q and the matrix K stored in the intermediate result cache toA neural network computing circuit; performing multiplication and addition operation on the matrix in the PE array, and calculating the multiplication and addition operation result through a Softmax module to obtain an attention fraction matrix S;
then, loading the matrix V stored in the intermediate result cache to a neural network computing circuit, and multiplying the matrix V with the attention fractional matrix S to obtain an output matrix Zi(ii) a Multi-head attention module output weight W for caching weightOAnd an output offset BOLoading to a neural network computing circuit; output matrix Z of each head of multi-head attention moduleiOutput weight matrix W of multi-head attention module after splicingOPerforming multiplication and addition operation, adding output offset vector B to the result of multiplication and addition operationOObtaining a matrix Z; storing the matrix Z into an intermediate result cache;
feed-forward layer first layer weight matrix W for off-chip NAND flash memory cell storageF1And feed forward layer first layer offset vector BF1Loading into a weight cache; loading the input matrix X and the final output matrix Z cached and stored in the intermediate result to a neural network computing circuit, and performing addition operation of the input matrix X and the final output matrix Z in the PE array to obtain a matrix A0Matrix A0Calculating by a layer normalization module to obtain a matrix L0
Feed-forward layer first layer weight matrix W for storing weight cacheF1And feed forward layer first layer offset vector BF1Loading to a neural network computing circuit; matrix L0And a feedforward layer first layer weight matrix WF1Performing multiply-add operation, adding the result of the multiply-add operation to the first layer offset vector B of the feedforward layerF1And derive the matrix F by ReLU0(ii) a Will matrix F0Storing the data in an intermediate result cache;
feed-forward layer second layer weight matrix W to store off-chip NAND flash memory cellsF2And a feedforward layer second layer offset vector BF2Loading into a weight cache; matrix F for caching intermediate results0And a feedforward layer second layer weight matrix W stored in a weight cacheF2And a feedforward layer second layer offset vector BF2Is loaded into a neural network computing circuit to be calculated,performing matrix F on PE array0And a feedforward layer second layer weight matrix WF2Multiply-add operation of adding the result of the multiply-add operation to the second layer offset vector B of the feedforward layerF2Calculating to obtain a matrix F1
Matrix A for caching intermediate results0Loading to a neural network computing circuit to obtain a matrix F1And matrix A0Performing addition operation to obtain a matrix A1Matrix A1The result L is obtained through the calculation of a layer normalization module1The result L is1Storing the output data in an output cache as the output of a neural network model Transformer encoder;
s3, the off-chip NAND flash memory storage unit adopts the weight data of the neural network model stored currently, and jumps to step S2 to encode the word vector of the next sentence until all the word vectors of all the input sentences are encoded.
4. The NAND flash based low power neural network accelerator memory architecture of any one of claims 1 to 3, wherein: the system also comprises a high-speed interface; the high-speed interface is used for data transmission between the off-chip NAND flash memory storage unit and the internal global cache.
5. The NAND flash based low power neural network accelerator memory architecture of any one of claims 1 to 3, wherein: between the off-chip NAND flash memory cell and the weight cache, the weight data is read and loaded by page unit.
6. The NAND flash based low power neural network accelerator memory architecture of any one of claims 1 to 3, wherein: the off-chip NAND flash memory storage unit is also used for storing the final operation result of the neural network computing circuit.
CN202110349392.2A 2021-03-31 2021-03-31 NAND flash memory-based low-power-consumption neural network accelerator storage architecture Active CN113159309B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110349392.2A CN113159309B (en) 2021-03-31 2021-03-31 NAND flash memory-based low-power-consumption neural network accelerator storage architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110349392.2A CN113159309B (en) 2021-03-31 2021-03-31 NAND flash memory-based low-power-consumption neural network accelerator storage architecture

Publications (2)

Publication Number Publication Date
CN113159309A true CN113159309A (en) 2021-07-23
CN113159309B CN113159309B (en) 2023-03-21

Family

ID=76885744

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110349392.2A Active CN113159309B (en) 2021-03-31 2021-03-31 NAND flash memory-based low-power-consumption neural network accelerator storage architecture

Country Status (1)

Country Link
CN (1) CN113159309B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117787366A (en) * 2024-02-28 2024-03-29 苏州元脑智能科技有限公司 Hardware accelerator and scheduling method thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948774A (en) * 2019-01-25 2019-06-28 中山大学 Neural network accelerator and its implementation based on network layer binding operation
CN110490311A (en) * 2019-07-08 2019-11-22 华南理工大学 Convolutional neural networks accelerator and its control method based on RISC-V framework
CN111062471A (en) * 2019-11-23 2020-04-24 复旦大学 Deep learning accelerator for accelerating BERT neural network operations
CN111222626A (en) * 2019-11-07 2020-06-02 合肥恒烁半导体有限公司 Data segmentation operation method of neural network based on NOR Flash module
CN111241028A (en) * 2018-11-28 2020-06-05 北京知存科技有限公司 Digital-analog hybrid storage and calculation integrated chip and calculation device
US20200184335A1 (en) * 2018-12-06 2020-06-11 Western Digital Technologies, Inc. Non-volatile memory die with deep learning neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241028A (en) * 2018-11-28 2020-06-05 北京知存科技有限公司 Digital-analog hybrid storage and calculation integrated chip and calculation device
US20200184335A1 (en) * 2018-12-06 2020-06-11 Western Digital Technologies, Inc. Non-volatile memory die with deep learning neural network
CN109948774A (en) * 2019-01-25 2019-06-28 中山大学 Neural network accelerator and its implementation based on network layer binding operation
CN110490311A (en) * 2019-07-08 2019-11-22 华南理工大学 Convolutional neural networks accelerator and its control method based on RISC-V framework
CN111222626A (en) * 2019-11-07 2020-06-02 合肥恒烁半导体有限公司 Data segmentation operation method of neural network based on NOR Flash module
CN111062471A (en) * 2019-11-23 2020-04-24 复旦大学 Deep learning accelerator for accelerating BERT neural network operations

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
仇越: ""基于FPGA的卷积神经网络加速方法研究及实现"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
陈建明编: "《嵌入式***及应用》", 28 February 2017, 国防工业出版社 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117787366A (en) * 2024-02-28 2024-03-29 苏州元脑智能科技有限公司 Hardware accelerator and scheduling method thereof
CN117787366B (en) * 2024-02-28 2024-05-10 苏州元脑智能科技有限公司 Hardware accelerator and scheduling method thereof

Also Published As

Publication number Publication date
CN113159309B (en) 2023-03-21

Similar Documents

Publication Publication Date Title
Nguyen et al. An approximate memory architecture for a reduction of refresh power consumption in deep learning applications
US11625584B2 (en) Reconfigurable memory compression techniques for deep neural networks
US20190188237A1 (en) Method and electronic device for convolution calculation in neutral network
CN112151091B (en) 8T SRAM unit and memory computing device
CN109902822B (en) Memory computing system and method based on Sgimenk track storage
CN109934336B (en) Neural network dynamic acceleration platform design method based on optimal structure search and neural network dynamic acceleration platform
CN108446764B (en) Novel neuromorphic chip architecture
CN113159309B (en) NAND flash memory-based low-power-consumption neural network accelerator storage architecture
CN114937470B (en) Fixed point full-precision memory computing circuit based on multi-bit SRAM unit
CN110322008A (en) Residual convolution neural network-based quantization processing method and device
CN110176264A (en) A kind of high-low-position consolidation circuit structure calculated interior based on memory
US20210216846A1 (en) Transpose memory unit for multi-bit convolutional neural network based computing-in-memory applications, transpose memory array structure for multi-bit convolutional neural network based computing-in-memory applications and computing method thereof
CN113296734A (en) Multi-position storage device
CN112233712B (en) 6T SRAM (static random Access memory) storage device, storage system and storage method
Jeong et al. A 28nm 1.644 tflops/w floating-point computation sram macro with variable precision for deep neural network inference and training
Sehgal et al. Trends in analog and digital intensive compute-in-SRAM designs
CN116401502B (en) Method and device for optimizing Winograd convolution based on NUMA system characteristics
CN113539318A (en) Memory computing circuit chip based on magnetic cache and computing device
CN117234720A (en) Dynamically configurable memory computing fusion data caching structure, processor and electronic equipment
US11615834B2 (en) Semiconductor storage device and information processor
CN114895869A (en) Multi-bit memory computing device with symbols
Lee et al. Robustness of differentiable neural computer using limited retention vector-based memory deallocation in language model
US20200257959A1 (en) Memory device having an address generator using a neural network algorithm and a memory system including the same
Kumar et al. Design and power analysis of 16× 16 SRAM Array Employing 7T I-LSVL
Yook et al. Refresh Methods and Accuracy Evaluation for 2T0C DRAM based Processing-in-memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant