CN113159309A - NAND flash memory-based low-power-consumption neural network accelerator storage architecture - Google Patents
NAND flash memory-based low-power-consumption neural network accelerator storage architecture Download PDFInfo
- Publication number
- CN113159309A CN113159309A CN202110349392.2A CN202110349392A CN113159309A CN 113159309 A CN113159309 A CN 113159309A CN 202110349392 A CN202110349392 A CN 202110349392A CN 113159309 A CN113159309 A CN 113159309A
- Authority
- CN
- China
- Prior art keywords
- matrix
- cache
- weight
- neural network
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0866—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
- G06F12/0868—Data transfer between cache memory and other subsystems, e.g. storage devices or host systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0877—Cache access modes
- G06F12/0882—Page mode
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/20—Employing a main memory using a specific memory technology
- G06F2212/202—Non-volatile memory
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Neurology (AREA)
- Semiconductor Memories (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The invention provides a low-power consumption neural network accelerator storage framework based on a NAND flash memory, which is characterized in that: the off-chip NAND flash memory comprises an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache and a controller; the internal global cache comprises a weight cache, an input cache, an intermediate result cache and an output cache; when the neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit and loads the weight data into the weight cache; loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and then the final computing result is output. The framework can meet the calculation requirement of a deep learning algorithm for completing reasoning tasks on end-side equipment, is low in power consumption and has a power-off protection effect.
Description
Technical Field
The invention relates to the technical field of integrated circuit design, in particular to a low-power-consumption neural network accelerator storage architecture based on a NAND flash memory.
Background
With the development of artificial intelligence and the Internet of things, the combination of the artificial intelligence and the Internet of things is further accelerated, and the end-side artificial intelligence is in a rapid development stage. On the other hand, with the synchronous improvement of the performance and complexity of the deep learning algorithm, the calculation task of deep learning needs to be integrated into the hardware architecture to accelerate the operation speed.
For end-side devices, various complex application scenarios are also encountered, subject to power consumption and cost constraints. In this context, low power consumption and low cost are the basic features of the end-side neural network accelerator in order to satisfy the deep learning algorithm that can complete the inference task at the end side.
Currently, the memories used are classified into two categories, volatile and nonvolatile, and the volatile memories are classified into Static Random Access Memories (SRAMs) and Dynamic Random Access Memories (DRAMs), and among the nonvolatile memories, flash memories are the dominant market at present. With the development of mobile devices and the internet of things, the off-chip memory is mainly based on DRAM and NAND flash memory at present. SRAM has the fastest read speed but the highest cost and is therefore only used for high speed internal cache usage. In the existing neural network accelerator architecture, a common memory architecture is: the SRAM serves as an on-chip cache and the DRAM serves as an off-chip memory. The DRAM has a higher read/write speed than the NAND flash memory, but cannot store data for a long time and is higher in cost. When the end side uses the DRAM for off-chip storage, data loss will re-read the data from the cloud, wasting a lot of power consumption, bandwidth and time cost.
Disclosure of Invention
In order to overcome the defects and shortcomings in the prior art, the invention aims to provide a low-power-consumption neural network accelerator storage architecture based on a NAND flash memory; the framework can meet the calculation requirement of a deep learning algorithm for completing reasoning tasks on end-side equipment, is low in power consumption and has a power-off protection effect.
In order to achieve the purpose, the invention is realized by the following technical scheme: a low-power consumption neural network accelerator storage architecture based on NAND flash memory is characterized in that: the off-chip NAND flash memory comprises an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache and a controller; the internal global cache comprises a weight cache, an input cache, an intermediate result cache and an output cache;
the off-chip NAND flash memory storage unit is used for storing weight data from a cloud or a server;
the controller is used for controlling the flow and calculation of data and is responsible for controlling the calculation of the neural network calculation circuit and the writing-in and writing-out of the data;
the neural network computing circuit is used for carrying out data computation;
the weight cache is used for caching weight data loaded from an off-chip NAND flash memory storage unit; the input cache is used for caching input data; the intermediate result cache is used for caching the intermediate calculation result of the neural network calculation circuit; the output cache is used for caching output data of the neural network computing circuit;
when the neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit and loads the weight data into the weight cache; loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and then the final computing result is output.
Preferably, it is applied to a neural network model with several module layers; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: loading the weight data of the neural network model into an off-chip NAND flash memory storage unit from a cloud or a server; the neural network computing circuit sequentially computes each module layer of the neural network model layer by layer:
when the first layer module layer is calculated, corresponding weight data of the first layer module layer in an off-chip NAND flash memory storage unit is loaded into a weight cache, the weight data stored in the weight cache and input data stored in an input cache are loaded into a neural network calculating circuit, then operation is carried out, and a middle calculation result of the neural network calculating circuit is cached into a middle result cache;
when the next module layer is calculated, the corresponding weight data of the currently calculated module layer in the off-chip NAND flash memory storage unit is loaded into the weight cache, and the neural network calculating circuit loads the weight data stored in the weight cache and the intermediate calculation result stored in the intermediate result cache and then performs operation; if the current calculated module layer is not the last module layer, caching the intermediate calculation result of the neural network calculation circuit into an intermediate result cache; if the current calculated module layer is the last module layer, the calculation result of the neural network calculation circuit is cached in an output cache, and then the final calculation result is output.
Preferably, applied to a neural network model transform encoder; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: the method comprises the following steps:
s1, loading the weight and the bias data of the neural network model from the cloud or the server by the off-chip NAND flash memory storage unit: position coding weight matrix and query vector weight matrix WQKey vector weight matrix WKValue vector weight matrix WVQuery vector offset vector BQKey vector offset vector BKValue vector offset vector BVFeedforward layer first layer weight matrix WE1Feedforward layer second layer weight matrix WF2Feedforward layer first layer offset vector BF1Feedforward layer second layer offset vector BF2The multi-head attention module outputs a weight matrix Wo and a multi-head attention output offset vector BOAnd layer normalized gain and bias;
s2, loading the position coding weight matrix stored in the off-chip NAND flash memory storage unit into a weight cache; loading an input matrix X corresponding to the current word vector into an input cache; the neural network computing circuit takes out an input matrix X from the input cache, takes out a position coding weight matrix from the weight cache, and computes the input matrix X after position coding;
query vector weight matrix W for off-chip NAND flash memory cell storageQKey vector weight matrix WKSum vector weight matrix WVAnd query vector offset vector BQKey vector offset vector BKSum vector offset vector BVLoading into a weight cache; taking out the weight and the offset data from the weight cache, carrying out linear layer calculation on the weight and the input matrix X after position coding in a neural network calculation circuit to obtain a matrix Q, a matrix K and a matrix V, and storing the matrix Q, the matrix K and the matrix V into an intermediate result cache; the calculation formula is as follows:
Q=WQX+BQ
K=WKX+BK
V=WVX+BV
multi-head attention module output weight W for storing off-chip NAND flash memory storage unitOAnd an output offset BOLoading into a weight cache; loading the matrix Q and the matrix K which are cached and stored by the intermediate result to a neural network computing circuit; performing multiplication and addition operation on the matrix in the PE array, and calculating the multiplication and addition operation result through a Softmax module to obtain an attention fraction matrix S; the calculation formula is as follows:
wherein d iskThe column numbers of the matrix Q and the matrix K correspond to the dimensionality of the word vector;
then, loading the matrix V stored in the intermediate result cache to a neural network computing circuit, and multiplying the matrix V with the attention fractional matrix S to obtain an output matrix Zi(ii) a Multi-head attention module output weight W for caching weightOAnd an output offset BOLoading to a neural network computing circuit; output matrix Z of each head of multi-head attention moduleiOutput weight matrix W of multi-head attention module after splicingOPerforming multiply-addOperation, adding the result of the multiply-add operation to the output offset vector BOObtaining a matrix Z; storing the matrix Z into an intermediate result cache; the calculation formula is as follows:
Zi=SV
Z=WO(Z1...Zi)+BO
feed-forward layer first layer weight matrix W for off-chip NAND flash memory cell storageF1And feed forward layer first layer offset vector BF1Loading into a weight cache; loading the input matrix X and the final output matrix Z cached and stored in the intermediate result to a neural network computing circuit, and performing addition operation of the input matrix X and the final output matrix Z in the PE array to obtain a matrix A0Matrix A0Calculating by a layer normalization module to obtain a matrix L0;
Feed-forward layer first layer weight matrix W for storing weight cacheF1And feed forward layer first layer offset vector BF1Loading to a neural network computing circuit; matrix L0And a feedforward layer first layer weight matrix WF1Performing multiply-add operation, adding the result of the multiply-add operation to the first layer offset vector B of the feedforward layerF1And derive the matrix F by ReLU0(ii) a Will matrix F0Storing the data in an intermediate result cache; the calculation formula is as follows:
F0=ReLU(WF1L0+BF1)
feed-forward layer second layer weight matrix W to store off-chip NAND flash memory cellsF2And a feedforward layer second layer offset vector BF2Loading into a weight cache; matrix F for caching intermediate results0And a feedforward layer second layer weight matrix W stored in a weight cacheF2And a feedforward layer second layer offset vector BF2Loading to a neural network computing circuit, and performing matrix F on the PE array0And a feedforward layer second layer weight matrix WF2Multiply-add operation of adding the result of the multiply-add operation to the second layer offset vector B of the feedforward layerF2Calculating to obtain a matrix F1:
F1=WF2F0+BF2
Matrix A for caching intermediate results0Loading to a neural network computing circuit to obtain a matrix F1And matrix A0Performing addition operation to obtain a matrix A1Matrix A1The result L is obtained through the calculation of a layer normalization module1The result L is1Storing the output data in an output cache as the output of a neural network model Transformer encoder;
s3, the off-chip NAND flash memory storage unit adopts the weight data of the neural network model stored currently, and jumps to step S2 to encode the word vector of the next sentence until all the word vectors of all the input sentences are encoded.
Preferably, the neural network computing circuit includes a PE (basic arithmetic unit) array for performing a matrix multiply-add operation and other computing modules; the other calculation module comprises any one or more than two of an addition tree, an activation function operation unit and a nonlinear function operation unit.
Preferably, a high-speed interface is also included; the high-speed interface is used for data transmission between the off-chip NAND flash memory storage unit and the internal global cache.
Preferably, the weight data is loaded by reading in units of pages between the off-chip NAND flash memory cells and the weight cache.
Preferably, the off-chip NAND flash memory storage unit is further configured to store intermediate calculation results and/or final operation results of the neural network calculation circuit.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the low-power-consumption neural network accelerator storage framework can meet the calculation requirement of a deep learning algorithm for completing an inference task on end-side equipment, is low in power consumption and has a power-off protection function; the deep learning algorithm is realized by adopting hardware, so that the performance of the deep learning algorithm can be improved, and the operation speed is accelerated.
Drawings
FIG. 1 is a block diagram of the structure of the storage architecture of the NAND flash memory based low power consumption neural network accelerator of the present invention;
FIG. 2 is a schematic block diagram of a neural network model Transformer encoder applied to the storage architecture of the NAND flash based low-power neural network accelerator according to the second embodiment.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Example one
In this embodiment, a storage architecture of a low-power neural network accelerator based on a NAND flash memory, as shown in fig. 1, includes an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache, and a controller; the internal global cache includes a weight cache, an input cache, an intermediate result cache, and an output cache.
The off-chip NAND flash memory storage unit is used for storing weight data from a cloud or a server.
The controller is used for controlling the flow and calculation of data and is responsible for controlling the calculation of the neural network calculation circuit and the writing-in and writing-out of data.
The neural network computing circuit is used for carrying out data computation; the neural network computing circuit comprises a PE (basic operation unit) array and other computing modules for matrix multiply-add operation; the other calculation modules include any one or more than two of an addition tree, an activation function operation unit and a nonlinear function operation unit.
The weight cache is used for caching weight data loaded from an off-chip NAND flash memory storage unit; the input cache is used for caching input data; the intermediate result cache is used for caching the intermediate calculation result of the neural network calculation circuit; the output buffer is used for buffering the output data of the neural network computing circuit.
The system also comprises a high-speed interface; the high-speed interface is a medium for data exchange between the off-chip NAND flash memory storage unit and the internal global cache, and is used for data transmission between the off-chip NAND flash memory storage unit and the internal global cache.
The off-chip NAND flash memory storage unit may also be used to store intermediate calculation results and/or final operation results of the neural network calculation circuit.
When the neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit through controlling the high-speed interface and loads the weight data into the weight cache; between the off-chip NAND flash memory cell and the weight cache, the weight data is read and loaded by page unit. Loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and then the final computing result is output.
The following description will take the example that the storage architecture of the low-power neural network accelerator is applied to a neural network model with several module layers. The method for calculating the neural network by the storage framework of the low-power-consumption neural network accelerator comprises the following steps: loading the weight data of the neural network model into an off-chip NAND flash memory storage unit from a cloud or a server; the neural network computing circuit sequentially computes each module layer of the neural network model layer by layer:
when the first layer module layer is calculated, corresponding weight data of the first layer module layer in an off-chip NAND flash memory storage unit is loaded into a weight cache, the weight data stored in the weight cache and input data stored in an input cache are loaded into a neural network calculating circuit, then operation is carried out, and a middle calculation result of the neural network calculating circuit is cached into a middle result cache;
when the next module layer is calculated, the corresponding weight data of the currently calculated module layer in the off-chip NAND flash memory storage unit is loaded into the weight cache, and the neural network calculating circuit loads the weight data stored in the weight cache and the intermediate calculation result stored in the intermediate result cache and then performs operation; if the current calculated module layer is not the last module layer, caching the intermediate calculation result of the neural network calculation circuit into an intermediate result cache; if the current calculated module layer is the last module layer, the calculation result of the neural network calculation circuit is cached in an output cache, and then the final calculation result is output.
The low-power-consumption neural network accelerator storage framework can meet the calculation requirement of a deep learning algorithm for completing an inference task on end-side equipment, is low in power consumption and has a power-off protection function; the deep learning algorithm is realized by adopting hardware, so that the performance of the deep learning algorithm can be improved, and the operation speed is accelerated.
The off-chip storage of the invention uses NAND flash memory, belongs to nonvolatile storage equipment, and the basic storage unit is a floating gate MOS tube. The electrons are controlled to pass in and out of the floating gate by an external electric field to write and read data, and the stored information is retained in the floating gate and cannot be lost due to power failure when the power is cut off. However, most of the off-chip memories used in the neural network accelerators are DRAMs, the DRAMs store data of one bit by adding a capacitor to an MOS transistor, and the capacitor is used for storing charges, so that the charges must be periodically supplemented to refresh the data, and once the power is turned off, the charges stored in the capacitor disappear, and the data stored in the DRAMs are lost. When the end side uses the DRAM for off-chip storage, once the data is lost due to power failure, the data is to be read from the cloud again, which wastes a lot of power consumption, bandwidth and time cost. Compared with the current neural network accelerator based on DRAM as off-chip storage, the neural network accelerator based on NAND flash memory as off-chip storage obviously has better power-off protection effect.
The basic memory cells of an off-chip storage DRAM used by the existing neural network are an MOS transistor and a capacitor, data is stored by charging and discharging the capacitor, and extra power consumption is caused because the charge in the capacitor is gradually lost along with time and needs to be refreshed additionally at regular time.
Compared with the existing neural network processor storage architecture based on DRAM as off-chip storage, the neural network processor storage architecture using the NAND flash memory as off-chip storage has the advantages that the basic storage unit is only one floating gate MOS tube, one capacitor is used less than that of the DRAM with the basic storage unit being one capacitor and MOS tube, the price cost is lower than that of the DRAM, and the low cost has extremely high competitiveness in end-side equipment. In addition, the basic memory unit of the NAND flash memory is one less capacitor than that of the DRAM, namely the NAND flash memory has higher integration level under the same area, the NAND flash memory has the advantages of light weight and small volume under the condition of the same integration level, and the neural network accelerator based on the NAND flash memory as off-chip storage is lighter and smaller in volume and is more suitable for being used by end-side devices.
Example two
In this embodiment, a storage architecture of a low-power neural network accelerator is applied to a neural network model Transformer encoder as an example for explanation. The principle of the neural network model Transformer encoder is shown in fig. 2. The method for calculating the neural network by the storage framework of the low-power-consumption neural network accelerator comprises the following steps: the method comprises the following steps:
s1, loading the weight and the bias data of the neural network model from the cloud or the server by the off-chip NAND flash memory storage unit: position coding weight matrix and query vector weight matrix WQKey vector weight matrix WKValue vector weight matrix WVQuery vector offset vector BQKey vector offset vector BKValue vector offset vector BVFeedforward layer first layer weight matrix WF1Feedforward layer second layer weight matrix WF2Feedforward layer first layer offset vector BF1Feedforward layer second layer offset vector BF2The multi-head attention module outputs a weight matrix Wo and a multi-head attention output offset vector BOAnd layer normalized gain and bias;
s2, loading the position coding weight matrix stored in the off-chip NAND flash memory storage unit into a weight cache; loading an input matrix X corresponding to the current word vector into an input cache; the neural network computing circuit takes out an input matrix X from the input cache, takes out a position coding weight matrix from the weight cache, and computes the input matrix X after position coding;
query vector weight matrix W for off-chip NAND flash memory cell storageQKey vector weight matrix WKSum vector weight matrixWVAnd query vector offset vector BQKey vector offset vector BKSum vector offset vector BVLoading into a weight cache; taking out the weight and the offset data from the weight cache, carrying out linear layer calculation on the weight and the input matrix X after position coding in a neural network calculation circuit to obtain a matrix Q, a matrix K and a matrix V, and storing the matrix Q, the matrix K and the matrix V into an intermediate result cache; the calculation formula is as follows:
Q=WQX+BQ
K=WKX+BK
V=WVX+BV
multi-head attention module output weight W for storing off-chip NAND flash memory storage unitOAnd an output offset BOLoading into a weight cache; and loading the matrix Q and the matrix K stored in the intermediate result cache to a neural network computing circuit. Firstly, performing multiplication and addition operation on a matrix in a PE array, and calculating a result of the multiplication and addition operation through a Softmax module to obtain an attention fraction matrix S; the calculation formula is as follows:
wherein d iskThe column numbers of the matrix Q and the matrix K correspond to the dimensionality of the word vector;
then, loading the matrix V stored in the intermediate result cache to a neural network computing circuit, and multiplying the matrix V with the attention fractional matrix S to obtain an output matrix Zi(ii) a Multi-head attention module output weight W for caching weightOAnd an output offset BOLoading to a neural network computing circuit; output matrix Z of each head of multi-head attention moduleiOutput weight matrix W of multi-head attention module after splicingOPerforming multiplication and addition operation, adding output offset vector B to the result of multiplication and addition operationOObtaining a matrix Z; storing the matrix Z into an intermediate result cache; the calculation formula is as follows:
Zi=SV
Z=WO(Z1...Zi)+BO
feed-forward layer first layer weight matrix W for off-chip NAND flash memory cell storageF1And feed forward layer first layer offset vector BF1Loading into a weight cache; loading the input matrix X and the final output matrix Z cached and stored in the intermediate result to a neural network computing circuit, and performing addition operation of the input matrix X and the final output matrix Z in the PE array to obtain a matrix A0Matrix A0Calculating by a layer normalization module to obtain a matrix L0;
Feed-forward layer first layer weight matrix W for storing weight cacheF1And feed forward layer first layer offset vector BF1Loading to a neural network computing circuit; matrix L0And a feedforward layer first layer weight matrix WF1Performing multiply-add operation, adding the result of the multiply-add operation to the first layer offset vector B of the feedforward layerF1And derive the matrix F by ReLU0(ii) a Will matrix F0Storing the data in an intermediate result cache; the calculation formula is as follows:
F0=ReLU(WF1L0+BF1)
feed-forward layer second layer weight matrix W to store off-chip NAND flash memory cellsF2And a feedforward layer second layer offset vector BF2Loading into a weight cache; matrix F for caching intermediate results0And a feedforward layer second layer weight matrix W stored in a weight cacheF2And a feedforward layer second layer offset vector BF2Loading to a neural network computing circuit, and performing matrix F on the PE array0And a feedforward layer second layer weight matrix WF2Multiply-add operation of adding the result of the multiply-add operation to the second layer offset vector B of the feedforward layerF2Calculating to obtain a matrix F1:
F1=WF2F0+BF2
Matrix A for caching intermediate results0Loading to a neural network computing circuit to obtain a matrix F1And matrix A0Performing addition operation to obtain a matrix A1Matrix A1Calculated by a layer normalization moduleObtaining the result L1The result L is1Storing the output data in an output cache as the output of a neural network model Transformer encoder;
s3, the off-chip NAND flash memory storage unit adopts the weight data of the neural network model stored currently, and jumps to step S2 to encode the word vector of the next sentence until all the word vectors of all the input sentences are encoded.
The Softmax module and the layer normalization module are both nonlinear function operation units of the neural network computing circuit.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (6)
1. A low-power consumption neural network accelerator storage architecture based on NAND flash memory is characterized in that: the off-chip NAND flash memory comprises an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache and a controller; the internal global cache comprises a weight cache, an input cache, an intermediate result cache and an output cache;
the off-chip NAND flash memory storage unit is used for storing weight data from a cloud or a server;
the controller is used for controlling the flow and calculation of data and is responsible for controlling the calculation of the neural network calculation circuit and the writing-in and writing-out of the data;
the neural network computing circuit is used for carrying out data computation;
the weight cache is used for caching weight data loaded from an off-chip NAND flash memory storage unit; the input cache is used for caching input data; the intermediate result cache is used for caching the intermediate calculation result of the neural network calculation circuit; the output cache is used for caching output data of the neural network computing circuit;
when the neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit and loads the weight data into the weight cache; loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and then the final computing result is output.
2. The NAND-flash-based low power neural network accelerator memory architecture of claim 1, wherein: the method is applied to a neural network model with a plurality of module layers; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: loading the weight data of the neural network model into an off-chip NAND flash memory storage unit from a cloud or a server; the neural network computing circuit sequentially computes each module layer of the neural network model layer by layer:
when the first layer module layer is calculated, corresponding weight data of the first layer module layer in an off-chip NAND flash memory storage unit is loaded into a weight cache, the weight data stored in the weight cache and input data stored in an input cache are loaded into a neural network calculating circuit, then operation is carried out, and a middle calculation result of the neural network calculating circuit is cached into a middle result cache;
when the next module layer is calculated, the corresponding weight data of the currently calculated module layer in the off-chip NAND flash memory storage unit is loaded into the weight cache, and the neural network calculating circuit loads the weight data stored in the weight cache and the intermediate calculation result stored in the intermediate result cache and then performs operation; if the current calculated module layer is not the last module layer, caching the intermediate calculation result of the neural network calculation circuit into an intermediate result cache; if the current calculated module layer is the last module layer, the calculation result of the neural network calculation circuit is cached in an output cache, and then the final calculation result is output.
3. The NAND-flash-based low power neural network accelerator memory architecture of claim 2, wherein: applying the neural network model transform coder; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: the method comprises the following steps:
s1, loading the weight and the bias data of the neural network model from the cloud or the server by the off-chip NAND flash memory storage unit: position coding weight matrix and query vector weight matrix WQKey vector weight matrix WKValue vector weight matrix WVQuery vector offset vector BQKey vector offset vector BKValue vector offset vector BVFeedforward layer first layer weight matrix WF1Feedforward layer second layer weight matrix WF2Feedforward layer first layer offset vector BF1Feedforward layer second layer offset vector BF2The multi-head attention module outputs a weight matrix Wo and a multi-head attention output offset vector BOAnd layer normalized gain and bias;
s2, loading the position coding weight matrix stored in the off-chip NAND flash memory storage unit into a weight cache; loading an input matrix X corresponding to the current word vector into an input cache; the neural network computing circuit takes out an input matrix X from the input cache, takes out a position coding weight matrix from the weight cache, and computes the input matrix X after position coding;
query vector weight matrix W for off-chip NAND flash memory cell storageQKey vector weight matrix WKSum vector weight matrix WVAnd query vector offset vector BQKey vector offset vector BKSum vector offset vector BVLoading into a weight cache; taking out the weight and the offset data from the weight cache, carrying out linear layer calculation on the weight and the input matrix X after position coding in a neural network calculation circuit to obtain a matrix Q, a matrix K and a matrix V, and storing the matrix Q, the matrix K and the matrix V into an intermediate result cache;
multi-head attention module output weight W for storing off-chip NAND flash memory storage unitOAnd an output offset BOLoading into a weight cache; loading the matrix Q and the matrix K stored in the intermediate result cache toA neural network computing circuit; performing multiplication and addition operation on the matrix in the PE array, and calculating the multiplication and addition operation result through a Softmax module to obtain an attention fraction matrix S;
then, loading the matrix V stored in the intermediate result cache to a neural network computing circuit, and multiplying the matrix V with the attention fractional matrix S to obtain an output matrix Zi(ii) a Multi-head attention module output weight W for caching weightOAnd an output offset BOLoading to a neural network computing circuit; output matrix Z of each head of multi-head attention moduleiOutput weight matrix W of multi-head attention module after splicingOPerforming multiplication and addition operation, adding output offset vector B to the result of multiplication and addition operationOObtaining a matrix Z; storing the matrix Z into an intermediate result cache;
feed-forward layer first layer weight matrix W for off-chip NAND flash memory cell storageF1And feed forward layer first layer offset vector BF1Loading into a weight cache; loading the input matrix X and the final output matrix Z cached and stored in the intermediate result to a neural network computing circuit, and performing addition operation of the input matrix X and the final output matrix Z in the PE array to obtain a matrix A0Matrix A0Calculating by a layer normalization module to obtain a matrix L0;
Feed-forward layer first layer weight matrix W for storing weight cacheF1And feed forward layer first layer offset vector BF1Loading to a neural network computing circuit; matrix L0And a feedforward layer first layer weight matrix WF1Performing multiply-add operation, adding the result of the multiply-add operation to the first layer offset vector B of the feedforward layerF1And derive the matrix F by ReLU0(ii) a Will matrix F0Storing the data in an intermediate result cache;
feed-forward layer second layer weight matrix W to store off-chip NAND flash memory cellsF2And a feedforward layer second layer offset vector BF2Loading into a weight cache; matrix F for caching intermediate results0And a feedforward layer second layer weight matrix W stored in a weight cacheF2And a feedforward layer second layer offset vector BF2Is loaded into a neural network computing circuit to be calculated,performing matrix F on PE array0And a feedforward layer second layer weight matrix WF2Multiply-add operation of adding the result of the multiply-add operation to the second layer offset vector B of the feedforward layerF2Calculating to obtain a matrix F1;
Matrix A for caching intermediate results0Loading to a neural network computing circuit to obtain a matrix F1And matrix A0Performing addition operation to obtain a matrix A1Matrix A1The result L is obtained through the calculation of a layer normalization module1The result L is1Storing the output data in an output cache as the output of a neural network model Transformer encoder;
s3, the off-chip NAND flash memory storage unit adopts the weight data of the neural network model stored currently, and jumps to step S2 to encode the word vector of the next sentence until all the word vectors of all the input sentences are encoded.
4. The NAND flash based low power neural network accelerator memory architecture of any one of claims 1 to 3, wherein: the system also comprises a high-speed interface; the high-speed interface is used for data transmission between the off-chip NAND flash memory storage unit and the internal global cache.
5. The NAND flash based low power neural network accelerator memory architecture of any one of claims 1 to 3, wherein: between the off-chip NAND flash memory cell and the weight cache, the weight data is read and loaded by page unit.
6. The NAND flash based low power neural network accelerator memory architecture of any one of claims 1 to 3, wherein: the off-chip NAND flash memory storage unit is also used for storing the final operation result of the neural network computing circuit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110349392.2A CN113159309B (en) | 2021-03-31 | 2021-03-31 | NAND flash memory-based low-power-consumption neural network accelerator storage architecture |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110349392.2A CN113159309B (en) | 2021-03-31 | 2021-03-31 | NAND flash memory-based low-power-consumption neural network accelerator storage architecture |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113159309A true CN113159309A (en) | 2021-07-23 |
CN113159309B CN113159309B (en) | 2023-03-21 |
Family
ID=76885744
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110349392.2A Active CN113159309B (en) | 2021-03-31 | 2021-03-31 | NAND flash memory-based low-power-consumption neural network accelerator storage architecture |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113159309B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117787366A (en) * | 2024-02-28 | 2024-03-29 | 苏州元脑智能科技有限公司 | Hardware accelerator and scheduling method thereof |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109948774A (en) * | 2019-01-25 | 2019-06-28 | 中山大学 | Neural network accelerator and its implementation based on network layer binding operation |
CN110490311A (en) * | 2019-07-08 | 2019-11-22 | 华南理工大学 | Convolutional neural networks accelerator and its control method based on RISC-V framework |
CN111062471A (en) * | 2019-11-23 | 2020-04-24 | 复旦大学 | Deep learning accelerator for accelerating BERT neural network operations |
CN111222626A (en) * | 2019-11-07 | 2020-06-02 | 合肥恒烁半导体有限公司 | Data segmentation operation method of neural network based on NOR Flash module |
CN111241028A (en) * | 2018-11-28 | 2020-06-05 | 北京知存科技有限公司 | Digital-analog hybrid storage and calculation integrated chip and calculation device |
US20200184335A1 (en) * | 2018-12-06 | 2020-06-11 | Western Digital Technologies, Inc. | Non-volatile memory die with deep learning neural network |
-
2021
- 2021-03-31 CN CN202110349392.2A patent/CN113159309B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111241028A (en) * | 2018-11-28 | 2020-06-05 | 北京知存科技有限公司 | Digital-analog hybrid storage and calculation integrated chip and calculation device |
US20200184335A1 (en) * | 2018-12-06 | 2020-06-11 | Western Digital Technologies, Inc. | Non-volatile memory die with deep learning neural network |
CN109948774A (en) * | 2019-01-25 | 2019-06-28 | 中山大学 | Neural network accelerator and its implementation based on network layer binding operation |
CN110490311A (en) * | 2019-07-08 | 2019-11-22 | 华南理工大学 | Convolutional neural networks accelerator and its control method based on RISC-V framework |
CN111222626A (en) * | 2019-11-07 | 2020-06-02 | 合肥恒烁半导体有限公司 | Data segmentation operation method of neural network based on NOR Flash module |
CN111062471A (en) * | 2019-11-23 | 2020-04-24 | 复旦大学 | Deep learning accelerator for accelerating BERT neural network operations |
Non-Patent Citations (2)
Title |
---|
仇越: ""基于FPGA的卷积神经网络加速方法研究及实现"", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
陈建明编: "《嵌入式***及应用》", 28 February 2017, 国防工业出版社 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117787366A (en) * | 2024-02-28 | 2024-03-29 | 苏州元脑智能科技有限公司 | Hardware accelerator and scheduling method thereof |
CN117787366B (en) * | 2024-02-28 | 2024-05-10 | 苏州元脑智能科技有限公司 | Hardware accelerator and scheduling method thereof |
Also Published As
Publication number | Publication date |
---|---|
CN113159309B (en) | 2023-03-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Nguyen et al. | An approximate memory architecture for a reduction of refresh power consumption in deep learning applications | |
US11625584B2 (en) | Reconfigurable memory compression techniques for deep neural networks | |
US20190188237A1 (en) | Method and electronic device for convolution calculation in neutral network | |
CN112151091B (en) | 8T SRAM unit and memory computing device | |
CN109902822B (en) | Memory computing system and method based on Sgimenk track storage | |
CN109934336B (en) | Neural network dynamic acceleration platform design method based on optimal structure search and neural network dynamic acceleration platform | |
CN108446764B (en) | Novel neuromorphic chip architecture | |
CN113159309B (en) | NAND flash memory-based low-power-consumption neural network accelerator storage architecture | |
CN114937470B (en) | Fixed point full-precision memory computing circuit based on multi-bit SRAM unit | |
CN110322008A (en) | Residual convolution neural network-based quantization processing method and device | |
CN110176264A (en) | A kind of high-low-position consolidation circuit structure calculated interior based on memory | |
US20210216846A1 (en) | Transpose memory unit for multi-bit convolutional neural network based computing-in-memory applications, transpose memory array structure for multi-bit convolutional neural network based computing-in-memory applications and computing method thereof | |
CN113296734A (en) | Multi-position storage device | |
CN112233712B (en) | 6T SRAM (static random Access memory) storage device, storage system and storage method | |
Jeong et al. | A 28nm 1.644 tflops/w floating-point computation sram macro with variable precision for deep neural network inference and training | |
Sehgal et al. | Trends in analog and digital intensive compute-in-SRAM designs | |
CN116401502B (en) | Method and device for optimizing Winograd convolution based on NUMA system characteristics | |
CN113539318A (en) | Memory computing circuit chip based on magnetic cache and computing device | |
CN117234720A (en) | Dynamically configurable memory computing fusion data caching structure, processor and electronic equipment | |
US11615834B2 (en) | Semiconductor storage device and information processor | |
CN114895869A (en) | Multi-bit memory computing device with symbols | |
Lee et al. | Robustness of differentiable neural computer using limited retention vector-based memory deallocation in language model | |
US20200257959A1 (en) | Memory device having an address generator using a neural network algorithm and a memory system including the same | |
Kumar et al. | Design and power analysis of 16× 16 SRAM Array Employing 7T I-LSVL | |
Yook et al. | Refresh Methods and Accuracy Evaluation for 2T0C DRAM based Processing-in-memory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |