CN113159309A

CN113159309A - NAND flash memory-based low-power-consumption neural network accelerator storage architecture

Info

Publication number: CN113159309A
Application number: CN202110349392.2A
Authority: CN
Inventors: 姜小波; 邓晗珂; 莫志杰
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-07-23
Anticipated expiration: 2041-03-31
Also published as: CN113159309B

Abstract

The invention provides a low-power consumption neural network accelerator storage framework based on a NAND flash memory, which is characterized in that: the off-chip NAND flash memory comprises an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache and a controller; the internal global cache comprises a weight cache, an input cache, an intermediate result cache and an output cache; when the neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit and loads the weight data into the weight cache; loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and then the final computing result is output. The framework can meet the calculation requirement of a deep learning algorithm for completing reasoning tasks on end-side equipment, is low in power consumption and has a power-off protection effect.

Description

NAND flash memory-based low-power-consumption neural network accelerator storage architecture

Technical Field

The invention relates to the technical field of integrated circuit design, in particular to a low-power-consumption neural network accelerator storage architecture based on a NAND flash memory.

Background

With the development of artificial intelligence and the Internet of things, the combination of the artificial intelligence and the Internet of things is further accelerated, and the end-side artificial intelligence is in a rapid development stage. On the other hand, with the synchronous improvement of the performance and complexity of the deep learning algorithm, the calculation task of deep learning needs to be integrated into the hardware architecture to accelerate the operation speed.

For end-side devices, various complex application scenarios are also encountered, subject to power consumption and cost constraints. In this context, low power consumption and low cost are the basic features of the end-side neural network accelerator in order to satisfy the deep learning algorithm that can complete the inference task at the end side.

Currently, the memories used are classified into two categories, volatile and nonvolatile, and the volatile memories are classified into Static Random Access Memories (SRAMs) and Dynamic Random Access Memories (DRAMs), and among the nonvolatile memories, flash memories are the dominant market at present. With the development of mobile devices and the internet of things, the off-chip memory is mainly based on DRAM and NAND flash memory at present. SRAM has the fastest read speed but the highest cost and is therefore only used for high speed internal cache usage. In the existing neural network accelerator architecture, a common memory architecture is: the SRAM serves as an on-chip cache and the DRAM serves as an off-chip memory. The DRAM has a higher read/write speed than the NAND flash memory, but cannot store data for a long time and is higher in cost. When the end side uses the DRAM for off-chip storage, data loss will re-read the data from the cloud, wasting a lot of power consumption, bandwidth and time cost.

Disclosure of Invention

In order to overcome the defects and shortcomings in the prior art, the invention aims to provide a low-power-consumption neural network accelerator storage architecture based on a NAND flash memory; the framework can meet the calculation requirement of a deep learning algorithm for completing reasoning tasks on end-side equipment, is low in power consumption and has a power-off protection effect.

In order to achieve the purpose, the invention is realized by the following technical scheme: a low-power consumption neural network accelerator storage architecture based on NAND flash memory is characterized in that: the off-chip NAND flash memory comprises an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache and a controller; the internal global cache comprises a weight cache, an input cache, an intermediate result cache and an output cache;

the off-chip NAND flash memory storage unit is used for storing weight data from a cloud or a server;

the controller is used for controlling the flow and calculation of data and is responsible for controlling the calculation of the neural network calculation circuit and the writing-in and writing-out of the data;

the neural network computing circuit is used for carrying out data computation;

the weight cache is used for caching weight data loaded from an off-chip NAND flash memory storage unit; the input cache is used for caching input data; the intermediate result cache is used for caching the intermediate calculation result of the neural network calculation circuit; the output cache is used for caching output data of the neural network computing circuit;

when the neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit and loads the weight data into the weight cache; loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and then the final computing result is output.

Preferably, it is applied to a neural network model with several module layers; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: loading the weight data of the neural network model into an off-chip NAND flash memory storage unit from a cloud or a server; the neural network computing circuit sequentially computes each module layer of the neural network model layer by layer:

when the first layer module layer is calculated, corresponding weight data of the first layer module layer in an off-chip NAND flash memory storage unit is loaded into a weight cache, the weight data stored in the weight cache and input data stored in an input cache are loaded into a neural network calculating circuit, then operation is carried out, and a middle calculation result of the neural network calculating circuit is cached into a middle result cache;

when the next module layer is calculated, the corresponding weight data of the currently calculated module layer in the off-chip NAND flash memory storage unit is loaded into the weight cache, and the neural network calculating circuit loads the weight data stored in the weight cache and the intermediate calculation result stored in the intermediate result cache and then performs operation; if the current calculated module layer is not the last module layer, caching the intermediate calculation result of the neural network calculation circuit into an intermediate result cache; if the current calculated module layer is the last module layer, the calculation result of the neural network calculation circuit is cached in an output cache, and then the final calculation result is output.

Preferably, applied to a neural network model transform encoder; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: the method comprises the following steps:

s1, loading the weight and the bias data of the neural network model from the cloud or the server by the off-chip NAND flash memory storage unit: position coding weight matrix and query vector weight matrix W_QKey vector weight matrix W_KValue vector weight matrix W_VQuery vector offset vector B_QKey vector offset vector B_KValue vector offset vector B_VFeedforward layer first layer weight matrix W_E1Feedforward layer second layer weight matrix W_F2Feedforward layer first layer offset vector B_F1Feedforward layer second layer offset vector B_F2The multi-head attention module outputs a weight matrix Wo and a multi-head attention output offset vector B_OAnd layer normalized gain and bias;

s2, loading the position coding weight matrix stored in the off-chip NAND flash memory storage unit into a weight cache; loading an input matrix X corresponding to the current word vector into an input cache; the neural network computing circuit takes out an input matrix X from the input cache, takes out a position coding weight matrix from the weight cache, and computes the input matrix X after position coding;

query vector weight matrix W for off-chip NAND flash memory cell storage_QKey vector weight matrix W_KSum vector weight matrix W_VAnd query vector offset vector B_QKey vector offset vector B_KSum vector offset vector B_VLoading into a weight cache; taking out the weight and the offset data from the weight cache, carrying out linear layer calculation on the weight and the input matrix X after position coding in a neural network calculation circuit to obtain a matrix Q, a matrix K and a matrix V, and storing the matrix Q, the matrix K and the matrix V into an intermediate result cache; the calculation formula is as follows:

Q＝W_QX+B_Q

K＝W_KX+B_K

V＝W_VX+B_V

multi-head attention module output weight W for storing off-chip NAND flash memory storage unit_OAnd an output offset B_OLoading into a weight cache; loading the matrix Q and the matrix K which are cached and stored by the intermediate result to a neural network computing circuit; performing multiplication and addition operation on the matrix in the PE array, and calculating the multiplication and addition operation result through a Softmax module to obtain an attention fraction matrix S; the calculation formula is as follows:

wherein d is_kThe column numbers of the matrix Q and the matrix K correspond to the dimensionality of the word vector;

then, loading the matrix V stored in the intermediate result cache to a neural network computing circuit, and multiplying the matrix V with the attention fractional matrix S to obtain an output matrix Z_i(ii) a Multi-head attention module output weight W for caching weight_OAnd an output offset B_OLoading to a neural network computing circuit; output matrix Z of each head of multi-head attention module_iOutput weight matrix W of multi-head attention module after splicing_OPerforming multiply-addOperation, adding the result of the multiply-add operation to the output offset vector B_OObtaining a matrix Z; storing the matrix Z into an intermediate result cache; the calculation formula is as follows:

Z_i＝SV

Z＝W_O(Z₁...Z_i)+B_O

feed-forward layer first layer weight matrix W for off-chip NAND flash memory cell storage_F1And feed forward layer first layer offset vector B_F1Loading into a weight cache; loading the input matrix X and the final output matrix Z cached and stored in the intermediate result to a neural network computing circuit, and performing addition operation of the input matrix X and the final output matrix Z in the PE array to obtain a matrix A₀Matrix A₀Calculating by a layer normalization module to obtain a matrix L₀；

Feed-forward layer first layer weight matrix W for storing weight cache_F1And feed forward layer first layer offset vector B_F1Loading to a neural network computing circuit; matrix L₀And a feedforward layer first layer weight matrix W_F1Performing multiply-add operation, adding the result of the multiply-add operation to the first layer offset vector B of the feedforward layer_F1And derive the matrix F by ReLU₀(ii) a Will matrix F₀Storing the data in an intermediate result cache; the calculation formula is as follows:

F₀＝ReLU(W_F1L₀+B_F1)

feed-forward layer second layer weight matrix W to store off-chip NAND flash memory cells_F2And a feedforward layer second layer offset vector B_F2Loading into a weight cache; matrix F for caching intermediate results₀And a feedforward layer second layer weight matrix W stored in a weight cache_F2And a feedforward layer second layer offset vector B_F2Loading to a neural network computing circuit, and performing matrix F on the PE array₀And a feedforward layer second layer weight matrix W_F2Multiply-add operation of adding the result of the multiply-add operation to the second layer offset vector B of the feedforward layer_F2Calculating to obtain a matrix F₁：

F₁＝W_F2F₀+B_F2

Matrix A for caching intermediate results₀Loading to a neural network computing circuit to obtain a matrix F₁And matrix A₀Performing addition operation to obtain a matrix A₁Matrix A₁The result L is obtained through the calculation of a layer normalization module₁The result L is₁Storing the output data in an output cache as the output of a neural network model Transformer encoder;

s3, the off-chip NAND flash memory storage unit adopts the weight data of the neural network model stored currently, and jumps to step S2 to encode the word vector of the next sentence until all the word vectors of all the input sentences are encoded.

Preferably, the neural network computing circuit includes a PE (basic arithmetic unit) array for performing a matrix multiply-add operation and other computing modules; the other calculation module comprises any one or more than two of an addition tree, an activation function operation unit and a nonlinear function operation unit.

Preferably, a high-speed interface is also included; the high-speed interface is used for data transmission between the off-chip NAND flash memory storage unit and the internal global cache.

Preferably, the weight data is loaded by reading in units of pages between the off-chip NAND flash memory cells and the weight cache.

Preferably, the off-chip NAND flash memory storage unit is further configured to store intermediate calculation results and/or final operation results of the neural network calculation circuit.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the low-power-consumption neural network accelerator storage framework can meet the calculation requirement of a deep learning algorithm for completing an inference task on end-side equipment, is low in power consumption and has a power-off protection function; the deep learning algorithm is realized by adopting hardware, so that the performance of the deep learning algorithm can be improved, and the operation speed is accelerated.

Drawings

FIG. 1 is a block diagram of the structure of the storage architecture of the NAND flash memory based low power consumption neural network accelerator of the present invention;

FIG. 2 is a schematic block diagram of a neural network model Transformer encoder applied to the storage architecture of the NAND flash based low-power neural network accelerator according to the second embodiment.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Example one

In this embodiment, a storage architecture of a low-power neural network accelerator based on a NAND flash memory, as shown in fig. 1, includes an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache, and a controller; the internal global cache includes a weight cache, an input cache, an intermediate result cache, and an output cache.

The off-chip NAND flash memory storage unit is used for storing weight data from a cloud or a server.

The controller is used for controlling the flow and calculation of data and is responsible for controlling the calculation of the neural network calculation circuit and the writing-in and writing-out of data.

The neural network computing circuit is used for carrying out data computation; the neural network computing circuit comprises a PE (basic operation unit) array and other computing modules for matrix multiply-add operation; the other calculation modules include any one or more than two of an addition tree, an activation function operation unit and a nonlinear function operation unit.

The weight cache is used for caching weight data loaded from an off-chip NAND flash memory storage unit; the input cache is used for caching input data; the intermediate result cache is used for caching the intermediate calculation result of the neural network calculation circuit; the output buffer is used for buffering the output data of the neural network computing circuit.

The system also comprises a high-speed interface; the high-speed interface is a medium for data exchange between the off-chip NAND flash memory storage unit and the internal global cache, and is used for data transmission between the off-chip NAND flash memory storage unit and the internal global cache.

The off-chip NAND flash memory storage unit may also be used to store intermediate calculation results and/or final operation results of the neural network calculation circuit.

When the neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit through controlling the high-speed interface and loads the weight data into the weight cache; between the off-chip NAND flash memory cell and the weight cache, the weight data is read and loaded by page unit. Loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and then the final computing result is output.

The following description will take the example that the storage architecture of the low-power neural network accelerator is applied to a neural network model with several module layers. The method for calculating the neural network by the storage framework of the low-power-consumption neural network accelerator comprises the following steps: loading the weight data of the neural network model into an off-chip NAND flash memory storage unit from a cloud or a server; the neural network computing circuit sequentially computes each module layer of the neural network model layer by layer:

The off-chip storage of the invention uses NAND flash memory, belongs to nonvolatile storage equipment, and the basic storage unit is a floating gate MOS tube. The electrons are controlled to pass in and out of the floating gate by an external electric field to write and read data, and the stored information is retained in the floating gate and cannot be lost due to power failure when the power is cut off. However, most of the off-chip memories used in the neural network accelerators are DRAMs, the DRAMs store data of one bit by adding a capacitor to an MOS transistor, and the capacitor is used for storing charges, so that the charges must be periodically supplemented to refresh the data, and once the power is turned off, the charges stored in the capacitor disappear, and the data stored in the DRAMs are lost. When the end side uses the DRAM for off-chip storage, once the data is lost due to power failure, the data is to be read from the cloud again, which wastes a lot of power consumption, bandwidth and time cost. Compared with the current neural network accelerator based on DRAM as off-chip storage, the neural network accelerator based on NAND flash memory as off-chip storage obviously has better power-off protection effect.

The basic memory cells of an off-chip storage DRAM used by the existing neural network are an MOS transistor and a capacitor, data is stored by charging and discharging the capacitor, and extra power consumption is caused because the charge in the capacitor is gradually lost along with time and needs to be refreshed additionally at regular time.

Compared with the existing neural network processor storage architecture based on DRAM as off-chip storage, the neural network processor storage architecture using the NAND flash memory as off-chip storage has the advantages that the basic storage unit is only one floating gate MOS tube, one capacitor is used less than that of the DRAM with the basic storage unit being one capacitor and MOS tube, the price cost is lower than that of the DRAM, and the low cost has extremely high competitiveness in end-side equipment. In addition, the basic memory unit of the NAND flash memory is one less capacitor than that of the DRAM, namely the NAND flash memory has higher integration level under the same area, the NAND flash memory has the advantages of light weight and small volume under the condition of the same integration level, and the neural network accelerator based on the NAND flash memory as off-chip storage is lighter and smaller in volume and is more suitable for being used by end-side devices.

Example two

In this embodiment, a storage architecture of a low-power neural network accelerator is applied to a neural network model Transformer encoder as an example for explanation. The principle of the neural network model Transformer encoder is shown in fig. 2. The method for calculating the neural network by the storage framework of the low-power-consumption neural network accelerator comprises the following steps: the method comprises the following steps:

s1, loading the weight and the bias data of the neural network model from the cloud or the server by the off-chip NAND flash memory storage unit: position coding weight matrix and query vector weight matrix W_QKey vector weight matrix W_KValue vector weight matrix W_VQuery vector offset vector B_QKey vector offset vector B_KValue vector offset vector B_VFeedforward layer first layer weight matrix W_F1Feedforward layer second layer weight matrix W_F2Feedforward layer first layer offset vector B_F1Feedforward layer second layer offset vector B_F2The multi-head attention module outputs a weight matrix Wo and a multi-head attention output offset vector B_OAnd layer normalized gain and bias;

query vector weight matrix W for off-chip NAND flash memory cell storage_QKey vector weight matrix W_KSum vector weight matrixW_VAnd query vector offset vector B_QKey vector offset vector B_KSum vector offset vector B_VLoading into a weight cache; taking out the weight and the offset data from the weight cache, carrying out linear layer calculation on the weight and the input matrix X after position coding in a neural network calculation circuit to obtain a matrix Q, a matrix K and a matrix V, and storing the matrix Q, the matrix K and the matrix V into an intermediate result cache; the calculation formula is as follows:

Q＝W_QX+B_Q

K＝W_KX+B_K

V＝W_VX+B_V

multi-head attention module output weight W for storing off-chip NAND flash memory storage unit_OAnd an output offset B_OLoading into a weight cache; and loading the matrix Q and the matrix K stored in the intermediate result cache to a neural network computing circuit. Firstly, performing multiplication and addition operation on a matrix in a PE array, and calculating a result of the multiplication and addition operation through a Softmax module to obtain an attention fraction matrix S; the calculation formula is as follows:

then, loading the matrix V stored in the intermediate result cache to a neural network computing circuit, and multiplying the matrix V with the attention fractional matrix S to obtain an output matrix Z_i(ii) a Multi-head attention module output weight W for caching weight_OAnd an output offset B_OLoading to a neural network computing circuit; output matrix Z of each head of multi-head attention module_iOutput weight matrix W of multi-head attention module after splicing_OPerforming multiplication and addition operation, adding output offset vector B to the result of multiplication and addition operation_OObtaining a matrix Z; storing the matrix Z into an intermediate result cache; the calculation formula is as follows:

Z_i＝SV

Z＝W_O(Z₁...Z_i)+B_O

F₀＝ReLU(W_F1L₀+B_F1)

F₁＝W_F2F₀+B_F2

Matrix A for caching intermediate results₀Loading to a neural network computing circuit to obtain a matrix F₁And matrix A₀Performing addition operation to obtain a matrix A₁Matrix A₁Calculated by a layer normalization moduleObtaining the result L₁The result L is₁Storing the output data in an output cache as the output of a neural network model Transformer encoder;

The Softmax module and the layer normalization module are both nonlinear function operation units of the neural network computing circuit.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A low-power consumption neural network accelerator storage architecture based on NAND flash memory is characterized in that: the off-chip NAND flash memory comprises an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache and a controller; the internal global cache comprises a weight cache, an input cache, an intermediate result cache and an output cache;

the neural network computing circuit is used for carrying out data computation;

2. The NAND-flash-based low power neural network accelerator memory architecture of claim 1, wherein: the method is applied to a neural network model with a plurality of module layers; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: loading the weight data of the neural network model into an off-chip NAND flash memory storage unit from a cloud or a server; the neural network computing circuit sequentially computes each module layer of the neural network model layer by layer:

3. The NAND-flash-based low power neural network accelerator memory architecture of claim 2, wherein: applying the neural network model transform coder; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: the method comprises the following steps:

query vector weight matrix W for off-chip NAND flash memory cell storage_QKey vector weight matrix W_KSum vector weight matrix W_VAnd query vector offset vector B_QKey vector offset vector B_KSum vector offset vector B_VLoading into a weight cache; taking out the weight and the offset data from the weight cache, carrying out linear layer calculation on the weight and the input matrix X after position coding in a neural network calculation circuit to obtain a matrix Q, a matrix K and a matrix V, and storing the matrix Q, the matrix K and the matrix V into an intermediate result cache;

multi-head attention module output weight W for storing off-chip NAND flash memory storage unit_OAnd an output offset B_OLoading into a weight cache; loading the matrix Q and the matrix K stored in the intermediate result cache toA neural network computing circuit; performing multiplication and addition operation on the matrix in the PE array, and calculating the multiplication and addition operation result through a Softmax module to obtain an attention fraction matrix S;

then, loading the matrix V stored in the intermediate result cache to a neural network computing circuit, and multiplying the matrix V with the attention fractional matrix S to obtain an output matrix Z_i(ii) a Multi-head attention module output weight W for caching weight_OAnd an output offset B_OLoading to a neural network computing circuit; output matrix Z of each head of multi-head attention module_iOutput weight matrix W of multi-head attention module after splicing_OPerforming multiplication and addition operation, adding output offset vector B to the result of multiplication and addition operation_OObtaining a matrix Z; storing the matrix Z into an intermediate result cache;

Feed-forward layer first layer weight matrix W for storing weight cache_F1And feed forward layer first layer offset vector B_F1Loading to a neural network computing circuit; matrix L₀And a feedforward layer first layer weight matrix W_F1Performing multiply-add operation, adding the result of the multiply-add operation to the first layer offset vector B of the feedforward layer_F1And derive the matrix F by ReLU₀(ii) a Will matrix F₀Storing the data in an intermediate result cache;

feed-forward layer second layer weight matrix W to store off-chip NAND flash memory cells_F2And a feedforward layer second layer offset vector B_F2Loading into a weight cache; matrix F for caching intermediate results₀And a feedforward layer second layer weight matrix W stored in a weight cache_F2And a feedforward layer second layer offset vector B_F2Is loaded into a neural network computing circuit to be calculated,performing matrix F on PE array₀And a feedforward layer second layer weight matrix W_F2Multiply-add operation of adding the result of the multiply-add operation to the second layer offset vector B of the feedforward layer_F2Calculating to obtain a matrix F₁；

4. The NAND flash based low power neural network accelerator memory architecture of any one of claims 1 to 3, wherein: the system also comprises a high-speed interface; the high-speed interface is used for data transmission between the off-chip NAND flash memory storage unit and the internal global cache.

5. The NAND flash based low power neural network accelerator memory architecture of any one of claims 1 to 3, wherein: between the off-chip NAND flash memory cell and the weight cache, the weight data is read and loaded by page unit.

6. The NAND flash based low power neural network accelerator memory architecture of any one of claims 1 to 3, wherein: the off-chip NAND flash memory storage unit is also used for storing the final operation result of the neural network computing circuit.