WO2021057085A1

WO2021057085A1 - Hybrid precision storage-based depth neural network accelerator

Info

Publication number: WO2021057085A1
Application number: PCT/CN2020/094551
Authority: WO
Inventors: 刘波; 龚宇; 蔡浩; 葛伟; 杨军; 时龙兴
Original assignee: 东南大学
Priority date: 2019-09-27
Filing date: 2020-06-05
Publication date: 2021-04-01
Also published as: CN110766155A

Abstract

Disclosed in the present invention is a hybrid precision storage-based deep neural network accelerator, belonging to the technical field of calculation, estimation and counting. The accelerator comprises an on-chip cache module, a control module, a multiplication and addition batch calculation module having a controllable bit width, a non-linear calculation module, a register array and a dual lookup table-based Huffman decoding module. Significant bit and sign bit parameters of weights are stored in the same memory, so that data storage and analysis of hybrid prediction are realized, and multiplication and addition operations of data and weights of hybrid precision are realized. By means of the hybrid precision-based data storage and analysis and the dual lookup table based Huffman decoding, the compression and storage of data and weights at different precisions are realized, reducing the data flow, and realizing deep neural network-based data scheduling of low power consumption.

Description

一种基于混合精度存储的深度神经网络加速器A deep neural network accelerator based on mixed precision storage

技术领域Technical field

本发明公开了一种基于混合精度存储的深度神经网络加速器，涉及人工智能神经网络的数模混合集成电路设计，属于计算、推算、计数的技术领域。The invention discloses a deep neural network accelerator based on mixed precision storage, relates to the design of a digital-analog hybrid integrated circuit of an artificial intelligence neural network, and belongs to the technical field of calculation, calculation and counting.

背景技术Background technique

深度神经网络以其优越的性能被广泛地研究和应用。当前主流的深度神经网络拥有数以亿计的连接，访存密集型和计算密集型的特点使得它们很难被映射到资源和功耗都极为有限的嵌入式***中。另外，当前深度神经网络朝着更加精确和功能更加强大方向发展的趋势使得深度神经网络的规模、需要的存储空间变得越来越大，计算开销以及复杂度也变得越来越大。Deep neural network has been widely studied and applied with its superior performance. The current mainstream deep neural networks have hundreds of millions of connections, and their memory-intensive and computationally intensive characteristics make it difficult to map them to embedded systems with extremely limited resources and power consumption. In addition, the current development trend of deep neural networks toward more precise and more powerful functions has made the scale and required storage space of deep neural networks become larger and larger, as well as computing overhead and complexity.

用于加速深度神经网络运算的传统定制硬件设计是从动态随机存取存储器中读取权重，其资源消耗为运算器执行操作的两个量级，这时候应用的主要功耗将由访存决定。因此，深度神经网络加速器的设计难点归于两点：1)深度神经网络的规模越来越大，访存问题成为神经网络运算中的最大瓶颈，特别是当权值矩阵的规模大于缓存容量的时候，神经网络的优势不能充分发挥；2)深度神经网络的结构决定了它的基础运算为大量的乘累加运算，而乘法一直是硬件资源消耗多、延时长和功耗大的算术运算，乘法运算的速度和功耗决定了深度神经网络加速器的性能。The traditional custom hardware design used to accelerate deep neural network operations is to read the weights from the dynamic random access memory, and its resource consumption is two orders of magnitude of the operation performed by the arithmetic unit. At this time, the main power consumption of the application will be determined by the memory access. Therefore, the design difficulties of deep neural network accelerators can be attributed to two points: 1) The scale of deep neural networks is getting larger and larger, and the memory access problem has become the biggest bottleneck in neural network operations, especially when the scale of the weight matrix is larger than the cache capacity. , The advantages of neural networks cannot be fully utilized; 2) The structure of deep neural networks determines that its basic operation is a large number of multiplication and accumulation operations, and multiplication has always been an arithmetic operation that consumes a lot of hardware resources, has long delays and high power consumption. The calculation speed and power consumption determine the performance of the deep neural network accelerator.

传统的深度神经网络加速器主要通过例化大量的乘加计算单元以及存储单元以一步提高***的可靠性和稳定性，大量的芯片面积以及大量的运转功耗限制了神经网络加速器在可便携交互设备中的部署。为了解决这些问题，目前最热门的技术就是对权重数据进行二值化处理，这种处理方式可极大地简化网络运算数据调度以及访存模式，但其网络精度损失较大，***稳定性有待考证。本申请旨在通过基于权重数据混合精度的分级压缩存储技术在保证原始网络识别精度的基础上实现一种低功耗、低延时、高效率的数据调度方式以及网络批处理操作。Traditional deep neural network accelerators improve the reliability and stability of the system by instantiating a large number of multiplication and addition calculation units and storage units. The large chip area and large operating power consumption limit the use of neural network accelerators in portable interactive devices. In the deployment. In order to solve these problems, the most popular technology at present is to binarize the weighted data. This processing method can greatly simplify the network operation data scheduling and memory access mode, but the network accuracy loss is large, and the system stability needs to be verified. . This application aims to implement a low-power, low-latency, and high-efficiency data scheduling method and network batch processing operation on the basis of ensuring the original network recognition accuracy through the hierarchical compression storage technology based on the weighted data mixing accuracy.

发明内容Summary of the invention

为了解决传统神经网络加速器高功耗、高计算量以及高延时的弊端，本发明提供了一种基于混合精度存储的深度神经网络加速器，采用线下软件的权值分级和线上硬件混合精度存储的工作方式，通过基于双查找表的霍夫曼编码实现混合精度数据的分级存储来解决深度神经网络的访存问题，通过引入位宽可控的批量乘加操作实现对应分级权重的数据计算匹配并节省网络因大量乘法计算产生的功耗，实现了低功耗、低延时、高效率的数据调度和网络批处理操作，解决了二值化权重的神经网络虽能简化网络运算数据调度以及访存模式但网络精度损失较大的技术问题。In order to solve the shortcomings of traditional neural network accelerators of high power consumption, high calculation volume and high latency, the present invention provides a deep neural network accelerator based on mixed precision storage, which adopts offline software weight classification and online hardware mixed precision storage. The working method is to realize the hierarchical storage of mixed-precision data through the Huffman coding based on the double lookup table to solve the memory access problem of the deep neural network, and realize the data calculation and matching of the corresponding hierarchical weight by introducing a batch multiplication and addition operation with a controllable bit width. Save the power consumption of the network due to a large number of multiplication calculations, realize low power consumption, low latency, and high efficiency data scheduling and network batch processing operations. The neural network that solves the binary weight can simplify network computing data scheduling and access There is a technical problem in which the model exists but the network accuracy loses a lot.

本发明为实现上述发明目的采用如下技术方案：The present invention adopts the following technical solutions to achieve the above-mentioned purpose of the invention:

一种基于混合精度存储的深度神经网络加速器，首先通过线下软件处理(包括基于Caffe平台、Tensorflow平台实现神经网络混合精度训练，如果达到预定压缩比例则对网络参数进行混合存储并对权重参数进行霍夫曼编码得到首1位置索引)对权值进行了有效的压缩，实现了精度可调，从而降低了运算的复杂度。数据从直接内存存取中读入，进入输入数据缓存模块，在缓存控制模块的调度下进入位宽可控的批乘加计算模块。权重及编码后的位置索引参数先进入索引缓存模块。在缓存控制模块的控制下，权重直接存入位宽可控的批乘加计算模块中的基于混合精度的权重存储器，编码后的位置索引参数通过基于双查找表的霍夫曼解码器模块解码后输出给位宽可控的批乘加计算模块中的位宽控制单元。当数据进入位宽可控的批乘加计算模块时，权重通过基于混合精度的数据存储解析模块完成权重解析，乘加单元根据位宽控制单元的控制信号选择数据权重位宽，然后完成输入数据与权重的对应乘加计算操作，结果直接存储到寄存器阵列中。存储到寄存器阵列中的中间值经过非线性计算模块完成计算后，在控制模块的调度下选择存储到输出数据缓存模块或者返回寄存器阵列，再次进行非线性操作。A deep neural network accelerator based on mixed-precision storage, firstly through offline software processing (including Caffe platform, Tensorflow platform based on neural network mixed-precision training, if it reaches a predetermined compression ratio, mixed storage of network parameters and weight parameters Huffman coding obtains the first position index) effectively compresses the weight value, realizes the adjustable precision, and reduces the complexity of the operation. The data is read in from the direct memory access, enters the input data buffer module, and enters the batch multiply-add calculation module with controllable bit width under the scheduling of the buffer control module. The weight and the encoded position index parameters first enter the index cache module. Under the control of the cache control module, the weights are directly stored in the mixed-precision weight memory in the bit-width controllable batch multiply-add calculation module, and the encoded position index parameters are decoded by the Huffman decoder module based on the double look-up table Then output to the bit width control unit in the batch multiply-add calculation module with controllable bit width. When the data enters the batch multiplication and addition calculation module with controllable bit width, the weight is analyzed by the data storage analysis module based on mixed precision. The multiplication and addition unit selects the data weight bit width according to the control signal of the bit width control unit, and then completes the input data The corresponding multiplication and addition calculation operation with the weight, the result is directly stored in the register array. After the intermediate value stored in the register array is calculated by the nonlinear calculation module, it is selected to be stored in the output data buffer module or returned to the register array under the scheduling of the control module, and the nonlinear operation is performed again.

本发明采用上述技术方案，具有以下有益效果：The present invention adopts the above technical scheme and has the following beneficial effects:

(1)本申请涉及的基于混合精度存储的深度神经网络加速器，对线下训练好的权重位置索引参数霍夫曼编码进行基于双查找表的霍夫曼解码，根据权重位置索引参数表征的权重访问次数选择乘法阵列的位宽控制信号实现了位宽可控的批乘加计算模块，对输入的数据和权重数据先进行位宽调整再进行混合精度数据的乘加计算，实现了加速器的精度可调，降低了运算的复杂度，在不降低神经网络精度的前提下，大大地减少了网络的计算量。(1) The deep neural network accelerator based on mixed-precision storage involved in this application performs Huffman decoding based on a double look-up table on the offline trained weight position index parameter Huffman coding, and the weight represented by the weight position index parameter The bit width control signal of the multiplication array for the number of accesses realizes the bit width controllable batch multiply and add calculation module. The bit width is adjusted for the input data and weight data first, and then the mixed-precision data is multiplied and added to achieve the accuracy of the accelerator. Adjustable, reducing the complexity of calculations, and greatly reducing the amount of network calculations without reducing the accuracy of the neural network.

(2)通过对权重数据解析将不同精度权值的有效比特位和符号位及位置索引参数存储在同一个存储器中，实现了混合精度的数据存储和解析，结合双查找表的霍夫曼解码将组合电路划分为两个组来降低功耗，实现了对不同精度下数据和权重的压缩和存储，减少了数据流，实现了深度神经网络的低功耗数据调度和高速度的乘加运算。(2) By analyzing the weight data, the effective bits, sign bits and position index parameters of different precision weights are stored in the same memory, which realizes the storage and analysis of mixed-precision data, and combines Huffman decoding with double look-up tables. Divide the combinational circuit into two groups to reduce power consumption, realize the compression and storage of data and weights at different precisions, reduce the data flow, and realize the low-power data scheduling of the deep neural network and the high-speed multiplication and addition operation .

附图说明Description of the drawings

图1为本发明的整体架构示意图。Figure 1 is a schematic diagram of the overall architecture of the present invention.

图2为本发明位宽可控的批乘加计算模块。Figure 2 is a batch multiply-add calculation module with a controllable bit width of the present invention.

图3为本发明基于混合精度的数据存储解析模块。Fig. 3 is a data storage analysis module based on mixed precision of the present invention.

图4为本发明基于双查找表的霍夫曼解码器模块。Fig. 4 is a Huffman decoder module based on a dual look-up table of the present invention.

具体实施方式detailed description

下面结合具体实施例进一步阐明本发明，应理解这些实施例仅用于说明本发明而不用于限制本发明的保护范围，在阅读了本发明之后，本领域技术人员所做出的各种等价形式的修改均落于本申请所附权利要求限定的范围。The present invention will be further clarified below in conjunction with specific examples. It should be understood that these examples are only used to illustrate the present invention and not to limit the scope of protection of the present invention. After reading the present invention, various equivalents made by those skilled in the art Modifications of the form all fall within the scope defined by the appended claims of this application.

本发明基于混合精度存储的深度神经网络加速器的整体架构如图1所示，工作时加速器通过接收线下训练和压缩的权值，在控制模块的控制和调度下完成不同精度权值的解码、全连接层以及激活层的运算。基于混合精度存储的深度神经网络加速器包括4个片上缓存模块、1个控制模块、16个混合精度近似乘加处理单元、1个非线性计算模块、1个寄存器阵列以及1个基于双查找表的参数霍夫曼解码模块。4个片上缓存模块包括：输入数据缓存模块、输出数据缓存模块、缓存控制模块以及索引缓存模块。The overall architecture of the deep neural network accelerator based on mixed-precision storage of the present invention is shown in Fig. 1. During operation, the accelerator receives offline training and compression weights, and completes the decoding, decoding and scheduling of weights of different precisions under the control and scheduling of the control module. Operation of the fully connected layer and the active layer. The deep neural network accelerator based on mixed-precision storage includes 4 on-chip cache modules, 1 control module, 16 mixed-precision approximate multiplication and addition processing units, 1 nonlinear calculation module, 1 register array, and 1 dual look-up table-based Parameter Huffman decoding module. The 4 on-chip cache modules include: input data cache module, output data cache module, cache control module, and index cache module.

如图2所示，本发明的位宽可控的位宽批乘加计算模块包括：一个内部静态随机存取存储器、一个数据解析模块，一个位宽控制单元、一个乘加单元和一个先入先出缓存单元，该模块与基于双查找表的参数霍夫曼解码模块协同配合，对于不同查找表对应的解码分级权重数据进行不同位宽的网络批量数据乘加处理，具体地，对于访问频繁的查找表1所解码的权重数据对应高位宽乘加操作，对于访问次数少的查找表2所解码的权重数据对应低位宽乘加操作。这种神经网络计算方式可减少网络中大量冗余的乘法操作。As shown in FIG. 2, the bit-width batch multiply-add calculation module with controllable bit width of the present invention includes: an internal static random access memory, a data analysis module, a bit-width control unit, a multiply-add unit and a first-in first Out of the buffer unit, this module cooperates with the parameter Huffman decoding module based on dual lookup tables to perform network batch data multiplication and addition processing of different bit widths for the decoding classification weight data corresponding to different lookup tables. Specifically, for frequently accessed The weight data decoded in the look-up table 1 corresponds to a high-width multiply-add operation, and the weight data decoded in the look-up table 2 with a few access times corresponds to a low-bit width multiply-add operation. This neural network calculation method can reduce a large number of redundant multiplication operations in the network.

本发明的基于混合精度的数据存储解析模块如图3所示，在线下将权重分为4级，给较大的权值分配较多的比特数，给较小的权值分配较少的比特数。对于分级后的权重而言，为了尽可能有效地表征其大小，对于每一个权值，有三项数据需要进行存储：1)权值中舍去末位保留下来的几个比特的数；2)权值的符号位；3)权值的位置索引参数。在神经网络参数存储过程中，将权值的有效比特位，符号位参数和权值位置索引参数存储在同一个存储器中。存储权值的静态随机存取存储器数据位宽为16比特，而权值由于采用了混合精度的方法，不同大小的权值位宽不一样，因此，存储权值时采用了混合存储的方案，即静态随机存取存储器中每一行16比特数据包含多个权值。存取权重时，通过数据解析模块进行权重的存储和解析。The data storage and analysis module based on mixed precision of the present invention is shown in Fig. 3. The weight is divided into 4 levels offline, and larger weights are allocated more bits, and smaller weights are allocated fewer bits. number. For the weights after grading, in order to characterize their size as effectively as possible, for each weight, there are three items of data that need to be stored: 1) the number of bits left behind by truncating the last bit of the weight; 2) The sign bit of the weight; 3) The position index parameter of the weight. In the process of neural network parameter storage, the effective bit position of the weight, the sign bit parameter and the weight position index parameter are stored in the same memory. The data bit width of the static random access memory for storing weights is 16 bits, and the weights adopt a mixed precision method, and the weights of different sizes have different widths. Therefore, a mixed storage scheme is used when storing the weights. That is, each row of 16-bit data in the SRAM contains multiple weights. When accessing the weights, the data analysis module is used to store and analyze the weights.

如图4所示，本发明的基于双查找表的霍夫曼解码器模块包括：两个查找表(查找表1，查找表2)，一个桶形移位器、一个选择器、一个通过多路复用器(MUX)实现的选择单元、一个组成以及相对应的数据存储器和寄存器。查找表1较小，包含最常用的权重位置索引编码，而查找表2包含所有剩余的权重位置索引编码。常用的权重位置索引编码了调用频率高且精度要求高的高比特权值，剩余的权重位置索引编码了调用频率低且精度要求较低的低比特权值。选择单元是一个预解码块，用于确定在解码码字时使用哪个查找表，并控制多路复用器(MUX)在每个解码周期选择正确的输出。工作时，输入有效数据锁存在边沿触发器中；其中，触发器的使能信号为输出码长数据经过累加器产生的进位信号，触发器采用乒乓结构实现流水输出，并且输出数据作为桶型移位器的输入；桶形移位寄存器的移位信号为输出码长数据经过累加器产生的累加信号，例如，第一次的输出码长为3，则桶形移位器右移3位输出，第二次的输出码长为4，则桶形移位器右移7位输出；桶形移位器的输出结果会输入到选择单元中，对于一个13比特的输入数据，如果高7位数据不全为1，则查找表1的使能信号有效，且查找表1的输入为选择单元的输出(输入数据的高7位)，如果前7位数据全为1，则查找表2的使能信号有效，并且查找表2的输入为选择单元的输出(输入数据的低6位)；选择单元会根据输入数据的高7位数据按位与的结果选择对应的查找表并控制选择器输出查表结果，即控制选择器输出对应编码的码长和标志状态。例如，输入一组编码，经过触发器之后得到一组32位的霍夫曼编码(32’b0011_1101_1111_1110_0110_0111_1110_0110)，初始化的累加器的累加和的结果为0，故对应的桶形移位器的输出为13’b0_0110_0111_1110_0110，可以算出移位后数据结果的高7位不全为1，即按位与后的结果为0，故查找表1的使能信号有效，查找表2不工作，可以得到最后的输出结果为码长4’b0100(即十进制4)，标志状态4’b0011(即S3)。码长通过一个累加器得到累加码长的和的结果为4，进位信号为0，故移位器左移4位得到13’b1_1110_0110_0111_1110，按照刚刚说明的解码流程会得到最后的输出结果，码长4’b1000(十进制8)，标志状态4’b0111(即S7)。这时，码长结果会继续输入到累加器中，得到累加和为12，进位信号为0，故移位器左移12位得到13’b1_1101_1111_1110，输出解码结果的码长为4’b1010(十进制10)，状态标志S9。码长结果会继续输入到累加器中，得到累加和为6，进位信号为1。此时进位信号有效，则FIFO读使能有效，新的16比特数据流(16’b0110_0110_0110_0110)输入到触发器中，移位器的输入更新为32’b0110_0110_0110_0110_0011_1101_1111_1110，继续进行移位操作。As shown in FIG. 4, the Huffman decoder module based on dual lookup tables of the present invention includes: two lookup tables (lookup table 1, lookup table 2), a barrel shifter, a selector, and a pass-through table. The multiplexer (MUX) realizes the selection unit, a composition, and the corresponding data memory and registers. Lookup table 1 is small and contains the most commonly used weight position index codes, while lookup table 2 contains all the remaining weight position index codes. The commonly used weight position index encodes high-bit weights with high calling frequency and high precision requirements, and the remaining weight position indexes encode low-bit weights with low calling frequency and low precision requirements. The selection unit is a pre-decoding block, which is used to determine which lookup table is used when decoding the codeword, and to control the multiplexer (MUX) to select the correct output in each decoding cycle. When working, the input valid data is latched in the edge flip-flop; among them, the enable signal of the flip-flop is the carry signal generated by the output code length data through the accumulator. The flip-flop uses a ping-pong structure to achieve pipeline output, and the output data is used as a barrel shift The input of the positioner; the shift signal of the barrel shift register is the accumulated signal generated by the output code length data through the accumulator, for example, the first output code length is 3, then the barrel shifter shifts right by 3 bits and outputs , The second output code length is 4, then the barrel shifter shifts 7 bits to the right; the output result of the barrel shifter will be input to the selection unit, for a 13-bit input data, if the upper 7 bits If the data is not all 1, the enable signal of the look-up table 1 is valid, and the input of the look-up table 1 is the output of the selection unit (the upper 7 bits of the input data). If the first 7 bits of data are all 1, the enable signal of the look-up table 2 The energy signal is valid, and the input of the look-up table 2 is the output of the selection unit (the lower 6 bits of the input data); the selection unit will select the corresponding look-up table according to the result of the high 7-bit data of the input data and control the selector output The result of the table look-up is to control the selector to output the code length and flag status of the corresponding code. For example, input a set of codes and get a set of 32-bit Huffman codes (32'b0011_1101_1111_1110_0110_0111_1110_0110) after the flip-flop. The result of the accumulated sum of the initialized accumulator is 0, so the output of the corresponding barrel shifter is 13'b0_0110_0111_1110_0110, it can be calculated that the upper 7 bits of the shifted data result are not all 1, that is, the result of bitwise and is 0, so the enable signal of lookup table 1 is valid, lookup table 2 does not work, and the final output can be obtained The result is a code length of 4'b0100 (that is, 4 in decimal) and a status of 4'b0011 (that is, S3). The code length is obtained by an accumulator and the result of the accumulated code length is 4, and the carry signal is 0, so the shifter shifts 4 bits to the left to get 13'b1_1110_0110_0111_1110. According to the decoding process just described, the final output result will be obtained. The code length 4'b1000 (decimal 8), marking status 4'b0111 (ie S7). At this time, the code length result will continue to be input to the accumulator, and the accumulated sum is 12, and the carry signal is 0. Therefore, the shifter shifts to the left by 12 bits to get 13'b1_1101_1111_1110, and the code length of the output decoding result is 4'b1010 (decimal 10), the status flag S9. The code length result will continue to be input into the accumulator, and the accumulated sum will be 6 and the carry signal will be 1. At this time, the carry signal is valid, the FIFO read enable is valid, the new 16-bit data stream (16’b0110_0110_0110_0110) is input to the flip-flop, the input of the shifter is updated to 32’b0110_0110_0110_0110_0011_1101_1111_1110, and the shift operation continues.

基于混合精度存储的深度神经网络加速器的实现流程包括如下四个步骤。The implementation process of a deep neural network accelerator based on mixed-precision storage includes the following four steps.

步骤一：神经网络加速器首先通过线下软件处理(包括基于Caffe平台、Tensorflow平台实现神经网络混合精度训练，如果达到预定压缩比例则对网络参数进行混合存储并对权重参数进行霍夫曼编码得到位置索引参数)对权值进行了有效的压缩，实现了精度可调，从而降低了运算的复杂度。Step 1: The neural network accelerator is first processed by offline software (including the mixed-precision training of neural network based on the Caffe platform and Tensorflow platform. If the predetermined compression ratio is reached, the network parameters are mixed and stored and the weight parameters are Huffman coded to get the position The index parameter) effectively compresses the weight, realizes the adjustable precision, and reduces the complexity of the operation.

步骤二：数据从直接内存存取中读入，进入输入数据缓存模块，在控制模块的调度下进入位宽可控的批乘加计算模块。权重及编码后的位置索引参数先进入索引缓存模块。在缓存控制模块的控制下，权重直接存入位宽可控的批乘加计算模块中的基于混合精度的权重存储器，编码后的位置索引参数通过基于双查找表的霍夫曼解码器模块解码后输出给位宽可控的批乘加计算模块中的位宽控制单元。Step 2: The data is read in from the direct memory access, enters the input data buffer module, and enters the batch multiply-add calculation module with controllable bit width under the scheduling of the control module. The weight and the encoded position index parameters first enter the index cache module. Under the control of the cache control module, the weights are directly stored in the mixed-precision weight memory in the bit-width controllable batch multiply-add calculation module, and the encoded position index parameters are decoded by the Huffman decoder module based on the double look-up table Then output to the bit width control unit in the batch multiply-add calculation module with controllable bit width.

步骤三：当数据进入位宽可控的批乘加计算模块时，权重通过基于混合精度的数据存储解析模块完成权重解析，乘加单元根据位宽控制单元的控制信号选择数据权重位宽，然后完成输入数据与权重的对应乘加计算操作，结果直接存储到寄存器阵列中。Step 3: When the data enters the batch multiplication and addition calculation module with controllable bit width, the weight is analyzed by the data storage analysis module based on mixed precision. The multiplication and addition unit selects the data weight bit width according to the control signal of the bit width control unit, and then The corresponding multiplication and addition calculation operation of the input data and the weight is completed, and the result is directly stored in the register array.

步骤四：存储到寄存器阵列中的中间值经过非线性计算模块完成计算后，在控制模块的调度下选择存储到输出数据缓存模块或者返回寄存器阵列，再次进行非线性操作。直接访存存取直接从输出数据缓存模块读取深度神经网络计算的数据。Step 4: After the intermediate value stored in the register array is calculated by the nonlinear calculation module, it is selected to be stored in the output data buffer module or returned to the register array under the scheduling of the control module, and the nonlinear operation is performed again. Direct memory access directly reads the data calculated by the deep neural network from the output data buffer module.

Claims

一种基于混合精度存储的深度神经网络加速器，其特征在于，包括：A deep neural network accelerator based on mixed-precision storage, which is characterized in that it includes:

索引缓存模块，用于存储训练好的权重、权重符号位以及权重位置索引参数霍夫曼编码，Index cache module, used to store the trained weight, weight sign bit and weight position index parameter Huffman coding,

输入数据缓存模块，用于存储输入数据，Input data buffer module, used to store input data,

缓存控制模块，用于生成索引缓存模块和输入数据缓存模块的读写地址，The cache control module is used to generate the read and write addresses of the index cache module and the input data cache module,

霍夫曼解码器，对权重位置索引参数的霍夫曼编码进行双查找表的霍夫曼解码完成位置索引操作，输出权重位置索引参数至索引缓存模块，The Huffman decoder performs double look-up table Huffman decoding on the Huffman encoding of the weight position index parameter to complete the position index operation, and outputs the weight position index parameter to the index buffer module,

位宽可控的批乘加计算模块，对从索引缓存模块读取的权重按照权值大小分配存储单元的数据位宽，不同位宽的存储单元存储有各权重的有效位、符号位、位置索引参数，根据位置索引参数对从输入数据缓存模块读取的输入数据进行位宽调整，对经位宽处理后的输入数据和混合存储的权重进行乘加计算，输出乘加计算结果，The bit-width controllable batch multiply-add calculation module allocates the data bit width of the storage unit according to the weight value for the weight read from the index cache module. The storage unit of different bit width stores the effective bit, sign bit, and position of each weight. Index parameter, adjust the bit width of the input data read from the input data cache module according to the position index parameter, multiply and add the input data after bit width processing and the weight of the mixed storage, and output the multiply and add calculation result,

寄存器阵列，用于缓存乘加计算结果，Register array, used to cache the multiplication and addition calculation results,

非线性计算模块，对读取的乘加计算结果进行非线性计算，The non-linear calculation module performs non-linear calculations on the read multiplication and addition calculation results,

输出数据缓存模块，用于缓存乘加计算结果或非线性计算结果，及，Output data buffer module, used to buffer multiplication and addition calculation results or non-linear calculation results, and,

控制模块，用于生成索引缓存模块的读写指令、输入数据缓存模块的读写指令、霍夫曼解码器的工作指令、位宽可控的批乘加计算模块的位宽控制指令、非线性计算结果存储的调度指令。The control module is used to generate the read and write instructions of the index cache module, the read and write instructions of the input data cache module, the work instructions of the Huffman decoder, the bit width control instructions of the batch multiply-add calculation module with controllable bit width, and the nonlinearity The scheduling instruction for the calculation result storage.
根据权利要求1所述一种基于混合精度存储的深度神经网络加速器，其特征在于，所述霍夫曼解码器包括：The deep neural network accelerator based on mixed-precision storage according to claim 1, wherein the Huffman decoder comprises:

触发器，在累加器输出的进位信号的使能下输出读取的权重位置索引参数的霍夫曼编码至桶形移位器，A flip-flop, which outputs the Huffman code of the read weight position index parameter to the barrel shifter under the enable of the carry signal output by the accumulator,

桶形移位器，在累加器输出的累加信号的使能下，对读取的权重位置索引参数霍夫曼编码进行移位操作后输出，Barrel shifter, when the accumulating signal output by the accumulator is enabled, the read weight position index parameter Huffman code is shifted and output,

选择单元，对桶形移位器输出的权重位置索引参数霍夫曼编码的高位数据进行检测，在权重位置索引参数霍夫曼编码的高位数据不全为1时输出第一查找表的使能信号及多路复用器输出第一查找表查表结果的选择信号，在权重位置索引参数霍夫曼编码的高位数据全部为1时输出第二查找表的使能信号及多路复用器输出第二查找表查表结果的选择信号，The selection unit detects the high-order data of the weight position index parameter Huffman code output from the barrel shifter, and outputs the enable signal of the first look-up table when the high-order data of the weight position index parameter Huffman code is not all 1 And the multiplexer outputs the selection signal of the first look-up table look-up table result, and outputs the enable signal of the second look-up table and the multiplexer output when the high-order data of the weight position index parameter Huffman code is all 1 The selection signal of the second look-up table look-up table result,

第一查找表，存储有常用的权重位置索引参数的霍夫曼编码，在选择单元的使能下输出权重位置索引参数霍夫曼编码高位数据的码长和标志状态，The first look-up table stores the Huffman codes of the commonly used weight position index parameters, and outputs the code length and flag status of the weight position index parameters of the Huffman code high-order data when the selection unit is enabled,

第二查找表，存储有剩余的权重位置索引参数的霍夫曼编码，在选择单元的使能下输出权重位置索引参数霍夫曼编码低位数据的码长和标志状态，The second look-up table stores the Huffman codes of the remaining weight position index parameters, and outputs the code length and flag status of the Huffman code lower data of the weight position index parameters when the selection unit is enabled,

多路复用器，在选择单元的使能下输出第一查找表的查表结果或第二查找表的查表结果，及，The multiplexer outputs the table lookup result of the first lookup table or the table lookup result of the second lookup table under the enablement of the selection unit, and,

累加器，对多路复用器输出的码长进行累加，输出进位信号至触发器，输出累加信号至桶形移位器。The accumulator accumulates the code length output by the multiplexer, outputs the carry signal to the flip-flop, and outputs the accumulated signal to the barrel shifter.
根据权利要求1所述一种基于混合精度存储的深度神经网络加速器，其特征在于，所述位宽可控的批乘加计算模块包括多个PE单元，每个PE单元包括：The deep neural network accelerator based on mixed-precision storage according to claim 1, wherein the batch multiply-add calculation module with a controllable bit width comprises a plurality of PE units, and each PE unit comprises:

FIFO，用于缓存从输入数据缓存模块读取的输入数据，FIFO, used to buffer the input data read from the input data buffer module,

存储器，读取索引缓存模块缓存的权重，按照权值大小为每个权重分配存储有效位、符号位、位置索引参数单元的数据位宽，The memory reads the weight cached by the index cache module, and allocates the data bit width of the effective bit, sign bit, and position index parameter unit to each weight according to the weight value.

数据解析模块，对存储器存储的数据进行解析获得权重，根据解析获得的位置索引参数生成位宽控制信号，及，The data analysis module analyzes the data stored in the memory to obtain the weight, and generates the bit width control signal according to the position index parameter obtained by the analysis, and,

乘加单元，在位宽控制信号的作用下对从FIFO读取的输入数据进行位宽调整，对位宽调整后的输入数据和数据解析模块输出的权重进行批乘加运算。The multiplication and addition unit adjusts the bit width of the input data read from the FIFO under the action of the bit width control signal, and performs batch multiplication and addition operations on the input data after the bit width adjustment and the weights output by the data analysis module.
根据权利要求1所述一种基于混合精度存储的深度神经网络加速器，其特征在于，根据位置索引参数对从输入数据缓存模块读取的输入数据进行位宽调整具体为：在位置索引参数表征权重为调用频率高且精度要求高的高比特权值时将输入数据调整为高位宽数据，在位置索引参数表征权重为调用频率低且精度要求低的低比特权值时将输入数据调整为低位宽数据。The deep neural network accelerator based on mixed-precision storage according to claim 1, wherein the bit width adjustment of the input data read from the input data cache module according to the position index parameter is specifically: the position index parameter represents the weight Adjust the input data to high bit width data when calling high bit weights with high frequency and high precision requirements, and adjust the input data to low bit widths when the position index parameter characterizing weight is low bit weights with low calling frequency and low precision requirements data.
根据权利要求4所述一种基于混合精度存储的深度神经网络加速器，其特征在于，所述乘法单元为对数乘法器。The deep neural network accelerator based on mixed-precision storage according to claim 4, wherein the multiplication unit is a logarithmic multiplier.
一种基于混合精度存储的深度神经网络加速方法，其特征在于，对权重位置索引参数的霍夫曼编码进行双查找表的霍夫曼解码完成位置索引操作，对权重按照权值大小分配存储单元的数据位宽，不同位宽的存储单元存储有各权重的有效位、符号位、位置索引参数，根据位置索引参数对输入数据进行位宽调整，对经位宽处理后的输入数据和混合存储的权重进行乘加计算。A deep neural network acceleration method based on mixed-precision storage, which is characterized in that the Huffman coding of the weight position index parameter is subjected to the Huffman decoding of the double lookup table to complete the position index operation, and the storage unit is allocated to the weight according to the weight value. Data bit width, different bit width storage units store the effective bit, sign bit, and position index parameters of each weight, adjust the bit width of the input data according to the position index parameter, and mix and store the input data after bit width processing The weight of is multiplied and added.
根据权利要求6所述一种基于混合精度存储的深度神经网络加速方法，其特征在于，所述权重位置索引参数通过Caffe平台或Tensorflow平台线下训练获取。The method for accelerating a deep neural network based on mixed-precision storage according to claim 6, wherein the weight position index parameter is obtained through offline training on the Caffe platform or the Tensorflow platform.
根据权利要求6所述一种基于混合精度存储的深度神经网络加速方法，其特征在于，对权重位置索引参数的霍夫曼编码进行双查找表的霍夫曼解码完成位置索引操作的方法为：对权重位置索引参数的霍夫曼编码进行移位操作，在权重位置索引参数霍夫曼编码的高位数据不全为1时查找常用权重位置索引参数的霍夫曼编码表获取权重位置索引参数霍夫曼编码高位数据的码长，在权重位置索引参数霍夫曼编码的高位数据全部为1时查找剩余权重位置索引参数的霍夫曼编码表获取权重位置索引参数霍夫曼编码低位数据的码长，对获取的码长进行累加，根据累加结果更新权重位置索引参数霍夫曼编码的移位操作。The method for accelerating a deep neural network based on mixed-precision storage according to claim 6, wherein the method for performing the Huffman decoding of the double look-up table on the Huffman coding of the weight position index parameter to complete the position index operation is: Perform a shift operation on the Huffman coding of the weight position index parameter. When the high data of the weight position index parameter Huffman code is not all 1, look up the Huffman coding table of the commonly used weight position index parameter to obtain the weight position index parameter Hough The code length of the high-order data of Mann encoding. When the high-order data of the weight position index parameter Huffman encoding is all 1, look up the Huffman code table of the remaining weight position index parameters to obtain the code length of the weight position index parameter Huffman encoding low data , Accumulate the acquired code length, and update the weight position index parameter Huffman coding shift operation according to the accumulation result.