WO2021057085A1 - Hybrid precision storage-based depth neural network accelerator - Google Patents
Hybrid precision storage-based depth neural network accelerator Download PDFInfo
- Publication number
- WO2021057085A1 WO2021057085A1 PCT/CN2020/094551 CN2020094551W WO2021057085A1 WO 2021057085 A1 WO2021057085 A1 WO 2021057085A1 CN 2020094551 W CN2020094551 W CN 2020094551W WO 2021057085 A1 WO2021057085 A1 WO 2021057085A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- weight
- position index
- data
- huffman
- index parameter
- Prior art date
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 43
- 238000004364 calculation method Methods 0.000 claims abstract description 40
- 238000007405 data analysis Methods 0.000 claims abstract description 7
- 238000000034 method Methods 0.000 claims description 11
- 238000003079 width control Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 4
- 238000009825 accumulation Methods 0.000 claims description 2
- 230000001133 acceleration Effects 0.000 claims 1
- 230000006835 compression Effects 0.000 abstract description 6
- 238000007906 compression Methods 0.000 abstract description 6
- 238000013500 data storage Methods 0.000 abstract description 6
- 230000009977 dual effect Effects 0.000 abstract description 6
- 238000013461 design Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- the invention discloses a deep neural network accelerator based on mixed precision storage, relates to the design of a digital-analog hybrid integrated circuit of an artificial intelligence neural network, and belongs to the technical field of calculation, calculation and counting.
- Deep neural network has been widely studied and applied with its superior performance.
- the current mainstream deep neural networks have hundreds of millions of connections, and their memory-intensive and computationally intensive characteristics make it difficult to map them to embedded systems with extremely limited resources and power consumption.
- the current development trend of deep neural networks toward more precise and more powerful functions has made the scale and required storage space of deep neural networks become larger and larger, as well as computing overhead and complexity.
- the traditional custom hardware design used to accelerate deep neural network operations is to read the weights from the dynamic random access memory, and its resource consumption is two orders of magnitude of the operation performed by the arithmetic unit. At this time, the main power consumption of the application will be determined by the memory access. Therefore, the design difficulties of deep neural network accelerators can be attributed to two points: 1) The scale of deep neural networks is getting larger and larger, and the memory access problem has become the biggest bottleneck in neural network operations, especially when the scale of the weight matrix is larger than the cache capacity.
- the advantages of neural networks cannot be fully utilized; 2)
- the structure of deep neural networks determines that its basic operation is a large number of multiplication and accumulation operations, and multiplication has always been an arithmetic operation that consumes a lot of hardware resources, has long delays and high power consumption.
- the calculation speed and power consumption determine the performance of the deep neural network accelerator.
- the present invention provides a deep neural network accelerator based on mixed precision storage, which adopts offline software weight classification and online hardware mixed precision storage.
- the working method is to realize the hierarchical storage of mixed-precision data through the Huffman coding based on the double lookup table to solve the memory access problem of the deep neural network, and realize the data calculation and matching of the corresponding hierarchical weight by introducing a batch multiplication and addition operation with a controllable bit width. Save the power consumption of the network due to a large number of multiplication calculations, realize low power consumption, low latency, and high efficiency data scheduling and network batch processing operations.
- the neural network that solves the binary weight can simplify network computing data scheduling and access There is a technical problem in which the model exists but the network accuracy loses a lot.
- a deep neural network accelerator based on mixed-precision storage firstly through offline software processing (including Caffe platform, Tensorflow platform based on neural network mixed-precision training, if it reaches a predetermined compression ratio, mixed storage of network parameters and weight parameters Huffman coding obtains the first position index) effectively compresses the weight value, realizes the adjustable precision, and reduces the complexity of the operation.
- the data is read in from the direct memory access, enters the input data buffer module, and enters the batch multiply-add calculation module with controllable bit width under the scheduling of the buffer control module.
- the weight and the encoded position index parameters first enter the index cache module.
- the weights are directly stored in the mixed-precision weight memory in the bit-width controllable batch multiply-add calculation module, and the encoded position index parameters are decoded by the Huffman decoder module based on the double look-up table Then output to the bit width control unit in the batch multiply-add calculation module with controllable bit width.
- the weight is analyzed by the data storage analysis module based on mixed precision.
- the multiplication and addition unit selects the data weight bit width according to the control signal of the bit width control unit, and then completes the input data
- the corresponding multiplication and addition calculation operation with the weight the result is directly stored in the register array.
- the intermediate value stored in the register array is calculated by the nonlinear calculation module, it is selected to be stored in the output data buffer module or returned to the register array under the scheduling of the control module, and the nonlinear operation is performed again.
- the deep neural network accelerator based on mixed-precision storage involved in this application performs Huffman decoding based on a double look-up table on the offline trained weight position index parameter Huffman coding, and the weight represented by the weight position index parameter
- the bit width control signal of the multiplication array for the number of accesses realizes the bit width controllable batch multiply and add calculation module.
- the bit width is adjusted for the input data and weight data first, and then the mixed-precision data is multiplied and added to achieve the accuracy of the accelerator. Adjustable, reducing the complexity of calculations, and greatly reducing the amount of network calculations without reducing the accuracy of the neural network.
- the effective bits, sign bits and position index parameters of different precision weights are stored in the same memory, which realizes the storage and analysis of mixed-precision data, and combines Huffman decoding with double look-up tables. Divide the combinational circuit into two groups to reduce power consumption, realize the compression and storage of data and weights at different precisions, reduce the data flow, and realize the low-power data scheduling of the deep neural network and the high-speed multiplication and addition operation .
- Figure 1 is a schematic diagram of the overall architecture of the present invention.
- Figure 2 is a batch multiply-add calculation module with a controllable bit width of the present invention.
- Fig. 3 is a data storage analysis module based on mixed precision of the present invention.
- Fig. 4 is a Huffman decoder module based on a dual look-up table of the present invention.
- the overall architecture of the deep neural network accelerator based on mixed-precision storage of the present invention is shown in Fig. 1.
- the accelerator receives offline training and compression weights, and completes the decoding, decoding and scheduling of weights of different precisions under the control and scheduling of the control module. Operation of the fully connected layer and the active layer.
- the deep neural network accelerator based on mixed-precision storage includes 4 on-chip cache modules, 1 control module, 16 mixed-precision approximate multiplication and addition processing units, 1 nonlinear calculation module, 1 register array, and 1 dual look-up table-based Parameter Huffman decoding module.
- the 4 on-chip cache modules include: input data cache module, output data cache module, cache control module, and index cache module.
- the bit-width batch multiply-add calculation module with controllable bit width of the present invention includes: an internal static random access memory, a data analysis module, a bit-width control unit, a multiply-add unit and a first-in first Out of the buffer unit, this module cooperates with the parameter Huffman decoding module based on dual lookup tables to perform network batch data multiplication and addition processing of different bit widths for the decoding classification weight data corresponding to different lookup tables. Specifically, for frequently accessed The weight data decoded in the look-up table 1 corresponds to a high-width multiply-add operation, and the weight data decoded in the look-up table 2 with a few access times corresponds to a low-bit width multiply-add operation.
- This neural network calculation method can reduce a large number of redundant multiplication operations in the network.
- the data storage and analysis module based on mixed precision of the present invention is shown in Fig. 3.
- the weight is divided into 4 levels offline, and larger weights are allocated more bits, and smaller weights are allocated fewer bits. number.
- the effective bit position of the weight, the sign bit parameter and the weight position index parameter are stored in the same memory.
- the data bit width of the static random access memory for storing weights is 16 bits, and the weights adopt a mixed precision method, and the weights of different sizes have different widths. Therefore, a mixed storage scheme is used when storing the weights. That is, each row of 16-bit data in the SRAM contains multiple weights.
- the data analysis module is used to store and analyze the weights.
- the Huffman decoder module based on dual lookup tables of the present invention includes: two lookup tables (lookup table 1, lookup table 2), a barrel shifter, a selector, and a pass-through table.
- the multiplexer (MUX) realizes the selection unit, a composition, and the corresponding data memory and registers.
- Lookup table 1 is small and contains the most commonly used weight position index codes, while lookup table 2 contains all the remaining weight position index codes.
- the commonly used weight position index encodes high-bit weights with high calling frequency and high precision requirements, and the remaining weight position indexes encode low-bit weights with low calling frequency and low precision requirements.
- the selection unit is a pre-decoding block, which is used to determine which lookup table is used when decoding the codeword, and to control the multiplexer (MUX) to select the correct output in each decoding cycle.
- MUX multiplexer
- the flip-flop uses a ping-pong structure to achieve pipeline output, and the output data is used as a barrel shift
- the input of the positioner; the shift signal of the barrel shift register is the accumulated signal generated by the output code length data through the accumulator, for example, the first output code length is 3, then the barrel shifter shifts right by 3 bits and outputs , The second output code length is 4, then the barrel shifter shifts 7 bits to the right; the output result of the barrel shifter will be input to the selection unit, for a 13-bit input data, if the upper 7 bits If the data is not all 1, the enable signal of the look-up table 1 is valid, and the input of the look-up table 1 is the output of the selection unit (the upper 7 bits of the input data).
- the enable signal of the look-up table 2 The energy signal is valid, and the input of the look-up table 2 is the output of the selection unit (the lower 6 bits of the input data); the selection unit will select the corresponding look-up table according to the result of the high 7-bit data of the input data and control the selector output
- the result of the table look-up is to control the selector to output the code length and flag status of the corresponding code. For example, input a set of codes and get a set of 32-bit Huffman codes (32'b0011_1101_1111_1110_0110_0111_1110_0110) after the flip-flop.
- the result of the accumulated sum of the initialized accumulator is 0, so the output of the corresponding barrel shifter is 13'b0_0110_0111_1110_0110, it can be calculated that the upper 7 bits of the shifted data result are not all 1, that is, the result of bitwise and is 0, so the enable signal of lookup table 1 is valid, lookup table 2 does not work, and the final output can be obtained
- the result is a code length of 4'b0100 (that is, 4 in decimal) and a status of 4'b0011 (that is, S3).
- the code length is obtained by an accumulator and the result of the accumulated code length is 4, and the carry signal is 0, so the shifter shifts 4 bits to the left to get 13'b1_1110_0110_0111_1110. According to the decoding process just described, the final output result will be obtained.
- the code length 4'b1000 (decimal 8), marking status 4'b0111 (ie S7).
- the code length result will continue to be input to the accumulator, and the accumulated sum is 12, and the carry signal is 0. Therefore, the shifter shifts to the left by 12 bits to get 13'b1_1101_1111_1110, and the code length of the output decoding result is 4'b1010 (decimal 10), the status flag S9.
- the code length result will continue to be input into the accumulator, and the accumulated sum will be 6 and the carry signal will be 1. At this time, the carry signal is valid, the FIFO read enable is valid, the new 16-bit data stream (16’b0110_0110_0110_0110) is input to the flip-flop, the input of the shifter is updated to 32’b0110_0110_0110_0011_1101_1111_1110, and the shift operation continues.
- the implementation process of a deep neural network accelerator based on mixed-precision storage includes the following four steps.
- Step 1 The neural network accelerator is first processed by offline software (including the mixed-precision training of neural network based on the Caffe platform and Tensorflow platform. If the predetermined compression ratio is reached, the network parameters are mixed and stored and the weight parameters are Huffman coded to get the position The index parameter) effectively compresses the weight, realizes the adjustable precision, and reduces the complexity of the operation.
- Step 2 The data is read in from the direct memory access, enters the input data buffer module, and enters the batch multiply-add calculation module with controllable bit width under the scheduling of the control module.
- the weight and the encoded position index parameters first enter the index cache module. Under the control of the cache control module, the weights are directly stored in the mixed-precision weight memory in the bit-width controllable batch multiply-add calculation module, and the encoded position index parameters are decoded by the Huffman decoder module based on the double look-up table Then output to the bit width control unit in the batch multiply-add calculation module with controllable bit width.
- Step 3 When the data enters the batch multiplication and addition calculation module with controllable bit width, the weight is analyzed by the data storage analysis module based on mixed precision.
- the multiplication and addition unit selects the data weight bit width according to the control signal of the bit width control unit, and then The corresponding multiplication and addition calculation operation of the input data and the weight is completed, and the result is directly stored in the register array.
- Step 4 After the intermediate value stored in the register array is calculated by the nonlinear calculation module, it is selected to be stored in the output data buffer module or returned to the register array under the scheduling of the control module, and the nonlinear operation is performed again.
- Direct memory access directly reads the data calculated by the deep neural network from the output data buffer module.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Compression Of Band Width Or Redundancy In Fax (AREA)
Abstract
Description
Claims (8)
- 一种基于混合精度存储的深度神经网络加速器,其特征在于,包括:A deep neural network accelerator based on mixed-precision storage, which is characterized in that it includes:索引缓存模块,用于存储训练好的权重、权重符号位以及权重位置索引参数霍夫曼编码,Index cache module, used to store the trained weight, weight sign bit and weight position index parameter Huffman coding,输入数据缓存模块,用于存储输入数据,Input data buffer module, used to store input data,缓存控制模块,用于生成索引缓存模块和输入数据缓存模块的读写地址,The cache control module is used to generate the read and write addresses of the index cache module and the input data cache module,霍夫曼解码器,对权重位置索引参数的霍夫曼编码进行双查找表的霍夫曼解码完成位置索引操作,输出权重位置索引参数至索引缓存模块,The Huffman decoder performs double look-up table Huffman decoding on the Huffman encoding of the weight position index parameter to complete the position index operation, and outputs the weight position index parameter to the index buffer module,位宽可控的批乘加计算模块,对从索引缓存模块读取的权重按照权值大小分配存储单元的数据位宽,不同位宽的存储单元存储有各权重的有效位、符号位、位置索引参数,根据位置索引参数对从输入数据缓存模块读取的输入数据进行位宽调整,对经位宽处理后的输入数据和混合存储的权重进行乘加计算,输出乘加计算结果,The bit-width controllable batch multiply-add calculation module allocates the data bit width of the storage unit according to the weight value for the weight read from the index cache module. The storage unit of different bit width stores the effective bit, sign bit, and position of each weight. Index parameter, adjust the bit width of the input data read from the input data cache module according to the position index parameter, multiply and add the input data after bit width processing and the weight of the mixed storage, and output the multiply and add calculation result,寄存器阵列,用于缓存乘加计算结果,Register array, used to cache the multiplication and addition calculation results,非线性计算模块,对读取的乘加计算结果进行非线性计算,The non-linear calculation module performs non-linear calculations on the read multiplication and addition calculation results,输出数据缓存模块,用于缓存乘加计算结果或非线性计算结果,及,Output data buffer module, used to buffer multiplication and addition calculation results or non-linear calculation results, and,控制模块,用于生成索引缓存模块的读写指令、输入数据缓存模块的读写指令、霍夫曼解码器的工作指令、位宽可控的批乘加计算模块的位宽控制指令、非线性计算结果存储的调度指令。The control module is used to generate the read and write instructions of the index cache module, the read and write instructions of the input data cache module, the work instructions of the Huffman decoder, the bit width control instructions of the batch multiply-add calculation module with controllable bit width, and the nonlinearity The scheduling instruction for the calculation result storage.
- 根据权利要求1所述一种基于混合精度存储的深度神经网络加速器,其特征在于,所述霍夫曼解码器包括:The deep neural network accelerator based on mixed-precision storage according to claim 1, wherein the Huffman decoder comprises:触发器,在累加器输出的进位信号的使能下输出读取的权重位置索引参数的霍夫曼编码至桶形移位器,A flip-flop, which outputs the Huffman code of the read weight position index parameter to the barrel shifter under the enable of the carry signal output by the accumulator,桶形移位器,在累加器输出的累加信号的使能下,对读取的权重位置索引参数霍夫曼编码进行移位操作后输出,Barrel shifter, when the accumulating signal output by the accumulator is enabled, the read weight position index parameter Huffman code is shifted and output,选择单元,对桶形移位器输出的权重位置索引参数霍夫曼编码的高位数据进行检测,在权重位置索引参数霍夫曼编码的高位数据不全为1时输出第一查找表的使能信号及多路复用器输出第一查找表查表结果的选择信号,在权重位置索引参数霍夫曼编码的高位数据全部为1时输出第二查找表的使能信号及多路复用器输出第二查找表查表结果的选择信号,The selection unit detects the high-order data of the weight position index parameter Huffman code output from the barrel shifter, and outputs the enable signal of the first look-up table when the high-order data of the weight position index parameter Huffman code is not all 1 And the multiplexer outputs the selection signal of the first look-up table look-up table result, and outputs the enable signal of the second look-up table and the multiplexer output when the high-order data of the weight position index parameter Huffman code is all 1 The selection signal of the second look-up table look-up table result,第一查找表,存储有常用的权重位置索引参数的霍夫曼编码,在选择单元的使能下输出权重位置索引参数霍夫曼编码高位数据的码长和标志状态,The first look-up table stores the Huffman codes of the commonly used weight position index parameters, and outputs the code length and flag status of the weight position index parameters of the Huffman code high-order data when the selection unit is enabled,第二查找表,存储有剩余的权重位置索引参数的霍夫曼编码,在选择单元的使能下输出权重位置索引参数霍夫曼编码低位数据的码长和标志状态,The second look-up table stores the Huffman codes of the remaining weight position index parameters, and outputs the code length and flag status of the Huffman code lower data of the weight position index parameters when the selection unit is enabled,多路复用器,在选择单元的使能下输出第一查找表的查表结果或第二查找表的查表结果,及,The multiplexer outputs the table lookup result of the first lookup table or the table lookup result of the second lookup table under the enablement of the selection unit, and,累加器,对多路复用器输出的码长进行累加,输出进位信号至触发器,输出累加信号至桶形移位器。The accumulator accumulates the code length output by the multiplexer, outputs the carry signal to the flip-flop, and outputs the accumulated signal to the barrel shifter.
- 根据权利要求1所述一种基于混合精度存储的深度神经网络加速器,其特征在于,所述位宽可控的批乘加计算模块包括多个PE单元,每个PE单元包括:The deep neural network accelerator based on mixed-precision storage according to claim 1, wherein the batch multiply-add calculation module with a controllable bit width comprises a plurality of PE units, and each PE unit comprises:FIFO,用于缓存从输入数据缓存模块读取的输入数据,FIFO, used to buffer the input data read from the input data buffer module,存储器,读取索引缓存模块缓存的权重,按照权值大小为每个权重分配存储有效位、符号位、位置索引参数单元的数据位宽,The memory reads the weight cached by the index cache module, and allocates the data bit width of the effective bit, sign bit, and position index parameter unit to each weight according to the weight value.数据解析模块,对存储器存储的数据进行解析获得权重,根据解析获得的位置索引参数生成位宽控制信号,及,The data analysis module analyzes the data stored in the memory to obtain the weight, and generates the bit width control signal according to the position index parameter obtained by the analysis, and,乘加单元,在位宽控制信号的作用下对从FIFO读取的输入数据进行位宽调整,对位宽调整后的输入数据和数据解析模块输出的权重进行批乘加运算。The multiplication and addition unit adjusts the bit width of the input data read from the FIFO under the action of the bit width control signal, and performs batch multiplication and addition operations on the input data after the bit width adjustment and the weights output by the data analysis module.
- 根据权利要求1所述一种基于混合精度存储的深度神经网络加速器,其特征在于,根据位置索引参数对从输入数据缓存模块读取的输入数据进行位宽调整具体为:在位置索引参数表征权重为调用频率高且精度要求高的高比特权值时将输入数据调整为高位宽数据,在位置索引参数表征权重为调用频率低且精度要求低的低比特权值时将输入数据调整为低位宽数据。The deep neural network accelerator based on mixed-precision storage according to claim 1, wherein the bit width adjustment of the input data read from the input data cache module according to the position index parameter is specifically: the position index parameter represents the weight Adjust the input data to high bit width data when calling high bit weights with high frequency and high precision requirements, and adjust the input data to low bit widths when the position index parameter characterizing weight is low bit weights with low calling frequency and low precision requirements data.
- 根据权利要求4所述一种基于混合精度存储的深度神经网络加速器,其特征在于,所述乘法单元为对数乘法器。The deep neural network accelerator based on mixed-precision storage according to claim 4, wherein the multiplication unit is a logarithmic multiplier.
- 一种基于混合精度存储的深度神经网络加速方法,其特征在于,对权重位置索引参数的霍夫曼编码进行双查找表的霍夫曼解码完成位置索引操作,对权重按照权值大小分配存储单元的数据位宽,不同位宽的存储单元存储有各权重的有效位、符号位、位置索引参数,根据位置索引参数对输入数据进行位宽调整,对经位宽处理后的输入数据和混合存储的权重进行乘加计算。A deep neural network acceleration method based on mixed-precision storage, which is characterized in that the Huffman coding of the weight position index parameter is subjected to the Huffman decoding of the double lookup table to complete the position index operation, and the storage unit is allocated to the weight according to the weight value. Data bit width, different bit width storage units store the effective bit, sign bit, and position index parameters of each weight, adjust the bit width of the input data according to the position index parameter, and mix and store the input data after bit width processing The weight of is multiplied and added.
- 根据权利要求6所述一种基于混合精度存储的深度神经网络加速方法,其特征在于,所述权重位置索引参数通过Caffe平台或Tensorflow平台线下训练获取。The method for accelerating a deep neural network based on mixed-precision storage according to claim 6, wherein the weight position index parameter is obtained through offline training on the Caffe platform or the Tensorflow platform.
- 根据权利要求6所述一种基于混合精度存储的深度神经网络加速方法,其特征在于,对权重位置索引参数的霍夫曼编码进行双查找表的霍夫曼解码完成位置索引操作的方法为:对权重位置索引参数的霍夫曼编码进行移位操作,在权重位置索引参数霍夫曼编码的高位数据不全为1时查找常用权重位置索引参数的霍夫曼编码表获取权重位置索引参数霍夫曼编码高位数据的码长,在权重位置索引参数霍夫曼编码的高位数据全部为1时查找剩余权重位置索引参数的霍夫曼编码表获取权重位置索引参数霍夫曼编码低位数据的码长,对获取的码长进行累加,根据累加结果更新权重位置索引参数霍夫曼编码的移位操作。The method for accelerating a deep neural network based on mixed-precision storage according to claim 6, wherein the method for performing the Huffman decoding of the double look-up table on the Huffman coding of the weight position index parameter to complete the position index operation is: Perform a shift operation on the Huffman coding of the weight position index parameter. When the high data of the weight position index parameter Huffman code is not all 1, look up the Huffman coding table of the commonly used weight position index parameter to obtain the weight position index parameter Hough The code length of the high-order data of Mann encoding. When the high-order data of the weight position index parameter Huffman encoding is all 1, look up the Huffman code table of the remaining weight position index parameters to obtain the code length of the weight position index parameter Huffman encoding low data , Accumulate the acquired code length, and update the weight position index parameter Huffman coding shift operation according to the accumulation result.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910922467.4A CN110766155A (en) | 2019-09-27 | 2019-09-27 | Deep neural network accelerator based on mixed precision storage |
CN201910922467.4 | 2019-09-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021057085A1 true WO2021057085A1 (en) | 2021-04-01 |
Family
ID=69330542
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/094551 WO2021057085A1 (en) | 2019-09-27 | 2020-06-05 | Hybrid precision storage-based depth neural network accelerator |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110766155A (en) |
WO (1) | WO2021057085A1 (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110766155A (en) * | 2019-09-27 | 2020-02-07 | 东南大学 | Deep neural network accelerator based on mixed precision storage |
CN111091190A (en) * | 2020-03-25 | 2020-05-01 | 光子算数(北京)科技有限责任公司 | Data processing method and device, photonic neural network chip and data processing circuit |
CN111783967B (en) * | 2020-05-27 | 2023-08-01 | 上海赛昉科技有限公司 | Data double-layer caching method suitable for special neural network accelerator |
CN112037118B (en) * | 2020-07-16 | 2024-02-02 | 新大陆数字技术股份有限公司 | Image scaling hardware acceleration method, device and system and readable storage medium |
CN112906863B (en) * | 2021-02-19 | 2023-04-07 | 山东英信计算机技术有限公司 | Neuron acceleration processing method, device, equipment and readable storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107451659A (en) * | 2017-07-27 | 2017-12-08 | 清华大学 | Neutral net accelerator and its implementation for bit wide subregion |
US20180046905A1 (en) * | 2016-08-12 | 2018-02-15 | Beijing Deephi Intelligence Technology Co., Ltd | Efficient Data Access Control Device for Neural Network Hardware Acceleration System |
US20190042939A1 (en) * | 2018-05-31 | 2019-02-07 | Intel Corporation | Circuitry for low-precision deep learning |
CN109726806A (en) * | 2017-10-30 | 2019-05-07 | 上海寒武纪信息科技有限公司 | Information processing method and terminal device |
WO2019177824A1 (en) * | 2018-03-14 | 2019-09-19 | Microsoft Technology Licensing, Llc | Hardware accelerated neural network subgraphs |
US20190294413A1 (en) * | 2018-03-23 | 2019-09-26 | Amazon Technologies, Inc. | Accelerated quantized multiply-and-add operations |
CN110766155A (en) * | 2019-09-27 | 2020-02-07 | 东南大学 | Deep neural network accelerator based on mixed precision storage |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3654172A1 (en) * | 2017-04-19 | 2020-05-20 | Shanghai Cambricon Information Technology Co., Ltd | Fused vector multiplier and method using the same |
-
2019
- 2019-09-27 CN CN201910922467.4A patent/CN110766155A/en active Pending
-
2020
- 2020-06-05 WO PCT/CN2020/094551 patent/WO2021057085A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180046905A1 (en) * | 2016-08-12 | 2018-02-15 | Beijing Deephi Intelligence Technology Co., Ltd | Efficient Data Access Control Device for Neural Network Hardware Acceleration System |
CN107451659A (en) * | 2017-07-27 | 2017-12-08 | 清华大学 | Neutral net accelerator and its implementation for bit wide subregion |
CN109726806A (en) * | 2017-10-30 | 2019-05-07 | 上海寒武纪信息科技有限公司 | Information processing method and terminal device |
WO2019177824A1 (en) * | 2018-03-14 | 2019-09-19 | Microsoft Technology Licensing, Llc | Hardware accelerated neural network subgraphs |
US20190294413A1 (en) * | 2018-03-23 | 2019-09-26 | Amazon Technologies, Inc. | Accelerated quantized multiply-and-add operations |
US20190042939A1 (en) * | 2018-05-31 | 2019-02-07 | Intel Corporation | Circuitry for low-precision deep learning |
CN110766155A (en) * | 2019-09-27 | 2020-02-07 | 东南大学 | Deep neural network accelerator based on mixed precision storage |
Non-Patent Citations (1)
Title |
---|
WANG, ZHEN ET AL.: "EERA-DNN: An energy-efficient reconfigurable architecture for DNNs with hybrid bit-width and logarithmic multiplier", IEICE ELECTRONICS EXPRESS, vol. 15, no. 8, 6 April 2018 (2018-04-06), XP055795437 * |
Also Published As
Publication number | Publication date |
---|---|
CN110766155A (en) | 2020-02-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021057085A1 (en) | Hybrid precision storage-based depth neural network accelerator | |
CN110378468B (en) | Neural network accelerator based on structured pruning and low bit quantization | |
CN110070178B (en) | Convolutional neural network computing device and method | |
US20210357736A1 (en) | Deep neural network hardware accelerator based on power exponential quantization | |
CN108292222B (en) | Hardware apparatus and method for data decompression | |
CN111062472B (en) | Sparse neural network accelerator based on structured pruning and acceleration method thereof | |
CN109447241B (en) | Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things | |
CN108416422A (en) | A kind of convolutional neural networks implementation method and device based on FPGA | |
CN109901814A (en) | Customized floating number and its calculation method and hardware configuration | |
CN111581593A (en) | Configurable reuse sectional type lookup table activation function implementation device | |
CN112257844B (en) | Convolutional neural network accelerator based on mixed precision configuration and implementation method thereof | |
CN111507465A (en) | Configurable convolutional neural network processor circuit | |
CN113361695A (en) | Convolutional neural network accelerator | |
Kim et al. | V-LSTM: An efficient LSTM accelerator using fixed nonzero-ratio viterbi-based pruning | |
CN113837365A (en) | Model for realizing sigmoid function approximation, FPGA circuit and working method | |
CN117574970A (en) | Inference acceleration method, system, terminal and medium for large-scale language model | |
CN109948787B (en) | Arithmetic device, chip and method for neural network convolution layer | |
CN115526131A (en) | Method and device for approximately calculating Tanh function by multi-level coding | |
EP4258135A1 (en) | Matrix calculation apparatus, method, system, circuit, chip, and device | |
WO2023284130A1 (en) | Chip and control method for convolution calculation, and electronic device | |
Wang et al. | EERA-DNN: An energy-efficient reconfigurable architecture for DNNs with hybrid bit-width and logarithmic multiplier | |
CN114996638A (en) | Configurable fast Fourier transform circuit with sequential architecture | |
Huang et al. | A low-bit quantized and hls-based neural network fpga accelerator for object detection | |
CN109117114B (en) | Low-complexity approximate multiplier based on lookup table | |
CN113392963A (en) | CNN hardware acceleration system design method based on FPGA |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20867402 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20867402 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20867402 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 20.10.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20867402 Country of ref document: EP Kind code of ref document: A1 |