WO2019127362A1 - 神经网络模型分块压缩方法、训练方法、计算装置及*** - Google Patents

神经网络模型分块压缩方法、训练方法、计算装置及*** Download PDF

Info

Publication number
WO2019127362A1
WO2019127362A1 PCT/CN2017/119819 CN2017119819W WO2019127362A1 WO 2019127362 A1 WO2019127362 A1 WO 2019127362A1 CN 2017119819 W CN2017119819 W CN 2017119819W WO 2019127362 A1 WO2019127362 A1 WO 2019127362A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
matrix
block
blocks
network model
Prior art date
Application number
PCT/CN2017/119819
Other languages
English (en)
French (fr)
Inventor
张悠慧
季宇
张优扬
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Priority to CN201780042629.4A priority Critical patent/CN109791628B/zh
Priority to PCT/CN2017/119819 priority patent/WO2019127362A1/zh
Publication of WO2019127362A1 publication Critical patent/WO2019127362A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present invention generally relates to the field of neural network technologies, and more particularly to a network model block compression method, a training method, a computing device, and a hardware system for a neural network.
  • Figure 1 shows a chain-like neural network, in which each circle represents a neuron.
  • Each arrow represents a connection between neurons, each connection has a weight, and the structure of the actual neural network is not limited to a chain-like network structure.
  • the core computation of the neural network is a matrix vector multiplication operation.
  • the output produced by the layer L n containing n neurons can be represented by a vector V n of length n, which is fully associated with the layer L m containing m neurons, and the connection weights It can be expressed as a matrix M n ⁇ m , the matrix size is n rows and m columns, and each matrix element represents the weight of one connection.
  • the vector input to L m after weighting is M n ⁇ m V n , and such matrix vector multiplication is the core calculation of the neural network.
  • the neural network acceleration chip Since the matrix vector multiplication is very large, it takes a lot of time to perform a large number of matrix multiplication operations on a conventional general-purpose processor. Therefore, the neural network acceleration chip also has the main design goal of accelerating matrix multiplication.
  • a memristor array is a hardware device capable of implementing the above matrix multiplication operation.
  • the resistance of each memristor can be varied at a specific input current, and the resistance can be used to store data.
  • DRAM Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • memristors have the characteristics of high memory density and no loss of data in the event of loss of power supply.
  • Figure 2 shows a schematic diagram of a memristor based crossbar structure.
  • the conductance value G (the reciprocal of the resistance) of the memristor is set as the matrix element value of the weight matrix.
  • the voltage V is multiplied by the memristor conductance G and superimposed with the output current, and the output current is multiplied by the grounding resistance Rs to obtain an output voltage V', thereby completing the matrix vector multiplication operation at the output end.
  • memristor-based chip calculations also has the disadvantages of low precision, large disturbance, large cost of digital-to-analog/analog conversion, and limited matrix size.
  • TrueNorth is also a chip capable of matrix vector multiplication.
  • TrueNorth is IBM's neuromorphic chip, which integrates 4096 synaptic nuclei on each chip, and each synaptophysin can handle 256 ⁇ 256 synaptic calculations.
  • the neural network model needs to be compressed to reduce resource overhead and improve the computational efficiency of the neural network.
  • the existing Deep Compression is a common compression method for CNN networks.
  • the implementation of deep compression is mainly divided into three steps: weight cropping, weight sharing and Huffman coding.
  • Weight cutting First, the normal training model obtains the network weight; second, all the weights below a certain threshold are set to 0; 3. Re-train the remaining non-zero weights in the network. Repeat the above three steps.
  • Weight sharing The kmeans algorithm is used to cluster weights. In each class, all weights share the cluster centroid of the class, so the final stored result is a codebook and index table.
  • Huffman coding mainly used to solve the redundancy problem caused by the length of coding. Deep compression uses 8-bit encoding for the convolutional layer and 5 bits for the fully connected layer. Therefore, this entropy coding can better balance the coding bits and reduce redundancy.
  • This method can compress the model to a compression ratio of 90% while maintaining the accuracy.
  • the present invention has been made in view of the above circumstances.
  • a network model block compression method for a neural network comprising: a weight matrix obtaining step, obtaining a weight matrix of a network model of the trained neural network; a weight matrix blocking step, Dividing the weight matrix into an array consisting of a number of initial sub-blocks according to a predetermined array size; the step of concentrating the weighted element elements, according to the absolute value and value of the weights of the matrix elements in the sub-block, the row and column exchange, the weight is smaller
  • the matrix elements are grouped into the sub-block to be cropped such that the absolute values and values of the weights of the matrix elements in the sub-block to be cropped are relative to the absolute values and values of the weights of the matrix elements in other sub-blocks that are not to be cropped.
  • the sub-block clipping step clips the weights of the matrix elements in the sub-block to be cropped to obtain a final weight matrix to implement compression of the network model of the neural network.
  • the number of the sub-blocks to be cropped may be set according to a compression ratio or according to a threshold.
  • the step of arranging the weight element element may include the steps of: determining a pre-trimming sub-block step, determining a pre-trimmed sub-block as a crop candidate; marking the row and column step, selecting and marking the pre-cut sub-block All rows and all columns are used as transposition rows and transposition columns, wherein the number of the pre-trimmed sub-blocks is set according to the compression ratio; the swap row step and the swap column step have absolute weights for the matrix elements in each row The values are summed, and the rows with the small values are sequentially swapped with the marked transposition rows, and the absolute values of the weights of the matrix elements in each column are summed, and the columns with the smaller values are sequentially
  • the marked transposition column performs position exchange; the above steps are repeated until the exchange cannot change the sum of the absolute values of the weights of the matrix elements in all the pre-trimmed sub-blocks, and the pre-trimmed sub-block at this time is used as the sub-block to be cropped.
  • the step of determining the pre-trimmed sub-block may include: calculating a sum of absolute values of weights of matrix elements in each of the initial sub-blocks, and using a sub-block having a small value as a pre-cut sub-block.
  • a neural network training method comprising the steps of: training a neural network to obtain a weight matrix of a network model; and compressing the weight matrix according to the network model blocking compression method; Iterate through the above steps until the predetermined iteration suspension request is reached.
  • a computing device for neural network computing comprising a memory and a processor, the memory storing computer executable instructions, the computer executable instructions comprising network model compression instructions, when the processor When the network model compression instruction is executed, the following method is performed: a weight matrix obtaining step, obtaining a weight matrix of the network model of the trained neural network; a weight matrix blocking step, dividing the weight matrix into a number according to a predetermined array size An array of initial sub-blocks; a step of concentrating the weighted element elements, according to the absolute values and values of the weights of the matrix elements in the sub-blocks, by matrix-row exchange, the matrix elements with smaller weights are concentrated into the sub-blocks to be cropped, And causing the absolute value and the value of the weight of the matrix element in the to-be-cut sub-block to be smaller than the absolute value and the value of the matrix element in the other sub-blocks that are not to be cropped; the sub-block clipping step The weights of the matrix
  • the number of the sub-blocks to be cropped can be set according to a compression ratio or according to a threshold.
  • the step of concentrating the weighting element element may include the steps of: determining a pre-trimming sub-block step, determining a pre-trimmed sub-block as a clipping candidate; marking the row and column step, selecting and marking all rows of the pre-trimmed sub-block and All columns are used as a transposition row and a transposition column, wherein the number of the pre-trimmed sub-blocks is set according to a compression ratio; the swap row step and the swap column step are used to sum the absolute values of the weights of the matrix elements in each row, And the row with the small value is sequentially exchanged with the marked transposition row, and the absolute values of the weights of the matrix elements in each column are summed, and the column with the smaller value is sequentially substituted with the labeled transposition
  • the column performs positional exchange; the above steps are repeated until the exchange cannot change the sum of the absolute values of the weights of the matrix elements in all the pre-trimmed sub-blocks, and the pre-trimmed sub-blocks at this time
  • the step of determining the pre-trimmed sub-block may further comprise: calculating a sum of absolute values of weights of the matrix elements in each of the initial sub-blocks, and using the sub-blocks having a smaller value as the pre-cut sub-blocks.
  • the computer executable instructions may include network model application instructions, and when the processor executes the network model application instructions, perform the following method: input data processing steps, and exchange input data according to a row and column exchange order a matrix multiplication operation step of performing matrix multiplication on the exchanged input data with a final weight matrix obtained by executing the network model compression instruction; and outputting a data processing step of performing a matrix multiplication operation according to the row and column exchange order Reverse exchange and output as output data.
  • the computer executable instructions may further include a network model training instruction, and when the processor executes the network model training instruction, performing the following method: training the neural network to obtain an initial weight matrix of the network model; Performing the network model compression instruction to obtain a compressed final weight matrix; executing the network model application instruction for training; and iteratively performing the above compression and training steps until a predetermined iteration suspension request is reached.
  • a hardware system using the above-described network model block compression method, the above-described neural network training method, and network model compression, application and training according to the above-described computing device including: neural network hardware The chip, the neural network hardware chip has a basic module that performs a matrix vector multiplication operation in a hardware form by a circuit device, wherein a circuit device is not provided at a position corresponding to a matrix element in the sub-block to be cropped.
  • the circuit device can be a synapse or a synapse of a TrueNorth chip.
  • a network model block compression method for a neural network is provided, thereby saving resource overhead to arrange a large-scale neural network under conditions of limited resources.
  • Figure 1 shows a schematic of a chained neural network.
  • Figure 2 shows a schematic diagram of a memristor based crossbar switch structure.
  • FIG. 3 is a diagram showing an application scenario of a network model block compression technique of a neural network in accordance with the present invention.
  • FIG. 4 shows a general flow diagram of a network model block compression method in accordance with the present invention.
  • Fig. 5 shows an exploded flow chart of the steps of concentrating the weights to be trimmed elements according to the above method.
  • Figures 6a-6c show the correct rates for different compression ratios using the compression method according to the invention over a variety of data sets and different network sizes.
  • FIG. 3 shows a schematic diagram of an application context 1000 of a network model block compression technique for a neural network in accordance with the present invention.
  • the general inventive concept of the present disclosure is to perform preliminary neural network training on the neural network application 1100, and learn to obtain a network model 1200 that is subjected to a predetermined compression ratio by the network model block compression method 1300.
  • FIG. 4 and FIG. 5 are flowcharts showing a network model block compression method 1300 according to an embodiment of the present invention, wherein FIG. 4 shows a general flowchart of a network model block compression method according to the present invention, and FIG. An exploded flow chart of the steps of concentrating the elements to be trimmed according to the above method is presented.
  • the network model block compression method includes the following steps:
  • the weight matrix obtains step S210 to obtain a weight matrix of the network model of the trained neural network.
  • the initial weight matrix is 6*6 size and is further explained by the matrix of Table 1 below.
  • Weight Matrix Blocking Step S220 The weight matrix is divided into an array of a number of initial sub-blocks according to a predetermined array size.
  • the divided sub-block size can be set according to the size of the weight matrix and the compression ratio. For example, sub-blocks such as 4*4, 8*8...256*256 can also be set. Matrix size.
  • the to-be-cut weight element is concentrated in step S230.
  • the matrix elements with smaller weights are concentrated by row-column exchange.
  • the sum value of the sub-block to be cropped is smaller than the sum of the other sub-blocks that are not to be cropped, wherein the number of the sub-block to be cropped is set according to the compression ratio;
  • FIG. 5 shows an exploded flowchart of step S230 of the weighting element to be cropped according to the above method, including the following steps:
  • Step S2301 Determining a pre-trimmed sub-block as a cropping candidate.
  • each initial sub-block and value is calculated, and a sub-block having a small sum value is used as a pre-cut sub-block.
  • the weight matrix of Table 1 is taken as an absolute value, and the matrix of Table 2 is obtained.
  • Table 2 takes the absolute value of the matrix
  • the sub-block with the smallest value is selected as the pre-trimmed sub-block, and is marked as True, and the other sub-blocks are marked as False, and Table 4 is obtained, where the sub-block number starts with B.
  • Step S2302 Select and mark all the rows and all columns in which the pre-trimmed sub-blocks are located as the transposition row and the transposition column, and mark the transposition row and the transposition column.
  • the four sub-blocks B21, B22, B23, and B33 having the smallest sum value are marked as "True” as pre-trimmed sub-blocks.
  • the transposition behaviors R2, R3, R4, R5 and the transposition column are C0-C5, thus marking the transposition behavior ER2, ER3, ER4, ER5 and transposition are listed as EC0-EC5, in which the transposition line starts with ER, and the transposition column starts with EC, which is different from the general line beginning with R and C.
  • Step S2303 summing the absolute values of the weights of the matrix elements in each row, and sequentially swapping the rows with the smaller values with the marked transposition rows.
  • R3 is transposed to ER2 ⁇ R[0 1 3 2 4 5] (At this time, since R3 and R2 have been swapped, R2 is no longer transposed);
  • R1 is transposed to ER4 ⁇ R[0 4 3 2 1 5];
  • R0 is transposed to ER5 ⁇ R[5 4 3 2 1 0].
  • Exchange column step S2304 sum the absolute values of the weights of the matrix elements in each column, and sequentially exchange the columns with the smaller values in position with the marked transposition columns.
  • C5 is transposed to EC1 ⁇ C [3 5 2 0 4 1];
  • C4 is transposed to EC2 ⁇ C [3 5 4 0 2 1];
  • C1 is transposed to EC3 ⁇ C [3 5 4 1 2 0];
  • C0 is transposed to EC4 ⁇ C[3 5 4 1 0 2];
  • C2 is transposed to EC5 ⁇ C [3 5 4 1 0 2].
  • the first column obtained after the exchange is now the C3 column of the original matrix
  • the second column is the C5 column of the original matrix
  • the line order is: R[5,4,3,2,1,0]
  • the pre-cut sub-block sum Sum2 is compared to the storage sub-block sum Sum1.
  • the storage sub-block sum Sum1 ⁇ pre-cut sub-block sum Sum2, which are not equal, sets the storage sub-block sum Sum1 pre-cut sub-block sum Sum2. That is, since 6.0731 ⁇ 7.1541, the storage sub-block sum Sum1 is set to 7.154.
  • Sum1 is smaller than Sum2, it indicates that the current switching operation can continue, so steps S2301 to S2305 are repeated, that is, the to-be-cut weight element is concentrated to the pre-cut sub-block position again by the row-column switching operation, and it is determined whether the pre-cut sub-block is pre-cut.
  • the sum is equal to the sum of the storage sub-blocks as comparison values, as detailed below.
  • the sum of the pre-trimmed sub-blocks is obtained after the exchange processing, based on the position of the pre-trimmed sub-block determined before the exchange processing, and the sub-block and value are calculated, and the sum of the storage sub-blocks is After the last exchange process of the secondary loop, it is set according to the judgment result. Specifically, each time the judgment is made, as long as the sum of the storage sub-blocks and the sum of the pre-cropped sub-blocks are different, the sum of the pre-cut sub-blocks is stored as a sum of the storage sub-blocks for use in the next comparison. In the above process, the initial value of the sum of the stored sub-blocks is set according to the sum of the pre-cut sub-blocks initially determined by the loop.
  • Step S2301 Determining the pre-trimmed sub-block as a cropping candidate.
  • the sub-block with the small value is still used as the pre-trimmed sub-block, so according to Table 9, the pre-cropped sub-block is re-selected as shown in Table 10 below.
  • Table 10 marks the pre-cut sub-block
  • ⁇ mark row and column step S2302 select and mark all rows and all columns where the pre-cut sub-block is located
  • transposition line and a transposition column As a transposition line and a transposition column, the transposition line and the transposition column are marked.
  • the four sub-blocks B21, B22, B23, and B32 with the smallest sum value are marked as "True” as the pre-trimmed sub-block, and the row row in which the pre-cut sub-block is located is used as the transposition line and the transposition column, and the transposition is performed.
  • the row and transposition columns include: R2, R3, R4, R5, and C0-C5, such that the transposition behavior ER2, ER3, ER4, ER5 and the transposition column are EC0-EC5.
  • ⁇ swap line step S2303 summing the absolute values of the weights of the matrix elements in each row, and
  • the row with the smaller value is sequentially swapped with the marked transposition row.
  • the sum of the rows, from small to large, is R2 ⁇ R3 ⁇ R4 ⁇ R5 ⁇ R0 ⁇ R1, and then the rows of ER2, ER3, ER4, and ER5 are sequentially switched in this order. Since the order of the rows having the small weights is in one-to-one correspondence with the order of the transposition lines at this time, the row transposition is no longer performed, and is still in Table 8.
  • ⁇ exchange column step S2304 summing the absolute values of the weights of the matrix elements in each column, and
  • the columns with the smaller values are sequentially swapped with the marked transposition columns.
  • the sum of the columns, from small to large, is C0 ⁇ C1 ⁇ C2 ⁇ C3 ⁇ C4 ⁇ C5, and then the label columns EC0, EC1, EC2, EC3, EC4, and EC5 are sequentially changed in this order. Since the order of the columns with small weights is in one-to-one correspondence with the order of the transposition columns at this time, the column transposition is no longer performed, and is still Table 8.
  • the storage sub-block sum Sum1 has been set to 7.154 in the first row-column exchange process, that is, the sum of the pre-cut sub-blocks before the second row-column exchange.
  • the line order is: R[5,4,3,2,1,0]
  • Step S2301 Determining the pre-trimmed sub-block as a cropping candidate.
  • the sub-block with the small value is still used as the pre-trimmed sub-block.
  • the sum of the four pre-trimmed sub-blocks B21, B22, B23, and B32 is still the minimum sub-block and value, so the pre-cut sub-block constant.
  • the line order is: R[5,4,3,2,1,0]
  • the storage sub-block sum Sum1 is equal to 5.666
  • Sub-block clipping step S240 Trimming the weights of the matrix elements in the sub-block to be cropped to implement compression of the network model of the neural network.
  • the clipping here is not limited to setting the value of the element of the matrix itself to 0.
  • the device corresponding to the position corresponding to the matrix element can be directly Omitted. More specifically, when the corresponding hardware device is arranged to implement the weight matrix, the device for performing the block calculation of the corresponding position is removed.
  • the weighting elements to be cropped are concentrated into the matrix sub-block, and then the matrix sub-block is directly cut off, and then the neural network training is performed as the initial value, and the array is reduced under the premise of ensuring the network effect. Use, which greatly reduces resource overhead.
  • the method proposed by the present invention is fully applicable to a neural network based on a memristor and a TrueNorth chip.
  • the traditional network compression method is not suitable for neural networks based on memristors or TrueNorth chips, because even if the network model is compressed to a small extent, the use of the array cannot be reduced, and resource consumption cannot be reduced.
  • the steps of the row and column exchange given above are only examples, but such a row and column exchange manner is not the only alternative.
  • the sum of the rows, from small to large is R3 ⁇ R2 ⁇ R1 ⁇ R0 ⁇ R5 ⁇ R4, and then sequentially switches to ER2, ER3, ER4, and ER5 in this order.
  • These markup lines that is to say, in the present invention, the row in which the sub-block with the smallest value and the smallest value are selected is exchanged as an exchange line, which can greatly speed up the efficiency of the row-column exchange, and the weight-valued element to be cropped with a smaller value can be quickly added.
  • the number of sub-blocks to be cropped is determined by setting the compression ratio in the present invention
  • the number of sub-blocks to be cropped may be determined by setting a threshold as long as the compression purpose can be satisfied.
  • the core of the inventive concept of the present invention is to obtain a croppable sub-block by row-column switching, so as to be applicable to a sub-block operation application without limiting the specifically available switching manner.
  • Table 14 is a weight matrix (corresponding to Table 8) after the row and column exchange according to the compression method of the present invention, wherein the underlined Null identifies the pre-trimmed matrix element.
  • Table 15 is an initial matrix in which Table 14 is restored in the original rank order (i.e., the following order), in which the black underlined Null identifies the pre-cut matrix element.
  • the line order is: R[5,4,3,2,1,0]
  • Table 15 the essential difference between Table 15 and Table 14 is that the pre-cut elements in Table 15 are dispersed, and the pre-cut elements in Table 14 are gathered together in the form of 2*2 sub-blocks. Therefore, in the actual arrangement, the hardware arrangement is implemented according to Table 14 (ie, the matrix after row and column exchange) to meet the needs of the block calculation, which is also the overall concept of the invention, that is, the compression method is applied to the corresponding The key to the block computing application.
  • the row-exchanged input vector is multiplied by the row-row-exchanged weight matrix in Table 14, that is, the product of the vector and the corresponding element of the matrix is summed, and the output point multiplication result 2 is:
  • Table 20 is the result of the point multiplication of the exchange 2
  • weight matrix obtained by the compression method according to the present invention when applied to data, needs to exchange input data according to the order of row and column exchange before processing, and then exchanges the input data with the final weight.
  • the matrix performs matrix multiplication, and finally the result of matrix multiplication is reversely exchanged and output as output data according to the order of row and column exchange.
  • Figure 6a shows the correct rate of the CIFAR10 data set after compression.
  • the CIFAR10 data set has 60,000 32*32 pixel color pictures, each of which belongs to one of 10 categories.
  • Figure 6b shows the correct rate of compression of the MNIST data set under the LENET network, where the MNIST data set has 60,000 28*28 pixel black and white handwritten digital pictures.
  • Figure 6c shows the correct rate of compression of the MNIST data set under the MLP network.
  • the abscissa is the compression ratio
  • the ordinate is the correct rate
  • the lines of different colors represent arrays of different sizes.
  • the correct rate is between 84% and 85% for the CIFAR10 data set.
  • the correct The rate is basically 98%-99%, or even higher, which also proves that the correct rate of the data compression method of the present invention is quite good from multiple angles. In other words, under various data sets and different network scales, the compression method of the present invention can greatly compress the network scale and save resource overhead without affecting the correct rate.
  • the compression rate of some of the results is not high, and the result is related to the large scale of the array.
  • the largest set of data is 256*256 array size, so large.
  • the size of the array causes the valid data to be cropped too much, thus affecting the correct rate.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种用于神经网络的网络模型分块压缩方法,包括:权重矩阵获得步骤,获得经过训练得到的神经网络的网络模型的权重矩阵;权重矩阵分块步骤,按照预定阵列大小将权重矩阵划分成由若干初始子块组成的阵列;待裁剪权值元素集中步骤,根据子块中的矩阵元素的权值绝对值和值,通过行列交换,将权值较小的矩阵元素集中到待裁剪子块中,使得该待裁剪子块中的矩阵元素的权值绝对值和值相对于不是待裁剪子块的其他子块中的矩阵元素的权值绝对值和值更小;子块裁剪步骤,将上述待裁剪子块中的矩阵元素的权值裁剪掉,获得最终的权重矩阵,以实现对神经网络的网络模型的压缩。实现能够节省资源开销,在有限资源的条件下布置规模巨大的神经网络。

Description

神经网络模型分块压缩方法、训练方法、计算装置及*** 技术领域
本发明总体地涉及神经网络技术领域,更具体地涉及用于神经网络的网络模型分块压缩方法、训练方法、计算装置以及硬件***。
背景技术
随着摩尔定律逐渐失效,传统芯片工艺进步放缓,人们不得不面向新应用和新器件。近年来,神经网络(Neural Network,NN)计算取得了突破性进展,在图像识别、语言识别、自然语言处理等诸多领域均取得了很高的准确率,但神经网络需要海量计算资源,传统的通用处理器已经很难满足深度学习的计算需求,设计专用芯片已经成为了一个重要的发展方向。
具体地,神经网络的建模通常以若干神经元为一层,层与层之间相互连接来构建,图1所示的是一种链状的神经网络,图中每一个圆表示一个神经元,每一个箭头表示神经元之间的连接,每个连接均有权重,***经网络的结构不限于链状的网络结构。
神经网络的核心计算是矩阵向量乘操作,包含n个神经元的层L n产生的输出可以用长度为n的向量V n表示,与包含m个神经元的层L m全相联,连接权重可以表示成矩阵M n×m,矩阵大小为n行m列,每个矩阵元素表示一个连接的权重。则加权之后输入到L m的向量为M n×mV n,这样的矩阵向量乘法运算是神经网络最核心的计算。
由于矩阵向量乘计算量非常大,在传统的通用处理器上进行大量的矩阵乘运算需要耗费大量的时间,因此神经网络加速芯片也都是以加速矩阵乘法运算为主要的设计目标。
忆阻器阵列是一种能够实现上述矩阵乘法运算的硬件器件。每个忆阻器的电阻阻值可以在特定的输入电流下改变,并且阻值可以用来存储数据。相比传统的DRAM(动态随机存储器)和SRAM(静态随机存储器),忆阻器具有存储密度高且在失去供电的情况下也不会丢失数据的特点。
图2示出了基于忆阻器的交叉开关(Crossbar)结构的示意图。
如图2所示,通过将线路排布成交叉开关(Crossbar),并在相交点用忆阻器相连,将忆阻器的电导值G(电阻的倒数)设置为权重矩阵的矩阵元数值,通过在输入端输入电压值V,电压V与忆阻器电导G相乘并叠加输出电流,输出电流与接地电阻Rs相乘得到输出电压V’,由此在输出端即可完成矩阵向量乘法运算。以此为基本单元,可以构建基于新型器件的神经形态芯片。
由于整个过程在模拟电路下实现,具有速度快,面积小的优点。
然而,使用基于忆阻器的芯片计算也存在精度低、扰动大,数模/模数转换开销大,矩阵规模受限等不足。
类似地,TrueNorth也是能够进行矩阵向量乘法运算的芯片。TrueNorth是IBM公司的神经形态芯片,每块芯片上集成了4096个神经突触核,每个神经突触核可以处理256×256的神经突触计算。
虽然忆阻器阵列和TrueNorth芯片均可以高效地进行矩阵向量乘法运算,但是由于神经网络规模巨大,需要数量惊人的阵列,这带来了海量的资源开销,使得基于这些芯片器件实现的神经网络,很难在有限资源的条件下布置规模巨大的初始神经网络。
因此,需要将神经网络模型进行压缩,以减小资源开销,提高神经网络计算效率。
现有的Deep Compression(深度压缩)是CNN网络常见的压缩方法。深度压缩的实现主要分为三步:权值裁剪、权值共享和霍夫曼编码。
(1)权值裁剪:一、正常训练模型得到网络权值;二、将所有低于一定阈值的权值设为0;三、重新训练网络中剩下的非零权值。将以上三步反复迭代。
(2)权值共享:采用kmeans算法来将权值进行聚类,在每一个类中,所有的权值共享该类的聚类质心,因此最终存储的结果就是一个码书和索引表。
(3)霍夫曼编码:主要用于解决编码长短不一带来的冗余问题。深度压缩针对卷积层统一采用8bit编码,而全连接层采用5bit,所以采用这种熵编码能够更好地使编码bit均衡,减少冗余。
该方法能在保持精度不变将模型压缩达到90%的压缩率。
这些现有技术虽然能极大地压缩模型规模,但是并不能适配于基于忆阻 器以及TrueNorth等能够进行矩阵向量乘法运算的芯片所应用的神经网络模型。例如,因为权值裁剪裁掉的权值不集中,不能减少所需阵列的数量;使用权值共享会降低忆阻器阵列的运行速度;忆阻器阵列的权值编码是固定的,无法压缩。
因此,需要一种用于神经网络计算的网络模型压缩技术,以解决上述问题。
发明内容
鉴于上述情况,做出了本发明。
根据本发明的一个方面,提供了一种用于神经网络的网络模型分块压缩方法,包括:权重矩阵获得步骤,获得经过训练得到的神经网络的网络模型的权重矩阵;权重矩阵分块步骤,按照预定阵列大小将权重矩阵划分成由若干初始子块组成的阵列;待裁剪权值元素集中步骤,根据子块中的矩阵元素的权值绝对值和值,通过行列交换,将权值较小的矩阵元素集中到待裁剪子块中,使得该待裁剪子块中的矩阵元素的权值绝对值和值相对于不是待裁剪子块的其他子块中的矩阵元素的权值绝对值和值更小;子块裁剪步骤,将上述待裁剪子块中的矩阵元素的权值裁剪掉,获得最终的权重矩阵,以实现对神经网络的网络模型的压缩。
根据上述网络模型分块压缩方法,可以根据压缩率或根据阈值来设定所述待裁剪子块的数量。
根据上述网络模型分块压缩方法,待裁剪权值元素集中步骤可以包括如下步骤:确定预裁剪子块步骤,确定作为裁剪候选的预裁剪子块;标记行列步骤,选择并标记预裁剪子块所在的所有行和所有列作为换位行和换位列,其中,根据压缩率设定所述预裁剪子块的数量;交换行步骤和交换列步骤,对每一行中的矩阵元素的权值绝对值求和,并且将和值小的行依次与所标记的换位行进行位置交换,以及,对每一列中的矩阵元素的权值绝对值求和,并且将和值小的列依次与所标记的换位列进行位置交换;重复上述步骤,直到交换也不能改变所有预裁剪子块中的矩阵元素的权值绝对值的总和,此时的预裁剪子块作为待裁剪子块。
根据上述网络模型分块压缩方法,确定预裁剪子块步骤可以包括:计算每一个初始子块中的矩阵元素的权值绝对值的总和,将和值小的子块作为预 裁剪子块。
根据本发明的另一个方面,提供一种神经网络训练方法,包括如下步骤:对神经网络进行训练,得到网络模型的权重矩阵;根据上述网络模型分块压缩方法对所述权重矩阵进行压缩;以及迭代进行上述步骤,直至达到预定迭代中止要求。
根据本发明的另一个方面,提供一种用于神经网络计算的计算装置,包括存储器和处理器,存储器中存储有计算机可执行指令,所述计算机可执行指令包括网络模型压缩指令,当处理器执行所述网络模型压缩指令时,执行下述方法:权重矩阵获得步骤,获得经过训练得到的神经网络的网络模型的权重矩阵;权重矩阵分块步骤,按照预定阵列大小将权重矩阵划分成由若干初始子块组成的阵列;待裁剪权值元素集中步骤,根据子块中的矩阵元素的权值绝对值和值,通过行列交换,将权值较小的矩阵元素集中到待裁剪子块中,使得该待裁剪子块中的矩阵元素的权值绝对值和值相对于不是待裁剪子块的其他子块中的矩阵元素的权值绝对值和值更小;子块裁剪步骤,将上述待裁剪子块中的矩阵元素的权值裁剪掉,获得最终的权重矩阵,以实现对神经网络的网络模型的压缩。
根据上述计算装置,可以根据压缩率或根据阈值来设定所述待裁剪子块的数量。
根据上述计算装置,待裁剪权值元素集中步骤可以包括如下步骤:确定预裁剪子块步骤,确定作为裁剪候选的预裁剪子块;标记行列步骤,选择并标记预裁剪子块所在的所有行和所有列作为换位行和换位列,其中,根据压缩率设定所述预裁剪子块的数量;交换行步骤和交换列步骤,对每一行中的矩阵元素的权值绝对值求和,并且将和值小的行依次与所标记的换位行进行位置交换,以及,对每一列中的矩阵元素的权值绝对值求和,并且将和值小的列依次与所标记的换位列进行位置交换;重复上述步骤,直到交换也不能改变所有预裁剪子块中的矩阵元素的权值绝对值的总和,此时的预裁剪子块作为待裁剪子块。
根据上述计算装置,确定预裁剪子块步骤可以还包括:计算每一个初始子块中的矩阵元素的权值绝对值的总和,将和值小的子块作为预裁剪子块。
根据上述计算装置,所述计算机可执行指令可以包括网络模型应用指令,当处理器执行所述网络模型应用指令时,执行下述方法:输入数据处理步骤, 根据行列交换顺序,对输入数据进行交换;矩阵乘法运算步骤,将交换后的输入数据与执行所述网络模型压缩指令后得到的最终的权重矩阵进行矩阵乘法运算;和输出数据处理步骤,根据行列交换顺序,对矩阵乘法运算的结果进行反向交换并且作为输出数据输出。
根据上述计算装置,所述计算机可执行指令可以还包括网络模型训练指令,当处理器执行所述网络模型训练指令时,执行下述方法:对神经网络进行训练,得到网络模型的初始权重矩阵;执行所述网络模型压缩指令得到压缩后的最终的权重矩阵;执行所述网络模型应用指令进行训练;和迭代进行上述压缩和训练步骤,直至达到预定的迭代中止要求。
根据本发明的另一方面,提供一种采用上述网络模型分块压缩方法、根据上述的神经网络训练方法和根据上述的计算装置进行网络模型压缩、应用和训练的硬件***,包括:神经网络硬件芯片,神经网络硬件芯片具有通过电路器件以硬件形式执行矩阵向量乘的操作的基本模块,其中,与待裁剪子块中的矩阵元素所对应的位置未设置电路器件。
根据上述硬件***,所述电路器件可以为忆阻器或TrueNorth芯片的神经突触。
根据本发明的一个方面,提供了一种用于神经网络的网络模型分块压缩方法,从而节省资源开销,以在有限资源的条件下布置规模巨大的神经网络。
附图说明
从下面结合附图对本发明实施例的详细描述中,本发明的这些和/或其它方面和优点将变得更加清楚并更容易理解,其中:
图1示出了链状的神经网络的示意图。
图2示出了基于忆阻器的交叉开关结构的示意图。
图3示出了根据本发明的神经网络的网络模型分块压缩技术的应用情境的示意图。
图4示出了根据本发明的网络模型分块压缩方法的总体流程图。
图5示出了根据上述方法的待裁剪权值元素集中步骤的分解流程图。
图6a-6c示出了在多种数据集和不同的网络规模下,采用根据本发明的压缩方法在不同的压缩率的情况下的正确率。
具体实施方式
为了使本领域技术人员更好地理解本发明,下面结合附图和具体实施方式对本发明作进一步详细说明。
图3示出了根据本发明的神经网络的网络模型分块压缩技术的应用情境1000的示意图。
如图3所示,本公开的总体发明构思在于:对神经网络应用1100进行初步神经网络训练,学习得到网络模型1200,对该网络模型1200通过网络模型分块压缩方法1300以预定的压缩率进行分块压缩,然后重新进行训练,然后再压缩-再训练-再压缩-再训练…,如此迭代,以便微调并且学习来提升准确率,直至达到预定的迭代中止要求,从而确定最终的网络模型1400,由此可以在不影响效果的情况下,减少神经网络芯片所需要的分块运算单元器件,进而在有限资源的条件下布置规模巨大的神经网络。
一、网络模型分块压缩方法
图4和图5示出了根据本发明一实施例的网络模型分块压缩方法1300的流程图,其中图4示出了根据本发明的网络模型分块压缩方法的总体流程图,图5示出了根据上述方法的待裁剪权值元素集中步骤的分解流程图。具体地,所述网络模型分块压缩方法包括如下步骤:
1.权重矩阵获得步骤S210,获得经过训练得到的神经网络的网络模型的权重矩阵。
这里,为了更好地说明本发明的方法,假设初始权重矩阵为6*6大小,并且以下面的表1的矩阵进一步说明。
表1初始权重矩阵
0.9373 0.0419 0.7959 0.8278 -0.4288 0.6854
0.3311 0.6683 0.8686 0.1087 0.3058 -0.6641
0.0879 -0.7366 0.5453 -0.017 -0.8295 0.5781
0.3964 0.0769 -0.4809 -0.1507 0.0296 -0.2923
0.9786 -0.9656 0.8449 0.6284 -0.9309 0.4138
0.754 0.7859 -0.8424 0.9 -0.4225 0.0847
2.权重矩阵分块步骤S220:按照预定阵列大小将权重矩阵划分成由若干初始子块组成的阵列。
对上述权重矩阵,按照例如矩阵大小为2*2的子块进行压缩,则上述矩阵被分为3*3=9的子块阵列。
这里,本领域技术人员可以理解,所划分的子块大小是可以根据权重矩阵的规模以及压缩率的需要设定的,例如也可以设置4*4,8*8…256*256这样的子块矩阵大小。
3.待裁剪权值元素集中步骤S230,根据子块中的矩阵元素的权值绝对值和值(下文中简称为子块和值),通过行列交换,将权值较小的矩阵元素集中到待裁剪子块中,使得该待裁剪子块的和值相对于不是待裁剪子块的其他子块的和值更小,其中,根据压缩率设定所述待裁剪子块的数量;
更具体地,图5示出了根据上述方法的待裁剪权值元素集中步骤S230的分解流程图,包括以下步骤:
a.确定预裁剪子块步骤S2301:确定作为裁剪候选的预裁剪子块。
在本实施方式中,计算每一个初始子块和值,将和值小的子块作为预裁剪子块。
具体地,首先,对表1的权重矩阵取绝对值,获得表2的矩阵。
表2取绝对值得到的矩阵
  C0 C1 C2 C3 C4 C5
R0 0.9373 0.0419 0.7959 0.8278 0.4288 0.6854
R1 0.3311 0.6683 0.8686 0.1087 0.3058 0.6641
R2 0.0879 0.7366 0.5453 0.017 0.8295 0.5781
R3 0.3964 0.0769 0.4809 0.1507 0.0296 0.2923
R4 0.9786 0.9656 0.8449 0.6284 0.9309 0.4138
R5 0.754 0.7859 0.8424 0.9 0.4225 0.0847
为了便于理解后续的行列交换,在表2中,顺序标记出换位前的权重矩阵的行列序号,其中行以R打头,列以C打头。
其次,对表2的矩阵,以2*2的子块为单位计算子块和值,即获得表3。
表3子块和值
1.9786 2.601 2.0841
1.2978 1.1939 1.7295
3.4841 3.2157 1.8519
最后,选取和值最小的子块作为预裁剪子块,并且标记为True,其他子块标记为False,获得表4,其中子块序号以B打头。
表4预裁剪子块
B11:False B12:False B13:False
B21:True B22:True B23:True
B31:False B32:False B33:True
至于所标记的预裁剪子块的数量,根据压缩率确定。具体地,假设压缩率为50%,那么设定预裁剪子块的数量应该为子块总数×压缩率,9*50%=4.5,然后取整为4。因此,结合表3和表4可知,总和数值最小的4个子块被标记为True。
b.标记行列步骤S2302:选择并标记预裁剪子块所在的所有行和所有列作为换位行和换位列,并且对换位行和换位列进行标记。
根据表3和表4可知,和值最小的四个子块B21、B22、B23和B33作为预裁剪子块被标记“True”。那么,以预裁剪子块所在的行列作为换位行和换位列,则换位行为R2、R3、R4、R5以及换位列为C0-C5,如此标记换位行为ER2、ER3、ER4、ER5以及换位列为EC0-EC5,其中换位行以ER打头,换位列以EC打头,区别于以R和C打头的一般的行列。
c.交换行步骤S2303:对每一行中的矩阵元素的权值绝对值求和,并且将和值小的行依次与所标记的换位行进行位置交换。
对表2的各行总和进行计算,得到下表5。
表5各行总和
R0 3.7171
R1 2.9466
R2 2.7944
R3 1.4268
R4 4.7622
R5 3.7895
根据表5可知,各行的总和,从小到大的顺序为R3<R2<R1<R0<R5<R4,那么按照此顺序依次换到ER2、ER3、ER4、ER5这些标记行,即:
R3换位到ER2→R【0 1 3 2 4 5】(此时由于R3和R2已经对换,所以不再对R2进行换位了);
R1换位到ER4→R【0 4 3 2 1 5】;
R0换位到ER5→R【5 4 3 2 1 0】。
此时,获得如下表6,
表6行交换后的矩阵
R5 0.754 0.7859 0.8424 0.9 0.4225 0.0847
R4 0.9786 0.9656 0.8449 0.6284 0.9309 0.4138
R3 0.3964 0.0769 0.4809 0.1507 0.0296 0.2923
R2 0.0879 0.7366 0.5453 0.017 0.8295 0.5781
R1 0.3311 0.6683 0.8686 0.1087 0.3058 0.6641
R0 0.9373 0.0419 0.7959 0.8278 0.4288 0.6854
也就是说,现在交换后获得的第1行为原矩阵的第R5行,第2行为原矩阵的第R4行,…,以此类推。
d.交换列步骤S2304:对每一列中的矩阵元素的权值绝对值求和,并且将和值小的列依次与所标记的换位列进行位置交换。
对表5的各列总和进行计算,得到下表7。
表7各列总和
C0 C1 C2 C3 C4 C5
3.4853 3.2752 4.378 2.6326 2.9471 2.7184
根据表7可知,各列的总和,从小到大的顺序为C3<C5<C4<C1<C0<C2,那么按照此顺序依次换到EC0、EC1、EC2、EC3、EC4、EC5这些标记列,即:
C3换位到EC0→C【3 1 2 0 4 5】
C5换位到EC1→C【3 5 2 0 4 1】;
C4换位到EC2→C【3 5 4 0 2 1】;
C1换位到EC3→C【3 5 4 1 2 0】;
C0换位到EC4→C【3 5 4 1 0 2】;
C2换位到EC5→C【3 5 4 1 0 2】。
此时,获得如下表8。
表8列交换后的矩阵
C3 C5 C4 C1 C0 C2
0.9 0.0847 0.4225 0.7859 0.754 0.8424
0.6284 0.4138 0.9309 0.9656 0.9786 0.8449
0.1507 0.2923 0.0296 0.0769 0.3964 0.4809
0.017 0.5781 0.8295 0.7366 0.0879 0.5453
0.1087 0.6641 0.3058 0.6683 0.3311 0.8686
0.8278 0.6854 0.4288 0.0419 0.9373 0.7959
也就是说,现在交换后获得的第1列为原矩阵的第C3列,第2列为原矩阵的第C5列…,以此类推。
本领域技术人员可以理解,上述行或列的标记以及交换操作没有顺序约束,先进行行交换,再进行列交换,或者反过来,都是可以的。
因此,第一次行列交换处理的结果是:
行顺序为:R【5,4,3,2,1,0】
列顺序为:C【3,5,4,1,0,2】
e.判断交换是否结束步骤S2305:
首先,计算存储子块总和Sum1。Sum1为在未进行此次行列交换之前的预裁剪子块的子块总和。具体地,根据表3,Sum1为未进行此次行列交换之前的四个预裁剪子块B21、B22、B23和B33的和值之总和,即Sum1=6.0731,将其作为存储子块总和存储起来,以供比较并判断行列交换是否完成来使用。
其次,计算预裁剪子块总和Sum2。四个预裁剪子块B21、B22、B23和B33的和值如下表9所示,将四个和值相加获得预裁剪子块总和Sum2=7.1541。
表9子块和
2.0269 3.1049 3.4199
1.0381 1.6726 1.5105
2.286 1.4448 2.9329
再次,将预裁剪子块总和Sum2与存储子块总和Sum1进行比较。此时,存储子块总和Sum1<预裁剪子块总和Sum2,两者不相等,则设定存储子块总和Sum1=预裁剪子块总和Sum2。即由于6.0731<7.1541,则将存储子块总和Sum1设定为7.154。
这里,由于Sum1小于Sum2,说明当前交换操作仍可以继续进行,因此重复步骤S2301~S2305,即再次通过行列交换操作将待裁剪权值元素集中到预裁剪子块位置,并且判断是否预裁剪子块总和等于作为比较值的存储子块总和,如下详述。
从上面的说明可以看出,预裁剪子块总和是在交换处理之后,根据交换处理之前确定的预裁剪子块的位置,进行子块和值计算得到的,而存储子块总和则是在每次循环最后的交换处理之后,根据判断结果设定的。具体地,每次判断时,只要存储子块总和与预裁剪子块总和不相同,就要将该预裁剪子块总和作为存储子块总和存储,以供下一次比较使用。在上面的过程中,存储子块总和的初值就按照循环初始确定的预裁剪子块总和来设定了。
f.重复上述步骤S2301-2305
√确定预裁剪子块步骤S2301:确定作为裁剪候选的预裁剪子块。
此时仍将和值小的子块作为预裁剪子块,因此根据表9,重新选取预裁剪子块,如下表10所示。
表10标记预裁剪子块
B11:False B12:False B13:False
B21:True B22:True B23:True
B31:False B32:True B33:False
√标记行列步骤S2302:选择并标记预裁剪子块所在的所有行和所有列
作为换位行和换位列,并且对换位行和换位列进行标记。
根据表10可知,和值最小的四个子块B21、B22、B23和B32作为预裁剪子块被标记“True”,以预裁剪子块所在的行列作为换位行和换位列,则换位行和换位列包括:R2、R3、R4、R5,以及C0-C5,如此标记换位行为ER2、ER3、ER4、ER5以及换位列为EC0-EC5。
√交换行步骤S2303:对每一行中的矩阵元素的权值绝对值求和,并且
将和值小的行依次与所标记的换位行进行位置交换。
表11各行总和
R0 3.7895
R1 4.7622
R2 1.4268
R3 2.7944
R4 2.9466
R5 3.7171
根据表11可知,各行的总和,从小到大的顺序为R2<R3<R4<R5<R0<R1,那么按照此顺序依次换到ER2、ER3、ER4、ER5这些标记行。由于此时权值小的行的顺序与换位行的顺序一一对应,所以不再进行行换位了,仍为表8。
√交换列步骤S2304:对每一列中的矩阵元素的权值绝对值求和,并且
将和值小的列依次与所标记的换位列进行位置交换。
对表8的各列总和进行计算,得到下表12。
表12各列总和
C0 C1 C2 C3 C4 C5
2.6326 2.7184 2.9471 3.2752 3.4853 4.378
根据表12可知,各列的总和,从小到大的顺序为C0<C1<C2<C3<C4<C5,那么按照此顺序依次换到EC0、EC1、EC2、EC3、EC4、EC5这些标记列。由于此时权值小的列的顺序与换位列的顺序一一对应,所以不再进行列换位了,仍为表8。
√判断交换是否结束步骤S2305:
此时,存储子块总和Sum1已经在第一次行列交换处理中设定为7.154,即进行第二次行列交换之前的预裁剪子块总和。
计算预裁剪子块总和,四个预裁剪子块B21、B22、B23和B32的和值如下表13(由于第二次行列交换,如上所述并未进行,所以表13与表9相同)所示,将四个和值相加获得预裁剪子块总和Sum2=5.666。将其与存储子块总和Sum1=7.154进行比较。此时,仍然依据Sum1和Sum2不同则将设定 存储子块总和Sum1=预裁剪子块总和Sum2的原则处理。Sum1=7.154>Sum2=5.666,将Sum1设定为5.666。
表13子块和
2.0269 3.1049 3.4199
1.0381 1.6726 1.5105
2.286 1.4448 2.9329
这里,当Sum1大于Sum2,说明当前交换操作仍可以继续进行,因此重复步骤S2301~S2305,即再次通过行列交换操作将待裁剪权值元素集中到预裁剪子块位置,并且判断是否预裁剪子块总和等于作为比较值的存储子块总和,如下详述。
因此,第二次行列交换处理的结果是:
行顺序为:R【5,4,3,2,1,0】
列顺序为:C【3,5,4,1,0,2】
h.重复上述步骤S2301-2305
√确定预裁剪子块步骤S2301:确定作为裁剪候选的预裁剪子块。
此时仍将和值小的子块作为预裁剪子块,根据表13可知,四个预裁剪子块B21、B22、B23和B32的和值仍为最小子块和值,所以预裁剪子块不变。
在预裁剪子块不变的情况下,则换位行和换位列也不变,同时由于在第二次处理中表8也未改变行列,因此,在第三次行列交换的处理中,实际的行列并未进行交换,均与第二次行列交换处理的各个步骤结果相同。
因此,第三次行列交换处理的结果仍然是:
行顺序为:R【5,4,3,2,1,0】
列顺序为:C【3,5,4,1,0,2】
然而,不同于第二次行列交换处理的是,在第三次行列交换的处理中,存储子块总和Sum1等于5.666,并且预裁剪子块总和Sum2也等于5.666,也就是说存储子块总和Sum1=预裁剪子块总和Sum2。当满足该条件时,交换行列的处理循环结束。
因此,根据上面的过程可以得知,当存储子块总和Sum1不等于预裁剪子块总和Sum2时,说明根据行列交换操作获得的结果仍没有稳定下来,仍有可能改变,因此继续进行行列交换。只有当存储子块总和Sum1等于预裁 剪子块总和Sum2时,才说明根据行列交换操作获得的结果已经稳定下来,使得该待裁剪子块中的矩阵元素的权值绝对值和值相对于不是待裁剪子块的其他子块中的矩阵元素的权值绝对值和值更小,也就是说已经确定地将小权值的矩阵元素集中到待裁剪子块中了,可以进行裁剪,如此才会从行列交换处理的循环中结束退出并进入下一步骤。
4.子块裁剪步骤S240:将上述待裁剪子块中的矩阵元素的权值裁剪掉,以实现对神经网络的网络模型的压缩。
需要说明的是,这里的裁剪,并不限于将矩阵本身元素的数值设置为0,对于通过电路器件以硬件形式执行矩阵向量乘的操作的基本模块,可以直接将矩阵元素所对应的位置的器件省略。更具体地,在布置相应的硬件器件以实现该权重矩阵时,该对应位置的进行分块计算的器件被去掉。
如此,通过上述步骤,将待裁剪权值元素集中到矩阵子块中,然后直接裁掉该矩阵子块,再以此为初始值进行神经网络训练,在保证网络效果的前提下,减少阵列的使用,从而极大地减少了资源开销。
本发明提出的方法完全适用于基于忆阻器以及TrueNorth芯片的神经网络。相比,传统的网络压缩方法并不适用于基于忆阻器或TrueNorth芯片的神经网络,因为即使将网络模型压缩得很小,也不能减少阵列的使用,无法减少资源消耗。
另外,需要说明的是,上述所给出的行列交换的步骤仅做示例,但这样的行列交换方式并不是唯一的可选方式。具体而言,例如在上面的一次行交换处理中,各行的总和,从小到大的顺序为R3<R2<R1<R0<R5<R4,那么按照此顺序依次换到ER2、ER3、ER4、ER5这些标记行。也就是说,在本发明中,采用选择和值最小的子块所在的行作为交换行来进行交换,这样可以大大加快行列交换的效率,以更快速地将和值小的待裁剪权值元素交换并集中到矩阵子块中。然而,显然,也可以直接按照和值的大小顺序将各个行交换到各个顺序行,例如,各行的总和,从小到大的顺序为R3<R2<R1<R0<R5<R4,那么就按照此顺序依次换到R【0,1,2,3,4,5】,然后再循序进行其他步骤,也是可以的。只是这样的交换次数会增多,而且效率较低,不是优选方案。
另外,虽然在本发明中通过设定压缩率来确定待裁剪子块的数量,但是也可以通过设定阈值来确定待裁剪子块的数量,只要能够满足压缩目的即可。
综上所述,本发明的发明构思的核心在于通过行列交换获得可裁剪的子块,以适用于子块运算应用,而不限制具体可采用的交换方式。
二、实际示例
下面给出实际的示例来说明以同样的输入以及同样的运算方法,采用本发明的压缩方法得到的权重矩阵与初始权重矩阵均会输出相同的计算结果。
表14为根据本发明的压缩方法进行行列交换后的权重矩阵(对应于表8),其中黑体下划线的Null标识出预裁剪矩阵元素。
表14进行行列交换后的权重矩阵
0.9 0.0847 -0.4225 0.7859 0.754 -0.8424
0.6284 0.4138 -0.9309 -0.9656 0.9786 0.8449
Null Null Null Null Null Null
Null Null Null Null Null Null
0.1087 -0.6641 Null Null 0.3311 0.8686
0.8278 0.6854 Null Null 0.9373 0.7959
表15为将表14按照初始的行列顺序(即下面的顺序)还原而成的初始矩阵,其中黑体下划线的Null标识出预裁剪矩阵元素。
行顺序为:R【5,4,3,2,1,0】
列顺序为:C【3,5,4,1,0,2】
表15未行列交换的权重矩阵
0.9373 Null 0.7959 0.8278 Null 0.6854
0.3311 Null 0.8686 0.1087 Null -0.6641
Null Null Null Null Null Null
Null Null Null Null Null Null
0.9786 -0.9656 0.8449 0.6284 -0.9309 0.4138
0.754 0.7859 -0.8424 0.9 -0.4225 0.0847
从上面可以看出,表15与表14的本质区别在于,表15中的预裁剪元素被分散,而表14的预裁剪元素以2*2子块的形式聚集在一起。因此在实际布置中,根据表14(即经过行列交换后的矩阵)来实现硬件布置,以适应分块 计算的需要,这也是本发明的发明总体构思之所在,也就是使压缩方法适用于相应的分块计算应用的关键。
下面,基于表14和15给出两者的输入和输出的比较。
1.对于未行列交换的初始权重矩阵(表15)
假设输入向量数据为:
表16初始的输入向量
0.3769 0.9087 0.6857 0.0513 0.6081 0.9523
与表15中的未行列交换的初始权重矩阵点乘,即对向量和矩阵的相应元素的乘积求和,输出点乘结果1为:
表17点乘结果1
1.9673 0.1612 0.8008 1.6500 -0.9684 -0.0128
2.对于经过行列交换的权重矩阵(表14)
在与表14中的行列交换后的权重矩阵点乘之前,需要首先按照行顺序R【0,1,2,3,4,5】→R【5,4,3,2,1,0】对表16给出的初始的输入向量进行调换,即
表18经过行交换的输入向量
0.9523 0.6081 0.0513 0.6857 0.9087 0.3769
将经过行交换的输入向量与表14中的经过行列交换的权重矩阵点乘,即对向量和矩阵的相应元素的乘积求和,输出点乘结果2为:
表19点乘结果2
1.6500 -0.0128 -0.9684 0.1612 1.9673 0.8008
再按照列顺序C【3,5,4,1,0,2】→C【0,1,2,3,4,5】对点乘结果2进行调换。
表20经过交换的点乘结果2
1.9673 0.1612 0.8008 1.6500 -0.9684 -0.0128
比较上面的表17和表20的数据结果,可以看出,在同一输入向量的情况下,只要经过合理的行列交换,根据本发明的压缩方法获得的经过数据压缩的权重矩阵,仍然能够获得与初始矩阵一致的点乘结果。也就是说,根据本发明的压缩方法,并不会影响器件所要实现的计算功能,而且由于能够实现分块压缩,还可以有效地减少器件数量,在有限的资源情况下,实现更大规模的神经网络布置。
当然,这也意味着,根据本发明的压缩方法得到的权重矩阵,在应用于数据时,需要在处理前根据行列交换顺序,对输入数据进行交换,然后将交换后的输入数据与最终的权重矩阵进行矩阵乘法运算,最后再根据行列交换顺序,对矩阵乘法运算的结果进行反向交换并且作为输出数据输出。
三、效果验证
为了验证本发明的算法的实际压缩效果,申请人做了一系列的实验。
图6a示出CIFAR10数据集在压缩后的正确率,其中CIFAR10数据集有60000个32*32像素的彩色图片,每张图片都属于10种分类之一。图6b示出MNIST数据集在LENET网络下压缩后的正确率,其中MNIST数据集有60000个28*28像素的黑白手写数字图片。图6c示出MNIST数据集在MLP网络下压缩后的正确率。
图中,横坐标为压缩率,纵坐标为正确率,不同颜色的线代表不同规模的阵列。在0-80%的压缩率的情况下,对CIFAR10数据集,正确率在84%-85%之间,对MNIST数据集,不管是采用16*16的阵列,还是256*256的阵列,正确率基本都在98%-99%,甚至更高,这也从多个角度证明了本发明的数据压缩方法的正确率是相当好的。换言之,在多种数据集和不同的网络规模下,本发明的压缩方法都可以在不影响正确率的前提下,极大的压缩网络规模,节约资源开销。
当然,从图6c中也可以看出,部分结果的压缩率不高,这样的结果是跟阵列规模太大有关的,例如波动最大的一组数据采用的是256*256的阵列规模,这样大的阵列规模导致有效数据被裁剪过多,从而影响了正确率。
另外,随着压缩率的上升,正确率会有所下降,这也是牺牲正确率换取更多的压缩所必然带来的结果。对于不同的应用情况,本领域技术人员可以根据实际需要选择合适的压缩率以保证足够的正确率。
需要说明的是,附图中按某顺序显示了各个步骤,并不表示这些步骤只能按照显示或者描述的顺序执行,只要不存在逻辑矛盾,步骤执行顺序可以不同于所显示的。
以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此,本发明的保护范围应该以权利要求的保护范围为准。

Claims (12)

  1. 一种用于神经网络的网络模型分块压缩方法,包括:
    权重矩阵获得步骤,获得经过训练得到的神经网络的网络模型的权重矩阵;
    权重矩阵分块步骤,按照预定阵列大小将权重矩阵划分成由若干初始子块组成的阵列;
    待裁剪权值元素集中步骤,根据子块中的矩阵元素的权值绝对值和值,通过行列交换,将权值较小的矩阵元素集中到待裁剪子块中,使得该待裁剪子块中的矩阵元素的权值绝对值和值相对于不是待裁剪子块的其他子块中的矩阵元素的权值绝对值和值更小;和
    子块裁剪步骤,将上述待裁剪子块中的矩阵元素的权值裁剪掉,获得最终的权重矩阵,以实现对神经网络的网络模型的压缩。
  2. 根据权利要求1所述的网络模型分块压缩方法,其中,根据压缩率或根据阈值来设定所述待裁剪子块的数量。
  3. 根据权利要求1所述的网络模型分块压缩方法,其中,待裁剪权值元素集中步骤包括如下步骤:
    确定预裁剪子块步骤,确定作为裁剪候选的预裁剪子块;
    标记行列步骤,选择并标记预裁剪子块所在的所有行和所有列作为换位行和换位列,其中,根据压缩率设定所述预裁剪子块的数量;
    交换行步骤和交换列步骤,对每一行中的矩阵元素的权值绝对值求和,并且将和值小的行依次与所标记的换位行进行位置交换,以及,对每一列中的矩阵元素的权值绝对值求和,并且将和值小的列依次与所标记的换位列进行位置交换;
    重复上述步骤,直到交换也不能改变所有预裁剪子块中的矩阵元素的权值绝对值的总和,此时的预裁剪子块作为待裁剪子块。
  4. 根据权利要求3所述的网络模型分块压缩方法,其中,确定预裁剪子块步骤还包括:计算每一个初始子块中的矩阵元素的权值绝对值的总和,将和值小的子块作为预裁剪子块。
  5. 一种神经网络训练方法,包括如下步骤:
    对神经网络进行训练,得到网络模型的权重矩阵;
    根据权利要求1-4所述的网络模型分块压缩方法对所述权重矩阵进行压缩;和
    迭代进行上述步骤,直至达到预定迭代中止要求。
  6. 一种用于神经网络计算的计算装置,包括存储器和处理器,存储器中存储有计算机可执行指令,所述计算机可执行指令包括网络模型压缩指令,当处理器执行所述网络模型压缩指令时,执行下述方法:
    权重矩阵获得步骤,获得经过训练得到的神经网络的网络模型的权重矩阵;
    权重矩阵分块步骤,按照预定阵列大小将权重矩阵划分成由若干初始子块组成的阵列;
    待裁剪权值元素集中步骤,根据子块中的矩阵元素的权值绝对值和值,通过行列交换,将权值较小的矩阵元素集中到待裁剪子块中,使得该待裁剪子块中的矩阵元素的权值绝对值和值相对于不是待裁剪子块的其他子块中的矩阵元素的权值绝对值和值更小;
    子块裁剪步骤,将上述待裁剪子块中的矩阵元素的权值裁剪掉,获得最终的权重矩阵,以实现对神经网络的网络模型的压缩。
  7. 根据权利要求6所述的计算装置,其中,根据压缩率或根据阈值来设定所述待裁剪子块的数量。8、根据权利要求6所述的计算装置,其中,待裁剪权值元素集中步骤包括如下步骤:
    确定预裁剪子块步骤,确定作为裁剪候选的预裁剪子块;
    标记行列步骤,选择并标记预裁剪子块所在的所有行和所有列作为换位行和换位列,其中,根据压缩率设定所述预裁剪子块的数量;
    交换行步骤和交换列步骤,对每一行中的矩阵元素的权值绝对值求和,并且将和值小的行依次与所标记的换位行进行位置交换,以及,对每一列中的矩阵元素的权值绝对值求和,并且将和值小的列依次与所标记的换位列进行位置交换;
    重复上述步骤,直到交换也不能改变所有预裁剪子块中的矩阵元素的权值绝对值的总和,此时的预裁剪子块作为待裁剪子块。
  8. 根据权利要求8所述的计算装置,其中,确定预裁剪子块步骤还包括:计算每一个初始子块中的矩阵元素的权值绝对值的总和,将和值小的子块作为预裁剪子块。
  9. 根据权利要求6所述的计算装置,其中,所述计算机可执行指令还包括网络模型应用指令,当处理器执行所述网络模型应用指令时,执行下述方法:
    输入数据处理步骤,根据行列交换顺序,对输入数据进行交换;
    矩阵乘法运算步骤,将交换后的输入数据与执行所述网络模型压缩指令后得到的最终的权重矩阵进行矩阵乘法运算;和
    输出数据处理步骤,根据行列交换顺序,对矩阵乘法运算的结果进行反向交换并且作为输出数据输出。
  10. 根据权利要求10所述的计算装置,其中,所述计算机可执行指令还包括网络模型训练指令,当处理器执行所述网络模型训练指令时,执行下述方法:
    对神经网络进行训练,得到网络模型的初始权重矩阵;
    执行所述网络模型压缩指令得到压缩后的最终的权重矩阵;
    执行所述网络模型应用指令进行训练;和
    迭代进行上述压缩和训练步骤,直至达到预定的迭代中止要求。
  11. 一种采用如权利要求1-4所述的网络模型分块压缩方法、如权利要求5所述的神经网络训练方法和如权利要求6-11所述的计算装置进行网络模型压缩、应用和训练的硬件***,包括:
    神经网络硬件芯片,神经网络硬件芯片具有通过电路器件以硬件形式执行矩阵向量乘的操作的基本模块,
    其中,与待裁剪子块中的矩阵元素所对应的位置未设置电路器件。
  12. 根据权利要求12所述的硬件***,其中,所述电路器件为忆阻器或TrueNorth芯片的神经突触。
PCT/CN2017/119819 2017-12-29 2017-12-29 神经网络模型分块压缩方法、训练方法、计算装置及*** WO2019127362A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201780042629.4A CN109791628B (zh) 2017-12-29 2017-12-29 神经网络模型分块压缩方法、训练方法、计算装置及***
PCT/CN2017/119819 WO2019127362A1 (zh) 2017-12-29 2017-12-29 神经网络模型分块压缩方法、训练方法、计算装置及***

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/119819 WO2019127362A1 (zh) 2017-12-29 2017-12-29 神经网络模型分块压缩方法、训练方法、计算装置及***

Publications (1)

Publication Number Publication Date
WO2019127362A1 true WO2019127362A1 (zh) 2019-07-04

Family

ID=66495633

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/119819 WO2019127362A1 (zh) 2017-12-29 2017-12-29 神经网络模型分块压缩方法、训练方法、计算装置及***

Country Status (2)

Country Link
CN (1) CN109791628B (zh)
WO (1) WO2019127362A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115724A (zh) * 2020-07-23 2020-12-22 云知声智能科技股份有限公司 一种多领域神经网络在垂直领域微调的优化方法及***
CN113642710A (zh) * 2021-08-16 2021-11-12 北京百度网讯科技有限公司 一种网络模型的量化方法、装置、设备和存储介质
CN114781650A (zh) * 2022-04-28 2022-07-22 北京百度网讯科技有限公司 一种数据处理方法、装置、设备以及存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659731B (zh) * 2018-06-30 2022-05-17 华为技术有限公司 一种神经网络训练方法及装置
CN113052292B (zh) * 2019-12-27 2024-06-04 北京硅升科技有限公司 卷积神经网络技术方法、装置及计算机可读存储介质
CN111259396A (zh) * 2020-02-01 2020-06-09 贵州师范学院 一种基于深度学习卷积神经网络的计算机病毒检测方法及深度学习神经网络的压缩方法
CN112861549B (zh) * 2021-03-12 2023-10-20 云知声智能科技股份有限公司 一种训练翻译模型的方法和设备
CN113052307B (zh) * 2021-03-16 2022-09-06 上海交通大学 一种面向忆阻器加速器的神经网络模型压缩方法及***
CN113252984B (zh) * 2021-07-06 2021-11-09 国网湖北省电力有限公司检修公司 基于蓝牙绝缘子测量仪的测量数据处理方法及***

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650928A (zh) * 2016-10-11 2017-05-10 广州视源电子科技股份有限公司 一种神经网络的优化方法及装置
CN106779068A (zh) * 2016-12-05 2017-05-31 北京深鉴智能科技有限公司 调整人工神经网络的方法和装置
CN106919942A (zh) * 2017-01-18 2017-07-04 华南理工大学 用于手写汉字识别的深度卷积神经网络的加速压缩方法
CN107239825A (zh) * 2016-08-22 2017-10-10 北京深鉴智能科技有限公司 考虑负载均衡的深度神经网络压缩方法
CN107368885A (zh) * 2017-07-13 2017-11-21 北京智芯原动科技有限公司 基于多粒度剪枝的网络模型压缩方法及装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9400955B2 (en) * 2013-12-13 2016-07-26 Amazon Technologies, Inc. Reducing dynamic range of low-rank decomposition matrices
CN106297778A (zh) * 2015-05-21 2017-01-04 中国科学院声学研究所 数据驱动的基于奇异值分解的神经网络声学模型裁剪方法
CN106529670B (zh) * 2016-10-27 2019-01-25 中国科学院计算技术研究所 一种基于权重压缩的神经网络处理器、设计方法、芯片
CN106779051A (zh) * 2016-11-24 2017-05-31 厦门中控生物识别信息技术有限公司 一种卷积神经网络模型参数处理方法及***

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239825A (zh) * 2016-08-22 2017-10-10 北京深鉴智能科技有限公司 考虑负载均衡的深度神经网络压缩方法
CN106650928A (zh) * 2016-10-11 2017-05-10 广州视源电子科技股份有限公司 一种神经网络的优化方法及装置
CN106779068A (zh) * 2016-12-05 2017-05-31 北京深鉴智能科技有限公司 调整人工神经网络的方法和装置
CN106919942A (zh) * 2017-01-18 2017-07-04 华南理工大学 用于手写汉字识别的深度卷积神经网络的加速压缩方法
CN107368885A (zh) * 2017-07-13 2017-11-21 北京智芯原动科技有限公司 基于多粒度剪枝的网络模型压缩方法及装置

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115724A (zh) * 2020-07-23 2020-12-22 云知声智能科技股份有限公司 一种多领域神经网络在垂直领域微调的优化方法及***
CN112115724B (zh) * 2020-07-23 2023-10-20 云知声智能科技股份有限公司 一种多领域神经网络在垂直领域微调的优化方法及***
CN113642710A (zh) * 2021-08-16 2021-11-12 北京百度网讯科技有限公司 一种网络模型的量化方法、装置、设备和存储介质
CN113642710B (zh) * 2021-08-16 2023-10-31 北京百度网讯科技有限公司 一种网络模型的量化方法、装置、设备和存储介质
CN114781650A (zh) * 2022-04-28 2022-07-22 北京百度网讯科技有限公司 一种数据处理方法、装置、设备以及存储介质
CN114781650B (zh) * 2022-04-28 2024-02-27 北京百度网讯科技有限公司 一种数据处理方法、装置、设备以及存储介质

Also Published As

Publication number Publication date
CN109791628A (zh) 2019-05-21
CN109791628B (zh) 2022-12-27

Similar Documents

Publication Publication Date Title
WO2019127362A1 (zh) 神经网络模型分块压缩方法、训练方法、计算装置及***
WO2021004366A1 (zh) 基于结构化剪枝和低比特量化的神经网络加速器及方法
Chang et al. Hardware accelerators for recurrent neural networks on FPGA
WO2019127363A1 (zh) 神经网络权重编码方法、计算装置及硬件***
US20240185050A1 (en) Analog neuromorphic circuit implemented using resistive memories
WO2019091020A1 (zh) 权重数据存储方法和基于该方法的神经网络处理器
Ankit et al. TraNNsformer: Neural network transformation for memristive crossbar based neuromorphic system design
CN112257844B (zh) 一种基于混合精度配置的卷积神经网络加速器及其实现方法
CN110084364B (zh) 一种深度神经网络压缩方法和装置
WO2022134465A1 (zh) 加速可重构处理器运行的稀疏化数据处理方法和装置
WO2019001323A1 (zh) 信号处理的***和方法
CN109993275A (zh) 一种信号处理方法及装置
KR20220101418A (ko) 저전력 고성능 인공 신경망 학습 가속기 및 가속 방법
Shen et al. PRAP-PIM: A weight pattern reusing aware pruning method for ReRAM-based PIM DNN accelerators
Yuan et al. A dnn compression framework for sot-mram-based processing-in-memory engine
CN113435581B (zh) 数据处理方法、量子计算机、装置及存储介质
Ascia et al. Improving inference latency and energy of network-on-chip based convolutional neural networks through weights compression
CN112132272B (zh) 神经网络的计算装置、处理器和电子设备
CN113705784A (zh) 一种基于矩阵共享的神经网络权重编码方法及硬件***
Guo et al. A multi-conductance states memristor-based cnn circuit using quantization method for digital recognition
Li et al. Memory saving method for enhanced convolution of deep neural network
Chang et al. UCP: Uniform channel pruning for deep convolutional neural networks compression and acceleration
Huang et al. BWA-NIMC: Budget-based Workload Allocation for Hybrid Near/In-Memory-Computing
US20200401926A1 (en) Multi-pass system for emulating a quantum computer and methods for use therewith
US20230169023A1 (en) Ai accelerator apparatus using in-memory compute chiplet devices for transformer workloads

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17936159

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17936159

Country of ref document: EP

Kind code of ref document: A1