WO2019127362A1

WO2019127362A1 - Neural network model block compression method, training method, computing device and system

Info

Publication number: WO2019127362A1
Application number: PCT/CN2017/119819
Authority: WO
Inventors: 张悠慧; 季宇; 张优扬
Original assignee: 清华大学
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2019-07-04
Also published as: CN109791628B; CN109791628A

Abstract

A network model block compression method for use with a neural network, comprising: a step of obtaining a weight matrix, obtaining a weight matrix of a network model of a neural network that is obtained after training; a step of blocking a weight matrix, dividing the weight matrix according to a predetermined array size into an array composed of a plurality of initial sub-blocks; a step of concentrating weighting elements to be cropped, according to the absolute value of weight and the value of matrix elements in the sub-blocks, concentrating matrix elements having a smaller weight by means of row-column exchange into a sub-block to be cropped such that the absolute value of weight and the value of the matrix elements in the sub-block to be cropped are smaller than the absolute value of weight and the value of matrix elements in other sub-blocks not to be cropped; a step of cropping a sub-block, cropping out the weight of the matrix elements in the sub-block to be cropped to obtain a final weight matrix so as to implement the compression of the network model of the neural network. Thus, resources and overhead may be saved, and a large-scale neural network is arranged with limited resources.

Description

神经网络模型分块压缩方法、训练方法、计算装置及***Neural network model block compression method, training method, computing device and system

技术领域Technical field

本发明总体地涉及神经网络技术领域，更具体地涉及用于神经网络的网络模型分块压缩方法、训练方法、计算装置以及硬件***。The present invention generally relates to the field of neural network technologies, and more particularly to a network model block compression method, a training method, a computing device, and a hardware system for a neural network.

背景技术Background technique

随着摩尔定律逐渐失效，传统芯片工艺进步放缓，人们不得不面向新应用和新器件。近年来，神经网络(Neural Network，NN)计算取得了突破性进展，在图像识别、语言识别、自然语言处理等诸多领域均取得了很高的准确率，但神经网络需要海量计算资源，传统的通用处理器已经很难满足深度学习的计算需求，设计专用芯片已经成为了一个重要的发展方向。As Moore's Law gradually fails, the progress of traditional chip technology has slowed down, and people have to face new applications and new devices. In recent years, the neural network (NN) calculation has made breakthrough progress, and has achieved high accuracy in many fields such as image recognition, speech recognition, and natural language processing. However, neural networks require massive computing resources, traditional General-purpose processors have been difficult to meet the computational needs of deep learning, and designing dedicated chips has become an important development direction.

具体地，神经网络的建模通常以若干神经元为一层，层与层之间相互连接来构建，图1所示的是一种链状的神经网络，图中每一个圆表示一个神经元，每一个箭头表示神经元之间的连接，每个连接均有权重，***经网络的结构不限于链状的网络结构。Specifically, the modeling of a neural network usually consists of several neurons as a layer, and the layers are connected to each other to construct. Figure 1 shows a chain-like neural network, in which each circle represents a neuron. Each arrow represents a connection between neurons, each connection has a weight, and the structure of the actual neural network is not limited to a chain-like network structure.

神经网络的核心计算是矩阵向量乘操作，包含n个神经元的层L _n产生的输出可以用长度为n的向量V _n表示，与包含m个神经元的层L _m全相联，连接权重可以表示成矩阵M _n×m，矩阵大小为n行m列，每个矩阵元素表示一个连接的权重。则加权之后输入到L _m的向量为M _n×mV _n，这样的矩阵向量乘法运算是神经网络最核心的计算。 The core computation of the neural network is a matrix vector multiplication operation. The output produced by the layer L _n containing n neurons can be represented by a vector V _{n of} length n, which is fully associated with the layer L _m containing m neurons, and the connection weights It can be expressed as a matrix M _{n × m} , the matrix size is n rows and m columns, and each matrix element represents the weight of one connection. Then the vector input to L _m after weighting is M _{n × m} V _n , and such matrix vector multiplication is the core calculation of the neural network.

由于矩阵向量乘计算量非常大，在传统的通用处理器上进行大量的矩阵乘运算需要耗费大量的时间，因此神经网络加速芯片也都是以加速矩阵乘法运算为主要的设计目标。Since the matrix vector multiplication is very large, it takes a lot of time to perform a large number of matrix multiplication operations on a conventional general-purpose processor. Therefore, the neural network acceleration chip also has the main design goal of accelerating matrix multiplication.

忆阻器阵列是一种能够实现上述矩阵乘法运算的硬件器件。每个忆阻器的电阻阻值可以在特定的输入电流下改变，并且阻值可以用来存储数据。相比传统的DRAM(动态随机存储器)和SRAM(静态随机存储器)，忆阻器具有存储密度高且在失去供电的情况下也不会丢失数据的特点。A memristor array is a hardware device capable of implementing the above matrix multiplication operation. The resistance of each memristor can be varied at a specific input current, and the resistance can be used to store data. Compared to conventional DRAM (Dynamic Random Access Memory) and SRAM (Static Random Access Memory), memristors have the characteristics of high memory density and no loss of data in the event of loss of power supply.

图2示出了基于忆阻器的交叉开关(Crossbar)结构的示意图。Figure 2 shows a schematic diagram of a memristor based crossbar structure.

如图2所示，通过将线路排布成交叉开关(Crossbar)，并在相交点用忆阻器相连，将忆阻器的电导值G(电阻的倒数)设置为权重矩阵的矩阵元数值，通过在输入端输入电压值V，电压V与忆阻器电导G相乘并叠加输出电流，输出电流与接地电阻Rs相乘得到输出电压V’，由此在输出端即可完成矩阵向量乘法运算。以此为基本单元，可以构建基于新型器件的神经形态芯片。As shown in FIG. 2, by arranging the lines into a crossbar and connecting them at the intersection point with a memristor, the conductance value G (the reciprocal of the resistance) of the memristor is set as the matrix element value of the weight matrix. By inputting the voltage value V at the input terminal, the voltage V is multiplied by the memristor conductance G and superimposed with the output current, and the output current is multiplied by the grounding resistance Rs to obtain an output voltage V', thereby completing the matrix vector multiplication operation at the output end. . With this as a basic unit, it is possible to construct a neuromorphic chip based on a novel device.

由于整个过程在模拟电路下实现，具有速度快，面积小的优点。Since the whole process is realized under the analog circuit, it has the advantages of high speed and small area.

然而，使用基于忆阻器的芯片计算也存在精度低、扰动大，数模/模数转换开销大，矩阵规模受限等不足。However, the use of memristor-based chip calculations also has the disadvantages of low precision, large disturbance, large cost of digital-to-analog/analog conversion, and limited matrix size.

类似地，TrueNorth也是能够进行矩阵向量乘法运算的芯片。TrueNorth是IBM公司的神经形态芯片，每块芯片上集成了4096个神经突触核，每个神经突触核可以处理256×256的神经突触计算。Similarly, TrueNorth is also a chip capable of matrix vector multiplication. TrueNorth is IBM's neuromorphic chip, which integrates 4096 synaptic nuclei on each chip, and each synaptophysin can handle 256 × 256 synaptic calculations.

虽然忆阻器阵列和TrueNorth芯片均可以高效地进行矩阵向量乘法运算，但是由于神经网络规模巨大，需要数量惊人的阵列，这带来了海量的资源开销，使得基于这些芯片器件实现的神经网络，很难在有限资源的条件下布置规模巨大的初始神经网络。Although both the memristor array and the TrueNorth chip can perform matrix vector multiplication efficiently, the large number of neural networks requires an amazing number of arrays, which brings a huge amount of resource overhead, enabling neural networks based on these chip devices. It is difficult to arrange a large initial neural network under limited resources.

因此，需要将神经网络模型进行压缩，以减小资源开销，提高神经网络计算效率。Therefore, the neural network model needs to be compressed to reduce resource overhead and improve the computational efficiency of the neural network.

现有的Deep Compression(深度压缩)是CNN网络常见的压缩方法。深度压缩的实现主要分为三步：权值裁剪、权值共享和霍夫曼编码。The existing Deep Compression is a common compression method for CNN networks. The implementation of deep compression is mainly divided into three steps: weight cropping, weight sharing and Huffman coding.

(1)权值裁剪：一、正常训练模型得到网络权值；二、将所有低于一定阈值的权值设为0；三、重新训练网络中剩下的非零权值。将以上三步反复迭代。(1) Weight cutting: First, the normal training model obtains the network weight; second, all the weights below a certain threshold are set to 0; 3. Re-train the remaining non-zero weights in the network. Repeat the above three steps.

(2)权值共享：采用kmeans算法来将权值进行聚类，在每一个类中，所有的权值共享该类的聚类质心，因此最终存储的结果就是一个码书和索引表。(2) Weight sharing: The kmeans algorithm is used to cluster weights. In each class, all weights share the cluster centroid of the class, so the final stored result is a codebook and index table.

(3)霍夫曼编码：主要用于解决编码长短不一带来的冗余问题。深度压缩针对卷积层统一采用8bit编码，而全连接层采用5bit，所以采用这种熵编码能够更好地使编码bit均衡，减少冗余。(3) Huffman coding: mainly used to solve the redundancy problem caused by the length of coding. Deep compression uses 8-bit encoding for the convolutional layer and 5 bits for the fully connected layer. Therefore, this entropy coding can better balance the coding bits and reduce redundancy.

该方法能在保持精度不变将模型压缩达到90％的压缩率。This method can compress the model to a compression ratio of 90% while maintaining the accuracy.

这些现有技术虽然能极大地压缩模型规模，但是并不能适配于基于忆阻器以及TrueNorth等能够进行矩阵向量乘法运算的芯片所应用的神经网络模型。例如，因为权值裁剪裁掉的权值不集中，不能减少所需阵列的数量；使用权值共享会降低忆阻器阵列的运行速度；忆阻器阵列的权值编码是固定的，无法压缩。Although these prior art techniques can greatly compress the model size, they cannot be adapted to neural network models applied by chips based on memristors and TrueNorth capable of matrix vector multiplication. For example, because the weights of the weight cut are not concentrated, the number of arrays required cannot be reduced; the use of weight sharing reduces the speed of the memristor array; the weight coding of the memristor array is fixed and cannot be compressed. .

因此，需要一种用于神经网络计算的网络模型压缩技术，以解决上述问题。Therefore, there is a need for a network model compression technique for neural network computing to solve the above problems.

发明内容Summary of the invention

鉴于上述情况，做出了本发明。The present invention has been made in view of the above circumstances.

根据本发明的一个方面，提供了一种用于神经网络的网络模型分块压缩方法，包括：权重矩阵获得步骤，获得经过训练得到的神经网络的网络模型的权重矩阵；权重矩阵分块步骤，按照预定阵列大小将权重矩阵划分成由若干初始子块组成的阵列；待裁剪权值元素集中步骤，根据子块中的矩阵元素的权值绝对值和值，通过行列交换，将权值较小的矩阵元素集中到待裁剪子块中，使得该待裁剪子块中的矩阵元素的权值绝对值和值相对于不是待裁剪子块的其他子块中的矩阵元素的权值绝对值和值更小；子块裁剪步骤，将上述待裁剪子块中的矩阵元素的权值裁剪掉，获得最终的权重矩阵，以实现对神经网络的网络模型的压缩。According to an aspect of the present invention, a network model block compression method for a neural network is provided, comprising: a weight matrix obtaining step, obtaining a weight matrix of a network model of the trained neural network; a weight matrix blocking step, Dividing the weight matrix into an array consisting of a number of initial sub-blocks according to a predetermined array size; the step of concentrating the weighted element elements, according to the absolute value and value of the weights of the matrix elements in the sub-block, the row and column exchange, the weight is smaller The matrix elements are grouped into the sub-block to be cropped such that the absolute values and values of the weights of the matrix elements in the sub-block to be cropped are relative to the absolute values and values of the weights of the matrix elements in other sub-blocks that are not to be cropped. The sub-block clipping step clips the weights of the matrix elements in the sub-block to be cropped to obtain a final weight matrix to implement compression of the network model of the neural network.

根据上述网络模型分块压缩方法，可以根据压缩率或根据阈值来设定所述待裁剪子块的数量。According to the network model block compression method described above, the number of the sub-blocks to be cropped may be set according to a compression ratio or according to a threshold.

根据上述网络模型分块压缩方法，待裁剪权值元素集中步骤可以包括如下步骤：确定预裁剪子块步骤，确定作为裁剪候选的预裁剪子块；标记行列步骤，选择并标记预裁剪子块所在的所有行和所有列作为换位行和换位列，其中，根据压缩率设定所述预裁剪子块的数量；交换行步骤和交换列步骤，对每一行中的矩阵元素的权值绝对值求和，并且将和值小的行依次与所标记的换位行进行位置交换，以及，对每一列中的矩阵元素的权值绝对值求和，并且将和值小的列依次与所标记的换位列进行位置交换；重复上述步骤，直到交换也不能改变所有预裁剪子块中的矩阵元素的权值绝对值的总和，此时的预裁剪子块作为待裁剪子块。According to the above network model block compression method, the step of arranging the weight element element may include the steps of: determining a pre-trimming sub-block step, determining a pre-trimmed sub-block as a crop candidate; marking the row and column step, selecting and marking the pre-cut sub-block All rows and all columns are used as transposition rows and transposition columns, wherein the number of the pre-trimmed sub-blocks is set according to the compression ratio; the swap row step and the swap column step have absolute weights for the matrix elements in each row The values are summed, and the rows with the small values are sequentially swapped with the marked transposition rows, and the absolute values of the weights of the matrix elements in each column are summed, and the columns with the smaller values are sequentially The marked transposition column performs position exchange; the above steps are repeated until the exchange cannot change the sum of the absolute values of the weights of the matrix elements in all the pre-trimmed sub-blocks, and the pre-trimmed sub-block at this time is used as the sub-block to be cropped.

根据上述网络模型分块压缩方法，确定预裁剪子块步骤可以包括：计算每一个初始子块中的矩阵元素的权值绝对值的总和，将和值小的子块作为预裁剪子块。According to the above network model block compression method, the step of determining the pre-trimmed sub-block may include: calculating a sum of absolute values of weights of matrix elements in each of the initial sub-blocks, and using a sub-block having a small value as a pre-cut sub-block.

根据本发明的另一个方面，提供一种神经网络训练方法，包括如下步骤：对神经网络进行训练，得到网络模型的权重矩阵；根据上述网络模型分块压缩方法对所述权重矩阵进行压缩；以及迭代进行上述步骤，直至达到预定迭代中止要求。According to another aspect of the present invention, a neural network training method is provided, comprising the steps of: training a neural network to obtain a weight matrix of a network model; and compressing the weight matrix according to the network model blocking compression method; Iterate through the above steps until the predetermined iteration suspension request is reached.

根据本发明的另一个方面，提供一种用于神经网络计算的计算装置，包括存储器和处理器，存储器中存储有计算机可执行指令，所述计算机可执行指令包括网络模型压缩指令，当处理器执行所述网络模型压缩指令时，执行下述方法：权重矩阵获得步骤，获得经过训练得到的神经网络的网络模型的权重矩阵；权重矩阵分块步骤，按照预定阵列大小将权重矩阵划分成由若干初始子块组成的阵列；待裁剪权值元素集中步骤，根据子块中的矩阵元素的权值绝对值和值，通过行列交换，将权值较小的矩阵元素集中到待裁剪子块中，使得该待裁剪子块中的矩阵元素的权值绝对值和值相对于不是待裁剪子块的其他子块中的矩阵元素的权值绝对值和值更小；子块裁剪步骤，将上述待裁剪子块中的矩阵元素的权值裁剪掉，获得最终的权重矩阵，以实现对神经网络的网络模型的压缩。According to another aspect of the present invention, a computing device for neural network computing is provided, comprising a memory and a processor, the memory storing computer executable instructions, the computer executable instructions comprising network model compression instructions, when the processor When the network model compression instruction is executed, the following method is performed: a weight matrix obtaining step, obtaining a weight matrix of the network model of the trained neural network; a weight matrix blocking step, dividing the weight matrix into a number according to a predetermined array size An array of initial sub-blocks; a step of concentrating the weighted element elements, according to the absolute values and values of the weights of the matrix elements in the sub-blocks, by matrix-row exchange, the matrix elements with smaller weights are concentrated into the sub-blocks to be cropped, And causing the absolute value and the value of the weight of the matrix element in the to-be-cut sub-block to be smaller than the absolute value and the value of the matrix element in the other sub-blocks that are not to be cropped; the sub-block clipping step The weights of the matrix elements in the cropped sub-block are trimmed, and the final weight matrix is obtained to implement the neural network Compression model.

根据上述计算装置，可以根据压缩率或根据阈值来设定所述待裁剪子块的数量。According to the above computing device, the number of the sub-blocks to be cropped can be set according to a compression ratio or according to a threshold.

根据上述计算装置，待裁剪权值元素集中步骤可以包括如下步骤：确定预裁剪子块步骤，确定作为裁剪候选的预裁剪子块；标记行列步骤，选择并标记预裁剪子块所在的所有行和所有列作为换位行和换位列，其中，根据压缩率设定所述预裁剪子块的数量；交换行步骤和交换列步骤，对每一行中的矩阵元素的权值绝对值求和，并且将和值小的行依次与所标记的换位行进行位置交换，以及，对每一列中的矩阵元素的权值绝对值求和，并且将和值小的列依次与所标记的换位列进行位置交换；重复上述步骤，直到交换也不能改变所有预裁剪子块中的矩阵元素的权值绝对值的总和，此时的预裁剪子块作为待裁剪子块。According to the above computing device, the step of concentrating the weighting element element may include the steps of: determining a pre-trimming sub-block step, determining a pre-trimmed sub-block as a clipping candidate; marking the row and column step, selecting and marking all rows of the pre-trimmed sub-block and All columns are used as a transposition row and a transposition column, wherein the number of the pre-trimmed sub-blocks is set according to a compression ratio; the swap row step and the swap column step are used to sum the absolute values of the weights of the matrix elements in each row, And the row with the small value is sequentially exchanged with the marked transposition row, and the absolute values of the weights of the matrix elements in each column are summed, and the column with the smaller value is sequentially substituted with the labeled transposition The column performs positional exchange; the above steps are repeated until the exchange cannot change the sum of the absolute values of the weights of the matrix elements in all the pre-trimmed sub-blocks, and the pre-trimmed sub-blocks at this time serve as the sub-blocks to be cropped.

根据上述计算装置，确定预裁剪子块步骤可以还包括：计算每一个初始子块中的矩阵元素的权值绝对值的总和，将和值小的子块作为预裁剪子块。According to the above computing device, the step of determining the pre-trimmed sub-block may further comprise: calculating a sum of absolute values of weights of the matrix elements in each of the initial sub-blocks, and using the sub-blocks having a smaller value as the pre-cut sub-blocks.

根据上述计算装置，所述计算机可执行指令可以包括网络模型应用指令，当处理器执行所述网络模型应用指令时，执行下述方法：输入数据处理步骤，根据行列交换顺序，对输入数据进行交换；矩阵乘法运算步骤，将交换后的输入数据与执行所述网络模型压缩指令后得到的最终的权重矩阵进行矩阵乘法运算；和输出数据处理步骤，根据行列交换顺序，对矩阵乘法运算的结果进行反向交换并且作为输出数据输出。According to the above computing device, the computer executable instructions may include network model application instructions, and when the processor executes the network model application instructions, perform the following method: input data processing steps, and exchange input data according to a row and column exchange order a matrix multiplication operation step of performing matrix multiplication on the exchanged input data with a final weight matrix obtained by executing the network model compression instruction; and outputting a data processing step of performing a matrix multiplication operation according to the row and column exchange order Reverse exchange and output as output data.

根据上述计算装置，所述计算机可执行指令可以还包括网络模型训练指令，当处理器执行所述网络模型训练指令时，执行下述方法：对神经网络进行训练，得到网络模型的初始权重矩阵；执行所述网络模型压缩指令得到压缩后的最终的权重矩阵；执行所述网络模型应用指令进行训练；和迭代进行上述压缩和训练步骤，直至达到预定的迭代中止要求。According to the above computing device, the computer executable instructions may further include a network model training instruction, and when the processor executes the network model training instruction, performing the following method: training the neural network to obtain an initial weight matrix of the network model; Performing the network model compression instruction to obtain a compressed final weight matrix; executing the network model application instruction for training; and iteratively performing the above compression and training steps until a predetermined iteration suspension request is reached.

根据本发明的另一方面，提供一种采用上述网络模型分块压缩方法、根据上述的神经网络训练方法和根据上述的计算装置进行网络模型压缩、应用和训练的硬件***，包括：神经网络硬件芯片，神经网络硬件芯片具有通过电路器件以硬件形式执行矩阵向量乘的操作的基本模块，其中，与待裁剪子块中的矩阵元素所对应的位置未设置电路器件。According to another aspect of the present invention, there is provided a hardware system using the above-described network model block compression method, the above-described neural network training method, and network model compression, application and training according to the above-described computing device, including: neural network hardware The chip, the neural network hardware chip has a basic module that performs a matrix vector multiplication operation in a hardware form by a circuit device, wherein a circuit device is not provided at a position corresponding to a matrix element in the sub-block to be cropped.

根据上述硬件***，所述电路器件可以为忆阻器或TrueNorth芯片的神经突触。According to the hardware system described above, the circuit device can be a synapse or a synapse of a TrueNorth chip.

根据本发明的一个方面，提供了一种用于神经网络的网络模型分块压缩方法，从而节省资源开销，以在有限资源的条件下布置规模巨大的神经网络。According to an aspect of the present invention, a network model block compression method for a neural network is provided, thereby saving resource overhead to arrange a large-scale neural network under conditions of limited resources.

附图说明DRAWINGS

从下面结合附图对本发明实施例的详细描述中，本发明的这些和/或其它方面和优点将变得更加清楚并更容易理解，其中：These and/or other aspects and advantages of the present invention will become more apparent from the following detailed description of the embodiments of the invention.

图1示出了链状的神经网络的示意图。Figure 1 shows a schematic of a chained neural network.

图2示出了基于忆阻器的交叉开关结构的示意图。Figure 2 shows a schematic diagram of a memristor based crossbar switch structure.

图3示出了根据本发明的神经网络的网络模型分块压缩技术的应用情境的示意图。3 is a diagram showing an application scenario of a network model block compression technique of a neural network in accordance with the present invention.

图4示出了根据本发明的网络模型分块压缩方法的总体流程图。4 shows a general flow diagram of a network model block compression method in accordance with the present invention.

图5示出了根据上述方法的待裁剪权值元素集中步骤的分解流程图。Fig. 5 shows an exploded flow chart of the steps of concentrating the weights to be trimmed elements according to the above method.

图6a-6c示出了在多种数据集和不同的网络规模下，采用根据本发明的压缩方法在不同的压缩率的情况下的正确率。Figures 6a-6c show the correct rates for different compression ratios using the compression method according to the invention over a variety of data sets and different network sizes.

具体实施方式Detailed ways

为了使本领域技术人员更好地理解本发明，下面结合附图和具体实施方式对本发明作进一步详细说明。The present invention will be further described in detail below in conjunction with the drawings and specific embodiments.

图3示出了根据本发明的神经网络的网络模型分块压缩技术的应用情境1000的示意图。3 shows a schematic diagram of an application context 1000 of a network model block compression technique for a neural network in accordance with the present invention.

如图3所示，本公开的总体发明构思在于：对神经网络应用1100进行初步神经网络训练，学习得到网络模型1200，对该网络模型1200通过网络模型分块压缩方法1300以预定的压缩率进行分块压缩，然后重新进行训练，然后再压缩-再训练-再压缩-再训练…，如此迭代，以便微调并且学习来提升准确率，直至达到预定的迭代中止要求，从而确定最终的网络模型1400，由此可以在不影响效果的情况下，减少神经网络芯片所需要的分块运算单元器件，进而在有限资源的条件下布置规模巨大的神经网络。As shown in FIG. 3, the general inventive concept of the present disclosure is to perform preliminary neural network training on the neural network application 1100, and learn to obtain a network model 1200 that is subjected to a predetermined compression ratio by the network model block compression method 1300. Block compression, then re-train, then compression-re-train-recompress-retrain..., iterate, so fine-tune and learn to improve the accuracy until the predetermined iteration abort is reached, thus determining the final network model 1400 Therefore, the block operation unit device required by the neural network chip can be reduced without affecting the effect, and the large-scale neural network can be arranged under the condition of limited resources.

一、网络模型分块压缩方法First, the network model block compression method

图4和图5示出了根据本发明一实施例的网络模型分块压缩方法1300的流程图，其中图4示出了根据本发明的网络模型分块压缩方法的总体流程图，图5示出了根据上述方法的待裁剪权值元素集中步骤的分解流程图。具体地，所述网络模型分块压缩方法包括如下步骤：4 and FIG. 5 are flowcharts showing a network model block compression method 1300 according to an embodiment of the present invention, wherein FIG. 4 shows a general flowchart of a network model block compression method according to the present invention, and FIG. An exploded flow chart of the steps of concentrating the elements to be trimmed according to the above method is presented. Specifically, the network model block compression method includes the following steps:

1.权重矩阵获得步骤S210，获得经过训练得到的神经网络的网络模型的权重矩阵。1. The weight matrix obtains step S210 to obtain a weight matrix of the network model of the trained neural network.

这里，为了更好地说明本发明的方法，假设初始权重矩阵为6*6大小，并且以下面的表1的矩阵进一步说明。Here, in order to better explain the method of the present invention, it is assumed that the initial weight matrix is 6*6 size and is further explained by the matrix of Table 1 below.

表1初始权重矩阵Table 1 initial weight matrix

0.93730.9373	0.04190.0419	0.79590.7959	0.82780.8278	-0.4288-0.4288	0.68540.6854
0.33110.3311	0.66830.6683	0.86860.8686	0.10870.1087	0.30580.3058	-0.6641-0.6641
0.08790.0879	-0.7366-0.7366	0.54530.5453	-0.017-0.017	-0.8295-0.8295	0.57810.5781
0.39640.3964	0.07690.0769	-0.4809-0.4809	-0.1507-0.1507	0.02960.0296	-0.2923-0.2923
0.97860.9786	-0.9656-0.9656	0.84490.8449	0.62840.6284	-0.9309-0.9309	0.41380.4138
0.7540.754	0.78590.7859	-0.8424-0.8424	0.90.9	-0.4225-0.4225	0.08470.0847

2.权重矩阵分块步骤S220：按照预定阵列大小将权重矩阵划分成由若干初始子块组成的阵列。2. Weight Matrix Blocking Step S220: The weight matrix is divided into an array of a number of initial sub-blocks according to a predetermined array size.

对上述权重矩阵，按照例如矩阵大小为2*2的子块进行压缩，则上述矩阵被分为3*3＝9的子块阵列。The weight matrix is compressed according to, for example, a sub-block having a matrix size of 2*2, and the matrix is divided into sub-block arrays of 3*3=9.

这里，本领域技术人员可以理解，所划分的子块大小是可以根据权重矩阵的规模以及压缩率的需要设定的，例如也可以设置4*4，8*8…256*256这样的子块矩阵大小。Here, those skilled in the art can understand that the divided sub-block size can be set according to the size of the weight matrix and the compression ratio. For example, sub-blocks such as 4*4, 8*8...256*256 can also be set. Matrix size.

3.待裁剪权值元素集中步骤S230，根据子块中的矩阵元素的权值绝对值和值(下文中简称为子块和值)，通过行列交换，将权值较小的矩阵元素集中到待裁剪子块中，使得该待裁剪子块的和值相对于不是待裁剪子块的其他子块的和值更小，其中，根据压缩率设定所述待裁剪子块的数量；3. The to-be-cut weight element is concentrated in step S230. According to the absolute value and value of the weight of the matrix element in the sub-block (hereinafter referred to as sub-block and value), the matrix elements with smaller weights are concentrated by row-column exchange. In the to-be-cut sub-block, the sum value of the sub-block to be cropped is smaller than the sum of the other sub-blocks that are not to be cropped, wherein the number of the sub-block to be cropped is set according to the compression ratio;

更具体地，图5示出了根据上述方法的待裁剪权值元素集中步骤S230的分解流程图，包括以下步骤：More specifically, FIG. 5 shows an exploded flowchart of step S230 of the weighting element to be cropped according to the above method, including the following steps:

a.确定预裁剪子块步骤S2301：确定作为裁剪候选的预裁剪子块。a. Determining a pre-trimmed sub-block Step S2301: Determining a pre-trimmed sub-block as a cropping candidate.

在本实施方式中，计算每一个初始子块和值，将和值小的子块作为预裁剪子块。In the present embodiment, each initial sub-block and value is calculated, and a sub-block having a small sum value is used as a pre-cut sub-block.

具体地，首先，对表1的权重矩阵取绝对值，获得表2的矩阵。Specifically, first, the weight matrix of Table 1 is taken as an absolute value, and the matrix of Table 2 is obtained.

表2取绝对值得到的矩阵Table 2 takes the absolute value of the matrix

	C0C0	C1C1	C2C2	C3C3	C4C4	C5C5
R0R0	0.93730.9373	0.04190.0419	0.79590.7959	0.82780.8278	0.42880.4288	0.68540.6854
R1R1	0.33110.3311	0.66830.6683	0.86860.8686	0.10870.1087	0.30580.3058	0.66410.6641
R2R2	0.08790.0879	0.73660.7366	0.54530.5453	0.0170.017	0.82950.8295	0.57810.5781
R3R3	0.39640.3964	0.07690.0769	0.48090.4809	0.15070.1507	0.02960.0296	0.29230.2923
R4R4	0.97860.9786	0.96560.9656	0.84490.8449	0.62840.6284	0.93090.9309	0.41380.4138
R5R5	0.7540.754	0.78590.7859	0.84240.8424	0.90.9	0.42250.4225	0.08470.0847

为了便于理解后续的行列交换，在表2中，顺序标记出换位前的权重矩阵的行列序号，其中行以R打头，列以C打头。In order to facilitate the understanding of the subsequent row and column exchange, in Table 2, the row and column numbers of the weight matrix before the transposition are sequentially marked, wherein the row starts with R and the column starts with C.

其次，对表2的矩阵，以2*2的子块为单位计算子块和值，即获得表3。Next, for the matrix of Table 2, the sub-blocks and values are calculated in units of 2*2 sub-blocks, that is, Table 3 is obtained.

表3子块和值Table 3 sub-blocks and values

1.97861.9786	2.6012.601	2.08412.0841
1.29781.2978	1.19391.1939	1.72951.7295

3.48413.4841

3.21573.2157

1.85191.8519

最后，选取和值最小的子块作为预裁剪子块，并且标记为True，其他子块标记为False，获得表4，其中子块序号以B打头。Finally, the sub-block with the smallest value is selected as the pre-trimmed sub-block, and is marked as True, and the other sub-blocks are marked as False, and Table 4 is obtained, where the sub-block number starts with B.

表4预裁剪子块Table 4 pre-cut sub-blocks

B11:FalseB11: False	B12:FalseB12: False	B13:FalseB13: False
B21:TrueB21: True	B22:TrueB22: True	B23:TrueB23: True
B31:FalseB31: False	B32:FalseB32: False	B33:TrueB33: True

至于所标记的预裁剪子块的数量，根据压缩率确定。具体地，假设压缩率为50％，那么设定预裁剪子块的数量应该为子块总数×压缩率，9*50％＝4.5，然后取整为4。因此，结合表3和表4可知，总和数值最小的4个子块被标记为True。As for the number of marked pre-trimmed sub-blocks, it is determined according to the compression ratio. Specifically, assuming that the compression ratio is 50%, the number of pre-trimmed sub-blocks should be set to be the total number of sub-blocks × compression ratio, 9 * 50% = 4.5, and then rounded to 4. Therefore, in combination with Tables 3 and 4, the four sub-blocks with the smallest sum value are marked as True.

b.标记行列步骤S2302：选择并标记预裁剪子块所在的所有行和所有列作为换位行和换位列，并且对换位行和换位列进行标记。b. Marking Rows Step S2302: Select and mark all the rows and all columns in which the pre-trimmed sub-blocks are located as the transposition row and the transposition column, and mark the transposition row and the transposition column.

根据表3和表4可知，和值最小的四个子块B21、B22、B23和B33作为预裁剪子块被标记“True”。那么，以预裁剪子块所在的行列作为换位行和换位列，则换位行为R2、R3、R4、R5以及换位列为C0-C5，如此标记换位行为ER2、ER3、ER4、ER5以及换位列为EC0-EC5，其中换位行以ER打头，换位列以EC打头，区别于以R和C打头的一般的行列。According to Tables 3 and 4, the four sub-blocks B21, B22, B23, and B33 having the smallest sum value are marked as "True" as pre-trimmed sub-blocks. Then, with the row and row of the pre-trimmed sub-block as the transposition row and the transposition column, the transposition behaviors R2, R3, R4, R5 and the transposition column are C0-C5, thus marking the transposition behavior ER2, ER3, ER4, ER5 and transposition are listed as EC0-EC5, in which the transposition line starts with ER, and the transposition column starts with EC, which is different from the general line beginning with R and C.

c.交换行步骤S2303：对每一行中的矩阵元素的权值绝对值求和，并且将和值小的行依次与所标记的换位行进行位置交换。c. Exchanging rows Step S2303: summing the absolute values of the weights of the matrix elements in each row, and sequentially swapping the rows with the smaller values with the marked transposition rows.

对表2的各行总和进行计算，得到下表5。The sum of the rows of Table 2 was calculated to obtain the following Table 5.

表5各行总和Table 5 sum of the rows

R0R0	3.71713.7171
R1R1	2.94662.9466
R2R2	2.79442.7944
R3R3	1.42681.4268
R4R4	4.76224.7622
R5R5	3.78953.7895

根据表5可知，各行的总和，从小到大的顺序为R3＜R2＜R1＜R0＜R5＜R4，那么按照此顺序依次换到ER2、ER3、ER4、ER5这些标记行，即：According to Table 5, the sum of the rows, from small to large, is R3<R2<R1<R0<R5<R4, and then the ER2, ER3, ER4, and ER5 mark lines are sequentially changed in this order, namely:

R3换位到ER2→R【0 1 3 2 4 5】(此时由于R3和R2已经对换，所以不再对R2进行换位了)；R3 is transposed to ER2→R[0 1 3 2 4 5] (At this time, since R3 and R2 have been swapped, R2 is no longer transposed);

R1换位到ER4→R【0 4 3 2 1 5】；R1 is transposed to ER4→R[0 4 3 2 1 5];

R0换位到ER5→R【5 4 3 2 1 0】。R0 is transposed to ER5→R[5 4 3 2 1 0].

此时，获得如下表6，At this point, obtain the following Table 6,

表6行交换后的矩阵Table 6 row exchanged matrix

R5R5	0.7540.754	0.78590.7859	0.84240.8424	0.90.9	0.42250.4225	0.08470.0847
R4R4	0.97860.9786	0.96560.9656	0.84490.8449	0.62840.6284	0.93090.9309	0.41380.4138
R3R3	0.39640.3964	0.07690.0769	0.48090.4809	0.15070.1507	0.02960.0296	0.29230.2923
R2R2	0.08790.0879	0.73660.7366	0.54530.5453	0.0170.017	0.82950.8295	0.57810.5781
R1R1	0.33110.3311	0.66830.6683	0.86860.8686	0.10870.1087	0.30580.3058	0.66410.6641
R0R0	0.93730.9373	0.04190.0419	0.79590.7959	0.82780.8278	0.42880.4288	0.68540.6854

也就是说，现在交换后获得的第1行为原矩阵的第R5行，第2行为原矩阵的第R4行，…，以此类推。That is to say, the first R5 row of the original matrix obtained after the exchange is now, the second row of the original matrix of the R4 row, ..., and so on.

d.交换列步骤S2304：对每一列中的矩阵元素的权值绝对值求和，并且将和值小的列依次与所标记的换位列进行位置交换。d. Exchange column step S2304: sum the absolute values of the weights of the matrix elements in each column, and sequentially exchange the columns with the smaller values in position with the marked transposition columns.

对表5的各列总和进行计算，得到下表7。The sum of the columns of Table 5 was calculated to obtain the following Table 7.

表7各列总和Table 7 sum of the columns

C0C0	C1C1	C2C2	C3C3	C4C4	C5C5
3.48533.4853	3.27523.2752	4.3784.378	2.63262.6326	2.94712.9471	2.71842.7184

根据表7可知，各列的总和，从小到大的顺序为C3＜C5＜C4＜C1＜C0＜C2，那么按照此顺序依次换到EC0、EC1、EC2、EC3、EC4、EC5这些标记列，即：According to Table 7, the sum of the columns, from small to large, is C3<C5<C4<C1<C0<C2, and then the EC0, EC1, EC2, EC3, EC4, EC5 mark columns are sequentially changed in this order. which is:

C3换位到EC0→C【3 1 2 0 4 5】C3 transposition to EC0 → C [3 1 2 0 4 5]

C5换位到EC1→C【3 5 2 0 4 1】；C5 is transposed to EC1→C [3 5 2 0 4 1];

C4换位到EC2→C【3 5 4 0 2 1】；C4 is transposed to EC2→C [3 5 4 0 2 1];

C1换位到EC3→C【3 5 4 1 2 0】；C1 is transposed to EC3→C [3 5 4 1 2 0];

C0换位到EC4→C【3 5 4 1 0 2】；C0 is transposed to EC4→C[3 5 4 1 0 2];

C2换位到EC5→C【3 5 4 1 0 2】。C2 is transposed to EC5→C [3 5 4 1 0 2].

此时，获得如下表8。At this time, the following Table 8 was obtained.

表8列交换后的矩阵Table 8 column after exchange matrix

C3C3	C5C5	C4C4	C1C1	C0C0	C2C2
0.90.9	0.08470.0847	0.42250.4225	0.78590.7859	0.7540.754	0.84240.8424
0.62840.6284	0.41380.4138	0.93090.9309	0.96560.9656	0.97860.9786	0.84490.8449
0.15070.1507	0.29230.2923	0.02960.0296	0.07690.0769	0.39640.3964	0.48090.4809
0.0170.017	0.57810.5781	0.82950.8295	0.73660.7366	0.08790.0879	0.54530.5453
0.10870.1087	0.66410.6641	0.30580.3058	0.66830.6683	0.33110.3311	0.86860.8686
0.82780.8278	0.68540.6854	0.42880.4288	0.04190.0419	0.93730.9373	0.79590.7959

也就是说，现在交换后获得的第1列为原矩阵的第C3列，第2列为原矩阵的第C5列…，以此类推。That is to say, the first column obtained after the exchange is now the C3 column of the original matrix, the second column is the C5 column of the original matrix, and so on.

本领域技术人员可以理解，上述行或列的标记以及交换操作没有顺序约束，先进行行交换，再进行列交换，或者反过来，都是可以的。Those skilled in the art will appreciate that the above-described row or column labeling and swapping operations are not sequentially constrained, and row switching, column switching, or vice versa are possible.

因此，第一次行列交换处理的结果是：Therefore, the result of the first row and column exchange process is:

行顺序为：R【5,4,3,2,1,0】The line order is: R[5,4,3,2,1,0]

列顺序为：C【3,5,4,1,0,2】The column order is: C[3,5,4,1,0,2]

e.判断交换是否结束步骤S2305：e. Determine whether the exchange ends with step S2305:

首先，计算存储子块总和Sum1。Sum1为在未进行此次行列交换之前的预裁剪子块的子块总和。具体地，根据表3，Sum1为未进行此次行列交换之前的四个预裁剪子块B21、B22、B23和B33的和值之总和，即Sum1＝6.0731，将其作为存储子块总和存储起来，以供比较并判断行列交换是否完成来使用。First, the storage sub-block sum Sum1 is calculated. Sum1 is the sum of the sub-blocks of the pre-cut sub-blocks before the row-row exchange is performed. Specifically, according to Table 3, Sum1 is the sum of the sum values of the four pre-cut sub-blocks B21, B22, B23, and B33 before the row-row exchange, that is, Sum1=6.0731, which is stored as the sum of the storage sub-blocks. , for comparison and to determine whether the row and column exchange is complete.

其次，计算预裁剪子块总和Sum2。四个预裁剪子块B21、B22、B23和B33的和值如下表9所示，将四个和值相加获得预裁剪子块总和Sum2＝7.1541。Second, calculate the sum of the pre-cut sub-blocks Sum2. The sum of the four pre-cut sub-blocks B21, B22, B23, and B33 is as shown in Table 9, and the four sum values are added to obtain the sum of the pre-cut sub-blocks Sum2 = 7.1541.

表9子块和Table 9 sub-blocks and

2.02692.0269	3.10493.1049	3.41993.4199
1.03811.0381	1.67261.6726	1.51051.5105
2.2862.286	1.44481.4448	2.93292.9329

再次，将预裁剪子块总和Sum2与存储子块总和Sum1进行比较。此时，存储子块总和Sum1<预裁剪子块总和Sum2，两者不相等，则设定存储子块总和Sum1＝预裁剪子块总和Sum2。即由于6.0731＜7.1541，则将存储子块总和Sum1设定为7.154。Again, the pre-cut sub-block sum Sum2 is compared to the storage sub-block sum Sum1. At this time, the storage sub-block sum Sum1<pre-cut sub-block sum Sum2, which are not equal, sets the storage sub-block sum Sum1=pre-cut sub-block sum Sum2. That is, since 6.0731 < 7.1541, the storage sub-block sum Sum1 is set to 7.154.

这里，由于Sum1小于Sum2，说明当前交换操作仍可以继续进行，因此重复步骤S2301～S2305，即再次通过行列交换操作将待裁剪权值元素集中到预裁剪子块位置，并且判断是否预裁剪子块总和等于作为比较值的存储子块总和，如下详述。Here, since Sum1 is smaller than Sum2, it indicates that the current switching operation can continue, so steps S2301 to S2305 are repeated, that is, the to-be-cut weight element is concentrated to the pre-cut sub-block position again by the row-column switching operation, and it is determined whether the pre-cut sub-block is pre-cut. The sum is equal to the sum of the storage sub-blocks as comparison values, as detailed below.

从上面的说明可以看出，预裁剪子块总和是在交换处理之后，根据交换处理之前确定的预裁剪子块的位置，进行子块和值计算得到的，而存储子块总和则是在每次循环最后的交换处理之后，根据判断结果设定的。具体地，每次判断时，只要存储子块总和与预裁剪子块总和不相同，就要将该预裁剪子块总和作为存储子块总和存储，以供下一次比较使用。在上面的过程中，存储子块总和的初值就按照循环初始确定的预裁剪子块总和来设定了。As can be seen from the above description, the sum of the pre-trimmed sub-blocks is obtained after the exchange processing, based on the position of the pre-trimmed sub-block determined before the exchange processing, and the sub-block and value are calculated, and the sum of the storage sub-blocks is After the last exchange process of the secondary loop, it is set according to the judgment result. Specifically, each time the judgment is made, as long as the sum of the storage sub-blocks and the sum of the pre-cropped sub-blocks are different, the sum of the pre-cut sub-blocks is stored as a sum of the storage sub-blocks for use in the next comparison. In the above process, the initial value of the sum of the stored sub-blocks is set according to the sum of the pre-cut sub-blocks initially determined by the loop.

f.重复上述步骤S2301-2305f. Repeat the above steps S2301-2305

√确定预裁剪子块步骤S2301：确定作为裁剪候选的预裁剪子块。√Determining the pre-trimmed sub-block Step S2301: Determining the pre-trimmed sub-block as a cropping candidate.

此时仍将和值小的子块作为预裁剪子块，因此根据表9，重新选取预裁剪子块，如下表10所示。At this time, the sub-block with the small value is still used as the pre-trimmed sub-block, so according to Table 9, the pre-cropped sub-block is re-selected as shown in Table 10 below.

表10标记预裁剪子块Table 10 marks the pre-cut sub-block

B11:FalseB11: False	B12:FalseB12: False	B13:FalseB13: False
B21:TrueB21: True	B22:TrueB22: True	B23:TrueB23: True
B31:FalseB31: False	B32:TrueB32: True	B33:FalseB33: False

√标记行列步骤S2302：选择并标记预裁剪子块所在的所有行和所有列√ mark row and column step S2302: select and mark all rows and all columns where the pre-cut sub-block is located

作为换位行和换位列，并且对换位行和换位列进行标记。As a transposition line and a transposition column, the transposition line and the transposition column are marked.

根据表10可知，和值最小的四个子块B21、B22、B23和B32作为预裁剪子块被标记“True”，以预裁剪子块所在的行列作为换位行和换位列，则换位行和换位列包括：R2、R3、R4、R5，以及C0-C5，如此标记换位行为ER2、ER3、ER4、ER5以及换位列为EC0-EC5。According to Table 10, the four sub-blocks B21, B22, B23, and B32 with the smallest sum value are marked as "True" as the pre-trimmed sub-block, and the row row in which the pre-cut sub-block is located is used as the transposition line and the transposition column, and the transposition is performed. The row and transposition columns include: R2, R3, R4, R5, and C0-C5, such that the transposition behavior ER2, ER3, ER4, ER5 and the transposition column are EC0-EC5.

√交换行步骤S2303：对每一行中的矩阵元素的权值绝对值求和，并且√ swap line step S2303: summing the absolute values of the weights of the matrix elements in each row, and

将和值小的行依次与所标记的换位行进行位置交换。The row with the smaller value is sequentially swapped with the marked transposition row.

表11各行总和Table 11 sum of rows

R0R0	3.78953.7895
R1R1	4.76224.7622
R2R2	1.42681.4268
R3R3	2.79442.7944
R4R4	2.94662.9466
R5R5	3.71713.7171

根据表11可知，各行的总和，从小到大的顺序为R2＜R3＜R4＜R5＜R0＜R1，那么按照此顺序依次换到ER2、ER3、ER4、ER5这些标记行。由于此时权值小的行的顺序与换位行的顺序一一对应，所以不再进行行换位了，仍为表8。According to Table 11, the sum of the rows, from small to large, is R2 < R3 < R4 < R5 < R0 < R1, and then the rows of ER2, ER3, ER4, and ER5 are sequentially switched in this order. Since the order of the rows having the small weights is in one-to-one correspondence with the order of the transposition lines at this time, the row transposition is no longer performed, and is still in Table 8.

√交换列步骤S2304：对每一列中的矩阵元素的权值绝对值求和，并且√ exchange column step S2304: summing the absolute values of the weights of the matrix elements in each column, and

将和值小的列依次与所标记的换位列进行位置交换。The columns with the smaller values are sequentially swapped with the marked transposition columns.

对表8的各列总和进行计算，得到下表12。The sum of the columns of Table 8 was calculated to obtain Table 12 below.

表12各列总和Table 12 sum of the columns

C0C0	C1C1	C2C2	C3C3	C4C4	C5C5
2.63262.6326	2.71842.7184	2.94712.9471	3.27523.2752	3.48533.4853	4.3784.378

根据表12可知，各列的总和，从小到大的顺序为C0＜C1＜C2＜C3＜C4＜C5，那么按照此顺序依次换到EC0、EC1、EC2、EC3、EC4、EC5这些标记列。由于此时权值小的列的顺序与换位列的顺序一一对应，所以不再进行列换位了，仍为表8。As can be seen from Table 12, the sum of the columns, from small to large, is C0 < C1 < C2 < C3 < C4 < C5, and then the label columns EC0, EC1, EC2, EC3, EC4, and EC5 are sequentially changed in this order. Since the order of the columns with small weights is in one-to-one correspondence with the order of the transposition columns at this time, the column transposition is no longer performed, and is still Table 8.

√判断交换是否结束步骤S2305：√ Determine whether the exchange ends with step S2305:

此时，存储子块总和Sum1已经在第一次行列交换处理中设定为7.154,即进行第二次行列交换之前的预裁剪子块总和。At this time, the storage sub-block sum Sum1 has been set to 7.154 in the first row-column exchange process, that is, the sum of the pre-cut sub-blocks before the second row-column exchange.

计算预裁剪子块总和，四个预裁剪子块B21、B22、B23和B32的和值如下表13(由于第二次行列交换，如上所述并未进行，所以表13与表9相同)所示，将四个和值相加获得预裁剪子块总和Sum2＝5.666。将其与存储子块总和Sum1＝7.154进行比较。此时，仍然依据Sum1和Sum2不同则将设定存储子块总和Sum1＝预裁剪子块总和Sum2的原则处理。Sum1＝7.154＞Sum2＝5.666，将Sum1设定为5.666。Calculate the sum of the pre-cut sub-blocks. The sum of the four pre-cut sub-blocks B21, B22, B23, and B32 is as shown in Table 13 (due to the second row-column exchange, as described above, so Table 13 is the same as Table 9) It is shown that the sum of the four sum values is obtained to obtain the sum of the pre-cut sub-blocks Sum2=5.666. Compare this with the sum of the storage sub-blocks Sum1=7.154. At this time, depending on Sum1 and Sum2, the principle of storing the sub-block sum Sum1=pre-cut sub-block sum Sum2 will be set. Sum1=7.154>Sum2=5.666, and Sum1 is set to 5.666.

表13子块和Table 13 sub-blocks and

这里，当Sum1大于Sum2，说明当前交换操作仍可以继续进行，因此重复步骤S2301～S2305，即再次通过行列交换操作将待裁剪权值元素集中到预裁剪子块位置，并且判断是否预裁剪子块总和等于作为比较值的存储子块总和，如下详述。Here, when Sum1 is greater than Sum2, it indicates that the current switching operation can still continue, so steps S2301 to S2305 are repeated, that is, the to-be-cut weight element is concentrated to the pre-trimmed sub-block position again by the row-column switching operation, and it is determined whether the pre-cut sub-block is pre-cut. The sum is equal to the sum of the storage sub-blocks as comparison values, as detailed below.

因此，第二次行列交换处理的结果是：Therefore, the result of the second row and column exchange process is:

行顺序为：R【5,4,3,2,1,0】The line order is: R[5,4,3,2,1,0]

列顺序为：C【3,5,4,1,0,2】The column order is: C[3,5,4,1,0,2]

h.重复上述步骤S2301-2305h. Repeat the above steps S2301-2305

此时仍将和值小的子块作为预裁剪子块，根据表13可知，四个预裁剪子块B21、B22、B23和B32的和值仍为最小子块和值，所以预裁剪子块不变。At this time, the sub-block with the small value is still used as the pre-trimmed sub-block. According to Table 13, the sum of the four pre-trimmed sub-blocks B21, B22, B23, and B32 is still the minimum sub-block and value, so the pre-cut sub-block constant.

在预裁剪子块不变的情况下，则换位行和换位列也不变，同时由于在第二次处理中表8也未改变行列，因此，在第三次行列交换的处理中，实际的行列并未进行交换，均与第二次行列交换处理的各个步骤结果相同。In the case where the pre-trimmed sub-blocks are unchanged, the transposition row and the transposition column are also unchanged, and since the row and column are not changed in the second processing, therefore, in the processing of the third row-column exchange, The actual ranks are not exchanged, and they all have the same results as the steps of the second row and column exchange process.

因此，第三次行列交换处理的结果仍然是：Therefore, the result of the third row and column exchange process is still:

行顺序为：R【5,4,3,2,1,0】The line order is: R[5,4,3,2,1,0]

列顺序为：C【3,5,4,1,0,2】The column order is: C[3,5,4,1,0,2]

然而，不同于第二次行列交换处理的是，在第三次行列交换的处理中，存储子块总和Sum1等于5.666,并且预裁剪子块总和Sum2也等于5.666，也就是说存储子块总和Sum1＝预裁剪子块总和Sum2。当满足该条件时，交换行列的处理循环结束。However, unlike the second row and column exchange processing, in the processing of the third row and column exchange, the storage sub-block sum Sum1 is equal to 5.666, and the pre-trimmed sub-block sum Sum2 is also equal to 5.666, that is, the storage sub-block sum Sum1. = Pre-cut sub-block sum Sum2. When this condition is met, the processing loop of the swap row ends.

因此，根据上面的过程可以得知，当存储子块总和Sum1不等于预裁剪子块总和Sum2时，说明根据行列交换操作获得的结果仍没有稳定下来，仍有可能改变，因此继续进行行列交换。只有当存储子块总和Sum1等于预裁剪子块总和Sum2时，才说明根据行列交换操作获得的结果已经稳定下来，使得该待裁剪子块中的矩阵元素的权值绝对值和值相对于不是待裁剪子块的其他子块中的矩阵元素的权值绝对值和值更小，也就是说已经确定地将小权值的矩阵元素集中到待裁剪子块中了，可以进行裁剪，如此才会从行列交换处理的循环中结束退出并进入下一步骤。Therefore, according to the above process, it can be known that when the storage sub-block sum Sum1 is not equal to the pre-cut sub-block sum Sum2, it indicates that the result obtained according to the row-column exchange operation has not stabilized and may still be changed, so the row-column exchange is continued. Only when the sum of the storage sub-blocks Sum1 is equal to the sum Sum2 of the pre-trimmed sub-blocks, the result obtained by the row-column swap operation has been stabilized, so that the absolute values and values of the weights of the matrix elements in the sub-block to be cropped are not The absolute values and values of the matrix elements in the other sub-blocks of the cropped sub-block are smaller, that is, the matrix elements of the small-weight values have been surely concentrated into the sub-block to be cropped, and can be cropped. Exit from the loop of the row and column exchange process and proceed to the next step.

4.子块裁剪步骤S240：将上述待裁剪子块中的矩阵元素的权值裁剪掉，以实现对神经网络的网络模型的压缩。4. Sub-block clipping step S240: Trimming the weights of the matrix elements in the sub-block to be cropped to implement compression of the network model of the neural network.

需要说明的是，这里的裁剪，并不限于将矩阵本身元素的数值设置为0，对于通过电路器件以硬件形式执行矩阵向量乘的操作的基本模块，可以直接将矩阵元素所对应的位置的器件省略。更具体地，在布置相应的硬件器件以实现该权重矩阵时，该对应位置的进行分块计算的器件被去掉。It should be noted that the clipping here is not limited to setting the value of the element of the matrix itself to 0. For the basic module that performs the operation of matrix vector multiplication by hardware in the circuit device, the device corresponding to the position corresponding to the matrix element can be directly Omitted. More specifically, when the corresponding hardware device is arranged to implement the weight matrix, the device for performing the block calculation of the corresponding position is removed.

如此，通过上述步骤，将待裁剪权值元素集中到矩阵子块中，然后直接裁掉该矩阵子块，再以此为初始值进行神经网络训练，在保证网络效果的前提下，减少阵列的使用，从而极大地减少了资源开销。In this way, through the above steps, the weighting elements to be cropped are concentrated into the matrix sub-block, and then the matrix sub-block is directly cut off, and then the neural network training is performed as the initial value, and the array is reduced under the premise of ensuring the network effect. Use, which greatly reduces resource overhead.

本发明提出的方法完全适用于基于忆阻器以及TrueNorth芯片的神经网络。相比，传统的网络压缩方法并不适用于基于忆阻器或TrueNorth芯片的神经网络，因为即使将网络模型压缩得很小，也不能减少阵列的使用，无法减少资源消耗。The method proposed by the present invention is fully applicable to a neural network based on a memristor and a TrueNorth chip. In contrast, the traditional network compression method is not suitable for neural networks based on memristors or TrueNorth chips, because even if the network model is compressed to a small extent, the use of the array cannot be reduced, and resource consumption cannot be reduced.

另外，需要说明的是，上述所给出的行列交换的步骤仅做示例，但这样的行列交换方式并不是唯一的可选方式。具体而言，例如在上面的一次行交换处理中，各行的总和，从小到大的顺序为R3＜R2＜R1＜R0＜R5＜R4，那么按照此顺序依次换到ER2、ER3、ER4、ER5这些标记行。也就是说，在本发明中，采用选择和值最小的子块所在的行作为交换行来进行交换，这样可以大大加快行列交换的效率，以更快速地将和值小的待裁剪权值元素交换并集中到矩阵子块中。然而，显然，也可以直接按照和值的大小顺序将各个行交换到各个顺序行，例如，各行的总和，从小到大的顺序为R3＜R2＜R1＜R0＜R5＜R4，那么就按照此顺序依次换到R【0,1,2,3,4,5】，然后再循序进行其他步骤，也是可以的。只是这样的交换次数会增多，而且效率较低，不是优选方案。In addition, it should be noted that the steps of the row and column exchange given above are only examples, but such a row and column exchange manner is not the only alternative. Specifically, for example, in the above-described one-line exchange processing, the sum of the rows, from small to large, is R3<R2<R1<R0<R5<R4, and then sequentially switches to ER2, ER3, ER4, and ER5 in this order. These markup lines. That is to say, in the present invention, the row in which the sub-block with the smallest value and the smallest value are selected is exchanged as an exchange line, which can greatly speed up the efficiency of the row-column exchange, and the weight-valued element to be cropped with a smaller value can be quickly added. Swap and concentrate into matrix sub-blocks. However, obviously, it is also possible to directly switch the respective rows to the respective sequential rows in the order of the magnitude of the sum value, for example, the sum of the rows, the order from small to large is R3 < R2 < R1 < R0 < R5 < R4, then follow this It is also possible to change the order to R[0, 1, 2, 3, 4, 5] in sequence, and then perform other steps in sequence. It is only that the number of exchanges will increase, and the efficiency is low, which is not a preferred solution.

另外，虽然在本发明中通过设定压缩率来确定待裁剪子块的数量，但是也可以通过设定阈值来确定待裁剪子块的数量，只要能够满足压缩目的即可。In addition, although the number of sub-blocks to be cropped is determined by setting the compression ratio in the present invention, the number of sub-blocks to be cropped may be determined by setting a threshold as long as the compression purpose can be satisfied.

综上所述，本发明的发明构思的核心在于通过行列交换获得可裁剪的子块，以适用于子块运算应用，而不限制具体可采用的交换方式。In summary, the core of the inventive concept of the present invention is to obtain a croppable sub-block by row-column switching, so as to be applicable to a sub-block operation application without limiting the specifically available switching manner.

二、实际示例Second, the actual example

下面给出实际的示例来说明以同样的输入以及同样的运算方法，采用本发明的压缩方法得到的权重矩阵与初始权重矩阵均会输出相同的计算结果。The actual example is given below to illustrate that with the same input and the same operation method, the weight matrix obtained by the compression method of the present invention and the initial weight matrix will output the same calculation result.

表14为根据本发明的压缩方法进行行列交换后的权重矩阵(对应于表8)，其中黑体下划线的Null标识出预裁剪矩阵元素。Table 14 is a weight matrix (corresponding to Table 8) after the row and column exchange according to the compression method of the present invention, wherein the underlined Null identifies the pre-trimmed matrix element.

表14进行行列交换后的权重矩阵Table 14 Weight matrix after row and column exchange

0.90.9	0.08470.0847	-0.4225-0.4225	0.78590.7859	0.7540.754	-0.8424-0.8424
0.62840.6284	0.41380.4138	-0.9309-0.9309	-0.9656-0.9656	0.97860.9786	0.84490.8449
NullNull	NullNull	NullNull	NullNull	NullNull	NullNull
NullNull	NullNull	NullNull	NullNull	NullNull	NullNull
0.10870.1087	-0.6641-0.6641	NullNull	NullNull	0.33110.3311	0.86860.8686
0.82780.8278	0.68540.6854	NullNull	NullNull	0.93730.9373	0.79590.7959

表15为将表14按照初始的行列顺序(即下面的顺序)还原而成的初始矩阵，其中黑体下划线的Null标识出预裁剪矩阵元素。Table 15 is an initial matrix in which Table 14 is restored in the original rank order (i.e., the following order), in which the black underlined Null identifies the pre-cut matrix element.

行顺序为：R【5,4,3,2,1,0】The line order is: R[5,4,3,2,1,0]

列顺序为：C【3,5,4,1,0,2】The column order is: C[3,5,4,1,0,2]

表15未行列交换的权重矩阵Table 15 Weight matrix for row and column exchange

0.93730.9373	NullNull	0.79590.7959	0.82780.8278	NullNull	0.68540.6854
0.33110.3311	NullNull	0.86860.8686	0.10870.1087	NullNull	-0.6641-0.6641
NullNull	NullNull	NullNull	NullNull	NullNull	NullNull
NullNull	NullNull	NullNull	NullNull	NullNull	NullNull
0.97860.9786	-0.9656-0.9656	0.84490.8449	0.62840.6284	-0.9309-0.9309	0.41380.4138
0.7540.754	0.78590.7859	-0.8424-0.8424	0.90.9	-0.4225-0.4225	0.08470.0847

从上面可以看出，表15与表14的本质区别在于，表15中的预裁剪元素被分散，而表14的预裁剪元素以2*2子块的形式聚集在一起。因此在实际布置中，根据表14(即经过行列交换后的矩阵)来实现硬件布置，以适应分块计算的需要，这也是本发明的发明总体构思之所在，也就是使压缩方法适用于相应的分块计算应用的关键。As can be seen from the above, the essential difference between Table 15 and Table 14 is that the pre-cut elements in Table 15 are dispersed, and the pre-cut elements in Table 14 are gathered together in the form of 2*2 sub-blocks. Therefore, in the actual arrangement, the hardware arrangement is implemented according to Table 14 (ie, the matrix after row and column exchange) to meet the needs of the block calculation, which is also the overall concept of the invention, that is, the compression method is applied to the corresponding The key to the block computing application.

下面，基于表14和15给出两者的输入和输出的比较。Below, a comparison of the inputs and outputs of both is given based on Tables 14 and 15.

1.对于未行列交换的初始权重矩阵(表15)1. Initial weight matrix for unranked columns (Table 15)

假设输入向量数据为：Suppose the input vector data is:

表16初始的输入向量Table 16 initial input vector

0.37690.3769

0.90870.9087

0.68570.6857

0.05130.0513

0.60810.6081

0.95230.9523

与表15中的未行列交换的初始权重矩阵点乘，即对向量和矩阵的相应元素的乘积求和，输出点乘结果1为：The initial weight matrix exchanged with the unallocated columns in Table 15 is multiplied by the product of the vector and the corresponding elements of the matrix, and the output point multiplication result 1 is:

表17点乘结果1Table 17 point multiplication result 1

1.96731.9673

0.16120.1612

0.80080.8008

1.65001.6500

-0.9684-0.9684

-0.0128-0.0128

2.对于经过行列交换的权重矩阵(表14)2. For the weight matrix exchanged by row and column (Table 14)

在与表14中的行列交换后的权重矩阵点乘之前，需要首先按照行顺序R【0,1,2,3,4,5】→R【5,4,3,2,1,0】对表16给出的初始的输入向量进行调换，即Before multiplying the weight matrix points after the row and column exchanges in Table 14, it is necessary to first follow the row order R[0,1,2,3,4,5]→R[5,4,3,2,1,0] The initial input vector given in Table 16 is swapped, ie

表18经过行交换的输入向量Table 18: Row-exchanged input vectors

0.95230.9523

0.60810.6081

0.05130.0513

0.68570.6857

0.90870.9087

0.37690.3769

将经过行交换的输入向量与表14中的经过行列交换的权重矩阵点乘，即对向量和矩阵的相应元素的乘积求和，输出点乘结果2为：The row-exchanged input vector is multiplied by the row-row-exchanged weight matrix in Table 14, that is, the product of the vector and the corresponding element of the matrix is summed, and the output point multiplication result 2 is:

表19点乘结果2Table 19 point multiplication result 2

1.65001.6500

-0.0128-0.0128

-0.9684-0.9684

0.16120.1612

1.96731.9673

0.80080.8008

再按照列顺序C【3,5,4,1,0,2】→C【0,1,2,3,4,5】对点乘结果2进行调换。Then, the point multiplication result 2 is exchanged according to the column order C [3, 5, 4, 1, 0, 2] → C [0, 1, 2, 3, 4, 5].

表20经过交换的点乘结果2Table 20 is the result of the point multiplication of the exchange 2

1.96731.9673

0.16120.1612

0.80080.8008

1.65001.6500

-0.9684-0.9684

-0.0128-0.0128

比较上面的表17和表20的数据结果，可以看出，在同一输入向量的情况下，只要经过合理的行列交换，根据本发明的压缩方法获得的经过数据压缩的权重矩阵，仍然能够获得与初始矩阵一致的点乘结果。也就是说，根据本发明的压缩方法，并不会影响器件所要实现的计算功能，而且由于能够实现分块压缩，还可以有效地减少器件数量，在有限的资源情况下，实现更大规模的神经网络布置。Comparing the data results of Table 17 and Table 20 above, it can be seen that in the case of the same input vector, the data compression weight matrix obtained by the compression method according to the present invention can still be obtained as long as reasonable row and column exchange is performed. The point matrix is consistent with the initial matrix. That is to say, the compression method according to the present invention does not affect the calculation function to be implemented by the device, and since the block compression can be realized, the number of devices can be effectively reduced, and a larger scale can be realized under limited resource conditions. Neural network arrangement.

当然，这也意味着，根据本发明的压缩方法得到的权重矩阵，在应用于数据时，需要在处理前根据行列交换顺序，对输入数据进行交换，然后将交换后的输入数据与最终的权重矩阵进行矩阵乘法运算，最后再根据行列交换顺序，对矩阵乘法运算的结果进行反向交换并且作为输出数据输出。Of course, this also means that the weight matrix obtained by the compression method according to the present invention, when applied to data, needs to exchange input data according to the order of row and column exchange before processing, and then exchanges the input data with the final weight. The matrix performs matrix multiplication, and finally the result of matrix multiplication is reversely exchanged and output as output data according to the order of row and column exchange.

三、效果验证Third, the effect verification

为了验证本发明的算法的实际压缩效果，申请人做了一系列的实验。In order to verify the actual compression effect of the algorithm of the present invention, the applicant did a series of experiments.

图6a示出CIFAR10数据集在压缩后的正确率，其中CIFAR10数据集有60000个32*32像素的彩色图片，每张图片都属于10种分类之一。图6b示出MNIST数据集在LENET网络下压缩后的正确率，其中MNIST数据集有60000个28*28像素的黑白手写数字图片。图6c示出MNIST数据集在MLP网络下压缩后的正确率。Figure 6a shows the correct rate of the CIFAR10 data set after compression. The CIFAR10 data set has 60,000 32*32 pixel color pictures, each of which belongs to one of 10 categories. Figure 6b shows the correct rate of compression of the MNIST data set under the LENET network, where the MNIST data set has 60,000 28*28 pixel black and white handwritten digital pictures. Figure 6c shows the correct rate of compression of the MNIST data set under the MLP network.

图中，横坐标为压缩率，纵坐标为正确率，不同颜色的线代表不同规模的阵列。在0-80％的压缩率的情况下，对CIFAR10数据集，正确率在84％-85％之间，对MNIST数据集，不管是采用16*16的阵列，还是256*256的阵列，正确率基本都在98％-99％，甚至更高，这也从多个角度证明了本发明的数据压缩方法的正确率是相当好的。换言之，在多种数据集和不同的网络规模下，本发明的压缩方法都可以在不影响正确率的前提下，极大的压缩网络规模，节约资源开销。In the figure, the abscissa is the compression ratio, the ordinate is the correct rate, and the lines of different colors represent arrays of different sizes. In the case of a compression ratio of 0-80%, the correct rate is between 84% and 85% for the CIFAR10 data set. For the MNIST data set, whether it is a 16*16 array or a 256*256 array, the correct The rate is basically 98%-99%, or even higher, which also proves that the correct rate of the data compression method of the present invention is quite good from multiple angles. In other words, under various data sets and different network scales, the compression method of the present invention can greatly compress the network scale and save resource overhead without affecting the correct rate.

当然，从图6c中也可以看出，部分结果的压缩率不高，这样的结果是跟阵列规模太大有关的，例如波动最大的一组数据采用的是256*256的阵列规模，这样大的阵列规模导致有效数据被裁剪过多，从而影响了正确率。Of course, as can be seen from Fig. 6c, the compression rate of some of the results is not high, and the result is related to the large scale of the array. For example, the largest set of data is 256*256 array size, so large. The size of the array causes the valid data to be cropped too much, thus affecting the correct rate.

另外，随着压缩率的上升，正确率会有所下降，这也是牺牲正确率换取更多的压缩所必然带来的结果。对于不同的应用情况，本领域技术人员可以根据实际需要选择合适的压缩率以保证足够的正确率。In addition, as the compression ratio increases, the correct rate will decrease, which is also the result of sacrificing the correct rate in exchange for more compression. For different application situations, those skilled in the art can select an appropriate compression ratio according to actual needs to ensure a sufficient correct rate.

需要说明的是，附图中按某顺序显示了各个步骤，并不表示这些步骤只能按照显示或者描述的顺序执行，只要不存在逻辑矛盾，步骤执行顺序可以不同于所显示的。It should be noted that the various steps are shown in a certain order in the drawings, and the steps are not only shown in the order shown or described, and the order of the steps may be different from the ones shown.

以上已经描述了本发明的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此，本发明的保护范围应该以权利要求的保护范围为准。The embodiments of the present invention have been described above, and the foregoing description is illustrative, not limiting, and not limited to the disclosed embodiments. Numerous modifications and changes will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

一种用于神经网络的网络模型分块压缩方法，包括：A network model block compression method for a neural network, comprising:

权重矩阵获得步骤，获得经过训练得到的神经网络的网络模型的权重矩阵；The weight matrix obtaining step obtains a weight matrix of the network model of the trained neural network;

权重矩阵分块步骤，按照预定阵列大小将权重矩阵划分成由若干初始子块组成的阵列；a weight matrix blocking step of dividing the weight matrix into an array of a plurality of initial sub-blocks according to a predetermined array size;

待裁剪权值元素集中步骤，根据子块中的矩阵元素的权值绝对值和值，通过行列交换，将权值较小的矩阵元素集中到待裁剪子块中，使得该待裁剪子块中的矩阵元素的权值绝对值和值相对于不是待裁剪子块的其他子块中的矩阵元素的权值绝对值和值更小；和The step of arranging the weighting element elements, according to the absolute value and the value of the weight of the matrix element in the sub-block, by matrix-column exchange, the matrix elements with smaller weights are concentrated into the sub-block to be cropped, so that the sub-block to be cropped is The absolute value and value of the weights of the matrix elements are smaller relative to the absolute values and values of the weights of the matrix elements in other sub-blocks that are not to be cropped; and

子块裁剪步骤，将上述待裁剪子块中的矩阵元素的权值裁剪掉，获得最终的权重矩阵，以实现对神经网络的网络模型的压缩。The sub-block clipping step clips the weights of the matrix elements in the sub-block to be cropped to obtain a final weight matrix to implement compression of the network model of the neural network.
根据权利要求1所述的网络模型分块压缩方法，其中，根据压缩率或根据阈值来设定所述待裁剪子块的数量。The network model block compression method according to claim 1, wherein the number of the sub-blocks to be cropped is set according to a compression ratio or according to a threshold.
根据权利要求1所述的网络模型分块压缩方法，其中，待裁剪权值元素集中步骤包括如下步骤：The network model block compression method according to claim 1, wherein the step of concentrating the weighting element elements comprises the following steps:

确定预裁剪子块步骤，确定作为裁剪候选的预裁剪子块；Determining a pre-cropping sub-block step to determine a pre-trimmed sub-block as a clipping candidate;

标记行列步骤，选择并标记预裁剪子块所在的所有行和所有列作为换位行和换位列，其中，根据压缩率设定所述预裁剪子块的数量；Marking the row and column step, selecting and marking all rows and all columns in which the pre-trimmed sub-block is located as a transposition row and a transposition column, wherein the number of the pre-trimmed sub-blocks is set according to a compression ratio;

交换行步骤和交换列步骤，对每一行中的矩阵元素的权值绝对值求和，并且将和值小的行依次与所标记的换位行进行位置交换，以及，对每一列中的矩阵元素的权值绝对值求和，并且将和值小的列依次与所标记的换位列进行位置交换；Exchanging row steps and exchanging column steps, summing the absolute values of the weights of the matrix elements in each row, and sequentially swapping the rows with the smaller values with the marked transposition rows, and, for each matrix in the column The absolute values of the weights of the elements are summed, and the columns with the smaller values are sequentially exchanged with the marked transposition columns;

重复上述步骤，直到交换也不能改变所有预裁剪子块中的矩阵元素的权值绝对值的总和，此时的预裁剪子块作为待裁剪子块。The above steps are repeated until the exchange cannot change the sum of the absolute values of the weights of the matrix elements in all the pre-trimmed sub-blocks, and the pre-trimmed sub-blocks at this time serve as the sub-blocks to be cropped.
根据权利要求3所述的网络模型分块压缩方法，其中，确定预裁剪子块步骤还包括：计算每一个初始子块中的矩阵元素的权值绝对值的总和，将和值小的子块作为预裁剪子块。The network model block compression method according to claim 3, wherein the determining the pre-trimming sub-block step further comprises: calculating a sum of absolute values of weights of matrix elements in each of the initial sub-blocks, and sub-blocks having a small sum value As a pre-cut sub-block.
一种神经网络训练方法，包括如下步骤：A neural network training method includes the following steps:

对神经网络进行训练，得到网络模型的权重矩阵；Training the neural network to obtain a weight matrix of the network model;

根据权利要求1-4所述的网络模型分块压缩方法对所述权重矩阵进行压缩；和The weight matrix is compressed by the network model block compression method according to claims 1-4; and

迭代进行上述步骤，直至达到预定迭代中止要求。Iterate through the above steps until the predetermined iteration suspension request is reached.
一种用于神经网络计算的计算装置，包括存储器和处理器，存储器中存储有计算机可执行指令，所述计算机可执行指令包括网络模型压缩指令，当处理器执行所述网络模型压缩指令时，执行下述方法：A computing device for neural network computing, comprising a memory and a processor, the memory storing computer executable instructions, the computer executable instructions comprising network model compression instructions, when the processor executes the network model compression instruction Perform the following method:

权重矩阵获得步骤，获得经过训练得到的神经网络的网络模型的权重矩阵；The weight matrix obtaining step obtains a weight matrix of the network model of the trained neural network;

权重矩阵分块步骤，按照预定阵列大小将权重矩阵划分成由若干初始子块组成的阵列；a weight matrix blocking step of dividing the weight matrix into an array of a plurality of initial sub-blocks according to a predetermined array size;

待裁剪权值元素集中步骤，根据子块中的矩阵元素的权值绝对值和值，通过行列交换，将权值较小的矩阵元素集中到待裁剪子块中，使得该待裁剪子块中的矩阵元素的权值绝对值和值相对于不是待裁剪子块的其他子块中的矩阵元素的权值绝对值和值更小；The step of arranging the weighting element elements, according to the absolute value and the value of the weight of the matrix element in the sub-block, by matrix-column exchange, the matrix elements with smaller weights are concentrated into the sub-block to be cropped, so that the sub-block to be cropped is The absolute values and values of the weights of the matrix elements are smaller relative to the absolute values and values of the weights of the matrix elements in other sub-blocks that are not to be cropped;

子块裁剪步骤，将上述待裁剪子块中的矩阵元素的权值裁剪掉，获得最终的权重矩阵，以实现对神经网络的网络模型的压缩。The sub-block clipping step clips the weights of the matrix elements in the sub-block to be cropped to obtain a final weight matrix to implement compression of the network model of the neural network.
根据权利要求6所述的计算装置，其中，根据压缩率或根据阈值来设定所述待裁剪子块的数量。8、根据权利要求6所述的计算装置，其中，待裁剪权值元素集中步骤包括如下步骤：The computing device of claim 6, wherein the number of sub-blocks to be cropped is set according to a compression ratio or according to a threshold. 8. The computing device of claim 6, wherein the step of concentrating the weighting element elements comprises the steps of:

确定预裁剪子块步骤，确定作为裁剪候选的预裁剪子块；Determining a pre-cropping sub-block step to determine a pre-trimmed sub-block as a clipping candidate;

标记行列步骤，选择并标记预裁剪子块所在的所有行和所有列作为换位行和换位列，其中，根据压缩率设定所述预裁剪子块的数量；Marking the row and column step, selecting and marking all rows and all columns in which the pre-trimmed sub-block is located as a transposition row and a transposition column, wherein the number of the pre-trimmed sub-blocks is set according to a compression ratio;

交换行步骤和交换列步骤，对每一行中的矩阵元素的权值绝对值求和，并且将和值小的行依次与所标记的换位行进行位置交换，以及，对每一列中的矩阵元素的权值绝对值求和，并且将和值小的列依次与所标记的换位列进行位置交换；Exchanging row steps and exchanging column steps, summing the absolute values of the weights of the matrix elements in each row, and sequentially swapping the rows with the smaller values with the marked transposition rows, and, for each matrix in the column The absolute values of the weights of the elements are summed, and the columns with the smaller values are sequentially exchanged with the marked transposition columns;

重复上述步骤，直到交换也不能改变所有预裁剪子块中的矩阵元素的权值绝对值的总和，此时的预裁剪子块作为待裁剪子块。The above steps are repeated until the exchange cannot change the sum of the absolute values of the weights of the matrix elements in all the pre-trimmed sub-blocks, and the pre-trimmed sub-blocks at this time serve as the sub-blocks to be cropped.
根据权利要求8所述的计算装置，其中，确定预裁剪子块步骤还包括：计算每一个初始子块中的矩阵元素的权值绝对值的总和，将和值小的子块作为预裁剪子块。The computing device of claim 8, wherein the step of determining the pre-trimmed sub-block further comprises: calculating a sum of absolute values of weights of matrix elements in each of the initial sub-blocks, and using the sub-blocks having a small value as pre-cutters Piece.
根据权利要求6所述的计算装置，其中，所述计算机可执行指令还包括网络模型应用指令，当处理器执行所述网络模型应用指令时，执行下述方法：The computing device of claim 6, wherein the computer executable instructions further comprise network model application instructions that, when executed by the processor, perform the method:

输入数据处理步骤，根据行列交换顺序，对输入数据进行交换；Input data processing steps to exchange input data according to the order of row and column exchange;

矩阵乘法运算步骤，将交换后的输入数据与执行所述网络模型压缩指令后得到的最终的权重矩阵进行矩阵乘法运算；和a matrix multiplication operation step of performing matrix multiplication on the exchanged input data and a final weight matrix obtained by executing the network model compression instruction; and

输出数据处理步骤，根据行列交换顺序，对矩阵乘法运算的结果进行反向交换并且作为输出数据输出。The output data processing step reverse-exchanges the results of the matrix multiplication operation according to the row-column exchange order and outputs as output data.
根据权利要求10所述的计算装置，其中，所述计算机可执行指令还包括网络模型训练指令，当处理器执行所述网络模型训练指令时，执行下述方法：The computing device of claim 10, wherein the computer executable instructions further comprise network model training instructions that, when executed by the processor, perform the method:

对神经网络进行训练，得到网络模型的初始权重矩阵；Training the neural network to obtain an initial weight matrix of the network model;

执行所述网络模型压缩指令得到压缩后的最终的权重矩阵；Performing the network model compression instruction to obtain a compressed final weight matrix;

执行所述网络模型应用指令进行训练；和Performing the network model application instructions for training; and

迭代进行上述压缩和训练步骤，直至达到预定的迭代中止要求。Iteratively performs the above compression and training steps until a predetermined iteration suspension request is reached.
一种采用如权利要求1-4所述的网络模型分块压缩方法、如权利要求5所述的神经网络训练方法和如权利要求6-11所述的计算装置进行网络模型压缩、应用和训练的硬件***，包括：A network model compression, application, and training using the network model block compression method according to claims 1-4, the neural network training method according to claim 5, and the computing device according to claims 6-11 Hardware system, including:

神经网络硬件芯片，神经网络硬件芯片具有通过电路器件以硬件形式执行矩阵向量乘的操作的基本模块，A neural network hardware chip, a neural network hardware chip having a basic module for performing matrix vector multiplication operations in a hardware form by a circuit device,

其中，与待裁剪子块中的矩阵元素所对应的位置未设置电路器件。Wherein, the circuit device is not disposed at a position corresponding to the matrix element in the sub-block to be cropped.
根据权利要求12所述的硬件***，其中，所述电路器件为忆阻器或TrueNorth芯片的神经突触。The hardware system of claim 12 wherein said circuit device is a synapse or a synapse of a TrueNorth chip.