CN112016672A - Method and medium for neural network compression based on sensitivity pruning and quantization - Google Patents

Method and medium for neural network compression based on sensitivity pruning and quantization Download PDF

Info

Publication number
CN112016672A
CN112016672A CN202010684270.4A CN202010684270A CN112016672A CN 112016672 A CN112016672 A CN 112016672A CN 202010684270 A CN202010684270 A CN 202010684270A CN 112016672 A CN112016672 A CN 112016672A
Authority
CN
China
Prior art keywords
neural network
sensitivity
pruning
weight
quantization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010684270.4A
Other languages
Chinese (zh)
Inventor
颜军
许怡冰
龚永红
赵宁波
陈绍波
黄腾杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Orbita Aerospace Technology Co ltd
Original Assignee
Zhuhai Orbita Aerospace Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Orbita Aerospace Technology Co ltd filed Critical Zhuhai Orbita Aerospace Technology Co ltd
Priority to CN202010684270.4A priority Critical patent/CN112016672A/en
Publication of CN112016672A publication Critical patent/CN112016672A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a sensitivity-based pruning and quantization compression neural network method, which comprises the following steps: pruning the neural network based on sensitivity; calculating the clustering center of each layer of neural network based on a K-Means + + algorithm, and representing the weight value of each layer of neural network by using the clustering center; the network is quantized. The method realizes considerable compression by the deep neural network through three steps of network pruning, weight sharing based on K-Means + + and weight quantization. Although there is still room for improvement in compression rate, the model after compression is already small enough that deployment of deep neural networks on the mobile side is possible.

Description

Method and medium for neural network compression based on sensitivity pruning and quantization
Technical Field
The invention belongs to the technical field of computer artificial intelligent compression network models, and mainly relates to a neural network compression method and medium based on sensitivity pruning and quantification.
Background
Deep neural networks are becoming ubiquitous in applications ranging from computer vision to speech recognition and natural language processing. These large deep neural networks are very powerful, but their size consumes considerable memory space, memory bandwidth, and computational resources. These resource requirements are becoming increasingly high for embedded mobile application devices.
The deep network compression model can be roughly divided into two types, wherein the first type is to modify a trained model, and fine-tuning is carried out by reducing redundant parameters, so that the storage of the network is reduced and the original accuracy is not lost. The second type is to propose a new convolution calculation mode, thereby reducing parameters and achieving the effect of compressing the model. The main methods can be broadly divided into the following categories:
(1) network pruning: the complexity and the over-fitting phenomenon of network parameters are reduced;
(2) network quantization: reducing the redundancy problem of the deep learning model;
(3) matrix decomposition: the redundancy of parameters in the model is reduced, and the waste of calculation and storage is avoided.
Based on the above current research situation, pruning, quantization and matrix decomposition are three main methods for deep neural network compression. The current research is mainly directed to a series of compression performed on the fully-connected layer of the deep convolutional neural network, but the compression rate is not high, and the convolutional layer is not pruned. Moreover, the following two problems still exist in the pruning of the convolutional network: the pruning process may delete the wrong weights to varying degrees; the learning process is slow, and the occupied memory cost is still large.
Citation of documents:
[1]Han S,Mao H,Dally W J.Deep Compression:Compressing Deep Neural Networks with Pruning,TrainedQuantization and Huffman Coding[J].International Conference on LearningRepresentations,2016,56(4):3-7。
disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a neural network compression method based on sensitivity pruning and quantization.
The invention also provides a computer readable storage medium for implementing the method.
According to an embodiment of the first aspect of the invention, a sensitivity-based pruning and quantization compression neural network method comprises the following steps:
s100, inputting a neural network, pruning the neural network based on sensitivity, and storing the pruned neural network; s200, training the pruned neural network, and then returning to S100 until the precision change of the pruned neural network is within a set range; s300, calculating a clustering center of each layer of weight of the pruned neural network based on a K-means + + algorithm, and representing each layer of weight of the pruned neural network by using the obtained clustering center; s400, quantifying the pruned neural network and outputting the quantified neural network.
According to some embodiments of the invention, the S100 comprises: s110, calculating the sensitivity of each input node to the whole neural network; s120, calculating the average value of each corresponding component of the sensitivity of each input node as the average sensitivity; and S130, deleting the corresponding input node with the minimum component value lower than the average sensitivity.
According to some embodiments of the invention, the S100 further comprises: and S140, storing the weight of the pruned neural network according to the format of a line memory CSR.
According to some embodiments of the invention, the S200 comprises: and adopting a global training mode, taking the parameters after pruning as initial parameters of next training, and updating and fine-tuning all the parameters.
According to some embodiments of the invention, the S300 comprises: s310, randomly selecting a point from the input weight data as a clustering center; s320, calculating the distance D (x) between each point in the weight data and the nearest clustering center; s330 based onSetting the next clustering center by the probability p, wherein the calculation formula of the probability p is as follows:
Figure BDA0002586950260000021
s340, returning to the step S320 until the number of the clustering centers reaches a threshold value; and S350, executing a K-means algorithm based on the clustering center.
According to some embodiments of the invention, step C comprises: randomly selecting multiple points from the value matrix as seed points, calculating the distance D (x) between each point and the nearest seed point, storing the distance D (x) in an array, and summing the distances to obtain the random value C in Sum (D (x))iThe calculation formula is as follows: ciSum (d (x)) × λ, where λ takes a random number of 0 to 1.
According to some embodiments of the invention, the S400 comprises: precision is sacrificed for each weight of floating point type of 32bit length.
According to some embodiments of the invention, the S400 comprises: representing the original weight of each layer by using the obtained clustering center, wherein a plurality of connection weights of each layer share the same weight; each weight is stored in a table of weight shares by an index.
According to some embodiments of the invention, the S400 comprises: the cluster centers are stored in a codebook.
A computer-readable storage medium according to an embodiment of the second aspect of the present invention has stored thereon program instructions for executing a sensitivity-based pruning and quantization compression neural network method according to any one of the above-mentioned first aspects.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides a method for compressing a neural network based on sensitivity pruning and quantization, aiming at the serious redundancy existing in a deep neural network, a network pruning strategy is used for changing an original deep neural network into a sparse network. This procedure greatly reduces the complexity of the network and reduces overfitting. The network pruning strategy is simple and effective, and the preliminary compression is realized on the deep neural network.
The invention clusters the weight of each layer of the network by adopting a K-Means + + method so as to realize weight sharing. The K-Means randomly selects the initial point, the result obtained by final clustering may be greatly different from the actual data distribution, and the K-Means + + algorithm avoids the problem and can effectively select the initial point.
According to the invention, an original network is pruned into a sparse network through a network pruning process, so that preliminary network compression is realized; clustering the weight of each layer of the network by adopting a K-Means + + method so as to realize weight sharing; and carrying out weight quantization after weight sharing to realize final compression. Through three steps of network pruning, K-Means + + based clustering and weight quantization, the effective compression of 30 to 40 times is realized on the whole deep neural network, and the precision loss is basically avoided. Considerable compression is realized by the deep neural network through three steps of network pruning, weight sharing based on K-Means + + and weight quantization. Although there is still room for improvement in compression rate, the model after compression is already small enough that deployment of deep neural networks on the mobile side is possible.
Compared with the existing network compression method, the method not only compresses the full-connection layer of the deep neural network, but also compresses the convolution layer. It should be noted that, on the one hand, after the network is compressed by the method of the present invention, the accuracy is not substantially lost, but is improved to a certain extent, and the effect is better than that in the prior art, which benefits from the improvement of the clustering method. On the other hand, the method quantizes the convolutional layer and the full-connection layer into the same length, so that the redundancy problem caused by inconsistent coding lengths is avoided, and the Huffman coding process is not needed, so that the compression method is simpler and more effective.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
FIG. 2 is a comparison graph of the network becoming dense and sparse before and after pruning according to an embodiment of the present invention;
FIG. 3 is a CSR storage format example of an embodiment of the invention;
FIG. 4 is a schematic diagram of k-means + + cluster center selection according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating an example of k-means + + initial cluster center selection in accordance with an embodiment of the present invention;
FIG. 6 is a second cluster center selection calculation table according to an embodiment of the present invention;
FIG. 7 is a table summarizing different network compression effects according to embodiments of the present invention;
FIG. 8 is a bar chart of Top-1error for a deep network under different compression schemes according to an embodiment of the present invention.
Detailed Description
The conception, the specific structure and the technical effects of the present invention will be clearly and completely described in conjunction with the embodiments and the accompanying drawings to fully understand the objects, the schemes and the effects of the present invention.
In the deep neural network compression development history, network pruning reduces the complexity of the network by reducing the number of parameters, and the prevention of overfitting is a very effective technology. Pruning, quantization, weight matrix SVD decomposition and modification of the network structure have certain effects on depth compression. But the compression ratio is low and far from satisfactory. According to the embodiment of the invention, the structure of the depth network is changed from dense to sparse by using a pruning method based on sensitivity analysis, and then the weight is quantized, so that further compression is realized.
The embodiment of the invention provides a method for compressing a neural network based on sensitivity pruning and quantification, which comprises the following steps: s100, inputting a neural network, pruning the neural network based on sensitivity, and storing the pruned neural network; s200, training the pruned neural network, and then returning to S100 until the precision change of the pruned neural network is within a set range; s300, calculating a clustering center of each layer of weight of the pruned neural network based on a K-means + + algorithm, and representing each layer of weight of the pruned neural network by using the obtained clustering center; s400, quantizing the pruned neural network and outputting the quantized neural network. Wherein, the set range of the neural network precision change is unchanged or slightly improved.
In this embodiment, for the severe redundancy existing in the deep neural network, the original deep neural network is changed into a sparse network using a network pruning strategy. This procedure greatly reduces the complexity of the network and reduces overfitting. The network pruning strategy is simple and effective, and the preliminary compression is realized on the deep neural network. The invention clusters the weight of each layer of the network by adopting a K-Means + + method so as to realize weight sharing. The K-Means randomly selects the initial point, the result obtained by final clustering may be greatly different from the actual data distribution, and the K-Means + + algorithm avoids the problem and can effectively select the initial point. Clustering the weight of each layer of the network by adopting a K-Means + + method so as to realize weight sharing; and carrying out weight quantization after weight sharing to realize final compression. Through three steps of network pruning, K-Means + + based clustering and weight quantization, the effective compression of 30 to 40 times is realized on the whole deep neural network.
Referring to fig. 1, in some embodiments, sensitivity-based pruning, as in part a of fig. 1, is implemented, first, learning which connections are important through normal network training; second, the corresponding input node of the minimum component value below the average sensitivity is deleted; third, the network is retrained to fine tune the remaining connections; the retraining network has two modes, namely a sparse training mode and a global training mode, and the embodiment adopts the global training mode, namely, the parameters after pruning are used as initial parameters of next training, and all the parameters are updated and fine-tuned. And then carrying out sensitive pruning on the network again, and repeating the process until the precision of the network after pruning is kept unchanged or slightly improved.
Referring to fig. 1, in some embodiments, as part b of fig. 1 is implemented, on the basis of completing sensitivity pruning, a k-means + + clustering algorithm is used to obtain a clustering center of each layer of weights of the deep neural network, the obtained clustering center is used to represent each layer of weights of the deep neural network, and the number of effective weights that we need to store is limited by implementing weight sharing. And then, the weight is quantized on the basis of weight sharing, and finally, the network is retrained and the weight is subjected to fine tuning and updating, so that the accuracy of the network is not lost or is slightly improved.
In some embodiments, sensitivity pruning is achieved in the following manner.
Sensitive pruning
The pruning step of this example is divided into 3 steps, as shown in part a of fig. 1, where the sensitivity-based method is based on the following theory: assuming that the non-linear micro-mapping it implements is Φ (x): R1→RkWhere o (K × 1) is the output vector (o ═ Φ (x)1,o2,...,ok) And x (I × 1) is an input vector (x)1,x2,...,xI). Let x (n) ε Ω, where Ω is the open set, since o is at x(n)Can be slightly, then have:
o(x(n)+Δx)=o(x(n))+J(x(n))Δx+g(Δx) (1)
wherein J (x)(n)) Is the jacobian matrix. While
Figure BDA0002586950260000051
Equation (1) becomes:
o(x(n)+Δx)-o(x(n))≈J(x(n))Δx (2)
output okFor input xiThe sensitivity of (a) is:
Figure BDA0002586950260000052
the above test (3) is abbreviated as
Figure BDA0002586950260000053
To account for the sensitivity of all outputs to a single input i:
Figure BDA0002586950260000054
the sensitivity given in equation (4) is calculated using a back propagation algorithm, which can be expressed as:
Figure BDA0002586950260000055
wherein, yjIs the output of the jth node of the hidden layer, ok' is the partial derivative of the activation function o ═ f (net) for the kth output node.
Thus, the matrix S (K × I) is composed of equation (5) and can be expressed as:
S=O'×W×Y'×V (6)
wherein, W (KxJ) is the weight matrix of the output layer, and V (JxI) is the weight matrix of the input layer;
Figure BDA0002586950260000056
the sensitivity of the I inputs to the entire network, which can be found from the found S, can be expressed as a vector as follows:
S=(s1,s2,...,sI) (8)
after the sensitivity of each input is calculated for the P inputs, the average value of the corresponding components of the sensitivity of all the P inputs is obtained to be used as the judgment standard of pruning. Finally, the expression for finding the average sensitivity as a criterion is:
Figure BDA0002586950260000061
after pruning based on sensitivity is completed, the network is changed from a dense network to a sparsely connected network. As shown in fig. 2. After pruning, the weights of the remaining connections after pruning are retained, and then retraining is performed, each iteration being a greedy search process. Through multiple iterations, we can find the minimum number of connections, and make the network keep the original precision or slightly improve.
The network after pruning is changed from dense to sparse, and the rest weight of the network is a typical sparse matrix. To save storage costs, only the non-zero elements and their positions typically need to be stored. The storage format includes two storage formats, i.e., Compressed Spare Row (CSR) and Column (CSC).
In some embodiments, referring to fig. 3, storing the sparse network structure resulting after pruning is done in a format 7 that stores CSRs by rows.
CSR is a compressed storage of the information of the lines, showing only the position where the first non-zero element of each line is reserved. It requires three types of data, numeric value, column number, and row offset, to represent. The numerical value refers to storing non-zero elements in the sparse matrix, and accessing the elements in a row traversal mode from top to bottom and from left to right. The column number refers to a column index of the non-zero element in the sparse matrix, and the length of the column index is the number of the non-zero elements of the sparse matrix. The row offset refers to the index of the first non-zero element in each row of the sparse matrix corresponding to the numerical value, and finally the total number of non-zero elements should be added to the index. CSCs correspond to CSRs and require three types of data, numeric values, row numbers and column offsets. The row storage or column storage of the sparse matrix requires the storage of 2a + n +1 values in total, where a represents the number of non-zero elements and n represents the size of the row or column. Taking the CSR storage format as an example, as shown in fig. 3, 2 × 8+4+1 is stored as 21 values.
In some embodiments, network quantization is implemented in the following manner.
Network quantization
The main idea of weight sharing is that multiple connections share one and the same weight to limit the number of valid weights that need to be stored. To cluster data objects, Han et al generalize five major clustering algorithms based on partitioning, hierarchy, density, network and model.
The k-means clustering algorithm is a bottom-up clustering algorithm, is simple, but has unclear and self-evident defects. First, the number k of classes to be aggregated needs to be set in advance, but in general practical applications, it is difficult to set the value of k to be consistent with the actual number of clusters, which may cause deviation of the final result. Secondly, the initial clustering centers of the k-means algorithm need to be determined manually, which has great uncertainty, because different initial clustering centers may result in failure to obtain the expected clustering effect. Finally, the extreme individual distant points have a great influence on the final clustering result.
Arthur et al propose a k-means + + clustering algorithm, improve the disadvantage that the k-means algorithm needs to artificially determine the initial clustering center, obtain the global optimum effect, and improve the accuracy of the clustering result compared with the k-means algorithm. And identifying the weight of each layer of the training network by using a k-means + + algorithm, wherein the same weight falls into the same cluster to realize weight sharing. The principle of k-means + + selection herein is that the distance between the initial clustering centers is as far as possible, and the process of selecting the clustering centers is as follows:
(1) randomly selecting a point from a plurality of input weight data as a clustering center;
(2) calculating the distance D (x) between each point x in the weight data and the nearest cluster center (the cluster center selected in the previous step);
(3) selecting the next clustering center based on the probability p, wherein the probability that the point with larger (D), (x) is selected as the new clustering center is larger, and the calculation formula of the probability p is as follows:
Figure BDA0002586950260000071
(4) repeating the steps 2 and 3 until k clustering centers are selected;
(5) the standard k-means algorithm is performed with the k centers selected.
The key point in the step 3 is that firstly, several seed points are randomly selected from the weight matrix, the distance D (x) between each point and the nearest seed point is calculated and stored in an array, and then the distances are summed to obtain the random value C in Sum (D (x)))iThe calculation formula is as follows:
Ci=Sum(D(x))×λ
(11)
wherein, λ is a random number from 0 to 1, and the new clustering point is selected.
As shown in FIG. 4Random value CiThe value of (a) falls within the interval of d (x) ═ 15 with a high probability, and the probability that the corresponding point is selected as the new cluster center is high. Fig. 5 is a specific method of selecting an initial point. As can be seen from the left image 5, the sample points have 3 clusters in total, and if the coordinates (1,2) are selected as the first initial clustering center through the point No. 6 after the first step of the process, D (x) of each sample in the second step2And the probability of being selected as the second cluster center is shown in fig. 6.
From the second cluster center selection calculation table shown in fig. 6, it can be seen that the probability interval of the next cluster center point falling on points 1 to 4 is [0,0.4738] (the probability of falling on point 1 is [0,0.1053], the probability of falling on point 2 is [0.1053,0.2764], (the probability interval of falling on points 5 to 8 is [0.4738,0.5265], (the probability interval of falling on points 9 to 12 is [0.5265,1 ]). That is, the sum of the probabilities of the first 4 points and the last 4 points is almost close to 1, because the cluster of 5,6,7,8 already has the first center point 6. At this time, a number λ between 0 and 1 is randomly generated, and the next cluster center can be determined.
After clustering is carried out on the weight of each layer of the neural network through k-means + + clustering, the original weight is represented by the obtained clustering center, weight sharing is realized, and a plurality of connection weights of the layer only share the same weight. And when the k value is far smaller than the weight quantity, reducing the number of the weights to achieve the purpose of compression.
After k-means + + clustering is performed on the convolutional layer and the full connection layer, the space occupied by each weight is reduced by quantizing the floating point type sacrifice precision of each weight with the length of 32 bits, and the network is compressed again. The number of valid weights that need to be stored is limited by multiple connections sharing the same weight. After the quantization process is finished, the cluster centers are stored in the codebook. For each weight, instead of being represented by the previous 32 bits, only one small index is stored in the table of weight sharing.
Assuming clustering into k clusters by a clustering algorithm, log is required2(k) The bit encoding is indexed. In general, when a network has N connections, each connected network is represented by b bits, and only k total networksSharing the weight, the compression ratio is:
Figure BDA0002586950260000081
assuming that there are 16 weights, each initial weight is 32 bits, and the clustering algorithm is used to cluster the initial weights into 4 classes, the weights are represented by 16 2-bit indexes. Finally, storing 4 effective weights of 32 bits and 16 indexes of 2 bits, the compression ratio is:
Figure BDA0002586950260000082
the embodiment of the invention summarizes the compression effect of each deep neural network from the aspects of a compression method, a model size, an error rate and the like, and specific parameters and performances of each network before and after compression are shown in fig. 7.
As can be seen from the summary table of the compression effects of different networks shown in fig. 7, the deep neural network achieves a compression ratio of 30 to 40 times without substantial precision loss by using the compression method of the deep neural network proposed in the present invention. Considerable compression is realized by the deep neural network through three steps of network pruning, weight sharing based on K-Means + + and weight quantization. Although there is still room for improvement in compression rate, the model after compression is already small enough that deployment of deep neural networks on the mobile side is possible.
Referring to fig. 8, compared with the existing network compression method, the method of the embodiment of the present invention compresses not only the fully-connected layer of the deep neural network, but also the convolutional layer. It should be noted that, on one hand, after the network is compressed by the method of the embodiment of the present invention, the accuracy is not substantially lost, but is improved to a certain extent, and the effect is better than that in the prior art (as shown in fig. 8), which benefits from the improvement of the clustering method. On the other hand, the method of the embodiment of the invention quantizes the convolutional layer and the full-link layer into the same length, which avoids the redundancy problem caused by different coding lengths, so that the Huffman coding process is not needed, and the compression method of the embodiment of the invention is simpler and more effective.
It should be recognized that the method steps in embodiments of the present invention may be embodied or carried out by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The method may use standard programming techniques. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.
Further, the operations of processes described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described herein (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.
Further, the method may be implemented in any type of android computing platform that is operatively connected to a suitable. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described herein includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention may also include the computer itself when programmed according to the methods and techniques described herein.
A computer program can be applied to input data to perform the functions described herein to transform the input data to generate output data that is stored to non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.
The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims (10)

1. A method for compressing a neural network based on sensitivity pruning and quantization is characterized by comprising the following steps:
s100, inputting a neural network, pruning the neural network based on sensitivity, and storing the pruned neural network;
s200, training the pruned neural network, and then returning to S100 until the precision change of the pruned neural network is within a set range;
s300, calculating a clustering center of each layer of weight of the pruned neural network based on a K-means + + algorithm, and representing each layer of weight of the pruned neural network by using the obtained clustering center;
s400, quantifying the pruned neural network and outputting the quantified neural network.
2. The method of sensitivity-based pruning and quantization-based compression of neural networks according to claim 1, wherein the S100 comprises:
s110, calculating the sensitivity of each input node to the whole neural network;
s120, calculating the average value of each corresponding component of the sensitivity of each input node as the average sensitivity;
and S130, deleting the corresponding input node with the minimum component value lower than the average sensitivity.
3. The method for sensitivity-based pruning and quantization-based compression of neural networks according to claim 1, wherein the S100 further comprises:
and S140, storing the weight of the pruned neural network according to the format of a line memory CSR.
4. The method of sensitivity-based pruning and quantization-based compression of neural networks according to claim 1, wherein the S200 comprises:
and adopting a global training mode, taking the parameters after pruning as initial parameters of next training, and updating and fine-tuning all the parameters.
5. The method for sensitivity-based pruning and quantization-based compression of neural networks according to claim 1, wherein the S300 comprises:
s310, randomly selecting a point from the input weight data as a clustering center;
s320, calculating the distance D (x) between each point in the weight data and the nearest clustering center;
s330, setting the next clustering center based on the probability p, wherein the calculation formula of the probability p is as follows:
Figure FDA0002586950250000011
s340, returning to the step S320 until the number of the clustering centers reaches a threshold value;
and S350, executing a K-means algorithm based on the clustering center.
6. The method of claim 5, wherein step S330 comprises:
randomly selecting multiple points from the value matrix as seed points, calculating the distance D (x) between each point and the nearest seed point, storing the distance D (x) in an array, and summing the distances to obtain the random value C in Sum (D (x))iThe calculation formula is as follows:
Ci=Sum(D(x))×λ
where λ takes a random number from 0 to 1.
7. The method of sensitivity-based pruning and quantization-based compression of neural networks according to claim 1, wherein the S400 comprises:
precision is sacrificed for each weight of floating point type of 32bit length.
8. The method of sensitivity-based pruning and quantization-based compression of neural networks according to claim 1, wherein the S400 comprises:
representing the original weight of each layer by using the obtained clustering center, wherein a plurality of connection weights of each layer share the same weight;
each weight is stored in a table of weight shares by an index.
9. The method of sensitivity-based pruning and quantization-based compression of neural networks according to claim 1, wherein the S400 comprises:
the cluster centers are stored in a codebook.
10. A computer readable storage medium having stored thereon program instructions which, when executed by a processor, implement the method of any one of claims 1 to 9.
CN202010684270.4A 2020-07-16 2020-07-16 Method and medium for neural network compression based on sensitivity pruning and quantization Pending CN112016672A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010684270.4A CN112016672A (en) 2020-07-16 2020-07-16 Method and medium for neural network compression based on sensitivity pruning and quantization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010684270.4A CN112016672A (en) 2020-07-16 2020-07-16 Method and medium for neural network compression based on sensitivity pruning and quantization

Publications (1)

Publication Number Publication Date
CN112016672A true CN112016672A (en) 2020-12-01

Family

ID=73499445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010684270.4A Pending CN112016672A (en) 2020-07-16 2020-07-16 Method and medium for neural network compression based on sensitivity pruning and quantization

Country Status (1)

Country Link
CN (1) CN112016672A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304928A (en) * 2018-01-26 2018-07-20 西安理工大学 Compression method based on the deep neural network for improving cluster
CN110210618A (en) * 2019-05-22 2019-09-06 东南大学 The compression method that dynamic trimming deep neural network weight and weight are shared
CN110443359A (en) * 2019-07-03 2019-11-12 中国石油大学(华东) Neural network compression algorithm based on adaptive combined beta pruning-quantization

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304928A (en) * 2018-01-26 2018-07-20 西安理工大学 Compression method based on the deep neural network for improving cluster
CN110210618A (en) * 2019-05-22 2019-09-06 东南大学 The compression method that dynamic trimming deep neural network weight and weight are shared
CN110443359A (en) * 2019-07-03 2019-11-12 中国石油大学(华东) Neural network compression algorithm based on adaptive combined beta pruning-quantization

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王宇: "基于灵敏度剪枝方法的深度神经网络压缩实现研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 08, pages 3 *

Similar Documents

Publication Publication Date Title
CN111488986B (en) Model compression method, image processing method and device
WO2020048389A1 (en) Method for compressing neural network model, device, and computer apparatus
CN109635935B (en) Model adaptive quantization method of deep convolutional neural network based on modular length clustering
CN110728361B (en) Deep neural network compression method based on reinforcement learning
CN111079899A (en) Neural network model compression method, system, device and medium
CN108304928A (en) Compression method based on the deep neural network for improving cluster
CN111105029A (en) Neural network generation method and device and electronic equipment
CN110263917B (en) Neural network compression method and device
CN115861767A (en) Neural network joint quantization method for image classification
US11922018B2 (en) Storage system and storage control method including dimension setting information representing attribute for each of data dimensions of multidimensional dataset
KR102454420B1 (en) Method and apparatus processing weight of artificial neural network for super resolution
CN115544029A (en) Data processing method and related device
CN109299780A (en) Neural network model compression method, device and computer equipment
CN112016672A (en) Method and medium for neural network compression based on sensitivity pruning and quantization
US20220076122A1 (en) Arithmetic apparatus and arithmetic method
CN112001495B (en) Neural network optimization method, system, device and readable storage medium
CN113962295A (en) Weapon equipment system efficiency evaluation method, system and device
CN111722594B (en) Industrial process monitoring method, device, equipment and readable storage medium
CN112307230B (en) Data storage method, data acquisition method and device
CN109212960B (en) Weight sensitivity-based binary neural network hardware compression method
CN113177627A (en) Optimization system, retraining system, and method thereof, and processor and readable medium
CN115577618B (en) Construction method and prediction method of high-pressure converter valve hall environmental factor prediction model
CN112396178B (en) Method for improving compression efficiency of CNN (compressed network)
Salehifar et al. On optimal coding of hidden markov sources
CN115934661B (en) Method and device for compressing graphic neural network, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination