CN112016672A

CN112016672A - Method and medium for neural network compression based on sensitivity pruning and quantization

Info

Publication number: CN112016672A
Application number: CN202010684270.4A
Authority: CN
Inventors: 颜军; 许怡冰; 龚永红; 赵宁波; 陈绍波; 黄腾杰
Original assignee: Zhuhai Orbita Aerospace Technology Co ltd
Current assignee: Zhuhai Orbita Aerospace Technology Co ltd
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2020-12-01

Abstract

The invention provides a sensitivity-based pruning and quantization compression neural network method, which comprises the following steps: pruning the neural network based on sensitivity; calculating the clustering center of each layer of neural network based on a K-Means + + algorithm, and representing the weight value of each layer of neural network by using the clustering center; the network is quantized. The method realizes considerable compression by the deep neural network through three steps of network pruning, weight sharing based on K-Means + + and weight quantization. Although there is still room for improvement in compression rate, the model after compression is already small enough that deployment of deep neural networks on the mobile side is possible.

Description

Method and medium for neural network compression based on sensitivity pruning and quantization

Technical Field

The invention belongs to the technical field of computer artificial intelligent compression network models, and mainly relates to a neural network compression method and medium based on sensitivity pruning and quantification.

Background

Deep neural networks are becoming ubiquitous in applications ranging from computer vision to speech recognition and natural language processing. These large deep neural networks are very powerful, but their size consumes considerable memory space, memory bandwidth, and computational resources. These resource requirements are becoming increasingly high for embedded mobile application devices.

The deep network compression model can be roughly divided into two types, wherein the first type is to modify a trained model, and fine-tuning is carried out by reducing redundant parameters, so that the storage of the network is reduced and the original accuracy is not lost. The second type is to propose a new convolution calculation mode, thereby reducing parameters and achieving the effect of compressing the model. The main methods can be broadly divided into the following categories:

(1) network pruning: the complexity and the over-fitting phenomenon of network parameters are reduced;

(2) network quantization: reducing the redundancy problem of the deep learning model;

(3) matrix decomposition: the redundancy of parameters in the model is reduced, and the waste of calculation and storage is avoided.

Based on the above current research situation, pruning, quantization and matrix decomposition are three main methods for deep neural network compression. The current research is mainly directed to a series of compression performed on the fully-connected layer of the deep convolutional neural network, but the compression rate is not high, and the convolutional layer is not pruned. Moreover, the following two problems still exist in the pruning of the convolutional network: the pruning process may delete the wrong weights to varying degrees; the learning process is slow, and the occupied memory cost is still large.

Citation of documents:

[1]Han S,Mao H,Dally W J.Deep Compression:Compressing Deep Neural Networks with Pruning,TrainedQuantization and Huffman Coding[J].International Conference on LearningRepresentations,2016,56(4):3-7。

disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a neural network compression method based on sensitivity pruning and quantization.

The invention also provides a computer readable storage medium for implementing the method.

According to an embodiment of the first aspect of the invention, a sensitivity-based pruning and quantization compression neural network method comprises the following steps:

s100, inputting a neural network, pruning the neural network based on sensitivity, and storing the pruned neural network; s200, training the pruned neural network, and then returning to S100 until the precision change of the pruned neural network is within a set range; s300, calculating a clustering center of each layer of weight of the pruned neural network based on a K-means + + algorithm, and representing each layer of weight of the pruned neural network by using the obtained clustering center; s400, quantifying the pruned neural network and outputting the quantified neural network.

According to some embodiments of the invention, the S100 comprises: s110, calculating the sensitivity of each input node to the whole neural network; s120, calculating the average value of each corresponding component of the sensitivity of each input node as the average sensitivity; and S130, deleting the corresponding input node with the minimum component value lower than the average sensitivity.

According to some embodiments of the invention, the S100 further comprises: and S140, storing the weight of the pruned neural network according to the format of a line memory CSR.

According to some embodiments of the invention, the S200 comprises: and adopting a global training mode, taking the parameters after pruning as initial parameters of next training, and updating and fine-tuning all the parameters.

According to some embodiments of the invention, the S300 comprises: s310, randomly selecting a point from the input weight data as a clustering center; s320, calculating the distance D (x) between each point in the weight data and the nearest clustering center; s330 based onSetting the next clustering center by the probability p, wherein the calculation formula of the probability p is as follows:

s340, returning to the step S320 until the number of the clustering centers reaches a threshold value; and S350, executing a K-means algorithm based on the clustering center.

According to some embodiments of the invention, step C comprises: randomly selecting multiple points from the value matrix as seed points, calculating the distance D (x) between each point and the nearest seed point, storing the distance D (x) in an array, and summing the distances to obtain the random value C in Sum (D (x))_iThe calculation formula is as follows: c_iSum (d (x)) × λ, where λ takes a random number of 0 to 1.

According to some embodiments of the invention, the S400 comprises: precision is sacrificed for each weight of floating point type of 32bit length.

According to some embodiments of the invention, the S400 comprises: representing the original weight of each layer by using the obtained clustering center, wherein a plurality of connection weights of each layer share the same weight; each weight is stored in a table of weight shares by an index.

According to some embodiments of the invention, the S400 comprises: the cluster centers are stored in a codebook.

A computer-readable storage medium according to an embodiment of the second aspect of the present invention has stored thereon program instructions for executing a sensitivity-based pruning and quantization compression neural network method according to any one of the above-mentioned first aspects.

The embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides a method for compressing a neural network based on sensitivity pruning and quantization, aiming at the serious redundancy existing in a deep neural network, a network pruning strategy is used for changing an original deep neural network into a sparse network. This procedure greatly reduces the complexity of the network and reduces overfitting. The network pruning strategy is simple and effective, and the preliminary compression is realized on the deep neural network.

The invention clusters the weight of each layer of the network by adopting a K-Means + + method so as to realize weight sharing. The K-Means randomly selects the initial point, the result obtained by final clustering may be greatly different from the actual data distribution, and the K-Means + + algorithm avoids the problem and can effectively select the initial point.

According to the invention, an original network is pruned into a sparse network through a network pruning process, so that preliminary network compression is realized; clustering the weight of each layer of the network by adopting a K-Means + + method so as to realize weight sharing; and carrying out weight quantization after weight sharing to realize final compression. Through three steps of network pruning, K-Means + + based clustering and weight quantization, the effective compression of 30 to 40 times is realized on the whole deep neural network, and the precision loss is basically avoided. Considerable compression is realized by the deep neural network through three steps of network pruning, weight sharing based on K-Means + + and weight quantization. Although there is still room for improvement in compression rate, the model after compression is already small enough that deployment of deep neural networks on the mobile side is possible.

Compared with the existing network compression method, the method not only compresses the full-connection layer of the deep neural network, but also compresses the convolution layer. It should be noted that, on the one hand, after the network is compressed by the method of the present invention, the accuracy is not substantially lost, but is improved to a certain extent, and the effect is better than that in the prior art, which benefits from the improvement of the clustering method. On the other hand, the method quantizes the convolutional layer and the full-connection layer into the same length, so that the redundancy problem caused by inconsistent coding lengths is avoided, and the Huffman coding process is not needed, so that the compression method is simpler and more effective.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a comparison graph of the network becoming dense and sparse before and after pruning according to an embodiment of the present invention;

FIG. 3 is a CSR storage format example of an embodiment of the invention;

FIG. 4 is a schematic diagram of k-means + + cluster center selection according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating an example of k-means + + initial cluster center selection in accordance with an embodiment of the present invention;

FIG. 6 is a second cluster center selection calculation table according to an embodiment of the present invention;

FIG. 7 is a table summarizing different network compression effects according to embodiments of the present invention;

FIG. 8 is a bar chart of Top-1error for a deep network under different compression schemes according to an embodiment of the present invention.

Detailed Description

The conception, the specific structure and the technical effects of the present invention will be clearly and completely described in conjunction with the embodiments and the accompanying drawings to fully understand the objects, the schemes and the effects of the present invention.

In the deep neural network compression development history, network pruning reduces the complexity of the network by reducing the number of parameters, and the prevention of overfitting is a very effective technology. Pruning, quantization, weight matrix SVD decomposition and modification of the network structure have certain effects on depth compression. But the compression ratio is low and far from satisfactory. According to the embodiment of the invention, the structure of the depth network is changed from dense to sparse by using a pruning method based on sensitivity analysis, and then the weight is quantized, so that further compression is realized.

The embodiment of the invention provides a method for compressing a neural network based on sensitivity pruning and quantification, which comprises the following steps: s100, inputting a neural network, pruning the neural network based on sensitivity, and storing the pruned neural network; s200, training the pruned neural network, and then returning to S100 until the precision change of the pruned neural network is within a set range; s300, calculating a clustering center of each layer of weight of the pruned neural network based on a K-means + + algorithm, and representing each layer of weight of the pruned neural network by using the obtained clustering center; s400, quantizing the pruned neural network and outputting the quantized neural network. Wherein, the set range of the neural network precision change is unchanged or slightly improved.

In this embodiment, for the severe redundancy existing in the deep neural network, the original deep neural network is changed into a sparse network using a network pruning strategy. This procedure greatly reduces the complexity of the network and reduces overfitting. The network pruning strategy is simple and effective, and the preliminary compression is realized on the deep neural network. The invention clusters the weight of each layer of the network by adopting a K-Means + + method so as to realize weight sharing. The K-Means randomly selects the initial point, the result obtained by final clustering may be greatly different from the actual data distribution, and the K-Means + + algorithm avoids the problem and can effectively select the initial point. Clustering the weight of each layer of the network by adopting a K-Means + + method so as to realize weight sharing; and carrying out weight quantization after weight sharing to realize final compression. Through three steps of network pruning, K-Means + + based clustering and weight quantization, the effective compression of 30 to 40 times is realized on the whole deep neural network.

Referring to fig. 1, in some embodiments, sensitivity-based pruning, as in part a of fig. 1, is implemented, first, learning which connections are important through normal network training; second, the corresponding input node of the minimum component value below the average sensitivity is deleted; third, the network is retrained to fine tune the remaining connections; the retraining network has two modes, namely a sparse training mode and a global training mode, and the embodiment adopts the global training mode, namely, the parameters after pruning are used as initial parameters of next training, and all the parameters are updated and fine-tuned. And then carrying out sensitive pruning on the network again, and repeating the process until the precision of the network after pruning is kept unchanged or slightly improved.

Referring to fig. 1, in some embodiments, as part b of fig. 1 is implemented, on the basis of completing sensitivity pruning, a k-means + + clustering algorithm is used to obtain a clustering center of each layer of weights of the deep neural network, the obtained clustering center is used to represent each layer of weights of the deep neural network, and the number of effective weights that we need to store is limited by implementing weight sharing. And then, the weight is quantized on the basis of weight sharing, and finally, the network is retrained and the weight is subjected to fine tuning and updating, so that the accuracy of the network is not lost or is slightly improved.

In some embodiments, sensitivity pruning is achieved in the following manner.

Sensitive pruning

The pruning step of this example is divided into 3 steps, as shown in part a of fig. 1, where the sensitivity-based method is based on the following theory: assuming that the non-linear micro-mapping it implements is Φ (x): R¹→R^kWhere o (K × 1) is the output vector (o ═ Φ (x)₁,o₂,...,o_k) And x (I × 1) is an input vector (x)₁,x₂,...,x_I). Let x (n) ε Ω, where Ω is the open set, since o is at x⁽ⁿ⁾Can be slightly, then have:

o(x⁽ⁿ⁾+Δx)＝o(x⁽ⁿ⁾)+J(x⁽ⁿ⁾)Δx+g(Δx) (1)

wherein J (x)⁽ⁿ⁾) Is the jacobian matrix. While

Equation (1) becomes:

o(x⁽ⁿ⁾+Δx)-o(x⁽ⁿ⁾)≈J(x⁽ⁿ⁾)Δx (2)

output o_kFor input x_iThe sensitivity of (a) is:

the above test (3) is abbreviated as

To account for the sensitivity of all outputs to a single input i:

the sensitivity given in equation (4) is calculated using a back propagation algorithm, which can be expressed as:

wherein, y_jIs the output of the jth node of the hidden layer, o_k' is the partial derivative of the activation function o ═ f (net) for the kth output node.

Thus, the matrix S (K × I) is composed of equation (5) and can be expressed as:

S＝O'×W×Y'×V (6)

wherein, W (KxJ) is the weight matrix of the output layer, and V (JxI) is the weight matrix of the input layer;

the sensitivity of the I inputs to the entire network, which can be found from the found S, can be expressed as a vector as follows:

S＝(s₁,s₂,...,s_I) (8)

after the sensitivity of each input is calculated for the P inputs, the average value of the corresponding components of the sensitivity of all the P inputs is obtained to be used as the judgment standard of pruning. Finally, the expression for finding the average sensitivity as a criterion is:

after pruning based on sensitivity is completed, the network is changed from a dense network to a sparsely connected network. As shown in fig. 2. After pruning, the weights of the remaining connections after pruning are retained, and then retraining is performed, each iteration being a greedy search process. Through multiple iterations, we can find the minimum number of connections, and make the network keep the original precision or slightly improve.

The network after pruning is changed from dense to sparse, and the rest weight of the network is a typical sparse matrix. To save storage costs, only the non-zero elements and their positions typically need to be stored. The storage format includes two storage formats, i.e., Compressed Spare Row (CSR) and Column (CSC).

In some embodiments, referring to fig. 3, storing the sparse network structure resulting after pruning is done in a format 7 that stores CSRs by rows.

CSR is a compressed storage of the information of the lines, showing only the position where the first non-zero element of each line is reserved. It requires three types of data, numeric value, column number, and row offset, to represent. The numerical value refers to storing non-zero elements in the sparse matrix, and accessing the elements in a row traversal mode from top to bottom and from left to right. The column number refers to a column index of the non-zero element in the sparse matrix, and the length of the column index is the number of the non-zero elements of the sparse matrix. The row offset refers to the index of the first non-zero element in each row of the sparse matrix corresponding to the numerical value, and finally the total number of non-zero elements should be added to the index. CSCs correspond to CSRs and require three types of data, numeric values, row numbers and column offsets. The row storage or column storage of the sparse matrix requires the storage of 2a + n +1 values in total, where a represents the number of non-zero elements and n represents the size of the row or column. Taking the CSR storage format as an example, as shown in fig. 3, 2 × 8+4+1 is stored as 21 values.

In some embodiments, network quantization is implemented in the following manner.

Network quantization

The main idea of weight sharing is that multiple connections share one and the same weight to limit the number of valid weights that need to be stored. To cluster data objects, Han et al generalize five major clustering algorithms based on partitioning, hierarchy, density, network and model.

The k-means clustering algorithm is a bottom-up clustering algorithm, is simple, but has unclear and self-evident defects. First, the number k of classes to be aggregated needs to be set in advance, but in general practical applications, it is difficult to set the value of k to be consistent with the actual number of clusters, which may cause deviation of the final result. Secondly, the initial clustering centers of the k-means algorithm need to be determined manually, which has great uncertainty, because different initial clustering centers may result in failure to obtain the expected clustering effect. Finally, the extreme individual distant points have a great influence on the final clustering result.

Arthur et al propose a k-means + + clustering algorithm, improve the disadvantage that the k-means algorithm needs to artificially determine the initial clustering center, obtain the global optimum effect, and improve the accuracy of the clustering result compared with the k-means algorithm. And identifying the weight of each layer of the training network by using a k-means + + algorithm, wherein the same weight falls into the same cluster to realize weight sharing. The principle of k-means + + selection herein is that the distance between the initial clustering centers is as far as possible, and the process of selecting the clustering centers is as follows:

(1) randomly selecting a point from a plurality of input weight data as a clustering center;

(2) calculating the distance D (x) between each point x in the weight data and the nearest cluster center (the cluster center selected in the previous step);

(3) selecting the next clustering center based on the probability p, wherein the probability that the point with larger (D), (x) is selected as the new clustering center is larger, and the calculation formula of the probability p is as follows:

(4) repeating the

steps

2 and 3 until k clustering centers are selected;

(5) the standard k-means algorithm is performed with the k centers selected.

The key point in the step 3 is that firstly, several seed points are randomly selected from the weight matrix, the distance D (x) between each point and the nearest seed point is calculated and stored in an array, and then the distances are summed to obtain the random value C in Sum (D (x)))_iThe calculation formula is as follows:

C_i＝Sum(D(x))×λ

(11)

wherein, λ is a random number from 0 to 1, and the new clustering point is selected.

As shown in FIG. 4Random value C_iThe value of (a) falls within the interval of d (x) ═ 15 with a high probability, and the probability that the corresponding point is selected as the new cluster center is high. Fig. 5 is a specific method of selecting an initial point. As can be seen from the left image 5, the sample points have 3 clusters in total, and if the coordinates (1,2) are selected as the first initial clustering center through the point No. 6 after the first step of the process, D (x) of each sample in the second step²And the probability of being selected as the second cluster center is shown in fig. 6.

From the second cluster center selection calculation table shown in fig. 6, it can be seen that the probability interval of the next cluster center point falling on points 1 to 4 is [0,0.4738] (the probability of falling on point 1 is [0,0.1053], the probability of falling on point 2 is [0.1053,0.2764], (the probability interval of falling on points 5 to 8 is [0.4738,0.5265], (the probability interval of falling on points 9 to 12 is [0.5265,1 ]). That is, the sum of the probabilities of the first 4 points and the last 4 points is almost close to 1, because the cluster of 5,6,7,8 already has the first center point 6. At this time, a number λ between 0 and 1 is randomly generated, and the next cluster center can be determined.

After clustering is carried out on the weight of each layer of the neural network through k-means + + clustering, the original weight is represented by the obtained clustering center, weight sharing is realized, and a plurality of connection weights of the layer only share the same weight. And when the k value is far smaller than the weight quantity, reducing the number of the weights to achieve the purpose of compression.

After k-means + + clustering is performed on the convolutional layer and the full connection layer, the space occupied by each weight is reduced by quantizing the floating point type sacrifice precision of each weight with the length of 32 bits, and the network is compressed again. The number of valid weights that need to be stored is limited by multiple connections sharing the same weight. After the quantization process is finished, the cluster centers are stored in the codebook. For each weight, instead of being represented by the previous 32 bits, only one small index is stored in the table of weight sharing.

Assuming clustering into k clusters by a clustering algorithm, log is required₂(k) The bit encoding is indexed. In general, when a network has N connections, each connected network is represented by b bits, and only k total networksSharing the weight, the compression ratio is:

assuming that there are 16 weights, each initial weight is 32 bits, and the clustering algorithm is used to cluster the initial weights into 4 classes, the weights are represented by 16 2-bit indexes. Finally, storing 4 effective weights of 32 bits and 16 indexes of 2 bits, the compression ratio is:

the embodiment of the invention summarizes the compression effect of each deep neural network from the aspects of a compression method, a model size, an error rate and the like, and specific parameters and performances of each network before and after compression are shown in fig. 7.

As can be seen from the summary table of the compression effects of different networks shown in fig. 7, the deep neural network achieves a compression ratio of 30 to 40 times without substantial precision loss by using the compression method of the deep neural network proposed in the present invention. Considerable compression is realized by the deep neural network through three steps of network pruning, weight sharing based on K-Means + + and weight quantization. Although there is still room for improvement in compression rate, the model after compression is already small enough that deployment of deep neural networks on the mobile side is possible.

Referring to fig. 8, compared with the existing network compression method, the method of the embodiment of the present invention compresses not only the fully-connected layer of the deep neural network, but also the convolutional layer. It should be noted that, on one hand, after the network is compressed by the method of the embodiment of the present invention, the accuracy is not substantially lost, but is improved to a certain extent, and the effect is better than that in the prior art (as shown in fig. 8), which benefits from the improvement of the clustering method. On the other hand, the method of the embodiment of the invention quantizes the convolutional layer and the full-link layer into the same length, which avoids the redundancy problem caused by different coding lengths, so that the Huffman coding process is not needed, and the compression method of the embodiment of the invention is simpler and more effective.

It should be recognized that the method steps in embodiments of the present invention may be embodied or carried out by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The method may use standard programming techniques. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Further, the operations of processes described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described herein (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.

Further, the method may be implemented in any type of android computing platform that is operatively connected to a suitable. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described herein includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention may also include the computer itself when programmed according to the methods and techniques described herein.

A computer program can be applied to input data to perform the functions described herein to transform the input data to generate output data that is stored to non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.

The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims

1. A method for compressing a neural network based on sensitivity pruning and quantization is characterized by comprising the following steps:

s100, inputting a neural network, pruning the neural network based on sensitivity, and storing the pruned neural network;

s200, training the pruned neural network, and then returning to S100 until the precision change of the pruned neural network is within a set range;

s300, calculating a clustering center of each layer of weight of the pruned neural network based on a K-means + + algorithm, and representing each layer of weight of the pruned neural network by using the obtained clustering center;

s400, quantifying the pruned neural network and outputting the quantified neural network.

2. The method of sensitivity-based pruning and quantization-based compression of neural networks according to claim 1, wherein the S100 comprises:

s110, calculating the sensitivity of each input node to the whole neural network;

s120, calculating the average value of each corresponding component of the sensitivity of each input node as the average sensitivity;

and S130, deleting the corresponding input node with the minimum component value lower than the average sensitivity.

3. The method for sensitivity-based pruning and quantization-based compression of neural networks according to claim 1, wherein the S100 further comprises:

and S140, storing the weight of the pruned neural network according to the format of a line memory CSR.

4. The method of sensitivity-based pruning and quantization-based compression of neural networks according to claim 1, wherein the S200 comprises:

and adopting a global training mode, taking the parameters after pruning as initial parameters of next training, and updating and fine-tuning all the parameters.

5. The method for sensitivity-based pruning and quantization-based compression of neural networks according to claim 1, wherein the S300 comprises:

s310, randomly selecting a point from the input weight data as a clustering center;

s320, calculating the distance D (x) between each point in the weight data and the nearest clustering center;

s330, setting the next clustering center based on the probability p, wherein the calculation formula of the probability p is as follows:

s340, returning to the step S320 until the number of the clustering centers reaches a threshold value;

and S350, executing a K-means algorithm based on the clustering center.

6. The method of claim 5, wherein step S330 comprises:

randomly selecting multiple points from the value matrix as seed points, calculating the distance D (x) between each point and the nearest seed point, storing the distance D (x) in an array, and summing the distances to obtain the random value C in Sum (D (x))_iThe calculation formula is as follows:

C_i＝Sum(D(x))×λ

where λ takes a random number from 0 to 1.

7. The method of sensitivity-based pruning and quantization-based compression of neural networks according to claim 1, wherein the S400 comprises:

precision is sacrificed for each weight of floating point type of 32bit length.

8. The method of sensitivity-based pruning and quantization-based compression of neural networks according to claim 1, wherein the S400 comprises:

representing the original weight of each layer by using the obtained clustering center, wherein a plurality of connection weights of each layer share the same weight;

each weight is stored in a table of weight shares by an index.

9. The method of sensitivity-based pruning and quantization-based compression of neural networks according to claim 1, wherein the S400 comprises:

the cluster centers are stored in a codebook.

10. A computer readable storage medium having stored thereon program instructions which, when executed by a processor, implement the method of any one of claims 1 to 9.