CN117639792A

CN117639792A - Deep learning model compression method based on code table clustering

Info

Publication number: CN117639792A
Application number: CN202311590503.4A
Authority: CN
Inventors: 黄科杰; 邓军灿; 沈海斌
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-11-27
Filing date: 2023-11-27
Publication date: 2024-03-01

Abstract

The invention provides a deep learning model compression method based on code table clustering, and belongs to the field of model compression in deep learning. The method comprises the steps of obtaining a code table and an index for the model weight by using a code table clustering algorithm, and reconstructing the compressed weight. The method utilizes the repeatability of the weight vector of the deep learning model, obtains the code table and the index with low memory occupation based on the code table clustering algorithm, realizes extremely high model compression rate, reduces the memory occupation of model storage and keeps good model performance.

Description

Deep learning model compression method based on code table clustering

Technical Field

The invention relates to the field of model compression in deep learning, in particular to a deep learning model compression method based on code table clustering.

Background

Deep learning has made remarkable progress in the past few years and has become a core technology in the fields of computer vision, natural language processing, speech recognition, and the like. However, deep learning models typically have a large number of parameters and complex structures, which results in huge computational resource consumption and high memory usage. As deep learning applications continue to extend to resource constrained mobile devices and edge computing devices, model compression techniques become particularly important. With the popularity of internet of things devices and intelligent mobile terminals and the rise of edge computing, the need for running complex deep learning models on low power consumption, limited computing power hardware has increased dramatically. In these scenarios, the model needs to greatly compress its computational and memory requirements while maintaining high performance. In addition, in data centers and cloud services, model compression can significantly reduce storage and transmission costs, reduce energy consumption, and improve the scalability and cost effectiveness of the system.

The current technical path for compression of the deep learning model mainly comprises the following two types: weight pruning: by identifying and removing neurons or connections in the neural network, the storage requirements of the model are reduced. The weight pruning may be unstructured (weights deleted in the parameter direction) or structured (weights deleted in the layer or channel direction). Model quantization: by reducing the network weight and accuracy of activation (e.g., from 32-bit floating point numbers to lower bit wide fixed point numbers), model size and computational complexity can be significantly reduced. Model pruning and model quantization are both lossy compression methods, so that at higher compression rates, the prediction performance may be reduced due to excessive information loss.

Disclosure of Invention

In order to solve the problem that the model performance loss is serious under the condition of higher compression rate in the existing model compression method, the invention provides a deep learning model compression method based on code table clustering, and the model weight is reconstructed by using a code table and an index with low memory occupation so as to realize the equivalent and even better model reasoning effect.

The technical scheme adopted for solving the technical problems is as follows:

the invention firstly provides a deep learning model compression method based on code table clustering, which comprises the following steps:

step S1: extracting weights of a linear layer and a convolution layer in a deep learning model, and splitting the weights according to the direction of an input channel so as to obtain a series of weight vectors; the length of the weight vector obtained by segmentation is defined as V;

step S2: setting a code table for each weight respectively, wherein the size of the code table is K, the number of codewords in the code table is K, a code table clustering algorithm is used for carrying out weight vector clustering to obtain a finally updated code table, each weight vector is distributed in the weight vector clustering process to obtain an index, and the distributed index is the position of the codeword with the shortest distance from the weight vector in the code table;

step S3: saving the code table and index which are clustered and other uncompressed data in the original deep learning model as a compressed model; when running the compressed model, for each weight, the index corresponding to its weight vector is used to retrieve the corresponding codeword, and the compressed weight with the same size as the original weight is reconstructed by using these codewords.

As a preferred scheme of the present invention, the weight size of the linear layer in step S1 is [ output channel number, input channel number ], and according to the input channel segmentation, the output channel number is the input channel number/V weight vectors are obtained; the weight size of the convolution layer is [ output channel number, input channel number, convolution kernel height, convolution kernel width ], the weight of the convolution layer is firstly reconstructed to obtain the weight of the convolution layer with the size of [ output channel number, input channel number, convolution kernel height, convolution kernel width ], and then the weight vectors of the output channel number, input channel number, convolution kernel height, convolution kernel width/V are obtained according to the segmentation of the input channels.

As a preferred scheme of the invention, the weight vector clustering in the step S2 is performed by using a code table clustering algorithm, and the specific process is as follows:

(2.1) randomly selecting K weight vectors as initial code words of the code table;

(2.2) calculating euclidean distances of the weight vectors to the codewords; finding the code word with the shortest distance to each weight vector, and distributing the index of the code word to each weight vector, wherein the index of the code word is the position of the code word in a code table;

the calculated Euclidean distance formula is:

wherein W is ^m For the mth weight direction in the weightsAmount of C ^k For the kth codeword, d (W) ^m ,C ^k ) Is W ^m And C ^k Is used for the distance between euclidean distance(s),and->Respectively W ^m And C ^k Is the i-th value of (2);

(2.3) averaging all weight vectors assigned with indexes of the same codeword as an updated value of the codeword corresponding to the index; the formula for averaging all weight vectors assigned to indexes of the same codeword is as follows:

wherein W is C ^k Weight vector assigned to index of same codeword, |W ε C ^k I is the number of weight vectors assigned to the index of the same codeword,an updated value for the codeword corresponding to the index;

and (2.4) repeating the steps (2.2) - (2.3) until the code table and the index are not updated any more, and completing the code table clustering algorithm to obtain the finally updated code table.

As a preferred embodiment of the present invention, one code table is used for each weight in step S2.

As a preferred solution of the present invention, in step S3, the codeword corresponding to the weight vector index is used to reconstruct the compressed weight with the same size as the original weight, where the formula is:

W ^′ ＝C[I]

wherein W is ^′ And C is a code table, and I is an index matrix corresponding to the total weight vector.

As a preferable scheme of the invention, the deep learning model to be compressed is a language big model LLaMA-7B; the linear layer weight size of the model is [ output channel number, input channel number ], the input channel number/V weight vectors of the output channel number are obtained according to the input channel segmentation, and the convolution layer weight size of the model is [ output channel number, input channel number, convolution kernel height, convolution kernel width ]; the weight of the convolution layer is firstly reconstructed to obtain a size of [ output channel number, input channel number, convolution kernel height, convolution kernel width ], and then the weight vectors (output channel number, input channel number, convolution kernel height, convolution kernel width/V) are obtained according to the segmentation of the input channels.

The beneficial effects of the invention are as follows:

1) The deep learning model clustering method based on the code table clustering provided by the invention utilizes the code table and the index to reconstruct the model weight, so that the memory occupation of the original weight of the model is reduced. The present invention allows a user to adjust the balance between compression rate and performance as desired due to the flexibility of code table size. This means that the degree of compression of the model can be tailored to the specific application scenario and performance requirements. The invention is not only suitable for the weight of the linear layer, but also can be extended to the convolution layer, so that the convolution layer can compress various deep learning models. This is useful for compressing complex models consisting of multiple types of layers.

2) The deep learning model clustering method based on the code table clustering provided by the invention has the advantages that all the steps from weight extraction and clustering to weight reconstruction are automatically performed, so that the need of manual intervention and possible human errors are greatly reduced.

3) The invention provides a deep learning model compression method based on code table clustering, which is a method for reconstructing model weights by using a code table and an index with lower memory occupation. The weight vectors of the current deep learning models have certain repeatability, the repeated vectors can be represented by using the same shared vector, the shared vector is pointed by using the same index, and a plurality of shared vectors form a code table. Because the memory occupation of the code table and the index is lower than that of the weight, the model compression method based on the code table clustering can realize higher compression rate. Due to the utilization of the repeatability of the weight vectors, the model adopting the code table and the index to reconstruct the weights can avoid serious performance loss while greatly compressing.

Drawings

FIG. 1 is a flow chart of a deep learning model compression method based on code table clustering.

FIG. 2 is a schematic diagram of weights split into weight vectors according to input channels.

FIG. 3 is a schematic diagram of a code table clustering algorithm.

FIG. 4 is a schematic diagram of reconstructing model weights using a code table and an index.

Fig. 5 is a text generation result of an original model and a model compressed by the method of the present invention.

FIG. 6 is a comparison of the performance of the different methods.

Detailed Description

The invention is further illustrated and described below in connection with specific embodiments. The described embodiments are merely exemplary of the present disclosure and do not limit the scope. The technical features of the embodiments of the invention can be combined correspondingly on the premise of no mutual conflict.

The flow chart of the deep learning model compression method based on the code table clustering shown in fig. 1 comprises the following steps:

step S2: setting a code table for each weight respectively, wherein the size of the code table is K.V, and using a code table clustering algorithm to cluster weight vectors, wherein the specific process of the weight vector clustering is as follows:

(2.3) averaging all weight vectors assigned with indexes of the same codeword as updated values of the corresponding codeword of the index;

(2.4) repeating the steps (2.2) - (2.3) until the code table and the index are not updated any more, and completing the code table clustering algorithm;

step S3: saving the code table and index which are clustered and other uncompressed data in the original deep learning model as a compressed model; when the compressed model is run, for each weight, the index corresponding to its weight vector is used to retrieve the corresponding codeword, and the compressed weight with the same size as the original weight is reconstructed by using these codewords.

A schematic diagram of step S1 is shown in fig. 2, and is described in detail as follows: in the step S1, a pre-trained large language model LLaMA-7B is used as a model to be compressed, and the memory occupation amount of the model is 12.38GB. The two data sets of wikietext 2 and ptb are used as test sets, the batch size at the time of verification is 4, and the single text length of the verification set is 128. The weight size of the linear layer of the model is generally [ output channel number, input channel number ], and the linear layer of the model is divided into a plurality of weight vectors with the length of V according to the input channel to obtain (output channel number is the input channel number/V) weight vectors. The size of the model convolution layer weight is generally [ output channel number, input channel number, convolution kernel height, convolution kernel width ], the size of the convolution layer weight is firstly reconstructed to obtain the size of [ output channel number, input channel number, convolution kernel height, convolution kernel width ], and then the weight vectors (output channel number, input channel number, convolution kernel height, convolution kernel width/V) are obtained according to the input channel segmentation. The length V of the weight vector obtained by the segmentation is set to 4.

A schematic diagram of step S2 is shown in fig. 3, and is described in detail as follows: each weight uses a code table with the size of K x V, wherein K and V are the number of code words and the code word length in the code table respectively, the code word length is the same as the weight vector length, and the number of code words K is set to 32768. The Euclidean distance formula from each weight vector to each codeword is:

wherein W is ^m As the mth weight vector in the weights, C ^k For the kth codeword, d (W) ^m ,C ^k ) Is W ^m And C ^k Is used for the distance between euclidean distance(s),and->Respectively W ^m And C ^k Is the i-th value of (c).

The formula for averaging all weight vectors with the same index is:

wherein W is C ^k Weight vector assigned to index of same codeword, |W ε C ^k I is the number of weight vectors with the same codeword index,is the updated value of the index.

A schematic diagram of step S3 is shown in fig. 4, and is described in detail as follows: using the code word corresponding to the weight vector index to replace the value of the original weight vector to obtain the compressed weight, wherein the memory occupation of the compressed weight is as follows:

wherein, for the linear layer, C _in For the number of input channels, for the convolutional layer, C _in Convolution kernel height and convolution kernel width for the number of input channels; c (C) _out For the number of output channels, K and V are the number of codewords and the length of codewords in the code table respectively, T is the data storage format of uncompressed weight of the model, M _W′ M is the memory occupation of the weight after compression _W′ In bytes (B)。

And finally, performing performance test on the compressed model, calculating the compression rate, and verifying the compression effect of the model. The compression rate formula before and after weight compression is:

wherein, for the linear layer, C _in For the number of input channels, for the convolutional layer, C _in Convolution kernel height and convolution kernel width for the number of input channels; c (C) _out For the number of output channels, T is the data storage format of the uncompressed weight of the model.

The embodiment of the invention adopts qualitative and quantitative aspects to evaluate the technical effect of the invention. The qualitative evaluation mainly adopts an intuitive visual inspection method to evaluate the quality of the text generated by the model. This process involves scrutinizing the generated text to detect if a logical discontinuity or unreasonable condition has occurred, such as whether a argument in the text is paradoxical or whether a statement exists with a distinct logical fault. While also checking whether the text provides only surface information without going deep into various aspects of the problem. In addition, the qualitative assessment also checks the fluency and naturalness of the generated text, such as whether a language that can be easily understood and accepted by humans is used, whether the sentence structure is reasonable, and whether the text is clear and unambiguous. It is also desirable to check whether text is creative and unique, e.g., can provide a novel perspective or solution, rather than merely repeating the common or stale perspective of abusing.

To quantitatively evaluate its performance, a confusion PPL was used for evaluation. The confusion PPL is used to quantify the difference between the generated text and the real text, and a smaller PPL value indicates a better performance of the language model, i.e., a better compression performance. Confusion PPL is an exponential form of cross entropy loss function, and large language model is a series of words w ₁ ,w ₂ ,…,w _N Assigning probabilities, the calculation formula of the confusion degree is as follows:

wherein W is the whole word sequence, W _i Is the i-th word in the word sequence, p (w _i ∣w ₁ ,w ₂ ,…,w _i-1 ) Is a model based on the conditional probability that all words preceding the i-th word are assigned to the i-th word, N is the total number of words in the word sequence, log is a logarithmic function with a base of 10, exp (magnitude) is an exponential function with a base of e.

The qualitative evaluation results of the invention are as follows: in the embodiment, the code table with the size of 32768×4 is used to compress the weight of each linear layer and the weight of each convolution layer of the pre-training large language model LLaMA-7B, the memory occupation amount of the original model is 12.38GB, the memory occupation amount of the compressed model is 2.98GB, and the compression rate is 75.91%. The generated text of the original model is shown in the upper half of fig. 5, and the generated text of the compressed model is shown in the lower half of fig. 5. As can be seen from FIG. 5, the compressed model of the embodiment can generate reasonable and deep text at a higher compression rate, which illustrates that the method provided by the invention can effectively compress the model and maintain the performance.

The quantitative evaluation results of the present invention are as follows: in the embodiment, the linear layer weight and the convolution layer weight of the pre-training large language model LLaMA-7B are respectively compressed by using a code table with the size of 32768×4, the memory occupation amount of the original model is 12.38GB, the memory occupation amount of the compressed model is 2.98GB, the compression rate is 75.91%, the confusion degree PPL of the compressed model on a wikiitext 2 data set is 14.4, and the confusion degree PPL on a ptb data set is 59.0. Compared with other methods, as shown in fig. 6, the method obtains a lower confusion PPL value with the same or even higher compression rate, namely, obtains a text generation effect more like real world text, and shows that the compression performance of the method is better.

The foregoing description of the specific embodiments of the present invention will be further described with reference to specific examples, where the contents are all explanations of the present invention, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention.

The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit of the invention.

Claims

1. A deep learning model compression method based on code table clustering is characterized by comprising the following steps:

2. The deep learning model compression method based on code table clustering according to claim 1, wherein the weight size of the linear layer in step S1 is [ number of output channels, number of input channels ], and the number of output channels is calculated by dividing the input channels to obtain the number of input channels/V weight vectors; the weight size of the convolution layer is [ output channel number, input channel number, convolution kernel height, convolution kernel width ], the weight of the convolution layer is firstly reconstructed to obtain the weight of the convolution layer with the size of [ output channel number, input channel number, convolution kernel height, convolution kernel width ], and then the weight vectors of the output channel number, input channel number, convolution kernel height, convolution kernel width/V are obtained according to the segmentation of the input channels.

3. The deep learning model compression method based on code table clustering according to claim 1, wherein the weight vector clustering in step S2 is performed by using a code table clustering algorithm, and the specific process is as follows:

(2.3) averaging all weight vectors assigned with indexes of the same codeword as an updated value of the codeword corresponding to the index;

4. The deep learning model compression method based on code table clustering according to claim 1, wherein each weight in step S2 uses one code table.

5. The deep learning model compression method based on code table clustering according to claim 3, wherein the euclidean distance formula calculated in step S2 is:

6. The deep learning model compression method based on code table clustering according to claim 3, wherein in step S2, a formula for averaging all weight vectors assigned to indexes of the same codeword is:

wherein W is C ^k Weight vector assigned to index of same codeword, |W ε C ^k I is the number of weight vectors assigned to the index of the same codeword,an updated value for the codeword corresponding to the index.

7. The deep learning model compression method based on code table clustering according to claim 1, wherein in step S3, the compressed weight having the same size as the original weight is reconstructed by using the codeword corresponding to the weight vector index, where the formula is:

W ^′ ＝C[I]

8. The deep learning model compression method based on code table clustering according to claim 1, wherein the deep learning model to be compressed is a language big model LLaMA-7B.