CN116384470A

CN116384470A - Convolutional neural network model compression method and device combining quantization and pruning

Info

Publication number: CN116384470A
Application number: CN202310205929.7A
Authority: CN
Inventors: 李莉; 杨森; 李存瑞
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2023-03-06
Filing date: 2023-03-06
Publication date: 2023-07-04

Abstract

The application provides a convolution neural network model compression method and device combining quantization and pruning, wherein the method comprises the following steps: determining a filter to be pruned in an original convolutional neural network for image processing, which is obtained through training, based on a preset importance factor, and pruning the filter to be pruned to obtain a pruned convolutional neural network; dividing the residual filter in the pruned convolutional neural network into a filter to be quantized and a central filter for gradient counter propagation; and carrying out quantization processing on the filter to be quantized according to preset image calibration data so as to obtain a compression model corresponding to the original convolutional neural network. The method and the device can effectively improve the compression efficiency and reliability of the convolutional neural network, ensure the model precision and application effectiveness after the compression, realize the model compression in the client device, and further effectively improve the effectiveness and accuracy of image processing of the model after the compression.

Description

Convolutional neural network model compression method and device combining quantization and pruning

Technical Field

The application relates to the technical field of image processing, in particular to a convolution neural network model compression method and device combining quantization and pruning.

Background

Convolutional neural networks are one of the most representative deep learning algorithms, and have been widely used in the fields of computer vision and the like, and have made a number of breakthrough progress. In order to improve the intelligent degree and effectiveness of image processing, convolutional neural networks can be adopted to perform image feature recognition and other processing. Aiming at the characteristics of huge model parameters, large model calculation amount, large electric energy consumption and the like of the convolutional neural network, researchers provide a neural network model compression technology, and the technology has important research significance for the application of the convolutional neural network on client equipment with limited resources.

Currently, the existing convolutional neural network compression mode is usually a network pruning mode or a quantization mode for reducing the number of bits required by weight, or directly superposes the two modes. However, the two modes are adopted independently, so that the compression efficiency and the model precision after compression cannot be ensured at the same time, and the model precision is greatly reduced by directly superposing the two modes, and the effect is even worse than that of singly using quantization or pruning. In addition, the existing method for directly superposing the two modes is basically to add pruning operation during quantitative training, a large amount of training data is needed in the process, and due to the privacy of the data, the method is difficult to implement in a landing mode and is unfavorable for the deployment of the model on the client device.

Disclosure of Invention

In view of this, embodiments of the present application provide a novel convolutional neural network model compression method and apparatus that combines quantization and pruning to obviate or ameliorate one or more of the disadvantages of the prior art.

One aspect of the present application provides a convolutional neural network model compression method combining quantization and pruning, including:

determining a filter to be pruned in an original convolutional neural network for image processing, which is obtained through training, based on a preset importance factor, and pruning the filter to be pruned to obtain a pruned convolutional neural network;

dividing the remaining filters in the pruned convolutional neural network into a filter to be quantized and a central filter for gradient back propagation;

and carrying out quantization processing on the filter to be quantized according to preset image calibration data so as to obtain a compression model corresponding to the original convolutional neural network.

In some embodiments of the present application, further comprising:

and adopting a distillation learning method, and performing fine adjustment processing on gradient direction propagation on the compression model by using a label corresponding to the image calibration data.

In some embodiments of the present application, the importance factors include: geometric median;

Correspondingly, the determining the training based on the preset importance factor to obtain the filter to be pruned in the original convolutional neural network for image processing comprises the following steps:

and (3) carrying out pruning evaluation: based on a preset geometric median evaluation mode, calculating a filter which is in a current target layer in the original convolutional neural network and enables the sum of Euclidean distances among all filters to be minimum, and taking the filter as a filter to be pruned of the target layer;

judging whether the number of current to-be-pruned filters of each layer of the original convolutional neural network reaches the pruning threshold value corresponding to each layer, if the number of the current to-be-pruned filters does not reach the layer of the pruning threshold value, taking the layer as a new target layer, and returning to execute the to-be-pruned evaluation step aiming at the target layer until the number of the current to-be-pruned filters of each layer of the original convolutional neural network reaches the pruning threshold value corresponding to each layer.

In some embodiments of the present application, before the dividing the remaining filters in the pruned convolutional neural network into the filter to be quantized and the center filter for gradient back propagation, the method further includes:

And carrying out quantization pretreatment on the residual filter in the pruned convolutional neural network based on a preset cross-layer equalization method.

In some embodiments of the present application, the dividing the remaining filters in the pruned convolutional neural network into a filter to be quantized and a center filter for gradient back propagation includes:

based on a preset geometric median evaluation mode, respectively calculating a filter which is in each layer of the original convolutional neural network and enables the sum of Euclidean distances among all filters to be minimum, and taking the filter as a central filter for gradient back propagation of the layer in which the filter is positioned;

and screening the central filter from the remaining filters in the pruned convolutional neural network to obtain the corresponding filter to be quantized.

In some embodiments of the present application, the quantizing the filter to be quantized according to preset image calibration data to obtain a compression model corresponding to the original convolutional neural network includes:

determining a quantization range for weight quantization of the filter to be quantized by adopting a preset MSE error method;

and carrying out weight rounding treatment on the filter to be quantized by adopting the image calibration data based on a preset AdaRound quantization algorithm so as to optimize the filter to be quantized in the original convolutional neural network layer by layer, and obtaining a compression model corresponding to the original convolutional neural network.

In some embodiments of the present application, the performing, by using a distillation learning method, a fine tuning process of gradient direction propagation on the compression model by using a tag corresponding to the image calibration data includes:

taking the original convolutional neural network as a teacher model and taking the compression model as a student model;

and taking the label corresponding to the image calibration data as the supervision information of the teacher model, and performing fine adjustment processing of gradient direction propagation on the compression model based on a distillation learning method.

Another aspect of the present application provides a convolutional neural network model compression device combining quantization and pruning, including:

the pruning module is used for determining a filter to be pruned in the original convolutional neural network which is obtained through training and used for image processing based on a preset importance factor, and pruning the filter to be pruned to obtain a pruned convolutional neural network;

the division module is used for dividing the residual filter in the pruned convolutional neural network into a filter to be quantized and a central filter for gradient counter propagation;

and the quantization module is used for carrying out quantization processing on the filter to be quantized according to preset image calibration data so as to obtain a compression model corresponding to the original convolutional neural network.

In some embodiments of the present application, the convolutional neural network model compression device combining quantization and pruning further comprises:

and the fine tuning module is used for carrying out fine tuning processing on the gradient direction propagation of the compression model by adopting a distillation learning method and applying a label corresponding to the image calibration data.

the quantization preprocessing module is used for carrying out quantization preprocessing on the residual filter in the pruned convolutional neural network based on a preset cross-layer equalization method.

In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the convolutional neural network model compression method of joint quantization and pruning when the computer program is executed.

A fourth aspect of the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of convolutional neural network model compression of joint quantization and pruning.

The convolution neural network model compression method combining quantization and pruning can realize the combination processing of quantization and pruning on the original convolution neural network for predicting image characteristics obtained through training, does not need to use a large amount of training data, can effectively save time cost and the like of model compression, and further can effectively improve the compression efficiency of the convolution neural network; by pruning the original convolutional neural network by adopting the importance factors, the situation that pruning errors and quantization errors are mixed due to quantization and other operations after pruning operation can be avoided, the effect of combining quantization and pruning can be effectively improved, the loss of model precision after compression can be reduced, the compression ratio can be improved, further, the recognition precision can be further ensured on the basis of image processing such as image feature recognition or prediction by adopting the compressed convolutional neural network, model compression can be realized in client equipment, the efficiency of image feature recognition and the like of the client equipment can be improved, and the recognition effectiveness and reliability can be ensured; the remaining filters in the pruned convolutional neural network are divided into a central filter and a filter to be quantized, the filter to be quantized is quantized based on preset image calibration data, and the central filter with full precision which is not quantized is reserved, so that the central filter can adapt to the loss caused by other quantization weights in a normal derivation mode in the subsequent gradient back propagation process, and the model precision after compression is further ensured.

Additional advantages, objects, and features of the application will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present application are not limited to the above-detailed description, and that the above and other objects that can be achieved with the present application will be more clearly understood from the following detailed description.

Drawings

The accompanying drawings are included to provide a further understanding of the application, and are incorporated in and constitute a part of this application. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the application. Corresponding parts in the drawings may be exaggerated, i.e. made larger relative to other parts in an exemplary device actually manufactured according to the present application, for convenience in showing and describing some parts of the present application. In the drawings:

fig. 1 is a schematic flow chart of a convolutional neural network model compression method combining quantization and pruning according to an embodiment of the present application.

Fig. 2 is a schematic flow chart of a second embodiment of a convolutional neural network model compression method combining quantization and pruning in an embodiment of the present application.

Fig. 3 is a schematic flow chart of a third embodiment of a convolutional neural network model compression method combining quantization and pruning according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a convolutional neural network model compression device combining quantization and pruning according to another embodiment of the present application.

Fig. 5 is a schematic diagram of another structure of a convolutional neural network model compression device combining quantization and pruning according to another embodiment of the present application.

Fig. 6 is a schematic diagram of an implementation architecture of a convolutional neural network model compression method for joint quantization and pruning provided in an application example of the present application.

Fig. 7 is an exemplary schematic diagram of a filter classification based pruning and quantization process provided in an application example of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the embodiments and the accompanying drawings. The exemplary embodiments of the present application and their descriptions are used herein to explain the present application, but are not intended to be limiting of the present application.

It should be noted here that, in order to avoid obscuring the present application due to unnecessary details, only structures and/or processing steps closely related to the solution according to the present application are shown in the drawings, while other details not greatly related to the present application are omitted.

It should be emphasized that the term "comprises/comprising" when used herein is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

It is also noted herein that the term "coupled" may refer to not only a direct connection, but also an indirect connection in which an intermediate is present, unless otherwise specified.

Hereinafter, embodiments of the present application will be described with reference to the drawings. In the drawings, the same reference numerals represent the same or similar components, or the same or similar steps.

Deep learning, which is a branch of the machine learning field, has recently been a hot spot field in machine learning because it exhibits very excellent performance in various fields such as image recognition and retrieval, natural language processing, and speech recognition. Deep learning is successful, on one hand, because the network model has a deeper layer number and more parameters, which enables the model to have strong nonlinear fitting capability; on the other hand, because the continual updating of hardware devices provides the possibility for rapid training of deep learning models. To achieve higher precision and accuracy, researchers have long been working on developing deeper and larger models, but this obviously leads to increased model parameters and computation, making the models difficult to deploy in practical applications, and convolutional neural network compression techniques are also emerging. How to simplify the model under the condition of ensuring the high precision of the model is an important subject in the field of deep learning and is a research focus of the model compression technology.

The convolutional neural network compression technology has important research significance for the application of the convolutional neural network to the terminal equipment with limited resources. The deep network model exposes the following disadvantages when applied on a mobile terminal device: 1) The model parameter is huge: for example, VGG16 has more than one hundred thousand parameters, its model size exceeds 500M, and as networks continue to deepen, many models can reach 1G or 2G or even larger, such large models are difficult to deploy to mobile terminals; 2) The calculated amount of the model is large: the deep convolutional neural network has a large quantity of convolutional operations, and the forward propagation calculation amount is huge by only running once, for example, the ResNet50 calculation amount can reach 3.8 hundred million times; 3) The electric energy consumption is large: the continuous access to memory and the extensive use of computing resources during network operation result in significant power consumption. The model compression can effectively reduce the parameters of the model, compress the memory occupied by the model, reduce the calculation amount of the model and compress the time occupied by the model training and prediction. The network model after model compression can be deployed to the embedded equipment with limited resources, and the wide application of the edge intelligence is expanded.

The convolutional neural network compression mode mainly comprises the following steps:

(1) Network pruning is a widely used method in deep neural network compression, and the method reduces model parameters by deleting redundant parameters in the neural network, reduces floating point operation amount during training and testing the model, thereby compressing the model and shortening the time for training and testing the model. Network pruning can be divided into from pruning granularity: single weight pruning, intra-nuclear weight pruning, convolution kernel pruning, channel pruning. The network pruning relates to the aspects of pruning granularity selection, pruning importance evaluation, pruning flow design and the like.

(2) The main idea of quantization is to reduce the number of bits required for the weights to compress the original network, mainly comprising both low precision and recoding methods. For convolutional neural networks, the network model weights are all single precision floating point 32 bits. The low-precision method uses floating point numbers or shaping numbers with lower digits to train, test or store the floating point numbers or shaping numbers; the recoding method recodes the original data, adopts fewer digits to identify the original data, and realizes the compression of the model. The quantization process involves the selection of rounding mechanisms, the selection of quantization parameters, the optimization of quantization target distribution, etc., where the adoption of different strategies affects the final effect of quantization.

(3) Distillation learning may be referred to as a "teacher-student learning algorithm" whose main idea is to use a teacher network that has trained convergence to provide additional guidance information for the training of student networks. Under normal conditions, a teacher network uses complex models with larger parameter quantities, such as VGG series, resNet101 and the like, a student network uses lightweight models with smaller parameter quantities, such as MobileNet series, the thought of transfer learning is used for reference, knowledge such as probability distribution, output characteristics and the like of the teacher model is extracted into a student model through training, so that the precision of the student model is improved, and only the student model is used for reasoning during model deployment, so that the purpose of model compression is achieved.

However, studies on quantification or pruning alone have achieved a great deal of research results, and related techniques are also applied in practical situations, but there are few studies on effective combination of various model compression methods. On one hand, the quantization or pruning method is used alone for model compression, firstly, quantization inevitably introduces a quantization function, but the non-conductivity of the quantization function leads to blocking of gradient back transmission, so that researchers have to continuously sacrifice precision to replace the quantization function to realize gradient back transmission approximately by using a conductive function, which further increases quantization error, leads to precision reduction of a quantization model, and secondly, for the pruning method, the time cost is the greatest problem, and the lengthy pruning flow of training, pruning and retraining fine adjustment greatly reduces the efficiency of model deployment landing; on the other hand, quantization focuses on model reasoning acceleration, pruning focuses on model structure compression, and two methods are needed to be used simultaneously to obtain better compression effect, but direct superposition of a simple model quantization and pruning method can greatly reduce model accuracy even worse than the effect of quantization or pruning alone, and the existing method based on combination of quantization and pruning considers that pruning operation is added during quantization training, and a large amount of training data is needed in the process, and because of the privacy of the data, the method is difficult to implement on the ground, and is unfavorable for the deployment of the model on end-side equipment.

Based on this, the research of the present application is mainly focused on the innovation of model quantization and pruning coupling technology without training data, and mainly solves the following two problems:

(1) Aiming at the problem of poor effect of the existing quantization and pruning combined method, based on post-training quantization (PTQ) and structured pruning technology, a novel quantization and pruning integrated coupled efficient compression algorithm with the compression ratio as high as possible is provided, wherein the loss of precision of an original model is as small as possible under a non-data scene.

(2) Aiming at the problem that the quantized model is difficult to carry out error information back propagation, taking the full-precision and pruning-free model reasoning result as supervision, adopting a more effective and stronger-robustness back propagation algorithm to realize more accurate error information back propagation.

The following examples are provided to illustrate the invention in more detail.

The embodiment of the application provides a method for compressing a convolutional neural network model by joint quantization and pruning, which can be executed by a device for compressing a convolutional neural network model by joint quantization and pruning, referring to fig. 1, the method for compressing a convolutional neural network model by joint quantization and pruning specifically comprises the following contents:

step 100: determining a filter to be pruned in the original convolutional neural network for image processing, which is obtained through training, based on a preset importance factor, and pruning the filter to be pruned to obtain the pruned convolutional neural network.

In step 100, by pruning the original convolutional neural network by using the importance factor, it is able to avoid that the pruning error and the quantization error are confused by performing quantization and other operations after pruning operation, so as to effectively improve the effect of combining quantization and pruning, reduce the loss of precision of the compressed model and improve the compression ratio, further ensure the recognition precision on the basis of performing image processing such as image feature recognition or prediction by using the compressed convolutional neural network, and implement model compression in the client device, further improve the efficiency of performing image feature recognition and other by the client device and ensure the recognition effectiveness and reliability.

Step 200: and dividing the residual filter in the pruned convolutional neural network into a filter to be quantized and a central filter for gradient back propagation.

In step 200, by reserving the center filter with full precision, which does not perform quantization processing, the center filter can adapt to the loss caused by other quantization weights in a normal derivation manner in the subsequent gradient back propagation process, so as to further ensure the precision of the compressed model.

Step 300: and carrying out quantization processing on the filter to be quantized according to preset image calibration data so as to obtain a compression model corresponding to the original convolutional neural network.

It will be appreciated that, after step 300, the client device or the like that stores the compression model locally may receive the image data to be subjected to feature recognition, and input the image data into the compression model, so that the compression model outputs the feature recognition result corresponding to the image data.

As can be seen from the above description, the convolutional neural network model compression method combining quantization and pruning provided by the embodiment of the present application can implement the combination processing of quantization and pruning on the original convolutional neural network for predicting image features obtained by training, without using a large amount of training data, so that the time cost of model compression can be effectively saved, and the compression efficiency of the convolutional neural network can be effectively improved; by pruning the original convolutional neural network by adopting the importance factors, the situation that pruning errors and quantization errors are mixed due to quantization and other operations after pruning operation can be avoided, the effect of combining quantization and pruning can be effectively improved, the loss of model precision after compression can be reduced, the compression ratio can be improved, further, the recognition precision can be further ensured on the basis of image processing such as image feature recognition or prediction by adopting the compressed convolutional neural network, model compression can be realized in client equipment, the efficiency of image feature recognition and the like of the client equipment can be improved, and the recognition effectiveness and reliability can be ensured; the remaining filters in the pruned convolutional neural network are divided into a central filter and a filter to be quantized, the filter to be quantized is quantized based on preset image calibration data, and the central filter with full precision which is not quantized is reserved, so that the central filter can adapt to the loss caused by other quantization weights in a normal derivation mode in the subsequent gradient back propagation process, and the model precision after compression is further ensured.

In order to further improve the applicability and application reliability of the compressed convolutional neural network model in the client device, in the method for compressing a convolutional neural network model by combining quantization and pruning provided in the embodiment of the present application, referring to fig. 2, after step 300 in the method for compressing a convolutional neural network model by combining quantization and pruning, the method specifically further includes the following contents:

step 400: and adopting a distillation learning method, and performing fine adjustment processing on gradient direction propagation on the compression model by using a label corresponding to the image calibration data.

In step 400, since the compression model includes the center filter that is not quantized, the center filter is used to adapt to the loss caused by other quantization weights in a normal derivation manner during the gradient back propagation process, so as to further ensure the model accuracy after compression.

In order to further improve reliability and efficiency of obtaining differences of binary files, in the convolutional neural network model compression method combining quantization and pruning provided in the embodiment of the application, the importance factors include: geometric median; referring to fig. 3, the step 100 in the convolutional neural network model compression method combining quantization and pruning specifically includes the following:

Step 110: and (3) carrying out pruning evaluation: and calculating a filter which is in the current target layer in the original convolutional neural network and enables the sum of Euclidean distances among all filters to be minimum based on a preset geometric median evaluation mode, and taking the filter as a filter to be pruned of the target layer.

Step 120: judging whether the number of current to-be-pruned filters of each layer of the original convolutional neural network reaches the pruning threshold value corresponding to each layer, if the number of the current to-be-pruned filters does not reach the layer of the pruning threshold value, taking the layer as a new target layer, and returning to execute the to-be-pruned evaluation step aiming at the target layer until the number of the current to-be-pruned filters of each layer of the original convolutional neural network reaches the pruning threshold value corresponding to each layer.

Specifically, considering that cross-layer equalization and quantization operations performed after pruning operations can confuse pruning errors and quantization errors, which results in the importance assessment of the filter being affected, the present application adopts an importance factor-based method to achieve the assessment of model importance. The filter evaluation index (FPGM) based on geometric median has strong practicability and is an effective method for evaluating the importance factors of the filter. The geometric median is an estimate of the center of euclidean space points, and the filters can be considered as points in euclidean space, so that the "centers" of these filters can be calculated from the definition of the geometric center, i.e. the points are found that minimize the sum of the euclidean distances of all points. If a filter is close to this center, the information of this filter may be considered to coincide with other filters, even redundant, and it may be decided that removing this filter does not have a large impact on the network.

When filter pruning is performed, for the ith layer convolution kernel, the pruning flow to be adopted is as follows: firstly, calculating the geometric center of the layer of filter for m times according to the initial pruning threshold value setting, removing the filter to be pruned after each calculation, then re-calculating, removing m filters to be pruned and the corresponding characteristics thereof, and finally storing the rest parameters.

In order to further improve the reliability and the reliability of the quantization process, in the method for compressing a convolutional neural network model by combining quantization and pruning provided in the embodiment of the present application, referring to fig. 2 and fig. 3, the following is specifically included between step 100 and step 200 in the method for compressing a convolutional neural network model by combining quantization and pruning:

step 010: and carrying out quantization pretreatment on the residual filter in the pruned convolutional neural network based on a preset cross-layer equalization method.

Specifically, the cross-layer equalization preprocessing is to change the dynamic range of the parameters so as to make the model easier to quantize, and because the weights of different channels are widely different, if the same scaling and biasing are used for different layers, the quantized values of the channel weights with small weight ranges become 0, which is not reasonable. The cross-layer equalization is to use mathematical characteristics of activation functions such as ReLU and the like to reduce the difference of weight ranges among channels, and use a new and more equalized FP32 weight to replace the original weight. Cross-layer equalization can effectively improve the effect of weight quantization.

In order to further improve the effectiveness and reliability of the central filter screening, in the method for compressing a convolutional neural network model by combining quantization and pruning provided in the embodiment of the present application, referring to fig. 3, step 200 in the method for compressing a convolutional neural network model by combining quantization and pruning specifically includes the following contents:

step 210: based on a preset geometric median evaluation mode, respectively calculating a filter which is in each layer of the original convolutional neural network and enables the sum of Euclidean distances among all filters to be minimum, and taking the filter as a central filter for gradient back propagation of the layer in which the filter is positioned;

step 220: and screening the central filter from the remaining filters in the pruned convolutional neural network to obtain the corresponding filter to be quantized.

Specifically, after pruning is carried out on a filter to be pruned, preprocessing is carried out before model quantization, and a cross-layer equalization method is adopted in the method. Since the range of weights for different channels varies widely, it is not reasonable if the same scaling and biasing is used for different layers, so that some channel weights with small weight ranges will become 0 after quantization. The cross-layer equalization is to use mathematical characteristics of activation functions such as ReLU and the like to reduce the difference of weight ranges among channels, and use a new and more equalized FP32 weight to replace the original weight. Cross-layer equalization can effectively improve the effect of weight quantization.

In order to further improve the effectiveness and reliability of the quantization process, in the method for compressing a convolutional neural network model by combining quantization and pruning provided in the embodiment of the present application, referring to fig. 3, step 300 in the method for compressing a convolutional neural network model by combining quantization and pruning further specifically includes the following contents:

step 310: and determining a quantization range for weight quantization of the filter to be quantized by adopting a preset MSE error method.

Step 320: and carrying out weight rounding treatment on the filter to be quantized by adopting the image calibration data based on a preset AdaRound quantization algorithm so as to optimize the filter to be quantized in the original convolutional neural network layer by layer, and obtaining a compression model corresponding to the original convolutional neural network.

In particular, two keys in weight quantization are the setting and mechanism of quantization ranges, which are sources of quantization errors. The application adopts MSE as a method for setting the quantization range. The mean square error MSE error method refers to: is to determine the quantization range (q) by minimizing the MSE error between the original tensor and the quantized tensor _min ,q _max ) Is a method of (2).

The self-adaptive rounding AdaRound quantization algorithm for quantization after training is used for rounding, the problem that the quantization precision is not optimal by directly rounding floating point numbers is solved, a better effect can be achieved on precision by using less calibration data under the condition that quantization training or fine tuning is not needed, and even better precision can be guaranteed for quantization with lower bits. The core idea is as follows: when each weight value in the network is quantized, a rounding method of rounding is not adopted, and the method adaptively determines whether a floating point value is transferred to a nearest right fixed point value or a nearest left fixed point value when the weight is quantized.

In order to further improve the effectiveness and reliability of the quantization process, in the method for compressing a convolutional neural network model by combining quantization and pruning provided in the embodiment of the present application, referring to fig. 3, step 400 in the method for compressing a convolutional neural network model by combining quantization and pruning further specifically includes the following contents:

step 410: taking the original convolutional neural network as a teacher model and taking the compression model as a student model;

step 420: and taking the label corresponding to the image calibration data as the supervision information of the teacher model, and performing fine adjustment processing of gradient direction propagation on the compression model based on a distillation learning method.

Specifically, in the model fine tuning part, the application creatively proposes a model parameter tuning method based on distillation learning aiming at the problem that a pruned model cannot be tuned under a training data-free scene. In a classical model pruning flow, parameters of the model need to be adjusted after pruning to improve the performance of the pruned model, so that the pruned model can be continuously trained on a training set, and a training-free data scene cannot meet the requirement. From the quantized data, model quantization can be divided into two parts, namely weight quantization and activation quantization, wherein the quantized data of the former is a weight value, and the quantized data of the latter is an activation value. In a training data-free scenario, there is little unlabeled image calibration data to activate the quantization process. The method and the device help solve the problem of model pruning in a data-free scene by using a small amount of unlabeled image calibration data used during activation quantization, take an original convolutional neural network as a teacher model, take a model after quantization pruning as a student model, and take labels generated by a small amount of unlabeled calibration data sent into the convolutional neural network as supervision information of the teacher model.

From the software aspect, the present application further provides a convolutional neural network model compression device for performing joint quantization and pruning in all or part of the method for performing joint quantization and pruning, referring to fig. 4, where the joint quantization and pruning convolutional neural network model compression device specifically includes the following contents:

the pruning module 10 is configured to determine a filter to be pruned in the training-obtained original convolutional neural network for image processing based on a preset importance factor, and perform pruning processing on the filter to be pruned to obtain a pruned convolutional neural network;

the dividing module 20 is configured to divide the remaining filters in the pruned convolutional neural network into a filter to be quantized and a central filter for gradient back propagation;

and the quantization module 30 is configured to perform quantization processing on the filter to be quantized according to preset image calibration data, so as to obtain a compression model corresponding to the original convolutional neural network.

In order to further improve applicability and application reliability of the compressed convolutional neural network model in the client device, in the convolutional neural network model compression device combining quantization and pruning provided in the embodiment of the present application, referring to fig. 5, the convolutional neural network model compression device further specifically includes the following contents:

And the fine tuning module 40 is used for performing fine tuning processing of gradient direction propagation on the compression model by adopting a distillation learning method and applying a label corresponding to the image calibration data.

In order to further improve the reliability and the reliability of the quantization process, in the convolutional neural network model compression device combining quantization and pruning provided in the embodiment of the present application, referring to fig. 5, the convolutional neural network model compression device further specifically includes the following contents:

the quantization preprocessing module 10 is configured to perform quantization preprocessing on the remaining filters in the pruned convolutional neural network based on a preset cross-layer equalization method.

The embodiment of the combined quantization and pruning convolutional neural network model compression device provided in the application may be specifically used for executing the processing flow of the embodiment of the combined quantization and pruning convolutional neural network model compression method in the above embodiment, and the functions thereof are not described herein in detail, and may refer to the detailed description of the embodiment of the combined quantization and pruning convolutional neural network model compression method.

The part of the convolutional neural network model compression device for jointly quantizing and pruning can be executed in a server, such as an edge server, and in another practical application case, all the operations can be completed in the client device. Specifically, the selection may be made according to the processing capability of the client device, and restrictions of the use scenario of the user. The present application is not limited in this regard. If all operations are performed in the client device, the client device may further include a processor for performing specific processing of the convolutional neural network model compression for joint quantization and pruning.

The client device may have a communication module (i.e. a communication unit) and may be connected to a remote server in a communication manner, so as to implement data transmission with the server. The server may include a server on the side of the task scheduling center, and in other implementations may include a server of an intermediate platform, such as a server of a third party server platform having a communication link with the task scheduling center server. The server may include a single computer device, a server cluster formed by a plurality of servers, or a server structure of a distributed device.

Any suitable network protocol may be used for communication between the server and the client device, including those not yet developed at the filing date of this application. The network protocols may include, for example, TCP/IP protocol, UDP/IP protocol, HTTP protocol, HTTPS protocol, etc. Of course, the network protocol may also include, for example, RPC protocol (Remote Procedure Call Protocol ), REST protocol (Representational State Transfer, representational state transfer protocol), etc. used above the above-described protocol.

As can be seen from the above description, the convolutional neural network model compression device for combined quantization and pruning provided in the embodiments of the present application can implement the combined quantization and pruning processing on the original convolutional neural network for predicting image features obtained by training, without using a large amount of training data, so that the time cost of model compression can be effectively saved, and the compression efficiency of the convolutional neural network can be effectively improved; by pruning the original convolutional neural network by adopting the importance factors, the situation that pruning errors and quantization errors are mixed due to quantization and other operations after pruning operation can be avoided, the effect of combining quantization and pruning can be effectively improved, the loss of model precision after compression can be reduced, the compression ratio can be improved, further, the recognition precision can be further ensured on the basis of image processing such as image feature recognition or prediction by adopting the compressed convolutional neural network, model compression can be realized in client equipment, the efficiency of image feature recognition and the like of the client equipment can be improved, and the recognition effectiveness and reliability can be ensured; the remaining filters in the pruned convolutional neural network are divided into a central filter and a filter to be quantized, the filter to be quantized is quantized based on preset image calibration data, and the central filter with full precision which is not quantized is reserved, so that the central filter can adapt to the loss caused by other quantization weights in a normal derivation mode in the subsequent gradient back propagation process, and the model precision after compression is further ensured.

In recent years, with the rapid development of deep learning, the deep learning is applied to various artificial intelligence tasks and has made breakthrough progress, researchers are actively exploring the use of model compression technology to solve the problem that network models are difficult to deploy on terminal devices, various model compression methods are proposed, and widely used model quantization, network pruning and the like show good performances. The differences among the network model compression methods are mainly represented by a rounding mechanism, pruning granularity, a pruning importance assessment method, a model gradient back-propagation algorithm and the like. The compression strategy employed will also vary for different compression indicators and for different network architectures.

In one or more embodiments of the present application, the convolutional neural network model may also be replaced with a pretrained model such as another deep neural network, that is, the convolutional neural network model compression method of joint quantization and pruning provided in the present application is not only applicable to a convolutional neural network model for image processing, but also applicable to a convolutional neural network for processing other types of data (such as audio data, etc.), and is more applicable to other types of pretrained models.

Referring to fig. 6, the convolutional neural network model compression method of joint quantization and pruning provided by the application of the present application, the overall architecture is divided into two parts: pruning and quantization of convolutional neural network models, and model fine-tuning.

Referring to fig. 7, filters in the original convolutional neural network can be classified into three categories: a filter to be quantized, a filter to be pruned, and a center filter. For a certain layer of the neural network, firstly, calculating a filter to be pruned for a plurality of times according to pruning threshold value setting, and executing pruning operation. And after the pruning threshold value is set, calculating the center of the pruning threshold value based on the rest filters, classifying the filters into center filters, and keeping the precision of FP32 without quantization for realizing the back transmission of gradients and updating parameters. And finally, classifying the rest filters into filters to be quantized, and executing subsequent quantization operation to obtain a model quantization result INT8. The pruning and quantization operation part of the convolutional neural network model relates to the aspects of pruning importance assessment, pruning flow design, division of a central filter and a filter to be quantized, preprocessing before model quantization, quantization algorithm selection and the like.

The specific content of the convolution neural network model compression method combining quantization and pruning provided by the application is as follows:

Pruning (I)

The present application performs pruning operations on the model at the fine granularity of the filter, removing redundant filters of low importance. Considering that cross-layer equalization and quantization operations are performed after pruning operations, pruning errors and quantization errors are mixed, so that importance assessment of a filter is affected, the importance assessment of a model is achieved by adopting an importance factor-based method. The filter evaluation index (FPGM) based on geometric median has strong practicability and is an effective method for evaluating the importance factors of the filter. The geometric median is an estimate of the center of euclidean space points, and the filters can be considered as points in euclidean space, so that the "centers" of these filters can be calculated from the definition of the geometric center, i.e. the points are found that minimize the sum of the euclidean distances of all points. If a filter is close to this center, the information of this filter may be considered to coincide with other filters, even redundant, and it may be decided that removing this filter does not have a large impact on the network. Let i be a certain layer of the neural network, and the number of filters of this layer be N, the center of the european space formed by the filters of this layer is expressed as:

Wherein,,

representing a filter to be pruned in the ith layer such that the sum of euclidean distances between all filters is minimum, j 'representing the index of a certain filter in the ith layer, fi, j' representing a certain filter in the ith layer; x represents the tensor used to calculate the geometric median point; since the information of the filter to be pruned can be replaced by other filter information, the network can be easily restored to the original performance after the parameter adjustment, and the pruned filter information can be represented by the remaining filter, therefore, the filter to be pruned is better than the filter to be pruned>

Can be trimmed with negligible impact on the final result of the neural network.

(II) quantization pretreatment

After pruning is carried out on the filter to be pruned, preprocessing is carried out before model quantization, and a cross-layer equalization method is adopted in the method. Since the range of weights for different channels varies widely, it is not reasonable if the same scaling and biasing is used for different layers, so that some channel weights with small weight ranges will become 0 after quantization. The cross-layer equalization is to use mathematical characteristics of activation functions such as ReLU and the like to reduce the difference of weight ranges among channels, and use a new and more equalized FP32 weight to replace the original weight. Cross-layer equalization can effectively improve the effect of weight quantization.

(III) division center Filter and Filter to be quantized

The method and the device for classifying the filters by using the Quannoise idea refer to the filters, and creatively propose a new model gradient anti-transmission algorithm based on the filters. In the training process of the convolutional neural network, in order to select the optimal model, the partial derivatives of the objective function on the weights of all the neurons are obtained layer by using back propagation, so as to form the gradient of the objective function on the weight vector. Although quantization after training does not involve model training and gradient feedback, the pruning process requires readjustment of model parameters, so gradient feedback is unavoidable. However, since the round, clip functions of the quantization operation are not conductive, this results in the quantized model not being able to achieve back-propagation of the gradient. The model gradient back-propagation algorithm proposed by the application has the following ideas: in the process of quantization, a normal quantization method is adopted for the activation value; for weight quantization and pruning, the filter after pruning at each layer is passed through the mode (geometric center) of judging the filter to be pruned in the pruning process, the filter at the center is calculated again, and then other filters are quantized under the condition of ensuring the full precision of the center filter. In this way, the full-precision parameter can be ensured to adapt to the loss caused by other quantization weights in a normal derivative manner in the back propagation process.

(IV) quantization

Two keys in weight quantization are the setting and mechanism of quantization ranges, which are sources of quantization errors. The application adopts MSE as a method for setting the quantization range. The MSE error method refers to: is to determine the quantization range (q) by minimizing the MSE error between the original tensor and the quantized tensor _min ,q _max ) The method of (1) is specifically realized as follows:

in the formula (i),

representing the value of V after quantization, ">

Representing the Frobenius norm, the optimization problem can be solved, typically using a grid search, golden section method, or analytical approximation closed-form.

The self-adaptive rounding AdaRound quantization algorithm for quantization after training is used for rounding, the problem that the quantization precision is not optimal by directly rounding floating point numbers is solved, a better effect can be achieved on precision by using less calibration data under the condition that quantization training or fine tuning is not needed, and even better precision can be guaranteed for quantization with lower bits. The core idea is as follows: when each weight value in the network is quantized, a rounding method of rounding is not adopted, and the method adaptively determines whether a floating point value is transferred to a nearest right fixed point value or a nearest left fixed point value when the weight is quantized. In the whole network optimization process, adaRound is optimized layer by layer in sequence, and the specific flow is as follows: firstly, optimizing an L-1 layer, quantifying parameters of the L-1 layer by using parameters obtained by AdaRound after optimizing, carrying out forward propagation once on the basis until reaching an L-th layer, and then continuing optimizing by using an AdaRound algorithm, and the like.

(V) Fine tuning

In a model fine tuning part, the application creatively proposes a model parameter tuning method based on distillation learning aiming at the problem that a pruned model cannot be tuned under a training data-free scene. In a classical model pruning flow, parameters of the model need to be adjusted after pruning to improve the performance of the pruned model, so that the pruned model can be continuously trained on a training set, and a training-free data scene cannot meet the requirement. The method and the device help solve the problem of model pruning in a data-free scene by using a small amount of unlabeled image calibration data used during activation quantization, take a pre-trained FP32 model as a teacher model, take a model after quantization pruning as a student model, and take labels generated by a small amount of unlabeled calibration data sent into a convolutional neural network as supervision information of the teacher model.

In summary, the convolutional neural network model compression method combining quantization and pruning provided by the application example effectively compresses the model under the condition of no data scene by effective coupling of quantization and pruning technology on the premise of ensuring high precision of the model; the method is used for solving the problems that the existing combined quantization and pruning method is poor in effect, training data are not easy to obtain and the like, providing a complete model quantization and pruning coupling algorithm, providing a new model gradient back propagation algorithm and solving the problem that parameters of a pruned model cannot be adjusted under a non-data scene by referring to the idea of distillation learning. The technical scheme has a very good compression effect on the network model under the specific scene without training data.

The embodiment of the application further provides an electronic device, such as a central server, where the electronic device may include a processor, a memory, a receiver and a transmitter, where the processor is configured to perform the convolutional neural network model compression method of joint quantization and pruning mentioned in the foregoing embodiment, and the processor and the memory may be connected by a bus or other manners, for example, through a bus connection. The receiver may be connected to the processor, memory, by wire or wirelessly.

The processor may be a central processing unit (Central Processing Unit, CPU). The processor may also be any other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof.

The memory is used as a non-transitory computer readable storage medium and can be used for storing a non-transitory software program, a non-transitory computer executable program and a module, such as program instructions/modules corresponding to the convolution neural network model compression method of joint quantization and pruning in the embodiment of the application. The processor executes various functional applications and data processing of the processor by running non-transitory software programs, instructions and modules stored in the memory, that is, the convolutional neural network model compression method implementing the joint quantization and pruning in the above method embodiments.

The memory may include a memory program area and a memory data area, wherein the memory program area may store an operating system, at least one application program required for a function; the storage data area may store data created by the processor, etc. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory may optionally include memory located remotely from the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory that, when executed by the processor, perform the convolutional neural network model compression method of joint quantization and pruning in embodiments.

In some embodiments of the present application, the user equipment may include a processor, a memory, and a transceiver unit, where the transceiver unit may include a receiver and a transmitter, and the processor, the memory, the receiver, and the transmitter may be connected by a bus system, the memory storing computer instructions, and the processor executing the computer instructions stored in the memory to control the transceiver unit to transmit and receive signals.

As an implementation manner, the functions of the receiver and the transmitter in the present application may be considered to be implemented by a transceiver circuit or a dedicated chip for transceiver, and the processor may be considered to be implemented by a dedicated processing chip, a processing circuit or a general-purpose chip.

As another implementation manner, a manner of using a general-purpose computer may be considered to implement the server provided in the embodiments of the present application. I.e. program code for implementing the functions of the processor, the receiver and the transmitter are stored in the memory, and the general purpose processor implements the functions of the processor, the receiver and the transmitter by executing the code in the memory.

Embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the aforementioned convolutional neural network model compression method of joint quantization and pruning. The computer readable storage medium may be a tangible storage medium such as Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disks, hard disk, a removable memory disk, a CD-ROM, or any other form of storage medium known in the art.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein can be implemented as hardware, software, or a combination of both. The particular implementation is hardware or software dependent on the specific application of the solution and the design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the present application are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave.

It should be clear that the present application is not limited to the particular arrangements and processes described above and illustrated in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present application are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications, and additions, or change the order between steps, after appreciating the spirit of the present application.

The features described and/or illustrated in this application for one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments.

The foregoing description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and variations may be made to the embodiment of the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. A convolutional neural network model compression method combining quantization and pruning, comprising:

dividing the residual filter in the pruned convolutional neural network into a filter to be quantized and a central filter for gradient counter propagation;

2. The method for compressing a convolutional neural network model combining quantization and pruning according to claim 1, further comprising:

3. The method of convolutional neural network model compression for joint quantization and pruning according to claim 1, wherein the importance factors comprise: geometric median;

4. The method for compressing a convolutional neural network model combining quantization and pruning according to claim 1, further comprising, before the dividing the remaining filters in the pruned convolutional neural network into a filter to be quantized and a center filter for gradient back propagation:

5. The method for compressing a convolutional neural network model combining quantization and pruning according to claim 1, wherein the dividing the remaining filters in the pruned convolutional neural network into a filter to be quantized and a center filter for gradient back propagation comprises:

6. The method for compressing a convolutional neural network model by combining quantization and pruning according to claim 1, wherein the quantizing the filter to be quantized according to preset image calibration data to obtain a compression model corresponding to the original convolutional neural network comprises:

7. The method for compressing a convolutional neural network model by combining quantization and pruning according to claim 2, wherein the performing fine tuning processing of gradient direction propagation on the compressed model by using a label corresponding to the image calibration data by using a distillation learning method comprises:

8. A convolutional neural network model compression device combining quantization and pruning, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of convolutional neural network model compression of joint quantization and pruning as claimed in any one of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a method of convolutional neural network model compression of joint quantization and pruning as claimed in any one of claims 1 to 7.