CN112733964A

CN112733964A - Convolutional neural network quantification method for reinforcement learning automatic perception weight distribution

Info

Publication number: CN112733964A
Application number: CN202110134308.5A
Authority: CN
Inventors: 任鹏举; 涂志俊; 马建; 夏天; 赵文哲; 陈飞; 郑南宁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-02-01
Filing date: 2021-02-01
Publication date: 2021-04-30
Anticipated expiration: 2041-02-01
Also published as: CN112733964B

Abstract

The method comprises the steps of fusing parameters of batch processing operation of each layer with weights of convolution operation to obtain fused weights and bias, and acquiring distribution information of the fused weights of each layer in a floating point convolution neural network model; according to the distribution information reinforcement learning of each layer of weight, automatically searching for the optimal each layer of weight scaling coefficient, and quantizing the floating point weight into INT8 type data based on each layer of weight scaling coefficient; inputting a calibration data set, inputting a group of data records of each layer of output characteristic diagram, selecting a mode as a scaling coefficient of each layer of output characteristic diagram, calculating the scaling coefficient of each layer of offset according to the scaling coefficient of each layer of weight and the scaling coefficient of each layer of output characteristic diagram to quantize the offset of a floating point into the offset of an INT32 type, and constructing a forward reasoning process based on INT8 type data, the bias of the INT32 type and the total scaling coefficient to finish quantization.

Description

Convolutional neural network quantification method for reinforcement learning automatic perception weight distribution

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a convolutional neural network quantization method for reinforcement learning automatic perception weight distribution.

Background

In recent years, with the development of artificial intelligence technology taking convolutional neural networks as a leading factor, more and more computer vision tasks are well solved, such as image classification, target detection, semantic segmentation and the like. And a current trend is to deploy a high-performance neural network model on an end-side platform and to be able to run in real time (more than 30 frames) in real scenes, such as mobile end/embedded end devices. The platforms have the characteristics of less memory resources, low processor performance and limited power consumption, so that the current model with the highest precision cannot be deployed on the platform due to the excess requirements on the memory and the computing resources and meets the requirement of real-time performance.

To solve this conflict, model compression techniques have been developed, which mainly reduce the number of original model parameters or the number of bits to represent to reduce the memory and computation requirements, thereby further reducing the energy consumption. The most stable performance of the current model quantization technology is INT8, compared with FP32 calculation of an original model, INT8 quantization can reduce the size of the model by 4 times, reduce the memory bandwidth requirement by 4 times, and generally support hardware of INT8 calculation by 2 to 4 times. The model quantization generally comprises three steps of operations, namely, quantizing the trained weight, quantizing the intermediate characteristic diagram by using the calibration data set, and finally, if the accuracy loss is large, performing quantization perception training by using additional training data to recover the accuracy. The current popular model quantization scheme mainly comprises a TensorFlow Lite quantization tool of Google and a TensorRT INT8 forward inference tool of Yingwei, the technology assumes that all layers in a convolutional neural network are independent, weights are quantized directly according to the maximum and minimum values of weight parameters of each layer without considering the correlation and the dependency between the layers, and thus some improper scaling coefficients are calculated, so that a larger truncation error and a zero-returning error exist after the weights are quantized, thereby causing obvious precision loss, and extra data are often needed for quantization perception training to recover the precision, and the characteristic diagram calibration mode adopted by the technology is based on exponential moving average and Kullback-Leibler divergence, and the two modes require more than a certain amount of calibration data to obtain better effect. However, in some medical and biological related fields, data privacy is very important, and it is difficult for developers to obtain a huge number of calibration data sets and training data sets to ensure good post-quantization accuracy.

Reinforcement learning is a collective term for a series of algorithms for solving the markov decision problem, which allows an agent to learn in a "trial and error" manner, with the goal of obtaining maximum rewards for the agent through reward guidance behavior that is continuously interactive with the environment. The feedback signal provided by the environment is only an assessment of how well the agent is producing an action, typically a scalar signal, and does not tell the agent how to produce the correct action, so the agent must learn to gain knowledge in the action-assessment environment to improve the action scheme to suit the environment. The Korean-pine team of Massachusetts works in America once uses reinforcement learning to automatically search a mixed precision quantization strategy, but a quantization method of mixed precision is not suitable for most of hardware at present, and the problem of precision loss in traditional quantization is not solved by using reinforcement learning to automatically search the most suitable scaling coefficient of each layer for a convolutional neural network.

The above information disclosed in this background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not form the prior art that is already known in this country to a person of ordinary skill in the art.

Disclosure of Invention

Aiming at the problems of large precision loss caused by traditional model quantization and dependence on a calibration data set and a training data set in the quantization process in the prior art, the invention provides a convolutional neural network quantization method for reinforcement learning automatic perception weight distribution.

The invention aims to realize the following technical scheme, and the convolutional neural network quantization method for the reinforcement learning automatic perception weight distribution comprises the following steps of:

providing a trained floating point convolution neural network model;

normalizing input data for fusion with parameters of a first layer of convolution of the floating point convolutional neural network model;

fusing the parameters of each layer of batch processing operation with the weights of the convolution operation to obtain fused weights and bias, and acquiring the distribution information of each layer of fused weights in the floating point convolution neural network model;

according to the distribution information reinforcement learning of each layer of weight, automatically searching for the optimal each layer of weight scaling coefficient, fixing the each layer of weight scaling coefficient, and quantizing the floating point weight into INT8 type data based on the each layer of weight scaling coefficient;

inputting a calibration data set, recording each layer of output characteristic graph by inputting each group of data, selecting a mode as a scaling coefficient of each layer of output characteristic graph, calculating the scaling coefficient of each layer of offset according to the scaling coefficient of each layer of weight and the scaling coefficient of each layer of output characteristic graph to quantize the offset of a floating point into the offset of INT32 type,

the quantization is accomplished by constructing a forward inference process based on INT8 type data, INT32 type bias, and the total scaling factor, which is the scaling factor of the input data x the scaling factor of the weight ÷ the scaling factor of the output profile.

In the method, the construction of the forward reasoning process comprises the following steps: inputting INT8 type data image data, performing convolution calculation on INT8 input and INT8 weight of a first layer to obtain an INT32 type result, adding the INT32 type result and INT32 type offset, dividing the sum by a total scaling coefficient to obtain INT8 output data, and inputting the output data to a next layer to perform the same operation.

In the method, the floating point convolutional neural network model is a pure floating point convolutional neural network model.

In the method, the distribution information of the weight comprises a maximum value, a minimum value, a mean value, a variance, a sharpness, a kurtosis, a data volume and a calculation type in floating point weight data, wherein the maximum value, the minimum value, the mean value, the variance and the data volume are obtained through original data, the sharpness and the kurtosis are obtained through constructing a histogram through the data of the weight, and the calculation type comprises a standard convolution calculation type, a depth convolution calculation type and a full-connection calculation type.

In the method, the scaling coefficient of the weight and the scaling coefficient of the output characteristic graph are both powers of 2.

In the method, the parameters of the batch processing operation comprise a mean value, a variance, a scaling coefficient and an offset, and the parameters of the convolution operation comprise a weight.

In the method, the reinforcement learning automatic search optimal scaling coefficient comprises the steps of constructing a reinforcement learning intelligent agent, inputting weight distribution information, outputting the scaling coefficient of each layer, quantizing a floating point convolution neural network model according to the scaling coefficient, obtaining the quantized accuracy on a test set, calculating the difference value of the original accuracy and the quantized accuracy of the floating point convolution neural network model, and feeding the difference value back to the reinforcement learning intelligent agent for iterative updating of parameters until the reinforcement learning intelligent agent converges and obtains the optimal weight scaling coefficient.

Compared with the prior art, the beneficial effect of this disclosure is: according to the method, weight distribution information is automatically perceived, the whole convolutional neural network is regarded as a whole in the model quantization process, the goal is to minimize precision loss after quantization, the optimal scaling coefficient is searched by utilizing reinforcement learning according to the distribution information of weight data of each layer and the information of the weight distribution, the method considers the correlation and the dependency between layers, and the quantization loss is smaller compared with other methods; the characteristic diagram calibration mode is simpler, the requirement on calibration data is low, the mode of the scaling coefficient of each layer of characteristic diagram is only required to be recorded as the final fixed coefficient in the calibration process, and a single related picture can finish calibration and achieve a good effect; the method is more hardware-friendly, is a calculation scheme with only integer numbers participating in operation, has scaling coefficients which are all powers of 2, can replace division by shifting, and has smaller requirement on memory and fewer multiply-accumulate operands compared with other methods.

The above description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly apparent, and to make the implementation of the content of the description possible for those skilled in the art, and to make the above and other objects, features and advantages of the present invention more obvious, the following description is given by way of example of the specific embodiments of the present invention.

Drawings

Various other advantages and benefits of the present invention will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. Also, like parts are designated by like reference numerals throughout the drawings.

In the drawings:

FIG. 1 is a schematic diagram of reinforcement learning automatic search of a convolutional neural network quantization method of reinforcement learning automatic perception weight distribution according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a single-layer forward inference calculation process of a convolutional neural network quantization method for reinforcement learning automatic perception weight distribution according to an embodiment of the present invention;

FIG. 3 is a calibration process stability comparison diagram of a convolutional neural network quantization method for reinforcement learning auto-perception weight distribution according to an embodiment of the present invention.

The invention is further explained below with reference to the figures and examples.

Detailed Description

Specific embodiments of the present invention will be described in more detail below with reference to fig. 1 to 3. While specific embodiments of the invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

It should be noted that certain terms are used throughout the description and claims to refer to particular components. As one skilled in the art will appreciate, various names may be used to refer to a component. This specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. The description which follows is a preferred embodiment of the invention, but is made for the purpose of illustrating the general principles of the invention and not for the purpose of limiting the scope of the invention. The scope of the present invention is defined by the appended claims.

For the purpose of facilitating understanding of the embodiments of the present invention, the following description will be made by taking specific embodiments as examples with reference to the accompanying drawings, and the drawings are not to be construed as limiting the embodiments of the present invention.

Fig. 1 is a schematic diagram illustrating steps of a convolutional neural network quantization method for reinforcement learning automatic perception weight distribution, and as shown in fig. 1, the convolutional neural network quantization method for reinforcement learning automatic perception weight distribution includes the following steps:

providing a trained floating point convolution neural network model;

In a preferred embodiment of the method, constructing the forward inference process includes: inputting INT8 type data image data, performing convolution calculation on INT8 input and INT8 weight of a first layer to obtain an INT32 type result, adding the INT32 type result and INT32 type offset, dividing the sum by a total scaling coefficient to obtain INT8 output data, and inputting the output data to a next layer to perform the same operation.

In a preferred embodiment of the method, the floating point convolutional neural network model is a pure floating point convolutional neural network model.

In a preferred embodiment of the method, the distribution information of the weights includes a maximum value, a minimum value, a mean value, a variance, a sharpness, a kurtosis, a data amount and a calculation type in the floating-point weight data, wherein the maximum value, the minimum value, the mean value, the variance and the data amount are obtained by original data, the sharpness and the kurtosis are obtained by constructing a histogram by the data of the weights, and the calculation type includes a standard convolution calculation type, a deep convolution calculation type and a full-connection calculation type.

In a preferred embodiment of the method, the scaling factor of the weight and the scaling factor of the output feature map are both powers of 2.

In a preferred embodiment of the method, the parameters of the batch operation include mean, variance, scaling factor, and offset, and the parameters of the convolution operation include weight.

In a preferred embodiment of the method, the automatic searching for the optimal scaling factor by reinforcement learning comprises the steps of constructing a reinforcement learning agent, inputting weight distribution information, outputting the scaling factor of each layer, quantizing a floating point convolutional neural network model according to the scaling factor, obtaining the quantized accuracy on a test set, calculating the difference between the original accuracy and the quantized accuracy of the floating point convolutional neural network model, and feeding the difference back to the reinforcement learning agent for parameter iterative updating until the reinforcement learning agent converges and obtains the optimal weight scaling factor.

In a preferred embodiment of the method, the obtaining of the distribution information: maximum value, minimum value, mean value, variance and data quantity can be obtained through original data, sharpness and kurtosis can be obtained through constructing a histogram through weighted data.

In a preferred embodiment of the method, the step of reinforcement learning automatic search comprises:

preparing a trained floating point model, and fusing parameters (mean, variance, scaling coefficient and offset) of each layer of batch processing operation and parameters (weight) of convolution operation;

in order to ensure that the input data is still of the INT8 type, the weighting parameters of the first layer need to be fused with the input preprocessing parameters (mean, variance) and the parameters of the batch processing operation of the layer, and the other layers only need to be fused with the parameters of the batch processing operation of the layer.

Acquiring distribution information of the fused weight of each layer in the floating point model;

constructing a reinforcement learning agent (namely a neural network), outputting a scaling coefficient corresponding to each layer by inputting weight distribution information, then dequantizing a floating point model according to the scaling coefficient, obtaining the quantized accuracy on a test set, calculating the difference value of the original accuracy and the quantized accuracy of a floating point, and feeding the difference value back to the reinforcement learning agent for iterative updating of parameters until the reinforcement learning agent converges and obtains the optimal scaling coefficient;

fixing the scaling factor, inputting a calibration data set, and inputting a set of data, namely recording the nearest power 2 number of the maximum value of the absolute value of each layer of output feature map (feature map) (for example, 17 nearest neighbors is 16, 16 is 4 powers of 2), because the power 2 numbers are discrete, there is a certain repetition, and finally, only the number (mode) with the most occurrence is selected as the final fixed scaling factor;

calculating to obtain the biased scaling coefficient of each layer according to the determined scaling coefficient of each layer weight and the output scaling coefficient of each layer;

quantizing the floating point weight into INT8 type data by using a scaling coefficient of the weight, quantizing the floating point offset into INT32 type data by using an offset scaling coefficient, and obtaining a uniform scaling coefficient of each layer according to the weighting scaling coefficient and the scaling coefficient of an output characteristic diagram for calculating the forward reasoning process of the integer;

the bias of INT32 and the uniform scaling factor construct an integer number inference computation process according to the weight of INT 8.

The model input is the data of the data set,

the calibration data set needs to be provided by a model provider, but the field related to data privacy of medical and biological and the like may not provide the calibration data or only provide a small amount of calibration data

In a preferred embodiment of the method, the mode, i.e. the number of most occurrences in a group of numbers, is e.g. [2, 3, 3, 1, 1, 3], the mode being 3, since it occurs 3 times, the most occurrences.

Selecting a fixed value: the calibration data set is input into the model, and each input set of data records the nearest power 2 number of the maximum value of the absolute value of each layer of output feature map (feature map) (for example, 17 nearest neighbors are 16, 16 is 4 powers of 2), because the power 2 numbers are discrete, so that the repetition is certain, and finally, only the number (mode) with the most occurrence needs to be selected as the final fixed scaling factor.

In the search process of the reinforcement learning agent DDPG shown in fig. 1, first, data of a layer of weights is obtained, and distribution information of the data is analyzed: maximum, minimum, average, variance, sharpness, kurtosis, data volume and calculation type are input to the reinforcement learning agent as a state, then an actor family (one group of neural networks) in the reinforcement learning agent takes the state as input and outputs a corresponding scaling factor, and a commentator (another group of neural networks) takes the state and the scaling factor output by the actor family as output, outputs an evaluation score for evaluating the predicted quality of the actor family, and the action output (floating point number between 0 and 1) of the actor family is passed through an action decoder:

resolving a corresponding scaling coefficient, wherein N is 2, M is the number of the power of 2 nearest to the maximum absolute value of the weight of the current layer, then feeding the scaling coefficient back to the model, quantizing the current layer by using the scaling coefficient, knowing that after all layers are quantized, obtaining a quantization precision on the test set, and feeding the reward equal to the quantization precision-floating point precision to the reinforced learning agent for updating the parameters of two neural networks of the actor and the critic.

Fig. 2 shows the forward reasoning process of integer numbers for a single layer in our quantization scheme: firstly, performing convolution calculation on int8 type input and int8 type weight to obtain int32 type intermediate data, adding the int32 type intermediate data to int32 type bias, then performing right shift operation according to the local layer shift value (the power of a scaling coefficient), and then obtaining int8 output through an activation function ReLU, wherein the whole process is only integer number operation (more efficient relative to TensorRT), and relative to TensorFlow Lite, multiplication of int32 is not needed, and the scheme is simpler, more efficient and more efficient

Fig. 3 shows a comparison of the calibration stability of WDAQ and TensorRT of the present invention, and it can be seen that the TensorRT scheme of england requires more calibration data to obtain a more stable and reliable accuracy, while the present invention can ensure a good accuracy without relying on a large number of calibration data sets, and our experiments show that only one set of correlation data is needed to complete the quantitative calibration process and obtain good accuracy. See table 1 below for quantization accuracy.

Table 1: quantization accuracy

In the table, the invention was tested on different models, compared with the current popular TensorFlow Lite protocol and TensorRT protocol, and the conclusions are as follows:

1. on a classical large model (lower sensitivity to quantization and no large-amplitude precision reduction due to direct quantization) such as ResNet-50, ResNet-101, ResNet-152 and inclusion _ V3, the scheme (WDAQ) of the invention can obtain better precision (less precision reduction compared with baseline) compared with TensorRT under the condition of no need of quantization perception training, and can still keep advantages in precision compared with TensorFlow Lite + quantization training;

on the classical lightweight model such as MobileNet _ w1 and MobileNet v2_ w2, under the condition of unification and no need of quantitative training, the invention can obtain better accuracy, and the condition that the precision is almost reduced to 0 in TensorRT and TensorFlow Lite can not occur.

Again, the reason for the two baselines in the table is illustrated: the first baseline was compared to WDAQ and TensorRT, and the test was performed using a trained floating point model found by itself, since TensorRT was not found to have public data on these several models, and the second baseline was compared to TensorFlow Lite, where the data for baseline and INT8 were both official, so that no experiments were performed, but rather the official data were taken directly, which is more fair. The effect of quantization is then mainly to see the amount of degradation of the quantized precision compared to the original floating point precision.

Based on the MobileNetv2_ w1 model, the required parameters and calculation amount (i.e. multiplication and addition operation operand) are calculated according to three schemes, which can reflect the hardware performance of different schemes, and the smaller the parameter amount and the smaller the calculation amount, the better the hardware performance, as shown in the following table 2,

TABLE 2

The scheme of the invention is not lower than TensorRT and TensorFlow Lite in both parameter quantity and calculated quantity, so the scheme is more friendly to hardware, only has integer number operation, and is more beneficial to the design of a special accelerator.

In a system design competition held by an automated design college, the scheme is utilized to compress and quantize an autonomously designed ShuffleDet target detection model, the algorithm is deployed on an FPGA development board of a model number Ultra 96V 1 of a sailing company, the official actual measurement accuracy (mIoU) is 0.615, the frame rate is 50.9 frames, the power consumption for processing a single picture is 0.183 joules, and the scheme obtains the second global result, wherein the second algorithm precision is ranked, the scheme is utilized to compress and quantize the EffiicientDet target detection model, and the algorithm is deployed on the FPGA development board of the model number U1tra 96V 2 of the sailing company, the official actual measurement accuracy (mIoU) is 74.4%, the frame rate is 42 frames, the power consumption for processing the picture is 0.175 joules, and the fifth global result is obtained, wherein the first algorithm precision is ranked. In the low-power-consumption image recognition challenge race of the international computer vision conference, the scheme is utilized to compress and quantize the EfficientNet-B4 model of Google, the classification precision of a floating point model is 80.2%, the algorithm precision after the scheme is compressed is 79.32%, and the third global achievement is obtained, wherein the algorithm precision is ranked in the same row. Therefore, the invention has high precision and low power consumption.

Although the embodiments of the present invention have been described above with reference to the accompanying drawings, the present invention is not limited to the above-described embodiments and application fields, and the above-described embodiments are illustrative, instructive, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous modifications thereto without departing from the scope of the invention as defined by the appended claims.

Claims

1. A convolutional neural network quantization method for reinforcement learning auto-perception weight distribution, the method comprising the steps of:

providing a trained floating point convolution neural network model;

2. The method of claim 1, wherein preferably constructing the forward inference process comprises: inputting INT8 type data image data, performing convolution calculation on INT8 input and INT8 weight of a first layer to obtain an INT32 type result, adding the INT32 type result and INT32 type offset, dividing the sum by a total scaling coefficient to obtain INT8 output data, and inputting the output data to a next layer to perform the same operation.

3. The method of claim 1, wherein the floating point convolutional neural network model is a pure floating point convolutional neural network model.

4. The method according to claim 1, wherein the distribution information of the weights comprises a maximum value, a minimum value, a mean value, a variance, a sharpness, a kurtosis, a data amount and a calculation type in the floating-point weight data, wherein the maximum value, the minimum value, the mean value, the variance and the data amount are obtained through original data, the sharpness and the kurtosis are obtained through constructing a histogram through the data of the weights, and the calculation type comprises a standard convolution calculation, a depth convolution calculation and a full-connection calculation type.

5. The method of claim 1, wherein the scaling factor of the weight and the scaling factor of the output feature map are both powers of 2.

6. The method of claim 1, wherein the parameters of the batch operation include a mean, a variance, a scaling factor, an offset, and the parameters of the convolution operation include a weight.

7. The method of claim 1, wherein the reinforcement learning automatically searching for optimal scaling factors comprises constructing a reinforcement learning agent, inputting weight distribution information, outputting scaling factors for each layer, quantizing a floating point convolutional neural network model according to the scaling factors, and obtaining quantized accuracy on a test set, calculating a difference between an original accuracy and a quantized accuracy of the floating point convolutional neural network model, and feeding the difference back to the reinforcement learning agent for iterative updating of parameters until the reinforcement learning agent converges and obtains optimal weight scaling factors.