CN112183726A

CN112183726A - Neural network full-quantization method and system

Info

Publication number: CN112183726A
Application number: CN202011043841.2A
Authority: CN
Inventors: 崔鑫
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2021-01-05

Abstract

The invention relates to a neural network full-quantization method and a system, wherein the method comprises the following steps: responding to a trained neural network to complete the processing of a data set, statistically analyzing the distribution of each path of data in the neural network, and determining a scaling factor of each path of data; and determining the quantization parameter of each layer in the neural network according to the scaling factor of each path of data. Through the quantization scheme, the output result of the convolution module in the neural network is the shaping data, so that the bandwidth requirements of other modules except the convolution module are reduced, and the network operation speed is increased.

Description

Neural network full-quantization method and system

Technical Field

The invention relates to the field of neural network algorithms, in particular to a neural network full-quantization method and system.

Background

In the existing neural network, the output result of the convolution module is floating point type data, and the bandwidth required by other modules in the network is high. The time overhead of fixed-point and floating-point conversion is used in the convolution calculation, the time overhead of floating-point calculation is large relative to fixed-point calculation, and the memory overhead and the hardware overhead of a floating-point operation unit are both larger than the overhead of a full quantization scheme. These overheads exist for either general purpose processors or dedicated processors, but are in different proportions, ultimately directly impacting cost and user experience.

Disclosure of Invention

In order to solve the technical problems, the invention provides a full-quantization method and system for a neural network.

The technical scheme for solving the technical problems is as follows:

a neural network full quantization method, comprising:

responding to a trained neural network to complete the processing of a data set, statistically analyzing the distribution of each path of data in the neural network, and determining a scaling factor of each path of data;

and determining the quantization parameter of each layer in the neural network according to the scaling factor of each path of data.

On the basis of the technical scheme, the invention can be further improved as follows.

Further, the statistically analyzing the distribution of each path of data in the neural network, and determining the scaling factor of each path of data specifically includes:

and for a neuron in each convolutional layer in the neural network, taking the scaling factor corresponding to the path of data with the maximum absolute value in each path of data input into the neuron as the scaling factor of each path of data input into the neuron.

Further, the scaling factor comprises a pre-processing scaling factor for quantizing raw data input to the neural network;

for the first layer neural network, determining the quantization parameter of each layer in the neural network according to the scaling factor of each path of data specifically includes:

and according to the preprocessing scaling factor, determining integer data and a corresponding shift bit number for replacing the preprocessing scaling factor.

Further, the scaling factors include a first input scaling factor, a weight scaling factor, a bias scaling factor, and a first output scaling factor for quantizing the input data, weights, biases, and output data of each convolution module in a convolution layer of the neural network;

for convolutional layers in the neural network, determining a quantization parameter of each layer in the neural network according to the scaling factor of each path of data specifically includes:

determining a floating-point convolution module scaling factor based on the first input scaling factor, the weight scaling factor, and the first output scaling factor;

and replacing the scaling factor of the convolution module of the floating point type by integer data and corresponding shift bit number.

Further, the scaling factor comprises a second input scaling factor and a second output scaling factor for quantizing input data and output data of non-convolutional layers of the neural network;

for the non-convolutional layer in the neural network, determining a quantization parameter of each layer in the neural network according to the scaling factor of each path of data specifically includes:

if the input scaling factor and the output scaling factor of the non-convolution module are the same, selecting the largest second output scaling factor in the multipath output from the non-convolution module in the non-convolution layer to the nearest convolution module in the lower-level module as the output scaling factor of the non-convolution module; if the lower module of the non-convolution module is not provided with the convolution module, setting the output scaling factor of the non-convolution module to be 1.0, namely the output data is in a floating point type;

and if the input scaling factor and the output scaling factor of the non-convolution module are different, converting the input scaling factor of the non-convolution module into the output scaling factor.

In order to achieve the above object, the present invention further provides a full quantization system of neural network, including:

the statistical module is used for responding to the trained neural network to complete the processing of the data set, statistically analyzing the distribution of each path of data in the neural network and determining the scaling factor of each path of data;

and the determining module is used for determining the quantization parameter of each layer in the neural network according to the scaling factor of each path of data.

Further, the statistical module is specifically configured to:

for the first layer neural network, the determining module is specifically configured to:

for convolutional layers in the neural network, the determining module is specifically configured to:

for a non-convolutional layer in the neural network, the determining module is specifically configured to:

The invention has the beneficial effects that: through the quantization scheme, the output result of the convolution module in the neural network is the shaping data, so that the bandwidth requirements of other modules except the convolution module are reduced, and the network operation speed is increased.

Drawings

Fig. 1 is a flowchart of a full quantization method of a neural network according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a flowchart of a full quantization method of a neural network according to an embodiment of the present invention, as shown in fig. 1, the method includes:

s11, responding to the trained neural network to complete the processing of the data set, statistically analyzing the distribution of each path of data in the neural network, and determining the scaling factor of each path of data;

specifically, step S11 specifically includes:

and for a neuron in each convolutional layer in the neural network, taking the scaling factor corresponding to the path of data with the maximum absolute value in each path of data input into the neuron as the scaling factor of each path of data input into the neuron. Therefore, the accuracy requirement of the quantized data of the path with the maximum absolute value can be met, the accuracy requirements of other paths of data can be met, and the accuracy loss is reduced.

And S12, determining the quantization parameter of each layer in the neural network according to the scaling factor of each path of data.

In one embodiment of the present invention, the scaling factor comprises a pre-processing scaling factor for quantizing raw data input to the neural network; for the first-layer neural network, step S12 specifically includes:

and S121, according to the preprocessing scaling factor, determining integer data and a corresponding shift bit number for replacing the preprocessing scaling factor.

Specifically, the preprocessing of the original data is integrated into a form of preprocessing output A original data + B, the preprocessing output is quantized into an integer, the preprocessing scaling factor is determined to be quantization scale, and then the quantization result is the preprocessing output quantization scale. Then the quantization result is a quantization scale original data + B quantization scale.

Then, the A quantization scale is replaced by the shaping data A _ int and the shift bit number, and the B quantization scale is replaced by the shaping data B _ int and the shift bit number.

In this embodiment, the output result of preprocessing the data input to the convolutional neural network is quantized into an integer, so that the overhead of system calculation can be reduced.

As an embodiment of the present invention, the scaling factors further include a first input scaling factor, a weight scaling factor, a bias scaling factor, and a first output scaling factor for quantizing the input data, weights, biases, and output data of each convolution module in the convolution layer of the neural network;

for the convolutional layer in the neural network, step S12 specifically includes:

s122, determining a floating-point convolution module scaling factor according to the first input scaling factor, the weight scaling factor and the first output scaling factor;

and S123, replacing the scaling factor of the floating-point convolution module with integer data and corresponding shift digits.

Specifically, quantization of convolution modules in a convolutional layer involves the following techniques and operations:

1. convolution/fully connected layer input and weight relationships for float version and quantized version of neural network:

in_float*in_scale＝in_int

ker_float*ker_scale＝ker_int

the convolution calculation result is:

multiply-accumulate (in _ float, ker _ float) in _ scale ker _ scale ═ multiply-accumulate (in _ int, ker _ int)

The present application requires that the data quantization format of the output result of the previous layer of convolution satisfies the quantization format of the input of the next layer of convolution.

In float mode, the output of the previous layer is the input of the next layer, and there are:

last_out_float＝next_in_float

multiply-accumulate (last _ in _ float, last _ ker _ float) ═ next _ in _ float

Multiply-accumulate (last _ in _ int, last _ ker _ int)/(last _ in _ scale _ last _ ker _ scale) ═ next _ in _ int/next _ in _ scale

next _ in _ int is multiply-accumulate (last _ in _ int, last _ ker _ int) (next _ in _ scale/last _ in _ scale/last _ ker _ scale)

Let S1 be (next _ in _ scale/last _ in _ scale/last _ ker _ scale)

Then S1 can be found to be approximately S1_ int > > n, note: n is the number of shift bits, when n is negative, it means left shift, and when n is positive, it means right shift.

2. For 1, if there is bias, the float type convolution outputs the result: multiply-accumulate (in _ float, ker _ float) + bias _ float

3. The approach of approximating S1 with S1_ int and n is:

suppose that S1 is to be mapped to an integer of BitN bits

Bint ═ rounding up (log2 (S1)); equation 3.1

Bfrac＝BitN-Bint；

n＝-Bfrac

accuracy is 2.0 raised to the nth power;

then S1 int is rounded up (S1/accuracy)

4. For 1, if the convolution input is quantized with 1 scale, and the weights are different according to the output channels (1 scale for each channel), S1 should be a one-dimensional array S1[ Cout ] at this time, and in formula 3.1 of step 3, log should be taken for max (abs (S1[ Cout ])).

5. If 1 convolution output is eventually the input of other multiple convolutions (regardless of operations between convolutional layers that do not change the data sacle, such as pooling, relu activation, etc.), there would be the following relationship, for example:

last_float_out*scale1＝next_int_1

Last_float_out*scale2＝next_int_2

Last_flaot_out*scale3＝next_int_3

...

assuming scale1> scale2> scale3, the convolution output is first quantized to next _ int _1 of scale1, and then next _ int _1 is translated to next _ int _2 and next _ int _ 3. If int16/int8 quantizes, it can reduce memory by 50%/75% and increase speed.

The quantization of the convolution module includes quantization of the weights, input data, offsets and output data of the convolution module, and then a convolution module scaling factor needs to be determined according to the scaling factors of the weights, input data and output data.

Specifically, the weights are quantized according to each convolution module kernel, the scale value of the weight scaling factor of each kernel is determined, and the weights are quantized into the shape. The quantization formula is: and (4) quantifying scale by original float. Quantized to a few bits, which can be set by the user, and default to 8 bits (i.e., int8 data).

For input data quantization, the range of the input data needs to be analyzed, an input quantization scale value is determined according to int (float) scale, and each path of input data shares the same sacle value.

The quantized bias scaling factor sacle for the bias is obtained by multiplying the input scaling factor sacle by the weight quantized weight quantization factor sacle, which enables the bias to be multiplied by, accumulated, and added directly to the convolution (of the quantized input and quantized weights). The bias is specifically quantized to a few bits set by the user, with a default of 32 bits (i.e., int32 data).

For the quantization of the output data, it is necessary to obtain the scale of the output (that is, the scale input by the connected lower convolution module, which is skipped over by the non-convolution module), and if the layer of convolution module outputs the scale to a plurality of convolution modules, add modules, contact modules, etc. at the same time, the maximum value is selected from the input scales of the plurality of convolutions as the output scale of the layer of convolution. If the last layer of convolution of the network is available, the scale of the output is set to be 1.0.

Finally, according to the input scaling factor, the weight scaling factor and the output scaling factor, determining the scaling factor scale of the convolution module to be multiplied by the result of multiply-accumulate (and adding bias) to be the output scaling factor/(input scaling factor weight scaling factor), and then approximating the scaling factor sacle of the floating-point type convolution module by an integer type convolution module scaling factor scale _ int and a shift digit (positive represents right shift, and negative represents left shift).

The range of the integer scale _ int is set by a user, if the range is large, more bits are used in calculation, the precision is high, but the bandwidth required by the multiplier is also large; the small range results in low precision and small required bandwidth of the multiplier. If the convolution is carried out on the last layer of the network, the scaling factor sacle of the convolution module in the convolution calculation is continuously represented by floating point numbers, and the integer scale _ int and the shift bit number are not used for approximation.

In this embodiment, the scaling factors corresponding to the input, the weight, and the output of each convolution module in the convolution layer of the neural network are combined into one integral convolution module, which is a scaling factor and a corresponding shift coefficient, so that the integral quantization of the convolution layer is completed, and the overhead of system calculation can be further reduced.

As an embodiment of the present invention, the scaling factor further comprises a second input scaling factor and a second output scaling factor for quantizing the input data and the output data of the non-convolutional layer of the neural network;

for the non-convolutional layer in the neural network, step S12 specifically includes:

s124, if the input scaling factor and the output scaling factor of the non-convolution module are the same, selecting the largest second output scaling factor from the multiple outputs of the non-convolution module in the non-convolution layer to the nearest convolution module in the lower module as the output scaling factor of the non-convolution module; if the lower module of the non-convolution module is not provided with the convolution module, setting the output scaling factor of the non-convolution module to be 1.0, namely the output data is in a floating point type;

and S125, if the input scaling factor and the output scaling factor of the non-convolution module are different, converting the input scaling factor of the non-convolution module into the output scaling factor.

Specifically, if the non-convolution module has only one input, for example, the ReLU activation module, the pooling module, etc., the input shaping data is directly used for corresponding calculation, and the output scaling factor is consistent with the input scaling factor. If the non-convolution module has multiple inputs, the multiple inputs are first converted into the output sacle required by the module, and then corresponding calculation is carried out. If the scale of a certain input of the adding module (add) and the scale of an output of the splicing module (concat) are the same, the way does not need to be quantized.

If the output of the module has multiple paths (including one path condition), selecting the maximum value of the scale of the multiple paths of output to the convolution module as the output scale of the module; if the output of the module is transmitted to the lower module without the convolution module, the maximum value of the output scale of the lower module is used as the output scale of the module, so that the precision requirement of the lower convolution module can be met, and if the lower module of the lower module does not have the convolution module, the module can be continuously found to the rear module in a recursion mode. If no convolution module is found even when the last layer of the network recurses, the output scale is set to be 1.0, namely the output data is in a floating point type.

If the scale of the input data is not consistent with the scale of the output, converting the scale of the input data into the scale of the output, wherein the conversion formula is as follows: the converted input data/output scale is the original input data/input scale. By the conversion formula, the scale of the input data of the non-convolution module can be converted into the scale of the output data by referring to the mode, so that the condition that the scale of the input data is inconsistent with the scale of the output data in a more complex model can be adapted.

In the embodiment, the accuracy requirement of the model can be met, and the method can be suitable for models with different complexities.

Optionally, in this embodiment, the method further includes:

and S13, continuing training the completely quantized neural network, and performing fine adjustment, so that the error of the quantized neural network meets the requirement.

The embodiment of the present invention provides a full quantization system of a neural network, the function and principle of each module in the system have been explained in the foregoing, and are not described in detail below, and the system includes:

Optionally, in this embodiment, the statistical module is specifically configured to:

Optionally, in this embodiment, the scaling factor includes a pre-processing scaling factor for quantizing raw data input to the neural network;

Optionally, in this embodiment, the scaling factors include a first input scaling factor, a weight scaling factor, a bias scaling factor, and a first output scaling factor for quantizing the input data, weights, biases, and output data of each convolution module in the convolution layer of the neural network;

Optionally, in this embodiment, the scaling factor comprises a second input scaling factor and a second output scaling factor for quantizing input data and output data of the non-convolutional layer of the neural network;

The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the modules and units in the above described system embodiment may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A neural network full-quantization method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the statistically analyzing the distribution of each path of data in the neural network and determining the scaling factor of each path of data specifically comprises:

3. The neural network full-quantization method according to claim 1, wherein the scaling factor comprises a preprocessing scaling factor for quantizing raw data input to the neural network;

4. The neural network full-quantization method according to claim 1, wherein the scaling factors include a first input scaling factor, a weight scaling factor, an offset scaling factor and a first output scaling factor for quantizing the input data, the weights, the offsets and the output data of each convolution module in the convolution layer of the neural network;

5. The neural network full-quantization method according to any one of claims 1 to 4, wherein the scaling factor comprises a second input scaling factor and a second output scaling factor for quantizing input data and output data of non-convolutional layers of the neural network;

6. A neural network full quantization system, comprising:

7. The system according to claim 6, wherein the statistical module is specifically configured to:

8. The neural network full-quantization system of claim 6, wherein the scaling factor comprises a pre-processing scaling factor for quantizing raw data input to the neural network;

9. The neural network full quantization system of claim 6, wherein said scaling factors comprise a first input scaling factor, a weight scaling factor, an offset scaling factor and a first output scaling factor for quantizing the input data, weights, offsets and output data of each convolution module in a convolution layer of the neural network;

10. The neural network full-quantization system according to any one of claims 6 to 9, wherein the scaling factor comprises a second input scaling factor and a second output scaling factor for quantizing input data and output data of non-convolutional layers of the neural network;