CN112183726A - Neural network full-quantization method and system - Google Patents

Neural network full-quantization method and system Download PDF

Info

Publication number
CN112183726A
CN112183726A CN202011043841.2A CN202011043841A CN112183726A CN 112183726 A CN112183726 A CN 112183726A CN 202011043841 A CN202011043841 A CN 202011043841A CN 112183726 A CN112183726 A CN 112183726A
Authority
CN
China
Prior art keywords
scaling factor
neural network
data
output
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011043841.2A
Other languages
Chinese (zh)
Inventor
崔鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202011043841.2A priority Critical patent/CN112183726A/en
Publication of CN112183726A publication Critical patent/CN112183726A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to a neural network full-quantization method and a system, wherein the method comprises the following steps: responding to a trained neural network to complete the processing of a data set, statistically analyzing the distribution of each path of data in the neural network, and determining a scaling factor of each path of data; and determining the quantization parameter of each layer in the neural network according to the scaling factor of each path of data. Through the quantization scheme, the output result of the convolution module in the neural network is the shaping data, so that the bandwidth requirements of other modules except the convolution module are reduced, and the network operation speed is increased.

Description

Neural network full-quantization method and system
Technical Field
The invention relates to the field of neural network algorithms, in particular to a neural network full-quantization method and system.
Background
In the existing neural network, the output result of the convolution module is floating point type data, and the bandwidth required by other modules in the network is high. The time overhead of fixed-point and floating-point conversion is used in the convolution calculation, the time overhead of floating-point calculation is large relative to fixed-point calculation, and the memory overhead and the hardware overhead of a floating-point operation unit are both larger than the overhead of a full quantization scheme. These overheads exist for either general purpose processors or dedicated processors, but are in different proportions, ultimately directly impacting cost and user experience.
Disclosure of Invention
In order to solve the technical problems, the invention provides a full-quantization method and system for a neural network.
The technical scheme for solving the technical problems is as follows:
a neural network full quantization method, comprising:
responding to a trained neural network to complete the processing of a data set, statistically analyzing the distribution of each path of data in the neural network, and determining a scaling factor of each path of data;
and determining the quantization parameter of each layer in the neural network according to the scaling factor of each path of data.
On the basis of the technical scheme, the invention can be further improved as follows.
Further, the statistically analyzing the distribution of each path of data in the neural network, and determining the scaling factor of each path of data specifically includes:
and for a neuron in each convolutional layer in the neural network, taking the scaling factor corresponding to the path of data with the maximum absolute value in each path of data input into the neuron as the scaling factor of each path of data input into the neuron.
Further, the scaling factor comprises a pre-processing scaling factor for quantizing raw data input to the neural network;
for the first layer neural network, determining the quantization parameter of each layer in the neural network according to the scaling factor of each path of data specifically includes:
and according to the preprocessing scaling factor, determining integer data and a corresponding shift bit number for replacing the preprocessing scaling factor.
Further, the scaling factors include a first input scaling factor, a weight scaling factor, a bias scaling factor, and a first output scaling factor for quantizing the input data, weights, biases, and output data of each convolution module in a convolution layer of the neural network;
for convolutional layers in the neural network, determining a quantization parameter of each layer in the neural network according to the scaling factor of each path of data specifically includes:
determining a floating-point convolution module scaling factor based on the first input scaling factor, the weight scaling factor, and the first output scaling factor;
and replacing the scaling factor of the convolution module of the floating point type by integer data and corresponding shift bit number.
Further, the scaling factor comprises a second input scaling factor and a second output scaling factor for quantizing input data and output data of non-convolutional layers of the neural network;
for the non-convolutional layer in the neural network, determining a quantization parameter of each layer in the neural network according to the scaling factor of each path of data specifically includes:
if the input scaling factor and the output scaling factor of the non-convolution module are the same, selecting the largest second output scaling factor in the multipath output from the non-convolution module in the non-convolution layer to the nearest convolution module in the lower-level module as the output scaling factor of the non-convolution module; if the lower module of the non-convolution module is not provided with the convolution module, setting the output scaling factor of the non-convolution module to be 1.0, namely the output data is in a floating point type;
and if the input scaling factor and the output scaling factor of the non-convolution module are different, converting the input scaling factor of the non-convolution module into the output scaling factor.
In order to achieve the above object, the present invention further provides a full quantization system of neural network, including:
the statistical module is used for responding to the trained neural network to complete the processing of the data set, statistically analyzing the distribution of each path of data in the neural network and determining the scaling factor of each path of data;
and the determining module is used for determining the quantization parameter of each layer in the neural network according to the scaling factor of each path of data.
Further, the statistical module is specifically configured to:
and for a neuron in each convolutional layer in the neural network, taking the scaling factor corresponding to the path of data with the maximum absolute value in each path of data input into the neuron as the scaling factor of each path of data input into the neuron.
Further, the scaling factor comprises a pre-processing scaling factor for quantizing raw data input to the neural network;
for the first layer neural network, the determining module is specifically configured to:
and according to the preprocessing scaling factor, determining integer data and a corresponding shift bit number for replacing the preprocessing scaling factor.
Further, the scaling factors include a first input scaling factor, a weight scaling factor, a bias scaling factor, and a first output scaling factor for quantizing the input data, weights, biases, and output data of each convolution module in a convolution layer of the neural network;
for convolutional layers in the neural network, the determining module is specifically configured to:
determining a floating-point convolution module scaling factor based on the first input scaling factor, the weight scaling factor, and the first output scaling factor;
and replacing the scaling factor of the convolution module of the floating point type by integer data and corresponding shift bit number.
Further, the scaling factor comprises a second input scaling factor and a second output scaling factor for quantizing input data and output data of non-convolutional layers of the neural network;
for a non-convolutional layer in the neural network, the determining module is specifically configured to:
if the input scaling factor and the output scaling factor of the non-convolution module are the same, selecting the largest second output scaling factor in the multipath output from the non-convolution module in the non-convolution layer to the nearest convolution module in the lower-level module as the output scaling factor of the non-convolution module; if the lower module of the non-convolution module is not provided with the convolution module, setting the output scaling factor of the non-convolution module to be 1.0, namely the output data is in a floating point type;
and if the input scaling factor and the output scaling factor of the non-convolution module are different, converting the input scaling factor of the non-convolution module into the output scaling factor.
The invention has the beneficial effects that: through the quantization scheme, the output result of the convolution module in the neural network is the shaping data, so that the bandwidth requirements of other modules except the convolution module are reduced, and the network operation speed is increased.
Drawings
Fig. 1 is a flowchart of a full quantization method of a neural network according to an embodiment of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
Fig. 1 is a flowchart of a full quantization method of a neural network according to an embodiment of the present invention, as shown in fig. 1, the method includes:
s11, responding to the trained neural network to complete the processing of the data set, statistically analyzing the distribution of each path of data in the neural network, and determining the scaling factor of each path of data;
specifically, step S11 specifically includes:
and for a neuron in each convolutional layer in the neural network, taking the scaling factor corresponding to the path of data with the maximum absolute value in each path of data input into the neuron as the scaling factor of each path of data input into the neuron. Therefore, the accuracy requirement of the quantized data of the path with the maximum absolute value can be met, the accuracy requirements of other paths of data can be met, and the accuracy loss is reduced.
And S12, determining the quantization parameter of each layer in the neural network according to the scaling factor of each path of data.
In one embodiment of the present invention, the scaling factor comprises a pre-processing scaling factor for quantizing raw data input to the neural network; for the first-layer neural network, step S12 specifically includes:
and S121, according to the preprocessing scaling factor, determining integer data and a corresponding shift bit number for replacing the preprocessing scaling factor.
Specifically, the preprocessing of the original data is integrated into a form of preprocessing output A original data + B, the preprocessing output is quantized into an integer, the preprocessing scaling factor is determined to be quantization scale, and then the quantization result is the preprocessing output quantization scale. Then the quantization result is a quantization scale original data + B quantization scale.
Then, the A quantization scale is replaced by the shaping data A _ int and the shift bit number, and the B quantization scale is replaced by the shaping data B _ int and the shift bit number.
In this embodiment, the output result of preprocessing the data input to the convolutional neural network is quantized into an integer, so that the overhead of system calculation can be reduced.
As an embodiment of the present invention, the scaling factors further include a first input scaling factor, a weight scaling factor, a bias scaling factor, and a first output scaling factor for quantizing the input data, weights, biases, and output data of each convolution module in the convolution layer of the neural network;
for the convolutional layer in the neural network, step S12 specifically includes:
s122, determining a floating-point convolution module scaling factor according to the first input scaling factor, the weight scaling factor and the first output scaling factor;
and S123, replacing the scaling factor of the floating-point convolution module with integer data and corresponding shift digits.
Specifically, quantization of convolution modules in a convolutional layer involves the following techniques and operations:
1. convolution/fully connected layer input and weight relationships for float version and quantized version of neural network:
in_float*in_scale=in_int
ker_float*ker_scale=ker_int
the convolution calculation result is:
multiply-accumulate (in _ float, ker _ float) in _ scale ker _ scale ═ multiply-accumulate (in _ int, ker _ int)
The present application requires that the data quantization format of the output result of the previous layer of convolution satisfies the quantization format of the input of the next layer of convolution.
In float mode, the output of the previous layer is the input of the next layer, and there are:
last_out_float=next_in_float
multiply-accumulate (last _ in _ float, last _ ker _ float) ═ next _ in _ float
Multiply-accumulate (last _ in _ int, last _ ker _ int)/(last _ in _ scale _ last _ ker _ scale) ═ next _ in _ int/next _ in _ scale
next _ in _ int is multiply-accumulate (last _ in _ int, last _ ker _ int) (next _ in _ scale/last _ in _ scale/last _ ker _ scale)
Let S1 be (next _ in _ scale/last _ in _ scale/last _ ker _ scale)
Then S1 can be found to be approximately S1_ int > > n, note: n is the number of shift bits, when n is negative, it means left shift, and when n is positive, it means right shift.
2. For 1, if there is bias, the float type convolution outputs the result: multiply-accumulate (in _ float, ker _ float) + bias _ float
3. The approach of approximating S1 with S1_ int and n is:
suppose that S1 is to be mapped to an integer of BitN bits
Bint ═ rounding up (log2 (S1)); equation 3.1
Bfrac=BitN-Bint;
n=-Bfrac
accuracy is 2.0 raised to the nth power;
then S1 int is rounded up (S1/accuracy)
4. For 1, if the convolution input is quantized with 1 scale, and the weights are different according to the output channels (1 scale for each channel), S1 should be a one-dimensional array S1[ Cout ] at this time, and in formula 3.1 of step 3, log should be taken for max (abs (S1[ Cout ])).
5. If 1 convolution output is eventually the input of other multiple convolutions (regardless of operations between convolutional layers that do not change the data sacle, such as pooling, relu activation, etc.), there would be the following relationship, for example:
last_float_out*scale1=next_int_1
Last_float_out*scale2=next_int_2
Last_flaot_out*scale3=next_int_3
...
assuming scale1> scale2> scale3, the convolution output is first quantized to next _ int _1 of scale1, and then next _ int _1 is translated to next _ int _2 and next _ int _ 3. If int16/int8 quantizes, it can reduce memory by 50%/75% and increase speed.
The quantization of the convolution module includes quantization of the weights, input data, offsets and output data of the convolution module, and then a convolution module scaling factor needs to be determined according to the scaling factors of the weights, input data and output data.
Specifically, the weights are quantized according to each convolution module kernel, the scale value of the weight scaling factor of each kernel is determined, and the weights are quantized into the shape. The quantization formula is: and (4) quantifying scale by original float. Quantized to a few bits, which can be set by the user, and default to 8 bits (i.e., int8 data).
For input data quantization, the range of the input data needs to be analyzed, an input quantization scale value is determined according to int (float) scale, and each path of input data shares the same sacle value.
The quantized bias scaling factor sacle for the bias is obtained by multiplying the input scaling factor sacle by the weight quantized weight quantization factor sacle, which enables the bias to be multiplied by, accumulated, and added directly to the convolution (of the quantized input and quantized weights). The bias is specifically quantized to a few bits set by the user, with a default of 32 bits (i.e., int32 data).
For the quantization of the output data, it is necessary to obtain the scale of the output (that is, the scale input by the connected lower convolution module, which is skipped over by the non-convolution module), and if the layer of convolution module outputs the scale to a plurality of convolution modules, add modules, contact modules, etc. at the same time, the maximum value is selected from the input scales of the plurality of convolutions as the output scale of the layer of convolution. If the last layer of convolution of the network is available, the scale of the output is set to be 1.0.
Finally, according to the input scaling factor, the weight scaling factor and the output scaling factor, determining the scaling factor scale of the convolution module to be multiplied by the result of multiply-accumulate (and adding bias) to be the output scaling factor/(input scaling factor weight scaling factor), and then approximating the scaling factor sacle of the floating-point type convolution module by an integer type convolution module scaling factor scale _ int and a shift digit (positive represents right shift, and negative represents left shift).
The range of the integer scale _ int is set by a user, if the range is large, more bits are used in calculation, the precision is high, but the bandwidth required by the multiplier is also large; the small range results in low precision and small required bandwidth of the multiplier. If the convolution is carried out on the last layer of the network, the scaling factor sacle of the convolution module in the convolution calculation is continuously represented by floating point numbers, and the integer scale _ int and the shift bit number are not used for approximation.
In this embodiment, the scaling factors corresponding to the input, the weight, and the output of each convolution module in the convolution layer of the neural network are combined into one integral convolution module, which is a scaling factor and a corresponding shift coefficient, so that the integral quantization of the convolution layer is completed, and the overhead of system calculation can be further reduced.
As an embodiment of the present invention, the scaling factor further comprises a second input scaling factor and a second output scaling factor for quantizing the input data and the output data of the non-convolutional layer of the neural network;
for the non-convolutional layer in the neural network, step S12 specifically includes:
s124, if the input scaling factor and the output scaling factor of the non-convolution module are the same, selecting the largest second output scaling factor from the multiple outputs of the non-convolution module in the non-convolution layer to the nearest convolution module in the lower module as the output scaling factor of the non-convolution module; if the lower module of the non-convolution module is not provided with the convolution module, setting the output scaling factor of the non-convolution module to be 1.0, namely the output data is in a floating point type;
and S125, if the input scaling factor and the output scaling factor of the non-convolution module are different, converting the input scaling factor of the non-convolution module into the output scaling factor.
Specifically, if the non-convolution module has only one input, for example, the ReLU activation module, the pooling module, etc., the input shaping data is directly used for corresponding calculation, and the output scaling factor is consistent with the input scaling factor. If the non-convolution module has multiple inputs, the multiple inputs are first converted into the output sacle required by the module, and then corresponding calculation is carried out. If the scale of a certain input of the adding module (add) and the scale of an output of the splicing module (concat) are the same, the way does not need to be quantized.
If the output of the module has multiple paths (including one path condition), selecting the maximum value of the scale of the multiple paths of output to the convolution module as the output scale of the module; if the output of the module is transmitted to the lower module without the convolution module, the maximum value of the output scale of the lower module is used as the output scale of the module, so that the precision requirement of the lower convolution module can be met, and if the lower module of the lower module does not have the convolution module, the module can be continuously found to the rear module in a recursion mode. If no convolution module is found even when the last layer of the network recurses, the output scale is set to be 1.0, namely the output data is in a floating point type.
If the scale of the input data is not consistent with the scale of the output, converting the scale of the input data into the scale of the output, wherein the conversion formula is as follows: the converted input data/output scale is the original input data/input scale. By the conversion formula, the scale of the input data of the non-convolution module can be converted into the scale of the output data by referring to the mode, so that the condition that the scale of the input data is inconsistent with the scale of the output data in a more complex model can be adapted.
In the embodiment, the accuracy requirement of the model can be met, and the method can be suitable for models with different complexities.
Optionally, in this embodiment, the method further includes:
and S13, continuing training the completely quantized neural network, and performing fine adjustment, so that the error of the quantized neural network meets the requirement.
The embodiment of the present invention provides a full quantization system of a neural network, the function and principle of each module in the system have been explained in the foregoing, and are not described in detail below, and the system includes:
the statistical module is used for responding to the trained neural network to complete the processing of the data set, statistically analyzing the distribution of each path of data in the neural network and determining the scaling factor of each path of data;
and the determining module is used for determining the quantization parameter of each layer in the neural network according to the scaling factor of each path of data.
Optionally, in this embodiment, the statistical module is specifically configured to:
and for a neuron in each convolutional layer in the neural network, taking the scaling factor corresponding to the path of data with the maximum absolute value in each path of data input into the neuron as the scaling factor of each path of data input into the neuron.
Optionally, in this embodiment, the scaling factor includes a pre-processing scaling factor for quantizing raw data input to the neural network;
for the first layer neural network, the determining module is specifically configured to:
and according to the preprocessing scaling factor, determining integer data and a corresponding shift bit number for replacing the preprocessing scaling factor.
Optionally, in this embodiment, the scaling factors include a first input scaling factor, a weight scaling factor, a bias scaling factor, and a first output scaling factor for quantizing the input data, weights, biases, and output data of each convolution module in the convolution layer of the neural network;
for convolutional layers in the neural network, the determining module is specifically configured to:
determining a floating-point convolution module scaling factor based on the first input scaling factor, the weight scaling factor, and the first output scaling factor;
and replacing the scaling factor of the convolution module of the floating point type by integer data and corresponding shift bit number.
Optionally, in this embodiment, the scaling factor comprises a second input scaling factor and a second output scaling factor for quantizing input data and output data of the non-convolutional layer of the neural network;
for a non-convolutional layer in the neural network, the determining module is specifically configured to:
if the input scaling factor and the output scaling factor of the non-convolution module are the same, selecting the largest second output scaling factor in the multipath output from the non-convolution module in the non-convolution layer to the nearest convolution module in the lower-level module as the output scaling factor of the non-convolution module; if the lower module of the non-convolution module is not provided with the convolution module, setting the output scaling factor of the non-convolution module to be 1.0, namely the output data is in a floating point type;
and if the input scaling factor and the output scaling factor of the non-convolution module are different, converting the input scaling factor of the non-convolution module into the output scaling factor.
The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the modules and units in the above described system embodiment may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A neural network full-quantization method is characterized by comprising the following steps:
responding to a trained neural network to complete the processing of a data set, statistically analyzing the distribution of each path of data in the neural network, and determining a scaling factor of each path of data;
and determining the quantization parameter of each layer in the neural network according to the scaling factor of each path of data.
2. The method according to claim 1, wherein the statistically analyzing the distribution of each path of data in the neural network and determining the scaling factor of each path of data specifically comprises:
and for a neuron in each convolutional layer in the neural network, taking the scaling factor corresponding to the path of data with the maximum absolute value in each path of data input into the neuron as the scaling factor of each path of data input into the neuron.
3. The neural network full-quantization method according to claim 1, wherein the scaling factor comprises a preprocessing scaling factor for quantizing raw data input to the neural network;
for the first layer neural network, determining the quantization parameter of each layer in the neural network according to the scaling factor of each path of data specifically includes:
and according to the preprocessing scaling factor, determining integer data and a corresponding shift bit number for replacing the preprocessing scaling factor.
4. The neural network full-quantization method according to claim 1, wherein the scaling factors include a first input scaling factor, a weight scaling factor, an offset scaling factor and a first output scaling factor for quantizing the input data, the weights, the offsets and the output data of each convolution module in the convolution layer of the neural network;
for convolutional layers in the neural network, determining a quantization parameter of each layer in the neural network according to the scaling factor of each path of data specifically includes:
determining a floating-point convolution module scaling factor based on the first input scaling factor, the weight scaling factor, and the first output scaling factor;
and replacing the scaling factor of the convolution module of the floating point type by integer data and corresponding shift bit number.
5. The neural network full-quantization method according to any one of claims 1 to 4, wherein the scaling factor comprises a second input scaling factor and a second output scaling factor for quantizing input data and output data of non-convolutional layers of the neural network;
for the non-convolutional layer in the neural network, determining a quantization parameter of each layer in the neural network according to the scaling factor of each path of data specifically includes:
if the input scaling factor and the output scaling factor of the non-convolution module are the same, selecting the largest second output scaling factor in the multipath output from the non-convolution module in the non-convolution layer to the nearest convolution module in the lower-level module as the output scaling factor of the non-convolution module; if the lower module of the non-convolution module is not provided with the convolution module, setting the output scaling factor of the non-convolution module to be 1.0, namely the output data is in a floating point type;
and if the input scaling factor and the output scaling factor of the non-convolution module are different, converting the input scaling factor of the non-convolution module into the output scaling factor.
6. A neural network full quantization system, comprising:
the statistical module is used for responding to the trained neural network to complete the processing of the data set, statistically analyzing the distribution of each path of data in the neural network and determining the scaling factor of each path of data;
and the determining module is used for determining the quantization parameter of each layer in the neural network according to the scaling factor of each path of data.
7. The system according to claim 6, wherein the statistical module is specifically configured to:
and for a neuron in each convolutional layer in the neural network, taking the scaling factor corresponding to the path of data with the maximum absolute value in each path of data input into the neuron as the scaling factor of each path of data input into the neuron.
8. The neural network full-quantization system of claim 6, wherein the scaling factor comprises a pre-processing scaling factor for quantizing raw data input to the neural network;
for the first layer neural network, the determining module is specifically configured to:
and according to the preprocessing scaling factor, determining integer data and a corresponding shift bit number for replacing the preprocessing scaling factor.
9. The neural network full quantization system of claim 6, wherein said scaling factors comprise a first input scaling factor, a weight scaling factor, an offset scaling factor and a first output scaling factor for quantizing the input data, weights, offsets and output data of each convolution module in a convolution layer of the neural network;
for convolutional layers in the neural network, the determining module is specifically configured to:
determining a floating-point convolution module scaling factor based on the first input scaling factor, the weight scaling factor, and the first output scaling factor;
and replacing the scaling factor of the convolution module of the floating point type by integer data and corresponding shift bit number.
10. The neural network full-quantization system according to any one of claims 6 to 9, wherein the scaling factor comprises a second input scaling factor and a second output scaling factor for quantizing input data and output data of non-convolutional layers of the neural network;
for a non-convolutional layer in the neural network, the determining module is specifically configured to:
if the input scaling factor and the output scaling factor of the non-convolution module are the same, selecting the largest second output scaling factor in the multipath output from the non-convolution module in the non-convolution layer to the nearest convolution module in the lower-level module as the output scaling factor of the non-convolution module; if the lower module of the non-convolution module is not provided with the convolution module, setting the output scaling factor of the non-convolution module to be 1.0, namely the output data is in a floating point type;
and if the input scaling factor and the output scaling factor of the non-convolution module are different, converting the input scaling factor of the non-convolution module into the output scaling factor.
CN202011043841.2A 2020-09-28 2020-09-28 Neural network full-quantization method and system Pending CN112183726A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011043841.2A CN112183726A (en) 2020-09-28 2020-09-28 Neural network full-quantization method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011043841.2A CN112183726A (en) 2020-09-28 2020-09-28 Neural network full-quantization method and system

Publications (1)

Publication Number Publication Date
CN112183726A true CN112183726A (en) 2021-01-05

Family

ID=73947243

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011043841.2A Pending CN112183726A (en) 2020-09-28 2020-09-28 Neural network full-quantization method and system

Country Status (1)

Country Link
CN (1) CN112183726A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255901A (en) * 2021-07-06 2021-08-13 上海齐感电子信息科技有限公司 Real-time quantization method and real-time quantization system
CN116644796A (en) * 2023-07-27 2023-08-25 美智纵横科技有限责任公司 Network model quantization method, voice data processing method, device and chip

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121796A1 (en) * 2016-11-03 2018-05-03 Intel Corporation Flexible neural network accelerator and methods therefor
US20190042935A1 (en) * 2017-12-28 2019-02-07 Intel Corporation Dynamic quantization of neural networks
US20190138882A1 (en) * 2017-11-07 2019-05-09 Samusung Electronics Co., Ltd. Method and apparatus for learning low-precision neural network that combines weight quantization and activation quantization
CN110929837A (en) * 2018-09-19 2020-03-27 北京搜狗科技发展有限公司 Neural network model compression method and device
CN111105017A (en) * 2019-12-24 2020-05-05 北京旷视科技有限公司 Neural network quantization method and device and electronic equipment
WO2020160787A1 (en) * 2019-02-08 2020-08-13 Huawei Technologies Co., Ltd. Neural network quantization method using multiple refined quantized kernels for constrained hardware deployment
CN111695671A (en) * 2019-03-12 2020-09-22 北京地平线机器人技术研发有限公司 Method and device for training neural network and electronic equipment
US20200302299A1 (en) * 2019-03-22 2020-09-24 Qualcomm Incorporated Systems and Methods of Cross Layer Rescaling for Improved Quantization Performance

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121796A1 (en) * 2016-11-03 2018-05-03 Intel Corporation Flexible neural network accelerator and methods therefor
US20190138882A1 (en) * 2017-11-07 2019-05-09 Samusung Electronics Co., Ltd. Method and apparatus for learning low-precision neural network that combines weight quantization and activation quantization
US20190042935A1 (en) * 2017-12-28 2019-02-07 Intel Corporation Dynamic quantization of neural networks
CN110929837A (en) * 2018-09-19 2020-03-27 北京搜狗科技发展有限公司 Neural network model compression method and device
WO2020160787A1 (en) * 2019-02-08 2020-08-13 Huawei Technologies Co., Ltd. Neural network quantization method using multiple refined quantized kernels for constrained hardware deployment
CN111695671A (en) * 2019-03-12 2020-09-22 北京地平线机器人技术研发有限公司 Method and device for training neural network and electronic equipment
US20200302299A1 (en) * 2019-03-22 2020-09-24 Qualcomm Incorporated Systems and Methods of Cross Layer Rescaling for Improved Quantization Performance
CN111105017A (en) * 2019-12-24 2020-05-05 北京旷视科技有限公司 Neural network quantization method and device and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255901A (en) * 2021-07-06 2021-08-13 上海齐感电子信息科技有限公司 Real-time quantization method and real-time quantization system
CN116644796A (en) * 2023-07-27 2023-08-25 美智纵横科技有限责任公司 Network model quantization method, voice data processing method, device and chip
CN116644796B (en) * 2023-07-27 2023-10-03 美智纵横科技有限责任公司 Network model quantization method, voice data processing method, device and chip

Similar Documents

Publication Publication Date Title
Köster et al. Flexpoint: An adaptive numerical format for efficient training of deep neural networks
CN110880038B (en) System for accelerating convolution calculation based on FPGA and convolution neural network
CN110378468A (en) A kind of neural network accelerator quantified based on structuring beta pruning and low bit
CN110852416B (en) CNN hardware acceleration computing method and system based on low-precision floating point data representation form
CN110969251B (en) Neural network model quantification method and device based on label-free data
CN110413255B (en) Artificial neural network adjusting method and device
US10491239B1 (en) Large-scale computations using an adaptive numerical format
CN112508125A (en) Efficient full-integer quantization method of image detection model
CN112183726A (en) Neural network full-quantization method and system
CN112990438B (en) Full-fixed-point convolution calculation method, system and equipment based on shift quantization operation
CN111696149A (en) Quantization method for stereo matching algorithm based on CNN
US20210170962A1 (en) Computing system, server and on-vehicle device
CN114418057A (en) Operation method of convolutional neural network and related equipment
CN110647974A (en) Network layer operation method and device in deep neural network
CN113449854A (en) Method and device for quantifying mixing precision of network model and computer storage medium
CN114139683A (en) Neural network accelerator model quantization method
Wu et al. Efficient dynamic fixed-point quantization of CNN inference accelerators for edge devices
CN112085175A (en) Data processing method and device based on neural network calculation
CN111383157A (en) Image processing method and device, vehicle-mounted operation platform, electronic equipment and system
CN112561050B (en) Neural network model training method and device
Kummer et al. Adaptive Precision Training (AdaPT): A dynamic quantized training approach for DNNs
Kalali et al. A power-efficient parameter quantization technique for CNN accelerators
CN113283591B (en) Efficient convolution implementation method and device based on Winograd algorithm and approximate multiplier
CN116227563A (en) Convolutional neural network compression and acceleration method based on data quantization
US20220164665A1 (en) Method and device for compressing neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination