CN114444688A

CN114444688A - Neural network quantization method, apparatus, device, storage medium, and program product

Info

Publication number: CN114444688A
Application number: CN202210044661.9A
Authority: CN
Inventors: 易松松; 吴腾
Original assignee: Bigo Technology Pte Ltd
Current assignee: Bigo Technology Pte Ltd
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-05-06

Abstract

The embodiment of the application discloses a quantization method, a quantization device, quantization equipment and quantization equipment of a neural network, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring first quantization parameters of a plurality of functional layers of a neural network, wherein the first quantization parameters are used for quantizing data in a floating point format into data in a first fixed point format, and the first fixed point format is a format adopted by a first processing unit; for a first functional layer in the plurality of functional layers, converting a first quantization parameter of the first functional layer into a second quantization parameter, the second quantization parameter being used to quantize data in a floating point format into data in a second fixed point format, the second fixed point format being a format adopted by a second processing unit, and the first processing unit and the second processing unit being two different processing units; and under the condition that the second processing unit is adopted to reason the first functional layer, quantizing the related data involved in the reasoning process of the first functional layer based on the second quantization parameter.

Description

Neural network quantization method, apparatus, device, storage medium, and program product

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a quantization method, apparatus, device, storage medium, and program product for a neural network.

Background

The quantization of the neural network is a mode for compressing the model, which is beneficial to reducing the size of the model and increasing the reasoning speed.

A general-purpose CPU (Central Processing Unit) supports fixed-point calculation with a minimum of 8 bits, and when the CPU is used as a calculation rear end of a neural network, parameters of the neural network are generally quantized to 8 bits for representation.

However, the quantization method only supports the neural network to perform inference on the CPU, and cannot adapt to the requirements of more computing scenarios.

Disclosure of Invention

The embodiment of the application provides a quantization method, a quantization device, quantization equipment and quantization equipment of a neural network. The technical scheme is as follows:

according to an aspect of an embodiment of the present application, there is provided a quantization method of a neural network, the method including:

acquiring first quantization parameters of a plurality of functional layers of a neural network, wherein the first quantization parameters are used for quantizing data in a floating point format into data in a first fixed point format, and the first fixed point format is a format adopted by a first processing unit;

for a first functional layer of the plurality of functional layers, converting a first quantization parameter of the first functional layer into a second quantization parameter, the second quantization parameter being used to quantize the data in the floating point format into data in a second fixed point format, the second fixed point format being a format employed by a second processing unit, and the first processing unit and the second processing unit being two different processing units;

and under the condition that the second processing unit is adopted to reason the first functional layer, quantizing the related data involved in the reasoning process of the first functional layer based on the second quantization parameter.

According to an aspect of an embodiment of the present application, there is provided an apparatus for quantizing a neural network, the apparatus including:

the parameter acquisition module is used for acquiring first quantization parameters of a plurality of functional layers of the neural network, wherein the first quantization parameters are used for quantizing data in a floating point format into data in a first fixed point format, and the first fixed point format is a format adopted by the first processing unit;

a parameter conversion module, configured to, for a first functional layer of the multiple functional layers, convert a first quantization parameter of the first functional layer into a second quantization parameter, where the second quantization parameter is used to quantize data in the floating-point format into data in a second fixed-point format, where the second fixed-point format is a format used by a second processing unit, and the first processing unit and the second processing unit are two different processing units;

and the data quantization module is used for quantizing the related data involved in the inference process of the first functional layer based on the second quantization parameter under the condition that the second processing unit is adopted to infer the first functional layer.

According to an aspect of embodiments of the present application, there is provided a computer device comprising a processor and a memory, the memory having stored therein a computer program, the computer program being loaded and executed by the processor to implement the above-mentioned neural network quantization method.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored therein a computer program, which is loaded and executed by a processor to implement the above-mentioned quantization method of a neural network.

According to an aspect of embodiments of the present application, there is provided a computer program product, which includes computer instructions stored in a computer-readable storage medium, and a processor reads and executes the computer instructions from the computer-readable storage medium to implement the quantization method of the neural network.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

the quantization parameters of the neural network are converted from quantization parameters suitable for being used by a first processing unit (such as a CPU) for calculation to quantization parameters suitable for being used by a second processing unit (such as a DSP) for calculation, so that the quantization parameters of the neural network can be adapted between a plurality of different types of processing units, such as between the CPU and the DSP, so as to meet the calculation requirements of simultaneously being compatible with the plurality of different types of processing units.

Drawings

FIG. 1 is a schematic illustration of an environment for implementing an embodiment provided by an embodiment of the present application;

FIG. 2 is a flow chart of a method for quantifying neural networks provided by one embodiment of the present application;

FIG. 3 is a flow diagram of an inference process of a neural network provided by one embodiment of the present application;

FIG. 4 is a diagram illustrating conversion of a prescription 8 format to an int8 format according to one embodiment of the present application;

FIG. 5 is a schematic diagram of a conversion of a uint8 format to an int8 format according to another embodiment of the present application;

FIG. 6 is a block diagram of a quantization apparatus of a neural network provided in an embodiment of the present application;

fig. 7 is a block diagram of a quantization apparatus of a neural network according to another embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before describing the embodiments of the present application, some technical terms referred to in the present application are defined and explained.

1. Neural network (neural network): the method is an arithmetic mathematical model simulating animal neural network behavior characteristics and performing distributed parallel information processing. The neural network depends on the complexity of the system, and the aim of processing information is fulfilled by adjusting the interconnection relationship among a large number of internal nodes. The neural network may include a plurality of functional layers. Alternatively, the functional layers include, but are not limited to, convolution (convolution) layers, convolution-like layers (e.g., deconvolution layers, fully-connected layers, etc.), pooling (posing) layers, concatenation (concat) layers, align operation (eltwise) layers, binary (binary) layers, scale (scale) layers, activation (relu) layers, and the like.

2. Quantification (quaternization): quantization refers to the process of approximating a continuous value (or a large number of possible discrete values) to a finite number (or fewer) of discrete values. Quantization of the neural network is a mode of model compression, namely, a weight value (weight) or an activation value (activation) represented by a high bit width (for example, float32, 32-bit floating point number) is approximately represented by a lower bit width (for example, int8, 8-bit signed integer number), and the representation on the value is to discretize a continuous value. Inverse quantization is the inverse process of quantization, i.e., the process of converting fixed point numbers into floating point numbers.

3. Processing unit (processing unit): also called a processor, refers to a hardware unit in a computer device responsible for data computation. Alternatively, the Processing Unit includes, but is not limited to, a CPU, a GPU (Graphics Processing Unit), a DSP (Digital Signal Processor), an NPU (Neural-network Processing Unit), and the like.

Refer to fig. 1, which illustrates a schematic diagram of an environment for implementing an embodiment of the present application. The embodiment implementation environment may include: an offline computing device 10 and an online computing device 20.

The quantification of neural networks can be divided into an offline phase and an online phase. The off-line stage is to quantize some network parameters (such as weight values) obtained by training after the training of the neural network is completed. The process can be performed off-line, i.e. without taking into account the data actually used in the online reasoning phase. The online stage is a stage of performing online reasoning on input data by adopting a trained neural network. In the online reasoning phase, input data and output data, which may also be referred to as activation values, of various functional layers of the neural network are generated, and the activation values need to be quantized.

The offline arithmetic device 10 is mainly used for some calculation processes in charge of the offline stage. The online computing device 20 is mainly used for some computing processes in charge of the online phase. The offline operating device 10 and the online operating device 20 may be computer devices with data storage and calculation capabilities, such as a computer, a mobile phone, a tablet computer, a wearable device, an intelligent home device, a vehicle-mounted terminal, and a server, which is not limited in the present application.

The offline arithmetic device 10 and the online arithmetic device 20 may be two independent devices or may be the same device, which is not limited in the present application.

For convenience of explanation, in the method embodiments below, only the execution subject of each step is explained as a computer device. It is to be understood that the computer device may be the online computing device 20 in the implementation environment shown in FIG. 1.

Referring to fig. 2, a flowchart of a quantization method of a neural network according to an embodiment of the present application is shown. The method can comprise at least one of the following steps (210-230):

step 210, obtaining first quantization parameters of a plurality of functional layers of the neural network, where the first quantization parameters are used to quantize the data in the floating point format into data in a first fixed point format, and the first fixed point format is a format adopted by the first processing unit.

The neural network may comprise a plurality of functional layers, for example a neural network may comprise an input layer, at least one hidden layer and an output layer, wherein the at least one hidden layer may in turn comprise at least one of: at least one convolutional layer, at least one class convolutional layer (e.g., a deconvolution layer, a full-link layer, etc.), at least one pooling layer, at least one cascading layer, at least one pair-wise operation layer, at least one binary layer, at least one scaling layer, at least one activation layer, etc.

A first quantization parameter for a plurality of functional layers of the neural network may be determined at an offline stage. In the online phase, the computer device obtains first quantization parameters of a plurality of functional layers of the neural network determined in the offline phase. Optionally, the neural network is quantized in the whole process, that is, all functional layers of the neural network have corresponding first quantization parameters.

Alternatively, the first processing unit is a CPU and the first fixed point format may be int8 (8-bit signed integer number) format. The CPU generally stores and calculates data in int8 format, and the processing unit of the online computing device generally takes the CPU as a main component, so that we can determine the first quantization parameters in int8 format of each functional layer of the neural network in an offline stage.

Illustratively, the quantization formula from floating point format to fixed point format is as follows: q ═ R/S + Z; the inverse quantization formula from fixed point format to floating point format is as follows: r ═ Q-Z × S; whereinR denotes the true floating point value, Q denotes the quantized fixed point value, Z denotes the quantized fixed point value corresponding to the floating point value of 0, and S denotes the smallest scale that can be represented after fixed point quantization, also called scale factor (scale). (R) is_max-R_min)/(Q_max-Q_min)，Z＝Q_max-R_max(ii) S; wherein R is_maxRepresenting the maximum floating-point value, R_minRepresenting the minimum floating-point value, Q_maxRepresenting the maximum fixed point value, Q_minRepresenting the minimum fixed point value.

In an embodiment of the present application, the quantization parameter includes a scaling factor (scale). Each functional layer of the neural network may have a corresponding quantization parameter. Moreover, since the value ranges of the parameters of the functional layers are different, the corresponding quantization parameters may be different.

For the first class of functional layers, the first class of functional layers refer to functional layers with only activation values and no weight values, such as pooling layers, cascade layers, alignment operation layers, binary layers, scaling layers, activation layers, and the like, and the quantization parameters of the first class of functional layers include input quantization parameters and output quantization parameters. The input quantization parameter refers to a quantization parameter corresponding to input data (or an input active value) of the functional layer, and the output quantization parameter refers to a quantization parameter corresponding to output data (or an output active value) of the functional layer.

For the second type of functional layer, the second type of functional layer refers to a functional layer that has a weight value in addition to an activation value, such as a convolutional layer or a quasi-convolutional layer, and its quantization parameters may include: the method comprises the steps of inputting a quantization parameter, outputting the quantization parameter and weighting value quantization parameters. The input quantization parameter refers to a quantization parameter corresponding to input data (or an input activation value) of the functional layer, the output quantization parameter refers to a quantization parameter corresponding to output data (or an output activation value) of the functional layer, and the weight value quantization parameter refers to a quantization parameter corresponding to a weight value of the functional layer.

Optionally, the quantization parameter corresponding to the functional layer further includes a unified quantization parameter after fusion. For the first class of functional layers, the corresponding fused uniform quantization parameters refer to quantization parameters calculated according to the input quantization parameters and the output quantization parameters. For example, fused scale ═ output scale/inputscale; wherein fused scale represents the unified quantization parameter after fusion, inputscale represents the input quantization parameter, and output scale represents the output quantization parameter. For the second type of functional layer, the corresponding fused uniform quantization parameter refers to a quantization parameter calculated according to the input quantization parameter, the output quantization parameter and the weighted value quantization parameter. For example, fused scale ═ output scale/(inputscale weight scale); the fused scale represents a unified quantization parameter after fusion, inputscale represents an input quantization parameter, output scale represents an output quantization parameter, and weight scale represents a weight value quantization parameter. By calculating the fused uniform quantization parameter in advance in the off-line stage, the fused uniform quantization parameter can be directly used for participating in calculation in the inference process of the subsequent on-line stage, and the input quantization parameter, the output quantization parameter, the weighted value quantization parameter and the like are not required to participate in calculation, so that the calculation amount of the on-line stage is reduced, and the time consumption of on-line inference is shortened.

In the embodiment of the present application, a manner of determining the quantization parameter of each functional layer is not limited, and for example, the quantization parameter may be obtained statistically in a KL divergence (Kullback-Leibler divergence) manner or obtained in another manner.

In step 220, for a first functional layer in the multiple functional layers, the first quantization parameter of the first functional layer is converted into a second quantization parameter, where the second quantization parameter is used to quantize data in a floating point format into data in a second fixed-point format, the second fixed-point format is a format used by the second processing unit, and the first processing unit and the second processing unit are two different processing units.

The first functional layer may be any one of a plurality of functional layers of a neural network. If the first functional layer needs to adopt a second processing unit for reasoning, where the second processing unit is another processing unit different from the first processing unit, and if the first processing unit is a CPU and the second processing unit is a DSP, the first quantization parameter of the first functional layer needs to be converted into a second quantization parameter.

Optionally, the first processing unit is a CPU, and the first fixed point format may be int8 (8-bit signed integer number) format; the second processing unit is a DSP and the second fixed point format may be the agent 8 (8-bit unsigned integer) format. The DSP usually stores and calculates the data in the uint8 format, and since the first quantization parameter suitable for calculating the data in the int8 format is calculated in advance in the offline stage, the first quantization parameter needs to be converted into the second quantization parameter suitable for calculating the data in the uint8 format.

In addition, for the int8 format, the quantization range is [ -128,127], and the center point is 0. Symmetric quantization is in int8 format. For the fluid 8 format, the quantization range is [0,255] with a center point of 128. Asymmetric quantization is in the format of uint 8.

Optionally, if the first functional layer belongs to a first class of functional layers, the first quantization parameter of the first functional layer includes a first active value quantization parameter, and the first active value quantization parameter is used to quantize the active value in the floating point format to the active value in the first fixed point format. Step 220 may include: converting the first active value quantization parameter of the first functional layer into a second active value quantization parameter based on a conversion relation between the first fixed point format and the second fixed point format; the second activation value quantization parameter is used for quantizing the activation value in the floating point format into the activation value in the second fixed point format. That is, for a functional layer not containing a weight value, only the activation value quantization parameter needs to be converted. Illustratively, the value range corresponding to the active value in the floating point format is determined according to the first active value quantization parameter and the value range corresponding to the first fixed point format, and then the second active value quantization parameter is determined according to the value range corresponding to the active value in the floating point format and the value range corresponding to the second fixed point format.

Optionally, if the first functional layer belongs to the second class of functional layers, the first quantization parameter of the first functional layer includes a first active value quantization parameter and a first weight value quantization parameter, the first active value quantization parameter is used to quantize the active value in the floating point format to the active value in the first fixed point format, and the first weight value quantization parameter is used to quantize the weight value in the floating point format to the weight value in the first fixed point format. Step 220 may include: converting the first active value quantization parameter of the first functional layer into a second active value quantization parameter based on a conversion relation between the first fixed point format and the second fixed point format; the second activation value quantization parameter is used for quantizing the activation value in the floating point format into the activation value in the second fixed point format. Step 220 may also include: converting a first weight value quantization parameter of the first functional layer into a second weight value quantization parameter based on a conversion relation between the first fixed point format and the second fixed point format; the second weight value quantization parameter is used for quantizing the weight value in the floating point format into the weight value in the second fixed point format. That is, for a function layer containing a weight value, it is necessary to convert the weight value quantization parameter in addition to the activation value quantization parameter. Illustratively, the value range corresponding to the active value in the floating point format is determined according to the first active value quantization parameter and the value range corresponding to the first fixed point format, and then the second active value quantization parameter is determined according to the value range corresponding to the active value in the floating point format and the value range corresponding to the second fixed point format. Exemplarily, a value range corresponding to the weight value in the floating point format is determined according to the first weight value quantization parameter and the value range corresponding to the first fixed point format, and then a second weight value quantization parameter is determined according to the value range corresponding to the weight value in the floating point format and the value range corresponding to the second fixed point format.

Optionally, if the first functional layer belongs to the second type of functional layer, the first weight value of the first functional layer needs to be converted into the second weight value based on a conversion relationship between the first fixed point format and the second fixed point format; the first weight value is expressed in a first fixed point format, and the second weight value is expressed in a second fixed point format. That is to say, in the case that the weight value of the first functional layer has been calculated in the offline stage, and the weight value of the first functional layer is a first weight value expressed in the first fixed point format and suitable for being calculated by the first processing unit, if the first functional layer needs to adopt the second processing unit for reasoning, the first weight value also needs to be converted into a second weight value expressed in the second fixed point format and suitable for being calculated by the second processing unit.

Illustratively, the formula for converting the first activation value quantization parameter (or the first weight value quantization parameter) into the second activation value quantization parameter (or the second weight value quantization parameter) is as follows:

real＝scale1*(quantize-zero_point)

scale2＝(f_max-f_min)/(q_max-q_min)

wherein, scale1 represents a first activation value quantization parameter (or a first weight value quantization parameter), scale2 represents a second activation value quantization parameter (or a second weight value quantization parameter), quantize represents a quantized fixed-point value, zero _ point represents a quantized fixed-point value corresponding to a 0 floating-point value, real represents a real floating-point value, f_maxRepresenting the maximum floating-point value, f_minRepresenting the minimum floating point value, q_maxRepresenting the maximum fixed point value, q_minRepresenting the minimum fixed point value. Illustratively, the first fixed point format is int8 format and the second fixed point format is a uint8 format, then zero _ point is 0 and quantize takes on the value of-128,127]，q_max＝255，q_min＝0。

Illustratively, the calculation formula for converting the weight value in int8 format (i.e., the first weight value described above) into the weight value in uint8 format (i.e., the second weight value described above) is as follows:

Wuint8＝Wint8^128

wherein, Wuint8 represents the weight value in the form of uint8, and Wint8 represents the weight value in the form of int 8.

And step 230, under the condition that the second processing unit is adopted to carry out reasoning on the first functional layer, quantizing the related data involved in the reasoning process of the first functional layer based on the second quantization parameter.

After the second quantization parameter corresponding to the first functional layer is obtained, in the process of reasoning the first functional layer by using the second processing unit, related data involved in the reasoning process is quantized or inversely quantized based on the second quantization parameter, so that the accuracy of reasoning the functional layer by using the second processing unit is ensured.

According to the technical scheme provided by the embodiment of the application, the quantization parameter of the neural network is converted from the quantization parameter suitable for being calculated and used by a first processing unit (such as a CPU) into the quantization parameter suitable for being calculated and used by a second processing unit (such as a DSP), so that the quantization parameter of the neural network can be adapted between processing units of different types, for example, between the CPU and the DSP, and the calculation requirement of being compatible with the processing units of different types at the same time is met.

In addition, the method considers first functional layers such as a convolutional layer or a quasi-convolutional layer and also considers first functional layers such as a pooling layer, a cascade layer, a counterpoint operation layer, a binary layer, a scaling layer and an activation layer, provides a solution of full-range symmetric quantization, and realizes compatibility of various different types of processing units through conversion of quantization parameters.

Next, the inference process of the neural network will be described. As shown in FIG. 3, the inference process may include at least one of the following steps (310-320):

in step 310, for a second functional layer of the plurality of functional layers, an operation type of the second functional layer is determined, where the operation type is used to indicate a correlation characteristic of input data and output data of the second functional layer.

The second functional layer may be any one of a plurality of functional layers of a neural network.

Optionally, the operation types include: single input single output, multiple input single output and the operation rule is addition operation, multiple input single output and the operation rule is multiplication operation. The operation type of the second functional layer may be any one of the above operation types.

If the operation type of the second functional layer is single-input single-output, it means that the input data of the second functional layer has and only has one set of activation values, and the output data of the second functional layer also has and only has one set of activation values. In other words, the input data of the second functional layer is a feature map, and the output data is also a feature map. Illustratively, the operation type of the pooling layer is a single input and single output.

If the operation type of the second functional layer is multi-input single-output and the operation rule is addition operation, it indicates that the input data of the second functional layer includes multiple groups of activation values, the output data of the second functional layer only has one group of activation values, and the addition operation is performed on the multiple groups of activation values in the input data to obtain the output data. In other words, the input data of the second functional layer includes a plurality of feature maps, the output data is a feature map, and the feature map is obtained by adding the plurality of feature maps in the input data and outputting the result. Illustratively, the bit alignment operation layer for performing the addition operation is a multi-input single-output and the operation rule is an addition operation.

If the operation type of the second functional layer is multi-input single-output and the operation rule is multiplication, it indicates that the input data of the second functional layer includes multiple groups of activation values, the output data of the second functional layer only has one group of activation values, and the multiplication is performed on the multiple groups of activation values in the input data to obtain the output data. In other words, the input data of the second functional layer includes a plurality of feature maps, the output data is one feature map, and the output data is one feature map obtained by multiplying the plurality of feature maps in the input data. Illustratively, the bitwise operation layer performing the multiplication operation is a multi-input single-output and the operation rule is a multiplication operation.

And 320, reasoning the second functional layer according to the input data of the second functional layer based on the operation type of the second functional layer to obtain output data of the second functional layer.

For different operation types, different processing modes can be adopted to carry out reasoning on the functional layer.

In some embodiments, when the operation type of the second functional layer is single input and single output, the second functional layer is inferred according to the input data of the second functional layer in the target fixed point format, so as to obtain the output data of the second functional layer in the target fixed point format.

The target fixed point format may be the first fixed point format or the second fixed point format described above. Taking the example of using the first processing unit to perform inference on the second functional layer, the target fixed point format is the first fixed point format. For example, the first processing unit is a CPU, and when the CPU performs inference calculation on the second functional layer, the target fixed point format is int8 format. Firstly, input data of the second functional layer in an int8 format is acquired, and then inference calculation is directly performed on the second functional layer by the data in the int8 format to obtain output data of the second functional layer in an int8 format.

In some embodiments, when the operation type of the second functional layer is multi-input single-output and the operation rule is addition operation, performing inverse quantization on multiple groups of input data according to quantization parameters corresponding to the multiple groups of input data of the second functional layer in the target fixed-point format, respectively, to obtain multiple groups of input data in the floating-point format; performing addition operation on a plurality of groups of input data in a floating point format to obtain an addition operation result in the floating point format; and quantizing the addition operation result in the floating point format to obtain the output data of the second functional layer in the target fixed point format.

The target fixed point format may be the first fixed point format or the second fixed point format described above. Taking the example of using the first processing unit to perform inference on the second functional layer, the target fixed point format is the first fixed point format. For example, the first processing unit is a CPU, and when the CPU performs inference calculation on the second functional layer, the target fixed point format is int8 format. Firstly, a plurality of groups of input data of the second functional layer in the int8 format are obtained, because quantization parameters corresponding to the plurality of groups of input data may be different, are expressed on different fixed point scales, and cannot directly participate in addition operation, and need to be unified to the same scale before calculation, the plurality of groups of input data need to be respectively dequantized according to quantization parameters corresponding to the plurality of groups of input data to obtain a plurality of groups of input data in the floating point format, then the plurality of groups of input data in the floating point format are subjected to addition operation to obtain an addition operation result in the floating point format, and finally the addition operation result in the floating point format is quantized to obtain output data of the second functional layer in the int8 format.

Exemplarily, taking an example that the second functional layer includes 2 sets of input data, the calculation formula of the output data of the second functional layer is as follows:

C＝Clamp((reScaleA*A+reScaleB*B)*scaleC)

wherein C represents the output data of the second functional layer, a represents one set of input data of the second functional layer, B represents another set of input data of the second functional layer, reScaleA represents the dequantization parameter corresponding to the input data a, reScaleB represents the dequantization parameter corresponding to the input data B, scaleC represents the quantization parameter corresponding to the output data C, and camp represents the truncation of floating point numbers into integer numbers.

In some embodiments, when the operation type of the second functional layer is multi-input single-output and the operation rule is multiplication, performing multiplication on multiple sets of input data of the second functional layer in the target fixed-point format to obtain a multiplication result in the target fixed-point format; and generating output data of the second functional layer in the target fixed point format according to the multiplication result in the target fixed point format.

The target fixed point format may be the first fixed point format or the second fixed point format described above. Taking the example of using the first processing unit to perform inference on the second functional layer, the target fixed point format is the first fixed point format. For example, the first processing unit is a CPU, and when the CPU performs inference calculation on the second functional layer, the target fixed point format is int8 format. Firstly, a plurality of groups of input data of the second functional layer in the int8 format are obtained, although quantization parameters corresponding to the plurality of groups of input data may be different, the plurality of groups of input data can directly participate in multiplication, so that the plurality of groups of input data in the int8 format are directly multiplied to obtain a multiplication result in the int8 format, and then output data of the second functional layer in the int8 format is generated according to the multiplication result in the int8 format.

C＝Clamp((A*B)*fusedScale)

fusedScale＝reScaleA*reScaleB*scaleC

The embodiment is directed at the reasoning process of the neural network, and is divided into a plurality of different operation types according to the operation characteristics of each functional layer, and then the reasoning is carried out by adopting different calculation modes according to different operation types, so that the reasoning speed is improved as much as possible and the calculation amount is reduced on the premise of ensuring the reasoning accuracy.

In an exemplary embodiment, to improve quantization accuracy, the present application proposes a weight control strategy. Quantitative reasoning is prone to accuracy loss, mainly from two aspects: one is the quantization characterization mode, i.e. whether the calculation of the quantization parameter (i.e. scale) is reasonable, and the other is the truncation error from floating point to fixed point. The quantization parameter is obtained by adopting a KL divergence mode, and is verified by the industry, and if the truncation error can be reduced, the precision is also improved. For a functional layer with single input and single output type and simpler calculation, such as a pooling layer, the application finds that the input quantization parameter and the output quantization parameter are relatively close, when the input quantization parameter and the output quantization parameter are smaller than a certain threshold, the functional layer does not need to be subjected to re-quantization, and can directly calculate and output a result in a fixed-point mode, so that a step of weighted truncation is omitted, the calculation accuracy can be improved, and some unnecessary calculation amount can be reduced. In the off-line phase, therefore, a quantization parameter can be added to the neural network, which indicates whether a functional layer needs to be quantized again. For example, for any functional layer, it may be determined whether the functional layer needs to be re-quantized according to a difference between an input quantization parameter and an output quantization parameter of the functional layer, so as to obtain a corresponding re-quantization parameter, and write the re-quantization parameter into a model structure. Illustratively, if the difference between the input quantization parameter and the output quantization parameter of the functional layer is less than or equal to a threshold, it is determined that the functional layer does not require re-quantization, otherwise re-quantization is required. The threshold can be determined by practical experience or experiment, for example, the threshold is 0.01, and if | input scale-output scale | ≦ 0.01, then no re-quantization is required, otherwise re-quantization is required. Wherein input scale represents the input quantization parameter and output scale represents the output quantization parameter.

Next, taking a third functional layer in the neural network as an example, the third functional layer may be any one functional layer, and when the weighting parameter corresponding to the third functional layer indicates that the weighting is required, the online stage may infer the third functional layer in the following manner:

in a possible implementation manner, inverse quantization is performed on input data of the third functional layer in the target fixed point format to obtain input data of the third functional layer in the floating point format, then the input data of the third functional layer in the floating point format is adopted to perform inference on the third functional layer to obtain output data of the third functional layer in the floating point format, and finally the output data of the third functional layer in the floating point format is quantized to obtain output data of the third functional layer in the target fixed point format. Taking the target fixed point format as the int8 format as an example, according to the input quantization parameter of the third functional layer, performing inverse quantization on the input data of the third functional layer in the int8 format to obtain the input data of the third functional layer in the floating point format, then performing inference calculation on the input data of the third functional layer in the floating point format to obtain the output data of the third functional layer in the floating point format, and finally performing quantization on the output data of the third functional layer in the floating point format by using the output quantization parameter corresponding to the third functional layer to obtain the output data of the third functional layer in the int8 format.

In another possible implementation, the input data of the third functional layer in the target fixed point format is used to perform inference on the third functional layer to obtain the initial output data of the third functional layer in the target fixed point format, and then the initial output data of the third functional layer in the target fixed point format is converted into the output data of the third functional layer in the target fixed point format based on the fused uniform quantization parameter corresponding to the third functional layer. Taking the target fixed point format as int8 format as an example, performing inference calculation by using input data of the third functional layer in int8 format to obtain initial output data of the third functional layer in int8 format, and then calculating the initial output data of the third functional layer in int8 format based on the fused uniform quantization parameters corresponding to the third functional layer to obtain output data of the third functional layer in int8 format. The fused unified quantization parameter corresponding to the third functional layer may be obtained by calculation according to the input quantization parameter and the output quantization parameter corresponding to the third functional layer.

In addition, in the case that the weighting parameter corresponding to the third functional layer indicates that the weighting is not needed, the online phase may reason for the third functional layer as follows: and reasoning the third functional layer by adopting the input data of the third functional layer in the target fixed point format to obtain the output data of the third functional layer in the target fixed point format. That is, the target fixed point format is directly adopted for reasoning calculation, and format conversion or post-processing is not needed.

The weight control strategy provided by the embodiment of the application can not only improve the calculation precision, but also reduce some unnecessary calculation amount.

In an exemplary embodiment, in the case where the input data of the neural network is in the uint8 format, for example, in the case where the input data of the neural network is image data, the range of values thereof is data in the uint8 format in [0.255 ]. In order to be compatible with input data in the format of the uint8, the quantization parameters of the input layer of the neural network are fused.

The input layer generally needs to be preprocessed, the mean (mean) and variance (norm) are introduced by the user, and the conventional operation steps are as shown in fig. 4. Firstly, a preprocessing step is executed: converting the input data in the fluid 8 format into input data in the float format according to the mean and the variance; then, a quantization step is performed: the input data in float format is converted to input data in int8 format according to the quantization parameter.

As shown in fig. 4, the conversion process is formulated as follows:

mid(float)＝(input(uint8)–mean)*norm

output(int8)＝Clamp(mid(float)*scale)

wherein input (uint8) represents input data in the format of uint8, mid (float) represents input data in the format of float obtained through the preprocessing step, output (int8) represents input data in the format of int8 obtained through final conversion, mean represents the mean value of the input data, norm represents the variance of the input data, scale represents a quantization parameter, and Clamp represents the truncation of a floating point number into an integer number.

The application proposes that the mean value and the quantization parameter may be fused, after the fusion, the preprocessing step may be omitted, and the conversion from the input data in the prescription 8 to the input data in the prescription int8 may be completed by directly performing the quantization step. This process is shown in FIG. 5 and is formulated as follows:

output(int8)＝Clamp((input(uint8)–zeroPoint)*scale’)

wherein input (uint8) represents input data in the format of uint8, output (int8) represents input data in the format of int8 obtained by final conversion, scale 'represents a quantization parameter after fusion, scale' represents norm, zeroPoint represents mean, mean represents a mean value of the input data, norm represents a variance of the input data, scale represents a quantization parameter, and Clamp represents truncation of a floating point number into an integer number.

Based on the above formula corresponding to fig. 5, it can be derived: under the condition that the input data of the neural network is in the second fixed point format, determining a translation factor corresponding to the second fixed point format according to the mean value of the input data of the neural network; determining the fused quantization parameter according to the variance of the input data of the neural network and the quantization parameter corresponding to the second fixed point format; and converting the input data of the neural network from the second fixed point format into the first fixed point format according to the translation factor corresponding to the second fixed point format and the fused quantization parameter. Illustratively, the first fixed point format is int8 format and the second fixed point format is a fluid 8 format.

In the embodiment of the application, under the condition that the neural network adopts the first fixed point format for reasoning and calculation, the input data in the second fixed point format is compatible, so that the input data in the second fixed point format can be input into the neural network to participate in reasoning and calculation, and the applicability of the neural network is improved.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 6, a block diagram of a quantization apparatus of a neural network provided in an embodiment of the present application is shown. The device has the function of realizing the quantization method of the neural network, and the function can be realized by hardware or by hardware executing corresponding software. The device can be a computer device and can also be arranged in the computer device. The apparatus 600 may include: a parameter acquisition module 610, a parameter conversion module 620 and a data quantization module 630.

The parameter obtaining module 610 is configured to obtain first quantization parameters of a plurality of functional layers of the neural network, where the first quantization parameters are used to quantize data in a floating point format into data in a first fixed point format, and the first fixed point format is a format adopted by the first processing unit.

A parameter converting module 620, configured to, for a first functional layer in the plurality of functional layers, convert a first quantization parameter of the first functional layer into a second quantization parameter, where the second quantization parameter is used to quantize the data in the floating point format into data in a second fixed point format, where the second fixed point format is a format adopted by a second processing unit, and the first processing unit and the second processing unit are two different processing units.

A data quantization module 630, configured to quantize, when the second processing unit is used to reason about the first functional layer, related data involved in the inference process of the first functional layer based on the second quantization parameter.

In an exemplary embodiment, if the first functional layer belongs to a first class of functional layers, the first quantization parameter of the first functional layer includes a first active value quantization parameter, and the first active value quantization parameter is used to quantize the active value in the floating point format to the active value in the first fixed point format; the parameter conversion module 620 is configured to convert the first active value quantization parameter of the first functional layer into a second active value quantization parameter based on a conversion relationship between the first fixed point format and the second fixed point format; wherein the second activation value quantization parameter is used to quantize the activation value in the floating point format to the activation value in the second fixed point format.

In an exemplary embodiment, if the first functional layer belongs to a second class of functional layers, the first quantization parameter of the first functional layer includes a first active value quantization parameter and a first weight value quantization parameter, the first active value quantization parameter is used for quantizing the active value in the floating point format to the active value in the first fixed point format, and the first weight value quantization parameter is used for quantizing the weight value in the floating point format to the weight value in the first fixed point format; the parameter conversion module 620 is configured to convert the first active value quantization parameter of the first functional layer into a second active value quantization parameter based on a conversion relationship between the first fixed point format and the second fixed point format; wherein the second activation value quantization parameter is used to quantize the activation value in the floating point format to the activation value in the second fixed point format; converting a first weight value quantization parameter of the first functional layer into a second weight value quantization parameter based on a conversion relation between the first fixed point format and the second fixed point format; wherein the second weight value quantization parameter is configured to quantize the weight value in the floating point format to the weight value in the second fixed point format.

Optionally, the parameter conversion module 620 is further configured to convert the first weight value of the first functional layer into a second weight value based on a conversion relationship between the first fixed point format and the second fixed point format; wherein the first weight value is represented in the first fixed point format, and the second weight value is represented in the second fixed point format.

In an exemplary embodiment, as shown in fig. 7, the apparatus 600 further includes: a type determination module 640 and an inferential computation module 650.

A type determining module 640, configured to determine, for a second functional layer in the plurality of functional layers, an operation type of the second functional layer, where the operation type is used to indicate a relevant characteristic of input data and output data of the second functional layer.

And the inference calculation module 650 is configured to perform inference on the second functional layer according to the input data of the second functional layer based on the operation type, so as to obtain output data of the second functional layer.

Optionally, the inference calculation module 650 is configured to, when the operation type is single input and single output, perform inference on the second functional layer according to the input data of the second functional layer in the target fixed point format to obtain the output data of the second functional layer in the target fixed point format.

Optionally, the inference calculation module 650 is configured to, when the operation type is multi-input single-output and the operation rule is addition operation, perform inverse quantization on multiple groups of input data according to quantization parameters corresponding to the multiple groups of input data of the second functional layer in the target fixed-point format, respectively, to obtain multiple groups of input data in the floating-point format; performing addition operation on the multiple groups of input data in the floating point format to obtain an addition operation result in the floating point format; and quantizing the addition operation result in the floating point format to obtain the output data of the second functional layer in the target fixed point format.

Optionally, the inference calculation module 650 is configured to, when the operation type is multiple-input single-output and the operation rule is multiplication, perform multiplication on multiple groups of input data of the second functional layer in a target fixed-point format to obtain a multiplication result in the target fixed-point format; and generating output data of the second functional layer in the target fixed point format according to the multiplication result in the target fixed point format.

In an exemplary embodiment, as shown in fig. 7, the apparatus 600 further includes: inferential computation module 650.

The inference calculation module 650 is configured to, for a third functional layer in the plurality of functional layers, in case that a weighting parameter corresponding to the third functional layer indicates that the weighting is required,

performing inverse quantization on the input data of the third functional layer in the target fixed point format to obtain the input data of the third functional layer in the floating point format; adopting the input data of the third functional layer in the floating point format to carry out reasoning on the third functional layer to obtain the output data of the third functional layer in the floating point format; quantizing the output data of the third functional layer in the floating point format to obtain the output data of the third functional layer in the target fixed point format;

or, the input data of the third functional layer in the target fixed point format is adopted to carry out reasoning on the third functional layer, so as to obtain the initial output data of the third functional layer in the target fixed point format; and converting the initial output data of the third functional layer in the target fixed point format into the output data of the third functional layer in the target fixed point format based on the fused uniform quantization parameters corresponding to the third functional layer.

In an exemplary embodiment, as shown in fig. 7, the apparatus 600 further includes: a data conversion module 660.

The data conversion module 660 is configured to determine a translation factor corresponding to the second fixed point format according to a mean value of the input data of the neural network when the input data of the neural network is the second fixed point format; determining a fused quantization parameter according to the variance of the input data of the neural network and the quantization parameter corresponding to the second fixed point format; and converting the input data of the neural network from the second fixed point format to the first fixed point format according to the translation factor corresponding to the second fixed point format and the fused quantization parameter.

According to the embodiment of the application, the quantization parameter of the neural network is converted from the quantization parameter suitable for being calculated by the first processing unit (such as a CPU) to the quantization parameter suitable for being calculated by the second processing unit (such as a DSP), so that the quantization parameter of the neural network can be adapted between processing units of different types, for example, between the CPU and the DSP, and the calculation requirement of being compatible with the processing units of different types at the same time is met.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the content structure of the device may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

In an exemplary embodiment, there is also provided a computer device comprising a processor and a memory, the memory having stored therein a computer program, the computer program being loaded and executed by the processor to implement the above-mentioned neural network quantization method.

In an exemplary embodiment, a computer-readable storage medium is also provided, in which a computer program is stored, which is loaded and executed by a processor to implement the above-mentioned neural network quantization method. Alternatively, the computer-readable storage medium may be a ROM (Read-Only Memory), a RAM (Random Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product including computer instructions stored in a computer-readable storage medium, from which a processor reads and executes the computer instructions to implement the quantization method of the neural network described above.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only exemplarily show one possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the order shown in the figure, which is not limited by the embodiment of the present application.

The above description is only exemplary of the present application and is not intended to limit the present application, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of quantifying neural networks, the method comprising:

and under the condition that the second processing unit is adopted to carry out inference on the first functional layer, quantizing related data involved in the inference process of the first functional layer based on the second quantization parameter.

2. The method of claim 1, wherein if the first functional layer belongs to a first class of functional layers, the first quantization parameter of the first functional layer comprises a first active value quantization parameter, the first active value quantization parameter being configured to quantize the active value in the floating point format to the active value in the first fixed point format;

the converting the first quantization parameter of the first functional layer into the second quantization parameter comprises:

converting a first active value quantization parameter of the first functional layer into a second active value quantization parameter based on a conversion relationship between the first fixed point format and the second fixed point format; wherein the second activation value quantization parameter is used to quantize the activation value in the floating point format to the activation value in the second fixed point format.

3. The method of claim 1, wherein if the first functional layer belongs to a second class of functional layers, the first quantization parameter of the first functional layer comprises a first active value quantization parameter and a first weight value quantization parameter, the first active value quantization parameter is used for quantizing the active value in the floating point format into the active value in the first fixed point format, and the first weight value quantization parameter is used for quantizing the weight value in the floating point format into the weight value in the first fixed point format;

the converting the first quantization parameter of the first functional layer into the second quantization parameter includes:

converting a first active value quantization parameter of the first functional layer into a second active value quantization parameter based on a conversion relationship between the first fixed point format and the second fixed point format; wherein the second activation value quantization parameter is used to quantize the activation value in the floating point format to the activation value in the second fixed point format;

converting a first weighted value quantization parameter of the first functional layer into a second weighted value quantization parameter based on a conversion relationship between the first fixed point format and the second fixed point format; wherein the second weight value quantization parameter is configured to quantize the weight value in the floating-point format to the weight value in the second fixed-point format.

4. The method of claim 3, further comprising:

converting a first weight value of the first functional layer into a second weight value based on a conversion relation between the first fixed point format and the second fixed point format; wherein the first weight value is represented in the first fixed point format, and the second weight value is represented in the second fixed point format.

5. The method of claim 1, further comprising:

for a second functional layer of the plurality of functional layers, determining an operation type of the second functional layer, wherein the operation type is used for indicating relevant characteristics of input data and output data of the second functional layer;

and reasoning the second functional layer according to the input data of the second functional layer based on the operation type to obtain the output data of the second functional layer.

6. The method of claim 5, wherein the inferring, based on the operation type, the second functional layer from the input data of the second functional layer to obtain the output data of the second functional layer comprises:

and under the condition that the operation type is single input and single output, reasoning the second functional layer according to the input data of the second functional layer under the target fixed point format to obtain the output data of the second functional layer under the target fixed point format.

7. The method of claim 5, wherein the inferring, based on the operation type, the second functional layer from the input data of the second functional layer to obtain the output data of the second functional layer comprises:

under the condition that the operation type is multi-input single-output and the operation rule is addition operation, respectively carrying out inverse quantization on the multiple groups of input data according to quantization parameters respectively corresponding to the multiple groups of input data of the second functional layer under the target fixed point format to obtain multiple groups of input data under the floating point format;

performing addition operation on the multiple groups of input data in the floating point format to obtain an addition operation result in the floating point format;

and quantizing the addition operation result in the floating point format to obtain the output data of the second functional layer in the target fixed point format.

8. The method of claim 5, wherein the inferring, based on the operation type, the second functional layer from the input data of the second functional layer to obtain the output data of the second functional layer comprises:

under the condition that the operation type is multi-input single-output and the operation rule is multiplication, performing multiplication on multiple groups of input data of the second functional layer in a target fixed point format to obtain a multiplication result in the target fixed point format;

and generating output data of the second functional layer in the target fixed point format according to the multiplication result in the target fixed point format.

9. The method of claim 1, further comprising:

for a third functional layer of the plurality of functional layers, in the case that a corresponding weighting parameter for the third functional layer indicates that a weighting is required,

alternatively, the first and second electrodes may be,

adopting input data of the third functional layer in a target fixed point format to carry out reasoning on the third functional layer to obtain initial output data of the third functional layer in the target fixed point format; and converting the initial output data of the third functional layer in the target fixed point format into the output data of the third functional layer in the target fixed point format based on the fused uniform quantization parameters corresponding to the third functional layer.

10. The method of claim 1, further comprising:

under the condition that the input data of the neural network is in the second fixed point format, determining a translation factor corresponding to the second fixed point format according to the mean value of the input data of the neural network;

determining a fused quantization parameter according to the variance of the input data of the neural network and the quantization parameter corresponding to the second fixed point format;

and converting the input data of the neural network from the second fixed point format to the first fixed point format according to the translation factor corresponding to the second fixed point format and the fused quantization parameter.

11. An apparatus for quantization of a neural network, the apparatus comprising:

12. A computer device, characterized in that the computer device comprises a processor and a memory, in which a computer program is stored, which computer program is loaded and executed by the processor to implement the method according to any of claims 1 to 10.

13. A computer-readable storage medium, in which a computer program is stored which is loaded and executed by a processor to implement the method according to any one of claims 1 to 10.

14. A computer program product comprising computer instructions stored in a computer readable storage medium, from which a processor reads and executes the computer instructions to implement the method of any one of claims 1 to 10.