WO2021083154A1

WO2021083154A1 - Method and apparatus for quantization of neural networks post training

Info

Publication number: WO2021083154A1
Application number: PCT/CN2020/124064
Authority: WO
Inventors: Amit Srivastava; Pariksheet PINJARI
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2019-10-30
Filing date: 2020-10-27
Publication date: 2021-05-06
Also published as: CN114222997A

Abstract

A method and an apparatus (100) are provided for dynamic point quantization of neural networks for greater accuracy and with lower storage requirements. Conventional neural network models requires huge disk space which increases the computation cost associated with these models thereby demanding substantial amount of performance and power from user devices. The present invention focuses on quantization of deep learning models to reduce storage requirements without compromising on accuracy while also achieving better performance when compared to conventional models. The neural network is quantized by determining if an input is positive or negative number (901); determining exponential range of the input (902); determining maximum range of layer parameters of the input (903); determining offset of a layer of the input (904); performing exponent adjustment by converting the input into its corresponding binary form (905) and determining exponent representation of the input by adding the offset to the exponent adjustment (906).

Description

METHOD AND APPARATUS FOR QUANTIZATION OF NEURAL NETWORKS POST TRAINING

TECHNICAL FIELD

This application claims priority to India Patent Application NO. IN201931043851 filed with the India Patent Office on October 30, 2019 and entitled "METHOD AND APPARATUS FOR QUANTIZATION OF NEURAL NETWORKS POST TRAINING " , which is incorporated herein by reference in its entirety.

The present subject matter described herein, in general, relates to deep learning models and more particularly, it relates to a method and an apparatus for quantization of neural networks, such as weights and parameters of a stored neural network model.

BACKGROUND

Machine learning is an application of artificial intelligence (AI) that employs statistical techniques to provide computer systems the ability to "learn" with data, e.g. progressively improve performance on a specific task, without being explicitly programmed to do so. Machine learning models receive an input and generate an output, e.g., a predicted output, based on the received input. Some machine learning models are parametric models and generate the output based on the received input and on values of the parameters of the model.

The state of the art technology in the field of machine learning includes deep neural networks, which employ computational models composed of multiple processing layers which learn representations of data, usually, extremely large volumes of data, with multiple levels of abstraction, thereby resulting in the terminology “deep learning” or “deep networks” . Deep learning, also known as “deep structured learning” or “hierarchical learning” is part of machine learning methods that are based on learning data representations, as opposed to task-specific algorithms. The “learning” can be “supervised” , “semi-supervised” or “unsupervised” .

The deep learning architectures such as “deep neural networks” , “deep belief networks” and “recurrent neural networks” have been applied to a multitude of fields including but not limited to computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, board game programs, and the like where they have produced results comparable to and in some cases superior to human experts.

An artificial neural network (herein, “neural network” ) is an information processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements (neurones) working in unison to solve specific problems. Neural networks like people learn by example. A neural network may be configured for a specific application, such as pattern recognition or data classification, through a learning process. Learning in biological systems involves adjustments to the synaptic connections that exist between the neurones and the same is true of neural networks as well.

A problem associated with the existing neural network models is that they take up a tremendous amount of space on the disk, with the size of the pre-trained models being huge; thereby resulting in the problem of placing such models on edge devices or user devices such as but not limited to mobile phones, tablets, Raspberry Pi boards, wearable devices, and the like. Typically such devices have limited storage space and processing capability. Currently these models are stored using 32-bit floating point data structure. An example of the sizes of the typical pre-trained models is tabulated hereinbelow in Table 1. It can be unanimously inferred that almost all the size is taken up by the weights of the neural network connections, because these weights are different floating point numbers. For the same reason, simple compression techniques such as zip compression does not adequately compress them.

Sl. No.	Model	Weights
1	AlexNet	285 MB
2	VGG-16	528 MB
3	Resnet 152	220 MB
4	Extraction	90 MB

Table 1: Tabulates the space occupied by some pre-trained neural network models

Another problem associated with these neural network models is that the computation cost of these models is very high and thereby draining substantial amount of performance and power from the user devices. Since these models require 32 bit floating point operations along with multiplication operations at most times, there is a high cost associated with these, in terms of power and performance requirements.

One approach to overcome these drawbacks associated with the neural networks is to increase the computational power of the hardware used to deploy these networks. This is a costly solution and not feasible for large scale deployment on user devices due to size constrains. An alternative approach for large scale deployment is to implement these neural network models in fixed point, which may offer advantages in reducing memory, bandwidth, lowering power consumption and computation time as well as the storage requirements.

For some state of the art technology, reference is made to TensorFlow ^TM which introduces fixed point quantization as a solution for quantizing the pre-trained models to reduce memory at the inference stage. It looks at the range of the weights i.e. min and max and tries to map every weight value to an equivalent integer representative tensor of 8 bits. For example, if the minimum = -10.0, and maximum = 30.0 and an eight-bit array, the quantized values are represented hereinbelow. While this approach saves space by 75%and improves performance due to its operation involving 8-bits, there are some real-time problems associated with this approach that are discussed in detail hereinbelow.

Quantized	Float
0	-10.0
255	30.0
128	10.0

The drawbacks associated with TensorFlow ^TM are listed hereinbelow:

a) Addition between different layers: Some neural networks use a plain elementwise Addition layer type that simply adds the two activation layer outputs (RESNET-32 is one such example) from different layers, as illustrated in Figure 1. In RESNET-32, 32 convolutional layers of filter size 3x3 are stacked upon each other and shortcut connection is present which adds the input from two activation layers. The dotted shortcut connection arrow represent the input is upsampled to match the size of second addition input, whereas solid shortcut connection arrow have addition inputs of same dimensions. The addition operation can be done on two arrays if their quantization scale is same. Since this method involves different quantization ranges for different layers, the two activation layers on different quantization scale cannot match up for addition operation. Hence the de-quantization is required to make the values of the layers back to original, perform the addition operation and again perform the quantization. Such repeated quantization and de-quantization renders the model slow and involves consumption of resources thereby making the performance advantage unviable.

b) Concatenation: Fully general support for concatenation layers poses the same de-quantization and re quantization problem as Addition layers. Because such rescaling values of data type uint 8 would be a lossy operation, and as it seems that concatenation ought to be a lossless operation, this implementation suffers from the same as well.

c) True Zero Problem: In this implementation, since quantizing is based on the range and incrementing it step wise, makes the zero value to be shifted, for example:

Quantized	Float
0	-10.0
255	30.0
128	10.0

Float value of -10.0 is now represented by zero and zero value is represented by some other number, this is a problem since the zero is an important operation and appears a lot in the neural network operations, for example convolutions can be padded with zeroes at the edges when filters overlap, relu activation function replaces any negative number as zero, etc. So for this reason the Quantized operation are again de-quantized and again re-quantized, as illustrated in Figure 2. The dark shaded boxes represent the de-quantization and quantization of input data. This repeated quantization and de-quantization make the performance of the model slower than the original model performance. Some of the performance data for the comparison between quantized TensorFlow ^TM and normal model for three variants of cifarnet image classification model is illustrated in Figure 3. The performance is measured in terms of inference time required to classify single image. It may be inferred from the said graph that the performance of the quantized model of TensorFlow ^TM is much slower than the original models, thereby defeating their very purpose itself. From the following data tabulated in Tables 2 and 3, it may be inferred that the performance of the MNIST data set as well has taken the hit.

Node Type	Avg ms	Avg %
Conv2D	62.098	73.109%
MatMul	16.089	18.942%
Add	4.248	5.001%
Maxpool	1.453	1.711%
Relu	1.006	1.184%
Const	0.022	0.026%
Reshape	0.008	0.009%
Retval	0.004	0.005%
Arg	0.004	0.005%
NoOp	0.004	0.005%
Identity	0.003	0.004%

Table 2: Tabulates performance of individual operations of original network on MNIST dataset

Node Type	Avg ms	Avg %
QuantizedConv2D	637.514	74.337%
QuantizedMatMul	188.589	21.990%
QuantizeV2	7.491	0.873%
RequantizationRange	5.128	0.598%
Dequantized	4.476	0.522%
Add	4.296	0.501%
Requantize	3.801	0.443%
QuantizedMaxPool	2.203	0.257%
QuantizedRelu	1.398	0.163%
Min	1.241	0.145%

Max	1.224	0.143%
NoOp	0.147	0.017%
Const	0.042	0.005%
Reshape	0.020	0.002%
QuantizedReshape	0.014	0.002%
Arg	0.007	0.001%
Retval	0.005	0.001%

Table 3: Tabulates the performance of the quantized operations of TensorFlow ^TM on MNIST dataset

SUMMARY

This summary is provided to introduce concepts related to, a method and an apparatus for quantization of neural networks, and the same are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.

From the above discussion, it can be inferred that the existing fixed point quantization solution does not address the problems associated with the neural networks thereby making such models too large to be used on user devices.

Accordingly, there is a need to reduce the storage size when compared to original model size without compromising much on accuracy and performance of the neural networks. Thus, the present disclosure discloses a method, an apparatus and a system for dynamic point quantization of neural networks for greater accuracy at with lower storage requirements. In addition, the time penalty associated with quantization and re-quantization is reduced.

The above-described need for dynamic point quantization of neural networks is merely intended to provide an overview of some of the shortcomings of conventional systems and techniques, and is not intended to be exhaustive. Other shortcomings with conventional systems and techniques and corresponding benefits of the various non-limiting embodiments described herein may become further apparent upon review of the following description.

An objective of the present disclosure is to provide quantization of deep learning models.

Another objective of the present disclosure is to reduce the storage size when compared to original model size without compromising much on accuracy and performance of the neural networks.

Yet another objective of the present disclosure is to provide a method for quantization of neural networks.

Yet another objective of the present disclosure is to provide an apparatus for quantization of neural networks.

In particular, the present disclosure involves quantizing the existing model so that their storage requirements can be reduced without compromising much on the accuracy while also achieving better performance when compared to the existing models.

According to first aspect of the present disclosure, there is provided a method for quantization of neural networks. The input to the method are the weights and biases of each layer of pre-trained neural network. As a pre-trained neural network, these inputs may be, at least initially, related to and /or characteristic of real world data, such as e.g. image information. The method comprises the steps of determining if an input is positive or negative number, wherein the input is a floating point number comprising a signed bit, at least one exponent bit and at least one mantissa bit; determining an exponential range of the input; determining a maximum range of layer parameters of the input to represent said input in a quantized form; determining offset of a layer of the input; performing exponent adjustment by converting the input into a corresponding binary form; and determining exponent representation of the input by adding the offset to the exponent adjustment.

In a first possible implementation of the method according to the first aspect, the signed bit is zero if the input is positive number while the signed bit is 1 if the input is negative number.

In a second possible implementation of the method according to the first aspect, exponential range for n bit width is calculated by taking n-1 exponent of base 2 and subtracting 1 from it as the range start from zero. 1 is subtracted from n in the exponential part to avoid over flow condition, where n is the bit width given to the exponent part.

In a third possible implementation of the method according to the first aspect, the maximum range is determined by Range = Number ≤ 2 ^Levels, where Levels is the number of bits required to represent entire range of the layer parameters.

In a fourth possible implementation of the method according to the first aspect, the offset is determined by subtracting the exponential range with the levels required to represent the max range and adding half the mantissa bits to offset.

In a fifth possible implementation of the method according to the first aspect, converting the input into a corresponding binary form comprises the step of determining if the input is 10-bits long; wherein if the bits of the binary form of input are more than 10 bits long, the bits from lower binary place are discarded and the value is adjusted in exponent part by adding the number of bits discarded; and wherein if the bits of the binary form of input are lesser than 10 bits long, zeros are appended towards the end of the binary form of the input and the exponent of the input is adjusted by subtracting the number of zeros added.

In a sixth possible implementation of the method according to the first aspect, further comprises the step of performing multiplication operation on any two quantized inputs.

In a seventh possible implementation of the method according to the first aspect, further comprises the step of performing addition operation on any two quantized inputs.

According to second aspect of the disclosure, there is provided an apparatus for quantization of neural networks. The apparatus comprises a sign determining module, an exponent range determining module, a maximum range determining module, an offset determining module, an exponent adjustment module, and an exponent representation module. The sign determining module configured to determine if an input is positive or negative number, wherein the input is a floating point number comprising a signed bit, at least one exponent bit and at least one mantissa bit. The exponent range determining module configured to determine exponential range of the input. The maximum range determining module configured to determine maximum range of layer parameters the input. The offset determining module configured to determine offset of a layer of the input. The exponent adjustment module configured to perform exponent adjustment by converting the input into a corresponding binary form. The exponent representation module configured to determine exponent representation of the input by adding the offset to the exponent adjustment.

In a first possible implementation of the apparatus according to the second aspect, the signed bit is zero if the input is positive number while the signed bit is 1 if the input is negative number.

In a second possible implementation of the apparatus according to the second aspect, exponential range for n bit width is calculated by taking n-1 exponent of base 2 and subtracting 1 from it as the range start from zero. 1 is subtracted from n in the exponential part to avoid over flow condition, where n is the bit width given to the exponent part.

In a third possible implementation of the apparatus according to the second aspect, the maximum range is determined by Range = Number ≤ 2 ^Levels, where Levels is the number of bits required to represent entire range of the layer parameters.

In a fourth possible implementation of the apparatus according to the second aspect, the offset is determined by subtracting the exponential range with the levels required to represent the max range and adding half the mantissa bits to offset.

In a fifth possible implementation of the apparatus according to the second aspect, converting the input into a corresponding binary form involves determining if the input is 10-bits long; wherein if the bits of the binary form of input are more than 10 bits long, the bits from lower binary place are discarded and the value is adjusted in exponent part by adding the number of bits discarded; and wherein if the bits of the binary form of input are lesser than 10 bits long, zeros are appended towards the end of the binary form of the input and the exponent of the input is adjusted by subtracting the number of zeros added.

In a sixth possible implementation of the apparatus according to the second aspect, the apparatus is further configured to perform multiplication operation on any two quantized inputs.

In a seventh possible implementation of the apparatus according to the second aspect, the apparatus is further configured to perform addition operation on any two quantized inputs.

Other aspects, advantages, and salient features of the disclosure will become apparent to those skilled in the art from the following detailed description, which, taken in conjunction with the annexed drawings, discloses exemplary embodiments of the disclosure.

BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS

The detailed description is described with reference to the accompanying figures. In the figures, the digit (s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to refer like features and components.

Figure 1 illustrates a schematic diagram of a RESNET architecture with 34 layers, as a prior art to the present disclosure.

Figure 2 illustrates a block diagram of the flowgraph of Tensorflow’s quantization and dequantization operation, as prior art of the present disclosure.

Figure 3 illustrates a graphical representation of performance of the Tensorflow quantized model vs the original model.

Figure 4 illustrates a block diagram of the process flow of the weight analysis, in accordance with an embodiment of the present disclosure.

Figure 5 illustrates a block diagram of the process flow of the float representation, in accordance with the present disclosure.

Figure 6 illustrates a flow diagram of how the memory footprint is distributed, in accordance with the present disclosure.

Figure 7 illustrates a graphical representation of the integer arithmetic multiplication speed vs the floating point multiplication speed.

Figure 8 illustrates a block diagram of the mapping of float to mantissa-exponent representation, in accordance with the present disclosure.

Figure 9 illustrates flow-chart of the method of quantization of neural networks, in accordance with an embodiment of the present disclosure.

Figure 10 illustrates a block diagram of the apparatus for quantization of neural networks, in accordance with another embodiment of the present disclosure.

It is to be understood that the attached drawings are for purposes of illustrating the concepts of the disclosure and may not be to scale.

DETAILED DESCRIPTION

The following clearly describes some technical solutions in the embodiments of the present disclosure with reference to the accompanying drawings in the embodiments of the present disclosure. It will be understood by the skilled person that the described embodiments are merely a part rather than all of the embodiments of the present disclosure. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present disclosure without creative efforts shall fall within the protection scope of the present disclosure.

The present invention can be implemented in numerous ways, such as a process, an apparatus, a system, a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are stored in (non-transitory) memory or sent over optical or electronic communication links. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention.

A detailed description of one or more embodiments of the disclosure is provided below along with accompanying figures that illustrate the principles of the disclosure. The disclosure is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the disclosure. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components, modules, units and/or circuits have not been described in detail so as not to obscure the invention.

Although embodiments of the invention are not limited in this regard, discussions utilizing terms such as, for example, “processing, ” “computing, ” “calculating, ” “determining, ” “establishing” , “analyzing” , “checking” , or the like, may refer to operation (s) and/or process (es) of a computer, a computing platform, a computing system, or other electronic computing device, that manipulates and/or transforms data represented as physical (e.g., electronic) quantities within the computer's registers and/or memories into other data similarly represented as physical quantities within the computer's registers and/or memories or other information non-transitory storage medium that may store instructions to perform operations and/or processes.

Although embodiments of the invention are not limited in this regard, the terms “plurality” and “a plurality” as used herein may include, for example, “multiple” or “two or more” . The terms “plurality” or “a plurality” may be used throughout the specification to describe two or more components, devices, elements, units, parameters, or the like. Unless explicitly stated, the method embodiments described herein are not constrained to a particular order or sequence. Additionally, some of the described method embodiments or elements thereof can occur or be performed simultaneously, at the same point in time, or concurrently.

In the present disclosure, “quantization” refers to representing a number with less number of bits compared to floating point bits.

In the present disclosure, “deep learning models” refers to artificial neural networks with depth of layers.

In the present disclosure, “neural network” refers to computational model based on the structure and functions of biological neural network.

In the present disclosure, “de-quantization” refers to converting a quantized number to floating point representation.

In the present disclosure, “pre trained models” refers to artificial neural network whose weights are trained to perform a specific task.

In the present disclosure, “weight parameter” refers to multiplication factor for the edge connecting one neuron to another neuron in artificial neural network.

In the present disclosure, “bias parameter” refers to addition factor of a neuron in artificial neural network.

In the present disclosure, “convolutional layer” refers to a layer which applies a convolutional operation on the input.

In the present disclosure, “Residual Network (RESNET) ” refers to a type of deep learning model.

In the present disclosure, “Modified National Institute of Standards and Technology (MNIST) ” refers to large dataset of handwritten digits used for training various image processing system.

In the present disclosure, “Rectified Linear Unit (ReLU) ” refers to activation function analogous to half wave rectification in electrical engineering.

In the present disclosure, “TensorFlow ^TM” refers to an open source machine learning framework.

In the present disclosure, “fixed point quantization” refers to quantization method where signal value falling between two levels is assigned to one of the two levels where total number of levels is fixed.

In the present disclosure, “weight analysis” refers to finding the range and standard deviation of layer weights.

A method and an apparatus for quantization of neural networks are disclosed. While aspects are described for quantization of deep learning models so as to reduce their storage requirements without compromising much on the accuracy and also achieving better performance when compared to the conventional models, the present invention may be implemented in any number of different computing systems, environments, and/or configurations, the embodiments are described in the context of the following exemplary systems, devices/nodes/apparatus, and methods.

The problems associated with the neural network models is that they take up a tremendous amount of space on the disk, with the size of the pre-trained models being huge, thereby resulting in the problem of placing such models on user devices and the computation cost associated with these models is very high, thereby demanding substantial amount of performance and power from the user devices. These models require 32 bit floating operations along with multiplication operations. Thus, there is a high cost associated with these in terms of power and performance requirements.

Moreover, the existing fixed point quantization solution does not address the problems associated with the neural networks thereby making such models too large to be used on user devices. Accordingly, there is a need to reduce the storage size when compared to original model size without compromising much on accuracy and performance of the neural networks. The existing technology involves repeated quantization and de-quantization that severely affects the performance of the neural models by making them slower than the original model performance.

Thus, the present disclosure discloses a method and an apparatus for dynamic point quantization of neural networks for greater accuracy at with lower storage requirements. The present disclosure specifically focuses on quantization of deep learning models in order to reduce their storage requirements without compromising much on the accuracy and also achieving better performance when compared to the conventional models.

Significantly, the solution offered by the present disclosure focuses on pre trained models to quantize the weights. An embodiment of the solution is implemented in the following steps briefly described hereinbelow:

1. Weight Analysis: Unlike the conventional fixed point quantization approach, a variable bit width approach is employed depending upon the range of weights in each layer; the quantized weights are represented as 8-bit or 16-bit solution, as illustrated in Figure 4.

2. Float Representation: Unlike the conventional fixed point quantization approach method of checking the range and representing the numbers falling in particular range by a unique number in quantized domain, an alternative method of representation is employed. Significantly, since it is observed that all of the weights in the pre trained model falls in much smaller range, the weights are represented in the manner illustrated in Figure 5. The three fully connected layers contains most of the weights as compared to five convolutional layers. The last 1000-way softmax layer help to classify image to one of the 1000 category. This completely does away with the requirement of floating point multiplication associated with the prior art. The representation of the equivalent is provided hereinbelow.

Uint32 = uint16 *uint16

The result of the operation of two uint 16 operands is stored in uint 32 to avoid any overflow that can be caused in the operation and then the resulted will again be normalized to the 8 or 16-bit format. The detailed description of the technical solution with different use cases and accuracy data is provided hereinbelow.

As discussed earlier, the pre trained models of popular convolutional neural networks are too large to fit on user devices or edge devices such as but not limited to mobile phones, tablets, Raspberry Pi boards, wearable devices, and the like and are currently stored using 32-bit float data structure. In the present disclosure, these weight and bias parameters are converted to lesser bit representation, i.e., 8 bits or 16 bits, depending upon distribution of weights and bias parameters for a particular layer. For layer with large standard deviation of weights and biases, 16 bits are chosen to represent quantized weights and 8 bits in other case. This results in memory footprint of the model being reduced by at least by 50%.

Figure 6 illustrates that in the present disclosure, each of the layer A, B, C etc. weights and biases is considered and depending upon the range the dynamic bit width will be chosen i.e., 8-bit or 16-bit. Thus, each layer can have different bit width depending on the range of parameter in the respective layer. In case of multiplication operation, only mantissa part is getting multiplied and that too in integer instead of float which is heavy operation compared to integer.

Figure 7 illustrates graphically, the experimental result which illustrates results of the mathematical operation vary depending on the data size of 16 bit or 32 bits and data type of integer or float. For integer arithmetic operations, 8 bit and 16 bit instructions are available and are much faster compared to floating point operations as shown in chart. And also it can be readily inferred from the graph that the float speed of multiplication of same data size is much slower than the speed of integer multiplication.

In the present disclosure, the exponential part is added and signed bit is xor-ed. As mantissa part is/becomes smaller in size comparatively then cost of multiplication is lesser in this case. Thus, the computational performance is improved. And the relative error in this representation is smaller compared to other techniques. Moreover, no de-quantization is required for this kind of representation as all the weights are quantized in the same scale or can be easily converted to higher bit representation. In case of addition operation, the exponential part is made exact same using right or left shift operator in mantissa part. After that mantissa part is get added to get final result. Therefore, addition operation is not having much overhead as compared to the existing quantization algorithms.

The dynamic point quantization of the present disclosure involves the following steps, explained for 16 bit representation:

Step 1: For each weight value of layer and for all layers of pre trained neural network model, depending upon the sign of the weight value to be quantized, the first bit will be zero if the number to be quantized is positive and 1 if the weight value to be quantized is negative.

Step 2: Exponential Range determination: Find the standard deviation of distribution of parameters in layer. According to precision required at the output, set the threshold for standard deviation beyond which layer parameters will be quantized to 16 bits and below which layer parameters will be quantized to 8 bit. If the layer parameters falls in 16-bit representation, then exponent bit will be 5 so the range will be 2^4-1=15 (2^4 has been considered and not 2^5 to avoid over flow condition) . The exponential range for n bit width is calculated by taking n-1 exponent of base 2 and subtracting 1 from it as the range start from zero. 1 is subtracted from n in the exponential part to avoid over flow condition. The formula is provided hereinbelow, where n is the bit width given to the exponent part.

Exp Range=2 ^n-1-1

Step 3: Max Range determination: To represent a number in its quantized form, range of layer parameters should be found. Maximum value between modulus value of minimum layer parameter value and maximum parameter value is the max range for a layer. Max range is used to calculate the number of bits required to represent the layer parameters For example, if minimum parameter value is -20 and maximum parameter value is 150. Then maximum of 20 and 150 i.e. 150 is the max range and 2^8=256 levels are needed to represent the number 150 as 2^7 will represent only 128 levels. The formula is provided hereinbelow, where level is the number of bit required to represent the entire range of layer parameters. The levels are the number of bit required to represent the entire range of layer parameters such as for one layer.

Range=Number≤2 ^Levels

Step 4: Offset determination: The Offset value is defined for each layers of model. It is the value taken out from the exponent part so that the parameters of the layer can fit into 8 bit or 16 bit representation. The Offset of a particular layer is calculated by subtracting the Exponential Range with the levels needed to represent max range of layer parameters. To keep weight values at centre of the range of quantization domain, half the mantissa bits of the chosen format (8 or 16 bit format) are added to offset. So for the above example, offset will be 15 (Exponential range) -8 (levels) +5 (half of mantissa bits) =12. The formula is provided hereinbelow, where n is the bit width given to the exponent part and levels is the number of bits required to represent the entire range of layer parameters.

offset=2 ^(n-1) -1-Levels+mantissa bits/2

Step 5: Finding Mantissa part and exponent adjustment: Now convert each number from layer parameters to binary form and make sure it fits the 10 bit format of the mantissa (in case the value is 16 bit width) . If the bits in the binary form of a number from layer parameter are more than 10 then the bits from lower binary place of binary form number are neglected and the value is adjusted in exponent part by adding the number of bit neglected. If the bits in the binary form of a number from layer parameter are lesser than 10 bits then zeros are appended at last and the exponent part is adjusted by subtracting number of zeros added. For example 150 can be represented as 1001 0110, So this will be represented in 10 bits as 1001 0110 00, two more bits are added in the last so that the full 10 bits can be represented and the exponent part becomes 2 ^(-2) . So exponent adjustment value is -2.

Step 6: Finding Exponent part: Exponent representation can be achieved by adding offset and exponent adjustment from step 5. The formula is indicated hereinbelow:

Exp 1=offset+Exponent adjustment from step 5

Exponential part is calculated for all the numbers in the operation. In case of example value 150, Exponent part becomes 12 + (-2) = 10.

In case of multiplication of any two numbers which uses quantization method of this claim, steps 1-6 remain the same. These steps represent the numbers in floating point format. After step 6, the following steps are performed for multiplication.

Step 7: Multiplication representation: The mantissa part of two quantized numbers is integer multiplied and result is stored in a 32-bit accumulator to avoid any overflow, and then normalize the operation to 10 bits for example 150*150 will result as per above mantissa part as 1010 1111 1100 1000 000, so this will be normalized to 1010 1111 11 (703 in decimal) and will create an offset in exponent as since we are discarding the last 9 bits. So the value of mantissa exponent adjustment is the number of bits discarded in mantissa. The formula is indicated hereinbelow:

Mantissa Exp Adjustment=Number of bits discarded

Step 8: Final representation of Exponent: The exponent of the product is addition of exponent of two quantized numbers and number of bits discarded in step 7. The formula is indicated hereinbelow:

Exp of product=Exp1+Exp 2+Number of bits discarded in step 7

So in case of 150*150 multiplication, exponent of product becomes (10) + (10) +9 = 29.

Step 9: -Offset calculation: Offset of the product will be addition of the offset of the two quantized numbers.

Offset of prodcut=offset1+offset2

In case of 150*150, offset of the product becomes 12 + 12 = 24.

Step 10: Verifying the product value: To verify the product value is correct we need to convert the number back to real number format. It can be done by multiplying mantissa by 2 to the power of result of subtraction offset from exponent value. A sign is added in the end from the sign bit. The formula is indicated hereinbelow:

real value=-1 ^sign×mantissa×2 ^(exp-offset)

In case of 150*150, real value becomes 22496. The value is very close to real product value 22500. The quantization multiplication has produced a relative error of 0.017%.

In case of addition of any two numbers which uses quantization method of this claim, steps 1-6 remain the same. These steps represent the number in floating point format. After step 6, the following are performed for addition.

Step 7: Normalization: To add two quantized numbers, the exponential part (final one) should be the same. The smaller number is normalized to the same value of offset and exponential of bigger number by adjusting mantissa part as discussed hereinabove.

Step 8: Addition representation: Mantissa part of the two quantized numbers are added. The result is stored in 32-bit accumulator. The 32-bit value is then normalized to 10-bit value with mantissa exponent adjustment value. For example, 150+150, mantissa addition will produce 1001 0110 00 + 1001 0110 00 = 1001 0110 000. Normalizing process will discard the last bit. So mantissa exponent adjustment value becomes 1.

Step 9: Exponent determination: Exponent of the sum will sum of the exponent of one number and number of bits discarded in step 8.

Exp of sum=Exp1+number of bits discarded in step 8

Step 10: Offset determination: Offset value remains the same as that the numbers.

Verification of addition algorithm can be done in the same way as of multiplication algorithm. Equation from step 10 of multiplication algorithm is used to verify addition result. The quantization error of the solution of the present disclosure is better than the quantization error of method used by TensorFlow ^TM. Table 4 shows the quantization error on quantizing 159.654 in both the schemes. It shows the improvement in the error is greater than 10 times.

Table 4: Quantization error from experiment for both schemes

With this scheme, a maximum number of 2 ¹⁰ is represented with accuracy ± 0.5. Any number larger than this and the distance between floating point numbers is greater than 0.5. The largest number we can represent using this scheme is 2 ¹⁷¹-1.

Figure 10 illustrates a block diagram of the apparatus (100) for quantization of neural networks of the present disclosure. The apparatus comprises a sign determining module (101) , an exponent range determining module (102) , a maximum range determining module (103) , an offset determining module (104) , an exponent adjustment module (105) and an exponent representation module (106) . The sign determining module (101) configured to determine if an input is positive or negative number, wherein the input is a floating point number comprising a signed bit, at least one exponent bit and at least one mantissa bit. The exponent range determining module (102) configured to determine exponential range of the input. The maximum range determining module (103) configured to determine maximum range of layer parameters the input. The offset determining module (104) configured to determine offset of a layer of the input. The exponent adjustment module (105) configured to perform exponent adjustment by converting the input into a corresponding binary form. The exponent representation module (106) configured to determine exponent representation of the input by adding the offset to the exponent adjustment. The apparatus is further configured to perform multiplication operation and/or addition operation on any two quantized inputs. The connections between the modules are illustrative only, and it will be understood that the modules are interconnected according to which output should be provided to which input to achieve the corresponding function.

The dynamic point quantization technique of the present disclosure ensures greater accuracy when compared to the existing fixed point quantization technique, which is evident from the example discussed hereinabove wherein a number was quantized and de-quantized back to the original number. Apart from accuracy, the dynamic point quantization technique ensures that the performance of the operation is much faster than the fixed point quantization technique since it does not involve transformation or representation of the number in another form, thereby doing away with all the problems associated with the fixed point quantization technique, such as true zero problem, addition of different layers, concatenation etc. The dynamic point quantization technique does away with the need for de-quantization thereby enhancing the performance by multi-fold.

Some of the non-limiting advantages of the present disclosure are mentioned hereinbelow:

● It reduces the storage requirements associated with conventional neural networks without compromising too much on accuracy.

● It provides better and faster performance when compare to the existing models.

● It makes it possible to implement neural networks on user devices.

● It saves disc space by up to 75%when compared to existing models.

● It results in memory footprint of the model being reduced by at least by 50%.

A person skilled in the art may understand that any known or new algorithms be used for the implementation of the present disclosure. However, it is to be noted that, the present disclosure provides a method and an apparatus for quantization of neural networks to achieve the above mentioned benefits and technical advancement irrespective of using any known or new algorithms.

A person of ordinary skill in the art may be aware that in combination with the examples described in the embodiments disclosed in this specification, units and algorithm steps may be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on the particular applications and design constraint conditions of the technical solution. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of the present disclosure.

It may be clearly understood by a person skilled in the art that for the purpose of convenient and brief description, for a detailed working process of the foregoing system, apparatus, and unit, reference may be made to a corresponding process in the foregoing method embodiments, and details are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, method or system may be implemented in other manners. For example, the described apparatus embodiment is merely exemplary. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

When the functions are implemented in a form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of the present disclosure essentially, or the part contributing to the prior art, or a part of the technical solutions may be implemented in a form of a software product. The computer software product is stored in a storage medium, and includes several instructions for instructing a computer node (which may be a personal computer, a server, or a network node) to perform all or a part of the steps of the methods described in the embodiment of the present disclosure. The foregoing storage medium includes: any medium that can store program code, such as a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM) , a random access memory (Random Access Memory, RAM) , a magnetic disk, or an optical disc.

Devices that are in communication with each other need not be in continuous communication with each other, unless expressly specified otherwise. In addition, devices that are in communication with each other may communicate directly or indirectly through one or more intermediaries.

When a single device or article is described herein, it will be readily apparent that more than one device/article (whether or not they cooperate) may be used in place of a single device/article. Similarly, where more than one device or article is described herein (whether or not they cooperate) , it will be readily apparent that a single device/article may be used in place of the more than one device or article or a different number of devices/articles may be used instead of the shown number of devices or programs. The functionality and/or the features of a device may be alternatively embodied by one or more other devices which are not explicitly described as having such functionality/features. Thus, other embodiments of the disclosure need not include the device itself.

Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the disclosure be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the disclosure of the embodiments of the disclosure is intended to be illustrative, but not limiting, of the scope of the disclosure, which is set forth in the following claims.

With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity.

Although implementations for a method and an apparatus for quantization of neural networks to reduce their storage requirements without compromising too much on accuracy and achieving better performance than the existing models have been described in language specific to structural features and/or methods, it is to be understood that the appended claims are not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as examples of implementations of quantization of deep learning models.

Claims

A method for quantization of neural networks, said method comprising the steps of:

determining (S901) , by an apparatus (100) , if an input is positive or negative number, wherein the input is a floating point number comprising a signed bit, at least one exponent bit and at least one mantissa bit;

determining (S902) , by the apparatus (100) , an exponential range of the input;

determining (S903) , by the apparatus (100) , a maximum range of layer parameters of the input to represent said input in a quantized form;

determining (S904) , by the apparatus (100) , offset of a layer of the input;

performing (S905) , by the apparatus (100) , exponent adjustment by converting the input into a corresponding binary form; and

determining (S906) , by the apparatus (100) , exponent representation of the input by adding the offset to the exponent adjustment value.
The method as claimed in claim 1, wherein the signed bit is zero if the input is positive number; and wherein the signed bit is 1 if the input is negative number.
The method as claimed in any of claims 1-2, wherein exponential range is determined by

Exp RRange=2 ^n-1-1

where n is the bit width.
The method as claimed in any of claims 1-3, wherein the maximum range is determined by

Range=Number≤2 ^Levels

where level is the number of bits required to represent entire range of the layer parameters.
The method as claimed in any of claims 1-4, wherein the offset is determined by subtracting the exponential range with the levels required to represent the max range and adding half the mantissa bits to offset.
The method as claimed in any of claims 1-5, wherein converting the input into a corresponding binary form comprises the step of determining if the input is 10-bits long; wherein if the bits of the binary form of input are more than 10 bits long, the bits from lower binary place are discarded and the value is adjusted in exponent part by adding the number of bits discarded; and wherein if the bits of the binary form of input are lesser than 10 bits long, zeros are appended towards the end of the binary form of the input and the exponent of the input is adjusted by subtracting the number of zeros added.
The method as claimed in any of claims 1-6, further comprising a step of performing multiplication operation on any two quantized inputs.
The method as claimed in any of claims 1-7, further comprising the step of performing addition operation on any two quantized inputs.
An apparatus (100) for quantization of neural networks, said apparatus comprising:

a sign determining module (101) configured to determine if an input is positive or negative number, wherein the input is a floating point number comprising a signed bit, at least one exponent bit and at least one mantissa bit;

an exponent range determining module (102) configured to determine exponential range of the input;

a maximum range determining module (103) configured to determine maximum range of layer parameters of the input;

an offset determining module (104) configured to determine offset of a layer of the input;

an exponent adjustment module (105) configured to perform exponent adjustment by converting the input into a corresponding binary form; and

an exponent representation module (106) configured to determine exponent representation of the input by adding the offset to the exponent adjustment.
The apparatus as claimed in claim 9, wherein the signed bit is zero if the input is positive number; and wherein the signed bit is 1 if the input is negative number.
The apparatus as claimed in any of claims 9-10, wherein exponential range is determined by

Exp Range=2 ^n-1-1

where n is the bit width.
The apparatus as claimed in any of claims 9-11, wherein the maximum range is determined by

Range=Number≤2 ^Levels

where level is the number of bits required to represent entire range of the layer parameters.
The apparatus as claimed in any of claims 9-12, wherein the offset is determined by subtracting the exponential range with the levels required to represent the max range and adding half the mantissa bits to offset.
The apparatus as claimed in any of claims 9-13, wherein the exponent adjustment module (105) is configured to determine if the input is 10-bits long; wherein if the bits of the binary form of input are more than 10 bits long, the bits from lower binary place are discarded and the value is adjusted in exponent part by adding the number of bits discarded; and wherein if the bits of the binary form of input are lesser than 10 bits long, zeros are appended towards the end of the binary form of the input and the exponent of the input is adjusted by subtracting the number of zeros added.
The apparatus as claimed in any of claims 9-14, further configured to perform multiplication operation on any two quantized inputs.
The apparatus as claimed in any of claims 9-16, further configured to perform addition operation on any two quantized inputs.
A neural network model in which weights of a layer are stored in 8 or 16 bit floating point representation obtained by the method of any of claims 1-8.
A method for quantizing a neural network, comprising a graph file and a parameter file, by using the method of any of claims 1-8.