WO2023059215A1

WO2023059215A1 - Apparatus and method for winograd convolution

Info

Publication number: WO2023059215A1
Application number: PCT/RU2021/000416
Authority: WO
Inventors: Vladimir Maximovich CHIKIN; Vladimir Mikhailovich KRYZHANOVSKIY; Alexandr Alexandrovich ZURUEV; Yury Alexandrovich PARFENOV
Original assignee: Huawei Technologies Co., Ltd
Priority date: 2021-10-04
Filing date: 2021-10-04
Publication date: 2023-04-13
Also published as: WO2023059215A8

Abstract

The present disclosure relates to a method and an apparatus for performing a Winograd convolution of a neural network. In particular, before input and filter tensors of the Winograd convolution in a Winograd domain are quantized into integer data, the apparatus is configured to determine a balancing tensor and balance channel ranges of the input and filter tensor based on the balancing tensor. After the channel ranges of the input and filter tensors are balanced, the apparatus is then configured to perform quantization. In this way, information loss caused by the quantization may be reduced. Hence, the precision and accuracy of the neural network may be enhanced.

Description

APPARATUS AND METHOD FOR WINOGRAD CONVOLUTION

TECHNICAL FIELD

The present disclosure relates to an apparatus and a method for processing a matrix in the fields of artificial intelligence. For example, the disclosure relates to an apparatus and a method for performing a convolution in an artificial neural network.

BACKGROUND

Artificial neural networks (ANNs) are often used for performing various tasks, such as image processing, speech recognition, robotics, and big data analysis. An ANN usually involves massive data processing, such as matrix convolution.

A matrix convolution may be seen as a process of adding each element of an input matrix to its local neighbours, weighted by a kernel matrix (or filter). Therefore, the matrix convolution normally includes matrix addition and multiplication. The matrix convolution is often used in a convolution neural network (CNN).

Several conventional algorithms have been developed to speed up the matrix convolution. In particular, Don Coppersmith and Shmuel Winograd in 1990 developed an algorithm for performing matrix multiplication, which is often referred to as “Coppersmith-Winograd algorithm”, or simply, “Winograd algorithm”. The matrix convolution performed based on the Winograd algorithm may be referred to as “Winograd-based convolution”, or simply “Winograd convolution”.

Winograd convolution is widely used in ANNs to reduce the computational complexity, e.g., by reducing the number of multiplications. As shown exemplarily in FIG. 6, two operands of a convolution, e.g. an input and a filter, are transformed into a so-called “Winograd domain” and a transformed output is obtained in the Winograd domain. Then, this transformed output is transformed back to a normal domain (sometimes also referred to as a spatial domain). Transformation matrices B, G and A depend on specific configurations of the Winograd convolution and have different values commonly known in the art for different configurations of the Winograd convolution. For efficient implementation and execution, it is often desired to perform Winograd convolution with integer data. Therefore, a floating point neural network is often quantized to an integer neural network. The floating point neural network is a neural network comprising floating point parameters, such as inputs and filters. The integer neural network is a neural network comprising only integer parameters. For inference on mobile and embedded devices, a quantized 8-bit integer (INT8) neural network can achieve comparable accuracies as a 32-bit float-point (FP32) neural network. Further, by quantizing an FP32 model to an INT8 model, model sizes can be reduced by a factor of four compared to the FP32 model. Moreover, calculations can be accelerated for quantized integer neural networks on processors compared to their floating point counterparts. Furthermore, on hardware where optimized fixed-point capabilities are available, the speedup can be further improved. Overall, quantization may bring improvements including model compression and latency reduction.

SUMMARY

Quantization of a neural network with a Winograd algorithm may lead to a significant drop in the performance of the neural network in some scenarios. For example, an actual neural network can have thousands of parameters. In general, it is not easy to obtain integer parameters by scaling, because hardware is restricted by the number of bits. For example, if an INT8 neural network is desired, then only 2⁸ = 256 integer numbers (from -128 to 127) can be used. This is often much less than the range of a floating point neural network. For example, if the floating point neural network has a range of 0.0001 to 0.1 (i.e., [0.0001, 0.1]), then it may be scaled or quantized into 10000*[0.0001, 0.1] = [1, 1000]. However, 256 is much less than 10000. Therefore, rounding may be performed and rounding error may be introduced. This may lead to information loss and may jeopardize the accuracy of the quantized neural network.

Further, since a Winograd convolution involves a preliminary calculation of parameters in order to achieve transformed inputs and transformed filters in the Winograd domain, a neural network with Winograd convolution is much less vulnerable to errors introduced by quantization.

In view of the above, there is a need to address the aforementioned technical drawbacks in existing devices to improve Winograd convolution(s) of a neural network.

Apparatus and methods according to this disclosure facilitate performing a Winograd convolution of a neural network in a robust and efficient manner. This and other objectives are achieved by the subject matter of the independent claims. Further implementations forms are apparent from the dependent claims, the description and the drawings.

According to the present disclosure, a balancing tensor is used to balance channel ranges of inputs and weights at each layer of a neural network before quantization. Optionally, a direct algorithm for calculating the balancing tensor is disclosed, which exploits the distributions of inputs and weights at each layer of a neural network.

A first aspect of the present disclosure provides an apparatus for performing Winograd-based convolution of a floating point neural network. The apparatus is configured to generate a Winograd input tensor based on an original input tensor, wherein the Winograd input tensor comprises one or more first floating point channels. Further, the apparatus is configured to generate a Winograd filter tensor based on a filter tensor of the floating point neural network, wherein the Winograd filter tensor comprises one or more second floating point channels.

Optionally, a tensor may be a data structure used in neural networks in the field of artificial intelligence to carry a specific amount of information. For example, the tensor may be: a 0-dimensional (0-D) array, such as a single number; a 1 -dimensional (1 -D) array, such as a vector; a 2-dimensional (2-D) array, such as a matrix; a 3 -dimensional (3-D) array, such as data representing an RGB image, or

- an array with a higher (larger than three) dimensional structure.

Optionally, the apparatus may be configured to transform a tensor with a specific dimension into another dimension. For example, a vector may be transformed into a matrix, while a matrix may also be transformed into a vector. This may be useful for satisfying different requirements of inputs required by different configurations of the Winograd convolution.

Optionally, a tensor (i.e., the input or the filter tensor) may comprise one or more channels. A channel may be used to transmit information from a certain aspect. That is, a channel may have a certain capacity for transmitting information. Optionally, the number of the one or more channels is the depth of the tensor involved in the convolution. Optionally, all channels of a tensor may share a same size. For example, an NxM pixel RGB image may also be represented by a 2D (N^XM) tensor with three channels: red, green and blue.

The Winograd convolution may be performed at one or more hidden layers of the floating point neural network. Optionally, the original input tensor may be or may be part of an input applied to a hidden layer. The input applied to a hidden layer may be an output of a previous layer. For generating the Winograd input tensor, the apparatus may be configured to apply Winograd transformation on the original input tensor. Optionally, the original input tensor may be split into several tiles that are suitable for performing the Winograd convolution. Then, the apparatus may be configured to transform each tile into the Winograd input tensor. Similarly, the apparatus may be configured to apply Winograd transformation to the original filter tensor in order to obtain the Winograd filter tensor.

Optionally, the filter tensor of the floating point neural network may be weights of neurons at each hidden layer of the floating point neural network. Therefore, the filter tensor may also be referred to as a weight tensor.

Optionally, a floating point channel may be a channel comprising at least one element that is a floating point value.

Then, the apparatus is configured to determine a balancing tensor based on the Winograd input tensor and the Winograd filter tensor. The balancing tensor is adapted to balance the one or more first floating point channels and the one or more second floating point channels.

Optionally, the balancing tensor may comprise one or more balancing coefficients. The one or more balancing coefficients, the one or more first channels, and the one or more second channels may be in a one-to-one correspondence. The apparatus may be configured to balance each first channel and each second channel based on a corresponding balancing coefficient.

Optionally, the apparatus may be configured to divide the Winograd input tensor by the balancing tensor and multiply the Winograd filter tensor by the balancing tensor. Alternatively, the apparatus may be configured to multiply the Winograd input tensor by the balancing tensor and divide the Winograd filter tensor by the balancing tensor. In this way, the balancing tensor may be canceled afterwards when Winograd multiplication of the Winograd input tensor and the Winograd filter tensor is performed. Therefore, no additional operation is introduced.

After the one or more first and second floating point channels are balanced, the apparatus is configured to determine a first scale factor for the Winograd input tensor and a second scale factor for the Winograd filter tensor. The first and the second scale factors are adapted to quantize the one or more first balanced floating point channels and the one or more second balanced floating point channels into one or more first integer channels and one or more second integer channels, respectively.

In this way, by applying the balancing tensor before quantization, the quantization errors may be reduced to a minimum.

Then, the apparatus is configured to perform the Winograd convolution based on the balancing tensor, the first scale factor, and the second scale factor.

Optionally, the apparatus may be configured to obtain a balanced and quantized Winograd input sensor and a balanced and quantized Winograd filter tensor based on the balancing tensor, the first scale factor, and the second scale factor. Then, the apparatus may be configured to perform Winograd multiplication based on the balanced and quantized Winograd input sensor and the balanced and quantized Winograd filter tensor.

By balancing the Winograd filter and input tensors based on the same balancing tensor, channel ranges of the Winograd filter and input tensors can be balanced while the number of operations of a Winograd convolution based thereon is equivalent to that of the conventional Winograd convolution. Further, quantization errors can be reduced because of the balanced channel ranges. In this way, the precision of the Winograd convolution according to the present disclosure can be increased.

The balancing of the Winograd filter and input tensors can be compatible with various quantization and training techniques in the art, such as post-training quantization and quantization aware training. The balancing of the Winograd filter and input tensors can be universal. Because the balancing does not depend on any specific type of the Winograd convolution, such as bit width, quantization scheme, scale type and so on. Therefore, the balancing can be applied to Winograd algorithm of any type.

In an implementation form of the first aspect, the floating point neural network may be a trained neural network. The apparatus may be configured to use the trained neural network for image processing such as image classification and image feature extraction. The apparatus may be further configured to obtain an image or a feature map of the image as the original input. The apparatus may be configured to determine the balancing tensor by minimizing quantization loss generated during the determining of the first scale factor and the second scale factor. Then, the apparatus may be configured to process the image or the feature map of the image by performing the Winograd convolution.

Optionally, after obtaining the image or the feature map of the image, the apparatus may be configured to split the image or the feature map of the image into multiple tiles. Each tile may still be considered as an original input, because values comprised therein are not altered. Therefore, each tile may still carry original information.

Optionally, the feature map of the image may be obtained by the apparatus as an output of a hidden layer comprised in the floating point neural network.

In an implementation form of the first aspect, the first floating point channel and the second floating point channel may be in a one-to-one correspondence. The balancing tensor may comprise one or more balancing coefficients. For determining the balancing tensor, the apparatus may be configured to determine each balancing coefficient based on a quantization range of each first floating point channel and a quantization range of each corresponding second floating point channel. A quantization range of a channel may be understood as a range between the maximum value and the minimum value of the channel. In an implementation form of the first aspect, the apparatus may be configured to determine each balancing coefficient based on the following equation:

wherein b^k is a balancing coefficient for channel k, is a quantization range of channel k of the one or more first floating point channels, is a quantization range of channel k of the one or more second floating point channels, and k is a positive integer.

In an alternative implementation form of the first aspect, the apparatus may be configured to determine each balancing coefficient based on the following equation:

In this way, the apparatus can be configured to obtain each balancing coefficient according to equation (1) or (2) in a simple and direct manner.

Optionally, prior to an inference phase of the trained floating point neural network, or during a training phase of a floating point neural network for obtaining the trained floating point neural network, the apparatus may be configured to obtain a set of sample inputs. The set of sample inputs may be a part of a complete set of inputs that are to be applied to the trained floating point neural network in the inference phase. Alternatively, the set of sample inputs may be similar to a set of inputs that are to be applied to the trained floating point neural network in the inference phase. Then, the apparatus may be configured to determine each balancing coefficient for each channel based on the sample inputs, and apply each determined balancing coefficient to each corresponding channel for later input(s).

In an implementation form of the first aspect, for performing the Winograd convolution, the apparatus may be configured to: divide the Winograd input tensor by the balancing tensor to obtain a balanced Winograd input tensor;

- multiply the Winograd filter tensor by the balancing tensor to obtain a balanced Winograd filter tensor; multiply the balanced Winograd input tensor by the first scale factor as a quantized Winograd input tensor, wherein the quantized Winograd input tensor comprises the one or more first integer channels; multiply the balanced Winograd filter tensor by the second scale factor as a quantized Winograd filter tensor, wherein the quantized Winograd filter tensor comprises the one or more second integer channels; and

- perform Winograd multiplication of the Winograd convolution based on the quantized Winograd input tensor and the quantized Winograd filter tensor.

In an alternative implementation form of the first aspect, for performing the Winograd convolution, the apparatus may be configured to: multiply the Winograd input tensor by the balancing tensor to obtain a balanced Winograd input tensor; divide the Winograd filter tensor by the balancing tensor to obtain a balanced Winograd filter tensor;

- multiply the balanced Winograd input tensor by the first scale factor as a quantized Winograd input tensor, wherein the quantized Winograd input tensor comprises the one or more first integer channels;

- multiply the balanced Winograd filter tensor by the second scale factor as a quantized Winograd filter tensor, wherein the quantized Winograd filter tensor comprises the one or more second integer channels; and

In an implementation form of the first aspect, the apparatus may be further configured to: combine the balancing tensor with the first scale factor and the second scale factor, respectively, to obtain a first balanced scale tensor and a second balanced scale tensor; and

- perform the Winograd convolution based further on the first balanced scale tensor and the second balanced scale tensor.

A second aspect of the present disclosure provides a computer-implemented method for performing Winograd convolution of a floating point neural network. The method comprises the following steps: - generating a Winograd input tensor based on an original input tensor, wherein the transformed input tensor comprises one or more first floating point channels; generating a Winograd filter tensor based on a filter tensor of the floating point neural network, wherein the transformed filter tensor comprises one or more second floating point channels; determining a balancing tensor based on the Winograd input tensor and the Winograd filter tensor, wherein the balancing tensor is adapted to balance the one or more first floating point channels and the one or more second floating point channels; determining a first scale factor for the Winograd input tensor and a second scale factor for the Winograd filter tensor, wherein the first and the second scale factors are adapted to quantize the one or more first floating point channels and the one or more second floating point channels to one or more first integer channels and one or more second integer channels, respectively; and

- performing the Winograd convolution based on the balancing tensor, the first scale factor, and the second scale factor.

In an implementation form of the second aspect, the floating point neural network may be a trained neural network. The trained neural network may be used for image processing, such as image classification and image feature extraction. The method may further comprise: obtaining an image or a feature map of the image as the original input; determining the balancing tensor by minimizing quantization loss generated during the determining of the first scale factor and the second scale factor;

- processing the image or the feature map of the image by performing the Winograd convolution.

In an implementation form of the second aspect, the first floating point channel and the second floating point channel may be in a one-to-one correspondence, and the balancing tensor may comprise one or more balancing coefficients. The determining of the balancing tensor may comprise:

- determining each balancing coefficient based on a quantization range of each first floating point channel and a quantization range of each corresponding second floating point channel. In an implementation form of the second aspect, the determining of each balancing coefficient may be based on the following equation:

wherein b^k is a balancing coefficient for channel k, is a quantization range of channel k of the one or more first floating point channels, r is a quantization range of channel k of the one or more second floating point channels, and k is a positive integer.

In an alternative implementation form of the first aspect, the determining of each balancing coefficient may be based on the following equation:

Optionally, prior to an inference phase of the trained floating point neural network, or during a training phase of a floating point neural network for obtaining the trained floating point neural network, the method may further comprise obtaining a set of sample inputs. The set of sample inputs may be a part of a complete set of inputs that are to be applied to the trained floating point neural network in the inference phase. Alternatively, the set of sample inputs may be similar to a set of inputs that are to be applied to the trained floating point neural network in the inference phase. Then, the method may further comprise determining each balancing coefficient for each channel based on the sample inputs, and applying each determined balancing coefficient to each corresponding channel for later input(s).

In an implementation form of the second aspect, the performing of the Winograd-based convolution may comprise the following steps:

- dividing the Winograd input tensor by the balancing tensor to obtain a balanced Winograd input tensor;

- multiplying the Winograd filter tensor by the balancing tensor to obtain a balanced Winograd filter tensor;

- multiplying the balanced Winograd input tensor by the first scale factor as a quantized Winograd input tensor, wherein the quantized Winograd input tensor comprises the one or more first integer channels; - multiplying the balanced Winograd filter tensor by the second scale factor as a quantized Winograd filter tensor, wherein the quantized Winograd filter tensor comprises the one or more second integer channels; and performing Winograd multiplication of the Winograd convolution based on the quantized Winograd input tensor and the quantized Winograd filter tensor.

In an alternative implementation form of the second aspect, the performing of the Winograd- based convolution may comprise the following steps: multiplying the Winograd input tensor by the balancing tensor to obtain a balanced Winograd input tensor;

- dividing the Winograd filter tensor by the balancing tensor to obtain a balanced Winograd filter tensor;

- multiplying the balanced Winograd input tensor by the first scale factor as a quantized Winograd input tensor, wherein the quantized Winograd input tensor comprises the one or more first integer channels;

- multiplying the balanced Winograd filter tensor by the second scale factor as a quantized Winograd filter tensor, wherein the quantized Winograd filter tensor comprises the one or more second integer channels; and

- performing Winograd multiplication of the Winograd convolution based on the quantized Winograd input tensor and the quantized Winograd filter tensor.

In an implementation form of the second aspect, the method may further comprise the following steps:

- combining the balancing tensor with the first scale factor and the second scale factor, respectively, to obtain a first balanced scale tensor and a second balanced scale tensor; and

- performing the Winograd convolution based further on the first balanced scale tensor and the second balanced scale tensor.

A third aspect of the present disclosure provides a computer program product comprising a program code for performing the method according to the second aspect or any implementation form thereof, when executed on a computer. A fourth aspect of the present disclosure provides a computer-readable medium comprising instructions which, when executed by a computer, cause the computer to carry out the method according to any one of the second aspect or any implementation form thereof.

A fifth aspect of the present disclosure provides a chipset comprising instructions which, when executed by the chipset, cause the chipset to carry out the method according to any one of the second aspect or any implementation form thereof.

It has to be noted that all apparatus, devices, elements, units, and means described in the present application could be implemented in software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity, which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The above-described aspects and implementation forms will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which

FIG. 1 shows an example of a Winograd convolution performed by an apparatus;

FIG. 2 shows an example of an apparatus for performing a Winograd convolution;

FIG. 3 shows a method for performing a Winograd convolution;

FIG. 4 shows an application scenario;

FIG. 5A-5C show results based on different methods for performing convolution; and FIG. 6 shows an illustrative example of a conventional Winograd convolution.

DETAILED DESCRIPTION OF THE EMBODIMENTS

According to the present disclosure, a framework for performing a Winograd convolution in a neural network is provided. Optionally, improvements are introduced based on the conventional Winograd convolution illustrated in FIG. 6. A solution for reducing errors generated by quantization of the neural network is provided.

In the present disclosure, quantization may be referred to as a process of mapping a set of input values to a smaller set of values. Optionally, the conversion of floating point numbers to fixed point (e.g., integer) numbers may be a process of quantization.

Optionally, embodiments disclosed herein may be applied to a neural network that is adapted for image processing, such as image classification, image enhancement, and image recognition. Advantageously, information loss of the image processing based on the neural network according to embodiments disclosed herein may be reduced.

It is noted that in the present disclosure, a neural network may be referred to as a neural network model or simply as a model. A Winograd parameter shall be understood as a parameter in the Winograd domain.

Reference is now made to the drawings, wherein similar elements may share the same features and may function likewise.

FIG. 1 shows an example of a Winograd convolution performed by an apparatus for a floating point neural network. In this example, the Winograd convolution comprises four phases: a transformation phase 101 , a balancing phase 102, a quantization phase 103, and an output phase 104.

In the transformation phase 101 , an original input tensor and a filter tensor of the neural network are transformed into a Winograd input tensor, denoted as V, and a Winograd filter tensor, denoted as U, respectively. It is noted that the input tensor is denoted as X in the present disclosure and is exemplarily shown as a 4x4 input tile in FIG. 1. The filter tensor of the neural network (also referred to as an original filter tensor) is denoted as W in the present disclosure and is exemplarily shown as a 3x3 filter tile in FIG. 1. In the transformation phase 101, the apparatus may be configured to apply the following equations:

V = B^T • X B, (3)

U = G • W • G^T. (4) Moreover, the Winograd input tensor comprises one or more first floating point channels, and the Winograd filter tensor comprises one or more second floating point channels. In FIG. 1, it is exemplarily shown that the original input tensor X and the filter tensor W both comprise three channels, and operations according to equations (3) and (4) may be performed for each channel comprised therein. Then, each channel may be carried over respectively into the Winograd domain during the transformation according to equations (3) and (4).

In some embodiments, the neural network may be a trained neural network. The trained neural network may be configured to perform image processing such as image classification and image feature extraction. In this case, the original input tensor may be or may be part of an image. Alternatively, the original input tensor may be or may be part of a feature map of an image. The feature map may be obtained at a hidden layer of the neural network. For example, the feature map may be an output of a previous layer at a certain hidden layer of the neural network. Optionally, the original filter tensor may be dependent upon the hidden layer. That is, the neural network may comprise different original filter tensors at different hidden layers. It is noted that in the present disclosure, it shall be understood that the apparatus may perform one or more Winograd convolutions at any layer comprised in the neural network. Therefore, the original input tensor and the original filter tensor may be associated with each layer comprised in the neural network.

For performing the Winograd convolution, the original input tensor and the original filter tensor are transformed into the Winograd domain, where transformation matrices B and G may be used to facilitate the transformation. Any input tensor not transformed into the Winograd domain may be referred to as an original input. That is, elements of the original input tensor are not altered and may still reflect original true data. Details regarding transformation matrices of B and G may depend on a specific Winograd algorithm applied to the Winograd convolution. For example, the transformation matrix B and G for an F(2*2, 3 *3) Winograd convolution with a 4x4 input tile and a 3x3 filter tile may be as follows:

It is noted that a F(m *n, r*s) Winograd convolution may denote a 2D Winograd convolution used to compute an m *n feature map with r*s filters. Other types of Winograd convolutions, such as an F(4 *4, 3 *3) and an F(6*6, 3 *3) Winograd convolution may also be commonly used in the art, in particular for image processing. It is noted that sometimes, the F(2 *2, 3 *3), F(4 *4, 3 *3) and F(6*6, 3 *3) Winograd convolutions may simply be referred to as F(2,3), F(4, 3) and F(6,3) Winograd convolutions, especially for image processing where a 2D convolution may be considered as default.

In the Winograd domain, the Winograd filter tensor U may have a dimension (C,K,a,a), and the Winograd input tensor V may have a dimension (P,C,a,a), where C is the number of input channels, K is the number of output channels, P is the number of tiles of a complete input, and (a, a) determines the dimension of the Winograd domain.

Optionally, the apparatus may be configured to split a complete original input tensor into a plurality of tiles. Each tile may comprise a part of the original input tensor. The size of each tile may be based on the size of the Winograd input tensor of the specific Winograd algorithm applied to the Winograd convolution. In the present disclosure, each tile may be equivalent to the original input, because it also carries original data that is not transformed into the Winograd domain. For example, the 4x4 input tile in FIG. 1 may be an image segment that is spilt from a complete image of a dimension of 80x80 (pixels). In this case, the apparatus may be configured to split the image into 20x20=400 tiles and perform the Winograd convolution according to the present disclosure for each tile. Then, the apparatus may be configured to aggregate results obtained based on all the split tiles as a final result corresponding to the complete original input tensor.

Before the Winograd filter tensor U and the Winograd input tensor V are quantized, balancing is performed for both U and V in the Winograd domain. The apparatus is configured to determine a balancing tensor, denoted as b in the present disclosure. Optionally, the balancing tensor b may have a dimension (C, a, a). The balancing tensor b may be used to balance channel ranges of the Winograd input and filter tensors.

For this purpose, a notion of precision of a channel is proposed in the present disclosure. For example, the precision of channel k in Winograd domain (i, j) of the Winograd filter tensor U may be denoted as: _rk

- -bL Pⁱ,j - _Rk '

where r_t ^kj is the quantization range of channel k in Winograd domain (i, j)

is the quantization range of all channels in Winograd domain (i, j) of U. It is noted that i, j, k are all positive integers. A quantization range may be referred to as a range between a minimum value and a maximum value of a domain before quantization.

Similarly, the precision of channel k in Winograd domain (i, j) of the Winograd input tensor V may be denoted as:

where t- is the quantization range of channel k in Winograd domain (i, j) of V, and T- is the quantization range of all channels in the Winograd domain of T. In some embodiments, the apparatus may be configured to obtain parameters such as s-j by obtaining a set of sample data without training.

An optimal balancing tensor b may be used to achieve a maximized total precision of all channels. That is, the apparatus may be configured to determine the balancing tensor b such that the total precision of all channels ij.k Pij ' ^sij is maximized.

In some embodiments, the apparatus may be configured to determine the balancing tensor b as:

wherein denotes each balancing coefficient for channel k in Winograd domain (i,j) comprised in the balancing tensor b. In some embodiments, each balancing coefficient in a same Winograd domain may also be denoted as:

Optionally, prior to an inference phase of the trained floating point neural network, or during a training phase of a floating point neural network for obtaining the trained floating point neural network, the apparatus may be configured to obtain a set of sample inputs. The set of sample inputs may be a part (e.g., 5.0-20.0 %) of a complete set of inputs that are to be applied to the trained floating point neural network in the inference phase. Alternatively, the set of sample inputs may be similar to a set of inputs to be applied to the trained floating point neural network in the inference phase. Then, the apparatus may be configured to determine each balancing coefficient for each channel based on the sample inputs, and apply each determined balancing coefficient to each corresponding channel for later input(s).

For example, a complete set of inputs to be applied to the trained floating point neural network may comprise 100 images, each image comprising three channels (RGB channels). The apparatus may be configured to obtain 5 to 20 images from the 100 images as the sample inputs, and determine the balancing tensor b for the three channels based on the 5 to 20 images. Then, the apparatus may be configured to apply the determined balancing tensor b for other images of the 100 images.

As another example, a complete set of inputs to be applied to the trained floating point neural network may comprise 100 tiles split from a complete image. Each tile is a part of the complete image and comprises three channels (RGB channels). The apparatus may be configured to obtain 5 to 20 tiles from the 100 tiles as the sample inputs, and determine the balancing tensor b for the three channels based on the 5 to 20 tiles. Then, the apparatus may be configured to apply the determined balancing tensor b for other tiles of the 100 tiles.

As another example, the apparatus may be configured to determine the balancing tensor b based on part or all of training samples given during the training phase of the floating point neural network. After the training phase, the floating point neural network may be referred to as the trained floating point neural network.

In some embodiments, as an alternative to the determining of the balancing tensor b according to equation (8) or (9), the apparatus may be configured to determine the balancing tensor b in a learnable way. That is, the balancing tensor b may be seen as one of trainable parameters in a so-called “quantization aware training”. That is, the balancing tensor b, as well as other neural network parameters such as weights and/or biases, may be fine-tuned in the training phase of the neural network before the inference phase. After obtaining the balancing tensor b, the apparatus may be configured to divide the Winograd input tensor by the balancing tensor b to obtain a balanced input tensor Vt>, and multiple the Winograd filter tensor by the balancing tensor b to obtain a balanced filter tensor Ub. In the balancing phase 102, the apparatus may be configured to apply the following equations:

V_b = V/ b, (10)

U_b = U b. (11)

In this way, channel ranges of the one or more first floating point channels and the one or more second floating point channels may be balanced. This may facilitate the quantization that is to be performed in a later stage, for example, where quantization errors may be reduced.

Alternatively, in some embodiments, the apparatus may be configured to divide the Winograd input tensor by the balancing tensor b to obtain a balanced input tensor V_b, and multiple the Winograd filter tensor by the balancing tensor b to obtain a balanced filter tensor U_b. In this case, the apparatus may be configured to determine each balancing coefficient in a same Winograd domain as:

In this case, the apparatus may be configured to apply the following equations based on equation (12) in the balancing phase 102:

V_b = V b, (13)

U_b = U/b. (14)

Then, the apparatus is configured to determine a first scale factor (denoted as “scale/”) for the Winograd input tensor that is balanced and a second scale factor (denoted as “scaleu”) for the Winograd filter tensor that is balanced. The first and the second scale factors are adapted to quantize the one or more first floating point channels and the one or more second floating point channels to one or more first integer channels and one or more second integer channels, respectively. It is noted that determining a scale factor for a quantized Winograd convolution is commonly known in the field. Therefore, it is not described in detail herein. In the quantization phase 103, the apparatus may be configured to apply the first and second scale factors to the balanced Winograd input tensor and the balanced Winograd filter tensor as follows: vquant = scale_v • V_b, and (15)

Uquant SCale^j ' U_b, (16) where V_quant is a quantized input tensor, and U_quant is a quantized filter tensor.

In the output phase 104, after obtaining the quantized Winograd input tensor and the quantized Winograd filter tensor, the apparatus may be configured to perform Winograd convolution as follows:

Y = _A . _A + bias (optional), (17) scaleyscaley * where Y is the final output, and transformation matrix A is used to convert the output that is in the Winograd domain to the normal domain. The operation of U_qUant N quant ^maY also be referred to as a Winograd multiplication, which may be understood as a multiplication in the Winograd domain. The transformation matrix A is used to transform the output in the Winograd domain back to the normal domain. Similar to transformation matrix B and G, details regarding the transformation matrix A may depend on the specific Winograd algorithm applied to the Winograd convolution. An example of the transformation matrix A, based on the (2x2, 3x3) Winograd convolution with the 4x4 input tile and the 3x3 filter tile as previously mentioned, may be as follows:

In some embodiments, optionally, the apparatus may be configured to combine the balancing tensor with the first scale factor and the second scale factor, respectively, to obtain a first balancing scale tensor and a second balancing scale tensor. That is, the first scale factor and the balancing tensor are combined, and the second scale factor and the balancing tensor are also combined. In this way, the apparatus may only need to store these combined parameters, and a simplified balancing and quantization of the Winograd convolution may be achieved. In this case, the apparatus may be configured to perform the same number of operations in the inference phase as in the case of a conventional Winograd convolution.

According to the embodiments of the present disclosure, by determining and applying the balancing tensor for both the Winograd input tensor and the Winograd filter tensor of a Winograd convolution, information loss caused by the quantization may be minimized. This may, for example, increase the precision and accuracy of the result obtained by the neural network. For example, the quality of the image processing may be increased when a neural network is used and a Winograd convolution based on the embodiments of the present disclosure. Moreover, since the apparatus applies reverse operations of multiplication and division to the Winograd input tensor and the Winograd filter tensor based on the same balancing tensor, the balancing tensor itself may be canceled during the operation of the Winograd multiplication in the output phase 104. Therefore, the apparatus may not need any additional operations to reverse the balancing. Hence, efficiency may be introduced to the Winograd convolution based on the embodiments of the present disclosure.

FIG. 2 shows an example of an apparatus 200 for performing a Winograd convolution.

In some embodiments, the apparatus 200 may comprise four units, which are shown exemplarily in FIG. 2 as a transformation unit 201, a balancing unit 202, a quantization unit 203, and an output unit 204. The transformation unit 201 may be configured to obtain the input tensor and the filter tensor. The transformation unit 201 may be further configured to transform the input tensor and the filter tensor from the normal domain into Winograd domain. Then, the balancing unit 202 may be configured to determine a balanced tensor. The balancing unit 202 may further be configured to perform channel balancing on the Winograd input tensor and the Winograd filter tensor based on the determined balanced tensor. Then, the quantization unit 203 may be configured to quantize the (balanced) Winograd input tensor and the (balanced) Winograd filter tensor. Then, the output unit 204 may be configured to compute the final output. It is noted that the units 201-204 in FIG. 2 may correspond to phases 101-104 in FIG. 1 correspondingly.

In some embodiments, optionally, the apparatus 200 may be or may be part of a neural processing unit (NPU) or an Al processor. For example, the apparatus 200 may be a matrix/vector/scalar computation unit comprised in an Al core. The Al core, optionally along with a number of other same Al cores, may be comprised onto a chipset or system-on-chip (SoC). The chipset or SoC may be configured to perform neural network related operations, such as training and/or inferencing. When the chipset or SoC is integrated into an electronic device such as a computer and a mobile phone, the chipset or SoC may be configured to perform image processing, speech recognition, text recognition and the like based on artificial intelligence by using the Winograd convolution according to the present disclosure.

FIG. 3 shows a method 300 for performing a Winograd convolution.

The method 300 is performed by an apparatus for performing the Winograd convolution of a floating point neural network. The method 300 comprises the following steps: step 301 : generating a Winograd input tensor based on an original input tensor. The transformed input tensor comprises one or more first floating point channels;

- step 302: generating a Winograd filter tensor based on a filter tensor of the floating point neural network. The transformed filter tensor comprises one or more second floating point channels; step 303: determining a balancing tensor based on the Winograd input tensor and the Winograd filter tensor, wherein the balancing tensor is adapted to balance the one or more first floating point channels and the one or more second floating point channels; step 304: determining a first scale factor for the Winograd input tensor and a second scale factor for the Winograd filter tensor, wherein the first and the second scale factors are adapted to quantize the one or more first floating point channels and the one or more second floating point channels to one or more first integer channels and one or more second integer channels, respectively; and step 305: performing the Winograd-based convolution based on the balancing tensor, the first scale factor, and the second scale factor.

It is noted that the steps of method 300 may share the same functions and details from the perspective of FIGs. 1-2 described above. In particular, steps 301 and 302 may correspond to phase 101 and may be performed by the transformation unit 201. Step 303 may correspond to phase 102 and may be performed by the balancing unit 202. Step 304 may correspond to phase 103 and may be performed by the quantization unit 203. Step 305 may correspond to phase 104 and may be performed by the output unit 204. The corresponding method implementations are not described in detail again at this point.

An application scenario of the present disclosure is that the method for performing a Winograd convolution may be applied to a neural network model that involves convolution operations. Applying the neural network model may often involve two phases: a training phase and an inference phase. For preparing the model in the training phase, the following steps may be executed, e.g., by the apparatus according to the present disclosure:

Step 401. Obtaining the neural network model that comprises conventional direct convolution operations.

Step 402. Replacing the conventional direct convolutions with the Winograd convolutions according to the present disclosure.

Step 403. Passing several data samples, which may be referred to as sample inputs, without training to the model.

Step 404. Collecting statistics about minimum and maximum values of the sample inputs in the Winograd domain (i.e., Winograd inputs).

Step 405. Calculating balancing coefficients based on the collected statistics according to the present disclosure.

Step 406. Applying the calculated balancing coefficients to filters in the Winograd domain (i.e., Winograd filters).

Step 407. Calculating a scaling factor of the balanced Winograd filters and quantizing the Winograd filters.

In the following, there may be two different ways of applying the balancing coefficient and the scale factors to actual input, i.e., a static approach and a dynamic approach.

For the static approach, the apparatus may be configured to perform the following step 408a and the optional step 409a.

Step 408a: calculating a scaling factor of the balanced Winograd sample inputs.

Step 409a (optional): fuse the scaling factor of the balanced Winograd sample inputs and the balancing coefficients. Alternatively, for the dynamic approach, the apparatus may be configured to perform the following step 408b.

Step 408b: storing the balancing coefficients for further usages in the inference phase.

In the inference phase, the neural network model has been trained and fine-tuned for actual application. In this case, an apparatus using the trained neural network model may be configured to perform the following steps 501 and 502. It is noted that the apparatus using the trained neural network model may also be the apparatus according to the present disclosure.

Step 501 : Obtaining balanced and quantized Winograd filters.

Step 502: Transforming actual inputs into Winograd inputs.

For the static approach, the apparatus may be configured to perform the following step 502a.

Step 502a: Balancing and quantizing the Winograd inputs by using the balancing coefficients and the scaling factor determined based on the Winograd sample inputs before the inference phase, or by applying a fused scaling factor and the balancing coefficients if available.

Alternatively, for the dynamic approach, the apparatus may be configured to perform the following steps 502b, 503 and 504.

Step 502b: Balancing the Winograd input by using the balancing coefficients.

Step 503 : Determining a current scale factor based on the balanced Winograd input. Step 504: Quantizing the balanced Winograd input based on the current scale factor.

Then, for both the static and dynamic approaches, the apparatus may be configured to perform the following steps 505 and 506.

Step 505: Calculating Winograd multiplication based on the balanced and quantized Winograd filters and the balanced and quantized Winograd inputs. Step 506: Transforming the Winograd product back to the normal (or spatial) domain as the final output.

FIG.4 shows an application scenario of the present disclosure.

In FIG. 4, an example of a quantization Efficient Sub-Pixel CNN (ESPCNN) for 3x image super-resolution is illustrated. In the ESPCNN, direct 2D convolutions of the 4 Convolutions 32x32 and Convolution 32x27 may be performed based on the Winograd convolution according to the present disclosure. For example, a symmetric quantization scheme, an F(4,3) Winograd algorithm and a quantized INT8 model may be used for the Winograd convolution.

FIG. 5A-5C show results based on different methods for performing convolution.

FIG. 5A shows a result of image super-resolution by using the ESPCNN based on a conventional Winograd convolution that is without balancing. FIG. 5B shows a result of image super-resolution by using the ESPCNN based on the Winograd convolution according to the present disclosure. FIG. 5C shows a result of image super-resolution by using a full precision model. It can be seen that the perceptual quality of FIG. 5B obtained based on the Winograd convolution according to the present disclosure significantly exceeds the perceptual quality of FIG. 5A obtained based on the conventional Winograd convolution and is close to the perceptual quality of FIG. 5C obtained based on the full precision model.

It is noted that the apparatus in the present disclosure may comprise processing circuitry configured to perform, conduct or initiate the various operations of the device described herein, respectively. The processing circuitry may comprise hardware and software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. In one embodiment, the processing circuitry comprises one or more processors and a non- transitory memory connected to the one or more processors. The non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the device to perform, conduct or initiate the operations or methods described herein, respectively. It is further noted that the apparatus in the present disclosure may be a single electronic device capable of computing, or may comprise a set of connected electronic components or modules capable of computing with shared system memory. It is well known in the art that such computing capabilities may be incorporated into many different devices, and therefore the term “apparatus” may comprise a chip, chipset, artificial intelligence accelerator, neural processing unit, computer, mobile terminal, tablet, wearable device, game console, graphic processing unit, graphic card, and the like.

The present disclosure has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed subject matter, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or another unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

CLAIMS An apparatus (200) for performing Winograd-based convolution of a floating point neural network, wherein the apparatus (200) is configured to: generate a Winograd input tensor based on an original input tensor, wherein the Winograd input tensor comprises one or more first floating point channels; generate a Winograd filter tensor based on a filter tensor of the floating point neural network, wherein the Winograd filter tensor comprises one or more second floating point channels; determine a balancing tensor based on the Winograd input tensor and the Winograd filter tensor, wherein the balancing tensor is adapted to balance the one or more first floating point channels and the one or more second floating point channels; determine a first scale factor for the Winograd input tensor and a second scale factor for the Winograd filter tensor, wherein the first and the second scale factors are adapted to quantize the one or more first floating point channels and the one or more second floating point channels to one or more first integer channels and one or more second integer channels, respectively; and perform the Winograd-based convolution based on the balancing tensor, the first scale factor, and the second scale factor. The apparatus (200) according to claim 1, wherein the floating point neural network is a trained neural network for image processing, and the apparatus (200) is configured to: obtain an image or a feature map of the image as the original input; determine the balancing tensor by minimizing quantization loss generated during the determining of the first scale factor and the second scale factor; process the image or the feature map of the image by performing the Winograd convolution. The apparatus (200) according to claim 1 or 2, wherein the first floating point channel and the second floating point channel are in a one-to-one correspondence, and the balancing tensor comprises one or more balancing coefficients, wherein for determining the balancing tensor, the apparatus (200) is configured to:

26 determine each balancing coefficient based on a quantization range of each first floating point channel and a quantization range of each corresponding second floating point channel.

4. The apparatus (200) according to claim 3, wherein the apparatus (200) is configured to determine each balancing coefficient based on the following equation: h^K = F —

° r^fc’ wherein b^k is a balancing coefficient for channel k, t^k is a quantization range of channel k of the one or more first floating point channels, r^k is a quantization range of channel k of the one or more second floating point channels, and k is a positive integer.

5. The apparatus (200) according to claim 3, wherein the apparatus (200) is configured to determine each balancing coefficient based on the following equation:

wherein b^k is a balancing coefficient for channel k, t^k is a quantization range of channel k of the one or more first floating point channels, r^k is a quantization range of channel k of the one or more second floating point channels, and k is a positive integer.

6. The apparatus (200) according to any one of claims 1 to 4, wherein for performing the Winograd-based convolution, the apparatus (200) is configured to: divide the Winograd input tensor by the balancing tensor to obtain a balanced Winograd input tensor; multiply the Winograd filter tensor by the balancing tensor to obtain a balanced Winograd filter tensor; multiply the balanced Winograd input tensor by the first scale factor to obtain a quantized Winograd input tensor, wherein the quantized Winograd input tensor comprises the one or more first integer channels; multiply the balanced Winograd filter tensor by the second scale factor to obtain a quantized Winograd filter tensor, wherein the quantized Winograd filter tensor comprises the one or more second integer channels; and perform Winograd multiplication of the Winograd-based convolution based on the quantized Winograd input tensor and the quantized Winograd filter tensor. The apparatus (200) according to any one of claims 1, 2, 3 and 5, wherein for performing the Winograd-based convolution, the apparatus (200) is configured to: multiply the Winograd input tensor by the balancing tensor to obtain a balanced Winograd input tensor; divide the Winograd filter tensor by the balancing tensor to obtain a balanced Winograd filter tensor; multiply the balanced Winograd input tensor by the first scale factor to obtain a quantized Winograd input tensor, wherein the quantized Winograd input tensor comprises the one or more first integer channels; multiply the balanced Winograd filter tensor by the second scale factor to obtain a quantized Winograd filter tensor, wherein the quantized Winograd filter tensor comprises the one or more second integer channels; and perform Winograd multiplication of the Winograd-based convolution based on the quantized Winograd input tensor and the quantized Winograd filter tensor. The apparatus (200) according to any one of claims 1 to 5, further configured to: combine the balancing tensor with the first scale factor and the second scale factor, respectively, to obtain a first balanced scale tensor and a second balanced scale tensor; and perform the Winograd-based convolution based on the first balanced scale tensor and the second balanced scale tensor. A computer-implemented method (300) for performing Winograd-based convolution of a floating point neural network, wherein the method (300) is executed by an apparatus and comprises: generating (301) a Winograd input tensor based on an original input tensor, wherein the transformed input tensor comprises one or more first floating point channels; generating (302) a Winograd filter tensor based on a filter tensor of the floating point neural network, wherein the transformed filter tensor comprises one or more second floating point channels; determining (303) a balancing tensor based on the Winograd input tensor and the Winograd filter tensor, wherein the balancing tensor is adapted to balance the one or more first floating point channels and the one or more second floating point channels; determining (304) a first scale factor for the Winograd input tensor and a second scale factor for the Winograd filter tensor, wherein the first and the second scale factors are adapted to quantize the one or more first floating point channels and the one or more second floating point channels to one or more first integer channels and one or more second integer channels, respectively; and performing (305) the Winograd-based convolution based on the balancing tensor, the first scale factor, and the second scale factor. The method (300) according to claim 9, wherein the floating point neural network is a trained neural network for image processing, and the method further comprises: obtaining an image or a feature map of the image as the original input; determining the balancing tensor by minimizing quantization loss generated during the determining of the first scale factor and the second scale factor; processing the image or the feature map of the image by performing the Winograd convolution. The method (300) according to claim 9 or 10, wherein the first floating point channel and the second floating point channel are in a one-to-one correspondence, and the balancing tensor comprises one or more balancing coefficients, wherein the determining of the balancing tensor comprises: determining each balancing coefficient based on a quantization range of each first floating point channel and a quantization range of each corresponding second floating point channel. The method (300) according to claim 11, wherein the determining of each balancing coefficient is based on the following equation:

29 wherein b^k is a balancing coefficient for channel k, t^k is a quantization range of channel k of the one or more first floating point channels, r^k is a quantization range of channel k of the one or more second floating point channels, and k is a positive integer. The method (300) according to claim 11, wherein the determining of each balancing coefficient is based on the following equation:

, r^k h^k = — 0 t^k ’ wherein b^k is a balancing coefficient for channel k, t^k is a quantization range of channel k of the one or more first floating point channels, r^k is a quantization range of channel k of the one or more second floating point channels, and k is a positive integer. The method (300) according to any one of claims 9 to 12, wherein the performing (305) of the Winograd-based convolution comprises: dividing the Winograd input tensor by the balancing tensor to obtain a balanced

Winograd input tensor; multiplying the Winograd filter tensor by the balancing tensor to obtain a balanced

Winograd filter tensor; multiplying the balanced Winograd input tensor by the first scale factor to obtain a quantized Winograd input tensor, wherein the quantized Winograd input tensor comprises the one or more first integer channels; multiplying the balanced Winograd filter tensor by the second scale factor to obtain a quantized Winograd filter tensor, wherein the quantized Winograd filter tensor comprises the one or more second integer channels; and performing Winograd multiplication of the Winograd-based convolution based on the quantized Winograd input tensor and the quantized Winograd filter tensor. The method (300) according to any one of claims 9, 10, 11 and 13, wherein the performing (305) of the Winograd-based convolution comprises: multiplying the Winograd input tensor by the balancing tensor to obtain a balanced Winograd input tensor; dividing the Winograd filter tensor by the balancing tensor to obtain a balanced

Winograd filter tensor;

30 multiplying the balanced Winograd input tensor by the first scale factor to obtain a quantized Winograd input tensor, wherein the quantized Winograd input tensor comprises the one or more first integer channels; multiplying the balanced Winograd filter tensor by the second scale factor to obtain a quantized Winograd filter tensor, wherein the quantized Winograd filter tensor comprises the one or more second integer channels; and performing Winograd multiplication of the Winograd-based convolution based on the quantized Winograd input tensor and the quantized Winograd filter tensor. 16. The method (300) according to any one of claims 9 to 13, further comprises: combining the balancing tensor with the first scale factor and the second scale factor, respectively, to obtain a first balanced scale tensor and a second balanced scale tensor; and performing the Winograd-based convolution based further on the first balanced scale tensor and the second balanced scale tensor.

17. A computer program comprising a program code for performing the method according to any one of claims 9 to 16, when executed on a computer.

31