WO2021228611A1

WO2021228611A1 - Device and method for training a neural network

Info

Publication number: WO2021228611A1
Application number: PCT/EP2021/061615
Authority: WO
Inventors: Erik Reinhard; Philippe Guillotel
Original assignee: Interdigital Ce Patent Holdings
Priority date: 2020-05-11
Filing date: 2021-05-04
Publication date: 2021-11-18

Abstract

A device trains a neural network by performing a gradient descent method including modifying a gradient of an optimization function using a gradient modification function to obtain a modified gradient and using the modified gradient in a gradient descent algorithm. The modifying and the using can be iterated until a minimum is found and a result corresponding to the minimum can be output.

Description

DEVICE AND METHOD FOR TRAINING A NEURAL NETWORK

TECHNICAL FIELD

The present disclosure relates generally to neural networks and in particular to training of neural networks. BACKGROUND

This section is intended to introduce the reader to various aspects of art, which may be related to various aspects of the present disclosure that are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.

Generally speaking, training an artificial neural network amounts to solving the following non-convex optimization function:

As this problem is typically intractable, instead of a global optimum, an e- stationary point x is sought, such that ; see Jingzhao Zhang et al.

"Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition." arXiv preprint arXiv:1905.11881 (2019).

Gradient descent is a scheme whereby the solution x is iteratively refined. Step k + 1 is computed from step k as follows:

In this equation, η is the step size, more commonly known as the learning rate, which is normally set to a small fixed value. This value is often modified to either stabilize or accelerate learning; see for example Zhang (2019). For example in clipped gradient descent, the above update scheme is modified as follows:

where

. Likewise, another variant, normalized gradient descent, manipulates the learning rate h replacing it with h_n as follows:

The well-known and often-used Adam optimizer adjusts the learning rate as function of training iteration k, as follows (see also Diederik P Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization. arXiv preprint arXiv: 1412.6980, 2014):

In this case, the gradients are scaled by values that depend on the iteration k. However, at each iteration, all gradients are scaled by the same amount. This is also true of variants of the Adam optimizer; see Ange Tato and Roger Nkambou. "Improving Adam Optimizer." (2018). Zijun Zhang. "Improved Adam Optimizer for Deep Neural Networks." In 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), pp. 1-2. IEEE, 2018. Ilya Loshchilov and Frank Hutter. "Fixing Weight Decay Regularization in Adam." arXiv preprint arXiv:1711.05101 (2017).

As is known, the learning rate is an important and often a manually chosen hyper-parameter. If it is too large, then instability may occur, which affects the ability to converge. If it is too small, the network may converge unnecessarily slowly, and the solution arrived at may also be sub-optimal if this is at a local rather than global optimum.

It will thus be appreciated that there is a desire for a solution that addresses at least some of the shortcomings of training neural networks. The present principles provide such a solution. SUMMARY OF DISCLOSURE

In a first aspect, the present principles are directed to a method including modifying a gradient of an optimization function using a gradient modification function to obtain a modified gradient and using the modified gradient in a gradient descent algorithm.

In a second aspect, the present principles are directed to a device including memory configured to store program code instructions and at least one hardware processor configured to execute the program code instructions to modify a gradient of an optimization function using a gradient modification function to obtain a modified gradient and use the modified gradient in a gradient descent algorithm.

In a third aspect, the present principles are directed to a computer program product which is stored on a non-transitory computer readable medium and includes program code instructions executable by a processor for implementing the steps of a method according to any embodiment of the first aspect. BRIEF DESCRIPTION OF DRAWINGS

Features of the present principles will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:

Figure 1 illustrates a device according to an embodiment of the present principles; Figure 2 illustrates different functions based on the hyperbolic tangent; Figure 3 illustrates a flowchart for a method according to a first embodiment of the present principles; and

Figure 4 illustrates a weighted quality assessment of experimental methods. DESCRIPTION OF EMBODIMENTS

Figure 1 illustrates a device 100 according to an embodiment of the present principles. The device 100 typically includes a user input interface 110, at least one hardware processor (“processor”) 120, memory 130, and a network interface 140. The device 100 can further include a display interface or a display 150. A non- transitory storage medium 170 stores computer-readable instructions that, when executed by a processor, perform the method described with reference to Figure 3. The user input interface 110, which for example can be implemented as a keyboard, a mouse or a touch screen, is configured to receive input from a user. The processor 120 is configured to execute program code instructions to perform a method of gradient descent according to at least one method of the present principles. The memory 130, which can be at least partly non-transitory, is configured to store the program code instructions to be executed by the processor 120, parameters, image data, intermediate results and so on. The network interface 140 is configured for communication with external devices (not shown) over any suitable connection 180, wired or wireless.

The processor 120 is configured to reshape gradients prior to performing gradient descent. This reshaping may occur in combination with learning rate adjustments, including, for example, the Adam optimizer. The reshaping can take the form of a function s which is applied to the gradients, as follows:

Thus, the function s takes as input a number of input gradients, and outputs a number of reshaped gradients. It is noted that a reshaped gradient may, in certain cases, be the same as the corresponding input gradient; in otherwords, the reshaping can in some cases leave the input gradient unchanged.

The reshaped gradients can then be used to process data in any suitable gradient descent method, as well as in any other gradient-based optimization algorithm. This notably includes gradient descent applied in the training of neural networks, including deep neural networks, and the training of generative adversarial networks.

The function s may in principle have any shape, but in an embodiment, the function maps gradients with a larger magnitude to gradients with a smaller magnitude. Smaller magnitude gradients can be unaltered by the function s. An example of such a function is the hyperbolic tangent:

A function s based on the hyperbolic tangent can be adapted to have a sharper knee, in the limit becoming a clamping function:

As can be seen, gradients with a large magnitude, i.e. greater than g, are reduced, while gradients with smaller magnitudes, i.e. equal to or smaller than g, are unchanged.

Figure 2 illustrates the function s based on the hyperbolic tangent for the following values of g: 0, 0.3, 0.7 and 1.

As can be seen, the middle section, i.e. for -g ≤ x<=g, forms a straight line with slope 1 , meaning that the output has the same value as the input. Thus, as already mentioned, small gradients are unaffected while larger values are increasingly reduced, with asymptotes at 1 and -1 for large positive and negative values respectively. The parameterization through constant g allows the shape of the hyperbolic tangent to be adjusted according to need. For a value of g = 0.0, the function is to a standard hyperbolic tangent. For a value of g = 1.0, the function is a clamping function. For values of g in-between 0 and 1 , the curvature of function s is more or less sharp.

An alternative gradient modification function may be based on the Naka- Rushton equation, here modified to also admit negative values as input:

This function also has a slope of 1 at 0 and asymptotes at +1 and -1 , but lacks the adaptability of the described function based on the hyperbolic tangent function. A possible advantage of the modified Naka-Rushton equation, however, can be lower computational complexity.

Another related function, the Michaelis-Menten equation, may also be appropriate to modify gradients:

where the exponent n is a constant usually taken between 0 and 1 .

Figure 3 illustrates a flowchart for a method 30 according to a first embodiment of the present principles.

In step S32, the processor 120 calculates a modified gradient or modified gradients as the output of a gradient modification function s applied to a gradient or, depending on the number of dimensions, gradients. As already mentioned, the gradient modification function s can be the same for all dimensions, but can also be different for at least two dimensions.

In step S34, the processor 120 performs gradient descent using the modified gradient(s).

For example, the standard gradient descent function may thus be modified as follows for any desired gradient modification function s:

It may be combined with any method that, possibly adaptively, modifies the learning rate parameter η. This includes modifications to the Adam optimizer, as follows:

Steps S32 and S34 are typically iterated until convergence is obtained and the result can be stored or output, for example through the network interface 140 or the display interface 150. In an embodiment, step S32 is not performed in every iteration. EVALUATION

The solutions of the present principles were tested together in a context of a dataset of images of human faces (the CelebA dataset: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html). The dataset was split into one set for training (90% of the images) and another set for validation (10% of the images). A Deep Convolutional GAN (DCGAN) [see for example Alec Radford et al. “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” arXiv preprint arXiv.1511.06434 (2015).] was modified according to principles of the solutions of the present principles. The learning rate was set to 0.002. Unless stated otherwise, the latent space is 100-dimensional (N = 100). The batch size is 512, images are 64x64 pixels, and for all results training was stopped after only 15 epochs. In all cases, the Adam optimizer was used.

Results labeled ‘Uniform’ used a latent space of independent and identically distributed (i.i.d.) zero-mean Gaussian random variables with a variance of 1. Results labeled ‘Weighted’ used zero-mean Gaussian random variables. The adjustment of individual gradients using a hyperbolic tangent are indicated by Tanh’, followed by the value of the parameter g, as already described with reference to Figure 2. A direct clamping of the values to be within the range [-1, 1] is indicated with ‘Clamp’, whereas no gradient adjustment is applied to results marked ‘Adam’ (where the standard Adam optimizer is applied). In all other cases, the Adam optimizer is used in conjunction with the solutions of the first embodiment.

Figure 4 illustrates a weighted quality assessment of the experimental methods. As can be seen, the baseline method, ‘Adam’, i.e. without any of the proposed modifications, provides the lowest weighted quality of all. This means that all variants tested improve upon the baseline result by some amount.

This seems to indicate that a non-linear adjustment of gradients with a hyperbolic tangent with a slightly sharpened knee ( g = 0.3 or g = 0.7) is a better function for gradient adjustment than hard clamping, not adjusting at all, or using a hyperbolic tangent with modified knee ( g = 0.0). As will be appreciated, the present embodiments can help reduce the training requirements of neural networks. The present principles are broadly applicable to all neural networks that require gradient descent optimization (which includes virtually all neural networks).

It should be understood that the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces. The device can be implemented on a plurality of physical devices working together.

The present description illustrates the principles of the present disclosure. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the disclosure and are included within its scope.

All examples and conditional language recited herein are intended for educational purposes to aid the reader in understanding the principles of the disclosure and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions.

Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosure, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e. , any elements developed that perform the same function, regardless of structure.

Thus, for example, it will be appreciated by those skilled in the art that the block diagrams presented herein represent conceptual views of illustrative circuitry embodying the principles of the disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, and the like represent various processes which may be substantially represented in computer readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.

The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “processor” or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage.

Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.

In the claims hereof, any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function. The disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.

Claims

1. A method comprising: modifying a gradient of an optimization function using a gradient modification function to obtain a modified gradient; and using the modified gradient in a gradient descent algorithm.

2. The method of claim 1 , wherein the modifying and the using are iterated until a minimum is found.

3. The method of claim 2, further comprising outputting a result corresponding to the minimum.

4. The method of claim 1 , wherein the gradient modification function reduces relatively larger gradient magnitudes more than or as much as relatively smaller gradient magnitudes.

5. The method of claim 1 , wherein the gradient modification function leaves magnitues smaller than a given value unchanged.

6. The method of claim 1 , wherein a plurality of gradients are used and the same gradient modification function is applied to gradients of more than one dimension.

7. A device comprising: memory configured to store program code instructions; and at least one hardware processor configured to execute the program code instructions to: modify a gradient of an optimization function using a gradient modification function to obtain a modified gradient; and use the modified gradient in a gradient descent algorithm.

8. The device of claim 7, wherein the program code instructions further cause the at least one hardware processor to iterate the modify and use until a minimum is found.

9. The device of claim 8, wherein the program code instructions further cause the at least one hardware processor to output a result corresponding to the minimum.

10. The device of claim 7, wherein the gradient modification function reduces relatively larger gradient magnitudes more than or as much as relatively smaller gradient magnitudes.

11. The device of claim 7, wherein the gradient modification function leaves magnitues smaller than a given value unchanged.

12. The device of claim 7, wherein a plurality of gradients are used and the same gradient modification function is applied to gradients of more than one dimension.

13. A non-transitory computer-readable storage medium storing instructions that, when executed, cause at least one hardware processor perform the method of any one of claims 1-6.