WO2021228611A1 - Device and method for training a neural network - Google Patents

Device and method for training a neural network Download PDF

Info

Publication number
WO2021228611A1
WO2021228611A1 PCT/EP2021/061615 EP2021061615W WO2021228611A1 WO 2021228611 A1 WO2021228611 A1 WO 2021228611A1 EP 2021061615 W EP2021061615 W EP 2021061615W WO 2021228611 A1 WO2021228611 A1 WO 2021228611A1
Authority
WO
WIPO (PCT)
Prior art keywords
gradient
function
gradients
modification function
modified
Prior art date
Application number
PCT/EP2021/061615
Other languages
French (fr)
Inventor
Erik Reinhard
Philippe Guillotel
Original Assignee
Interdigital Ce Patent Holdings
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Interdigital Ce Patent Holdings filed Critical Interdigital Ce Patent Holdings
Publication of WO2021228611A1 publication Critical patent/WO2021228611A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound

Definitions

  • the present disclosure relates generally to neural networks and in particular to training of neural networks.
  • training an artificial neural network amounts to solving the following non-convex optimization function:
  • Step k + 1 is computed from step k as follows:
  • is the step size, more commonly known as the learning rate, which is normally set to a small fixed value. This value is often modified to either stabilize or accelerate learning; see for example Zhang (2019).
  • the above update scheme is modified as follows: where .
  • normalized gradient descent manipulates the learning rate h replacing it with h n as follows:
  • the well-known and often-used Adam optimizer adjusts the learning rate as function of training iteration k, as follows (see also Diederik P Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization. arXiv preprint arXiv: 1412.6980, 2014):
  • the gradients are scaled by values that depend on the iteration k. However, at each iteration, all gradients are scaled by the same amount.
  • the learning rate is an important and often a manually chosen hyper-parameter. If it is too large, then instability may occur, which affects the ability to converge. If it is too small, the network may converge unnecessarily slowly, and the solution arrived at may also be sub-optimal if this is at a local rather than global optimum.
  • the present principles are directed to a method including modifying a gradient of an optimization function using a gradient modification function to obtain a modified gradient and using the modified gradient in a gradient descent algorithm.
  • the present principles are directed to a device including memory configured to store program code instructions and at least one hardware processor configured to execute the program code instructions to modify a gradient of an optimization function using a gradient modification function to obtain a modified gradient and use the modified gradient in a gradient descent algorithm.
  • the present principles are directed to a computer program product which is stored on a non-transitory computer readable medium and includes program code instructions executable by a processor for implementing the steps of a method according to any embodiment of the first aspect.
  • Figure 1 illustrates a device according to an embodiment of the present principles
  • Figure 2 illustrates different functions based on the hyperbolic tangent
  • Figure 3 illustrates a flowchart for a method according to a first embodiment of the present principles
  • Figure 4 illustrates a weighted quality assessment of experimental methods.
  • Figure 1 illustrates a device 100 according to an embodiment of the present principles.
  • the device 100 typically includes a user input interface 110, at least one hardware processor (“processor”) 120, memory 130, and a network interface 140.
  • the device 100 can further include a display interface or a display 150.
  • a non- transitory storage medium 170 stores computer-readable instructions that, when executed by a processor, perform the method described with reference to Figure 3.
  • the user input interface 110 which for example can be implemented as a keyboard, a mouse or a touch screen, is configured to receive input from a user.
  • the processor 120 is configured to execute program code instructions to perform a method of gradient descent according to at least one method of the present principles.
  • the memory 130 which can be at least partly non-transitory, is configured to store the program code instructions to be executed by the processor 120, parameters, image data, intermediate results and so on.
  • the network interface 140 is configured for communication with external devices (not shown) over any suitable connection 180, wired or wireless.
  • the processor 120 is configured to reshape gradients prior to performing gradient descent. This reshaping may occur in combination with learning rate adjustments, including, for example, the Adam optimizer.
  • the reshaping can take the form of a function s which is applied to the gradients, as follows:
  • the function s takes as input a number of input gradients, and outputs a number of reshaped gradients. It is noted that a reshaped gradient may, in certain cases, be the same as the corresponding input gradient; in otherwords, the reshaping can in some cases leave the input gradient unchanged.
  • the reshaped gradients can then be used to process data in any suitable gradient descent method, as well as in any other gradient-based optimization algorithm.
  • This notably includes gradient descent applied in the training of neural networks, including deep neural networks, and the training of generative adversarial networks.
  • the function s may in principle have any shape, but in an embodiment, the function maps gradients with a larger magnitude to gradients with a smaller magnitude. Smaller magnitude gradients can be unaltered by the function s.
  • An example of such a function is the hyperbolic tangent:
  • a function s based on the hyperbolic tangent can be adapted to have a sharper knee, in the limit becoming a clamping function:
  • Figure 2 illustrates the function s based on the hyperbolic tangent for the following values of g: 0, 0.3, 0.7 and 1.
  • the parameterization through constant g allows the shape of the hyperbolic tangent to be adjusted according to need.
  • the function is to a standard hyperbolic tangent.
  • the function is a clamping function.
  • the curvature of function s is more or less sharp.
  • An alternative gradient modification function may be based on the Naka- Rushton equation, here modified to also admit negative values as input:
  • This function also has a slope of 1 at 0 and asymptotes at +1 and -1 , but lacks the adaptability of the described function based on the hyperbolic tangent function.
  • a possible advantage of the modified Naka-Rushton equation, however, can be lower computational complexity.
  • n is a constant usually taken between 0 and 1 .
  • Figure 3 illustrates a flowchart for a method 30 according to a first embodiment of the present principles.
  • step S32 the processor 120 calculates a modified gradient or modified gradients as the output of a gradient modification function s applied to a gradient or, depending on the number of dimensions, gradients.
  • the gradient modification function s can be the same for all dimensions, but can also be different for at least two dimensions.
  • step S34 the processor 120 performs gradient descent using the modified gradient(s).
  • the standard gradient descent function may thus be modified as follows for any desired gradient modification function s:
  • Steps S32 and S34 are typically iterated until convergence is obtained and the result can be stored or output, for example through the network interface 140 or the display interface 150. In an embodiment, step S32 is not performed in every iteration. EVALUATION
  • the solutions of the present principles were tested together in a context of a dataset of images of human faces (the CelebA dataset: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html).
  • the dataset was split into one set for training (90% of the images) and another set for validation (10% of the images).
  • a Deep Convolutional GAN (DCGAN) [see for example Alec Radford et al. “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” arXiv preprint arXiv.1511.06434 (2015).] was modified according to principles of the solutions of the present principles.
  • the learning rate was set to 0.002.
  • the batch size is 512, images are 64x64 pixels, and for all results training was stopped after only 15 epochs. In all cases, the Adam optimizer was used.
  • Results labeled ‘Uniform’ used a latent space of independent and identically distributed (i.i.d.) zero-mean Gaussian random variables with a variance of 1.
  • Results labeled ‘Weighted’ used zero-mean Gaussian random variables.
  • the adjustment of individual gradients using a hyperbolic tangent are indicated by Tanh’, followed by the value of the parameter g, as already described with reference to Figure 2.
  • a direct clamping of the values to be within the range [-1, 1] is indicated with ‘Clamp’, whereas no gradient adjustment is applied to results marked ‘Adam’ (where the standard Adam optimizer is applied). In all other cases, the Adam optimizer is used in conjunction with the solutions of the first embodiment.
  • Figure 4 illustrates a weighted quality assessment of the experimental methods.
  • the baseline method ‘Adam’, i.e. without any of the proposed modifications, provides the lowest weighted quality of all. This means that all variants tested improve upon the baseline result by some amount.
  • the present embodiments can help reduce the training requirements of neural networks.
  • the present principles are broadly applicable to all neural networks that require gradient descent optimization (which includes virtually all neural networks).
  • the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces.
  • the device can be implemented on a plurality of physical devices working together.
  • processor or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage.
  • DSP digital signal processor
  • ROM read only memory
  • RAM random access memory
  • any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
  • any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function.
  • the disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A device trains a neural network by performing a gradient descent method including modifying a gradient of an optimization function using a gradient modification function to obtain a modified gradient and using the modified gradient in a gradient descent algorithm. The modifying and the using can be iterated until a minimum is found and a result corresponding to the minimum can be output.

Description

DEVICE AND METHOD FOR TRAINING A NEURAL NETWORK
TECHNICAL FIELD
The present disclosure relates generally to neural networks and in particular to training of neural networks. BACKGROUND
This section is intended to introduce the reader to various aspects of art, which may be related to various aspects of the present disclosure that are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
Generally speaking, training an artificial neural network amounts to solving the following non-convex optimization function:
Figure imgf000002_0001
As this problem is typically intractable, instead of a global optimum, an e- stationary point x is sought, such that ; see Jingzhao Zhang et al.
Figure imgf000002_0003
"Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition." arXiv preprint arXiv:1905.11881 (2019).
Gradient descent is a scheme whereby the solution x is iteratively refined. Step k + 1 is computed from step k as follows:
Figure imgf000002_0004
In this equation, η is the step size, more commonly known as the learning rate, which is normally set to a small fixed value. This value is often modified to either stabilize or accelerate learning; see for example Zhang (2019). For example in clipped gradient descent, the above update scheme is modified as follows:
Figure imgf000002_0005
where
Figure imgf000002_0002
. Likewise, another variant, normalized gradient descent, manipulates the learning rate h replacing it with hn as follows:
Figure imgf000003_0001
The well-known and often-used Adam optimizer adjusts the learning rate as function of training iteration k, as follows (see also Diederik P Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization. arXiv preprint arXiv: 1412.6980, 2014):
Figure imgf000003_0002
In this case, the gradients are scaled by values that depend on the iteration k. However, at each iteration, all gradients are scaled by the same amount. This is also true of variants of the Adam optimizer; see Ange Tato and Roger Nkambou. "Improving Adam Optimizer." (2018). Zijun Zhang. "Improved Adam Optimizer for Deep Neural Networks." In 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), pp. 1-2. IEEE, 2018. Ilya Loshchilov and Frank Hutter. "Fixing Weight Decay Regularization in Adam." arXiv preprint arXiv:1711.05101 (2017).
As is known, the learning rate is an important and often a manually chosen hyper-parameter. If it is too large, then instability may occur, which affects the ability to converge. If it is too small, the network may converge unnecessarily slowly, and the solution arrived at may also be sub-optimal if this is at a local rather than global optimum.
It will thus be appreciated that there is a desire for a solution that addresses at least some of the shortcomings of training neural networks. The present principles provide such a solution. SUMMARY OF DISCLOSURE
In a first aspect, the present principles are directed to a method including modifying a gradient of an optimization function using a gradient modification function to obtain a modified gradient and using the modified gradient in a gradient descent algorithm.
In a second aspect, the present principles are directed to a device including memory configured to store program code instructions and at least one hardware processor configured to execute the program code instructions to modify a gradient of an optimization function using a gradient modification function to obtain a modified gradient and use the modified gradient in a gradient descent algorithm.
In a third aspect, the present principles are directed to a computer program product which is stored on a non-transitory computer readable medium and includes program code instructions executable by a processor for implementing the steps of a method according to any embodiment of the first aspect. BRIEF DESCRIPTION OF DRAWINGS
Features of the present principles will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:
Figure 1 illustrates a device according to an embodiment of the present principles; Figure 2 illustrates different functions based on the hyperbolic tangent; Figure 3 illustrates a flowchart for a method according to a first embodiment of the present principles; and
Figure 4 illustrates a weighted quality assessment of experimental methods. DESCRIPTION OF EMBODIMENTS
Figure 1 illustrates a device 100 according to an embodiment of the present principles. The device 100 typically includes a user input interface 110, at least one hardware processor (“processor”) 120, memory 130, and a network interface 140. The device 100 can further include a display interface or a display 150. A non- transitory storage medium 170 stores computer-readable instructions that, when executed by a processor, perform the method described with reference to Figure 3. The user input interface 110, which for example can be implemented as a keyboard, a mouse or a touch screen, is configured to receive input from a user. The processor 120 is configured to execute program code instructions to perform a method of gradient descent according to at least one method of the present principles. The memory 130, which can be at least partly non-transitory, is configured to store the program code instructions to be executed by the processor 120, parameters, image data, intermediate results and so on. The network interface 140 is configured for communication with external devices (not shown) over any suitable connection 180, wired or wireless.
The processor 120 is configured to reshape gradients prior to performing gradient descent. This reshaping may occur in combination with learning rate adjustments, including, for example, the Adam optimizer. The reshaping can take the form of a function s which is applied to the gradients, as follows:
Figure imgf000005_0001
Thus, the function s takes as input a number of input gradients, and outputs a number of reshaped gradients. It is noted that a reshaped gradient may, in certain cases, be the same as the corresponding input gradient; in otherwords, the reshaping can in some cases leave the input gradient unchanged.
The reshaped gradients can then be used to process data in any suitable gradient descent method, as well as in any other gradient-based optimization algorithm. This notably includes gradient descent applied in the training of neural networks, including deep neural networks, and the training of generative adversarial networks.
The function s may in principle have any shape, but in an embodiment, the function maps gradients with a larger magnitude to gradients with a smaller magnitude. Smaller magnitude gradients can be unaltered by the function s. An example of such a function is the hyperbolic tangent:
Figure imgf000005_0002
A function s based on the hyperbolic tangent can be adapted to have a sharper knee, in the limit becoming a clamping function:
Figure imgf000006_0001
As can be seen, gradients with a large magnitude, i.e. greater than g, are reduced, while gradients with smaller magnitudes, i.e. equal to or smaller than g, are unchanged.
Figure 2 illustrates the function s based on the hyperbolic tangent for the following values of g: 0, 0.3, 0.7 and 1.
As can be seen, the middle section, i.e. for -g ≤ x<=g, forms a straight line with slope 1 , meaning that the output has the same value as the input. Thus, as already mentioned, small gradients are unaffected while larger values are increasingly reduced, with asymptotes at 1 and -1 for large positive and negative values respectively. The parameterization through constant g allows the shape of the hyperbolic tangent to be adjusted according to need. For a value of g = 0.0, the function is to a standard hyperbolic tangent. For a value of g = 1.0, the function is a clamping function. For values of g in-between 0 and 1 , the curvature of function s is more or less sharp.
An alternative gradient modification function may be based on the Naka- Rushton equation, here modified to also admit negative values as input:
Figure imgf000006_0002
This function also has a slope of 1 at 0 and asymptotes at +1 and -1 , but lacks the adaptability of the described function based on the hyperbolic tangent function. A possible advantage of the modified Naka-Rushton equation, however, can be lower computational complexity.
Another related function, the Michaelis-Menten equation, may also be appropriate to modify gradients:
Figure imgf000007_0002
where the exponent n is a constant usually taken between 0 and 1 .
Figure 3 illustrates a flowchart for a method 30 according to a first embodiment of the present principles.
In step S32, the processor 120 calculates a modified gradient or modified gradients as the output of a gradient modification function s applied to a gradient or, depending on the number of dimensions, gradients. As already mentioned, the gradient modification function s can be the same for all dimensions, but can also be different for at least two dimensions.
In step S34, the processor 120 performs gradient descent using the modified gradient(s).
For example, the standard gradient descent function may thus be modified as follows for any desired gradient modification function s:
Figure imgf000007_0003
It may be combined with any method that, possibly adaptively, modifies the learning rate parameter η. This includes modifications to the Adam optimizer, as follows:
Figure imgf000007_0001
Steps S32 and S34 are typically iterated until convergence is obtained and the result can be stored or output, for example through the network interface 140 or the display interface 150. In an embodiment, step S32 is not performed in every iteration. EVALUATION
The solutions of the present principles were tested together in a context of a dataset of images of human faces (the CelebA dataset: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html). The dataset was split into one set for training (90% of the images) and another set for validation (10% of the images). A Deep Convolutional GAN (DCGAN) [see for example Alec Radford et al. “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” arXiv preprint arXiv.1511.06434 (2015).] was modified according to principles of the solutions of the present principles. The learning rate was set to 0.002. Unless stated otherwise, the latent space is 100-dimensional (N = 100). The batch size is 512, images are 64x64 pixels, and for all results training was stopped after only 15 epochs. In all cases, the Adam optimizer was used.
Results labeled ‘Uniform’ used a latent space of independent and identically distributed (i.i.d.) zero-mean Gaussian random variables with a variance of 1. Results labeled ‘Weighted’ used zero-mean Gaussian random variables. The adjustment of individual gradients using a hyperbolic tangent are indicated by Tanh’, followed by the value of the parameter g, as already described with reference to Figure 2. A direct clamping of the values to be within the range [-1, 1] is indicated with ‘Clamp’, whereas no gradient adjustment is applied to results marked ‘Adam’ (where the standard Adam optimizer is applied). In all other cases, the Adam optimizer is used in conjunction with the solutions of the first embodiment.
Figure 4 illustrates a weighted quality assessment of the experimental methods. As can be seen, the baseline method, ‘Adam’, i.e. without any of the proposed modifications, provides the lowest weighted quality of all. This means that all variants tested improve upon the baseline result by some amount.
This seems to indicate that a non-linear adjustment of gradients with a hyperbolic tangent with a slightly sharpened knee ( g = 0.3 or g = 0.7) is a better function for gradient adjustment than hard clamping, not adjusting at all, or using a hyperbolic tangent with modified knee ( g = 0.0). As will be appreciated, the present embodiments can help reduce the training requirements of neural networks. The present principles are broadly applicable to all neural networks that require gradient descent optimization (which includes virtually all neural networks).
It should be understood that the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces. The device can be implemented on a plurality of physical devices working together.
The present description illustrates the principles of the present disclosure. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the disclosure and are included within its scope.
All examples and conditional language recited herein are intended for educational purposes to aid the reader in understanding the principles of the disclosure and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions.
Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosure, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e. , any elements developed that perform the same function, regardless of structure.
Thus, for example, it will be appreciated by those skilled in the art that the block diagrams presented herein represent conceptual views of illustrative circuitry embodying the principles of the disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, and the like represent various processes which may be substantially represented in computer readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “processor” or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage.
Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
In the claims hereof, any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function. The disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.

Claims

1. A method comprising: modifying a gradient of an optimization function using a gradient modification function to obtain a modified gradient; and using the modified gradient in a gradient descent algorithm.
2. The method of claim 1 , wherein the modifying and the using are iterated until a minimum is found.
3. The method of claim 2, further comprising outputting a result corresponding to the minimum.
4. The method of claim 1 , wherein the gradient modification function reduces relatively larger gradient magnitudes more than or as much as relatively smaller gradient magnitudes.
5. The method of claim 1 , wherein the gradient modification function leaves magnitues smaller than a given value unchanged.
6. The method of claim 1 , wherein a plurality of gradients are used and the same gradient modification function is applied to gradients of more than one dimension.
7. A device comprising: memory configured to store program code instructions; and at least one hardware processor configured to execute the program code instructions to: modify a gradient of an optimization function using a gradient modification function to obtain a modified gradient; and use the modified gradient in a gradient descent algorithm.
8. The device of claim 7, wherein the program code instructions further cause the at least one hardware processor to iterate the modify and use until a minimum is found.
9. The device of claim 8, wherein the program code instructions further cause the at least one hardware processor to output a result corresponding to the minimum.
10. The device of claim 7, wherein the gradient modification function reduces relatively larger gradient magnitudes more than or as much as relatively smaller gradient magnitudes.
11. The device of claim 7, wherein the gradient modification function leaves magnitues smaller than a given value unchanged.
12. The device of claim 7, wherein a plurality of gradients are used and the same gradient modification function is applied to gradients of more than one dimension.
13. A non-transitory computer-readable storage medium storing instructions that, when executed, cause at least one hardware processor perform the method of any one of claims 1-6.
PCT/EP2021/061615 2020-05-11 2021-05-04 Device and method for training a neural network WO2021228611A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP20305471 2020-05-11
EP20305471.3 2020-05-11

Publications (1)

Publication Number Publication Date
WO2021228611A1 true WO2021228611A1 (en) 2021-11-18

Family

ID=71103320

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2021/061615 WO2021228611A1 (en) 2020-05-11 2021-05-04 Device and method for training a neural network

Country Status (1)

Country Link
WO (1) WO2021228611A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190180469A1 (en) * 2017-12-08 2019-06-13 Nvidia Corporation Systems and methods for dynamic facial analysis using a recurrent neural network

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190180469A1 (en) * 2017-12-08 2019-06-13 Nvidia Corporation Systems and methods for dynamic facial analysis using a recurrent neural network

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
"Deep Learning", 1 January 2016, MIT PRESS, article GOODFELLOW IAN ET AL: "Chapter 10: Sequence Modeling: Recurrent and Recursive Nets", pages: 367 - 415, XP055828572 *
ALEC RADFORD ET AL.: "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks", ARXIV PREPRINT ARXIV:1511.06434, 2015
ANGE TATOROGER NKAMBOU, IMPROVING ADAM OPTIMIZER., 2018
DAMIR VODENICAREVIC ET AL: "Nano-oscillator-based classification with a machine learning-compatible architecture", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 August 2018 (2018-08-25), XP081103944, DOI: 10.1063/1.5042359 *
ILYA LOSHCHILOVFRANK HUTTER: "Fixing Weight Decay Regularization in Adam.", ARXIV PREPRINT ARXIV:1711.05101, 2017
JINGZHAO ZHANG ET AL.: "Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition.", ARXIV PREPRINT ARXIV:1905.11881, 2019
PASCANU RAZVAN ET AL: "On the difficulty of training Recurrent Neural Networks", PROCEEDINGS OF THE 30TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, ATLANTA, GEORGIA, USA, PMLR 28(3):1310-1318, 2013., vol. 28, 17 June 2013 (2013-06-17), pages 1310 - 1318, XP055828669, Retrieved from the Internet <URL:http://proceedings.mlr.press/v28/pascanu13.pdf> *
WILSON ASHIA ET AL: "Accelerating Rescaled Gradient Descent: Fast Optimization of Smooth Functions", ARXIV:1902.08825V3, 4 January 2020 (2020-01-04), XP055828371, Retrieved from the Internet <URL:https://arxiv.org/pdf/1902.08825.pdf> [retrieved on 20210728] *
ZHANG JINGZHAO ET AL: "Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition", ARXIV:1905.11881V1, 28 March 2019 (2019-03-28), pages 1 - 18, XP055828327, Retrieved from the Internet <URL:https://arxiv.org/pdf/1905.11881v1.pdf> [retrieved on 20210728] *
ZIJUN ZHANG: "2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS", 2018, IEEE, article "Improved Adam Optimizer for Deep Neural Networks.", pages: 1 - 2

Similar Documents

Publication Publication Date Title
JPWO2019004350A1 (en) Training method, training apparatus, program, and non-transitory computer-readable medium
CN109165735B (en) Method for generating sample picture based on generation of confrontation network and adaptive proportion
EP3989534A1 (en) Image collection method and apparatus, and device and storage medium
EP3779891A1 (en) Method and device for training neural network model, and method and device for generating time-lapse photography video
CN105118067A (en) Image segmentation method based on Gaussian smoothing filter
CN112614072B (en) Image restoration method and device, image restoration equipment and storage medium
CN114724007A (en) Training classification model, data classification method, device, equipment, medium and product
US20230138380A1 (en) Self-contrastive learning for image processing
US20190392311A1 (en) Method for quantizing a histogram of an image, method for training a neural network and neural network training system
US20240054605A1 (en) Methods and systems for wavelet domain-based normalizing flow super-resolution image reconstruction
WO2021228611A1 (en) Device and method for training a neural network
CN111630530A (en) Data processing system and data processing method
CN116704217A (en) Model training method, device and storage medium based on difficult sample mining
CN113822386B (en) Image identification method, device, equipment and medium
CN114758130B (en) Image processing and model training method, device, equipment and storage medium
CN116245769A (en) Image processing method, device, equipment and storage medium
CN112927168B (en) Image enhancement method based on longicorn stigma search and differential evolution hybrid algorithm
US20220108220A1 (en) Systems And Methods For Performing Automatic Label Smoothing Of Augmented Training Data
EP4007173A1 (en) Data storage method, and data acquisition method and apparatus therefor
WO2021199226A1 (en) Learning device, learning method, and computer-readable recording medium
JP7047665B2 (en) Learning equipment, learning methods and learning programs
KR20230015186A (en) Method and Device for Determining Saturation Ratio-Based Quantization Range for Quantization of Neural Network
CN114529899A (en) Method and system for training convolutional neural networks
KR20220165121A (en) Method and apparatus for progressive image resolution improvement using neural ordinary differential equation, and image super-resolution method using the smae
GB2610531A (en) Optimizing capacity and learning of weighted real-valued logic

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21722480

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21722480

Country of ref document: EP

Kind code of ref document: A1