WO2018076331A1 - 一种神经网络训练方法及装置 - Google Patents

一种神经网络训练方法及装置 Download PDF

Info

Publication number
WO2018076331A1
WO2018076331A1 PCT/CN2016/103979 CN2016103979W WO2018076331A1 WO 2018076331 A1 WO2018076331 A1 WO 2018076331A1 CN 2016103979 W CN2016103979 W CN 2016103979W WO 2018076331 A1 WO2018076331 A1 WO 2018076331A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameter
neural network
nonlinear
updated
bit width
Prior art date
Application number
PCT/CN2016/103979
Other languages
English (en)
French (fr)
Inventor
陈云霁
庄毅敏
郭崎
陈天石
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Priority to PCT/CN2016/103979 priority Critical patent/WO2018076331A1/zh
Publication of WO2018076331A1 publication Critical patent/WO2018076331A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention belongs to the technical field of neural networks, and in particular relates to a neural network training method and device.
  • Multi-layer neural networks have attracted more and more attention in academia and industry due to their high recognition accuracy and good parallelism. They also have many fields in pattern recognition, image processing and natural language processing. More and more applications.
  • neural networks make it difficult to apply to embedded systems.
  • researchers use a variety of ways to reduce the storage space required to store these model parameters.
  • the most common method is to store data using low-precision data representation methods. For example, a 16-bit floating-point data representation method, a 16-bit fixed-point data representation method, and a 1-bit binary data representation method are used.
  • the low-precision data representation method can reduce the data storage space.
  • the neural network model parameter data has a wide range of values, the use of low-precision data representation method will bring great Loss of precision affects the performance of the neural network.
  • the invention provides a neural network training method for training parameters in a neural network, and the method comprises:
  • the parameter is updated according to the gradient value of the parameter to be updated.
  • step S3 the gradient value ⁇ y of the low bit width conversion parameter to be updated is:
  • is the learning rate of the neural network
  • x is the parameter before the nonlinear transformation
  • y is the weight or offset parameter after the nonlinear transformation
  • the gradient value of the parameter to be updated before the nonlinear transformation is:
  • step S4 the expression for updating the parameter is:
  • nonlinear function is a hyperbolic tangent series function or a sigmoid series function.
  • the parameters include weights and offsets of the neural network.
  • steps S1 to S4 are repeatedly performed on the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.
  • the invention also provides a neural network training device for training parameters in a neural network, the device comprising:
  • a nonlinear transformation module for nonlinearly transforming parameters using a nonlinear function to obtain transformation parameters
  • a low bit width conversion module for performing low bit width conversion on the transform parameter conversion to obtain a low bit width conversion parameter
  • a reverse gradient transform module configured to obtain a gradient value of a low bit width transform parameter to be updated by a neural network reverse process, and obtain a pre-linear transform parameter according to a nonlinear function and a gradient value of the low bit width transform parameter to be updated The gradient value to be updated;
  • the update module is configured to update the parameter according to the gradient value of the parameter to be updated.
  • the gradient value ⁇ y of the low bit width transform parameter to be updated is:
  • is the learning rate of the neural network
  • x is the parameter before the nonlinear transformation
  • y is the weight or offset parameter after the nonlinear transformation
  • the gradient value of the parameter to be updated before the nonlinear transformation is:
  • nonlinear function is a hyperbolic tangent series function or a sigmoid series function.
  • the parameters include weights and offsets of the neural network.
  • the nonlinear transform module, the low bit width conversion module, the inverse gradient transform module, and the update module repeatedly train the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.
  • the nonlinear transformation function can control the data range and data precision of the parameters, the accuracy of the original data can be well preserved in the subsequent low bit width conversion, thereby ensuring the performance of the neural network.
  • the parameters obtained by the training can be used for the dedicated neural network accelerator. Due to the use of lower precision parameters, the transmission bandwidth requirement of the accelerator can be reduced, and the low precision data can reduce the hardware area overhead, for example, reduce the arithmetic unit. The size of the hardware can thus optimize the area-to-power ratio of the hardware.
  • FIG. 1 is a flowchart of a neural network training method provided by the present invention.
  • FIG. 2 is a schematic structural view of a neural network training device provided by the present invention.
  • the invention provides a neural network training device and method for training parameters in a neural network. Firstly, a nonlinear function is used to nonlinearly transform the parameters to obtain transformation parameters, and then the transformation parameters are converted into bit widths. Converting, obtaining a low bit width transform parameter, and then obtaining a gradient value of the low bit width transform parameter to be updated by a neural network reverse process, and obtaining a nonlinear transform according to a nonlinear function and a gradient value of the low bit width transform parameter to be updated The gradient value of the parameter to be updated, and finally the parameter is updated according to the gradient value of the parameter to be updated.
  • the present invention allows the post-training parameters to have a lower bit width with less loss of precision.
  • FIG. 1 is a flowchart of a neural network training method provided by the present invention. As shown in FIG. 1, the method includes:
  • the nonlinear transformation function used in this step is not unique, and different nonlinear functions can be selected according to the actual use requirements, which can be a hyperbolic tangent series function, a sigmoid series function, and the like.
  • the neural network parameter data range generally adopting a high-precision data representation method is relatively large, if Converting the original full-precision data into bit-width data into low-precision data can affect the performance of the neural network due to the loss of precision in the data itself. Therefore, the function of the hyperbolic tangent series can be used for nonlinear transformation.
  • the hyperbolic tangent series function can be in the form of Taking the weight data of a convolutional layer of the neural network as an example, w is the weight data before the transformation, p_w is the transformed weight data, Max w is the maximum value of the layer ownership value data, and A and B are constants. Parameter. Among them, A is used to control the transformed data range, and B is used to control the transformed data distribution. The principle of the adjustment is as follows. The transformation scales the data to the range [-1,1]. By adjusting A, the transformed data can be scaled to the range [-A, A]. By adjusting B, the data can be made to a different function of the tan(x) function.
  • Segment, B is very small, then the data to be converted mostly falls in the linear segment of the tan(x) function.
  • B When B is large, the data to be converted mostly falls in the saturated segment of the tan(x) function, and B takes a moderate value.
  • the data to be converted is mostly in a non-linear segment, thereby changing the original distribution of the data without changing the relative size relationship between the data. Therefore, the purpose of the nonlinear transformation in the present invention is to make the range of data and the distribution of data controllable.
  • nonlinear functions or function parameters may be used as needed.
  • 16-bit floating point since 16-bit floating point can indicate that the data range is relatively large, more attention is required.
  • the loss of precision may be due to the loss of precision of individual data beyond the 16-bit floating-point representation range, so a nonlinear function can be designed such that this portion of the data goes out of saturation and the other data is in a nonlinear or linear segment.
  • the parameters referred to in the present invention are the weights and offsets of the neural network.
  • the parameters are the weights and offsets of each layer of the neural network.
  • the parameters of the neural network are floating point numbers.
  • the original full precision floating point number is nonlinearly transformed and converted into a low precision low bit width floating point number.
  • the "lower bit” in the low-order wide floating point number indicates that the number of bits required to store one data is smaller than the number of bits required for the full-precision floating point, and "wide” means that the data is statistically evenly distributed in the data representation. Indicated within the index range.
  • the loss function of the reverse process is Then the gradient calculated by the reverse process is This gradient is the gradient of y, by the chain rule Gradient of x can be obtained, gradient through x Update the weight and offset parameters before the nonlinear transformation.
  • the parameter is updated according to the gradient value of the parameter to be updated.
  • steps S1 to S4 are repeatedly performed on the updated parameters until the parameters are less than a predetermined threshold, and the training is completed.
  • the network training device includes:
  • a nonlinear transformation module for nonlinearly transforming parameters using a nonlinear function to obtain transformation parameters
  • the nonlinear transformation function adopted by the nonlinear transformation module is not unique, and different nonlinear functions can be selected according to the actual use requirements, which can be a hyperbolic tangent series function, a sigmoid series function, and the like.
  • the neural network parameter data range generally adopting a high-precision data representation method is relatively large, if Converting the original full-precision data into bit-width data into low-precision data can affect the performance of the neural network due to the loss of precision in the data itself. Therefore, the function of the hyperbolic tangent series can be used for nonlinear transformation.
  • the hyperbolic tangent series function can be in the form of Taking the weight data of a convolutional layer of the neural network as an example, w is the weight data before the transformation, p_w is the transformed weight data, Max w is the maximum value of the layer ownership value data, and A and B are constants. Parameter. Among them, A is used to control the transformed data range, and B is used to control the transformed data distribution. The principle of the adjustment is as follows. The transformation scales the data to the range [-1,1]. By adjusting A, the transformed data can be scaled to the range [-A, A]. By adjusting B, the data can be made to a different function of the tan(x) function.
  • Segment, B is very small, then the data to be converted mostly falls in the linear segment of the tan(x) function.
  • B When B is large, the data to be converted mostly falls in the saturated segment of the tan(x) function, and B takes a moderate value.
  • the data to be converted is mostly in a non-linear segment, thereby changing the original distribution of the data without changing the relative size relationship between the data. Therefore, the purpose of the nonlinear transformation in the present invention is to make the range of data and the distribution of data controllable.
  • nonlinear functions or function parameters may be used as needed.
  • 16-bit floating point since 16-bit floating point can indicate that the data range is relatively large, more attention is required.
  • the loss of precision may be due to the loss of precision of individual data beyond the 16-bit floating-point representation range, so a nonlinear function can be designed such that this portion of the data goes out of saturation and the other data is in a nonlinear or linear segment.
  • the parameters referred to in the present invention are the weights and offsets of the neural network.
  • the parameters are the weights and offsets of each layer of the neural network.
  • the low bit width conversion module is used for low bit width conversion of the transform parameter conversion to obtain a low bit width conversion parameter.
  • the parameters of the neural network are floating point numbers
  • the function of the low bit width conversion module is to convert the original full precision floating point number into a low precision low bit width floating point number after nonlinear transformation.
  • the "lower bit” in the low-order wide floating point number indicates that the number of bits required to store one data is smaller than the number of bits required for the full-precision floating point, and "wide” means that the data is statistically evenly distributed in the data representation. Indicated within the index range.
  • a reverse gradient transform module configured to obtain a gradient value of a low bit width transform parameter to be updated by a neural network reverse process, and obtain a pre-linear transform parameter according to a nonlinear function and a gradient value of the low bit width transform parameter to be updated The gradient value to be updated;
  • the loss function of the reverse process is Then the gradient calculated by the reverse process is This gradient is the gradient of y, by the chain rule Gradient of x can be obtained, gradient through x Update the weight and offset parameters before the nonlinear transformation.
  • the gradient of the y to be updated can be obtained by the inverse process of the neural network.
  • is the learning rate of the neural network
  • the update module is configured to update the parameter according to the gradient value of the parameter to be updated.
  • the neural network training device provided by the present invention repeatedly performs steps S1 to S4 on the updated parameters until the parameter is less than a predetermined threshold, and the training is completed.
  • the neural network of this embodiment supports a 16-bit floating point data representation method, and this embodiment uses a nonlinear transformation function as a hyperbolic tangent series function:
  • w is the weight data before the transformation
  • p_w is the weight data after the transformation
  • Maxw is the maximum value of the ownership value data of the layer
  • a and B are constant parameters.
  • A is used to control the transformed data range
  • B is used to control the transformed data distribution.
  • the nonlinear function is Since the weight or offset data is relatively small relative to its maximum value, The data is mapped into the range of [-1,1], and most of the data is concentrated near the value of 0.
  • This function form can make full use of the nonlinear segment of the function to compress the interval with less data distribution to make the data dense, and at the same time utilize The linear segment of the function stretches the interval where the data is distributed near the value of 0 to spread the data.
  • weights and offsets of each layer of the neural network are nonlinearly transformed by the nonlinear transform module to obtain the transformed weights and offsets, ie Where y is the transformed weight or offset, x is the weight or offset before the transformation, and Max x is the maximum of all x. At the same time, the weight and offset data before the transformation are retained.
  • the transformed weight and offset data and the input data of each layer are converted into the required 16-bit floating point data through the low bit width floating point data conversion module, and the low precision floating point data conversion can adopt the direct truncation method, that is, the original
  • the full-precision data intercepts the precision part that the 16-bit floating-point representation method can represent, and the part that exceeds the precision is directly discarded.
  • the converted 16-bit floating point data is used for neural network training, and the gradient value ⁇ y to be updated is obtained by the neural network reverse process.
  • w is the weight data before the transformation
  • p_w is the transformed weight data
  • a and B are constant parameters
  • A is used to control the transformed data range.
  • B is used to control the transformed data distribution.
  • weights and offsets of each layer of the neural network are nonlinearly transformed by the nonlinear transform module to obtain the transformed weights and offsets, ie Where y is the transformed weight or offset, and x is the weight or offset before the transformation. At the same time, the weight and offset data before the transformation are retained.
  • the transformed weight and offset data and the input data of each layer are converted by low bit width.
  • 16-bit floating-point data, low-precision floating-point data conversion can use the method of direct truncation, that is, for the original full-precision data, the precision part that can be represented by the 16-bit floating-point representation method is intercepted, and the part exceeding the precision is directly discarded.
  • the converted 16-bit floating point data is used for neural network training, and the gradient value ⁇ y to be updated is obtained by the neural network reverse process.
  • the trained neural network of the present invention can be used for a neural network accelerator, a voice recognition device, an image recognition device, an automatic navigation device, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Feedback Control In General (AREA)

Abstract

一种神经网络训练装置及方法,用于对神经网络中的参数进行训练,该方法首先采用非线性函数对所述参数进行非线性变换,得到变换参数(S1),然后对变换参数转换进行位宽转换,得到低位宽变换参数(S2),接着,通过神经网络反向过程,获取低位宽变换参数的待更新梯度值,根据非线性函数和所述低位宽变换参数的待更新梯度值,得到非线性变换前参数的待更新梯度值(S3),最后根据参数的待更新梯度值,对参数进行更新(S4)。该方法使得训练后的参数具有较低的位宽并且精度损失较小。

Description

一种神经网络训练方法及装置 技术领域
本发明属于神经网络技术领域,具体涉及一种神经网络训练方法及装置。
背景技术
多层神经网络由于其具有较高的识别准确度和较好的可并行性,受到学术界和工业界越来越广泛的关注,在模式识别、图像处理和自然语言处理等多个领域也有着越来越多的应用。
神经网络由于其庞大的模型参数数据,使之难以应用于嵌入式***,研究人员通过多种方式减少存储这些模型参数所需的存储空间,最常用的是采用低精度的数据表示方法存储数据,例如采用16位浮点型数据表示方法,16位定点型数据表示方法,更有甚者采用1bit的二值数据表示方法。相比于原精度的浮点数据表示方法,采用低精度的数据表示方法可以减少数据存储空间,然而由于神经网络模型参数数据数值范围非常广,采用低精度的数据表示方法会带来极大的精度损失,影响神经网络的性能。
发明内容
(一)要解决的技术问题
本发明的目的在于,提供一种神经网络训练方法及装置,用于对神经网络中的参数进行训练,以使得训练后的参数具有低位宽并且精度损失较小。
(二)技术方案
本发明提供一种神经网络训练方法,用于对神经网络中的参数进行训练,方法包括:
S1,采用非线性函数对参数进行非线性变换,得到变换参数;
S2,对变换参数转换进行低位宽转换,得到低位宽变换参数;
S3,通过神经网络反向过程,获取低位宽变换参数的待更新梯度值,根据非线性函数和所述低位宽变换参数的待更新梯度值,得到非线性变换前参数的待更新梯度值;
S4,根据参数的待更新梯度值,对参数进行更新。
进一步,步骤S3中,低位宽变换参数的待更新梯度值Δy为:
Figure PCTCN2016103979-appb-000001
其中,η为神经网络的学习率,
Figure PCTCN2016103979-appb-000002
为神经网络反向过程的损失函数,非线性变换函数为y=f(x),x为非线性变换前的参数,y为非线性变换后的权值或偏置参数;
非线性变换前所述参数的待更新梯度值为:
Δx=f′(x)Δy;
步骤S4中,对参数进行更新的表达式为:
xnew=x-Δx。
进一步,非线性函数为双曲正切系列函数或者sigmoid系列函数。
进一步,参数包括神经网络的权值和偏置。
进一步,对更新后的参数重复执行步骤S1~S4,直到参数小于一预定阈值时,训练完成。
本发明还提供一种神经网络训练装置,用于对神经网络中的参数进行训练,装置包括:
非线性变换模块,用于采用非线性函数对参数进行非线性变换,得到变换参数;
低位宽转换模块,用于对变换参数转换进行低位宽转换,得到低位宽变换参数;
反向梯度变换模块,用于通过神经网络反向过程,获取低位宽变换参数的待更新梯度值,根据非线性函数和所述低位宽变换参数的待更新梯度值,得到非线性变换前参数的待更新梯度值;
更新模块,用于根据参数的待更新梯度值,对参数进行更新。
进一步,反向梯度变换模块中,低位宽变换参数的待更新梯度值Δy为:
Figure PCTCN2016103979-appb-000003
其中,η为神经网络的学习率,
Figure PCTCN2016103979-appb-000004
为神经网络反向过程的损失函数,非线性变换函数为y=f(x),x为非线性变换前的参数,y为非线性变换后的权值或偏置参数;
非线性变换前所述参数的待更新梯度值为:
Δx=f′(x)Δy;
更新模块对所述参数进行更新的表达式为:
xnew=x-Δx。
进一步,非线性函数为双曲正切系列函数或者sigmoid系列函数。
进一步,参数包括神经网络的权值和偏置。
进一步,非线性变换模块、低位宽转换模块、反向梯度变换模块及更新模块对更新后的参数反复进行训练,直到参数小于一预定阈值时,训练完成。
(三)有益效果
本发明具有以下优点:
1、由于非线性变换函数对参数的数据范围和数据精度可控,在后续进行低位宽转换时,能够很好的保留数据原本的精度,从而保证神经网络的性能。
2、由于训练获得的参数具有较低的位宽,故参数所需的存储空间大大减少。
3、训练获得的参数可以用于专用的神经网络加速器,由于采用较低精度的参数,故可以减少对加速器传输带宽的要求,并且低精度的数据可以减少硬件的面积开销,例如减小运算器的大小,从而可以优化硬件的面积功耗比。
附图说明
图1为本发明提供的神经网络训练方法的流程图。
图2是本发明提供神经网络训练装置的结构示意图。
具体实施方式
本发明提供一种神经网络训练装置及方法,用于对神经网络中的参数进行训练,方法首先采用非线性函数对所述参数进行非线性变换,得到变换参数,然后对变换参数转换进行位宽转换,得到低位宽变换参数,接着,通过神经网络反向过程,获取低位宽变换参数的待更新梯度值,根据非线性函数和所述低位宽变换参数的待更新梯度值,得到非线性变换前参数的待更新梯度值,最后根据参数的待更新梯度值,对参数进行更新。本发明使得训练后的参数具有较低的位宽并且精度损失较小。
图1为本发明提供的神经网络训练方法的流程图,如图1所示,方法包括:
S1,采用非线性函数对参数进行非线性变换,得到变换参数。
在本步骤中所采用非线性变换函数不唯一,可根据实际使用的需求选择不同的非线性函数,可以是双曲正切系列函数、sigmoid系列函数等。
例如,针对于采用8位浮点表示方法的神经网络,由于8位浮点表示方法所能表示的数据范围较小,而一般采用高精度数据表示方法的神经网络参数数据范围都比较大,若将原全精度数据通过位宽转换成低精度数据,对数据本身会带来的精度损失会影响神经网络的性能。故可以采用双曲正切系列的函数做非线性变换。
双曲正切系列函数可以是如下形式,
Figure PCTCN2016103979-appb-000005
以神经网络一个卷积层的权值数据为例,则w为变换前的权值数据,p_w为变换后的权值数据,Maxw为该层所有权值数据的最大值,A与B为常数参量。其中,A用于控制变换后的数据范围,B用于控制变换后的数据分布,调 参原理如下,由于
Figure PCTCN2016103979-appb-000006
变换将数据缩放到了[-1,1]范围内,则通过调节A可以将变换后的数据缩放到[-A,A]范围内,通过调节B可以使得数据处于tan(x)函数的不同函数段,B很小时,则待转换数据大部分落在tan(x)函数的线性段,B很大时,则待转化数据大部分落在tan(x)函数饱和段,B取一个适中的值,可以使待转化数据大部分处于非线性段,从而改变数据原本的分布情况,而不改变数据之间的相对大小关系。故,本发明中非线性变换的目的是使得数据的范围和数据的分布可控。
在其他场景,根据需要,可以采用其他非线性函数或者函数参数,例如针对于采用16位浮点表示方法的神经网络,由于16位浮点可表示数据范围相对较大,因此,需要更多关注的精度损失可能为个别数据超出16位浮点表示范围带来的精度损失,故可以设计非线性函数使得这部分数据出去饱和段,而其他数据处于非线性段或线性段。
另外,本发明所指的参数为神经网络的权值和偏置,其中,神经网络为多层神经网络时,参数即为每一层神经网络的权值和偏置。
S2,对变换参数转换进行低位宽转换,得到低位宽变换参数。
一般地,神经网络的参数为浮点数,本步骤中,将原全精度浮点数经过非线性变换后转为低精度的低位宽浮点数。此处低位宽浮点数中“低位”表示存储一个数据所需的比特数小于全精度浮点所需的比特位数,“宽”表示数据在统计意义上较均匀分布在该数据表示法所能表示的指数范围内。例如,假设有一堆数据,分布在[-5,5]之间,并且集中分布在[-1.5,1.5]范围内,那么,采用一种非线性变换tanh(x),对于[-1.5,1.5]范围内的数据,处于该变换斜率较大的线性段,能够将密集的数据分布拉散开,而对于[-5,-1.5]和[1.5,5]范围的数据,处于该变换曲率较大的非线性段,能够将原本松散的数据变得密集,由此可以使得数据的分布更加均匀。并且,该变换使得数据范围缩放到[-1,1]区域内。
S3,通过神经网络反向过程,获取低位宽变换参数的待更新梯度值,根据非线性函数和低位宽变换参数的待更新梯度值,得到非线性变换前参数的待更新梯度值。
在本步骤中,设非线性变换采用的非线性变换函数为y=f(x),此处x为变换前的权值或偏置参数,y为变换后的权值或偏置参数。反向过 程的损失函数为
Figure PCTCN2016103979-appb-000007
则反向过程计算得的梯度为
Figure PCTCN2016103979-appb-000008
此梯度为y的梯度,由链式法则
Figure PCTCN2016103979-appb-000009
可获得x的梯度,通过x的梯度
Figure PCTCN2016103979-appb-000010
更新非线性变换前的权值和偏置参数。具体地,通过神经网络的反向过程计算后可得到y的待更新梯度
Figure PCTCN2016103979-appb-000011
其中η为神经网络的学习率,通过返向梯度变换模块计算获得x的带更新梯度值
Figure PCTCN2016103979-appb-000012
即反向梯度模块通过计算Δx=f′(x)Δy获得x的待更新梯度值。
S4,根据参数的待更新梯度值,对参数进行更新。
在本步骤中,对参数进行更新的表达式为:
xnew=x-Δx。
接着,对更新后的参数重复执行步骤S1~S4,直到参数小于一预定阈值时,训练完成。
图2是本发明提供神经网络训练装置的结构示意图,如图2所示,网络训练装置包括:
非线性变换模块,用于采用非线性函数对参数进行非线性变换,得到变换参数;
非线性变换模块所采用非线性变换函数不唯一,可根据实际使用的需求选择不同的非线性函数,可以是双曲正切系列函数、sigmoid系列函数等。
例如,针对于采用8位浮点表示方法的神经网络,由于8位浮点表示方法所能表示的数据范围较小,而一般采用高精度数据表示方法的神经网络参数数据范围都比较大,若将原全精度数据通过位宽转换成低精度数据,对数据本身会带来的精度损失会影响神经网络的性能。故可以采用双曲正切系列的函数做非线性变换。
双曲正切系列函数可以是如下形式,
Figure PCTCN2016103979-appb-000013
以神经网络一个卷积层的权值数据为例,则w为变换前的权值数据,p_w为变换后的权值数据,Maxw为该层所有权值数据的最大值,A与B为常数参量。其中,A用于控制变换后的数据范围,B用于控制变换后的数据分布, 调参原理如下,由于
Figure PCTCN2016103979-appb-000014
变换将数据缩放到了[-1,1]范围内,则通过调节A可以将变换后的数据缩放到[-A,A]范围内,通过调节B可以使得数据处于tan(x)函数的不同函数段,B很小时,则待转换数据大部分落在tan(x)函数的线性段,B很大时,则待转化数据大部分落在tan(x)函数饱和段,B取一个适中的值,可以使待转化数据大部分处于非线性段,从而改变数据原本的分布情况,而不改变数据之间的相对大小关系。故,本发明中非线性变换的目的是使得数据的范围和数据的分布可控。
在其他场景,根据需要,可以采用其他非线性函数或者函数参数,例如针对于采用16位浮点表示方法的神经网络,由于16位浮点可表示数据范围相对较大,因此,需要更多关注的精度损失可能为个别数据超出16位浮点表示范围带来的精度损失,故可以设计非线性函数使得这部分数据出去饱和段,而其他数据处于非线性段或线性段。
另外,本发明所指的参数为神经网络的权值和偏置,其中,神经网络为多层神经网络时,参数即为每一层神经网络的权值和偏置。
低位宽转换模块,用于对变换参数转换进行低位宽转换,得到低位宽变换参数。
一般地,神经网络的参数为浮点数,低位宽转换模块的作用是将原全精度浮点数经过非线性变换后转为低精度的低位宽浮点数。此处低位宽浮点数中“低位”表示存储一个数据所需的比特数小于全精度浮点所需的比特位数,“宽”表示数据在统计意义上较均匀分布在该数据表示法所能表示的指数范围内。例如,假设有一堆数据,分布在[-5,5]之间,并且集中分布在[-1.5,1.5]范围内,那么,采用一种非线性变换tanh(x),对于[-1.5,1.5]范围内的数据,处于该变换斜率较大的线性段,能够将密集的数据分布拉散开,而对于[-5,-1.5]和[1.5,5]范围的数据,处于该变换曲率较大的非线性段,能够将原本松散的数据变得密集,由此可以使得数据的分布更加均匀。并且,该变换使得数据范围缩放到[-1,1]区域内。
反向梯度变换模块,用于通过神经网络反向过程,获取低位宽变换参数的待更新梯度值,根据非线性函数和所述低位宽变换参数的待更新梯度值,得到非线性变换前参数的待更新梯度值;
在反向梯度变换模块中,设非线性变换采用的非线性变换函数为 y=f(x),此处x为变换前的权值或偏置参数,y为变换后的权值或偏置参数。反向过程的损失函数为
Figure PCTCN2016103979-appb-000015
则反向过程计算得的梯度为
Figure PCTCN2016103979-appb-000016
此梯度为y的梯度,由链式法则
Figure PCTCN2016103979-appb-000017
可获得x的梯度,通过x的梯度
Figure PCTCN2016103979-appb-000018
更新非线性变换前的权值和偏置参数。具体地,通过神经网络的反向过程计算后可得到y的待更新梯度
Figure PCTCN2016103979-appb-000019
其中η为神经网络的学习率,通过返向梯度变换模块计算获得x的带更新梯度值
Figure PCTCN2016103979-appb-000020
即反向梯度模块通过计算Δx=f′(x)Δy获得x的待更新梯度值。
更新模块,用于根据参数的待更新梯度值,对参数进行更新。
更新模块,对参数进行更新的表达式为:
xnew=x-Δx。
本发明提供的神经网络训练装置,对更新后的参数重复执行步骤S1~S4,直到参数小于一预定阈值时,训练完成。
实施例一
本实施例的神经网络支持16位浮点数据表示方法,并且,本实施例采用非线性变换函数为双曲正切系列函数:
Figure PCTCN2016103979-appb-000021
以神经网络一个卷积层的权值数据为例,w为变换前的权值数据,p_w为变换后的权值数据,Maxw为该层所有权值数据的最大值,A与B为常数参量,A用于控制变换后的数据范围,B用于控制变换后的数据分布。通过调节A与B,能够将分布不均的数据在指数域上较为均匀分布,能够将数据映射到指定范围内。
具体地,当
Figure PCTCN2016103979-appb-000022
B=3时,非线性函数即为
Figure PCTCN2016103979-appb-000023
由于权值或偏置数据相对于其最大值都比较小,因此
Figure PCTCN2016103979-appb-000024
将数据映射到[-1,1]范围内,同时大部分数据都集中在0值附近,该函数形式可以充分利用函数的非线性段将数据分布较少的区间压缩使数据密集化,同时利用函数线性段将0值附近数据分布较多的区间拉伸使数据散开。
对神经网络每一层的权值和偏置,通过非线性变换模块做非线性变换,获得变换后的权值和偏置,即
Figure PCTCN2016103979-appb-000025
其中y为变换后的权值或偏置,x为变换前的权值或偏置,Maxx为所有x的最大值。同时,保留变换前的权值和偏置的数据。
变换后的权值和偏置数据以及各层输入数据通过低位宽浮点数据转换模块,转换成所需的16位浮点数据,低精度浮点数据转换可以采用直接截断的方法,即对于原全精度的数据,截取16位浮点表示方法所能表示的精度部分,超出该精度的部分直接舍去。
转换后的16位浮点数据用于神经网络训练,通过神经网络反向过程获得待更新梯度值Δy。
梯度值通过反向梯度变换模块计算的非线性变换前的权值和偏置的待更新梯度值,即
Figure PCTCN2016103979-appb-000026
并用该梯度值完成非线性变换前的权值和偏置的更新,即xnew=x-Δx。
重复上述步骤直至训练完成。
实施例二
实施例二与实施例一的区别在于,本实施例采用Sigmoid系列非线性变换函数:
Figure PCTCN2016103979-appb-000027
以神经网络一个卷积层的权值数据为例,则w为变换前的权值数据,p_w为变换后的权值数据,A与B为常数参量,A用于控制变换后的数据范围,B用于控制变换后的数据分布。通过调节A与B,能够将分布不均的数据在指数域上较为均匀分布,能够将数据映射到指定范围内。假设该神经网络的权值参数数据大小一般在10-4量级,可以采用非线性函数
Figure PCTCN2016103979-appb-000028
该函数形式可以充分利用函数的非线性段和线性段使得数据范围较均匀的分布。
对神经网络每一层的权值和偏置,通过非线性变换模块做非线性变换,获得变换后的权值和偏置,即其中y为变换后的权值或偏置,x为变换前的权值或偏置。同时,保留变换前的权值和偏置的数据。
变换后的权值和偏置数据以及各层输入数据通过低位宽转换,得到 16位浮点数据,低精度浮点数据转换可以采用直接截断的方法,即对于原全精度的数据,截取16位浮点表示方法所能表示的精度部分,超出该精度的部分直接舍去。
转换后的16位浮点数据用于神经网络训练,通过神经网络反向过程获得待更新梯度值Δy。
梯度值通过反向梯度变换模块计算的非线性变换前的权值和偏置的待更新梯度值,即
Figure PCTCN2016103979-appb-000030
并用该梯度值完成非线性变换前的权值和偏置的更新,即xnew=x-Δx。
重复上述步骤直至训练完成。
本发明经过训练后的神经网络可用于神经网络加速器、语音识别装置、图像识别装置及自动导航装置等。
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种神经网络训练方法,用于对神经网络中的参数进行训练,其特征在于,方法包括:
    S1,采用非线性函数对所述参数进行非线性变换,得到变换参数;
    S2,对所述变换参数转换进行低位宽转换,得到低位宽变换参数;
    S3,通过神经网络反向过程,获取所述低位宽变换参数的待更新梯度值,根据所述非线性函数和所述低位宽变换参数的待更新梯度值,得到非线性变换前所述参数的待更新梯度值;
    S4,根据所述参数的待更新梯度值,对所述参数进行更新。
  2. 根据权利要求1所述的神经网络训练方法,其特征在于,所述步骤S3中,所述低位宽变换参数的待更新梯度值Δy为:
    Figure PCTCN2016103979-appb-100001
    其中,η为神经网络的学习率,
    Figure PCTCN2016103979-appb-100002
    为神经网络反向过程的损失函数,非线性变换函数为y=f(x),x为非线性变换前的参数,y为非线性变换后的权值或偏置参数;
    所述非线性变换前所述参数的待更新梯度值为:
    Δx=f′(x)Δy;
    所述步骤S4中,对所述参数进行更新的表达式为:
    xnew=x-Δx。
  3. 根据权利要求1所述的神经网络训练方法,其特征在于,所述非线性函数为双曲正切系列函数或者sigmoid系列函数。
  4. 根据权利要求1所述的神经网络训练方法,其特征在于,所述参数包括神经网络的权值和偏置。
  5. 根据权利要求1所述的神经网络训练方法,其特征在于,对更新后的参数重复执行步骤S1~S4,直到所述参数小于一预定阈值时,训练完成。
  6. 一种神经网络训练装置,用于对神经网络中的参数进行训练,其特征在于,装置包括:
    非线性变换模块,用于采用非线性函数对所述参数进行非线性变换,得到变换参数;
    低位宽转换模块,用于对所述变换参数转换进行低位宽转换,得到低位宽变换参数;
    反向梯度变换模块,用于通过神经网络反向过程,获取所述低位宽变换参数的待更新梯度值,根据所述非线性函数和所述低位宽变换参数的待更新梯度值,得到非线性变换前所述参数的待更新梯度值;
    更新模块,用于根据所述参数的待更新梯度值,对所述参数进行更新。
  7. 根据权利要求6所述的神经网络训练装置,其特征在于,所述反向梯度变换模块中,所述低位宽变换参数的待更新梯度值Δy为:
    Figure PCTCN2016103979-appb-100003
    其中,η为神经网络的学习率,
    Figure PCTCN2016103979-appb-100004
    为神经网络反向过程的损失函数,非线性变换函数为y=f(x),x为非线性变换前的参数,y为非线性变换后的权值或偏置参数;
    所述非线性变换前所述参数的待更新梯度值为:
    Δx=f′(x)Δy;
    所述更新模块对所述参数进行更新的表达式为:
    xnew=x-Δx。
  8. 根据权利要求6所述的神经网络训练装置,其特征在于,所述非线性函数为双曲正切系列函数或者sigmoid系列函数。
  9. 根据权利要求6所述的神经网络训练装置,其特征在于,所述参数包括神经网络的权值和偏置。
  10. 根据权利要求6所述的神经网络训练装置,其特征在于,非线性变换模块、低位宽转换模块、反向梯度变换模块及更新模块对更新后的参数反复进行训练,直到所述参数小于一预定阈值时,训练完成。
PCT/CN2016/103979 2016-10-31 2016-10-31 一种神经网络训练方法及装置 WO2018076331A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/103979 WO2018076331A1 (zh) 2016-10-31 2016-10-31 一种神经网络训练方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/103979 WO2018076331A1 (zh) 2016-10-31 2016-10-31 一种神经网络训练方法及装置

Publications (1)

Publication Number Publication Date
WO2018076331A1 true WO2018076331A1 (zh) 2018-05-03

Family

ID=62024220

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/103979 WO2018076331A1 (zh) 2016-10-31 2016-10-31 一种神经网络训练方法及装置

Country Status (1)

Country Link
WO (1) WO2018076331A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555508A (zh) * 2018-05-31 2019-12-10 北京深鉴智能科技有限公司 人工神经网络调整方法和装置
WO2020019236A1 (en) * 2018-07-26 2020-01-30 Intel Corporation Loss-error-aware quantization of a low-bit neural network
CN111198714A (zh) * 2018-11-16 2020-05-26 上海寒武纪信息科技有限公司 重训练方法及相关产品
CN112114874A (zh) * 2020-08-20 2020-12-22 北京百度网讯科技有限公司 数据处理方法、装置、电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5280564A (en) * 1991-02-20 1994-01-18 Honda Giken Kogyo Kabushiki Kaisha Neural network having an optimized transfer function for each neuron
CN1846218A (zh) * 2003-09-09 2006-10-11 西麦恩公司 人工神经网络
CN105550748A (zh) * 2015-12-09 2016-05-04 四川长虹电器股份有限公司 基于双曲正切函数的新型神经网络的构造方法
CN105787439A (zh) * 2016-02-04 2016-07-20 广州新节奏智能科技有限公司 一种基于卷积神经网络的深度图像人体关节定位方法
CN105976027A (zh) * 2016-04-29 2016-09-28 北京比特大陆科技有限公司 数据处理方法和装置、芯片

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5280564A (en) * 1991-02-20 1994-01-18 Honda Giken Kogyo Kabushiki Kaisha Neural network having an optimized transfer function for each neuron
CN1846218A (zh) * 2003-09-09 2006-10-11 西麦恩公司 人工神经网络
CN105550748A (zh) * 2015-12-09 2016-05-04 四川长虹电器股份有限公司 基于双曲正切函数的新型神经网络的构造方法
CN105787439A (zh) * 2016-02-04 2016-07-20 广州新节奏智能科技有限公司 一种基于卷积神经网络的深度图像人体关节定位方法
CN105976027A (zh) * 2016-04-29 2016-09-28 北京比特大陆科技有限公司 数据处理方法和装置、芯片

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555508A (zh) * 2018-05-31 2019-12-10 北京深鉴智能科技有限公司 人工神经网络调整方法和装置
WO2020019236A1 (en) * 2018-07-26 2020-01-30 Intel Corporation Loss-error-aware quantization of a low-bit neural network
CN111198714A (zh) * 2018-11-16 2020-05-26 上海寒武纪信息科技有限公司 重训练方法及相关产品
CN111198714B (zh) * 2018-11-16 2022-11-18 寒武纪(西安)集成电路有限公司 重训练方法及相关产品
CN112114874A (zh) * 2020-08-20 2020-12-22 北京百度网讯科技有限公司 数据处理方法、装置、电子设备和存储介质

Similar Documents

Publication Publication Date Title
WO2017185412A1 (zh) 一种支持较少位数定点数的神经网络运算的装置和方法
WO2018076331A1 (zh) 一种神经网络训练方法及装置
CN107340993B (zh) 运算装置和方法
CN110969250B (zh) 一种神经网络训练方法及装置
Wang et al. Multi-scale dilated convolution of convolutional neural network for image denoising
CN110659734B (zh) 深度可分离卷积结构的低比特量化方法
CN113052868B (zh) 一种抠图模型训练、图像抠图的方法及装置
CN111985523A (zh) 基于知识蒸馏训练的2指数幂深度神经网络量化方法
CN109284761B (zh) 一种图像特征提取方法、装置、设备及可读存储介质
CN106203625A (zh) 一种基于多重预训练的深层神经网络训练方法
DE112020003600T5 (de) Hardware für maschinelles lernen mit parameterkomponenten mit reduzierter genauigkeit für effiziente parameteraktualisierung
CN104504015A (zh) 一种基于动态增量式字典更新的学习算法
CN111311530B (zh) 基于方向滤波器及反卷积神经网络的多聚焦图像融合方法
CN107863111A (zh) 面向交互的语音语料处理方法及装置
CN109389222A (zh) 一种快速的自适应神经网络优化方法
WO2023020456A1 (zh) 网络模型的量化方法、装置、设备和存储介质
WO2019037409A1 (zh) 神经网络训练***、方法和计算机可读存储介质
WO2020118553A1 (zh) 一种卷积神经网络的量化方法、装置及电子设备
CN112257466B (zh) 一种应用于小型机器翻译设备的模型压缩方法
WO2020253692A1 (zh) 一种针对深度学习网络参数的量化方法
CN112561050B (zh) 一种神经网络模型训练方法及装置
CN111382854A (zh) 一种卷积神经网络处理方法、装置、设备及存储介质
CN108694414A (zh) 基于数字图像转化和深度学习的数字取证文件碎片分类方法
CN114781603B (zh) 一种用于cnn模型图像分类任务的高精度激活函数
US11823043B2 (en) Machine learning with input data domain transformation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16919904

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16919904

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16919904

Country of ref document: EP

Kind code of ref document: A1