WO2017185335A1 - 一种用于执行batch normalization运算的装置和方法 - Google Patents

一种用于执行batch normalization运算的装置和方法 Download PDF

Info

Publication number
WO2017185335A1
WO2017185335A1 PCT/CN2016/080695 CN2016080695W WO2017185335A1 WO 2017185335 A1 WO2017185335 A1 WO 2017185335A1 CN 2016080695 W CN2016080695 W CN 2016080695W WO 2017185335 A1 WO2017185335 A1 WO 2017185335A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
unit
instruction
batch normalization
batch
Prior art date
Application number
PCT/CN2016/080695
Other languages
English (en)
French (fr)
Inventor
刘少礼
于涌
陈云霁
陈天石
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Priority to PCT/CN2016/080695 priority Critical patent/WO2017185335A1/zh
Publication of WO2017185335A1 publication Critical patent/WO2017185335A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead

Definitions

  • the present invention relates to artificial neural network technology, and in particular to an apparatus and method for performing forward normalization of backward normalization in an artificial neural network.
  • Multi-layer artificial neural networks are widely used in the fields of pattern recognition, image processing, function approximation and optimization calculation.
  • Multi-layer artificial neural networks have been accepted by academic circles in recent years due to their high recognition accuracy and good parallelism. And the industry is more and more concerned, and the batch normalization operation in the multi-layer artificial neural network is more and more applied to the multi-layer neural network because it can accelerate the training speed of the neural network and improve the recognition accuracy.
  • One known method of supporting batch normalization operations is to use a general purpose processor.
  • the method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions.
  • One of the disadvantages of this approach is that the performance of a single general purpose processor is low and cannot meet the performance requirements of conventional multi-layer artificial neural network operations.
  • communication between general-purpose processors becomes a performance bottleneck.
  • the general-purpose processor needs to decode the multi-layer artificial neural network into a long column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.
  • Another known method of supporting batch normalization is to use a graphics processing unit (GPU).
  • the method supports the above algorithm by executing a generic SIMD instruction using a general purpose register file and a generic stream processing unit.
  • the GPU is a device dedicated to performing graphic image operations and scientific calculations, without the special support for multi-layer artificial neural network batch normalization operations, a large amount of front-end decoding work is still required to perform multi-layer artificial neural network operations, bringing A lot of extra overhead.
  • the GPU has only a small on-chip cache, and the model data of the multi-layer artificial neural network batch normalization needs to be repeatedly transferred from off-chip, and the off-chip bandwidth becomes the main performance bottleneck.
  • the batch normalization operation has a large number of normalization operations such as summation, and the parallel architecture of the GPU is not suitable for such a large number of normalized operations.
  • An aspect of the present invention provides an apparatus for performing an artificial neural network batch normalization operation, including an instruction storage unit, a controller unit, a data access unit, and an operation module, wherein: the instruction storage unit is configured to read in through the data access unit The instruction caches the read instruction; the controller unit is configured to read the instruction from the instruction storage unit and decode the instruction into a micro instruction that controls the operation module; the data access unit is configured to use the corresponding data from the external address space to the operation module. The data is written to or read from the data cache unit to the external address space; the arithmetic module is used for specific calculation of the data.
  • Another aspect of the present invention provides a method of performing a batch normalization forward operation using the above apparatus.
  • x is each input neuron element and y is the output element.
  • the forward operation requires dynamic calculation of the mean E[x] and the variance var[x].
  • the summation and (normalization) operations in the mean and variance calculation process are performed by the arithmetic module of the device, thereby calculating the mean and variance of each iteration in the training process.
  • Another aspect of the present invention provides a method of performing a batch normalization inverse operation using the above apparatus.
  • the gradient passed in by a pixel is dl/dY
  • the forward process output is Y
  • the reverse process of batch normalization completes the normalization operation of the neurons in parallel by the arithmetic unit, for example, taking the mean value, the variance, and the like.
  • the invention can be applied to the following (including but not limited to) scenarios: data processing, robots, computers, printers, scanners, telephones, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers , cameras, camcorders, projectors, watches, earphones, mobile storage, wearable devices and other electronic products; aircraft, ships, vehicles and other types of transportation; televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, Washing machines, electric lights, gas stoves, range hoods and other household appliances; and including nuclear magnetic resonance instruments, B-ultrasound, electrocardiographs and other medical equipment.
  • the invention solves the problem that the computing performance of the CPU and the GPU is insufficient and the front-end decoding overhead is large by adopting the device and the instruction set for performing the batch normalization operation. Effectively improved support for batch normalization forward and reverse operations.
  • the invention By adopting a dedicated on-chip buffer for batch normalization operation, the invention fully exploits the reusability of input neurons and intermediate data, avoids repeatedly reading the data into the memory, reduces the memory access bandwidth, and avoids the memory bandwidth becoming a multi-layer artificial The problem of neural network forward computing performance bottlenecks.
  • the present invention better balances the relationship between parallel and serial by employing a dedicated arithmetic unit for batch normalization operations. Avoiding the CPU architecture is only a serial operation, the data is slower when the data size is larger, and the GPU architecture is only a parallel operation, which deals with the weakness of the normalized operation.
  • the data storage unit and the arithmetic unit cooperate to better balance the normalized serial operation and the parallel operation.
  • FIG. 1 shows an example block diagram of an overall structure of an apparatus for performing a batch normalization operation according to an embodiment of the present invention.
  • FIG. 2 shows an example block diagram of the structure of an arithmetic module in an apparatus for performing a batch normalization operation in accordance with an embodiment of the present invention.
  • FIG. 3 shows an example block diagram of a batch normalization operation process in accordance with an embodiment of the present invention.
  • FIG. 4 shows a flow chart of a batch normalization operation in accordance with an embodiment of the present invention.
  • the batch normalization operation consists of two parts, forward and reverse.
  • both the forward and reverse of the batch normalization operation need to be applied, and only the forward process of the batch normalization operation is performed during the use of the artificial neural network.
  • the parameters obtained by the training process are used in the process of using the artificial neural network. For example, the mean value and variance of the batch normalization operation do not need to be repeated.
  • the apparatus includes an instruction storage unit 1, a controller unit 2, a data access unit 3, and an arithmetic module 4.
  • the instruction storage unit 1, the controller unit 2, the data access unit 3, and the arithmetic module 4 can all be implemented by hardware circuits (including, but not limited to, FPGA, CGRA, application specific integrated circuit ASIC, analog circuit, memristor, etc.).
  • the instruction storage unit 1 reads in an instruction through the data access unit 3 and caches the read instruction.
  • the instruction memory unit can be implemented by a variety of different memory devices (SRAM, eDRAM, DRAM, memristor, 3D-DRAM or non-volatile memory, etc.).
  • the controller unit 2 reads instructions from the instruction storage unit 1, decodes the instructions into microinstructions that control the behavior of other units or modules, and then distributes the respective microinstructions to the various units or modules, such as data.
  • Access unit 3 arithmetic module 4, and the like.
  • the data access unit 3 can access the external address space, directly read and write data to each cache unit inside the device, and complete data loading and storage.
  • the arithmetic module 4 includes an arithmetic unit 41, a data dependency determining unit 42, a neuron buffer unit 43, and an intermediate value buffer unit 44.
  • the arithmetic unit 41 receives the microinstructions issued by the controller unit 2 and performs an arithmetic logic operation.
  • the data dependency judging unit 42 is responsible for reading and writing operations to the neuron cache unit in the calculation process. Before the data dependency judgment unit 42 performs the read/write operation, it first ensures that there is no read/write consistency conflict between the data used between the instructions. For example, all microinstructions sent to the data dependency unit 42 are stored in an instruction queue inside the data dependency unit 42, in which the range of read data of the read instruction is a write command ahead of the queue position. Write If the scope of the data conflicts, the instruction must wait until the dependent write instruction is executed before it can execute.
  • the neuron buffer unit 43 buffers the input neuron vector data and the output neuron value data of the arithmetic module 4.
  • the intermediate value buffer unit 44 buffers the intermediate value data required by the arithmetic module 4 in the calculation process. For example, the data of the partial sum, partial square sum, etc. calculated during the operation. For each of the arithmetic modules 4, the intermediate value buffer unit 44 stores the intermediate value data of the batch normalization operation process. For example, the forward batch normalization operation is used during the use of the artificial neural network. Suppose x is every input neuron data and y is output neuron data. The learning parameters alpha, beta, which are constantly updated during the reverse training process, are used in the formula for calculating the output neuron data y.
  • the minimum constant eps which represents a very small amount of data, can usually be represented by a scale of 10 to 5, and can also be set to 0 in actual use.
  • the mean value E[x] represents the mean value of the neuron data x of the input data
  • the mean value of the batch size is a total amount
  • var[x] represents the variance of the corresponding input neuron data x by the batch size as a total amount. .
  • the input neuron data usually has four dimensions: the input batch is the batch (also called number) size, the input channel number is channel, the input high height, the input width width, these four dimensions Determine the total number of input data x, E[x], var[x] is to calculate the mean and variance of the data in the other three dimensions with the batch as the total number.
  • the result is returned to the data access unit to get the output neurons.
  • the input neuron data usually has four dimensions: the input batch is the batch size, the input channel number is channel, the input is high height, and the input width is width. These four dimensions determine the input.
  • the total number of data x, E[x], var[x] is the mean and variance of the data in the other three dimensions calculated by batch as the total number.
  • the data storage mode of the device is stored according to the three dimensions of channel, height, and weight, the device can read the input summation, average, and variance operation after reading the input neuron data x. .
  • the mean and variance in the batch normalization operation can use the calculated mean variance E(x), var(x), and use this parameter as a constant storage and operation.
  • the mean and variance of the calculation process during the batch normalization operation can also be calculated from the input data in the forward process.
  • the arithmetic unit calculates the mean and variance data for each time.
  • the input neurons calculate the mean and variance via the arithmetic unit, and the partial data is placed in the intermediate value buffer unit 44 for subsequent calculation of the f(x) of the iterative process.
  • the present invention also provides an instruction set for performing an artificial neural network batch normalization operation on the aforementioned apparatus.
  • the instruction set includes the CONFIG instruction, the COMPUTE instruction, the IO instruction, the NOP instruction, the JUMP instruction, and the MOVE instruction, where:
  • the CONFIG directive configures the various constants required for the current layer calculation before the batch normalization calculation begins;
  • the COMPUTE instruction completes the arithmetic logic calculation of the batch normalization process
  • the IO instruction realizes reading input data required for calculation from the external address space and storing the data back to the external space after the calculation is completed;
  • the NOP instruction is responsible for clearing the microinstructions in all the microinstruction memory queues of the current device, and ensuring that all the instructions before the NOP instruction are all executed.
  • the NOP instruction itself does not contain any operations;
  • the JUMP instruction is responsible for controlling the jump of the next instruction address to be read from the instruction storage unit, and is used to implement the jump of the control flow;
  • the MOVE instruction is responsible for carrying data of an address in the internal address space of the device to another address in the internal address space of the device.
  • the process is independent of the operation unit and does not occupy the resources of the operation unit during execution.
  • FIG. 3 illustrates an example block diagram of an artificial neural network batch normalization forward and reverse operations in accordance with an embodiment of the present invention.
  • out (in-middle)/middle
  • in is the input neuron data
  • out is the output neuron data.
  • Middle is the intermediate value in the operation process.
  • the intermediate value is the intermediate result of the normalization, variance, etc., which needs to be normalized.
  • the intermediate value of the normalization process is calculated in parallel by the operation module 4 [middlel,..., middleN], stored to Intermediate value buffer unit 44.
  • the arithmetic module 4 then calculates the output neuron data out in parallel with each of the input neuron data in using the intermediate value middle to obtain the final output vector.
  • FIG. 4 illustrates a flow chart of the batch normalization forward operation in a training process in accordance with one embodiment.
  • the flowchart depicts the process of implementing the forward operation of the batch normalization operation shown in FIG. 3 using the apparatus and instruction set of the present invention.
  • step S1 an IO instruction is pre-stored at the first address of the instruction memory unit 1.
  • step S2 the operation starts, the controller unit 2 reads the IO instruction from the first address of the instruction storage unit 1, and according to the translated microinstruction, the data access unit 3 reads all the corresponding batch normalization forwards from the external address space.
  • the instruction is operated and cached in the instruction storage unit 1.
  • step S3 the controller unit 2 then reads in the next IO instruction from the instruction storage unit, and according to the translated microinstruction, the data access unit 3 reads all the data required by the operation module 4 from the external address space (for example, including the input nerve The meta-vector, the batch size, the learning parameter alpha, beta, the minimum value eps, the mean, the variance, etc.) are supplied to the neuron buffer unit 43 of the arithmetic module 4.
  • the external address space for example, including the input nerve
  • step S4 the controller unit 2 then reads the next CONFIG command from the instruction storage unit, and according to the translated microinstruction, the device configures a batch normalization operation. For example, does the forward calculation process use the calculated mean variance or calculate the mean variance based on the input.
  • step S5 the controller unit 2 then reads the next COMPUTE instruction from the instruction storage unit.
  • the operation module 4 reads the input neuron vector from the neuron buffer unit, and calculates the mean and variance of the input neurons. Stored in the intermediate value cache unit.
  • step S6 the arithmetic module 4 subtracts the mean value from the data in the input neuron buffer unit and the intermediate value buffer unit according to the micro-instruction decoded by the COMPUTE instruction, and divides the square root operation of the variance and the minimum amount of eps, and stores the result. Back to the intermediate value cache unit.
  • step S7 the arithmetic module 4 reads the learning parameter alpha from the neuron buffer unit 43 according to the microinstruction decoded by the COMPUTE instruction, multiplies the intermediate value, and adds the learning parameter beta back to the neuron cache.
  • step S8 the controller unit then reads the next IO instruction from the instruction storage unit. According to the translated microinstruction, the data access unit 3 stores the output neuron vector in the neuron buffer unit 43 to the external address space designation address. The operation ends.
  • step S4 uses the constant mean and the variance in step S4, and does not require dynamic calculation every time, that is, step S5 is removed.
  • step S5 is removed.
  • the other is the same as FIG. 4.
  • the reverse process for the batch normalization operation is similar to the forward process described above. The difference is that the data of the operation is different.
  • the gradient introduced by one pixel is dl/dY
  • the gradient of the backward transmission is dl/dx
  • the output of the forward process is Y
  • the other parameters indicate the same meaning as the forward process, and then propagated backward by batch normalization.
  • the gradient dl/dx (alpha/sqrt(var(x)+eps))*(dl/dY-mean(dl/dY)-mean(dl/dY*Y)*Y), where mean is the averaging operation.
  • the reverse process of batch normalization normalizes the gradient data by the arithmetic unit, for example, the mean value, the variance, and the like.
  • the arithmetic unit then completes the rest of the equations in parallel.
  • the relationship between parallel and serial is better balanced by using a dedicated arithmetic unit for batch normalization operations. Avoiding the CPU architecture is only a serial operation, the data is slower when the data size is larger, and the GPU architecture is only a parallel operation, which deals with the weakness of the normalized operation.
  • the data storage unit and the arithmetic unit cooperate to better balance the normalized serial operation and the parallel operation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

一种用于执行batch normalization运算的装置,包括:指令存储单元(1)、控制器单元(2)、数据访问单元(3)、运算模块(4),可实现多层人工神经网络中的batch normalization运算。其中,所述指令存储单元(1)通过数据访问单元(3)读入指令并缓存读入的指令,控制器单元(2)从指令存储单元(1)中读取指令,将指令译码成控制其他单元或模块行为的微指令,然后将各自的微指令分发至各个单元或模块;数据访问单元(3)用于访问外部地址空间,完成数据的加载和存储,运算模块(4)用于batch normalization运算的正向过程或反向过程。该装置有效提高了对人工神经网络中batch normalization正反向运算的支持。

Description

一种用于执行batch normalization运算的装置和方法 技术领域
本发明涉及人工神经网络技术,具体地涉及一种用于执行人工神经网络中batch normalization正向和反向运算的装置和方法。
背景技术
多层人工神经网络被广泛应用于模式识别,图像处理,函数逼近和优化计算等领域,多层人工神经网络在近年来由于其较高的识别准确度和较好的可并行性,受到学术界和工业界越来越广泛的关注,而多层人工神经网络中的batch normalization运算因为其可以加速神经网络训练速度,提高识别精度的特性,被越来越多的运用到了多层神经网络中。
一种支持batch normalization运算的已知方法是使用通用处理器。该方法通过使用通用寄存器堆和通用功能部件执行通用指令来支持上述算法。该方法的缺点之一是单个通用处理器的运算性能较低,无法满足通常的多层人工神经网络运算的性能需求。而多个通用处理器并行执行时,通用处理器之间相互通信又成为了性能瓶颈。另外,通用处理器需要把多层人工神经网络正向运算译码成一长列运算及访存指令序列,处理器前端译码带来了较大的功耗开销。
另一种支持batch normalization的已知方法是使用图形处理器(GPU)。该方法通过使用通用寄存器堆和通用流处理单元执行通用SIMD指令来支持上述算法。由于GPU是专门用来执行图形图像运算以及科学计算的设备,没有对多层人工神经网络batch normalization运算的专门支持,仍然需要大量的前端译码工作才能执行多层人工神经网络运算,带来了大量的额外开销。另外GPU只有较小的片上缓存,多层人工神经网络batch normalization的模型数据需要反复从片外搬运,片外带宽成为了主要性能瓶颈。并且batch normalization运算有大量的例如求和的归一化运算,GPU的并行架构并不适合做这种大量归一化的运算。
发明内容
本发明的一个方面提供了一种用于执行人工神经网络batch normalization运算的装置,包括指令存储单元、控制器单元、数据访问单元、运算模块,其中:指令存储单元用于通过数据访问单元读入指令并缓存读入的指令;控制器单元用于从指令存储单元读取指令,并将该指令译码成控制运算模块的微指令;数据访问单元用于从外部地址空间向运算模块的相应数据缓存单元中写数据或从所述数据缓存单元向外部地址空间读数据;运算模块用于对数据的具体计算。
本发明的另一个方面提供了一种使用上述装置执行batch normalization正向运算的方法。在使用过程中,假设x是每一个输入神经元元素,y是输出元素。学习参数alpha、beta、极小常数eps、均值E[x]、方差var[x]均为训练过程中得到的常数,该装置并行的完成batch normalization正向y=f(x)=alpha*(x-E[x])/sqrt(var(x)+eps)+beta计算过程得到输出神经元。在训练过程中,正向运算需要动态的计算均值E[x]、方差var[x]。通过本装置的运算模块完成均值、方差计算过程中的累加和(归一化)运算,从而计算训练过程中每一次迭代的均值和方差。
本发明的另一方面提供了一种使用上述装置执行batch normalization反向运算的方法。假设一个像素点传入的梯度为dl/dY,正向过程输出为Y,则经过batch normalization反向传播出的梯度dl/dx=(alpha/sqrt(var(x)+eps))*(dl/dY-mean(dl/dY)-mean(dl/dY*Y)*Y)学习参数的alpha的梯度:dl/dalpha=(∑dl/dY)*Y,学习参数beta的梯度:dl/dbeta=∑dl/dY。batch normalization的反向过程通过运算单元并行的完成神经元的归一化运算例如取均值、方差等。
本发明可以应用于以下(包括但不限于)场景中:数据处理、机器人、电脑、打印机、扫描仪、电话、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备等各类电子产品;飞机、轮船、车辆等各类交通工具;电视、空调、微波炉、冰箱、电饭煲、加湿器、 洗衣机、电灯、燃气灶、油烟机等各类家用电器;以及包括核磁共振仪、B超、心电图仪等各类医疗设备。
本发明通过采用用于执行batch normalization运算的装置和指令集,解决了CPU和GPU运算性能不足,前端译码开销大的问题。有效提高了对batch normalization正反向运算的支持。
本发明通过采用针对batch normalization运算的专用片上缓存,充分挖掘了输入神经元和中间数据的重用性,避免了反复向内存读取这些数据,降低了内存访问带宽,避免了内存带宽成为多层人工神经网络正向运算性能瓶颈的问题。
本发明通过采用针对batch normalization运算的专用运算单元较好的平衡了并行和串行之间的关系。避免了CPU架构只是串行运算,数据规模较大时速度较慢,GPU架构只是并行运算,处理不好归一化运算的弱点。本发明中数据存储单元和运算单元相配合可以较好的平衡归一化串行运算和并行运算。
附图说明
为了更完整地理解本发明及其优势,现在将参考结合附图的以下描述,其中:
图1示出了根据本发明实施例的用于执行batch normalization运算的装置的整体结构的示例框图。
图2示出了根据本发明实施例的用于执行batch normalization运算的装置中运算模块结构的示例框图。
图3示出了根据本发明实施例的batch normalization运算过程的示例框图。
图4示出了根据本发明实施例的batch normalization运算的流程图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明作进一步的详细说明。
batch normalization运算包括正向和反向两部分。在人工神经网络训练过程中batch normalization运算的正向和反向都需要运用,而在人工神经网络使用过程中仅执行batch normalization运算的正向过程。在人工神经网络使用过程中使用了训练过程得到的参数,例如batch normalization运算中的均值、方差等数据不必再重复计算。
图1示出了根据本发明用于执行人工神经网络batch normalization运算的装置的整体结构图。如图1所示,该装置包括指令存储单元1、控制器单元2、数据访问单元3和运算模块4。指令存储单元1、控制器单元2、数据访问单元3、运算模块4均可以通过硬件电路(例如包括但不限于FPGA、CGRA、专用集成电路ASIC、模拟电路和忆阻器等)实现。
指令存储单元1通过数据访问单元3读入指令并缓存读入的指令。指令存储单元可以通过各种不同存储器件(SRAM、eDRAM、DRAM、忆阻器、3D-DRAM或非易失存储等)实现。
控制器单元2从指令存储单元1中读取指令,将指令译码成控制其他单元或模块行为的微指令,然后将各自的微指令分发至各个单元或模块,所述其他单元或模块例如数据访问单元3、运算模块4等。
数据访问单元3能够访问外部地址空间,直接向装置内部的各个缓存单元读写数据,完成数据的加载和存储。
图2示出了根据本发明实施例的用于执行人工神经网络batch normalization运算的装置中运算模块4的结构的示例框图。如图2所示,运算模块4包括运算单元41、数据依赖关系判定单元42、神经元缓存单元43和中间值缓存单元44。
运算单元41接收控制器单元2发出的微指令并进行算数逻辑运算。
数据依赖关系判断单元42负责计算过程中对神经元缓存单元的读写操作。数据依赖关系判断单元42执行读写操作之前会首先保证指令之间所用的数据不存在读写一致性冲突。例如,所有发往数据依赖关系单元42的微指令都会被存入数据依赖关系单元42内部的指令队列里,在该队列中,读指令的读取数据的范围如果与队列位置靠前的写指令写 数据的范围发生冲突,则该指令必须等到所依赖的写指令被执行后才能够执行。
神经元缓存单元43缓存该运算模块4的输入神经元向量数据和输出神经元值数据。
中间值缓存单元44缓存该运算模块4在计算过程中需要的中间值数据。例如运算过程中计算的部分和、部分平方和等数据。对于每一个运算模块4,中间值缓存单元44会存储batch normalization运算过程的中间值数据。例如正向的batch normalization操作,在人工神经网络的使用过程中。假设x是每一个输入神经元数据,y是输出神经元数据。学习参数alpha、beta,这两个参数在反向训练过程中不断更新,用于之后计算输出神经元数据y的公式中。极小常数eps,代表一个极小量数据,通常可以用10的-5次方代表,实际使用中也可以设为0。均值E[x]代表输入数据的神经元数据x以batch大小为一个总量所求出的均值,var[x]表示相应的输入神经元数据x以batch大小为一个总量所求出的方差。在人工神经网络算法中,输入神经元数据通常有四个维度:输入的批量即batch(也有叫number)大小、输入的频道数目即channel、输入的高height、输入的宽width,这四个维度决定了输入数据x的总数目,E[x]、var[x]就是以batch为总数目计算另外三个维度上数据的均值、方差。运算单元41可以并行的完成y=f(x)=alpha*(x-E[x])/sqrt(var(x)+eps)+beta计算过程,sqrt代表开方操作,该过程的常数数据可以存储在中间值缓存单元中,得到的结果返回到数据访问单元得到输出神经元。在人工神经网络算法中,输入神经元数据通常有四个维度:输入的批量为batch大小、输入的频道数目为channel、输入的为高height、输入的宽为width,这四个维度决定了输入数据x的总数目,E[x]、var[x]就是以batch为总数目计算另外三个维度上数据的均值、方差。并且,由于该装置的数据存储方式是按照channel、height、weight三个维度存储的,该装置在读取输入神经元数据x时可以依序读入完成之后的求和、求均值、求方差操作。
对于batch normalization的正向运算过程,在batch normalization运算过程中的均值和方差可以使用已经计算好的均值方差E(x)、var(x),将该参数作为常数存储和运算使用,完成之后的计算过程在batch normalization运算过程中的均值和方差也可以在正向过程中根据输入数据计算。运算单元要计算每一次的均值与方差数据。在每一次训练迭代过程中,输入神经元经由运算单元计算均值和方差,将该部分数据放于中间值缓存单元44中,用于该次迭代过程的f(x)后续计算。
本发明还提供在前述装置上执行人工神经网络batch normalization运算的指令集。指令集中包括CONFIG指令、COMPUTE指令、IO指令、NOP指令、JUMP指令和MOVE指令,其中:
CONFIG指令在batch normalization计算开始前配置当前层计算需要的各种常数;
COMPUTE指令完成batch normalization过程的算术逻辑计算;
IO指令实现从外部地址空间读入计算需要的输入数据以及在计算完成后将数据存回至外部空间;
NOP指令负责清空当前装置内部所有微指令存储队列中的微指令,保证NOP指令之前的所有指令全部执行完毕。NOP指令本身不包含任何操作;
JUMP指令负责控制将要从指令存储单元读取的下一条指令地址的跳转,用来实现控制流的跳转;
MOVE指令负责将装置内部地址空间某一地址的数据搬运至装置内部地址空间的另一地址,该过程独立于运算单元,在执行过程中不占用运算单元的资源。
图3示出了根据本发明实施例的人工神经网络batch normalization正向和反向运算的示例框图。对于公式out=(in-middle)/middle,in是输入神经元数据,out是输出神经元数据。middle是运算过程中的中间值,该中间值是均值、方差等需要做归一化运算的中间结果,通过运算模块4并行的计算归一化过程中的部分中间值[middlel,...,middleN],存储到 中间值缓存单元44。之后运算模块4对每一个输入神经元数据in用中间值middle并行的计算出输出神经元数据out,得到最后的输出向量。
图4示出根据一个实施例的训练过程中的batch normalization正向运算流程图。该流程图描述利用本发明的装置和指令集实现图3所示的batch normalization运算的正向运算的过程。
在步骤S1,在指令存储单元1的首地址处预先存入一条IO指令。
在步骤S2,运算开始,控制器单元2从指令存储单元1的首地址读取该条IO指令,根据译出的微指令,数据访问单元3从外部地址空间读取相应的所有batch normalization正向运算指令,并将其缓存在指令存储单元1中。
在步骤S3,控制器单元2接着从指令存储单元读入下一条IO指令,根据译出的微指令,数据访问单元3从外部地址空间读取运算模块4需要的所有数据(例如,包括输入神经元向量、batch大小、学习参数alpha、beta、极小值eps、均值、方差等)至运算模块4的神经元缓存单元43。
在步骤S4,控制器单元2接着从指令存储单元读入下一条CONFIG指令,根据译出的微指令,装置配置batch normalization运算。例如,本次正向运算过程是使用计算好的均值方差,还是根据输入计算均值方差。
在步骤S5,控制器单元2接着从指令存储单元读入下一条COMPUTE指令,根据译出的微指令,运算模块4从神经元缓存单元读取输入神经元向量,计算输入神经元的均值和方差存入中间值缓存单元中。
在步骤S6,运算模块4根据COMPUTE指令译出的微指令将输入神经元缓存单元和中间值缓存单元中的数据完成减去均值后除以方差与极小量eps和的平方根操作,将结果存回中间值缓存单元。
在步骤S7,运算模块4根据COMPUTE指令译出的微指令,从神经元缓存单元43读取学习参数alpha,与中间值相乘后加上学习参数beta返回至神经元缓存。
在步骤S8,控制器单元接着从指令存储单元读入下一条IO指令,根据译出的微指令,数据访问单元3将神经元缓存单元43中的输出神经元向量存至外部地址空间指定地址,运算结束。
对于使用过程中的batch normalizaiton运算的正向过程与训练过程中的batch normalization运算的正向过程区别在于步骤S4中配置使用常数均值和方差,不需要每次动态计算,也就是去掉了步骤S5。其他与图4相同。
对于batch normalization运算的反向过程与上述的正向过程类似。区别在于操作的数据不同。假设一个像素点传入的梯度为dl/dY,反向传出的梯度是dl/dx,正向过程输出为Y,其余参数表示含义与正向过程相同,则经过batch normalization反向传播出的梯度dl/dx=(alpha/sqrt(var(x)+eps))*(dl/dY-mean(dl/dY)-mean(dl/dY*Y)*Y),其中mean是取均值操作。学习参数的alpha的梯度:dl/dalpha=(∑dl/dY)*Y,学习参数beta的梯度:dl/dbeta=∑dl/dY,通过这两个梯度更新学习参数的数值。batch normalization的反向过程通过运算单元归一化运算梯度数据例如取均值、方差等。之后运算单元并行的完成公式中其余操作。
通过采用用于执行batch normalization运算的装置和指令集,解决了CPU和GPU运算性能不足,前端译码开销大的问题。有效提高了对batch normalization正反向运算的支持。
通过采用针对batch normalization运算的专用片上缓存,充分挖掘了输入神经元和中间数据的重用性,避免了反复向内存读取这些数据,降低了内存访问带宽,避免了内存带宽成为多层人工神经网络正向运算性能瓶颈的问题。
通过采用针对batch normalization运算的专用运算单元较好的平衡了并行和串行之间的关系。避免了CPU架构只是串行运算,数据规模较大时速度较慢,GPU架构只是并行运算,处理不好归一化运算的弱点。本发明中数据存储单元和运算单元相配合可以较好的平衡归一化串行运算和并行运算。
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (7)

  1. 一种用于执行batch normalization运算的装置,包括指令存储单元、控制器单元、数据访问单元和运算模块,其中,
    所述指令存储单元通过数据访问单元读入指令并缓存读入的指令;
    所述控制器单元从指令存储单元中读取指令,将指令译码成控制其他单元或模块行为的微指令,然后将各自的微指令分发至各个单元或模块;
    所述数据访问单元用于访问外部地址空间,完成数据的加载和存储;
    所述运算模块用于batch normalization运算的正向过程或反向过程。
  2. 如权利要求1所述的用于执行batch normalization运算的装置,所述运算模块包括运算单元、数据依赖关系判定单元、神经元缓存单元和中间值缓存单元,其中,
    所述运算单元用于接收控制器单元发出的微指令并进行算数逻辑运算;
    所述数据依赖关系判断单元用于对神经元缓存单元进行读写操作,保证指令之间所用的数据不存在读写一致性冲突;
    所述神经元缓存单元用于缓存输入神经元数据和输出神经元数据;
    所述中间值缓存单元用于缓存所述运算模块计算过程中需要的中间值数据。
  3. 如权利要求2所述的用于执行batch normalization运算的装置,
    所述运算单元在batch normailizaiton运算的正向过程进行以下计算过程:
    y=f(x)=alpha*(x-E[x])/sqrt(var(x)+eps)+beta,其中,x为输入神经元数据,y为输出神经元数据,alpha、beta为学习参数,其在反向训练过程中不断更新,用于之后计算输出神经元数据y的公式中;极小常数eps,均值E[x]代表输入数据的神经元数据x以batch大小为一个总量所求出的均值,var[x]表示相应的输入神经元数据x以batch大小为一个总量所求出的方差。
  4. 如权利要求2所述的用于执行batch normalization运算的装置,
    所述运算单元在batch normailizaiton运算的反向过程进行以下计算过程:
    假设一个像素点传入的梯度为dl/dY,反向传出的梯度是dl/dx,正向过程输出为Y,其余参数表示含义与正向过程相同,则经过batch normalization反向传播出的梯度dl/dx=(alpha/sqrt(var(x)+eps))*(dl/dY-mean(dl/dY)-mean(dl/dY*Y)*Y),其中mean是取均值操作,学习参数alpha的梯度为:dl/dalpha=(∑dl/dY)*Y,学习参数beta的梯度为:dl/dbeta=∑dl/dY,通过这两个梯度更新学习参数的数值。
  5. 一种用于执行batch normalization运算的方法,包括如下步骤:
    采用一指令存储单元读入指令并缓存读入的指令;
    将所述指令译码成控制一运算模块的微指令;
    利用所述运算模块进行batch normalization运算的正向过程或反向过程。
  6. 如权利要求5所述的执行batch normalization运算的方法,所述运算模块利用一神经元缓存单元缓存输入神经元数据和输出神经元数据,且利用一中间值缓存单元缓存计算过程中需要的中间值数据。
  7. 如权利要求5或6所述的执行batch normalization运算的方法,所述运算单元在batch normailizaiton运算的正向过程进行以下计算过程:
    所述运算单元在batch normailizaiton运算的正向过程进行以下计算过程:
    y=f(x)=alpha*(x-E[x])/sqrt(var(x)+eps)+beta,其中,x为输入神经元数据,y为输出神经元数据,alpha、beta为学习参数,其在反向训练过程中不断更新,用于之后计算输出神经元数据y的公式中;极小常数eps,均值E[x]代表输入数据的神经元数据x以batch大小为一个总量
PCT/CN2016/080695 2016-04-29 2016-04-29 一种用于执行batch normalization运算的装置和方法 WO2017185335A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/080695 WO2017185335A1 (zh) 2016-04-29 2016-04-29 一种用于执行batch normalization运算的装置和方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/080695 WO2017185335A1 (zh) 2016-04-29 2016-04-29 一种用于执行batch normalization运算的装置和方法

Publications (1)

Publication Number Publication Date
WO2017185335A1 true WO2017185335A1 (zh) 2017-11-02

Family

ID=60160652

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/080695 WO2017185335A1 (zh) 2016-04-29 2016-04-29 一种用于执行batch normalization运算的装置和方法

Country Status (1)

Country Link
WO (1) WO2017185335A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754062A (zh) * 2017-11-07 2019-05-14 上海寒武纪信息科技有限公司 卷积扩展指令的执行方法以及相关产品
CN110097181A (zh) * 2018-01-30 2019-08-06 上海寒武纪信息科技有限公司 用于执行人工神经网络正向运算的装置和方法
CN111222632A (zh) * 2018-11-27 2020-06-02 中科寒武纪科技股份有限公司 计算装置、计算方法及相关产品
CN112789627A (zh) * 2018-09-30 2021-05-11 华为技术有限公司 一种神经网络处理器、数据处理方法及相关设备

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103019656A (zh) * 2012-12-04 2013-04-03 中国科学院半导体研究所 可动态重构的多级并行单指令多数据阵列处理***
CN105528191A (zh) * 2015-12-01 2016-04-27 中国科学院计算技术研究所 数据累加装置、方法及数字信号处理装置

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103019656A (zh) * 2012-12-04 2013-04-03 中国科学院半导体研究所 可动态重构的多级并行单指令多数据阵列处理***
CN105528191A (zh) * 2015-12-01 2016-04-27 中国科学院计算技术研究所 数据累加装置、方法及数字信号处理装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
IOFFE, S ET AL.: "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", COMPUTER SCIENCE, 2 March 2015 (2015-03-02), XP055266268 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754062A (zh) * 2017-11-07 2019-05-14 上海寒武纪信息科技有限公司 卷积扩展指令的执行方法以及相关产品
CN109754062B (zh) * 2017-11-07 2024-05-14 上海寒武纪信息科技有限公司 卷积扩展指令的执行方法以及相关产品
CN110097181A (zh) * 2018-01-30 2019-08-06 上海寒武纪信息科技有限公司 用于执行人工神经网络正向运算的装置和方法
CN112789627A (zh) * 2018-09-30 2021-05-11 华为技术有限公司 一种神经网络处理器、数据处理方法及相关设备
CN112789627B (zh) * 2018-09-30 2023-08-22 华为技术有限公司 一种神经网络处理器、数据处理方法及相关设备
CN111222632A (zh) * 2018-11-27 2020-06-02 中科寒武纪科技股份有限公司 计算装置、计算方法及相关产品

Similar Documents

Publication Publication Date Title
KR102470264B1 (ko) 완전연결층 신경망 역방향 트레이닝 실행용 장치와 방법
WO2017185391A1 (zh) 一种用于执行卷积神经网络训练的装置和方法
CN109284825B (zh) 用于执行lstm运算的装置和方法
KR102486030B1 (ko) 완전연결층 신경망 정방향 연산 실행용 장치와 방법
CN107316078B (zh) 用于执行人工神经网络自学习运算的装置和方法
KR102175044B1 (ko) 인공 신경망 역방향 트레이닝 실행용 장치와 방법
WO2017185347A1 (zh) 用于执行循环神经网络和lstm运算的装置和方法
WO2017185336A1 (zh) 用于执行pooling运算的装置和方法
WO2017185386A1 (zh) 一种用于执行卷积神经网络正向运算的装置和方法
CN106991476B (zh) 用于执行人工神经网络正向运算的装置和方法
WO2017185396A1 (zh) 一种用于执行矩阵加/减运算的装置和方法
CN107886166B (zh) 一种执行人工神经网络运算的装置和方法
WO2017185393A1 (zh) 一种用于执行向量内积运算的装置和方法
WO2017185335A1 (zh) 一种用于执行batch normalization运算的装置和方法
CN107315568B (zh) 一种用于执行向量逻辑运算的装置
WO2017185392A1 (zh) 一种用于执行向量四则运算的装置和方法
WO2017185248A1 (zh) 用于执行人工神经网络自学习运算的装置和方法
CN111651206A (zh) 一种用于执行向量外积运算的装置和方法
WO2018058452A1 (zh) 一种执行人工神经网络运算的装置和方法
CN107341546B (zh) 一种用于执行batch normalization运算的装置和方法
WO2017185419A1 (zh) 一种用于执行向量最大值最小值运算的装置和方法
CN111860772B (zh) 一种用于执行人工神经网络pooling运算的装置和方法
WO2018024094A1 (zh) 一种运算装置及其操作方法

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16899847

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16899847

Country of ref document: EP

Kind code of ref document: A1