This application claims the Chinese patent application 201610663201.9 formerly submitted, a kind of " optimized artificial neural network
Method " and Chinese patent application 201610663563.8 " a kind of to be used to realize ANN advanced treatment unit " priority.
Embodiment
A part of content of the application is once by inventor Yao Song academic article " Going Deeper With
Embedded FPGA Platform for Convolutional Neural Network " (2016.2) are delivered.The application
More improvement have been carried out on its basis.
In the application, mainly it will illustrate improvement of the present invention to CNN by taking image procossing as an example.Deep neural network
(DNN) and Recognition with Recurrent Neural Network (RNN) is similar with CNN.
CNN basic conceptions
CNN reaches state-of-the-art performance in extensive visual correlation task.Help, which is understood in the application, to be analyzed
Based on CNN image classification algorithms, we describe CNN rudimentary knowledge first, introduce image network data set and existing CNN moulds
Type.
As shown in Fig. 1 (a), typical CNN is made up of a series of layer of orderly functions.
The parameter of CNN models is referred to as " weight " (weights).CNN first layer reads input picture, and exports a system
The characteristic pattern (map) of row.Following layer reads the characteristic pattern as caused by last layer, and exports new characteristic pattern.Last point
The probability for each classification that class device (classifier) output input picture may belong to.CONV layers (convolutional layer) and FC layers are (complete
Even layer) it is two kinds of basic channel types in CNN.After CONV layers, generally there is tether layer (Pooling layers).
For example, for a CNN layer,J-th of input feature vector figure (input feature map) is represented,Represent
I-th of output characteristic figure (output feature map), biRepresent the bias term of i-th of output figure.
For CONV layers, ninAnd noutThe quantity of input and output characteristic figure is represented respectively.
For FC layers, ninAnd noutThe length of input and output characteristic vector is represented respectively.
The definition of CONV layers (Convolutional layers, convolutional layer):CONV layers are using series of features figure as defeated
Enter, and output characteristic figure is obtained with convolution kernels convolution.
The non-linear layer being generally connected with CONV layers, i.e. nonlinear activation function, be applied to every in output characteristic figure
Individual element.
CONV layers can be represented with expression formula 1:
Wherein gI, jIt is applied to the convolution kernels of j-th of input feature vector figure and i-th of output characteristic figure.
The definition of FC layers (Fully-Connected layers, connect layer entirely):FC layers are applied on input feature value
One linear transformation:
fout=Wfin+b (2)
W is a nout×ninTransformation matrix, b are bias terms.It is noted that for FC layers, input is not several two dimensions
The combination of characteristic pattern, but a characteristic vector.Therefore, in expression formula 2, parameter ninAnd noutActually correspond to input and
The length of output characteristic vector.
Collect (pooling) layer:Generally it is connected with CONV layers, for exporting each subregion in each characteristic pattern
(subarea) maximum or average value.Pooling maximums can be represented by expression formula 3:
Wherein p is the size for collecting kernel.This nonlinear " down-sampled " is not only that next layer reduces characteristic pattern
Size and calculating, additionally provide a kind of translation invariant (translation invariance).
CNN can be used for during forward inference carrying out image classification.But before CNN is used to any task, it should first
First train CNN data sets.It has recently been demonstrated that the CNN of the forward direction training based on large data sets for a Given task
Model can be used for other tasks, and realize high-precision minor adjustment in network weight (network weights), this
Minor adjustment is called " fine setting (fine-tune) ".CNN training is mainly realized on large server.For embedded
FPGA platform, we are absorbed in the reasoning process for accelerating CNN.
Image-Net data sets
Image-Net data sets are considered as canonical reference benchmark, to assess the performance of image classification and algorithm of target detection.
Up to the present, Image-Net data sets have been have collected in individual classification more than 20,000 1 thousand more than 14,000,000 width images.Image-
Net is that ILSVRC classification tasks issue one and have 1000 classifications, the subset of 1,200,000 images, and CV technologies are greatly facilitated
Development.In this application, all CNN models are verified by ILSVRC 2014 and collected by the training set trainings of ILSVRC 2014
Assess.
Existing CNN models
In ILSVRC in 2012, Supervision teams used AlexNet, first place have been won in image classification task,
84.7% preceding 5 precision.CaffeNet has minor variations on the basis of AlexNet is replicated.AlexNet and CaffeNet are wrapped
Include 5 CONV layers and 3 FC layers.
In ILSVRC in 2013, Zeiler-and-Fergus (ZF) networks won first place in image classification task,
88.8% preceding 5 precision.ZF networks also have 5 CONV layers and 3 FC layers.
As shown in Fig. 1 (b), illustrate a typical CNN from the data flow angle of input-output.
CNN shown in Fig. 1 (b) includes 5 CONV groups Conv1, conv2, conv3, conv4, conv5,3 FC layers FC1,
FC2, FC3, and a Softmax decision function, wherein each CONV groups include 3 convolutional layers.
What Fig. 2 was represented is the software optimization of artificial neural network and hard-wired schematic diagram.
As shown in Fig. 2 in order to accelerate CNN, technological package scheme is proposed from the angle of Optimizing Flow and hardware structure.
Side shows artificial nerve network model under figure 2.Illustrate how to compress CNN models between in fig. 2 to reduce
EMS memory occupation and operation amount, while loss of significance is reduced to greatest extent.
The specialized hardware that CNN after being shown on the upside of Fig. 2 as compression is provided.
Shown on the upside of Fig. 2, in hardware structure, including two modules of PS and PL.
(processing system, PS) includes in generic processing system:CPU and external memory storage (EXTERNAL
MEMORY)。
Programmed logical module (Programmable Logic, PL) includes:DMA and calculate core, input/output buffering with
And controller etc..
As shown in Fig. 2 PL is provided with:Complicated calculations (Computing Complex), input block, output buffer,
Controller and direct memory access (DMA).
Calculating core includes multiple processing units (PEs), and it is responsible for CONV layers, tether layer and FC layers in artificial neural network
Most calculating task.
Chip buffering area includes input block and output buffer, prepares the data that PEs is used and stores result.
Controller, instruction on external memory storage is obtained, to instruction decoding (if desired), and to all moulds in PL
Block is allocated, except DMA.
DMAs is used to transmit data and instruction of the external memory storage (such as DDR) between PL.
PS includes general processor (CPU) 8110 and external memory storage 8120.
External memory storage stores the model parameter, data and instruction of all people's artificial neural networks.
PS is stone, and hardware configuration is fixed, and is scheduled with software.
PL is programmable hardware logic, and hardware configuration is variable.For example, the programmed logical module (PL) can be
FPGA。
, will although DMA in PL sides, directly receives CPU control it is noted that according to an embodiment of the invention
Data are transported in PL from EXTERNAL MEMORY.
Therefore, the hardware structure shown in Fig. 2 is only function division, and the boundary between above-mentioned PL and PS is not absolute.For example,
In actual implementation, the PL and CPU can be realized on a SOC, such as xilinx Zynq chips.The external memory storage
It can be realized by another memory chip, be connected with the CPU in the SOC.
The Optimizing Flow before artificial neural network is deployed to hardware chip is shown in Fig. 3.
Fig. 3 input is original artificial neural network.
Step 405:Compression
Compression step can include trimming CNN models.Network cut is proved to be a kind of effective method, to subtract
The complexity and overfitting of few network.For example, with reference to B.Hassibi and D.G.Stork article " Second order
derivatives for network pruning:Optimal brain surgeon”。
In priority requisition 201610663201.9, " a kind of method of optimized artificial neural network " in the application reference,
Propose and a kind of compress the method for CNN networks by trimming.
First, initialization step, the weights initialisation convolutional layer, FC layers is random value, wherein generating with complete
The ANN of connection, the connection have weight parameter.
Second, training step, the ANN is trained, according to ANN precision, to adjust ANN weight, until the precision
Reach preassigned.
For example, the training step adjusts the weight of the ANN based on stochastic gradient descent algorithm, i.e., random adjustment power
Weight values, the intensive reading based on ANN change to be selected.On the introduction of stochastic gradient algorithm, above-mentioned " Learning may refer to
both weights and connections for efficient neural networks”。
The precision can be quantified as, for training dataset, the difference between ANN prediction result and correct result.
3rd, shearing procedure, based on predetermined condition, the unessential connection in ANN is found, is trimmed described unessential
Connection.Specifically, the weight parameter for the connection being trimmed to about no longer is saved.
The predetermined condition includes following one of any:The weight parameter of connection is 0;Or the weight parameter of connection is less than in advance
Definite value.
4th, trim step, the connection being trimmed to about is re-set as the connection that weight parameter value is zero, i.e. recover institute
The connection being trimmed to about is stated, and distributes weighted value as 0.
Finally, judge that ANN precision reaches preassigned.If not provided, repeat second, third, four steps.
Step 410:Data fixed point quantifies
For a fixed-point number, its value represents as follows:
Wherein bw is several bit widths, flBe can be negative part length (fractional length).
In order to obtain full accuracy while floating number is converted into fixed-point number, a dynamic accuracy number is inventors herein proposed
According to quantization strategy and automatic workflow.
It is different from former static accuracy quantization strategy, in the data quantization flow given by us, flFor different
Layer and feature atlas are dynamic changes, while keep static in one layer, to reduce by every layer of truncated error as far as possible.
The quantization flow proposed is mainly made up of two stages.
(1) weight quantization stage:
The purpose of weight quantization stage is the optimal f for the weight for finding a layerl, such as expression formula 5:
Wherein W is weight, W (bw, fl) represent in given bw and flUnder W fixed point format.
In one embodiment, the dynamic range of each layer of weight is analyzed first, such as is estimated by sampling.It
Afterwards, in order to avoid data are overflowed, f is initializedl.In addition, we are in initial flThe optimal f of neighborhood searchl。
According to another embodiment, in weight pinpoints quantization step, optimal f is found using another wayl, such as table
Up to formula 6.
Wherein, i represents a certain position in bw position, kiFor this weight.By the way of expression formula 6, to different positions
Different weights is given, then calculates optimal fl。
(2) the data quantization stage.
The data quantization stage is it is intended that the feature atlas between two layers of CNN models finds optimal fl。
In this stage, CNN is trained using training dataset (bench mark).The training dataset can be data
set 0。
According to one embodiment of present invention, all CNN CONV layers are completed first, the weight of FC layers quantifies, then carried out
Data quantization.Now, training dataset is input to the CNN for being quantized weight, by the successively place of CONV layers, FC layers
Reason, obtains each layer input feature vector figure.
For each layer of input feature vector figure, successively compared in fixed point CNN models and floating-point CNN models using greedy algorithm
Between data, to reduce loss of significance.Each layer of optimization aim is as shown in expression formula 7:
In expression formula 7, when A represents the calculating of one layer (such as a certain CONV layers or FC layers), x represents input, x+=A
During x, x+Represent the output of this layer.It is worth noting that, for CONV layers or FC layers, direct result x+With than given standard
Longer bit width.Therefore, as optimal flNeed to block during selection.Finally, whole data quantization configuration generation.
According to another embodiment, in data pinpoint quantization step, optimal f is found using another wayl, such as table
Up to formula 8.
Wherein, i represents a certain position in bw position, kiFor this weight.It is similar with the mode of expression formula 4, to different
Different weights is given in position, then calculates optimal fl。
Above-mentioned data quantization step obtains optimal fl。
In addition, according to another embodiment, weight quantifies and data quantify not being alternately to carry out successively.
For the flow order of data processing, the convolutional layer (CONV layers) of the ANN, connect each layer in layer (FC layers) entirely
The each feature atlas obtained for series relationship, the training dataset when being handled successively by the CONV layers of the ANN and FC layers.
Specifically, the weight quantization step and the data quantization step according to the series relationship alternately,
Wherein after the weight quantization step completes the fixed point quantization of wherein a certain layer, the feature atlas exported to this layer performs
Data quantization step.
First embodiment
In priority requisition, the collaborative design using general processor and special accelerator is inventors herein proposed, but simultaneously
How do not discuss efficiently using the flexibility of general processor and the computing capability of special accelerator, such as how to transmit and refer to
Make, transmit data, perform calculating etc..In the application, further prioritization scheme is inventors herein proposed.
Fig. 4 shows the further improvement to Fig. 2 hardware configuration.
In Fig. 4, CPU controls DMA, DMA to be responsible for dispatching data.Specifically, CPU controls DMA by external memory storage
(DDR) instruction in is transported in FIFO.Then, special accelerator instruction fetch and performs in FIFO.
Data number is also transported in FIFO by the data required for special accelerator by CPU controls DMA from DDR, is calculated
When from FIFO carry data calculated.Equally, CPU also safeguards the carrying work of the output data of accelerator.
When operation, CPU needs moment monitoring DMA state.When Input FIFO are discontented, it is necessary to data from
It is transported in DDR in Input FIFO.When output FIFO not space-time, it is necessary to which data are carried back DDR from Output FIFO
In.
In addition, Fig. 4 special accelerator includes:Controller, calculate core (computation Complex) and buffering area
(buffer)。
Calculating core includes:Acoustic convolver, adder tree, nonlinear block etc..
The size of convolution kernel generally only has several options such as 3 × 3,5 × 5 and 7 × 7.For example, designed for convolution operation two
It is 3 × 3 windows to tie up acoustic convolver.
Adder tree (AD) is summed to all results of acoustic convolver.Non-linear (NL) module is applied to nonlinear activation function
Input traffic.For example, the function can be ReLU functions.(do not show in addition, maximum collects (Max-Pooling) module
Go out) it is used for integration operations, for example, specific 2 × 2 window is used for into input traffic, and export maximum therein.
Buffering area includes:Input data buffering area, data output buffer area, bias shift (bias shift) module.
Bias shift (bias shift) is used for the conversion for supporting dynamic quantization scope.For example, shifted for weight.
Further for example, shifted for data.
Input data buffering area can also include:Input Data Buffer, weight buffer.Input Data Buffer can be with
It is wire data buffer (line buffer), for preserving the data of computing needs, and data described in sustained release, with reality
The reuse of the existing data.
Fig. 5 shows the FIFO interactive modes between CPU and special accelerator.
There are 3 class FIFO in Organization Chart shown in Fig. 5.Equally, controls of the CPU to DMA also has three kinds.
In first embodiment, communicated completely by FIFO between CPU and special accelerator, between CPU and special accelerator
There are three classes to cache FIFO:Instruction, input data, output data.Specifically, under the control of cpu, DMA is responsible for external memory
Input data, output data, instruction transmission between special accelerator, wherein being carried respectively between DMA and special accelerator
Input data FIFO, output data FIFO, instruction FIFO are supplied.
For special accelerator, this design is simple, it is only necessary to is concerned about and calculates, it is not necessary to is concerned about data.Data are grasped
Work is controlled by CPU completely.
But in some application scenarios, scheme also has weak point shown in Fig. 5.
First, CPU performs the resource that scheduling will consume CPU.For example, CPU needs each FIFO of moment monitoring state, with
When prepare receive and send data.CPU listening states and when will consume substantial amounts of CPU according to different state processing data
Between.In some applications, CPU monitors FIFO and the cost of processing data can be very big, causes CPU to be almost fully occupied, without CPU
Other tasks (reading picture, pretreatment picture etc.) of time-triggered protocol.
Secondly, need to set multiple FIFO in special accelerator, also take PL resources.
Second embodiment
The characteristics of second embodiment, is as follows:First, application specific processor shares external memory with CPU, both can read
External memory.Secondly, CPU only controls the instruction input of special accelerator.In this way, CPU cooperates with special accelerator system
Operation, wherein CPU undertake the task that some special accelerators can not be completed.
As shown in fig. 6, in second embodiment, special accelerator (PL) and external memory (DDR) direct interaction.Correspondingly,
Input FIFO and the Ouput FIFO (as shown in Figure 5) between DMA and special accelerator are eliminated, only retains 1 FIFO and exists
Instruction is transmitted between DMA and special accelerator, saves resource.
For CPU, it is not necessary to complicated scheduling is carried out to inputoutput data, and by special accelerator directly from outer
Portion's internal memory (DDR) accesses data.When artificial neural network is run, CPU can carry out other processing, such as be read from camera
Take pending view data etc..
Therefore, second embodiment solves the problems, such as that CPU tasks are overweight, and CPU can be freed, and processing is more to appoint
Business.But, special accelerator needs oneself to perform and the data access of external memory (DDR) is controlled.
The improvement of first, second embodiment
In first and second embodiments, CPU is to control accelerator by instructing.
Accelerator may occur mistake " run fly " in the process of running, and (that is, program enters endless loop or without meaning
Disorderly operation).In current scheme, it is winged that CPU can not determine whether accelerator has run.
In the improvement embodiment based on first or second embodiments, inventor additionally provides " state peripheral hardware " in CPU,
So as to which the state of the finite state machine (FSM) in special accelerator (PL) is directly passed into CPU.
CPU can understand the running situation of accelerator by detecting the state of finite state machine (FSM).If it find that accelerate
Device run fly or it is stuck, CPU can also send signal direct reduction accelerator.
Fig. 7 shows the example that " state peripheral hardware " is added on the framework of first embodiment shown in Fig. 4.
Fig. 8 shows the example that " state peripheral hardware " is added on the framework of second embodiment shown in Fig. 6.
As shown in Figure 7,8, finite state machine (FSM), the state of finite state machine are set in the controller of special accelerator
CPU state peripheral hardware (that is, monitoring module) is transmitted directly to, so as to which CPU can run the failures such as deadlock with monitoring programme
Situation.
First and second embodiments compare
Two kinds of scheduling strategies of the first and second embodiments are each advantageous.
In Fig. 4 embodiment, image data needs CPU to dispatch DMA to be transferred to special accelerator, so special accelerator
Meeting having time is left unused.But because CPU scheduling data are carried, special accelerator is only responsible for calculating, and computing capability is optimised, processing
The time of data also can shorter.
In Fig. 6 embodiment, there is special accelerator the ability for individually accessing data to be carried without CPU scheduling data.
Data processing can independently be carried out on special accelerator.
CPU can only be responsible for the digital independent between external system and output.Read operation refers to that for example CPU is picture
Data are read out from camera (not shown), are transferred to external memory storage;Output operation refer to CPU the output after identification from
External memory is output to screen (not shown).
Using Fig. 6 embodiment, task pipeline can be got up so that the processing speed of multitask is faster.Accordingly
Cost is that special accelerator is responsible for calculating simultaneously and data are carried, and inefficient, processing needs the longer time.
Fig. 9 contrasts show the similarities and differences of the handling process of the first and second embodiments.
The application of second embodiment:Recognition of face
According to second embodiment, because there is shared external memory (DDR), CPU can be total to special accelerator
With one calculating task of completion.
For example, in the task of recognition of face:CPU can read camera and detect the face in input picture;
Neutral net crosses the identification that accelerator accelerates core to complete face.
Can be quickly by the neural computing task portion above CPU using use CPU and special accelerator collaborative design
Administration is on embedded device.
Specifically, referring to Fig. 9 example 2, the reading (for example, being read from camera) and in advance of picture is run on CPU
Processing, the processing procedure of picture is completed on special accelerator.
Come because the above method isolates the task of CPU task and accelerator so that CPU and accelerator can be complete
Parallel processing task.
Form 1 is illustrated merely with the performance pair between CPU and second embodiment (the special accelerator collaborative designs of CPU+)
Than.
Form 1
CPU as a comparison uses the tall and handsome Tegra k1 up to company's production.It can be seen that the CPU+ using us
The collaborative design of special accelerator has obvious acceleration to each layer, overall to accelerate to have reached 7 times.
It is an advantage of the current invention that using CPU (general processor) it is feature-rich the characteristics of make up special accelerator and (can compile
Journey logic module PL, such as FPGA) flexibility deficiency the characteristics of, also using special accelerator calculating speed it is fast the characteristics of make up
The characteristics of CPU calculating speeds are not enough to complete to calculate in real time.
In addition, general processor can be arm processor, or any other CPU.Programmed logical module can be
FPGA or other programmable application specific processors (ASIC).
It should be noted that each embodiment in this specification is described by the way of progressive, each embodiment emphasis
What is illustrated is all the difference with other embodiment, between each embodiment identical similar part mutually referring to.
In several embodiments provided herein, it should be understood that disclosed apparatus and method, can also pass through
Other modes are realized.Device embodiment described above is only schematical, for example, flow chart and block diagram in accompanying drawing
Show the device of multiple embodiments according to the present invention, method and computer program product architectural framework in the cards,
Function and operation.At this point, each square frame in flow chart or block diagram can represent the one of a module, program segment or code
Part, a part for the module, program segment or code include one or more and are used to realize holding for defined logic function
Row instruction.It should also be noted that at some as in the implementation replaced, the function that is marked in square frame can also with different from
The order marked in accompanying drawing occurs.For example, two continuous square frames can essentially perform substantially in parallel, they are sometimes
It can perform in the opposite order, this is depending on involved function.It is it is also noted that every in block diagram and/or flow chart
The combination of individual square frame and block diagram and/or the square frame in flow chart, function or the special base of action as defined in performing can be used
Realize, or can be realized with the combination of specialized hardware and computer instruction in the system of hardware.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area
For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies
Change, equivalent substitution, improvement etc., should be included in the scope of the protection.It should be noted that:Similar label and letter exists
Similar terms is represented in following accompanying drawing, therefore, once being defined in a certain Xiang Yi accompanying drawing, is then not required in subsequent accompanying drawing
It is further defined and explained.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any
Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained
Gai Ben
Within the protection domain of invention.Therefore, protection scope of the present invention should it is described using scope of the claims as
It is accurate.