CN107657316A

CN107657316A - The cooperative system of general processor and neural network processor designs

Info

Publication number: CN107657316A
Application number: CN201610695285.4A
Authority: CN
Inventors: 余金城; 姚颂
Original assignee: Beijing Insight Technology Co Ltd
Current assignee: Xilinx Technology Beijing Ltd
Priority date: 2016-08-12
Filing date: 2016-08-19
Publication date: 2018-02-02
Anticipated expiration: 2036-08-19
Also published as: CN107657316B; CN107688855A; CN107688855B

Abstract

The present invention relates to artificial neural network (ANN), such as convolutional neural networks (CNN), more particularly to how to be designed based on the cooperative system of general processor and neutral net application specific processor to realize artificial neural network.

Description

The cooperative system of general processor and neural network processor designs

The priority requisition of reference

This application claims the Chinese patent application 201610663201.9 formerly submitted, a kind of " optimized artificial neural network Method " and Chinese patent application 201610663563.8 " a kind of to be used to realize ANN advanced treatment unit " priority.

Technical field

The present invention relates to artificial neural network (ANN), such as convolutional neural networks (CNN), more particularly to how based on logical Designed with the cooperative system of processor and neural network processor to realize artificial neural network.

Background technology

Convolutional neural networks have a very wide range of applications in present image process field, and neutral net has training method simple The characteristics of single, calculating structure is unified.But neutral net storage amount of calculation is all very big.Many work attempt to build on FPGA or Person direct design specialized chip realizes the accelerator of neutral net.But due to special neutral net accelerating hardware flexibility Or limited, can completing for task is excessively single.

Article " the Going Deeper With Embedded FPGA Platform for that inventor Yao Song etc. is delivered Convolutional Neural Network " (2016.2) describe a kind of acceleration system using FPGA, wherein using logical The calculating completed with processor (for example, ARM) to complete some FPGA to have no idea.For example, ARM is responsible for transmission instruction and standard Standby data.

The content of the invention

On the basis of above-mentioned article, further improvement is inventors herein proposed.Present applicant proposes with reference to neutral net Application specific processor and general processor (CPU) and a flexible system is provided, complicated neutral net can be applied to.

A kind of according to an aspect of the invention, it is proposed that advanced treatment unit for being used to run artificial neural network (ANN) (DPU), including：CPU, for dispatching programmable processing module (PL) and direct memory access device (DMA)；Direct memory is visited Device (DMA) is asked, is connected respectively with the CPU, programmable processing module and external memory storage, for the CPU and and may be programmed Communication between processing module；Programmable processor module (PL), including：Controller (Controller), instructed for obtaining, And be scheduled based on the instruction to calculating core, calculate core (Computing Complex), including multiple computing units (PE), for carrying out calculating task, buffering area (buffer), for preserving the programmable processor mould based on instruction and data The data and instruction that block uses；External memory storage (DDR), is connected with described CPU, DMA, for preserving：For realizing ANN finger Order and needs are by the data of ANN processing；The CPU controls the DMA with the external memory storage and the FPGA Instruction and data is transmitted between module.

In addition, the DMA transmits data by FIFO between the external memory storage and the programmable processing module； The DMA transmits instruction by FIFO between the external memory storage and the programmable processing module.

A kind of according to another aspect of the invention, it is proposed that advanced treating list for being used to run artificial neural network (ANN) First (DPU), including：CPU, for dispatching programmable processing module (PL) and direct memory access device (DMA)；Direct memory Accessor (DMA), it is connected with the CPU, programmable processing module and external memory storage, for the CPU and and can compiles respectively Communication between journey processing module；Programmable processor module (PL), including：Controller (Controller), refer to for obtaining Order, and be scheduled based on the instruction to calculating core, calculate core (Computing Complex), including multiple computing units (PE), for carrying out calculating task, buffering area (buffer), for preserving the programmable processor mould based on instruction and data The data and instruction that block uses；External memory storage (DDR), is connected with described CPU, DMA and programmed logical module, for protecting Deposit：For realizing ANN instruction and needing by the data of ANN processing；Wherein described CPU controls the DMA to be deposited in the outside Instruction is transmitted between reservoir and the programmed logical module；Wherein described programmed logical module and the external memory storage are straight Connect transmission data.

Instructed in addition, being transmitted between the DMA and the programmable processing module by FIFO.

In addition, the CPU also includes：State monitoring module, for monitoring the finite state of the programmed logical module The state of machine (FSM).

In addition, the computing unit (PE) includes：Complicated convolution kernel (convolver complex), with the buffering area It is connected to receive ANN weight, input data, for carrying out the operation of the convolutional calculation in the ANN；Add tree (adder Tree), it is connected with the complicated convolution kernel, for the result summation operated to convolutional calculation；Non-linearization module, add with described Method tree connects, for linear function operation being applied to the output of the add tree.

In addition, the computing unit (PE) also includes：Collection module, it is connected with the nonlinear block, for carrying out institute State the integration operations in ANN.

In addition, the buffering area includes：Input block, for prepare it is described calculating assess the input data used, Instruction；Output buffer, preserve and export result of calculation.

In addition, the buffering area also includes：Bias shift device (bias shift), for shifting the weight to different Quantizing range, the weight are the fixed-point number being quantized, and the weight after displacement are exported to the add tree.

According to one embodiment of present invention, the CPU, programmed logical module, the DMA are implemented in a SOC On.The external memory storage is implemented on another chip different from the SOC.

Brief description of the drawings

Fig. 1 a and 1b show the common structure of artificial nerve network model.

Fig. 2 shows flow artificial nerve network model being deployed on specialized hardware.

Fig. 3 shows the overall flow of optimized artificial neural network.

Fig. 4 shows that the collaboration for using CPU and special accelerator (for example, DPU) according to a first embodiment of the present invention is set Count to realize the hardware structure of artificial neural network.

Fig. 5 shows that hardware structure shown in Fig. 3 uses FIFO data transmission mechanism.

Fig. 6 shows that the collaboration for using CPU and special accelerator (for example, DPU) according to a second embodiment of the present invention is set Count to realize the hardware structure of artificial neural network.

Fig. 7 shows the further improvement to first embodiment of the invention.

Fig. 8 shows the further improvement to second embodiment of the invention.

Fig. 9 shows the similarities and differences of the handling process of first, second embodiment.

Embodiment

A part of content of the application is once by inventor Yao Song academic article " Going Deeper With Embedded FPGA Platform for Convolutional Neural Network " (2016.2) are delivered.The application More improvement have been carried out on its basis.

In the application, mainly it will illustrate improvement of the present invention to CNN by taking image procossing as an example.Deep neural network (DNN) and Recognition with Recurrent Neural Network (RNN) is similar with CNN.

CNN basic conceptions

CNN reaches state-of-the-art performance in extensive visual correlation task.Help, which is understood in the application, to be analyzed Based on CNN image classification algorithms, we describe CNN rudimentary knowledge first, introduce image network data set and existing CNN moulds Type.

As shown in Fig. 1 (a), typical CNN is made up of a series of layer of orderly functions.

The parameter of CNN models is referred to as " weight " (weights).CNN first layer reads input picture, and exports a system The characteristic pattern (map) of row.Following layer reads the characteristic pattern as caused by last layer, and exports new characteristic pattern.Last point The probability for each classification that class device (classifier) output input picture may belong to.CONV layers (convolutional layer) and FC layers are (complete Even layer) it is two kinds of basic channel types in CNN.After CONV layers, generally there is tether layer (Pooling layers).

For example, for a CNN layer,J-th of input feature vector figure (input feature map) is represented,Represent I-th of output characteristic figure (output feature map), b_iRepresent the bias term of i-th of output figure.

For CONV layers, n_inAnd n_outThe quantity of input and output characteristic figure is represented respectively.

For FC layers, n_inAnd n_outThe length of input and output characteristic vector is represented respectively.

The definition of CONV layers (Convolutional layers, convolutional layer)：CONV layers are using series of features figure as defeated Enter, and output characteristic figure is obtained with convolution kernels convolution.

The non-linear layer being generally connected with CONV layers, i.e. nonlinear activation function, be applied to every in output characteristic figure Individual element.

CONV layers can be represented with expression formula 1:

Wherein g_{I, j}It is applied to the convolution kernels of j-th of input feature vector figure and i-th of output characteristic figure.

The definition of FC layers (Fully-Connected layers, connect layer entirely)：FC layers are applied on input feature value One linear transformation：

f^out=Wfⁱⁿ+b (2)

W is a n_out×n_inTransformation matrix, b are bias terms.It is noted that for FC layers, input is not several two dimensions The combination of characteristic pattern, but a characteristic vector.Therefore, in expression formula 2, parameter n_inAnd n_outActually correspond to input and The length of output characteristic vector.

Collect (pooling) layer：Generally it is connected with CONV layers, for exporting each subregion in each characteristic pattern (subarea) maximum or average value.Pooling maximums can be represented by expression formula 3:

Wherein p is the size for collecting kernel.This nonlinear " down-sampled " is not only that next layer reduces characteristic pattern Size and calculating, additionally provide a kind of translation invariant (translation invariance).

CNN can be used for during forward inference carrying out image classification.But before CNN is used to any task, it should first First train CNN data sets.It has recently been demonstrated that the CNN of the forward direction training based on large data sets for a Given task Model can be used for other tasks, and realize high-precision minor adjustment in network weight (network weights), this Minor adjustment is called " fine setting (fine-tune) ".CNN training is mainly realized on large server.For embedded FPGA platform, we are absorbed in the reasoning process for accelerating CNN.

Image-Net data sets

Image-Net data sets are considered as canonical reference benchmark, to assess the performance of image classification and algorithm of target detection. Up to the present, Image-Net data sets have been have collected in individual classification more than 20,000 1 thousand more than 14,000,000 width images.Image- Net is that ILSVRC classification tasks issue one and have 1000 classifications, the subset of 1,200,000 images, and CV technologies are greatly facilitated Development.In this application, all CNN models are verified by ILSVRC 2014 and collected by the training set trainings of ILSVRC 2014 Assess.

Existing CNN models

In ILSVRC in 2012, Supervision teams used AlexNet, first place have been won in image classification task, 84.7% preceding 5 precision.CaffeNet has minor variations on the basis of AlexNet is replicated.AlexNet and CaffeNet are wrapped Include 5 CONV layers and 3 FC layers.

In ILSVRC in 2013, Zeiler-and-Fergus (ZF) networks won first place in image classification task, 88.8% preceding 5 precision.ZF networks also have 5 CONV layers and 3 FC layers.

As shown in Fig. 1 (b), illustrate a typical CNN from the data flow angle of input-output.

CNN shown in Fig. 1 (b) includes 5 CONV groups Conv1, conv2, conv3, conv4, conv5,3 FC layers FC1, FC2, FC3, and a Softmax decision function, wherein each CONV groups include 3 convolutional layers.

What Fig. 2 was represented is the software optimization of artificial neural network and hard-wired schematic diagram.

As shown in Fig. 2 in order to accelerate CNN, technological package scheme is proposed from the angle of Optimizing Flow and hardware structure.

Side shows artificial nerve network model under figure 2.Illustrate how to compress CNN models between in fig. 2 to reduce EMS memory occupation and operation amount, while loss of significance is reduced to greatest extent.

The specialized hardware that CNN after being shown on the upside of Fig. 2 as compression is provided.

Shown on the upside of Fig. 2, in hardware structure, including two modules of PS and PL.

(processing system, PS) includes in generic processing system：CPU and external memory storage (EXTERNAL MEMORY)。

Programmed logical module (Programmable Logic, PL) includes：DMA and calculate core, input/output buffering with And controller etc..

As shown in Fig. 2 PL is provided with：Complicated calculations (Computing Complex), input block, output buffer, Controller and direct memory access (DMA).

Calculating core includes multiple processing units (PEs), and it is responsible for CONV layers, tether layer and FC layers in artificial neural network Most calculating task.

Chip buffering area includes input block and output buffer, prepares the data that PEs is used and stores result.

Controller, instruction on external memory storage is obtained, to instruction decoding (if desired), and to all moulds in PL Block is allocated, except DMA.

DMAs is used to transmit data and instruction of the external memory storage (such as DDR) between PL.

PS includes general processor (CPU) 8110 and external memory storage 8120.

External memory storage stores the model parameter, data and instruction of all people's artificial neural networks.

PS is stone, and hardware configuration is fixed, and is scheduled with software.

PL is programmable hardware logic, and hardware configuration is variable.For example, the programmed logical module (PL) can be FPGA。

, will although DMA in PL sides, directly receives CPU control it is noted that according to an embodiment of the invention Data are transported in PL from EXTERNAL MEMORY.

Therefore, the hardware structure shown in Fig. 2 is only function division, and the boundary between above-mentioned PL and PS is not absolute.For example, In actual implementation, the PL and CPU can be realized on a SOC, such as xilinx Zynq chips.The external memory storage It can be realized by another memory chip, be connected with the CPU in the SOC.

The Optimizing Flow before artificial neural network is deployed to hardware chip is shown in Fig. 3.

Fig. 3 input is original artificial neural network.

Step 405：Compression

Compression step can include trimming CNN models.Network cut is proved to be a kind of effective method, to subtract The complexity and overfitting of few network.For example, with reference to B.Hassibi and D.G.Stork article " Second order derivatives for network pruning:Optimal brain surgeon”。

In priority requisition 201610663201.9, " a kind of method of optimized artificial neural network " in the application reference, Propose and a kind of compress the method for CNN networks by trimming.

First, initialization step, the weights initialisation convolutional layer, FC layers is random value, wherein generating with complete The ANN of connection, the connection have weight parameter.

Second, training step, the ANN is trained, according to ANN precision, to adjust ANN weight, until the precision Reach preassigned.

For example, the training step adjusts the weight of the ANN based on stochastic gradient descent algorithm, i.e., random adjustment power Weight values, the intensive reading based on ANN change to be selected.On the introduction of stochastic gradient algorithm, above-mentioned " Learning may refer to both weights and connections for efficient neural networks”。

The precision can be quantified as, for training dataset, the difference between ANN prediction result and correct result.

3rd, shearing procedure, based on predetermined condition, the unessential connection in ANN is found, is trimmed described unessential Connection.Specifically, the weight parameter for the connection being trimmed to about no longer is saved.

The predetermined condition includes following one of any：The weight parameter of connection is 0；Or the weight parameter of connection is less than in advance Definite value.

4th, trim step, the connection being trimmed to about is re-set as the connection that weight parameter value is zero, i.e. recover institute The connection being trimmed to about is stated, and distributes weighted value as 0.

Finally, judge that ANN precision reaches preassigned.If not provided, repeat second, third, four steps.

Step 410：Data fixed point quantifies

For a fixed-point number, its value represents as follows：

Wherein bw is several bit widths, f_lBe can be negative part length (fractional length).

In order to obtain full accuracy while floating number is converted into fixed-point number, a dynamic accuracy number is inventors herein proposed According to quantization strategy and automatic workflow.

It is different from former static accuracy quantization strategy, in the data quantization flow given by us, f_lFor different Layer and feature atlas are dynamic changes, while keep static in one layer, to reduce by every layer of truncated error as far as possible.

The quantization flow proposed is mainly made up of two stages.

(1) weight quantization stage：

The purpose of weight quantization stage is the optimal f for the weight for finding a layer_l, such as expression formula 5：

Wherein W is weight, W (bw, f_l) represent in given bw and f_lUnder W fixed point format.

In one embodiment, the dynamic range of each layer of weight is analyzed first, such as is estimated by sampling.It Afterwards, in order to avoid data are overflowed, f is initialized_l.In addition, we are in initial f_lThe optimal f of neighborhood search_l。

According to another embodiment, in weight pinpoints quantization step, optimal f is found using another way_l, such as table Up to formula 6.

Wherein, i represents a certain position in bw position, k_iFor this weight.By the way of expression formula 6, to different positions Different weights is given, then calculates optimal f_l。

(2) the data quantization stage.

The data quantization stage is it is intended that the feature atlas between two layers of CNN models finds optimal f_l。

In this stage, CNN is trained using training dataset (bench mark).The training dataset can be data set 0。

According to one embodiment of present invention, all CNN CONV layers are completed first, the weight of FC layers quantifies, then carried out Data quantization.Now, training dataset is input to the CNN for being quantized weight, by the successively place of CONV layers, FC layers Reason, obtains each layer input feature vector figure.

For each layer of input feature vector figure, successively compared in fixed point CNN models and floating-point CNN models using greedy algorithm Between data, to reduce loss of significance.Each layer of optimization aim is as shown in expression formula 7：

In expression formula 7, when A represents the calculating of one layer (such as a certain CONV layers or FC layers), x represents input, x⁺=A During x, x⁺Represent the output of this layer.It is worth noting that, for CONV layers or FC layers, direct result x⁺With than given standard Longer bit width.Therefore, as optimal f_lNeed to block during selection.Finally, whole data quantization configuration generation.

According to another embodiment, in data pinpoint quantization step, optimal f is found using another way_l, such as table Up to formula 8.

Wherein, i represents a certain position in bw position, k_iFor this weight.It is similar with the mode of expression formula 4, to different Different weights is given in position, then calculates optimal f_l。

Above-mentioned data quantization step obtains optimal f_l。

In addition, according to another embodiment, weight quantifies and data quantify not being alternately to carry out successively.

For the flow order of data processing, the convolutional layer (CONV layers) of the ANN, connect each layer in layer (FC layers) entirely The each feature atlas obtained for series relationship, the training dataset when being handled successively by the CONV layers of the ANN and FC layers.

Specifically, the weight quantization step and the data quantization step according to the series relationship alternately, Wherein after the weight quantization step completes the fixed point quantization of wherein a certain layer, the feature atlas exported to this layer performs Data quantization step.

First embodiment

In priority requisition, the collaborative design using general processor and special accelerator is inventors herein proposed, but simultaneously How do not discuss efficiently using the flexibility of general processor and the computing capability of special accelerator, such as how to transmit and refer to Make, transmit data, perform calculating etc..In the application, further prioritization scheme is inventors herein proposed.

Fig. 4 shows the further improvement to Fig. 2 hardware configuration.

In Fig. 4, CPU controls DMA, DMA to be responsible for dispatching data.Specifically, CPU controls DMA by external memory storage (DDR) instruction in is transported in FIFO.Then, special accelerator instruction fetch and performs in FIFO.

Data number is also transported in FIFO by the data required for special accelerator by CPU controls DMA from DDR, is calculated When from FIFO carry data calculated.Equally, CPU also safeguards the carrying work of the output data of accelerator.

When operation, CPU needs moment monitoring DMA state.When Input FIFO are discontented, it is necessary to data from It is transported in DDR in Input FIFO.When output FIFO not space-time, it is necessary to which data are carried back DDR from Output FIFO In.

In addition, Fig. 4 special accelerator includes：Controller, calculate core (computation Complex) and buffering area (buffer)。

Calculating core includes：Acoustic convolver, adder tree, nonlinear block etc..

The size of convolution kernel generally only has several options such as 3 × 3,5 × 5 and 7 × 7.For example, designed for convolution operation two It is 3 × 3 windows to tie up acoustic convolver.

Adder tree (AD) is summed to all results of acoustic convolver.Non-linear (NL) module is applied to nonlinear activation function Input traffic.For example, the function can be ReLU functions.(do not show in addition, maximum collects (Max-Pooling) module Go out) it is used for integration operations, for example, specific 2 × 2 window is used for into input traffic, and export maximum therein.

Buffering area includes：Input data buffering area, data output buffer area, bias shift (bias shift) module.

Bias shift (bias shift) is used for the conversion for supporting dynamic quantization scope.For example, shifted for weight. Further for example, shifted for data.

Input data buffering area can also include：Input Data Buffer, weight buffer.Input Data Buffer can be with It is wire data buffer (line buffer), for preserving the data of computing needs, and data described in sustained release, with reality The reuse of the existing data.

Fig. 5 shows the FIFO interactive modes between CPU and special accelerator.

There are 3 class FIFO in Organization Chart shown in Fig. 5.Equally, controls of the CPU to DMA also has three kinds.

In first embodiment, communicated completely by FIFO between CPU and special accelerator, between CPU and special accelerator There are three classes to cache FIFO：Instruction, input data, output data.Specifically, under the control of cpu, DMA is responsible for external memory Input data, output data, instruction transmission between special accelerator, wherein being carried respectively between DMA and special accelerator Input data FIFO, output data FIFO, instruction FIFO are supplied.

For special accelerator, this design is simple, it is only necessary to is concerned about and calculates, it is not necessary to is concerned about data.Data are grasped Work is controlled by CPU completely.

But in some application scenarios, scheme also has weak point shown in Fig. 5.

First, CPU performs the resource that scheduling will consume CPU.For example, CPU needs each FIFO of moment monitoring state, with When prepare receive and send data.CPU listening states and when will consume substantial amounts of CPU according to different state processing data Between.In some applications, CPU monitors FIFO and the cost of processing data can be very big, causes CPU to be almost fully occupied, without CPU Other tasks (reading picture, pretreatment picture etc.) of time-triggered protocol.

Secondly, need to set multiple FIFO in special accelerator, also take PL resources.

Second embodiment

The characteristics of second embodiment, is as follows：First, application specific processor shares external memory with CPU, both can read External memory.Secondly, CPU only controls the instruction input of special accelerator.In this way, CPU cooperates with special accelerator system Operation, wherein CPU undertake the task that some special accelerators can not be completed.

As shown in fig. 6, in second embodiment, special accelerator (PL) and external memory (DDR) direct interaction.Correspondingly, Input FIFO and the Ouput FIFO (as shown in Figure 5) between DMA and special accelerator are eliminated, only retains 1 FIFO and exists Instruction is transmitted between DMA and special accelerator, saves resource.

For CPU, it is not necessary to complicated scheduling is carried out to inputoutput data, and by special accelerator directly from outer Portion's internal memory (DDR) accesses data.When artificial neural network is run, CPU can carry out other processing, such as be read from camera Take pending view data etc..

Therefore, second embodiment solves the problems, such as that CPU tasks are overweight, and CPU can be freed, and processing is more to appoint Business.But, special accelerator needs oneself to perform and the data access of external memory (DDR) is controlled.

The improvement of first, second embodiment

In first and second embodiments, CPU is to control accelerator by instructing.

Accelerator may occur mistake " run fly " in the process of running, and (that is, program enters endless loop or without meaning Disorderly operation).In current scheme, it is winged that CPU can not determine whether accelerator has run.

In the improvement embodiment based on first or second embodiments, inventor additionally provides " state peripheral hardware " in CPU, So as to which the state of the finite state machine (FSM) in special accelerator (PL) is directly passed into CPU.

CPU can understand the running situation of accelerator by detecting the state of finite state machine (FSM).If it find that accelerate Device run fly or it is stuck, CPU can also send signal direct reduction accelerator.

Fig. 7 shows the example that " state peripheral hardware " is added on the framework of first embodiment shown in Fig. 4.

Fig. 8 shows the example that " state peripheral hardware " is added on the framework of second embodiment shown in Fig. 6.

As shown in Figure 7,8, finite state machine (FSM), the state of finite state machine are set in the controller of special accelerator CPU state peripheral hardware (that is, monitoring module) is transmitted directly to, so as to which CPU can run the failures such as deadlock with monitoring programme Situation.

First and second embodiments compare

Two kinds of scheduling strategies of the first and second embodiments are each advantageous.

In Fig. 4 embodiment, image data needs CPU to dispatch DMA to be transferred to special accelerator, so special accelerator Meeting having time is left unused.But because CPU scheduling data are carried, special accelerator is only responsible for calculating, and computing capability is optimised, processing The time of data also can shorter.

In Fig. 6 embodiment, there is special accelerator the ability for individually accessing data to be carried without CPU scheduling data. Data processing can independently be carried out on special accelerator.

CPU can only be responsible for the digital independent between external system and output.Read operation refers to that for example CPU is picture Data are read out from camera (not shown), are transferred to external memory storage；Output operation refer to CPU the output after identification from External memory is output to screen (not shown).

Using Fig. 6 embodiment, task pipeline can be got up so that the processing speed of multitask is faster.Accordingly Cost is that special accelerator is responsible for calculating simultaneously and data are carried, and inefficient, processing needs the longer time.

Fig. 9 contrasts show the similarities and differences of the handling process of the first and second embodiments.

The application of second embodiment：Recognition of face

According to second embodiment, because there is shared external memory (DDR), CPU can be total to special accelerator With one calculating task of completion.

For example, in the task of recognition of face：CPU can read camera and detect the face in input picture； Neutral net crosses the identification that accelerator accelerates core to complete face.

Can be quickly by the neural computing task portion above CPU using use CPU and special accelerator collaborative design Administration is on embedded device.

Specifically, referring to Fig. 9 example 2, the reading (for example, being read from camera) and in advance of picture is run on CPU Processing, the processing procedure of picture is completed on special accelerator.

Come because the above method isolates the task of CPU task and accelerator so that CPU and accelerator can be complete Parallel processing task.

Form 1 is illustrated merely with the performance pair between CPU and second embodiment (the special accelerator collaborative designs of CPU+) Than.

Form 1

CPU as a comparison uses the tall and handsome Tegra k1 up to company's production.It can be seen that the CPU+ using us The collaborative design of special accelerator has obvious acceleration to each layer, overall to accelerate to have reached 7 times.

It is an advantage of the current invention that using CPU (general processor) it is feature-rich the characteristics of make up special accelerator and (can compile Journey logic module PL, such as FPGA) flexibility deficiency the characteristics of, also using special accelerator calculating speed it is fast the characteristics of make up The characteristics of CPU calculating speeds are not enough to complete to calculate in real time.

In addition, general processor can be arm processor, or any other CPU.Programmed logical module can be FPGA or other programmable application specific processors (ASIC).

It should be noted that each embodiment in this specification is described by the way of progressive, each embodiment emphasis What is illustrated is all the difference with other embodiment, between each embodiment identical similar part mutually referring to.

In several embodiments provided herein, it should be understood that disclosed apparatus and method, can also pass through Other modes are realized.Device embodiment described above is only schematical, for example, flow chart and block diagram in accompanying drawing Show the device of multiple embodiments according to the present invention, method and computer program product architectural framework in the cards, Function and operation.At this point, each square frame in flow chart or block diagram can represent the one of a module, program segment or code Part, a part for the module, program segment or code include one or more and are used to realize holding for defined logic function Row instruction.It should also be noted that at some as in the implementation replaced, the function that is marked in square frame can also with different from The order marked in accompanying drawing occurs.For example, two continuous square frames can essentially perform substantially in parallel, they are sometimes It can perform in the opposite order, this is depending on involved function.It is it is also noted that every in block diagram and/or flow chart The combination of individual square frame and block diagram and/or the square frame in flow chart, function or the special base of action as defined in performing can be used Realize, or can be realized with the combination of specialized hardware and computer instruction in the system of hardware.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.It should be noted that：Similar label and letter exists Similar terms is represented in following accompanying drawing, therefore, once being defined in a certain Xiang Yi accompanying drawing, is then not required in subsequent accompanying drawing It is further defined and explained.

The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Gai Ben

Within the protection domain of invention.Therefore, protection scope of the present invention should it is described using scope of the claims as It is accurate.

Claims

1. one kind is used for the advanced treatment unit (DPU) for running artificial neural network (ANN), including：

CPU, for dispatching programmable processing module (PL) and direct memory access device (DMA)；

Direct memory access device (DMA), it is connected respectively with the CPU, programmable processing module and external memory storage, for institute State CPU and the communication between programmable processing module；

Programmable processor module (PL), including：

Controller (Controller), instruct for obtaining, and be scheduled based on the instruction to calculating core；

Core (Computing Complex), including multiple computing units (PE) are calculated, for being calculated based on instruction and data Task；

Buffering area (buffer), for preserving data and the instruction that the programmable processor module uses；

External memory storage (DDR), it is connected with the CPU, direct memory access device (DMA), for preserving：Realize ANN finger Order and needs are by the data of ANN processing；

The CPU controls the DMA to transmit instruction sum between the external memory storage and the programmed logical module According to.

2. advanced treatment unit according to claim 1, wherein：

The DMA transmits data by FIFO between the external memory storage and the programmable processing module；

The DMA transmits instruction by FIFO between the external memory storage and the programmable processing module.

3. advanced treatment unit according to claim 1, wherein the CPU also includes：

State monitoring module, the state of the finite state machine (FSM) for monitoring the programmed logical module.

4. advanced treatment unit according to claim 1, the computing unit (PE) includes：

Complicated convolution kernel (convolver complex), it is connected with the buffering area to receive ANN weight, input data, uses Operated in carrying out the convolutional calculation in the ANN；

Add tree (adder tree), is connected with the complicated convolution kernel, for the result summation operated to convolutional calculation；

Non-linearization module, it is connected with the add tree, for linear function operation being applied to the output of the add tree.

5. advanced treatment unit according to claim 4, the computing unit (PE) also includes：

Collection module, it is connected with the nonlinear block, for carrying out the integration operations in the ANN.

6. advanced treatment unit according to claim 1, the buffering area includes：

Input block, the input data used, instruction are assessed for preparing the calculating；

Output buffer, preserve and export result of calculation.

7. advanced treatment unit according to claim 6, the buffering area also includes：

Bias shift device (bias shift), for shifting the weight to different quantizing ranges, the weight is is quantized Fixed-point number, and the weight after displacement is exported to the add tree.

8. advanced treatment unit according to claim 1, wherein the CPU, programmed logical module, the DMA are by reality On a present SOC.

9. advanced treatment unit according to claim 8, wherein the external memory storage is implemented in different from the SOC Another chip on.

10. one kind is used for the advanced treatment unit (DPU) for running artificial neural network (ANN), including：

Programmable processor module (PL), including：

External memory storage (DDR), is connected, for preserving respectively with described CPU, DMA and programmed logical module：Realize ANN's Instruction and needs are by the data of ANN processing；

Wherein described CPU controls the DMA to transmit instruction between the external memory storage and the programmed logical module；

Wherein described programmed logical module and the external memory storage directly transmit data.

11. advanced treatment unit according to claim 10, wherein leading between the DMA and the programmable processing module Cross FIFO transmission instructions.

12. advanced treatment unit according to claim 10, wherein the CPU also includes：

13. advanced treatment unit according to claim 10, the computing unit (PE) includes：

14. advanced treatment unit according to claim 13, the computing unit (PE) also includes：

15. advanced treatment unit according to claim 10, the buffering area includes：

Output buffer, preserve and export result of calculation.

16. advanced treatment unit according to claim 15, the buffering area also includes：

17. advanced treatment unit according to claim 10, wherein the CPU, programmed logical module, the DMA quilts Realize on a SOC.

18. advanced treatment unit according to claim 17, wherein the external memory storage is implemented in different from described On SOC another chip.