It is a kind of for CNN accelerate OPU instruction set define method
Technical field
The present invention relates to CNN accelerator instruction set to define method field, especially a kind of OPU instruction accelerated for CNN
Collect definition method.
Background technique
Depth convolutional neural networks (CNNs) show very high accuracy in various applications, if visual object identifies,
Speech recognition and object detection etc..However, its breakthrough in accuracy is the high cost for calculating cost, need to pass through calculating
Cluster, GPU and FPGA accelerate to push.Wherein, FPGA accelerator has energy efficiency high, and flexibility is good, and computing capability is strong etc.
Advantage takes off in the CNN good application on the edge devices such as speech recognition especially on smart phone and visual object identification
Grain husk and go out;It is usually directed to framework and explores and optimize, RTL programming, hardware realization and software-hardware interface exploitation, with development
People conduct in-depth research FPGA CNN (convolutional neural networks) the automatic compiler accelerated, can configure platform and provide
Concurrent computation resource abundant and high energy efficiency become the ideal chose of edge calculations and data center CNN acceleration.But with
Development of DNN (deep neural network) algorithm in various more complicated Computer Vision Tasks, such as recognition of face, license plate are known
Not, gesture recognition etc., the cascade structure of a variety of DNN are widely used to obtain better performance, these new application scenarios are wanted
The sequence of heterogeneous networks is asked to execute, it is therefore desirable to constantly reconfigure FPGA device, bring the problem of time-consuming;On the other hand,
Each new update in customer network framework can result in the regeneration and entire realization process of RTL code, time-consuming longer.
In recent years, the CNN device generator that automatically speeds up for being deployed to FPGA quickly can be become into another focus, existing skill
There is researcher to develop Deep weaver in art, the resource allocation and hardware organization that it is provided according to design planning person calculate CNN
Method is mapped to manual optimization design template;Somebody proposes the compiler based on the module library RTL, it is by multiple optimizations
Manual coding Verilog template composition, describes the calculating and data flow of different type layer;With the accelerator phase of custom design
Than this two work all obtain comparable performance;There are also researchers to provide the compiler based on HLS, is primarily upon
The bandwidth optimization carried out by internal storage access recombination;There are also researchers to propose a kind of systolic array framework, higher to realize
FPGA running frequency.But existing FPGA acceleration work, it is intended that different CNN generates specific independent accelerator, this guarantees
The reasonable high-performance of template based on RTL or based on HLS-RTL, but HardwareUpgring is complicated in the case where adjusting target network
Degree is high.Therefore, specific hardware coded description is generated to individual networks to be implemented without, is not related to burning FPGA again
Record is all disposed process and is completed by instruction configuration, configure different target network configuration by instruction, do not reconstruct FPGA acceleration
Device proposes completely new CNN acceleration system, defines OPU (Overlay Processor Unit) instruction set, compiler compiling definition
Instruction set generate instruction sequence, OPU executes the instruction after compiling and realizes that CNN accelerates, to realize that above-mentioned CNN acceleration needs to consider
How to define instruction to realize in the network mapping recombination to specific structure by different structure, so that realizing the processor of instruction control
Universality is good.On the other hand, when being related to external memory in use, memory reads and writees the circulating analog accuracy of operation
It is not high, because of refresh time and other expenses during external memory use outside possible amount incurred;If stood after the decoding
It executes instruction, then operation order can only be by the sequential control of instruction sequence;If control is simultaneously without the accurate simulated operation period
The starting point for the operation that row executes will become intractable;Meanwhile the initial conditions variation of main business is limited, usually in preceding several steps
Suddenly it is triggered after reaching certain state, causes to realize that time for each instruction is uncertain big, therefore, it is necessary to a kind of instruction set definition sides
Legal justice instruction overcomes problem above, provides OPU instruction set for the network mapping recombination of different structure to specific structure, optimization refers to
The universality for enabling the processor of control is able to achieve the configuration for completing different target network according to instruction, is realized by OPU general
CNN accelerates.
Summary of the invention
It is an object of the invention to: the present invention provides a kind of OPU instruction set accelerated for CNN to define method, provides
By the OPU instruction set of the network mapping recombination of different structure to specific structure, the universality of the processor of optimization instruction control reaches
To the purpose for not reconstructing FPGA realization heterogeneous networks.
The technical solution adopted by the invention is as follows:
It is a kind of for CNN accelerate OPU instruction set define method, include the following steps:
Including defining conditional instruction, defining imperative statement and setting instruction granularity;
Conditional instruction is defined to include the following steps:
Conditional instruction is constructed, conditional instruction includes reading store instruction, writing store instruction, data grabber instruction, data
Post-processing instruction and computations;
The register and executive mode of conditional instruction are set, and executive mode is to hold after meeting the trigger condition that hardware is written
Row, register includes parameter register and trigger condition register;
The parameter configuration mode of conditional instruction is set, and configuration mode is to carry out parameter configuration according to imperative statement;
Imperative statement is defined to include the following steps:
Define the parameter of imperative statement;
The executive mode of imperative statement parameter is defined, executive mode is directly to execute after being read;
Setting instruction granularity includes the following steps:
It counts CNN network and accelerates demand;
Calculating mode is determined according to the parallel input and output channel of statistical result and selection, and instruction granularity is set.
Preferably, the reading store instruction includes carrying out reading storage operation by Mode A 1 and carrying out reading storage behaviour by Mode A 2
Make, granularity is to read in n number, n > 1 every time;
Mode A 1: n number is read backward since specified address;
Mode A 2: n number is read according to address stream, wherein address is discontinuous in address stream, operates after three kinds of readings: 1, it reads
Without operation after taking;2, designated length is spliced into after reading;3, designated length is split as after reading;On piece is deposited after four reading operations
Storage space is set: characteristic pattern memory module, inner product parameter memory module, offset parameter memory module, instructs memory module;
The reading storage operational order can include initial address with parameter, operand quantity, read post-processing mode and piece
Upper storage location.
Preferably, the store instruction of writing includes carrying out writing storage operation by Mode B 1 and carrying out writing storage behaviour by Mode B 2
Make, granularity is to write out n number, n > 1 every time;
Mode B 1: n number is write backward since specified address;
Mode B 2: n number is write according to destination address stream, wherein address is discontinuous in address stream;
The storage operational order of writing can include initial address and operand quantity with parameter.
Preferably, data grabber instruction includes according to different reading data patterns and data recombination pattern of rows and columns from piece
Upper characteristic pattern memory and inner product parameter storage read data operation and to the data of reading carry out recombination arrangement operation,
Granularity is while operating 64 input datas;Data grabber instruction can include reading characteristic pattern memory and reading inner product parameter with parameter
Memory, wherein reading characteristic pattern memory includes reading address constraint i.e. lowest address and maximum address, reading step and rearrangement
Mode;Reading inner product parameter storage includes the constraint of reading address and reading mode.
Preferably, the Data Post instruction includes pond, activation, fixed point cutting, is rounded, one that vector contraposition is added
Kind operation or a variety of operations, granularity are the multiple data of operation 64 every time;Data Post operational order can be with ginseng
Number includes pond type, pond size, activation type and fixed point cutting position.
Preferably, the computations include being deployed to carry out inner product of vectors operation, granularity according to different length vector
It is 32, the inner product of vectors module that the calculating basic unit that inner product of vectors operation uses is 32 for two length, calculating operation instruction
Adjustable parameter includes output fruiting quantities.
Preferably, the imperative statement provides parameter and updates, and parameter includes that on piece storage feature module is long, wide, leads to
Road number, current layer input length, wide, current layer input channel number, output channel number read storage operation initial address, read operation mould
Storage operation initial address, write operation model selection, data grabber mode and constraint, setting calculating mode, setting are write in formula selection
Pondization operates relevant parameter, setting activation operation relevant parameter and setting data and shifts, and shearing is rounded relevant operation.
It preferably, further include setting instruction sequence definition mode, specifically: instruction sequence refers to if continuous a plurality of repetition
It enables, then a single instruction is set, which can be repeatedly executed at predetermined intervals, until content in trigger condition register and parameter register
It is updated.
It preferably, further include defining command length, command length is uniform length.
Preferably, input and output channel correspond to the inner product of vectors that the minimum unit of calculating mode is 32 parallel.
In conclusion by adopting the above-described technical solution, the beneficial effects of the present invention are:
1. the present invention is by defining conditional instruction, defining imperative statement and setting instruction granularity, imperative statement
Configuration parameter is provided for conditional instruction, trigger condition is arranged in conditional instruction, and trigger condition is written firmly within hardware, has ready conditions
Corresponding register is arranged in instruction, executes after trigger condition satisfaction, imperative statement directly executes after being read, alternative parameter
Content of registers avoids existing operation cycle uncertainty from leading to not predict instruction reorder greatly, realizes that Accurate Prediction instruction is suitable
Sequence;Calculating mode is determined according to the parallel input of CNN network, acceleration demand and selection and output channel, and instruction particle is set
Degree is realized and recombinates the network mapping of different structure to specific structure, using the various sizes of network of parallel computation mode adaptive
Kernel size, solve the universality of instruction set alignment processing device, be that CNN OverDrive Processor ODP according to instruction completes different mesh
The configuration for marking network accelerates general CNN accelerating velocity and provides applicable OPU instruction set;
2. trigger condition is arranged in conditional instruction of the invention, the sequence that existing instruction sequence fully relies on setting is avoided
The shortcomings that time-consuming is executed, realizes that memory reading is continuously operated with model identical, without the fixed intervals sequence by setting
It executes, greatly shortens the length of instruction sequence, conducive to the acceleration for realizing different target network by instruction rapid configuration;
3. the parallel input and output channel according to statistical result and selection determine calculating mode, and instruction particle is arranged
Degree, can by parameter regulation parallel section input channel to calculate more output channels simultaneously, or parallel more multiple input path with
Calculating wheel number is reduced, input channel and output channel are 32 multiple in universal CNN structure, and choosing 32 can as basis unit
The peak use rate of computing unit is effectively ensured;
It is updated 4. imperative statement of the invention provides parameter, the synchronous parameter of renewal frequency is classified into same without item
To make full use of all bits of instruction in part instruction, reduces total instruction and call item number;
5. a single instruction is only arranged in the present invention when having continuous a plurality of repetitive instruction, which is repeatedly executed at predetermined intervals, until
Trigger condition register and parameter register keep content until being updated, and are conducive to realize different target by instruction rapid configuration
The acceleration of network.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached
Figure is briefly described, it should be understood that the following drawings illustrates only certain embodiments of the present invention, therefore is not construed as pair
The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this
A little attached drawings obtain other relevant attached drawings.
Fig. 1 is that instruction set of the invention defines method flow diagram;
Fig. 2 is that conditional instruction of the invention triggers operation schematic diagram;
Fig. 3 is parallel computation pattern diagram of the invention;
Fig. 4 is instruction set schematic diagram of the invention;
Fig. 5 is the OPU structural schematic diagram collected based on instruction of the invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not
For limiting the present invention, i.e., described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is logical
The component for the embodiment of the present invention being often described and illustrated herein in the accompanying drawings can be arranged and be designed with a variety of different configurations.
Therefore, the detailed description of the embodiment of the present invention provided in the accompanying drawings is not intended to limit below claimed
The scope of the present invention, but be merely representative of selected embodiment of the invention.Based on the embodiment of the present invention, those skilled in the art
Member's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
It should be noted that the relational terms of term " first " and " second " or the like be used merely to an entity or
Operation is distinguished with another entity or operation, and without necessarily requiring or implying between these entities or operation, there are any
This actual relationship or sequence.Moreover, the terms "include", "comprise" or its any other variant be intended to it is non-exclusive
Property include so that include a series of elements process, method, article or equipment not only include those elements, but also
Further include other elements that are not explicitly listed, or further include for this process, method, article or equipment it is intrinsic
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described
There is also other identical elements in the process, method, article or equipment of element.
Feature and performance of the invention are described in further detail with reference to embodiments.
Embodiment 1
A kind of OPU instruction set accelerated for CNN defines method, including defines conditional instruction, defines imperative statement
Granularity is instructed with setting;
When the instruction set of definition is used for CNN acceleration, need to define instruction type, the corresponding behaviour of every kind of instruction of instruction
Make, conventional parameter definition and instruction granularity, conventional parameter definition include command length and instruction sequence.When OPU instruction operation
Including step 1: reading instruction block, (instruction set is the aggregate list of all instructions;Instruction block is the instruction of one group of continual command, is used
In execute a network instruction include multiple instruction block);Step 2: the imperative statement in acquisition instruction block directly executes, solution
Code goes out the parameter write-in corresponding registers for including in imperative statement;Conditional instruction in acquisition instruction block, according to having ready conditions
Step 3 is skipped to after instruction setting trigger condition;Step 3: judging whether trigger condition meets, if satisfied, then executing finger of having ready conditions
It enables;If not satisfied, not executing instruction then;Step 4: whether the reading instruction for the next instruction block for including in decision instruction meets
Trigger condition, if satisfied, then return step 1 continues to execute instruction;Otherwise the touching of register parameters and conditions present instruction setting
Clockwork spring part remains unchanged, until meeting trigger condition.
Application-defined instruction set is used for the CNN acceleration system based on OPU, and OPU structural schematic diagram is as shown in figure 5, OPU
It is realized with FPGA or ASIC, final operating instruction is generated according to the instruction of definition, different target can be realized in OPU operating instruction
The acceleration of CNN network, the technological means of use are as follows: it defines conditional instruction, define imperative statement and setting instruction granularity,
Flow chart as shown in Figure 1, conditional instruction define its composition, and conditional instruction is instructed including six classes;Conditional instruction is set
Register and executive mode, executive mode be meet hardware write-in trigger condition after execute, register includes parameter register
Device and trigger condition register;The parameter configuration mode of conditional instruction is set, and configuration mode is to be carried out according to imperative statement
Parameter configuration;Defining imperative statement includes defining its parameter, defining its executive mode i.e. directly execution;Instruction length definitions are
Uniform length, instruction set have structure shown in Fig. 4;Instruct the setting of granularity: statistics CNN network and acceleration demand;According to
The parallel input and output channel of statistical result and selection determine calculating mode, and instruction granularity is arranged.OPU instruction set includes
Conditional instruction, that is, c-type instruction and imperative statement, that is, U-shaped instruction, the instruction sequence of formation are as shown in Figure 4;
Wherein, the composition that conditional instruction includes is as shown in table 1:
Table 1
Instruction name |
Command function |
r |
Read store instruction |
w |
Write store instruction |
f |
Data grabber instruction |
c |
Computations |
p |
Data Post instruction |
The instruction granularity of every class instruction is according to CNN network structure and accelerates demand setting: the reading store instruction, according to
It is to read in n number, n > 1 every time that CNN, which accelerates feature that its granularity is arranged,;It is described to write store instruction, feature setting is accelerated according to CNN
Its granularity is to write out n number, n > 1 every time;The data grabber instruction, according to the structure of CNN network, granularity is 64
Multiple operates 64 input datas simultaneously;The Data Post instruction, granularity are the multiple number of operation 64 every time
According to;The computations, because of the multiple that network inputs and output channel product are 32, therefore its granularity is 32.
The parameter that imperative statement defines is as shown in table 2:
Table 2
Calculating mode is parallel input and output channel, can be by parameter regulation parallel section input channel to calculate simultaneously
More output channels, or parallel more multiple input path calculate wheel number to reduce, and input channel and output channel are tied in universal CNN
It is 32 multiple in structure, the present embodiment selects the inner product of vectors that minimum unit is 32 in parallel input and output channel calculating mode,
It can effectively ensure that the peak use rate of computing unit.Wherein, when instruction set accelerates for CNN, parallel input and output channel meter
Pattern diagram is calculated as shown in figure 3, in each clock cycle, reads the piece that a size is 1*1, depth is ICS input channel
Section and corresponding interior nuclear element, these elements meet natural data memory module, it is only necessary to the bandwidth of very little.In input channel
(ICS) and in output channel (OCS, the quantity for the kernel collection being related to) concurrency is realized.Fig. 3 (c), which is further illustrated, to be calculated
Journey.For the 0th wheel period 0, the input channel slice of reading position (0,0), next cycle we jump and stride x and read position
It sets (0,2), operation, which is read, to be continued, and all pixels until corresponding to core position (0,0) are to calculate.Then we enter the 1st
It takes turns and starts to read all pixels corresponding to core position (0,1) from position (0,1).It is to calculate size with OC collection kernel
The block number evidence of IN*IM*IC needs Kx*Ky* (IC/ICS) * (OC/OCS) to take turns.It can be for any interior using above-mentioned calculating mode
The uniform data of core size or step-length greatly simplifies the data management before calculating, and is realized more with less resource consumption
High efficiency adapts to the kernel size of various various sizes of networks.
To sum up, because existing FPGA accelerates work it is intended that different CNN generates specific independent accelerator, the application is
Realization does not reconstruct FPGA and realizes heterogeneous networks, and OverDrive Processor ODP is arranged, controls application-defined instruction, application-defined
Technical inspiration is not present in instruction in OPU instruction set, because it is with the hardware of FPGA acceleration system, system in the prior art and covers
Lid range is different;The present invention is determined according to the parallel input of CNN network, acceleration demand and selection and output channel calculates mould
Simultaneously instruction granularity is arranged in formula, realizes and recombinates the network mapping of different structure to specific structure, adapts to various sizes of network
Kernel size, solve the universality of instruction set alignment processing device;Conditional instruction and imperative statement are defined, unconditionally
Instruction provides configuration parameter for conditional instruction, and trigger condition is arranged in conditional instruction, and trigger condition is written firmly within hardware, has
Corresponding register is arranged in conditional order, executes after trigger condition satisfaction, and imperative statement directly executes after being read, replacement
Parameter register content realizes and runs conditional instruction according to trigger condition, imperative statement provides configuration for conditional instruction
Parameter, instruction execution sequence is accurate, is not affected by other factors, and overcomes in CNN acceleration system that there are time for each instruction is not true
It is qualitative big, can not Accurate Prediction instruction sequences the problem of, the present invention provides a kind of OPU instruction set, and the network of different structure is reflected
The OPU instruction set that specific structure is arrived in recombination is penetrated, instruction set and corresponding processor OPU can be realized with FPGA or ASIC, be improved
The universality of the processor of control is instructed, OPU can accelerate different target CNN networks, avoid hardware reconstruction.
Embodiment 2
Based on embodiment 1, six kinds in the conditional instruction of the application instructions: including read store instruction, write store instruction,
Data grabber instruction, Data Post instruction and computations;Conditional instruction is held after the trigger condition for meeting hardware write-in
Row, conditional instruction register includes parameter register and trigger condition register;Conditional instruction according to imperative statement into
Row parameter configuration.
Reading store instruction includes carrying out reading storage operation by Mode A 1 and carrying out reading storage operation by Mode A 2;Read storage behaviour
Making instruction can include initial address with parameter, operand quantity, read post-processing mode and on piece storage location.
Mode A 1: reading n number backward since specified address, and n is positive integer;
Mode A 2: n number is read according to address stream, wherein address is discontinuous in address stream, operates after three kinds of readings: 1, it reads
Without operation after taking;2, designated length is spliced into after reading;3, designated length is split as after reading;On piece is deposited after four reading operations
Storage space is set: characteristic pattern memory module, inner product parameter memory module, offset parameter memory module, instructs memory module;
Writing store instruction includes carrying out writing storage operation by Mode B 1 and carrying out writing storage operation by Mode B 2;Write storage behaviour
Making instruction can include initial address and operand quantity with parameter.
Mode B 1: n number is write backward since specified address;
Mode B 2: n number is write according to destination address stream, wherein address is discontinuous in address stream;
Data grabber instruction includes being deposited according to different reading data patterns and data recombination pattern of rows and columns from piece characteristic pattern
Reservoir and inner product parameter storage read data operation and to the data of reading carry out recombination arrangement operation;Data grabber and recombination
Operational order can include reading characteristic pattern memory and reading inner product parameter storage with parameter, wherein reading characteristic pattern memory includes
Reading address constraint is lowest address and maximum address, reading step and rearrangement pattern;Reading inner product parameter storage includes reading
Address constraint and reading mode.
Data Post instruction includes pond, activation, fixed point cutting, be rounded, vector align addition a kind of operation or
A variety of operations;Data Post operational order can include pond type, pond size, activation type and fixed point cleavage with parameter
It sets.
Computations include being deployed to carry out inner product of vectors operation according to different length vector, the meter that inner product of vectors operation uses
Calculating basic unit is the inner product of vectors module that two length are 32 (32 are the length of vector, include 32 8bit data), meter
Calculating the adjustable parameter of operational order includes output fruiting quantities.
Trigger condition is arranged in conditional instruction, and trigger condition is written firmly within hardware, the corresponding deposit of conditional instruction setting
Device executes after trigger condition satisfaction and reads storage, writes storage, data grabber, Data Post and calculating, instruction is compiled
Afterwards, it after OPU reads described instruction according to the commencing signal that GUI is sent, is instructed according to the parallel computation mode operation of instruction definition,
Complete the acceleration of different target network;Trigger condition is written firmly within hardware, for example for storage read module instruction, shares 6
Kind instruction triggers condition, including 1. when last storage reading completion and last data grabber recombination is completed then to touch
Hair;2. writing storage operation when last data to complete then to trigger;3. when then triggering etc. is completed in last Data Post operation;
Trigger condition is arranged in conditional instruction, avoids existing instruction sequence and fully relies on the sequence of setting and executes the shortcomings that time-consuming,
It realizes that memory reading is continuously operated with model identical, is executed without the fixed intervals sequence by setting, greatly shorten instruction
The length of sequence further speeds up the instruction speed of service, conducive to the acceleration that instruction rapid configuration realizes different target network is passed through,
As shown in Fig. 2, two operations are read and write, initial TCI is set as t0, reads in t1 triggering memory, holds from t1-t5
Row, for next trigger condition TCI can any time point between t1 and t5 update, store current TCI, it is by newly referring to
Enable and updating, in this case, when memory reading is continuously operated with model identical, do not need instruction (in time t6 and
T12, operation are triggered by identical TCI), this shortens instruction sequence more than 10x;Meanwhile execution has item after meeting trigger condition
Part instruction, the configuration parameter of conditional instruction are provided by unconditional parameter instruction, and instruction execution is accurate, avoids existing because not true
Qualitative the problem of causing instruction to suspend greatly.
Embodiment 3
Based on embodiment 1, when accelerating for CNN, there is the case where a plurality of continuous repetitive instruction in instruction sequence, therefore fixed
The definition mode of instruction sequence is defined when adopted instruction set, specifically: instruction sequence is then only set if continuous a plurality of repetitive instruction
A single instruction is set, which is repeatedly executed at predetermined intervals until content is updated in trigger condition register and parameter register;Have
First is only defined when continuous a plurality of repetitive instruction, trigger condition register and parameter register keep content until being updated,
Conducive to the acceleration for realizing different target network by instruction rapid configuration.
Need to define many kinds of parameters in imperative statement, corresponding command length is long, in order to reduce command length, defines nothing
The unified approach of conditional order parameter, unified approach are that unification is carried out when renewal frequency is synchronous, the synchronous parameter quilt of renewal frequency
It is referred in same imperative statement to make full use of all bits of instruction, reduces total instruction and call item number, greatly shorten
Command length, conducive to the acceleration for realizing different target network by instruction rapid configuration.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.