CN110058882A

CN110058882A - It is a kind of for CNN accelerate OPU instruction set define method

Info

Publication number: CN110058882A
Application number: CN201910192455.0A
Authority: CN
Inventors: 喻韵璇; 王铭宇
Original assignee: Chengdu Star Innovation Technology Co Ltd
Current assignee: Shenzhen Biong Core Technology Co ltd
Priority date: 2019-03-14
Filing date: 2019-03-14
Publication date: 2019-07-26
Anticipated expiration: 2039-03-14
Also published as: CN110058882B

Abstract

The invention discloses a kind of OPU instruction set accelerated for CNN to define method, is related to the instruction field of CNN OverDrive Processor ODP, and method includes defining conditional instruction, defining imperative statement and setting instruction granularity.Imperative statement provides configuration parameter for conditional instruction, trigger condition is arranged in conditional instruction, trigger condition is written firmly within hardware, corresponding trigger condition register is arranged in conditional instruction, conditional instruction executes after trigger condition satisfaction, imperative statement directly executes after being read, and alternative parameter content of registers.The present invention is according to CNN network and demand is accelerated to select the calculating mode of parallel input and output channel, and is provided with instruction granularity.Instruction set of the invention avoids uncertain the problem of leading to not prediction instruction reorder greatly in operation cycle.Instruction set of the invention and corresponding processor OPU can be realized with FPGA or ASIC；OPU can accelerate different target CNN networks, avoid hardware reconstruction.

Description

It is a kind of for CNN accelerate OPU instruction set define method

Technical field

The present invention relates to CNN accelerator instruction set to define method field, especially a kind of OPU instruction accelerated for CNN Collect definition method.

Background technique

Depth convolutional neural networks (CNNs) show very high accuracy in various applications, if visual object identifies, Speech recognition and object detection etc..However, its breakthrough in accuracy is the high cost for calculating cost, need to pass through calculating Cluster, GPU and FPGA accelerate to push.Wherein, FPGA accelerator has energy efficiency high, and flexibility is good, and computing capability is strong etc. Advantage takes off in the CNN good application on the edge devices such as speech recognition especially on smart phone and visual object identification Grain husk and go out；It is usually directed to framework and explores and optimize, RTL programming, hardware realization and software-hardware interface exploitation, with development People conduct in-depth research FPGA CNN (convolutional neural networks) the automatic compiler accelerated, can configure platform and provide Concurrent computation resource abundant and high energy efficiency become the ideal chose of edge calculations and data center CNN acceleration.But with Development of DNN (deep neural network) algorithm in various more complicated Computer Vision Tasks, such as recognition of face, license plate are known Not, gesture recognition etc., the cascade structure of a variety of DNN are widely used to obtain better performance, these new application scenarios are wanted The sequence of heterogeneous networks is asked to execute, it is therefore desirable to constantly reconfigure FPGA device, bring the problem of time-consuming；On the other hand, Each new update in customer network framework can result in the regeneration and entire realization process of RTL code, time-consuming longer.

In recent years, the CNN device generator that automatically speeds up for being deployed to FPGA quickly can be become into another focus, existing skill There is researcher to develop Deep weaver in art, the resource allocation and hardware organization that it is provided according to design planning person calculate CNN Method is mapped to manual optimization design template；Somebody proposes the compiler based on the module library RTL, it is by multiple optimizations Manual coding Verilog template composition, describes the calculating and data flow of different type layer；With the accelerator phase of custom design Than this two work all obtain comparable performance；There are also researchers to provide the compiler based on HLS, is primarily upon The bandwidth optimization carried out by internal storage access recombination；There are also researchers to propose a kind of systolic array framework, higher to realize FPGA running frequency.But existing FPGA acceleration work, it is intended that different CNN generates specific independent accelerator, this guarantees The reasonable high-performance of template based on RTL or based on HLS-RTL, but HardwareUpgring is complicated in the case where adjusting target network Degree is high.Therefore, specific hardware coded description is generated to individual networks to be implemented without, is not related to burning FPGA again Record is all disposed process and is completed by instruction configuration, configure different target network configuration by instruction, do not reconstruct FPGA acceleration Device proposes completely new CNN acceleration system, defines OPU (Overlay Processor Unit) instruction set, compiler compiling definition Instruction set generate instruction sequence, OPU executes the instruction after compiling and realizes that CNN accelerates, to realize that above-mentioned CNN acceleration needs to consider How to define instruction to realize in the network mapping recombination to specific structure by different structure, so that realizing the processor of instruction control Universality is good.On the other hand, when being related to external memory in use, memory reads and writees the circulating analog accuracy of operation It is not high, because of refresh time and other expenses during external memory use outside possible amount incurred；If stood after the decoding It executes instruction, then operation order can only be by the sequential control of instruction sequence；If control is simultaneously without the accurate simulated operation period The starting point for the operation that row executes will become intractable；Meanwhile the initial conditions variation of main business is limited, usually in preceding several steps Suddenly it is triggered after reaching certain state, causes to realize that time for each instruction is uncertain big, therefore, it is necessary to a kind of instruction set definition sides Legal justice instruction overcomes problem above, provides OPU instruction set for the network mapping recombination of different structure to specific structure, optimization refers to The universality for enabling the processor of control is able to achieve the configuration for completing different target network according to instruction, is realized by OPU general CNN accelerates.

Summary of the invention

It is an object of the invention to: the present invention provides a kind of OPU instruction set accelerated for CNN to define method, provides By the OPU instruction set of the network mapping recombination of different structure to specific structure, the universality of the processor of optimization instruction control reaches To the purpose for not reconstructing FPGA realization heterogeneous networks.

The technical solution adopted by the invention is as follows:

It is a kind of for CNN accelerate OPU instruction set define method, include the following steps:

Including defining conditional instruction, defining imperative statement and setting instruction granularity；

Conditional instruction is defined to include the following steps:

Conditional instruction is constructed, conditional instruction includes reading store instruction, writing store instruction, data grabber instruction, data Post-processing instruction and computations；

The register and executive mode of conditional instruction are set, and executive mode is to hold after meeting the trigger condition that hardware is written Row, register includes parameter register and trigger condition register；

The parameter configuration mode of conditional instruction is set, and configuration mode is to carry out parameter configuration according to imperative statement；

Imperative statement is defined to include the following steps:

Define the parameter of imperative statement；

The executive mode of imperative statement parameter is defined, executive mode is directly to execute after being read；

Setting instruction granularity includes the following steps:

It counts CNN network and accelerates demand；

Calculating mode is determined according to the parallel input and output channel of statistical result and selection, and instruction granularity is set.

Preferably, the reading store instruction includes carrying out reading storage operation by Mode A 1 and carrying out reading storage behaviour by Mode A 2 Make, granularity is to read in n number, n > 1 every time；

Mode A 1: n number is read backward since specified address；

Mode A 2: n number is read according to address stream, wherein address is discontinuous in address stream, operates after three kinds of readings: 1, it reads Without operation after taking；2, designated length is spliced into after reading；3, designated length is split as after reading；On piece is deposited after four reading operations Storage space is set: characteristic pattern memory module, inner product parameter memory module, offset parameter memory module, instructs memory module；

The reading storage operational order can include initial address with parameter, operand quantity, read post-processing mode and piece Upper storage location.

Preferably, the store instruction of writing includes carrying out writing storage operation by Mode B 1 and carrying out writing storage behaviour by Mode B 2 Make, granularity is to write out n number, n > 1 every time；

Mode B 1: n number is write backward since specified address；

Mode B 2: n number is write according to destination address stream, wherein address is discontinuous in address stream；

The storage operational order of writing can include initial address and operand quantity with parameter.

Preferably, data grabber instruction includes according to different reading data patterns and data recombination pattern of rows and columns from piece Upper characteristic pattern memory and inner product parameter storage read data operation and to the data of reading carry out recombination arrangement operation, Granularity is while operating 64 input datas；Data grabber instruction can include reading characteristic pattern memory and reading inner product parameter with parameter Memory, wherein reading characteristic pattern memory includes reading address constraint i.e. lowest address and maximum address, reading step and rearrangement Mode；Reading inner product parameter storage includes the constraint of reading address and reading mode.

Preferably, the Data Post instruction includes pond, activation, fixed point cutting, is rounded, one that vector contraposition is added Kind operation or a variety of operations, granularity are the multiple data of operation 64 every time；Data Post operational order can be with ginseng Number includes pond type, pond size, activation type and fixed point cutting position.

Preferably, the computations include being deployed to carry out inner product of vectors operation, granularity according to different length vector It is 32, the inner product of vectors module that the calculating basic unit that inner product of vectors operation uses is 32 for two length, calculating operation instruction Adjustable parameter includes output fruiting quantities.

Preferably, the imperative statement provides parameter and updates, and parameter includes that on piece storage feature module is long, wide, leads to Road number, current layer input length, wide, current layer input channel number, output channel number read storage operation initial address, read operation mould Storage operation initial address, write operation model selection, data grabber mode and constraint, setting calculating mode, setting are write in formula selection Pondization operates relevant parameter, setting activation operation relevant parameter and setting data and shifts, and shearing is rounded relevant operation.

It preferably, further include setting instruction sequence definition mode, specifically: instruction sequence refers to if continuous a plurality of repetition It enables, then a single instruction is set, which can be repeatedly executed at predetermined intervals, until content in trigger condition register and parameter register It is updated.

It preferably, further include defining command length, command length is uniform length.

Preferably, input and output channel correspond to the inner product of vectors that the minimum unit of calculating mode is 32 parallel.

In conclusion by adopting the above-described technical solution, the beneficial effects of the present invention are:

1. the present invention is by defining conditional instruction, defining imperative statement and setting instruction granularity, imperative statement Configuration parameter is provided for conditional instruction, trigger condition is arranged in conditional instruction, and trigger condition is written firmly within hardware, has ready conditions Corresponding register is arranged in instruction, executes after trigger condition satisfaction, imperative statement directly executes after being read, alternative parameter Content of registers avoids existing operation cycle uncertainty from leading to not predict instruction reorder greatly, realizes that Accurate Prediction instruction is suitable Sequence；Calculating mode is determined according to the parallel input of CNN network, acceleration demand and selection and output channel, and instruction particle is set Degree is realized and recombinates the network mapping of different structure to specific structure, using the various sizes of network of parallel computation mode adaptive Kernel size, solve the universality of instruction set alignment processing device, be that CNN OverDrive Processor ODP according to instruction completes different mesh The configuration for marking network accelerates general CNN accelerating velocity and provides applicable OPU instruction set；

2. trigger condition is arranged in conditional instruction of the invention, the sequence that existing instruction sequence fully relies on setting is avoided The shortcomings that time-consuming is executed, realizes that memory reading is continuously operated with model identical, without the fixed intervals sequence by setting It executes, greatly shortens the length of instruction sequence, conducive to the acceleration for realizing different target network by instruction rapid configuration；

3. the parallel input and output channel according to statistical result and selection determine calculating mode, and instruction particle is arranged Degree, can by parameter regulation parallel section input channel to calculate more output channels simultaneously, or parallel more multiple input path with Calculating wheel number is reduced, input channel and output channel are 32 multiple in universal CNN structure, and choosing 32 can as basis unit The peak use rate of computing unit is effectively ensured；

It is updated 4. imperative statement of the invention provides parameter, the synchronous parameter of renewal frequency is classified into same without item To make full use of all bits of instruction in part instruction, reduces total instruction and call item number；

5. a single instruction is only arranged in the present invention when having continuous a plurality of repetitive instruction, which is repeatedly executed at predetermined intervals, until Trigger condition register and parameter register keep content until being updated, and are conducive to realize different target by instruction rapid configuration The acceleration of network.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be understood that the following drawings illustrates only certain embodiments of the present invention, therefore is not construed as pair The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.

Fig. 1 is that instruction set of the invention defines method flow diagram；

Fig. 2 is that conditional instruction of the invention triggers operation schematic diagram；

Fig. 3 is parallel computation pattern diagram of the invention；

Fig. 4 is instruction set schematic diagram of the invention；

Fig. 5 is the OPU structural schematic diagram collected based on instruction of the invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention, i.e., described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is logical The component for the embodiment of the present invention being often described and illustrated herein in the accompanying drawings can be arranged and be designed with a variety of different configurations.

Therefore, the detailed description of the embodiment of the present invention provided in the accompanying drawings is not intended to limit below claimed The scope of the present invention, but be merely representative of selected embodiment of the invention.Based on the embodiment of the present invention, those skilled in the art Member's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

It should be noted that the relational terms of term " first " and " second " or the like be used merely to an entity or Operation is distinguished with another entity or operation, and without necessarily requiring or implying between these entities or operation, there are any This actual relationship or sequence.Moreover, the terms "include", "comprise" or its any other variant be intended to it is non-exclusive Property include so that include a series of elements process, method, article or equipment not only include those elements, but also Further include other elements that are not explicitly listed, or further include for this process, method, article or equipment it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described There is also other identical elements in the process, method, article or equipment of element.

Feature and performance of the invention are described in further detail with reference to embodiments.

Embodiment 1

A kind of OPU instruction set accelerated for CNN defines method, including defines conditional instruction, defines imperative statement Granularity is instructed with setting；

When the instruction set of definition is used for CNN acceleration, need to define instruction type, the corresponding behaviour of every kind of instruction of instruction Make, conventional parameter definition and instruction granularity, conventional parameter definition include command length and instruction sequence.When OPU instruction operation Including step 1: reading instruction block, (instruction set is the aggregate list of all instructions；Instruction block is the instruction of one group of continual command, is used In execute a network instruction include multiple instruction block)；Step 2: the imperative statement in acquisition instruction block directly executes, solution Code goes out the parameter write-in corresponding registers for including in imperative statement；Conditional instruction in acquisition instruction block, according to having ready conditions Step 3 is skipped to after instruction setting trigger condition；Step 3: judging whether trigger condition meets, if satisfied, then executing finger of having ready conditions It enables；If not satisfied, not executing instruction then；Step 4: whether the reading instruction for the next instruction block for including in decision instruction meets Trigger condition, if satisfied, then return step 1 continues to execute instruction；Otherwise the touching of register parameters and conditions present instruction setting Clockwork spring part remains unchanged, until meeting trigger condition.

Application-defined instruction set is used for the CNN acceleration system based on OPU, and OPU structural schematic diagram is as shown in figure 5, OPU It is realized with FPGA or ASIC, final operating instruction is generated according to the instruction of definition, different target can be realized in OPU operating instruction The acceleration of CNN network, the technological means of use are as follows: it defines conditional instruction, define imperative statement and setting instruction granularity, Flow chart as shown in Figure 1, conditional instruction define its composition, and conditional instruction is instructed including six classes；Conditional instruction is set Register and executive mode, executive mode be meet hardware write-in trigger condition after execute, register includes parameter register Device and trigger condition register；The parameter configuration mode of conditional instruction is set, and configuration mode is to be carried out according to imperative statement Parameter configuration；Defining imperative statement includes defining its parameter, defining its executive mode i.e. directly execution；Instruction length definitions are Uniform length, instruction set have structure shown in Fig. 4；Instruct the setting of granularity: statistics CNN network and acceleration demand；According to The parallel input and output channel of statistical result and selection determine calculating mode, and instruction granularity is arranged.OPU instruction set includes Conditional instruction, that is, c-type instruction and imperative statement, that is, U-shaped instruction, the instruction sequence of formation are as shown in Figure 4；

Wherein, the composition that conditional instruction includes is as shown in table 1:

Table 1

Instruction name	Command function
		r	Read store instruction
w	Write store instruction
		f	Data grabber instruction
c	Computations
		p	Data Post instruction

The instruction granularity of every class instruction is according to CNN network structure and accelerates demand setting: the reading store instruction, according to It is to read in n number, n > 1 every time that CNN, which accelerates feature that its granularity is arranged,；It is described to write store instruction, feature setting is accelerated according to CNN Its granularity is to write out n number, n > 1 every time；The data grabber instruction, according to the structure of CNN network, granularity is 64 Multiple operates 64 input datas simultaneously；The Data Post instruction, granularity are the multiple number of operation 64 every time According to；The computations, because of the multiple that network inputs and output channel product are 32, therefore its granularity is 32.

The parameter that imperative statement defines is as shown in table 2:

Table 2

Calculating mode is parallel input and output channel, can be by parameter regulation parallel section input channel to calculate simultaneously More output channels, or parallel more multiple input path calculate wheel number to reduce, and input channel and output channel are tied in universal CNN It is 32 multiple in structure, the present embodiment selects the inner product of vectors that minimum unit is 32 in parallel input and output channel calculating mode, It can effectively ensure that the peak use rate of computing unit.Wherein, when instruction set accelerates for CNN, parallel input and output channel meter Pattern diagram is calculated as shown in figure 3, in each clock cycle, reads the piece that a size is 1*1, depth is ICS input channel Section and corresponding interior nuclear element, these elements meet natural data memory module, it is only necessary to the bandwidth of very little.In input channel (ICS) and in output channel (OCS, the quantity for the kernel collection being related to) concurrency is realized.Fig. 3 (c), which is further illustrated, to be calculated Journey.For the 0th wheel period 0, the input channel slice of reading position (0,0), next cycle we jump and stride x and read position It sets (0,2), operation, which is read, to be continued, and all pixels until corresponding to core position (0,0) are to calculate.Then we enter the 1st It takes turns and starts to read all pixels corresponding to core position (0,1) from position (0,1).It is to calculate size with OC collection kernel The block number evidence of IN*IM*IC needs Kx*Ky* (IC/ICS) * (OC/OCS) to take turns.It can be for any interior using above-mentioned calculating mode The uniform data of core size or step-length greatly simplifies the data management before calculating, and is realized more with less resource consumption High efficiency adapts to the kernel size of various various sizes of networks.

To sum up, because existing FPGA accelerates work it is intended that different CNN generates specific independent accelerator, the application is Realization does not reconstruct FPGA and realizes heterogeneous networks, and OverDrive Processor ODP is arranged, controls application-defined instruction, application-defined Technical inspiration is not present in instruction in OPU instruction set, because it is with the hardware of FPGA acceleration system, system in the prior art and covers Lid range is different；The present invention is determined according to the parallel input of CNN network, acceleration demand and selection and output channel calculates mould Simultaneously instruction granularity is arranged in formula, realizes and recombinates the network mapping of different structure to specific structure, adapts to various sizes of network Kernel size, solve the universality of instruction set alignment processing device；Conditional instruction and imperative statement are defined, unconditionally Instruction provides configuration parameter for conditional instruction, and trigger condition is arranged in conditional instruction, and trigger condition is written firmly within hardware, has Corresponding register is arranged in conditional order, executes after trigger condition satisfaction, and imperative statement directly executes after being read, replacement Parameter register content realizes and runs conditional instruction according to trigger condition, imperative statement provides configuration for conditional instruction Parameter, instruction execution sequence is accurate, is not affected by other factors, and overcomes in CNN acceleration system that there are time for each instruction is not true It is qualitative big, can not Accurate Prediction instruction sequences the problem of, the present invention provides a kind of OPU instruction set, and the network of different structure is reflected The OPU instruction set that specific structure is arrived in recombination is penetrated, instruction set and corresponding processor OPU can be realized with FPGA or ASIC, be improved The universality of the processor of control is instructed, OPU can accelerate different target CNN networks, avoid hardware reconstruction.

Embodiment 2

Based on embodiment 1, six kinds in the conditional instruction of the application instructions: including read store instruction, write store instruction, Data grabber instruction, Data Post instruction and computations；Conditional instruction is held after the trigger condition for meeting hardware write-in Row, conditional instruction register includes parameter register and trigger condition register；Conditional instruction according to imperative statement into Row parameter configuration.

Reading store instruction includes carrying out reading storage operation by Mode A 1 and carrying out reading storage operation by Mode A 2；Read storage behaviour Making instruction can include initial address with parameter, operand quantity, read post-processing mode and on piece storage location.

Mode A 1: reading n number backward since specified address, and n is positive integer；

Writing store instruction includes carrying out writing storage operation by Mode B 1 and carrying out writing storage operation by Mode B 2；Write storage behaviour Making instruction can include initial address and operand quantity with parameter.

Mode B 1: n number is write backward since specified address；

Data grabber instruction includes being deposited according to different reading data patterns and data recombination pattern of rows and columns from piece characteristic pattern Reservoir and inner product parameter storage read data operation and to the data of reading carry out recombination arrangement operation；Data grabber and recombination Operational order can include reading characteristic pattern memory and reading inner product parameter storage with parameter, wherein reading characteristic pattern memory includes Reading address constraint is lowest address and maximum address, reading step and rearrangement pattern；Reading inner product parameter storage includes reading Address constraint and reading mode.

Data Post instruction includes pond, activation, fixed point cutting, be rounded, vector align addition a kind of operation or A variety of operations；Data Post operational order can include pond type, pond size, activation type and fixed point cleavage with parameter It sets.

Computations include being deployed to carry out inner product of vectors operation according to different length vector, the meter that inner product of vectors operation uses Calculating basic unit is the inner product of vectors module that two length are 32 (32 are the length of vector, include 32 8bit data), meter Calculating the adjustable parameter of operational order includes output fruiting quantities.

Trigger condition is arranged in conditional instruction, and trigger condition is written firmly within hardware, the corresponding deposit of conditional instruction setting Device executes after trigger condition satisfaction and reads storage, writes storage, data grabber, Data Post and calculating, instruction is compiled Afterwards, it after OPU reads described instruction according to the commencing signal that GUI is sent, is instructed according to the parallel computation mode operation of instruction definition, Complete the acceleration of different target network；Trigger condition is written firmly within hardware, for example for storage read module instruction, shares 6 Kind instruction triggers condition, including 1. when last storage reading completion and last data grabber recombination is completed then to touch Hair；2. writing storage operation when last data to complete then to trigger；3. when then triggering etc. is completed in last Data Post operation； Trigger condition is arranged in conditional instruction, avoids existing instruction sequence and fully relies on the sequence of setting and executes the shortcomings that time-consuming, It realizes that memory reading is continuously operated with model identical, is executed without the fixed intervals sequence by setting, greatly shorten instruction The length of sequence further speeds up the instruction speed of service, conducive to the acceleration that instruction rapid configuration realizes different target network is passed through, As shown in Fig. 2, two operations are read and write, initial TCI is set as t0, reads in t1 triggering memory, holds from t1-t5 Row, for next trigger condition TCI can any time point between t1 and t5 update, store current TCI, it is by newly referring to Enable and updating, in this case, when memory reading is continuously operated with model identical, do not need instruction (in time t6 and T12, operation are triggered by identical TCI), this shortens instruction sequence more than 10x；Meanwhile execution has item after meeting trigger condition Part instruction, the configuration parameter of conditional instruction are provided by unconditional parameter instruction, and instruction execution is accurate, avoids existing because not true Qualitative the problem of causing instruction to suspend greatly.

Embodiment 3

Based on embodiment 1, when accelerating for CNN, there is the case where a plurality of continuous repetitive instruction in instruction sequence, therefore fixed The definition mode of instruction sequence is defined when adopted instruction set, specifically: instruction sequence is then only set if continuous a plurality of repetitive instruction A single instruction is set, which is repeatedly executed at predetermined intervals until content is updated in trigger condition register and parameter register；Have First is only defined when continuous a plurality of repetitive instruction, trigger condition register and parameter register keep content until being updated, Conducive to the acceleration for realizing different target network by instruction rapid configuration.

Need to define many kinds of parameters in imperative statement, corresponding command length is long, in order to reduce command length, defines nothing The unified approach of conditional order parameter, unified approach are that unification is carried out when renewal frequency is synchronous, the synchronous parameter quilt of renewal frequency It is referred in same imperative statement to make full use of all bits of instruction, reduces total instruction and call item number, greatly shorten Command length, conducive to the acceleration for realizing different target network by instruction rapid configuration.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims

1. a kind of OPU instruction set accelerated for CNN defines method, it is characterised in that: including defining conditional instruction, defining nothing Conditional order and setting instruction granularity；

Conditional instruction is defined to include the following steps:

Conditional instruction is constructed, conditional instruction includes after reading store instruction, writing store instruction, data grabber instruction, data Reason instruction and computations；

The register and executive mode of conditional instruction are set, executive mode is to execute after meeting the trigger condition that hardware is written, Register includes parameter register and trigger condition register；

Imperative statement is defined to include the following steps:

Define the parameter of imperative statement；

Setting instruction granularity includes the following steps:

It counts CNN network and accelerates demand；

2. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: the reading Store instruction includes carrying out reading storage operation by Mode A 1 and carrying out reading storage operation by Mode A 2, and granularity is to read in every time N number, n > 1；

Mode A 1: n number is read backward since specified address；

Mode A 2: n number is read according to address stream, wherein address is discontinuous in address stream, operates after three kinds of readings: 1, after reading Without operation；2, designated length is spliced into after reading；3, designated length is split as after reading；On piece stores position after four reading operations Set: characteristic pattern memory module, inner product parameter memory module, offset parameter memory module instruct memory module；

The reading storage operational order can include that initial address, operand quantity, reading post-processing mode and on piece are deposited with parameter Storage space is set.

3. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: described to write Store instruction includes carrying out writing storage operation by Mode B 1 and carrying out writing storage operation by Mode B 2, and granularity is to write out every time N number, n > 1；

Mode B 1: n number is write backward since specified address；

4. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: the number It include according to different reading data patterns and data recombination pattern of rows and columns from piece characteristic pattern memory and inner product according to fetching instruction Parameter storage read data operation and to the data of reading carry out recombination arrangement operation, granularity be simultaneously operate 64 it is defeated Enter data；Data grabber instruction can include reading characteristic pattern memory and reading inner product parameter storage with parameter, wherein read characteristic pattern Memory includes that the constraint of reading address is lowest address and maximum address, reading step and rearrangement pattern；Read the storage of inner product parameter Device includes the constraint of reading address and reading mode.

5. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: the number Include pond according to post-processing instruction, activation, fixed point cutting, be rounded, a kind of operation or a variety of operations that vector contraposition is added, Granularity is the multiple data of operation 64 every time；Data Post operational order can include pond type, Chi Huachi with parameter Very little, activation type and fixed point cutting position.

6. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: the meter Calculating instruction includes being deployed to carry out inner product of vectors operation according to different length vector, and granularity 32, inner product of vectors operation uses Calculating basic unit be inner product of vectors module that two length are 32, it includes output result that calculating operation, which instructs adjustable parameter, Quantity.

7. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: the nothing Conditional order provides parameter and updates, and parameter includes that on piece storage feature module is long, wide, port number, current layer input length, wide, Current layer input channel number, output channel number read storage operation initial address, read operation model selection, write storage operation starting Address, write operation model selection, data grabber mode and constraint, setting calculating mode, setting pondization operate relevant parameter, setting Activation operation relevant parameter and setting data shift, and shearing is rounded relevant operation.

8. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: further include Instruction sequence definition mode is set, specifically: a single finger is then arranged if continuous a plurality of repetitive instruction in instruction sequence It enables, which can be repeatedly executed at predetermined intervals, until content is updated in trigger condition register and parameter register.

9. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: further include Command length is defined, command length is uniform length.

10. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: parallel Input and output channel correspond to the inner product of vectors that the minimum unit of calculating mode is 32.