CN110058882A - It is a kind of for CNN accelerate OPU instruction set define method - Google Patents

It is a kind of for CNN accelerate OPU instruction set define method Download PDF

Info

Publication number
CN110058882A
CN110058882A CN201910192455.0A CN201910192455A CN110058882A CN 110058882 A CN110058882 A CN 110058882A CN 201910192455 A CN201910192455 A CN 201910192455A CN 110058882 A CN110058882 A CN 110058882A
Authority
CN
China
Prior art keywords
instruction
parameter
mode
reading
opu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910192455.0A
Other languages
Chinese (zh)
Other versions
CN110058882B (en
Inventor
喻韵璇
王铭宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Biong Core Technology Co ltd
Original Assignee
Chengdu Star Innovation Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Star Innovation Technology Co Ltd filed Critical Chengdu Star Innovation Technology Co Ltd
Priority to CN201910192455.0A priority Critical patent/CN110058882B/en
Publication of CN110058882A publication Critical patent/CN110058882A/en
Application granted granted Critical
Publication of CN110058882B publication Critical patent/CN110058882B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Devices For Executing Special Programs (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

The invention discloses a kind of OPU instruction set accelerated for CNN to define method, is related to the instruction field of CNN OverDrive Processor ODP, and method includes defining conditional instruction, defining imperative statement and setting instruction granularity.Imperative statement provides configuration parameter for conditional instruction, trigger condition is arranged in conditional instruction, trigger condition is written firmly within hardware, corresponding trigger condition register is arranged in conditional instruction, conditional instruction executes after trigger condition satisfaction, imperative statement directly executes after being read, and alternative parameter content of registers.The present invention is according to CNN network and demand is accelerated to select the calculating mode of parallel input and output channel, and is provided with instruction granularity.Instruction set of the invention avoids uncertain the problem of leading to not prediction instruction reorder greatly in operation cycle.Instruction set of the invention and corresponding processor OPU can be realized with FPGA or ASIC;OPU can accelerate different target CNN networks, avoid hardware reconstruction.

Description

It is a kind of for CNN accelerate OPU instruction set define method
Technical field
The present invention relates to CNN accelerator instruction set to define method field, especially a kind of OPU instruction accelerated for CNN Collect definition method.
Background technique
Depth convolutional neural networks (CNNs) show very high accuracy in various applications, if visual object identifies, Speech recognition and object detection etc..However, its breakthrough in accuracy is the high cost for calculating cost, need to pass through calculating Cluster, GPU and FPGA accelerate to push.Wherein, FPGA accelerator has energy efficiency high, and flexibility is good, and computing capability is strong etc. Advantage takes off in the CNN good application on the edge devices such as speech recognition especially on smart phone and visual object identification Grain husk and go out;It is usually directed to framework and explores and optimize, RTL programming, hardware realization and software-hardware interface exploitation, with development People conduct in-depth research FPGA CNN (convolutional neural networks) the automatic compiler accelerated, can configure platform and provide Concurrent computation resource abundant and high energy efficiency become the ideal chose of edge calculations and data center CNN acceleration.But with Development of DNN (deep neural network) algorithm in various more complicated Computer Vision Tasks, such as recognition of face, license plate are known Not, gesture recognition etc., the cascade structure of a variety of DNN are widely used to obtain better performance, these new application scenarios are wanted The sequence of heterogeneous networks is asked to execute, it is therefore desirable to constantly reconfigure FPGA device, bring the problem of time-consuming;On the other hand, Each new update in customer network framework can result in the regeneration and entire realization process of RTL code, time-consuming longer.
In recent years, the CNN device generator that automatically speeds up for being deployed to FPGA quickly can be become into another focus, existing skill There is researcher to develop Deep weaver in art, the resource allocation and hardware organization that it is provided according to design planning person calculate CNN Method is mapped to manual optimization design template;Somebody proposes the compiler based on the module library RTL, it is by multiple optimizations Manual coding Verilog template composition, describes the calculating and data flow of different type layer;With the accelerator phase of custom design Than this two work all obtain comparable performance;There are also researchers to provide the compiler based on HLS, is primarily upon The bandwidth optimization carried out by internal storage access recombination;There are also researchers to propose a kind of systolic array framework, higher to realize FPGA running frequency.But existing FPGA acceleration work, it is intended that different CNN generates specific independent accelerator, this guarantees The reasonable high-performance of template based on RTL or based on HLS-RTL, but HardwareUpgring is complicated in the case where adjusting target network Degree is high.Therefore, specific hardware coded description is generated to individual networks to be implemented without, is not related to burning FPGA again Record is all disposed process and is completed by instruction configuration, configure different target network configuration by instruction, do not reconstruct FPGA acceleration Device proposes completely new CNN acceleration system, defines OPU (Overlay Processor Unit) instruction set, compiler compiling definition Instruction set generate instruction sequence, OPU executes the instruction after compiling and realizes that CNN accelerates, to realize that above-mentioned CNN acceleration needs to consider How to define instruction to realize in the network mapping recombination to specific structure by different structure, so that realizing the processor of instruction control Universality is good.On the other hand, when being related to external memory in use, memory reads and writees the circulating analog accuracy of operation It is not high, because of refresh time and other expenses during external memory use outside possible amount incurred;If stood after the decoding It executes instruction, then operation order can only be by the sequential control of instruction sequence;If control is simultaneously without the accurate simulated operation period The starting point for the operation that row executes will become intractable;Meanwhile the initial conditions variation of main business is limited, usually in preceding several steps Suddenly it is triggered after reaching certain state, causes to realize that time for each instruction is uncertain big, therefore, it is necessary to a kind of instruction set definition sides Legal justice instruction overcomes problem above, provides OPU instruction set for the network mapping recombination of different structure to specific structure, optimization refers to The universality for enabling the processor of control is able to achieve the configuration for completing different target network according to instruction, is realized by OPU general CNN accelerates.
Summary of the invention
It is an object of the invention to: the present invention provides a kind of OPU instruction set accelerated for CNN to define method, provides By the OPU instruction set of the network mapping recombination of different structure to specific structure, the universality of the processor of optimization instruction control reaches To the purpose for not reconstructing FPGA realization heterogeneous networks.
The technical solution adopted by the invention is as follows:
It is a kind of for CNN accelerate OPU instruction set define method, include the following steps:
Including defining conditional instruction, defining imperative statement and setting instruction granularity;
Conditional instruction is defined to include the following steps:
Conditional instruction is constructed, conditional instruction includes reading store instruction, writing store instruction, data grabber instruction, data Post-processing instruction and computations;
The register and executive mode of conditional instruction are set, and executive mode is to hold after meeting the trigger condition that hardware is written Row, register includes parameter register and trigger condition register;
The parameter configuration mode of conditional instruction is set, and configuration mode is to carry out parameter configuration according to imperative statement;
Imperative statement is defined to include the following steps:
Define the parameter of imperative statement;
The executive mode of imperative statement parameter is defined, executive mode is directly to execute after being read;
Setting instruction granularity includes the following steps:
It counts CNN network and accelerates demand;
Calculating mode is determined according to the parallel input and output channel of statistical result and selection, and instruction granularity is set.
Preferably, the reading store instruction includes carrying out reading storage operation by Mode A 1 and carrying out reading storage behaviour by Mode A 2 Make, granularity is to read in n number, n > 1 every time;
Mode A 1: n number is read backward since specified address;
Mode A 2: n number is read according to address stream, wherein address is discontinuous in address stream, operates after three kinds of readings: 1, it reads Without operation after taking;2, designated length is spliced into after reading;3, designated length is split as after reading;On piece is deposited after four reading operations Storage space is set: characteristic pattern memory module, inner product parameter memory module, offset parameter memory module, instructs memory module;
The reading storage operational order can include initial address with parameter, operand quantity, read post-processing mode and piece Upper storage location.
Preferably, the store instruction of writing includes carrying out writing storage operation by Mode B 1 and carrying out writing storage behaviour by Mode B 2 Make, granularity is to write out n number, n > 1 every time;
Mode B 1: n number is write backward since specified address;
Mode B 2: n number is write according to destination address stream, wherein address is discontinuous in address stream;
The storage operational order of writing can include initial address and operand quantity with parameter.
Preferably, data grabber instruction includes according to different reading data patterns and data recombination pattern of rows and columns from piece Upper characteristic pattern memory and inner product parameter storage read data operation and to the data of reading carry out recombination arrangement operation, Granularity is while operating 64 input datas;Data grabber instruction can include reading characteristic pattern memory and reading inner product parameter with parameter Memory, wherein reading characteristic pattern memory includes reading address constraint i.e. lowest address and maximum address, reading step and rearrangement Mode;Reading inner product parameter storage includes the constraint of reading address and reading mode.
Preferably, the Data Post instruction includes pond, activation, fixed point cutting, is rounded, one that vector contraposition is added Kind operation or a variety of operations, granularity are the multiple data of operation 64 every time;Data Post operational order can be with ginseng Number includes pond type, pond size, activation type and fixed point cutting position.
Preferably, the computations include being deployed to carry out inner product of vectors operation, granularity according to different length vector It is 32, the inner product of vectors module that the calculating basic unit that inner product of vectors operation uses is 32 for two length, calculating operation instruction Adjustable parameter includes output fruiting quantities.
Preferably, the imperative statement provides parameter and updates, and parameter includes that on piece storage feature module is long, wide, leads to Road number, current layer input length, wide, current layer input channel number, output channel number read storage operation initial address, read operation mould Storage operation initial address, write operation model selection, data grabber mode and constraint, setting calculating mode, setting are write in formula selection Pondization operates relevant parameter, setting activation operation relevant parameter and setting data and shifts, and shearing is rounded relevant operation.
It preferably, further include setting instruction sequence definition mode, specifically: instruction sequence refers to if continuous a plurality of repetition It enables, then a single instruction is set, which can be repeatedly executed at predetermined intervals, until content in trigger condition register and parameter register It is updated.
It preferably, further include defining command length, command length is uniform length.
Preferably, input and output channel correspond to the inner product of vectors that the minimum unit of calculating mode is 32 parallel.
In conclusion by adopting the above-described technical solution, the beneficial effects of the present invention are:
1. the present invention is by defining conditional instruction, defining imperative statement and setting instruction granularity, imperative statement Configuration parameter is provided for conditional instruction, trigger condition is arranged in conditional instruction, and trigger condition is written firmly within hardware, has ready conditions Corresponding register is arranged in instruction, executes after trigger condition satisfaction, imperative statement directly executes after being read, alternative parameter Content of registers avoids existing operation cycle uncertainty from leading to not predict instruction reorder greatly, realizes that Accurate Prediction instruction is suitable Sequence;Calculating mode is determined according to the parallel input of CNN network, acceleration demand and selection and output channel, and instruction particle is set Degree is realized and recombinates the network mapping of different structure to specific structure, using the various sizes of network of parallel computation mode adaptive Kernel size, solve the universality of instruction set alignment processing device, be that CNN OverDrive Processor ODP according to instruction completes different mesh The configuration for marking network accelerates general CNN accelerating velocity and provides applicable OPU instruction set;
2. trigger condition is arranged in conditional instruction of the invention, the sequence that existing instruction sequence fully relies on setting is avoided The shortcomings that time-consuming is executed, realizes that memory reading is continuously operated with model identical, without the fixed intervals sequence by setting It executes, greatly shortens the length of instruction sequence, conducive to the acceleration for realizing different target network by instruction rapid configuration;
3. the parallel input and output channel according to statistical result and selection determine calculating mode, and instruction particle is arranged Degree, can by parameter regulation parallel section input channel to calculate more output channels simultaneously, or parallel more multiple input path with Calculating wheel number is reduced, input channel and output channel are 32 multiple in universal CNN structure, and choosing 32 can as basis unit The peak use rate of computing unit is effectively ensured;
It is updated 4. imperative statement of the invention provides parameter, the synchronous parameter of renewal frequency is classified into same without item To make full use of all bits of instruction in part instruction, reduces total instruction and call item number;
5. a single instruction is only arranged in the present invention when having continuous a plurality of repetitive instruction, which is repeatedly executed at predetermined intervals, until Trigger condition register and parameter register keep content until being updated, and are conducive to realize different target by instruction rapid configuration The acceleration of network.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be understood that the following drawings illustrates only certain embodiments of the present invention, therefore is not construed as pair The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.
Fig. 1 is that instruction set of the invention defines method flow diagram;
Fig. 2 is that conditional instruction of the invention triggers operation schematic diagram;
Fig. 3 is parallel computation pattern diagram of the invention;
Fig. 4 is instruction set schematic diagram of the invention;
Fig. 5 is the OPU structural schematic diagram collected based on instruction of the invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention, i.e., described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is logical The component for the embodiment of the present invention being often described and illustrated herein in the accompanying drawings can be arranged and be designed with a variety of different configurations.
Therefore, the detailed description of the embodiment of the present invention provided in the accompanying drawings is not intended to limit below claimed The scope of the present invention, but be merely representative of selected embodiment of the invention.Based on the embodiment of the present invention, those skilled in the art Member's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
It should be noted that the relational terms of term " first " and " second " or the like be used merely to an entity or Operation is distinguished with another entity or operation, and without necessarily requiring or implying between these entities or operation, there are any This actual relationship or sequence.Moreover, the terms "include", "comprise" or its any other variant be intended to it is non-exclusive Property include so that include a series of elements process, method, article or equipment not only include those elements, but also Further include other elements that are not explicitly listed, or further include for this process, method, article or equipment it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described There is also other identical elements in the process, method, article or equipment of element.
Feature and performance of the invention are described in further detail with reference to embodiments.
Embodiment 1
A kind of OPU instruction set accelerated for CNN defines method, including defines conditional instruction, defines imperative statement Granularity is instructed with setting;
When the instruction set of definition is used for CNN acceleration, need to define instruction type, the corresponding behaviour of every kind of instruction of instruction Make, conventional parameter definition and instruction granularity, conventional parameter definition include command length and instruction sequence.When OPU instruction operation Including step 1: reading instruction block, (instruction set is the aggregate list of all instructions;Instruction block is the instruction of one group of continual command, is used In execute a network instruction include multiple instruction block);Step 2: the imperative statement in acquisition instruction block directly executes, solution Code goes out the parameter write-in corresponding registers for including in imperative statement;Conditional instruction in acquisition instruction block, according to having ready conditions Step 3 is skipped to after instruction setting trigger condition;Step 3: judging whether trigger condition meets, if satisfied, then executing finger of having ready conditions It enables;If not satisfied, not executing instruction then;Step 4: whether the reading instruction for the next instruction block for including in decision instruction meets Trigger condition, if satisfied, then return step 1 continues to execute instruction;Otherwise the touching of register parameters and conditions present instruction setting Clockwork spring part remains unchanged, until meeting trigger condition.
Application-defined instruction set is used for the CNN acceleration system based on OPU, and OPU structural schematic diagram is as shown in figure 5, OPU It is realized with FPGA or ASIC, final operating instruction is generated according to the instruction of definition, different target can be realized in OPU operating instruction The acceleration of CNN network, the technological means of use are as follows: it defines conditional instruction, define imperative statement and setting instruction granularity, Flow chart as shown in Figure 1, conditional instruction define its composition, and conditional instruction is instructed including six classes;Conditional instruction is set Register and executive mode, executive mode be meet hardware write-in trigger condition after execute, register includes parameter register Device and trigger condition register;The parameter configuration mode of conditional instruction is set, and configuration mode is to be carried out according to imperative statement Parameter configuration;Defining imperative statement includes defining its parameter, defining its executive mode i.e. directly execution;Instruction length definitions are Uniform length, instruction set have structure shown in Fig. 4;Instruct the setting of granularity: statistics CNN network and acceleration demand;According to The parallel input and output channel of statistical result and selection determine calculating mode, and instruction granularity is arranged.OPU instruction set includes Conditional instruction, that is, c-type instruction and imperative statement, that is, U-shaped instruction, the instruction sequence of formation are as shown in Figure 4;
Wherein, the composition that conditional instruction includes is as shown in table 1:
Table 1
Instruction name Command function
r Read store instruction
w Write store instruction
f Data grabber instruction
c Computations
p Data Post instruction
The instruction granularity of every class instruction is according to CNN network structure and accelerates demand setting: the reading store instruction, according to It is to read in n number, n > 1 every time that CNN, which accelerates feature that its granularity is arranged,;It is described to write store instruction, feature setting is accelerated according to CNN Its granularity is to write out n number, n > 1 every time;The data grabber instruction, according to the structure of CNN network, granularity is 64 Multiple operates 64 input datas simultaneously;The Data Post instruction, granularity are the multiple number of operation 64 every time According to;The computations, because of the multiple that network inputs and output channel product are 32, therefore its granularity is 32.
The parameter that imperative statement defines is as shown in table 2:
Table 2
Calculating mode is parallel input and output channel, can be by parameter regulation parallel section input channel to calculate simultaneously More output channels, or parallel more multiple input path calculate wheel number to reduce, and input channel and output channel are tied in universal CNN It is 32 multiple in structure, the present embodiment selects the inner product of vectors that minimum unit is 32 in parallel input and output channel calculating mode, It can effectively ensure that the peak use rate of computing unit.Wherein, when instruction set accelerates for CNN, parallel input and output channel meter Pattern diagram is calculated as shown in figure 3, in each clock cycle, reads the piece that a size is 1*1, depth is ICS input channel Section and corresponding interior nuclear element, these elements meet natural data memory module, it is only necessary to the bandwidth of very little.In input channel (ICS) and in output channel (OCS, the quantity for the kernel collection being related to) concurrency is realized.Fig. 3 (c), which is further illustrated, to be calculated Journey.For the 0th wheel period 0, the input channel slice of reading position (0,0), next cycle we jump and stride x and read position It sets (0,2), operation, which is read, to be continued, and all pixels until corresponding to core position (0,0) are to calculate.Then we enter the 1st It takes turns and starts to read all pixels corresponding to core position (0,1) from position (0,1).It is to calculate size with OC collection kernel The block number evidence of IN*IM*IC needs Kx*Ky* (IC/ICS) * (OC/OCS) to take turns.It can be for any interior using above-mentioned calculating mode The uniform data of core size or step-length greatly simplifies the data management before calculating, and is realized more with less resource consumption High efficiency adapts to the kernel size of various various sizes of networks.
To sum up, because existing FPGA accelerates work it is intended that different CNN generates specific independent accelerator, the application is Realization does not reconstruct FPGA and realizes heterogeneous networks, and OverDrive Processor ODP is arranged, controls application-defined instruction, application-defined Technical inspiration is not present in instruction in OPU instruction set, because it is with the hardware of FPGA acceleration system, system in the prior art and covers Lid range is different;The present invention is determined according to the parallel input of CNN network, acceleration demand and selection and output channel calculates mould Simultaneously instruction granularity is arranged in formula, realizes and recombinates the network mapping of different structure to specific structure, adapts to various sizes of network Kernel size, solve the universality of instruction set alignment processing device;Conditional instruction and imperative statement are defined, unconditionally Instruction provides configuration parameter for conditional instruction, and trigger condition is arranged in conditional instruction, and trigger condition is written firmly within hardware, has Corresponding register is arranged in conditional order, executes after trigger condition satisfaction, and imperative statement directly executes after being read, replacement Parameter register content realizes and runs conditional instruction according to trigger condition, imperative statement provides configuration for conditional instruction Parameter, instruction execution sequence is accurate, is not affected by other factors, and overcomes in CNN acceleration system that there are time for each instruction is not true It is qualitative big, can not Accurate Prediction instruction sequences the problem of, the present invention provides a kind of OPU instruction set, and the network of different structure is reflected The OPU instruction set that specific structure is arrived in recombination is penetrated, instruction set and corresponding processor OPU can be realized with FPGA or ASIC, be improved The universality of the processor of control is instructed, OPU can accelerate different target CNN networks, avoid hardware reconstruction.
Embodiment 2
Based on embodiment 1, six kinds in the conditional instruction of the application instructions: including read store instruction, write store instruction, Data grabber instruction, Data Post instruction and computations;Conditional instruction is held after the trigger condition for meeting hardware write-in Row, conditional instruction register includes parameter register and trigger condition register;Conditional instruction according to imperative statement into Row parameter configuration.
Reading store instruction includes carrying out reading storage operation by Mode A 1 and carrying out reading storage operation by Mode A 2;Read storage behaviour Making instruction can include initial address with parameter, operand quantity, read post-processing mode and on piece storage location.
Mode A 1: reading n number backward since specified address, and n is positive integer;
Mode A 2: n number is read according to address stream, wherein address is discontinuous in address stream, operates after three kinds of readings: 1, it reads Without operation after taking;2, designated length is spliced into after reading;3, designated length is split as after reading;On piece is deposited after four reading operations Storage space is set: characteristic pattern memory module, inner product parameter memory module, offset parameter memory module, instructs memory module;
Writing store instruction includes carrying out writing storage operation by Mode B 1 and carrying out writing storage operation by Mode B 2;Write storage behaviour Making instruction can include initial address and operand quantity with parameter.
Mode B 1: n number is write backward since specified address;
Mode B 2: n number is write according to destination address stream, wherein address is discontinuous in address stream;
Data grabber instruction includes being deposited according to different reading data patterns and data recombination pattern of rows and columns from piece characteristic pattern Reservoir and inner product parameter storage read data operation and to the data of reading carry out recombination arrangement operation;Data grabber and recombination Operational order can include reading characteristic pattern memory and reading inner product parameter storage with parameter, wherein reading characteristic pattern memory includes Reading address constraint is lowest address and maximum address, reading step and rearrangement pattern;Reading inner product parameter storage includes reading Address constraint and reading mode.
Data Post instruction includes pond, activation, fixed point cutting, be rounded, vector align addition a kind of operation or A variety of operations;Data Post operational order can include pond type, pond size, activation type and fixed point cleavage with parameter It sets.
Computations include being deployed to carry out inner product of vectors operation according to different length vector, the meter that inner product of vectors operation uses Calculating basic unit is the inner product of vectors module that two length are 32 (32 are the length of vector, include 32 8bit data), meter Calculating the adjustable parameter of operational order includes output fruiting quantities.
Trigger condition is arranged in conditional instruction, and trigger condition is written firmly within hardware, the corresponding deposit of conditional instruction setting Device executes after trigger condition satisfaction and reads storage, writes storage, data grabber, Data Post and calculating, instruction is compiled Afterwards, it after OPU reads described instruction according to the commencing signal that GUI is sent, is instructed according to the parallel computation mode operation of instruction definition, Complete the acceleration of different target network;Trigger condition is written firmly within hardware, for example for storage read module instruction, shares 6 Kind instruction triggers condition, including 1. when last storage reading completion and last data grabber recombination is completed then to touch Hair;2. writing storage operation when last data to complete then to trigger;3. when then triggering etc. is completed in last Data Post operation; Trigger condition is arranged in conditional instruction, avoids existing instruction sequence and fully relies on the sequence of setting and executes the shortcomings that time-consuming, It realizes that memory reading is continuously operated with model identical, is executed without the fixed intervals sequence by setting, greatly shorten instruction The length of sequence further speeds up the instruction speed of service, conducive to the acceleration that instruction rapid configuration realizes different target network is passed through, As shown in Fig. 2, two operations are read and write, initial TCI is set as t0, reads in t1 triggering memory, holds from t1-t5 Row, for next trigger condition TCI can any time point between t1 and t5 update, store current TCI, it is by newly referring to Enable and updating, in this case, when memory reading is continuously operated with model identical, do not need instruction (in time t6 and T12, operation are triggered by identical TCI), this shortens instruction sequence more than 10x;Meanwhile execution has item after meeting trigger condition Part instruction, the configuration parameter of conditional instruction are provided by unconditional parameter instruction, and instruction execution is accurate, avoids existing because not true Qualitative the problem of causing instruction to suspend greatly.
Embodiment 3
Based on embodiment 1, when accelerating for CNN, there is the case where a plurality of continuous repetitive instruction in instruction sequence, therefore fixed The definition mode of instruction sequence is defined when adopted instruction set, specifically: instruction sequence is then only set if continuous a plurality of repetitive instruction A single instruction is set, which is repeatedly executed at predetermined intervals until content is updated in trigger condition register and parameter register;Have First is only defined when continuous a plurality of repetitive instruction, trigger condition register and parameter register keep content until being updated, Conducive to the acceleration for realizing different target network by instruction rapid configuration.
Need to define many kinds of parameters in imperative statement, corresponding command length is long, in order to reduce command length, defines nothing The unified approach of conditional order parameter, unified approach are that unification is carried out when renewal frequency is synchronous, the synchronous parameter quilt of renewal frequency It is referred in same imperative statement to make full use of all bits of instruction, reduces total instruction and call item number, greatly shorten Command length, conducive to the acceleration for realizing different target network by instruction rapid configuration.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims (10)

1. a kind of OPU instruction set accelerated for CNN defines method, it is characterised in that: including defining conditional instruction, defining nothing Conditional order and setting instruction granularity;
Conditional instruction is defined to include the following steps:
Conditional instruction is constructed, conditional instruction includes after reading store instruction, writing store instruction, data grabber instruction, data Reason instruction and computations;
The register and executive mode of conditional instruction are set, executive mode is to execute after meeting the trigger condition that hardware is written, Register includes parameter register and trigger condition register;
The parameter configuration mode of conditional instruction is set, and configuration mode is to carry out parameter configuration according to imperative statement;
Imperative statement is defined to include the following steps:
Define the parameter of imperative statement;
The executive mode of imperative statement parameter is defined, executive mode is directly to execute after being read;
Setting instruction granularity includes the following steps:
It counts CNN network and accelerates demand;
Calculating mode is determined according to the parallel input and output channel of statistical result and selection, and instruction granularity is set.
2. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: the reading Store instruction includes carrying out reading storage operation by Mode A 1 and carrying out reading storage operation by Mode A 2, and granularity is to read in every time N number, n > 1;
Mode A 1: n number is read backward since specified address;
Mode A 2: n number is read according to address stream, wherein address is discontinuous in address stream, operates after three kinds of readings: 1, after reading Without operation;2, designated length is spliced into after reading;3, designated length is split as after reading;On piece stores position after four reading operations Set: characteristic pattern memory module, inner product parameter memory module, offset parameter memory module instruct memory module;
The reading storage operational order can include that initial address, operand quantity, reading post-processing mode and on piece are deposited with parameter Storage space is set.
3. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: described to write Store instruction includes carrying out writing storage operation by Mode B 1 and carrying out writing storage operation by Mode B 2, and granularity is to write out every time N number, n > 1;
Mode B 1: n number is write backward since specified address;
Mode B 2: n number is write according to destination address stream, wherein address is discontinuous in address stream;
The storage operational order of writing can include initial address and operand quantity with parameter.
4. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: the number It include according to different reading data patterns and data recombination pattern of rows and columns from piece characteristic pattern memory and inner product according to fetching instruction Parameter storage read data operation and to the data of reading carry out recombination arrangement operation, granularity be simultaneously operate 64 it is defeated Enter data;Data grabber instruction can include reading characteristic pattern memory and reading inner product parameter storage with parameter, wherein read characteristic pattern Memory includes that the constraint of reading address is lowest address and maximum address, reading step and rearrangement pattern;Read the storage of inner product parameter Device includes the constraint of reading address and reading mode.
5. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: the number Include pond according to post-processing instruction, activation, fixed point cutting, be rounded, a kind of operation or a variety of operations that vector contraposition is added, Granularity is the multiple data of operation 64 every time;Data Post operational order can include pond type, Chi Huachi with parameter Very little, activation type and fixed point cutting position.
6. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: the meter Calculating instruction includes being deployed to carry out inner product of vectors operation according to different length vector, and granularity 32, inner product of vectors operation uses Calculating basic unit be inner product of vectors module that two length are 32, it includes output result that calculating operation, which instructs adjustable parameter, Quantity.
7. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: the nothing Conditional order provides parameter and updates, and parameter includes that on piece storage feature module is long, wide, port number, current layer input length, wide, Current layer input channel number, output channel number read storage operation initial address, read operation model selection, write storage operation starting Address, write operation model selection, data grabber mode and constraint, setting calculating mode, setting pondization operate relevant parameter, setting Activation operation relevant parameter and setting data shift, and shearing is rounded relevant operation.
8. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: further include Instruction sequence definition mode is set, specifically: a single finger is then arranged if continuous a plurality of repetitive instruction in instruction sequence It enables, which can be repeatedly executed at predetermined intervals, until content is updated in trigger condition register and parameter register.
9. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: further include Command length is defined, command length is uniform length.
10. a kind of OPU instruction set accelerated for CNN according to claim 1 defines method, it is characterised in that: parallel Input and output channel correspond to the inner product of vectors that the minimum unit of calculating mode is 32.
CN201910192455.0A 2019-03-14 2019-03-14 OPU instruction set definition method for CNN acceleration Active CN110058882B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910192455.0A CN110058882B (en) 2019-03-14 2019-03-14 OPU instruction set definition method for CNN acceleration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910192455.0A CN110058882B (en) 2019-03-14 2019-03-14 OPU instruction set definition method for CNN acceleration

Publications (2)

Publication Number Publication Date
CN110058882A true CN110058882A (en) 2019-07-26
CN110058882B CN110058882B (en) 2023-01-06

Family

ID=67316909

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910192455.0A Active CN110058882B (en) 2019-03-14 2019-03-14 OPU instruction set definition method for CNN acceleration

Country Status (1)

Country Link
CN (1) CN110058882B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516790A (en) * 2019-08-16 2019-11-29 浪潮电子信息产业股份有限公司 A kind of convolutional network accelerated method, apparatus and system
CN111563579A (en) * 2020-04-28 2020-08-21 深圳市易成自动驾驶技术有限公司 CNN acceleration method, device, equipment and storage medium based on data stream
CN111932436A (en) * 2020-08-25 2020-11-13 成都恒创新星科技有限公司 Deep learning processor architecture for intelligent parking
CN112257843A (en) * 2020-09-23 2021-01-22 浙江大学 System for expanding instruction set based on MobileNetV1 network inference task
WO2022028220A1 (en) * 2020-08-06 2022-02-10 腾讯科技(深圳)有限公司 Neural network model computing chip, method and apparatus, device and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6088783A (en) * 1996-02-16 2000-07-11 Morton; Steven G DPS having a plurality of like processors controlled in parallel by an instruction word, and a control processor also controlled by the instruction word
CA2725130A1 (en) * 2008-05-29 2009-12-03 Axis Semiconductor Inc. Method & apparatus for real-time data processing
CN104834503A (en) * 2014-02-12 2015-08-12 想象技术有限公司 Processor with granular add immediates capability & methods
WO2016171846A1 (en) * 2015-04-23 2016-10-27 Google Inc. Compiler for translating between a virtual image processor instruction set architecture (isa) and target hardware having a two-dimensional shift array structure
CN107533750A (en) * 2015-04-23 2018-01-02 谷歌公司 Virtual Image Processor instruction set architecture(ISA)With memory model and the exemplary goal hardware with two-dimensional shift array structure
CN107679620A (en) * 2017-04-19 2018-02-09 北京深鉴科技有限公司 Artificial neural network processing unit
CN107704922A (en) * 2017-04-19 2018-02-16 北京深鉴科技有限公司 Artificial neural network processing unit
CN108197705A (en) * 2017-12-29 2018-06-22 国民技术股份有限公司 Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6088783A (en) * 1996-02-16 2000-07-11 Morton; Steven G DPS having a plurality of like processors controlled in parallel by an instruction word, and a control processor also controlled by the instruction word
CA2725130A1 (en) * 2008-05-29 2009-12-03 Axis Semiconductor Inc. Method & apparatus for real-time data processing
CN104834503A (en) * 2014-02-12 2015-08-12 想象技术有限公司 Processor with granular add immediates capability & methods
WO2016171846A1 (en) * 2015-04-23 2016-10-27 Google Inc. Compiler for translating between a virtual image processor instruction set architecture (isa) and target hardware having a two-dimensional shift array structure
CN107533750A (en) * 2015-04-23 2018-01-02 谷歌公司 Virtual Image Processor instruction set architecture(ISA)With memory model and the exemplary goal hardware with two-dimensional shift array structure
CN107679620A (en) * 2017-04-19 2018-02-09 北京深鉴科技有限公司 Artificial neural network processing unit
CN107704922A (en) * 2017-04-19 2018-02-16 北京深鉴科技有限公司 Artificial neural network processing unit
US20180307974A1 (en) * 2017-04-19 2018-10-25 Beijing Deephi Intelligence Technology Co., Ltd. Device for implementing artificial neural network with mutiple instruction units
CN108197705A (en) * 2017-12-29 2018-06-22 国民技术股份有限公司 Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MOHAMED S. ABDELFATTAH ET AL: "DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration", 《ARXIV》 *
殷伟: "基于FPGA的卷积神经网络并行加速体系架构的研究", 《中国优秀硕士学位论文全文数据库》 *
马珂: "具有卷积神经网络扩展指令的微处理器的设计与实现", 《中国优秀硕士学位论文全文数据库》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516790A (en) * 2019-08-16 2019-11-29 浪潮电子信息产业股份有限公司 A kind of convolutional network accelerated method, apparatus and system
WO2021031350A1 (en) * 2019-08-16 2021-02-25 浪潮电子信息产业股份有限公司 Convolutional network acceleration method, device and system
CN110516790B (en) * 2019-08-16 2023-08-22 浪潮电子信息产业股份有限公司 Convolutional network acceleration method, device and system
CN111563579A (en) * 2020-04-28 2020-08-21 深圳市易成自动驾驶技术有限公司 CNN acceleration method, device, equipment and storage medium based on data stream
CN111563579B (en) * 2020-04-28 2023-09-22 深圳市易成自动驾驶技术有限公司 CNN acceleration method, device, equipment and storage medium based on data stream
WO2022028220A1 (en) * 2020-08-06 2022-02-10 腾讯科技(深圳)有限公司 Neural network model computing chip, method and apparatus, device and medium
CN111932436A (en) * 2020-08-25 2020-11-13 成都恒创新星科技有限公司 Deep learning processor architecture for intelligent parking
CN111932436B (en) * 2020-08-25 2024-04-19 成都恒创新星科技有限公司 Deep learning processor architecture for intelligent parking
CN112257843A (en) * 2020-09-23 2021-01-22 浙江大学 System for expanding instruction set based on MobileNetV1 network inference task
CN112257843B (en) * 2020-09-23 2022-06-28 浙江大学 System for expanding instruction set based on MobileNet V1 network inference task

Also Published As

Publication number Publication date
CN110058882B (en) 2023-01-06

Similar Documents

Publication Publication Date Title
CN110058882A (en) It is a kind of for CNN accelerate OPU instruction set define method
CN110058883B (en) CNN acceleration method and system based on OPU
US8001510B1 (en) Automated method of architecture mapping selection from constrained high level language description via element characterization
CA2963088C (en) Apparatus and method for scheduling distributed workflow tasks
Srinivasan et al. Hardware software partitioning with integrated hardware design space exploration
CN110956272B (en) Method and system for realizing data processing
IL265682A (en) Systems and methods for configuring programmable logic devices for deep learning networks
US20170185700A1 (en) Selective Execution For Partitioned Parallel Simulations
CN110187969A (en) A kind of distributed big data parallel calculating method based on GPU
Kamthe et al. A stochastic approach to estimating earliest start times of nodes for scheduling DAGs on heterogeneous distributed computing systems
CN109656872A (en) Dynamic partially reconfigurable on-chip system software and hardware partitioning method
Jain et al. Machine learned machines: Adaptive co-optimization of caches, cores, and on-chip network
Wodecki A block approach to earliness-tardiness scheduling problems
CN103049310A (en) Multi-core simulation parallel accelerating method based on sampling
Boucheneb et al. Optimal reachability in cost time Petri nets
Shu et al. ROAM: memory-efficient large DNN training via optimized operator ordering and memory layout
Alur et al. Ranking automata and games for prioritized requirements
US20230140809A1 (en) Machine learning based contention delay prediction in multicore architectures
CN106055862A (en) Novel efficient heuristic-type two-stage parallel branch-and-bound method
Sima et al. Runtime decision of hardware or software execution on a heterogeneous reconfigurable platform
Michalska et al. Tabu search for partitioning dynamic dataflow programs
CN110928253B (en) Dynamic weighting heuristic scheduling method for automatic manufacturing system
Damschen et al. WCET guarantees for opportunistic runtime reconfiguration
Jordans et al. An efficient method for energy estimation of application specific instruction-set processors
Meng et al. Accelerating monte-carlo tree search on cpu-fpga heterogeneous platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200616

Address after: Room 305, building 9, meizhuang new village, 25 Yangzi Jiangbei Road, Weiyang District, Yangzhou City, Jiangsu Province 225000

Applicant after: Liang Lei

Address before: 610094 China (Sichuan) Free Trade Pilot Area, Chengdu City, Sichuan Province, 1402, Block 199, Tianfu Fourth Street, Chengdu High-tech Zone

Applicant before: Chengdu Star Innovation Technology Co.,Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20221215

Address after: 518017 1110, Building 3, Northwest Shenjiu Science and Technology Pioneer Park, the intersection of Taohua Road and Binglang Road, Fubao Community, Fubao Street, Shenzhen, Guangdong

Applicant after: Shenzhen biong core technology Co.,Ltd.

Address before: Room 305, Building 9, Meizhuang New Village, No. 25, Yangzijiang North Road, Weiyang District, Yangzhou City, Jiangsu Province, 225000

Applicant before: Liang Lei

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant