CN106708780A - Low complexity branch processing circuit of uniform dyeing array towards SIMT framework - Google Patents

Low complexity branch processing circuit of uniform dyeing array towards SIMT framework Download PDF

Info

Publication number
CN106708780A
CN106708780A CN201611140108.6A CN201611140108A CN106708780A CN 106708780 A CN106708780 A CN 106708780A CN 201611140108 A CN201611140108 A CN 201611140108A CN 106708780 A CN106708780 A CN 106708780A
Authority
CN
China
Prior art keywords
predicate register
control unit
unit
stack cell
storehouse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611140108.6A
Other languages
Chinese (zh)
Inventor
牛少平
田泽
韩鹏
韩一鹏
许宏杰
张骏
魏艳艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Aeronautics Computing Technique Research Institute of AVIC
Original Assignee
Xian Aeronautics Computing Technique Research Institute of AVIC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Aeronautics Computing Technique Research Institute of AVIC filed Critical Xian Aeronautics Computing Technique Research Institute of AVIC
Priority to CN201611140108.6A priority Critical patent/CN106708780A/en
Publication of CN106708780A publication Critical patent/CN106708780A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

The invention belongs to the technical field of integrated circuits, and provides a low complexity branch processing circuit of a uniform dyeing array towards the SIMT framework. The low complexity branch processing circuit comprises an assertion register unit (1), an assertion stack unit (2), and a control unit (3). Through the adoption of the low complexity branch processing circuit, the requirements for different numbers of parallel units and different numbers of sites can be satisfied, the realization circuit of the mechanism is high in time series performance and good in expandability.

Description

Towards the low complex degree branch process circuit of SIMT frameworks unification dyeing array
Technical field
The invention belongs to technical field of integrated circuits, it is related to a kind of low complexity for unifying stainer array based on SIMT frameworks Degree branch process circuit.
Background technology
Unified stainer array completes the unified dyeing function of summit, pixel in unified dyeing graphic process unit.In unification In stainer array, the realization of parallel processing is that based on SIMT, there is 16 in main parallel execution unit, at one Need to be performed simultaneously on 16 execution units simultaneously after instruction issue.But in programming, it is necessary to jumped including condition Turn to wait flow control instructions, because the input data in 16 Parallel Units is different, it seem likely that occurring multiple parallel single First condition judgment is inconsistent, and then causes to redirect also inconsistent situation.
The content of the invention
Goal of the invention:
The present invention mainly proposes a kind of low complex degree branch process circuit towards SIMT frameworks unification dyeing array, the electricity For varying number Parallel Unit, varying number scene can meet requirement on road, the mechanism realize circuit sequence performance it is high, can Favorable expandability.
Technical scheme:
A kind of low complex degree branch process circuit towards SIMT frameworks unification dyeing array, including:
Predicate register unit (1), assert stack cell (2), control unit (3);
Predicate register unit (1):When instructing performing module (4) to perform condition judgment instruction, by the knot of condition judgment Fruit and scene number are exported and give predicate register unit (1), and the value is stored in this and showed by predicate register unit (1) according to scene number In the predicate register of field;When branch process circuit performs POP and instructs, control unit (3) is read from stack cell (2) is asserted In going out to assert that the numerical value of storehouse fills in predicate register unit (1) the live predicate register;Held in branch process circuit When row INV is instructed, the value of former predicate register is negated and asserts the numerical value step-by-step phase "AND" at the top of storehouse by control unit (3), In write-in predicate register unit (1) live predicate register;Predicate register unit (1) posts asserting for each scene Storage value is exported gives control unit (3);
Control unit (3):With task scheduling modules (5), IFID modules (6), predicate register unit (1), assert storehouse Unit (2) is connected, and control unit (3) receives the branch process instruction that IFIF modules (6) is issued, the branch process instruction bag Include:POP instructions, INV instructions, PUSH instructions;When POP instructions are performed, by live reading numerical values from stack cell (2) is asserted It is transferred to predicate register unit (1);Control unit (3) will come from predicate register unit (1) when PUSH instructions are performed The predicate register value write-in of current live is asserted in stack cell (2);Control unit (3) is when INV instructions are performed, and control is single First (3) enter the numerical value with the step-by-step negation of predicate register value from the numerical value asserted at the top of stack cell (2) acquisition storehouse Row step-by-step AND-operation, and operating result is transmitted back to predicate register unit (1);Control unit (3) is sent out in IFID modules (6) When penetrating non-branch process instruction, the predicate register value of predicate register unit (1) will be come from, come from task scheduling modules (5) TaskMask step-by-step phase "AND", and result is transferred to instruction performing module (4);
Assert stack cell (2):Reception control unit (3) sends three types operation, including:POP operations, PUSH behaviour Make, INV is operated;For POP operations, the scene number that stack cell (2) is input into according to control unit (3) is asserted, from correspondence scene Storehouse top read and data and return to control unit (3);For PUSH operations, assert that stack cell (2) is single according to control The scene number of first (3) input, to the storehouse top write-in data at correspondence scene;For INV operations, stack cell (2) root is asserted The scene number being input into according to control unit (3) returns to the data at the top of correspondence scene, but does not perform read operation, that is, does not influence The content of whole storehouse.
Beneficial effect:
1st, for multiple scenes, can (data mask) whether effective according to data, condition success or not (predict Mask whether the index (excute mask) for indicating this thread actually to perform) is produced, so as to ensure the dyeing of SIMT structures Array clock can correctly perform the instruction of transmitting, and increase of the mechanism to Parallel Unit number, and efficiency is unaffected;
2nd, on multiple scenes, the present invention is carried out so that 8 warp, each warp 4 cycles of execution amount to 32 scenes as an example Design, but increasing data mask registers, predict mask registers, predict mask heaps for more scenes Similarly supported after stack etc.;
3rd, design structure of the invention is simple, and scalability is high, circuit realiration efficiency high.
Brief description of the drawings
Fig. 1 is the function structure block diagram of the branch process mechanism of present invention description.
Specific embodiment
Below in conjunction with the accompanying drawings and specific embodiment, technical scheme is clearly and completely stated.Obviously, The embodiment stated only is a part of embodiment of the invention, rather than whole embodiments, based on the embodiment in the present invention, Those skilled in the art are not making all other embodiment that creative work premise is obtained, and belong to guarantor of the invention Shield scope.
A kind of low complex degree branch process circuit towards SIMT frameworks unification dyeing array, as shown in figure 1, including:
Predicate register unit (1), assert stack cell (2), control unit (3).
The predicate register unit (1), each scene to that should have a predicate register, for storing the scene Predicate register value.The digit of predicate register is equal to the number of Parallel Unit, and the number of predicate register is run equal to program Live number.
Described to assert stack cell (2), each scene is to that should have one to assert storehouse, and the storehouse is nested in configuration processor When use, for carrying out the popping of predicate register, stack-incoming operation.Assert that the bit wide of storehouse is equal to the number of Parallel Unit, break Say that the depth of storehouse is equal to the series of routine nesting, assert that the number of storehouse is equal to the live number of program operation.
Described control unit (3), is to assert storehouse for performing PUSH, POP, INV instruction read-write, produces new asserting to post Storage value, and by the value step-by-step phase of PredictMask and TaskMask and produce ExcuteMask.
Predicate register unit (1):When instructing performing module (4) to perform condition judgment instruction, by the knot of condition judgment Fruit and scene number are exported and give predicate register unit (1), and the value is stored in this and showed by predicate register unit (1) according to scene number In the predicate register of field;When branch process circuit performs POP and instructs, control unit (3) is read from stack cell (2) is asserted In going out to assert that the numerical value of storehouse fills in predicate register unit (1) the live predicate register;Held in branch process circuit When row INV is instructed, the value of former predicate register is negated and asserts the numerical value step-by-step phase "AND" at the top of storehouse by control unit (3), In write-in predicate register unit (1) live predicate register;Predicate register unit (1) posts asserting for each scene Storage value is exported gives control unit (3);
Control unit (3):With task scheduling modules (5), IFID modules (6), predicate register unit (1), assert storehouse Unit (2) is connected, and control unit (3) receives the branch process instruction that IFIF modules (6) is issued, the branch process instruction bag Include:POP instructions, INV instructions, PUSH instructions;When POP instructions are performed, by live reading numerical values from stack cell (2) is asserted It is transferred to predicate register unit (1);Control unit (3) will come from predicate register unit (1) when PUSH instructions are performed The predicate register value write-in of current live is asserted in stack cell (2);Control unit (3) is when INV instructions are performed, and control is single First (3) enter the numerical value with the step-by-step negation of predicate register value from the numerical value asserted at the top of stack cell (2) acquisition storehouse Row step-by-step AND-operation, and operating result is transmitted back to predicate register unit (1);Control unit (3) is sent out in IFID modules (6) When penetrating non-branch process instruction, the predicate register value of predicate register unit (1) will be come from, come from task scheduling modules (5) TaskMask step-by-step phase "AND", and result is transferred to instruction performing module (4);
Assert stack cell (2):Reception control unit (3) sends three types operation, including:POP operations, PUSH behaviour Make, INV is operated;For POP operations, the scene number that stack cell (2) is input into according to control unit (3) is asserted, from correspondence scene Storehouse top read and data and return to control unit (3);For PUSH operations, assert that stack cell (2) is single according to control The scene number of first (3) input, to the storehouse top write-in data at correspondence scene;For INV operations, stack cell (2) root is asserted The scene number being input into according to control unit (3) returns to the data at the top of correspondence scene, but does not perform read operation, that is, does not influence The content of whole storehouse.
Embodiment
1st, predicate register unit
Predicate register is Predicate Mask, inside 1 SSC, in 1 cycle, correspondence 20 1 Predicate Mask.To support that 8 warp, each warp run 4 cycles, predicate register needs 32 sets of scenes;SFU Corresponding Predicate Mask are carried out or operated to obtain by the Predicate Mask of 4 SC in same SPU.
The value of predicate register receives following behavioral implications:SC to the implementing result of conditional jump instructions, to asserting storehouse POP, INV are operated.
2nd, control unit
The unit is responsible for the generation of Excute Mask, and asserts PUSH, POP and INV of storehouse.
The DataMask TaskMask corresponding with the SPU of one SPU inside SC and SFU are identical, and ExcuteMask Be then DataMask and PredicateMask step-by-step with.
PUSH operating process is to read the content (1 cycle, totally 20) of predicate register, and storehouse is asserted in write-in;
POP operating process is to be read from the top for asserting storehouse and assert information, is written into predicate register;
The process of INV operations is to read the content (it is assumed that m) of current predicate register, and stack is asserted in reading Information (it is assumed that n) is asserted, after being negated to m step-by-steps, result and n is carried out into step-by-step and then ((~m) &n) write-in is disconnected by result Speech register.
3rd, stack cell is asserted
The storehouse is used to preserve Predicate Mask, and to support 8 warp4 cycles, the storehouse needs 32 arbitrages .For each scene, the storehouse width is 20b, and depth is 32 (supporting 32 layers of conditional branching nesting).

Claims (1)

1. it is a kind of to unify to dye the low complex degree branch process circuit of array towards SIMT frameworks, it is characterised in that including:
Predicate register unit (1), assert stack cell (2), control unit (3);
Predicate register unit (1):Instruct performing module (4) perform condition judgment instruct when, by the result of condition judgment with And scene number exports and gives predicate register unit (1), the value is stored in the scene by predicate register unit (1) according to scene number In predicate register;When branch process circuit performs POP and instructs, control unit (3) reads disconnected from stack cell (2) is asserted In saying that the numerical value of storehouse fills in predicate register unit (1) the live predicate register;INV is performed in branch process circuit During instruction, the value of former predicate register is negated and asserts the numerical value step-by-step phase "AND" at the top of storehouse by control unit (3), and write-in is disconnected In speech register cell (1) live predicate register;Predicate register unit (1) is by each live predicate register value Export and give control unit (3);
Control unit (3):With task scheduling modules (5), IFID modules (6), predicate register unit (1), assert stack cell (2) it is connected, control unit (3) receives the branch process instruction that IFIF modules (6) is issued, the branch process instruction includes:POP Instruction, INV instructions, PUSH instructions;When POP instructions are performed, it is transferred to by live reading numerical values from stack cell (2) is asserted Predicate register unit (1);Control unit (3) will come from predicate register unit (1) currently existing when PUSH instructions are performed The predicate register value write-in of field is asserted in stack cell (2);Control unit (3) perform INV instruct when, control unit (3) From asserting that stack cell (2) obtains the numerical value at the top of storehouse, by the step-by-step negation of the numerical value and predicate register value carry out by Position AND-operation, and operating result is transmitted back to predicate register unit (1);Control unit (3) launches non-in IFID modules (6) When branch process is instructed, the predicate register value of predicate register unit (1) will be come from, come from task scheduling modules (5) TaskMask step-by-step phase "AND", and by result be transferred to instruction performing module (4);
Assert stack cell (2):Reception control unit (3) sends three types operation, including:POP operations, PUSH operations, INV Operation;For POP operations, the scene number that stack cell (2) is input into according to control unit (3) is asserted, from the storehouse at correspondence scene Top reads data and returns to control unit (3);For PUSH operations, assert that stack cell (2) is defeated according to control unit (3) The scene number for entering, to the storehouse top write-in data at correspondence scene;For INV operations, assert that stack cell (2) is single according to control The scene number of first (3) input returns to the data at the top of correspondence scene, but does not perform read operation, that is, does not influence whole storehouse Content.
CN201611140108.6A 2016-12-12 2016-12-12 Low complexity branch processing circuit of uniform dyeing array towards SIMT framework Pending CN106708780A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611140108.6A CN106708780A (en) 2016-12-12 2016-12-12 Low complexity branch processing circuit of uniform dyeing array towards SIMT framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611140108.6A CN106708780A (en) 2016-12-12 2016-12-12 Low complexity branch processing circuit of uniform dyeing array towards SIMT framework

Publications (1)

Publication Number Publication Date
CN106708780A true CN106708780A (en) 2017-05-24

Family

ID=58937111

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611140108.6A Pending CN106708780A (en) 2016-12-12 2016-12-12 Low complexity branch processing circuit of uniform dyeing array towards SIMT framework

Country Status (1)

Country Link
CN (1) CN106708780A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163358A (en) * 2018-02-13 2019-08-23 上海寒武纪信息科技有限公司 A kind of computing device and method
CN112579164A (en) * 2020-12-05 2021-03-30 西安翔腾微电子科技有限公司 SIMT conditional branch processing device and method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7543136B1 (en) * 2005-07-13 2009-06-02 Nvidia Corporation System and method for managing divergent threads using synchronization tokens and program instructions that include set-synchronization bits
US7617384B1 (en) * 2006-11-06 2009-11-10 Nvidia Corporation Structured programming control flow using a disable mask in a SIMD architecture
US20110078690A1 (en) * 2009-09-28 2011-03-31 Brian Fahs Opcode-Specified Predicatable Warp Post-Synchronization
CN102640132A (en) * 2009-09-28 2012-08-15 辉达公司 Efficient predicated execution for parallel processors

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7543136B1 (en) * 2005-07-13 2009-06-02 Nvidia Corporation System and method for managing divergent threads using synchronization tokens and program instructions that include set-synchronization bits
US7617384B1 (en) * 2006-11-06 2009-11-10 Nvidia Corporation Structured programming control flow using a disable mask in a SIMD architecture
US20110078690A1 (en) * 2009-09-28 2011-03-31 Brian Fahs Opcode-Specified Predicatable Warp Post-Synchronization
CN102640132A (en) * 2009-09-28 2012-08-15 辉达公司 Efficient predicated execution for parallel processors

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐元旭: "一种易实现的SIMT调度模型分析", 《微电子学与计算机》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163358A (en) * 2018-02-13 2019-08-23 上海寒武纪信息科技有限公司 A kind of computing device and method
CN110163358B (en) * 2018-02-13 2021-01-05 安徽寒武纪信息科技有限公司 Computing device and method
CN112579164A (en) * 2020-12-05 2021-03-30 西安翔腾微电子科技有限公司 SIMT conditional branch processing device and method
CN112579164B (en) * 2020-12-05 2022-10-25 西安翔腾微电子科技有限公司 SIMT conditional branch processing device and method

Similar Documents

Publication Publication Date Title
CN108171317B (en) Data multiplexing convolution neural network accelerator based on SOC
CN108537331A (en) A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic
CN111433758B (en) Programmable operation and control chip, design method and device thereof
CN107220023B (en) Embedded configurable FIFO memory
CN105912501B (en) A kind of SM4-128 Encryption Algorithm realization method and systems based on extensive coarseness reconfigurable processor
CN103221933A (en) Method and apparatus for moving data to a SIMD register file from a general purpose register file
Peng et al. A 2.92-Gb/s/W and 0.43-Gb/s/MG flexible and scalable CGRA-based baseband processor for massive MIMO detection
CN102053816B (en) Data shuffling unit with switch matrix memory and shuffling method thereof
Liu et al. An energy-efficient coarse-grained dynamically reconfigurable fabric for multiple-standard video decoding applications
Peemen et al. The neuro vector engine: Flexibility to improve convolutional net efficiency for wearable vision
CN112486903B (en) Reconfigurable processing unit, reconfigurable processing unit array and operation method thereof
CN102508803A (en) Matrix transposition memory controller
CN110704364A (en) Automatic dynamic reconstruction method and system based on field programmable gate array
Meloni et al. A high-efficiency runtime reconfigurable IP for CNN acceleration on a mid-range all-programmable SoC
CN106708780A (en) Low complexity branch processing circuit of uniform dyeing array towards SIMT framework
CN108628693B (en) Processor debugging method and system
CN106406820A (en) Multi-issue instruction parallel processing method and device of network processor micro engine
Xu et al. HeSA: Heterogeneous systolic array architecture for compact CNNs hardware accelerators
Vokorokos et al. Innovative operating memory architecture for computers using the data driven computation model
Rossi et al. Application space exploration of a heterogeneous run-time configurable digital signal processor
CN106155979B (en) A kind of DES algorithm secret key expansion system and extended method based on coarseness reconstruction structure
CN109933372B (en) Multi-mode dynamic switchable architecture low-power-consumption processor
EP2965221B1 (en) Parallel configuration of a reconfigurable instruction cell array
CN102129495B (en) Method for reducing power consumption of reconfigurable operator array structure
CN109800867A (en) A kind of data calling method based on FPGA chip external memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170524

RJ01 Rejection of invention patent application after publication