CN115495155B

CN115495155B - Hardware circulation processing device suitable for general processor

Info

Publication number: CN115495155B
Application number: CN202211442932.2A
Authority: CN
Inventors: 李东声
Original assignee: Beijing Shudu Information Technology Co ltd
Current assignee: Beijing Shudu Information Technology Co ltd
Priority date: 2022-11-18
Filing date: 2022-11-18
Publication date: 2023-03-24
Anticipated expiration: 2042-11-18
Also published as: CN115495155A

Abstract

The invention provides a hardware loop processing device suitable for a general processor, which comprises a micro-architecture applying a loop instruction, wherein the micro-architecture comprises a hardware loop predictor used for providing prediction of a loop body jump-out or return entry address, the predicted entry address is sent to an instruction cache unit, the instruction is retrieved at the identified address, the fetched instruction is sent to an instruction decoding unit, the instruction is decoded to generate a control signal used for controlling an instruction execution unit to operate, so that the required operation is realized, when the instruction execution is finished, submission information is returned to the hardware loop predictor through an instruction submission unit to be updated, and the hardware loop predictor comprises a loop buffer area, a jump-out state tracking table and a control unit. The invention obviously improves the instruction compiling efficiency and the hardware execution efficiency and reduces the instruction set design limit.

Description

Hardware circulation processing device suitable for general processor

Technical Field

The present invention relates to a hardware loop prediction period, and more particularly, to a hardware loop processing apparatus suitable for a general-purpose processor.

Background

Loop operations are a common type in computer programs. While software is typically written for (i =0; i < N; i + +) { loop body }, in general processor instruction set designs, a compiler typically compiles the last instruction of the loop body into a branch instruction. A Branch instruction (Branch) is an instruction in a computer program that can cause the computer to begin executing different sequences of instructions, deviating from its default behavior of executing instructions in sequence. For example, a Conditional Branch instruction (Conditional Branch) may be used to implement a bidirectional Branch (if \8230; else \8230; indicating in a loop whether the loop is to be popped out or re-executed.

The design of the processor hardware loop components is strongly related to the instruction set design. For loop operations, there are generally three implementations in instruction set design:

(1) And finally adding a conditional branch instruction into the loop body, jumping out after a counting condition is met, and continuing to loop if the counting condition is not met.

(2) And designing a special LOOP (LOOP) instruction, and giving LOOP times and LOOP body inlet/outlet information.

(3) The common instruction is used for replacing the branch instruction, and the same effect is achieved.

Currently, the mainstream general-purpose processor uses the scheme (1) as a loop implementation manner, and a special loop instruction is not generally designed in an instruction set, and an example of an ARM instruction set is as follows:

deciding whether to continue the loop or jump out by comparing whether the value in register R1 is equal to immediate # 1; a general purpose processor or a dedicated artificial intelligence processor, digital Signal Processor (DSP) supporting a custom instruction set would use scheme (2), shown below for a loop instruction design in the high-pass Hexagon V5x instruction set.

However, such instruction design generally adds limitations such as:

therefore, scheme (2) actually supports only at most two layers of loop nesting. While scheme (3) is generally not a practical option due to compilation efficiency issues.

The three implementations mentioned in the above section all have their inherent drawbacks in current industry products:

with respect to scenario (1), without branch prediction, the processor must wait until a branch instruction (e.g., a conditional jump instruction) passes through the execution stage before the next instruction can enter the fetch stage. The branch predictor avoids such waiting by determining whether a branch (e.g., a conditional jump) is more likely to jump. The instruction at the most likely branch (e.g., the next instruction in the sequence or a different instruction) is then fetched and one or more instructions starting with the predicted instruction are speculatively (speculatively) executed. If the processor later detects a branch misprediction, the pipeline is flushed (the instruction that caused the speculative execution is discarded) and the pipeline is restarted with the correct instruction. In predicting a loop, instead of simply predicting a conditional branch at the end of the loop body, the loop (total) number of the current loop needs to be estimated and the number of the current loop needs to be monitored in real time, which increases the implementation overhead of the branch predictor, and the prediction accuracy is difficult to maintain at a high level, so that the loop predictor is not even a necessary option in the design of a branch prediction unit in the design of a general-purpose processor. Furthermore, branch predictors are not typically designed in current artificial intelligence processors as well as shallow pipeline processors. The reason for this is not primarily because it is believed that stalls and flushes of shallow pipelines can be tolerated. In addition, the instruction set of the artificial intelligence processor is different from the instruction set of a general-purpose superscalar processor, so that for the artificial intelligence processor and the shallow-pipeline general-purpose processor, the prior art scheme is very easy and highly coupled with the instruction set of the artificial intelligence processor and the shallow-pipeline general-purpose processor even if branch prediction is carried out.

With respect to scheme (2), a significant drawback is that the Instruction Set Architecture (ISA) definition of branch instructions is compromised, and to avoid performance penalties, various restrictions are imposed, such as the inability to include branch instructions in the loop body, the inability to nest the loop body itself, and so forth, which is not friendly to software designers to program.

For scenario (3), if the compiler uses regular instructions instead of branch instructions, then 1 branch instruction requires 5 to 6 regular instructions to describe the same behavior, which is very inefficient.

Reference to the literature

[1] Hexagon V5x Programmer’s Reference Manual. Qualcomm Technologies, Inc.

[2] ARM® Architecture Reference Manual--ARMv8, for ARMv8-A architecture profile. ARM Limited。

Disclosure of Invention

The invention provides a hardware circulation processing device suitable for a general processor, provides a hardware circulation design with low hardware resource overhead, high performance benefit and less instruction set design limitation aiming at a reduced instruction set general processor and an artificial intelligence processor, and solves the defects of the prior art. The technical scheme is as follows:

a hardware loop processing apparatus suitable for a general purpose processor, comprising a microarchitecture that applies loop instructions, the microarchitecture comprising:

hardware loop predictor: for providing a prediction of loop body jump or return entry address;

an instruction cache unit: a processor for receiving a predicted entry address returned by the hardware loop predictor, retrieving an instruction at the identified address, the fetched instruction being sent to an instruction decode unit;

an instruction decoding unit: the instructions are decoded to generate control signals for controlling the operation of the instruction execution unit to achieve the desired operation;

an instruction execution unit: when the instruction is executed, the execution completion information is sent to the instruction submitting unit;

an instruction commit unit: and returning the submitted information to the hardware loop predictor for updating.

The loop instruction requires passing three parameters directly or indirectly to the microarchitecture: the inlet address of the cycle body, the outlet address of the cycle body and the cycle times;

the first design form of the loop instruction is as follows:

LOOP begin, iteration

LOOPEND

the LOOP body starting address begin and the LOOP frequency iteration are stored in a LOOP instruction by using a register Rs or are represented by using an immediate number, and the LOOP body outlet address corresponds to an address corresponding to the LOOPEND instruction;

the second design form of the loop instruction is as follows: LOOP begin, end, iteration

Three parameters are given in one instruction of the LOOP, the entry address begin of the LOOP body and the number of LOOPs iteration are stored in the LOOP instruction by using a register Rs1, or expressed by using an immediate number, the exit address of the LOOP body is directly stored in a register Rs2, or the offset of the exit address relative to the entry address is stored in a register Rs 2.

The hardware loop predictor, accessed each time an instruction is fetched in the instruction cache unit, when indexing to an exit of the loop body, can predict an instruction block address to be executed next, the predicted fetch instruction block comprising one or more sequential instructions in a memory address space, the predicted fetch instruction block being identified by an address of a first instruction in the instruction block as a tag; if no loop body exit address is identified within the fetched instruction block, the starting address of the next fetched instruction block is the sequential address following the address of the last instruction of the current instruction block or is dependent on the predicted address of other branch prediction components.

The instruction cache unit retrieves an instruction at the identified address, the address including the input address and the memory system, and if the command of the input address is detected in the instruction cache unit, the corresponding instruction may be directly output from the instruction cache unit, otherwise the instruction is requested from a lower level of the hierarchy of the memory system and output from the instruction cache unit upon retrieval.

The hardware loop predictor comprises a loop buffer area, a jump-out state tracking table and a control unit, wherein the loop buffer area is used for judging whether a loop is predicted to be stopped and jumped out or return to the beginning address to continue, the jump-out state tracking table records the outlet index of the hardware loop which is jumped out, and the control unit is used for creating, updating and cleaning the entries of the loop buffer area.

The circular buffer area can be provided with a plurality of entries, and each entry comprises a valid bit, a label, a circular frequency and a circular starting address which are connected in sequence; the valid bit indicates whether the entry is valid, the tag and the loop start address are part of a program counter, and the tag is derived from a loop body jump address given in a loop instruction and is obtained by a register or a corresponding address of loop; the loop time initial value and the loop initial address are obtained by information analyzed by the loop instruction, and the loop time is initialized to the total number of times of loop execution.

When the loop is predicted to terminate, the instruction stream is predicted to fetch new instructions sequentially down, otherwise jump to the loop start address stored in the entry.

The jump status tracking table records the exit index of the hardware loop that has jumped out, i.e. an N-bit register is maintained at each stage in the pipeline from the predictor exit of the same loop instruction to the commit stage of the loop instruction, N depending on the maximum number of entries in the loop buffer.

The control unit is used for creating, updating and cleaning entries of the circular buffer area, wherein the creating means that a circular buffer area entry needs to be created when a new LOOP instruction is received; the updating means that the cycle counter in the corresponding entry is reduced by 1 every time the cycle body exit address passes through, and the cleaning means that all entries in the cycle buffer are emptied after the outermost cycle body is correctly jumped out.

And if the instruction flow moves to the exit address of the loop body again, the instruction of the previous loop does not confirm correct submission, the pipeline is pressed back, so that any layer of loop only allows the speculative execution to be carried out backwards for one loop.

The Hardware Loop processing device suitable for the general processor realizes a high-efficiency Hardware Loop (Hardware Loop) predictor with low Hardware resource overhead in the chip design of the Reduced Instruction Set (RISC) or a Very Long Instruction Word (VLIW) architecture general processor and an artificial intelligent processor, obviously improves the instruction compiling efficiency and the Hardware executing efficiency, and reduces the instruction set design limit.

Drawings

FIG. 1 is a schematic diagram of the microarchitectural design of the hardware loop processing apparatus;

FIG. 2 is a schematic diagram of the internal structure of the hardware loop predictor;

fig. 3 is a schematic diagram showing the details of the entries in the circular buffer.

Detailed Description

In general purpose processor, artificial intelligence reasoning/training processor, digital signal processor chip applications, loop operation is a common programming type, so it is necessary to design hardware loop predictors to predict the true loop instruction. Compared with a general branch predictor, the prediction accuracy is 100% through the matching design of the instruction set and the hardware loop processing device, and the operation efficiency of the processor can be remarkably improved.

The hardware loop processing device comprises a micro-architecture designed with a hardware loop predictor, and loop operation is realized by combining a loop instruction. The loop instruction requires the direct or indirect passing of three parameters to the hardware: entry address of cycle body, exit address of cycle body, cycle number. The instruction has two design forms:

form 1: LOOP begin, iteration

LOOPEND

The LOOP body start address begin and the LOOP frequency iteration may be stored in the LOOP instruction by using a register Rs, or may be represented by an immediate value. The loop body exit address corresponds to the address corresponding to the LOOPEND instruction.

Form 2: LOOP begin, end, iteration

Compared with the form 1, the form 2 does not need a special LOOPEND instruction, but three parameters need to be given in one LOOP instruction, so that certain requirements are required for the instruction encoding length, and the instruction encoding length can be flexibly selected according to actual conditions. Like the form 1, the LOOP body entry address begin and the LOOP frequency iteration may be stored in the LOOP instruction using the register Rs1, or may be expressed using an immediate value. The loop body exit address can be directly stored in the register Rs2, or the offset of the exit address relative to the entry address can be stored in the register Rs2, which can reduce the register overhead but requires additional computational resources.

As shown in FIG. 1, the microarchitecture of the hardware loop processing apparatus includes a hardware loop predictor that provides a prediction of a loop body jump or return entry address, the predicted address being sent to an instruction cache unit that retrieves instructions at the identified address, the fetched instructions being sent to an instruction decode unit where the instructions are decoded to generate control signals that control the operation of an instruction execution unit to implement the desired operation. When the instruction is executed, the commit information needs to be returned to the hardware loop predictor through the instruction commit unit for updating. Wherein the instruction cache unit retrieves the instruction at the identified address, the address including the input address and the command in the memory system, if the input address is detected in the instruction cache unit, the corresponding instruction may be directly output from the instruction cache unit, otherwise the instruction may be requested from a lower level of the hierarchy of the memory system and may be output from the instruction cache unit upon retrieval.

The hardware loop predictor is accessible each time an instruction is fetched by the instruction cache unit, when an exit of the loop body is indexed, a next instruction block address to be executed is predicted, and the predicted fetched instruction block may include one or more sequential instructions within the memory address space. The predicted fetched instruction block may be identified by the address of the first instruction within the instruction block as a tag, and the number of instructions in the instruction block is typically determined, and may be, for example, 1,2,4,8, 16, 32.

If no loop body exit address is identified within the fetched instruction block, the starting address of the next fetched instruction block is either the sequential address following the address of the last instruction of the current instruction block or a predicted (jump) address dependent on other branch prediction components.

The internal structure of the hardware loop predictor is shown in fig. 2 and comprises a loop buffer, a jump-out state tracking table and a control unit. The loop buffer area is used for judging that the loop is predicted to be terminated to jump out or return to the starting address to continue, the jump-out state tracking table records the outlet index of the hardware loop which is jumped out, and the control unit is used for creating, updating and cleaning the entries of the loop buffer area.

The circular buffer can be provided with a plurality of entries, and the specific contents of the entries are as shown in fig. 3, including a valid bit, a tag, a circular number and a circular start address which are connected in sequence. The tag and loop start address are typically part of a program counter (typically a 32-bit binary number). The label is originated from a loop body jump address given in the instruction, and can be obtained by a register or a LOOPEND corresponding address in the invention; the initial value of the cycle times and the cycle start address are obtained by the information analyzed by the instruction. The number of loops is initialized to the total number of times the loop needs to be executed.

When the index of the circular buffer is accessed, judging whether the entry is effective according to the effective bit, if the entry is effective and the mark is matched, the mark is from the address part of the last instruction in the circular body, and at the moment, the last instruction in the circular body is contained in the extracted instruction block, judging according to the current cycle number: 0, the loop is predicted to terminate the exit. In addition, the loop back start address continues while the current loop count in the entry is decremented by 1. That is, the instruction stream is predicted to fetch sequentially down when the loop is predicted to terminate, and otherwise jumps to the loop start address stored in the entry. The entry of the loop predictor can be determined based on the actual application scenario, and the prediction accuracy of the loop buffer is expected to be 100% when a special loop instruction is designed in the instruction set architecture and the total number of loops and the loop body length can be given.

Since branch instructions are allowed to exist within the loop body, this means that the loop jump may be on the wrong instruction path and should be corrected subsequently. Because the LOOP instruction is different from a normal branch instruction, the instruction execution unit cannot record all the information of the hardware LOOP, and therefore cannot correct the wrong LOOP jump, and therefore the predictor itself must have a recovery mechanism.

The jump status tracking table records the exit index (which may be multi-level) of the hardware loop that has jumped out, maintaining an N-bit register (N depending on the maximum number of entries in the loop buffer) per stage during operation of the same loop instruction, i.e., in the pipeline from the predictor exit to the loop instruction commit stage. And the loop instruction submitting stage is processed by an instruction submitting unit, when the hardware loop predictor predicts that the current address is a loop body outlet and the current loop frequency is 0, the corresponding position of the jump-out state tracking table corresponding to the pipeline stage is set to be 1, the corresponding position refers to the pipeline stage where the current loop instruction is located, and after the instruction at the address is successfully submitted, the corresponding positions of all jump-out state tracking tables in the preceding stage pipeline are set to be 0. When the instruction stream is found to be on the wrong path, the loop times of the loop buffer entries corresponding to the 1 position need to be recovered according to the content of the latest jump-out state tracking table while the pipeline is flushed.

The role of the control unit is to create, update, and clear circular buffer entries. As mentioned above, when a new LOOP instruction is received, a new circular buffer entry is required. When the loop body is multi-level nested, the loop buffer is valid for multiple entries at the same time. Each time the loop body exit address is passed, the loop counter in the corresponding entry is decremented by 1. When the instruction stream is on the wrong path, the pipeline flushing needs to recover the cycle times of the cycle buffer entries corresponding to the 1 position according to the content of the latest jump-out state tracking table. When the outermost loop body is correctly popped (no pipeline flush is done with the correct commit of the instruction at the exit), all entries in the loop buffer are flushed.

It should be noted that the loop body allows for speculative jumps, i.e. the instruction stream may continue to execute until the instruction at the current exit address acknowledges commit. However, if the instruction of the previous round of the loop has not yet confirmed a correct commit when the instruction stream goes to the loop body exit address again, the pipeline is stalled. That is, any layer cycle only allows speculative execution one round backwards.

For compiler instruction placement, care should be taken to ensure that the inserted cycles of other instruction execution cover the time window of hardware pipeline instruction resolution before entering the LOOP body entry address after the LOOP instruction is issued.

The invention has the following beneficial effects:

(1) The prediction can give a prediction result in the same clock cycle of calculating the instruction fetching address, the Pipeline Flush (Pipeline Flush) caused by the prediction is avoided, the effect of Pipeline Zero Bubble (Zero Bubble) prediction is realized, and the prediction accuracy reaches 100%.

(2) The method has the advantages of good universality and no limitation on the loop instruction design of an Instruction Set Architecture (ISA) (the existing scheme often limits branch instructions during the instruction set design, such as the loop body cannot contain the branch instructions, the loop body cannot be nested, and the like).

(3) The method is friendly to compiler design, and useless instructions do not need to be inserted for processing the loop to occupy the machine running period.

Claims

1. A hardware loop processing apparatus adapted for use with a general purpose processor, comprising: a microarchitecture including an application loop instruction, the microarchitecture including:

hardware loop predictor: for providing a prediction of loop body jump or return entry address; the hardware loop predictor is accessed when the instruction cache unit fetches instructions each time, and can predict the address of an instruction block to be executed next when the exit of the loop body is indexed, wherein the predicted fetched instruction block comprises one or more sequential instructions in a memory address space, and is identified by taking the address of a first instruction in the instruction block as a tag; if no loop body exit address is identified within the fetched instruction block, the starting address of the next fetched instruction block is the sequential address following the address of the last instruction of the current instruction block or is dependent on the predicted address of other branch prediction components;

2. The hardware loop processing apparatus of claim 1, wherein: the loop instruction requires passing three parameters directly or indirectly to the microarchitecture: the inlet address of the cycle body, the outlet address of the cycle body and the cycle times;

the first design form of the loop instruction is as follows:

LOOP begin, iteration

LOOPEND

3. The hardware loop processing apparatus of claim 1, wherein: the instruction cache unit retrieves an instruction at the identified address, the address including the input address and the memory system, and if the command of the input address is detected in the instruction cache unit, the corresponding instruction may be directly output from the instruction cache unit, otherwise the instruction is requested from a lower level of the hierarchy of the memory system and output from the instruction cache unit upon retrieval.

4. The hardware loop processing apparatus of claim 1, wherein: the hardware loop predictor comprises a loop buffer area, a jump-out state tracking table and a control unit, wherein the loop buffer area is used for judging whether a loop is predicted to be stopped and jumped out or return to the beginning address to continue, the jump-out state tracking table records the outlet index of the hardware loop which is jumped out, and the control unit is used for creating, updating and cleaning the entries of the loop buffer area.

5. The hardware loop processing apparatus of claim 4, wherein: the circular buffer area can be provided with a plurality of entries, and the entries comprise sequentially connected valid bits, tags, cycle times and cycle starting addresses; the valid bit indicates whether the entry is valid, the tag and the loop start address are part of a program counter, and the tag is derived from a loop body jump address given in a loop instruction and is obtained by a register or a corresponding address of loop; the loop time initial value and the loop initial address are obtained by information analyzed by the loop instruction, and the loop time is initialized to the total number of times of loop execution.

6. The hardware loop processing apparatus of claim 5, wherein: when the loop is predicted to terminate, the instruction stream is predicted to fetch new instructions sequentially down, otherwise jump to the loop start address stored in the entry.

7. The hardware loop processing apparatus of claim 4, wherein: the jump status tracking table records the exit index of the hardware loop that has jumped out, i.e. an N-bit register is maintained at each stage in the pipeline from the predictor exit to the loop instruction commit stage, N depending on the maximum number of entries in the loop buffer.

8. The hardware loop processing apparatus of claim 4, wherein: the control unit is used for creating, updating and cleaning entries of the circular buffer area, wherein the creating means that a circular buffer area entry needs to be created when a new LOOP instruction is received; the updating means that the cycle counter in the corresponding entry is reduced by 1 every time the cycle body exit address passes through, and the cleaning means that all entries in the cycle buffer are emptied after the outermost cycle body is correctly jumped out.

9. The hardware loop processing apparatus of claim 8, wherein: and if the instruction flow moves to the exit address of the loop body again, the instruction of the previous loop does not confirm correct submission, the pipeline is pressed back, so that any layer of loop only allows the speculative execution to be carried out backwards for one loop.