WO2018082344A1 - 一种软硬件协同分支指令预测方法及装置 - Google Patents

一种软硬件协同分支指令预测方法及装置 Download PDF

Info

Publication number
WO2018082344A1
WO2018082344A1 PCT/CN2017/093343 CN2017093343W WO2018082344A1 WO 2018082344 A1 WO2018082344 A1 WO 2018082344A1 CN 2017093343 W CN2017093343 W CN 2017093343W WO 2018082344 A1 WO2018082344 A1 WO 2018082344A1
Authority
WO
WIPO (PCT)
Prior art keywords
branch
register
prediction
instruction
bit sequence
Prior art date
Application number
PCT/CN2017/093343
Other languages
English (en)
French (fr)
Inventor
施慧
卢腾
顾成成
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2018082344A1 publication Critical patent/WO2018082344A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead

Definitions

  • the present invention relates to the field of electronic technologies, and in particular, to a software and hardware cooperative branch instruction prediction method and apparatus.
  • a branch is a node that needs to be changed when the program is running.
  • the branch is divided into an unconditional branch and a conditional branch.
  • the unconditional branch only needs the CPU (Central Processing Unit) to execute in the order of instructions, and the conditional branch must be processed according to the instruction. The result then determines if the program's direction of operation changes.
  • Branch prediction is a means used by modern processors to improve CPU execution speed. First, the branch flow of the program is predicted, and then one of the branch instructions is read and decoded to reduce the waiting time of the decoder.
  • the conditional branch of the program is executed after the program instruction gets the result of the instruction in the pipeline processing. Therefore, when the CPU waits for the result of the instruction, the pipeline is also in an idle state waiting for the branch instruction, resulting in wasted clock cycles.
  • FIG. 1 a flow chart of a branch instruction execution provided by the prior art scheme of FIG. In the execution of the branch instruction, it takes 7 clock cycles to complete the branch jump judgment. If the CPU can predict whether the branch is transferred before the result of the previous instruction, the corresponding instruction can be executed in advance to avoid the idle line. Wait, increase the speed of the CPU.
  • FIG. 2 a flow chart of another branch instruction execution provided by the prior art scheme of FIG. 2 is shown. During branch instruction execution, if the branch instruction can be correctly predicted, the program can jump in 5 clock cycles, otherwise it takes 11 clock cycles.
  • FIG. 3 is a schematic diagram of a dynamic branch prediction method provided by a prior art solution.
  • the method is 2 Level Adaptive Training.
  • HR Branch History Register
  • branch history register is used to accumulate the result of each jump: the jump is shifted to the left by 1 and the left is shifted to 0 without a jump.
  • a history jump record is generated and stored in the HR.
  • Embodiments of the present invention provide a software and hardware cooperative branch instruction prediction method and apparatus. Can solve the prior art solution The technical problem of low accuracy in mid-branch prediction.
  • the present application provides a software and hardware cooperative branch instruction prediction method, including: a processor includes a branch prediction register, a count register, and a branch predictor, wherein the branch prediction register obtains a bit sequence of a data length of the operation data, and And outputting, to the branch predictor, a bit sequence of the data length; the counting register is configured to count the effective digits of the bit sequence of the data length, and indicate the branch prediction according to the effective number of bits of the bit sequence
  • the register sequentially outputs a bit sequence of the data length to the branch predictor; the branch predictor guides branch instruction prediction according to the bit sequence of the data length, thereby improving the accuracy of branch prediction and improving program performance.
  • a data copy instruction has four different lengths of 8 bytes, 4 bytes, 2 bytes, and 1 byte.
  • the length can be converted into a bit sequence 1010, according to the number of bits 1010.
  • the 8-byte copy instruction and the 2-byte copy instruction may be sequentially performed, but a 4-byte copy instruction and a 1-byte copy instruction are not required.
  • the branch prediction register and the count register are respectively operated by an extended instruction set, wherein the extended instruction set includes an opcode, a register operand, a first immediate operand, and a The immediate operand, the opcode is an instruction mnemonic, the register operand is used to store the data length value of the branch prediction; the first immediate operand user identifies the starting bit of the length of the branch prediction, and the second immediate number is used to identify The end bit of the length of the branch prediction.
  • the extended instruction set includes an opcode, a register operand, a first immediate operand, and a The immediate operand, the opcode is an instruction mnemonic, the register operand is used to store the data length value of the branch prediction; the first immediate operand user identifies the starting bit of the length of the branch prediction, and the second immediate number is used to identify The end bit of the length of the branch prediction.
  • the extended instruction set may include a number of new instruction sets for directly operating the branch prediction register and the count register.
  • the format of the instruction is not necessarily fixed, and the input from the user can be read and input to
  • the branch prediction module has software-to-hardware interaction functions. The instructions do not affect the correctness of program execution, and only affect performance.
  • a current jump instruction during program execution is acquired; a history jump record corresponding to the current jump instruction is searched from the branch history register, and indexed into a preset array table. Finding a corresponding operational state; inputting the operational state to the branch predictor for branch instruction prediction.
  • the prediction result predicted by the branch instruction is acquired; and the prediction result is input into the branch history register,
  • the execution result is used as the history jump record stored in the branch history register for input to a state machine corresponding to the branch history register for training, and adaptively optimizing the prediction result of the state machine.
  • the prediction result of the branch prediction register is the highest priority and can cover the prediction results of other branch predictors.
  • the present application provides a software and hardware cooperative branch instruction predicting apparatus configured to implement the method and function performed by the resource multiplexing apparatus in the above first aspect, by hardware/software Implementation, the hardware/software includes units corresponding to the above functions.
  • the present application provides a hardware-software cooperative branch instruction prediction device, including: a processor, a memory, and a communication bus, wherein the communication bus is used to implement connection communication between the processor and the memory, and the processor executes the The program stored in the memory is used to implement the steps in the software and hardware cooperative branch instruction prediction method provided by the above first aspect.
  • Figure 1 is a flow chart showing the execution of a branch instruction provided by the prior art solution
  • FIG. 3 is a schematic diagram of a dynamic branch prediction method provided by a prior art solution
  • FIG. 4 is a schematic diagram of a prior art solution providing an irregular continuous jump branch
  • FIG. 5 is a schematic flowchart of a continuous branch judgment provided by the present invention.
  • FIG. 6 is a schematic diagram of a copy instruction combination manner provided by the present invention.
  • FIG. 7 is a schematic structural diagram of a hardware module for branch prediction according to an embodiment of the present invention.
  • FIG. 8 is a schematic flowchart of a software and hardware cooperative branch instruction prediction method according to an embodiment of the present invention.
  • FIG. 9 is a schematic flow chart of a branch instruction prediction provided by the present invention.
  • FIG. 10 is a schematic diagram of an extended instruction set according to an embodiment of the present invention.
  • FIG. 11 is a schematic flowchart of branch prediction by using an extended instruction set according to the present invention.
  • FIG. 12 is a schematic structural diagram of a software and hardware cooperative branch instruction predicting apparatus according to an embodiment of the present invention.
  • FIG. 13 is a schematic structural diagram of a software and hardware cooperative branch instruction prediction apparatus according to an embodiment of the present invention.
  • FIG. 5 is a schematic flowchart of a continuous branch judgment provided by the present invention. From a performance point of view, after processing the data using the copy instruction of the largest byte, a scene of continuous branch judgment is generated when the tail data is processed.
  • FIG. 6 is a schematic diagram of a copy instruction combination manner provided by the present invention.
  • the data copy instruction includes four different lengths of 8 bytes, 4 bytes, 2 bytes, and 1 byte, and performs a copy operation on a 10-byte array, using 1 byte. Copy instructions need to perform 1*10 operations, which is obviously unfavorable for program performance.
  • the optimal instruction combination should be a combination of 8-byte copy instructions and 2-byte copy instructions, using a combination of 8-byte copy instructions and 2-byte copy instructions. You can do it once. In order to accurately predict an 8-byte copy instruction and a 2-byte copy instruction, the specific method of implementation will be described below.
  • FIG. 7 is a schematic structural diagram of a hardware module for branch prediction according to an embodiment of the present invention.
  • the branch prediction hardware module includes a branch history register and a branch predictor, and the branch history register is used to accumulate the result of each jump.
  • the branch history register For each branch jump instruction, there is a branch history register.
  • a corresponding branch history record the jump is shifted to the left by 1 and the left is shifted to 0.
  • the generated history jump record is stored in the branch history register.
  • branch prediction is required, the branch corresponding to the current jump instruction is used.
  • the branch prediction hardware module provided by the invention adds a Branch Prediction Register (BPR) and a Count Register Counter to the existing hardware module, and the hardware module of the branch prediction can be used in combination with the existing method, or can be independent. Use to further enhance the branch prediction probability.
  • BPR Branch Prediction Register
  • Count Register Counter a Count Register Counter
  • FIG. 8 is a schematic flowchart of a software and hardware cooperative branch instruction prediction method according to an embodiment of the present invention, including:
  • the branch prediction register acquires a bit sequence of data length of operation data, and sequentially outputs a bit sequence of the data length to the branch predictor.
  • the counting register counts the effective digits of the bit sequence of the data length, and instructs the branch prediction register to sequentially output the data length to the branch predictor according to the effective digits of the bit sequence. Bit sequence.
  • the branch predictor guides branch instruction prediction according to a bit sequence of the data length. Among them, the prediction of the branch prediction register has the highest priority and can cover the prediction results of other branch predictors.
  • the data tail processing flow has a continuous branch, and the branch jump depends on the length of the data.
  • the data length of 10 bytes can be first converted into a bit sequence 1010, and the branch prediction register sequentially sends the bit sequence 1010 to the branch predictor, when one bit is predicted from the branch.
  • the bit sequence 1010 is shifted to the right by 0.
  • the count register indicates the effective number of bits of the bit sequence 1010 in the branch prediction register.
  • Branch prediction the count value of the count register is decremented by 1.
  • FIG. 9 is a schematic flowchart of a branch instruction prediction.
  • a branch instruction prediction For data of a data length of 10 bytes, after successive data copying using the maximum byte copy instruction (8-byte copy instruction), a scene of continuous branch judgment is generated, and the branch prediction register indicates branch prediction based on the bit sequence 1010. The device then uses the 2-byte copy instruction to operate on the data, so the 10-byte size data is finally copied by the 8-byte copy instruction and the 2-byte copy byte.
  • FIG. 10 is a schematic diagram of an extended instruction set according to an embodiment of the present invention.
  • the extended instruction set includes an opcode, a register operand, a first immediate operand, and a second immediate operand, the opcode is an instruction mnemonic, and the register operand is used to store the data length value of the branch prediction; the first immediate operand user A start bit that identifies the length of the branch prediction, and a second immediate value is used to identify the end bit of the length of the branch prediction.
  • the value of the branch prediction register is "register operand>>first immediate operand”, and the count value of the count register is "second immediate operand - first immediate operand”.
  • the entire instruction is similar to an operation that takes a bit value, and the fetched result is sent to the branch prediction register for branch prediction.
  • FIG. 11 provides a schematic flowchart of branch prediction by extending an instruction set. After processing the data using the copy instruction of the largest byte, the branch predictor is instructed to process the end data by extending the instruction set and adding a new hardware module. Before the continuous branch starts, the instruction to add the branch predictor can be automatically identified by manually adding or compiling the program, and the judgment condition of the branch jump can be read ahead, so that the branch prediction can be accurately performed when the program is running, and the program can be improved. Program performance.
  • the extended instruction set may include a number of newly added instruction sets for directly operating the newly added hardware modules.
  • the format of the instructions is not necessarily fixed, and the input from the user can be read and input to the branch prediction register.
  • Software to hardware interaction functions, instructions do not affect the correctness of program execution, only affect performance.
  • the branch predictor can be directly operated for branch prediction.
  • the specific steps are as follows: firstly identify the continuous branch scene; use the hardware and software interactive instructions to continuously The branch scene is optimized; the interactive instruction takes effect, and the key factors affecting the branch prediction are input to the branch prediction register; the branch prediction register module directly instructs the branch predictor to perform accurate branch prediction according to the user input, thereby improving program performance.
  • branch instruction prediction when performing branch instruction prediction by using an extended instruction set and a newly added hardware module, acquiring a current jump instruction during program execution; and searching for a history jump corresponding to the current jump instruction from the branch history register Recording, and indexing to a preset array table to find a corresponding operation state; inputting the operation state to the branch predictor for branch instruction prediction, and then jointly performing branch instruction prediction under the synergistic effect of the two, Improve the accuracy of branch instruction prediction.
  • the prediction result predicted by the branch instruction may be obtained; the prediction result is input into the branch history register, and the execution result is used to be stored in the branch history register as the history jump record.
  • the prediction result of the state machine is adaptively optimized.
  • FIG. 12 is a schematic structural diagram of a software and hardware cooperative branch instruction predicting apparatus according to an embodiment of the present invention. As shown in the figure, the device in the embodiment of the present invention includes:
  • a branch prediction register 1201 configured to acquire a bit sequence of a data length of the operation data, and sequentially output a bit sequence of the data length to the branch predictor;
  • a counting register 1202 configured to count the effective digits of the bit sequence of the data length, and instruct the branch prediction register to sequentially output the data length to the branch predictor according to the effective number of bits of the bit sequence Bit sequence
  • the branch predictor 1203 is configured to guide the branch instruction prediction according to the bit sequence of the data length.
  • the data tail processing flow has a continuous branch, and the branch jump depends on the length of the data.
  • the data length of 10 bytes can be first converted into a bit sequence 1010, and the branch prediction register 1201 sequentially sends the bit sequence 1010 to the branch predictor 1203, when a bit is from When the branch prediction register 1201 is read, the bit sequence 1010 is shifted to the right by 0.
  • the count register 1202 indicates the bit sequence 1010 in the branch prediction register 1201.
  • FIG. 9 is a schematic flowchart of a branch instruction prediction.
  • a branch instruction prediction For data of a data length of 10 bytes in size, after data copying is first performed using a copy instruction (8-byte copy instruction) of the largest byte, a scene of continuous branch judgment is generated, and the branch prediction register 1201 indicates a branch according to the bit sequence 1010.
  • the predictor 1203 uses the 2-byte copy instruction to operate on the data, so that the 10-byte size data is finally copied by the 8-byte copy instruction and the 2-byte copy byte.
  • FIG. 10 is a schematic diagram of an extended instruction set according to an embodiment of the present invention.
  • the extended instruction set includes an opcode, a register operand, a first immediate operand, and a second immediate operand, the opcode is an instruction mnemonic, and the register operand is used to store the data length value of the branch prediction; the first immediate operand user A start bit that identifies the length of the branch prediction, and a second immediate value is used to identify the end bit of the length of the branch prediction.
  • the value of the branch prediction register is "register operand>>first immediate operand”, and the count value of the count register is "second immediate operand - first immediate operand”.
  • the entire instruction is similar to an operation that takes a bit value, and the fetched result is sent to the branch prediction register for branch prediction.
  • FIG. 11 provides a schematic flowchart of branch prediction by extending an instruction set. After processing the data using the copy instruction of the largest byte, the branch predictor is instructed to process the end data by extending the instruction set and adding a new hardware module. Before the start of the continuous branch, the instruction to add the branch predictor can be automatically recognized by manually adding or compiling the program, and the judgment condition of the branch jump can be pre-read, so that the branch prediction can be accurately performed when the program is running, and the program performance is improved.
  • the extended instruction set may include a number of newly added instruction sets for directly operating the newly added hardware modules.
  • the format of the instructions is not necessarily fixed, and the input from the user can be read and input to the branch prediction register.
  • Software to hardware interaction functions, instructions do not affect the correctness of program execution, only affect performance.
  • the branch predictor can be directly operated for branch prediction.
  • the specific steps are as follows: identify the continuous branch scene; use the hardware and software interactive instructions to the continuous branch The scenario is optimized; the interactive instruction takes effect, and the key factors affecting the branch prediction are input to the branch prediction register; the branch prediction register module directly instructs the branch predictor to perform accurate branch prediction according to the user input, thereby improving program performance.
  • branch instruction prediction when performing branch instruction prediction by using an extended instruction set and a newly added hardware module, acquiring a current jump instruction during program execution; and searching for a history jump corresponding to the current jump instruction from the branch history register Recording, and indexing to a preset array table to find a corresponding operation state; inputting the operation state to the branch predictor for branch instruction prediction, and then jointly performing branch instruction prediction and improving The accuracy of branch instruction prediction.
  • the prediction result predicted by the branch instruction may be obtained; the prediction result is input into the branch history register, and the execution result is used to be stored in the branch history register as the history jump record.
  • the prediction result of the state machine is adaptively optimized.
  • FIG. 13 is a schematic structural diagram of a software and hardware cooperative branch instruction prediction apparatus according to the present invention.
  • the apparatus can include at least one processor 1301, such as a CPU, at least one communication interface 1302, at least one memory 1303, and at least one communication bus 1304.
  • the communication bus 1304 is used to implement connection communication between these components.
  • the communication interface 1302 of the device in the embodiment of the present invention is used for signaling or data communication with other node devices.
  • the memory 1303 may be a high speed RAM memory or a non-volatile memory such as at least one disk memory.
  • the memory 1303 can also optionally be at least one storage device located remotely from the aforementioned processor 1301.
  • a program code is stored in the memory 1303, and the processor 1301 executes the program stored in the memory 1303, and executes the software and hardware cooperative branch instruction prediction device. The method performed or the function implemented by the software and hardware cooperative branch instruction predicting device is implemented.
  • the program may be stored in a computer readable storage medium, and the storage medium may include: Flash disk, read-only memory (English: Read-Only Memory, referred to as: ROM), random accessor (English: Random Access Memory, referred to as: RAM), disk or optical disk.
  • ROM Read-Only Memory
  • RAM Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

本发明实施例公开了一种软硬件协同分支指令预测方法,所述方法应用于处理器,所述处理器包括分支预测寄存器、计数寄存器和分支预测器,包括:所述分支预测寄存器获取操作数据的数据长度的比特序列,并向所述分支预测器依次输出所述数据长度的比特序列;所述计数寄存器对所述数据长度的比特序列的有效位数进行统计,并根据所述比特序列的有效位数指示所述分支预测寄存器向所述分支预测器依次输出所述数据长度的比特序列;所述分支预测器根据所述数据长度的比特序列,指导分支指令预测。本发明实施例还公开了一种软硬件协同分支指令预测装置。采用本发明实施例,可以提高连续分支预测准确率。

Description

一种软硬件协同分支指令预测方法及装置 技术领域
本发明涉及电子技术领域,尤其涉及一种软硬件协同分支指令预测方法及装置。
背景技术
分支是指程序运行时需要改变的节点,分支分为有无条件分支和有条件分支,其中,无条件分支只需要CPU(Central Processing Unit,中央处理器)按指令顺序执行,而条件分支则必须根据处理结果再决定程序运行方向是否改变。分支预测是现代处理器用来提高CPU执行速度的一种手段,首先对程序的分支流程进行预测,然后预先读取其中一个分支指令并解码来减少等待译码器的时间。
程序的条件分支是程序指令在流水线处理得到指令结果之后再执行的,所以,当CPU等待指令结果时,流水线也处于空闲状态等待分支指令,导致时钟周期浪费。如图1所示,图1现有技术方案提供的一种分支指令执行的流程图。在该分支指令执行过程中,需要7个时钟周期才能完成分支跳转判断,如果CPU能在前条指令结果出来之前就能预测到分支是否转移,则可以提前执行相应的指令,从而避免流水线的空闲等待,提高CPU的运算速度。但是,如果前指令结果出来后证明分支预测错误,则必须将已经装入流水线执行的指令和结果全部清除,然后再装入正确指令重新处理,这样会导致指令执行速度更慢。如图2所示,图2现有技术方案提供的另一种分支指令执行的流程图。在分支指令执行过程中,如果能够正确预测分支指令,则程序在5个时钟周期可以跳转,反之,则需要11个时钟周期。
在现有技术方案中,分支预测可以分为动态分支预测和静态分支预测两类方法。如图3所示,图3是现有技术方案提供的一种动态分支预测方法的示意图,该方法为2Level Adaptive Training(二级自适应训练),首先,对于每一条分支跳转指令,存在着一个对应的Branch History Register(HR,分支历史寄存器),用于累积记录每次跳转的结果:跳转则左移补1,不跳则左移补0,产生一个历史跳转记录存储在HR中;然后,需要进行分支预测时,根据当前跳转指令对应HR的记录,视作不同的状态机,索引到Pattern Table(PT,阵列表)中对应的状态,得到Sc用于进行预测;最后,本次分支执行的实际结果,会输入到该分支HR记录所对应的状态机进行训练,自适应地优化该状态机的预测结果。但是,2Level Adaptive Training依赖于历史记录来进行预测,可以很好的预测有规律的循环类分支,但是对于无规律的连续跳转分支,2Level Adaptive Training均无法有效的覆盖到这种场景,若实际应用中存在反复执行连续分支的情况,则会因为分支预测失败的开销导致CPU性能降低。如图4所示,图4提供了一种无规律的连续跳转分支的示意图。在这种连续跳转分支中,由于多个跳转之间没有联系,且行为不确定,并且分支执行结果多依赖与程序运行的实时状态,历史跳转记录的参考价值不大,影响分支预测的准确性。
发明内容
本发明实施例提供一种软硬件协同分支指令预测方法及装置。可以解决现有技术方案 中分支预测准确性不高的技术问题。
第一方面,本申请提供了一种软硬件协同分支指令预测方法,包括:处理器包括分支预测寄存器、计数寄存器和分支预测器,所述分支预测寄存器获取操作数据的数据长度的比特序列,并向所述分支预测器依次输出所述数据长度的比特序列;所述计数寄存器对所述数据长度的比特序列的有效位数进行统计,并根据所述比特序列的有效位数指示所述分支预测寄存器向所述分支预测器依次输出所述数据长度的比特序列;所述分支预测器根据所述数据长度的比特序列,指导分支指令预测,从而提高分支预测的准确率,提高程序性能。例如,数据拷贝指令有8字节、4字节、2字节和1字节四种不同的长度,对于长度为10字节的数组,可以将长度转化为比特序列1010,根据比特位数1010可以分别进行依次8字节拷贝指令和2字节拷贝指令,但不需要4字节拷贝指令和1字节拷贝指令。
在另一种可能的设计中,通过扩展指令集分别对所述分支预测寄存器和所述计数寄存器进行操作,其中,所述扩展指令集包括操作码、寄存器操作数、第一立即操作数和第二立即操作数,操作码为指令助记符,寄存器操作数用于存储分支预测的数据长度值;第一立即操作数用户标识分支预测的长度的起始比特位,第二立即数用于标识分支预测的长度的结束比特位。
在另一种可能的设计中,扩展指令集可以包括若干新增的指令集,用于直接操作分支预测寄存器和计数寄存器,指令的格式不一定固定,能够读取来自用户的输入,并输入至分支预测模块,具有软件到硬件的交互功能,指令不影响程序执行的正确性,仅对性能造成影响。
在另一种可能的设计中,获取程序执行过程中的当前跳转指令;从所述分支历史寄存器查找与所述当前跳转指令对应的历史跳转记录,并索引到预设的阵列表中查找对应的操作状态;将所述操作状态输入到所述分支预测器进行分支指令预测。
在另一种可能的设计中,在根据所述数据长度的比特序列进行分支指令预测之后,获取所述分支指令预测的预测结果;将所述预测结果输入到所述分支历史寄存器中,所述执行结果用于作为所述历史跳转记录存储于所述分支历史寄存器,以便输入到分支历史寄存器所对应的状态机进行训练,自适应地优化该状态机的预测结果。
在另一种可能的设计中,所述分支预测寄存器的预测结果为最高优先级,可以覆盖其他分支预测器的预测结果。
第二方面,本申请提供了一种软硬件协同分支指令预测装置,该软硬件协同分支指令预测装置被配置为实现上述第一方面中资源复用装置所执行的方法和功能,由硬件/软件实现,其硬件/软件包括与上述功能相应的单元。
第三方面,本申请提供了软硬件协同分支指令预测设备,包括:处理器、存储器和通信总线,其中,所述通信总线用于实现所述处理器和存储器之间连接通信,处理器执行所述存储器中存储的程序用于实现上述第一方面提供的一种软硬件协同分支指令预测方法中的步骤。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附 图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1现有技术方案提供的一种分支指令执行的流程图;
图2是现有技术方案提供的另一种分支指令执行的流程图;
图3是现有技术方案提供的一种动态分支预测方法的示意图;
图4是现有技术方案提供了一种无规律的连续跳转分支的示意图;
图5是本发明提供的一种连续分支判断的流程示意图;
图6是本发明提供的一种拷贝指令组合方式的示意图;
图7是本发明实施例提供的一种分支预测的硬件模块的结构示意图;
图8是本发明实施例提供的一种软硬件协同分支指令预测方法的流程示意图;
图9是本发明提供的一种分支指令预测的流程示意图;
图10是本发明实施例提供的一种扩展指令集的示意图;
图11是本发明提供的一种通过扩展指令集进行分支预测的流程示意图;
图12是本发明实施例提供的一种软硬件协同分支指令预测装置的结构示意图;
图13是本发明实施例提出的一种软硬件协同分支指令预测设备的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
在实际业务场景中,如,Spark大数据工程存在着大量对数据进行拷贝(memcpy)、运算(CRC校验)等操作。根据***作数据的大小,CPU使用不同的指令完成操作,此时构成连续分支的场景。如图5所示,图5是本发明提供了一种连续分支判断的流程示意图。从性能方面考虑,在使用最大字节的拷贝指令处理数据之后,处理尾部数据时会产生连续的分支判断的场景。
如图6所示,图6是本发明提供了一种拷贝指令组合方式的示意图。以处理器ARM A57芯片为例,数据拷贝指令包含8字节、4字节、2字节和1字节四种不同的长度,对一个10字节大小的数组进行拷贝操作,使用1字节拷贝指令需要执行1*10次操作,显然对程序性能不利,最佳指令组合应该是8字节拷贝指令和2字节拷贝指令组合,使用8字节拷贝指令和2字节拷贝指令组合执行2次操作即可。为了能准确预测出8字节拷贝指令和2字节拷贝指令,下面对实现的具体方法进行描述。
请参考图7,图7是本发明实施例提供的一种分支预测的硬件模块的结构示意图。在现有技术方案中,分支预测的硬件模块包括分支历史寄存器和分支预测器,分支历史寄存器用于累积记录每次跳转的结果,对于每一条分支跳转指令,在分支历史寄存器中存在着一个对应的分支历史记录,跳转则左移补1,不跳则左移补0,产生的历史跳转记录存储在分支历史寄存器中,需要进行分支预测时,根据当前跳转指令对应的分支HR记录,视作不同的状态机,索引到阵列表中查找对应的状态,得到Sc用于进行预测,最后将本次 分支执行的实际结果输入到该分支HR记录所对应的状态机进行训练。本发明提供的分支预测的硬件模块在现有硬件模块的基础上增加了分支预测寄存器(Branch prediction register,BPR)和计数寄存器Counter,该分支预测的硬件模块可以结合现有方法使用,也可以独立使用,进一步增强分支预测概率。
基于上述硬件结构和技术原理,本发明提供的一种软硬件协同分支指令预测方法。如图8所示,图8是本发明实施例提供的一种软硬件协同分支指令预测方法的流程示意图,包括:
S801,所述分支预测寄存器获取操作数据的数据长度的比特序列,并向所述分支预测器依次输出所述数据长度的比特序列。
S802,所述计数寄存器对所述数据长度的比特序列的有效位数进行统计,并根据所述比特序列的有效位数指示所述分支预测寄存器向所述分支预测器依次输出所述数据长度的比特序列。
S803,所述分支预测器根据所述数据长度的比特序列,指导分支指令预测。其中,分支预测寄存器的预测拥有最高优先级,能够覆盖其他分支预测器的预测结果。
具体实现中,对于一连串数据操作,数据尾部处理流程存在连续分支,该分支跳转与否取决于数据的长度。例如,对一个10字节大小的数组进行拷贝操作,可以首先将10字节大小的数据长度转化为比特序列1010,分支预测寄存器依次将比特序列1010送入分支预测器,当一个比特从分支预测寄存器读出时,比特序列1010整体右移补0,在分支预测寄存器依次将比特序列1010送入分支预测器的过程中,计数寄存器指示分支预测寄存器中比特序列1010的有效位数,每进行一次分支预测,计数寄存器的计数值减1,当计数寄存器的计数值减至0时,分支预测寄存器不再使能分支预测。如图9所示,图9提供的一种分支指令预测的流程示意图。对于10字节大小的数据长度的数据,在首先使用最大字节的拷贝指令(8字节拷贝指令)进行数据拷贝之后,产生连续的分支判断的场景,分支预测寄存器根据比特序列1010指示分支预测器再使用2字节的拷贝指令对数据操作即可,因此最终通过8字节的拷贝指令和2字节的拷贝的字节对该10字节大小的数据进行拷贝操作。
可选的,可以通过扩展指令集分别对所述分支预测寄存器和所述计数寄存器进行操作,如图10所示,图10是本发明实施例提供的一种扩展指令集的示意图。扩展指令集包括操作码、寄存器操作数、第一立即操作数和第二立即操作数,操作码为指令助记符,寄存器操作数用于存储分支预测的数据长度值;第一立即操作数用户标识分支预测的长度的起始比特位,第二立即数用于标识分支预测的长度的结束比特位。其中,分支预测寄存器的值为“寄存器操作数>>第一立即操作数”,计数寄存器的计数值为“第二立即操作数–第一立即操作数”。整条指令类似于一个取比特值的运算,取出的结果送入到分支预测寄存器中,用于分支预测。
如图11所示,图11提供了一种通过扩展指令集进行分支预测的流程示意图。在使用最大字节的拷贝指令处理数据之后,通过扩展指令集和新增硬件模块,指示分支预测器处理尾端数据。在连续分支开始之前,可以通过手动添加或编译程序自动识别添加分支预测器的指令,通过预读分支跳转的判断条件,使得程序运行时能够准确进行分支预测,提高 程序性能。
需要说明的是,扩展指令集可以包括若干新增的指令集,用于直接操作新增的硬件模块,指令的格式不一定固定,能够读取来自用户的输入,并输入至分支预测寄存器,具有软件到硬件的交互功能,指令不影响程序执行的正确性,仅对性能造成影响。
最终,在实现扩展指令集和新增硬件模块之后,在软件开发的过程中,就可以直接操作分支预测器进行分支预测,具体步骤如下:首先识别连续分支场景;利用软硬件交互指令,对连续分支场景进行优化;交互指令生效,将影响分支预测的关键因素输入至分支预测寄存器;分支预测寄存器模块根据用户输入,直接指示分支预测器进行准确的分支预测,提高程序性能。
可选的,在通过扩展指令集和新增硬件模块进行分支指令预测时,需要获取程序执行过程中的当前跳转指令;从所述分支历史寄存器查找与所述当前跳转指令对应的历史跳转记录,并索引到预设的阵列表中查找对应的操作状态;将所述操作状态输入到所述分支预测器进行分支指令预测,然后在两者的协同作用下,共同完成分支指令预测,提高分支指令预测的准确性。
可选的,可以获取所述分支指令预测的预测结果;将所述预测结果输入到所述分支历史寄存器中,所述执行结果用于作为所述历史跳转记录存储于所述分支历史寄存器,以便输入到分支历史寄存器所对应的状态机进行训练,自适应地优化该状态机的预测结果。
请参考图12,图12是本发明实施例提供的一种软硬件协同分支指令预测装置的结构示意图。如图所示,本发明实施例中的装置包括:
分支预测寄存器1201,用于获取操作数据的数据长度的比特序列,并向所述分支预测器依次输出所述数据长度的比特序列;
计数寄存器1202,用于对所述数据长度的比特序列的有效位数进行统计,并根据所述比特序列的有效位数指示所述分支预测寄存器向所述分支预测器依次输出所述数据长度的比特序列;
分支预测器1203,用于根据所述数据长度的比特序列,指导分支指令预测。
具体实现中,对于一连串数据操作,数据尾部处理流程存在连续分支,该分支跳转与否取决于数据的长度。例如,对一个10字节大小的数组进行拷贝操作,可以首先将10字节大小的数据长度转化为比特序列1010,分支预测寄存器1201依次将比特序列1010送入分支预测器1203,当一个比特从分支预测寄存器1201读出时,比特序列1010整体右移补0,在分支预测寄存器1201依次将比特序列1010送入分支预测器1203的过程中,计数寄存器1202指示分支预测寄存器1201中比特序列1010的有效位数,每进行一次分支预测,计数寄存器1202的计数值减1,当计数寄存器1202的计数值减至0时,分支预测寄存器1201不再使能分支预测。如图9所示,图9提供的一种分支指令预测的流程示意图。对于10字节大小的数据长度的数据,在首先使用最大字节的拷贝指令(8字节拷贝指令)进行数据拷贝之后,产生连续的分支判断的场景,分支预测寄存器1201根据比特序列1010指示分支预测器1203再使用2字节的拷贝指令对数据操作即可,因此最终通过8字节的拷贝指令和2字节的拷贝的字节对该10字节大小的数据进行拷贝操作。
可选的,可以通过扩展指令集分别对所述分支预测寄存器和所述计数寄存器进行操作,如图10所示,图10是本发明实施例提供的一种扩展指令集的示意图。扩展指令集包括操作码、寄存器操作数、第一立即操作数和第二立即操作数,操作码为指令助记符,寄存器操作数用于存储分支预测的数据长度值;第一立即操作数用户标识分支预测的长度的起始比特位,第二立即数用于标识分支预测的长度的结束比特位。其中,分支预测寄存器的值为“寄存器操作数>>第一立即操作数”,计数寄存器的计数值为“第二立即操作数–第一立即操作数”。整条指令类似于一个取比特值的运算,取出的结果送入到分支预测寄存器中,用于分支预测。
如图11所示,图11提供了一种通过扩展指令集进行分支预测的流程示意图。在使用最大字节的拷贝指令处理数据之后,通过扩展指令集和新增硬件模块,指示分支预测器处理尾端数据。在连续分支开始之前,可以通过手动添加或编译程序自动识别添加分支预测器的指令,通过预读分支跳转的判断条件,使得程序运行时能够准确进行分支预测,提高程序性能。
需要说明的是,扩展指令集可以包括若干新增的指令集,用于直接操作新增的硬件模块,指令的格式不一定固定,能够读取来自用户的输入,并输入至分支预测寄存器,具有软件到硬件的交互功能,指令不影响程序执行的正确性,仅对性能造成影响。
最终,在实现扩展指令集和新增硬件模块之后,在软件开发的过程中,就可以直接操作分支预测器进行分支预测,具体步骤如下:识别连续分支场景;利用软硬件交互指令,对连续分支场景进行优化;交互指令生效,将影响分支预测的关键因素输入至分支预测寄存器;分支预测寄存器模块根据用户输入,直接指示分支预测器进行准确的分支预测,提高程序性能。
可选的,在通过扩展指令集和新增硬件模块进行分支指令预测时,需要获取程序执行过程中的当前跳转指令;从所述分支历史寄存器查找与所述当前跳转指令对应的历史跳转记录,并索引到预设的阵列表中查找对应的操作状态;将所述操作状态输入到所述分支预测器进行分支指令预测,然后在两者协同作用下,共同完成分支指令预测,提高分支指令预测的准确性。
可选的,可以获取所述分支指令预测的预测结果;将所述预测结果输入到所述分支历史寄存器中,所述执行结果用于作为所述历史跳转记录存储于所述分支历史寄存器,以便输入到分支历史寄存器所对应的状态机进行训练,自适应地优化该状态机的预测结果。
请继续参考图13,图13是本发明提出的一种软硬件协同分支指令预测设备的结构示意图。如图所示,该设备可以包括:至少一个处理器1301,例如CPU,至少一个通信接口1302,至少一个存储器1303和至少一个通信总线1304。其中,通信总线1304用于实现这些组件之间的连接通信。其中,本发明实施例中设备的通信接口1302用于与其他节点设备进行信令或数据的通信。存储器1303可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器1303可选的还可以是至少一个位于远离前述处理器1301的存储装置。存储器1303中存储一组程序代码,且处理器1301执行所述存储器1303中存储的程序,执行上述软硬件协同分支指令预测装 置所执行的方法、或实现上述软硬件协同分支指令预测装置所实现的功能。
需要说明的是,对于前述的各个方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某一些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详细描述的部分,可以参见其他实施例的相关描述。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。
以上对本发明实施例所提供的内容下载方法及相关设备、***进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (11)

  1. 一种软硬件协同分支指令预测方法,其特征在于,所述方法应用于处理器,所述处理器包括分支预测寄存器、计数寄存器和分支预测器,所述方法包括:
    所述分支预测寄存器获取操作数据的数据长度的比特序列,并向所述分支预测器依次输出所述数据长度的比特序列;
    所述计数寄存器对所述数据长度的比特序列的有效位数进行统计,并根据所述比特序列的有效位数指示所述分支预测寄存器向所述分支预测器依次输出所述数据长度的比特序列;
    所述分支预测器根据所述数据长度的比特序列,指导分支指令预测。
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    通过扩展指令集分别对所述分支预测寄存器和所述计数寄存器进行操作,其中,所述扩展指令集包括操作码、寄存器操作数、第一立即操作数和第二立即操作数。
  3. 如权利要求1所述的方法,其特征在于,所述处理器还包括分支历史寄存器,所述方法还包括:
    获取程序执行过程中的当前跳转指令;
    从所述分支历史寄存器查找与所述当前跳转指令对应的历史跳转记录,并索引到预设的阵列表中查找对应的操作状态;
    将所述操作状态输入到所述分支预测器进行分支指令预测。
  4. 如权利要求3所述的方法,其特征在于,所述根据所述数据长度的比特序列进行分支指令预测之后,还包括:
    获取所述分支指令预测的预测结果;
    将所述预测结果输入到所述分支历史寄存器中,所述执行结果用于作为所述历史跳转记录存储于所述分支历史寄存器。
  5. 如权利要求1-4任意一项所述的方法,其特征在于,所述分支预测寄存器的预测结果为最高优先级。
  6. 一种软硬件协同分支指令预测装置,其特征在于,所述装置包括分支预测寄存器、计数寄存器以及分支预测器,其中:
    所述分支预测寄存器,用于获取操作数据的数据长度的比特序列,并向所述分支预测器依次输出所述数据长度的比特序列;
    所述计数寄存器,用于对所述数据长度的比特序列的有效位数进行统计,并根据所述比特序列的有效位数指示所述分支预测寄存器向所述分支预测器依次输出所述数据长度 的比特序列;
    所述分支预测器,用于根据所述数据长度的比特序列,指导分支指令预测。
  7. 如权利要求6所述的装置,其特征在于,所述装置还包括:
    数据操作模块,用于通过扩展指令集分别对所述分支预测寄存器和所述计数寄存器进行操作,其中,所述扩展指令集包括操作码、寄存器操作数、第一立即操作数和第二立即操作数。
  8. 如权利要求6所述的装置,其特征在于,所述装置包括:
    分支预测模块,用于获取程序执行过程中的当前跳转指令;从分支历史寄存器查找与所述当前跳转指令对应的历史跳转记录,并索引到预设的阵列表中查找对应的操作状态;将所述操作状态输入到所述分支预测器进行分支指令预测。
  9. 如权利要求8所述的装置,其特征在于,所述装置还包括:
    数据存储模块,用于获取所述分支指令预测的预测结果;将所述预测结果输入到所述分支历史寄存器中,所述执行结果用于作为所述历史跳转记录存储于所述分支历史寄存器。
  10. 如权利要求6-9任意一项所述的装置,其特征在于,所述分支预测寄存器的预测结果为最高优先级。
  11. 一种软硬件协同分支指令预测设备,其特征在于,包括:存储器、通信总线以及处理器,其中,所述存储器用于存储程序代码,所述处理器用于调用所述程序代码,执行以下操作:
    通过分支预测寄存器获取操作数据的数据长度的比特序列,并向分支预测器依次输出所述数据长度的比特序列;
    通过所述计数寄存器对所述数据长度的比特序列的有效位数进行统计,并根据所述比特序列的有效位数指示所述分支预测寄存器向所述分支预测器依次输出所述数据长度的比特序列;
    通过所述分支预测器根据所述数据长度的比特序列,指导分支指令预测。
PCT/CN2017/093343 2016-11-07 2017-07-18 一种软硬件协同分支指令预测方法及装置 WO2018082344A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610973045.6A CN108062236A (zh) 2016-11-07 2016-11-07 一种软硬件协同分支指令预测方法及装置
CN201610973045.6 2016-11-07

Publications (1)

Publication Number Publication Date
WO2018082344A1 true WO2018082344A1 (zh) 2018-05-11

Family

ID=62076601

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/093343 WO2018082344A1 (zh) 2016-11-07 2017-07-18 一种软硬件协同分支指令预测方法及装置

Country Status (2)

Country Link
CN (1) CN108062236A (zh)
WO (1) WO2018082344A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1601463A (zh) * 2003-09-24 2005-03-30 三星电子株式会社 低功率消耗的分支预测装置和方法
CN101763248A (zh) * 2008-12-25 2010-06-30 世意法(北京)半导体研发有限责任公司 用于多模式分支预测器的***和方法
US20150046682A1 (en) * 2013-08-12 2015-02-12 International Business Machines Corporation Global branch prediction using branch and fetch group history

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1200343C (zh) * 2002-03-22 2005-05-04 中国科学院计算技术研究所 适用于上下文切换的分支预测方法
US7437537B2 (en) * 2005-02-17 2008-10-14 Qualcomm Incorporated Methods and apparatus for predicting unaligned memory access
US8904155B2 (en) * 2006-03-17 2014-12-02 Qualcomm Incorporated Representing loop branches in a branch history register with multiple bits
US7523298B2 (en) * 2006-05-04 2009-04-21 International Business Machines Corporation Polymorphic branch predictor and method with selectable mode of prediction
CN101477455B (zh) * 2009-01-22 2011-06-29 浙江大学 无预测延时的分支预测控制方法
GB201300608D0 (en) * 2013-01-14 2013-02-27 Imagination Tech Ltd Indirect branch prediction
CN104423929B (zh) * 2013-08-21 2017-07-14 华为技术有限公司 一种分支预测方法及相关装置
JP6457836B2 (ja) * 2015-02-26 2019-01-23 ルネサスエレクトロニクス株式会社 プロセッサおよび命令コード生成装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1601463A (zh) * 2003-09-24 2005-03-30 三星电子株式会社 低功率消耗的分支预测装置和方法
CN101763248A (zh) * 2008-12-25 2010-06-30 世意法(北京)半导体研发有限责任公司 用于多模式分支预测器的***和方法
US20150046682A1 (en) * 2013-08-12 2015-02-12 International Business Machines Corporation Global branch prediction using branch and fetch group history

Also Published As

Publication number Publication date
CN108062236A (zh) 2018-05-22

Similar Documents

Publication Publication Date Title
EP0021399B1 (en) A method and a machine for multiple instruction execution
JP3630118B2 (ja) スレッド終了方法及び装置並びに並列プロセッサシステム
US10572404B2 (en) Cyclic buffer pointer fixing
RU2614583C2 (ru) Определение профиля пути, используя комбинацию аппаратных и программных средств
RU2417407C2 (ru) Способы и устройство для моделирования поведения предсказания переходов явного вызова подпрограммы
US9542169B2 (en) Generating SIMD code from code statements that include non-isomorphic code statements
JP2021103577A (ja) 循環命令の処理方法、電子機器、コンピュータ可読記憶媒体及びコンピュータプログラム
JP6633119B2 (ja) 自律的メモリの方法及びシステム
US5812809A (en) Data processing system capable of execution of plural instructions in parallel
US5313644A (en) System having status update controller for determining which one of parallel operation results of execution units is allowed to set conditions of shared processor status word
US9395986B2 (en) Compiling method and compiling apparatus
JP5941488B2 (ja) 条件付きショート前方分岐の計算的に等価な述語付き命令への変換
TW202009692A (zh) 在中央處理單元(cpu)中執行指令的方法
TW201732561A (zh) 用於控制流向終止的模式特定結束分支
US9442729B2 (en) Minimizing bandwidth to track return targets by an instruction tracing system
JPH0348537B2 (zh)
EP2577464B1 (en) System and method to evaluate a data value as an instruction
WO2024103913A1 (zh) 一种定时器队列设置方法、装置、设备及可读存储介质
WO2018082344A1 (zh) 一种软硬件协同分支指令预测方法及装置
CN108345534B (zh) 生成和处理跟踪流的装置和方法
WO2016201699A1 (zh) 指令处理方法及设备
CN112445587A (zh) 一种任务处理的方法以及任务处理装置
JP2010140233A (ja) エミュレーションシステム及びエミュレーション方法
US10565036B1 (en) Method of synchronizing host and coprocessor operations via FIFO communication
WO2023185799A1 (zh) 一种指令翻译方法及其相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17867493

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17867493

Country of ref document: EP

Kind code of ref document: A1