CN117093270B

CN117093270B - Instruction sending method, device, equipment and storage medium

Info

Publication number: CN117093270B
Application number: CN202311049367.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Moore Threads Technology Co Ltd
Current assignee: Moore Threads Technology Co Ltd
Priority date: 2023-08-18
Filing date: 2023-08-18
Publication date: 2024-06-14
Anticipated expiration: 2043-08-18
Also published as: CN117093270A

Abstract

The embodiment of the application discloses a method, a device, equipment and a storage medium for sending instructions, wherein the method comprises the following steps: acquiring a plurality of instructions; selecting at least one target instruction from the plurality of instructions based on the associated constraint; wherein the relevant constraint condition at least comprises at most one target instruction corresponding to each instruction type under the condition that the plurality of instructions correspond to at least two instruction types; and sending the at least one target instruction to an instruction queue matched with the type according to the instruction type of the at least one target instruction in the current clock cycle.

Description

Instruction sending method, device, equipment and storage medium

Technical Field

The present application relates to, but not limited to, the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for sending an instruction.

Background

The processor of the current image processor (Graphics Processing Unit, GPU) performs task processing with a thread bundle as a basic Unit, and for a single thread bundle, instructions are issued from an instruction issue Unit to a downstream instruction queue, and the instruction queue determines whether to issue the instruction to a corresponding arithmetic Logic Unit (ARITHMETIC AND Logic Unit, ALU) for execution according to whether the first issued instruction is independent of related data. When a certain ALU completes the task of the corresponding instruction, the completed instruction is transmitted to the instruction queue for releasing the dependence of other instructions on the output data of the instruction.

For a single thread bundle, only one instruction queue can be issued according to the instruction sequence, and one instruction queue corresponds to a plurality of ALUs, namely, only one instruction can be issued at the same time, which has a certain time waste for a plurality of ALUs capable of operating in a pipelined manner.

Disclosure of Invention

In view of this, embodiments of the present application at least provide a method, an apparatus, a device, and a storage medium for sending instructions.

The technical scheme of the embodiment of the application is realized as follows:

in one aspect, an embodiment of the present application provides a method for sending an instruction, where the method includes:

Acquiring a plurality of instructions; selecting at least one target instruction from the plurality of instructions based on the associated constraint; wherein the relevant constraint condition at least comprises at most one target instruction corresponding to each instruction type under the condition that the plurality of instructions correspond to at least two instruction types; and sending the at least one target instruction to an instruction queue matched with the type according to the instruction type of the at least one target instruction in the current clock cycle.

In another aspect, an embodiment of the present application provides an instruction sending device, including:

The instruction acquisition module is used for acquiring a plurality of instructions;

the instruction screening module is used for selecting at least one target instruction from the plurality of instructions based on related constraint conditions; wherein the relevant constraint condition at least comprises at most one target instruction corresponding to each instruction type under the condition that the plurality of instructions correspond to at least two instruction types;

And the instruction sending module is used for sending the at least one target instruction to an instruction queue matched with the type according to the instruction type to which the at least one target instruction belongs in the current clock cycle.

In yet another aspect, an embodiment of the present application provides a computer device including a memory and a processor, where the memory stores a computer program executable on the processor, and where the processor implements some or all of the steps of the above method when the program is executed.

In yet another aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs some or all of the steps of the above-described method.

In the embodiment of the application, firstly, a plurality of instructions are acquired; then selecting at least one target instruction from the plurality of instructions based on the relevant constraint conditions; finally, sending the at least one target instruction to an instruction queue matched with the type according to the instruction type to which the at least one target instruction belongs in the current clock cycle; therefore, by analyzing the relevant constraint conditions of the multiple instructions, at least one target instruction is screened out, and at least one target instruction is stored in the corresponding type of instruction queue according to the instruction type, so that multiple different instructions can be simultaneously sent to different ALUs for execution through the multiple types of instruction queues, various ALUs can be used uniformly as much as possible when in use, and the execution efficiency of a single thread bundle can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the aspects of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic diagram of a single-threaded bundle instruction serial transmission process in the related art;

FIG. 2 is a schematic flow chart of an alternative method for sending instructions according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of an alternative method for sending instructions according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of an alternative method for sending instructions according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a multi-instruction parallel sending process according to an embodiment of the present application;

FIG. 6 is a schematic diagram of parallel multi-instruction issue with the same type according to an embodiment of the present application;

FIG. 7 is a schematic diagram of parallel multi-instruction issue with pre-data dependency provided by an embodiment of the present application;

FIG. 8 is a diagram illustrating parallel multi-instruction issue with issued instructions according to an embodiment of the present application;

Fig. 9 is a schematic diagram of a composition structure of an instruction sending device according to an embodiment of the present application;

fig. 10 is a schematic diagram of a hardware entity of a computer device according to an embodiment of the present application.

Detailed Description

The technical solution of the present application will be further elaborated with reference to the accompanying drawings and examples, which should not be construed as limiting the application, but all other embodiments which can be obtained by one skilled in the art without making inventive efforts are within the scope of protection of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

The term "first/second/third" is merely to distinguish similar objects and does not represent a particular ordering of objects, it being understood that the "first/second/third" may be interchanged with a particular order or precedence, as allowed, to enable embodiments of the application described herein to be implemented in other than those illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing the application only and is not intended to be limiting of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application are suitable for the following explanation.

The thread blocks are split into a minimum execution thread bundle (or simply thread bundle THREAD WARP) that each contains a fixed number (or less) of threads, e.g., 32 threads. When multiple thread blocks execute in the same compute unit, the thread bundles in the compute unit may be from the same thread block or different thread blocks, and all threads in the same thread bundle may execute in a single instruction multiple data stream (Single Instruction Multiple Data, SIMD) manner.

The pipeline operation mode of the GPU comprises the following steps: fetch (Instruction Fetch, IF), decode (Instruction Dispatch/decode, ID), execute (Executed, EX), memory (MEM), and Write Back (WB), referring to updating the result of the instruction execution into a register). Each thread bundle has a Program Counter (PC) that records the address of the next instruction to be executed in the thread bundle, i.e., the fetch address. The value in the program counter is used to indicate the location of the next instruction in main memory. When an instruction is fetched, the value in the program counter is automatically incremented, and after the instruction is executed and the instruction data is written back, the computer then obtains the next instruction address from the program counter. Fetching refers to the first stage in the pipeline, and the fetched instruction data is input to the subsequent stage for processing, thereby realizing the running process of the whole computing unit.

FIG. 1 is a schematic diagram of a single-thread bundle instruction serial transmission process in the related art, as shown in FIG. 1, a plurality of instructions are read from a memory in each clock cycle, for example, the read instructions are sequentially an instruction A1, an instruction B, an instruction C, an instruction D, and an instruction A2, and according to the issuing condition, only one instruction is issued to the instruction queue according to the instruction sequence, for example, the instruction A1 corresponding to the current PC value 101 is issued to the instruction queue 102, then the PC value is moved to the next instruction B according to the instruction sequence, that is, the next PC value 103 is the address of the instruction B, and the instruction B is issued to the same instruction queue 102, accordingly, the instruction queue 102 can issue only one instruction to one ALU at the same time, that is, at most one instruction is issued to one ALU in each clock cycle, so that the bandwidth in the single-thread bundle case is less than 1, resulting in low execution efficiency of the ALU.

The embodiment of the application provides an instruction sending method which can be executed by a processor of computer equipment. The computer device may be a server, a notebook computer, a tablet computer, a desktop computer, a smart television, a set-top box, a mobile device (such as a mobile phone, a portable video player, a personal digital assistant, a dedicated messaging device, and a portable game device) and the like with instruction scheduling capability. Fig. 2 is an optional flowchart of an instruction sending method according to an embodiment of the present application, as shown in fig. 2, the method includes steps S210 to S230 as follows:

In step S210, a plurality of instructions are acquired.

Here, the plurality of instructions are read from the memory unit once every clock cycle, and instruction types of different instructions may be the same or different.

Step S220, selecting at least one target instruction from the plurality of instructions based on the related constraint.

Here, the relevant constraint includes at least one target instruction corresponding to each instruction type in a case where the plurality of instructions corresponds to at least two instruction types. That is, the plurality of instructions that are read are divided into several instruction types, at most several target instructions are available.

It should be noted that, the single-threaded bundle may include a plurality of ALUs, at least one instruction belonging to the same type may be processed and calculated by the ALUs matched by the same type, and each instruction belonging to different instruction types may be simultaneously sent to each type of matched ALU to perform calculation.

In some embodiments, the instructions are classified according to different computing scenarios, so as to obtain an instruction type corresponding to each instruction in the instructions, and only one target instruction is selected at most for a specific instruction type.

In some embodiments, prior to screening the fetched plurality of instructions based on relevant constraints, the plurality of instructions are subject to a long-term data dependency check based on issue conditions, including but not limited to the following conditions: the state of the target ALU to be sent is idle; the number of the simultaneous issued instructions of the single thread bundle is less than 8, and no unfinished instructions exist at present; the number of instructions processed by the same ALU in multiple thread bundles cannot exceed a specified number (e.g., not more than 3 instructions); there are 3 instructions and no more instruction 4 is issued when another instruction is completed.

Step S230, sending the at least one target instruction to an instruction queue matching the type according to the instruction type to which the at least one target instruction belongs in the current clock cycle.

Here, the clock cycle is a clock cycle of workload sampling of the processing unit, and the embodiment of the present application may be understood that 1cycle is 1 clock cycle. And aiming at the instruction type of each target instruction, sending the corresponding target instruction to an instruction queue of a corresponding type, and simultaneously sending target instructions of different instruction types to instruction queues of respective type matching.

Fig. 3 is an optional flowchart of an instruction sending method according to an embodiment of the present application, as shown in fig. 3, the method includes steps S310 to S350 as follows:

in step S310, a plurality of instructions are acquired.

Step S320, determining at least two instruction types corresponding to the plurality of instructions according to a calculation scene.

Here, the calculation scene includes different instruction use scenes or instruction application scenes, and the plurality of instructions may be allocated into a plurality of instruction classes in advance according to an actual calculation scene.

It should be noted that the instructions of different instruction types include high power consumption instructions representing workload in the multi-core processing unit, and may include any one of integer instructions, single precision instructions, double precision instructions, matrix instructions, memory instructions, and other instruction types. The integer instruction may include, for example, an integer calculation instruction, a type conversion instruction, a comparison instruction, and the like; the single precision instruction may include an operation instruction to perform single precision addition, multiplication addition, division, and override functions; the double-precision instruction can comprise double-precision multiplication, addition, fusion multiplication addition (Fused Multiply and Add, FMA), division, overrunning functions and other operation instructions; the matrix instruction may include matrix addition, multiplication, division, inversion, etc. operation instructions; the memory instructions may include off-chip memory instructions that read and write data to off-chip memory, shared memory instructions that read and write shared memory on-chip, and so on.

Step S330, setting a type-matching instruction queue for each instruction type of the instructions.

Here, each instruction type is used to store the corresponding type of instruction in the instruction queue, and the instruction without data dependency relationship can be sent to the corresponding type of ALU for execution in the same clock cycle.

Step S340, selecting at least one target instruction from the plurality of instructions based on the related constraint.

Here, the relevant constraint includes at least one target instruction corresponding to each instruction type in a case where the plurality of instructions corresponds to at least two instruction types.

It will be appreciated that at least one target instruction satisfying the relevant constraint is classified into different instruction types, and that, in the case where there are at least two candidate instructions of the same type among a plurality of instructions read at one time, only one of the at least two candidate instructions is selected as the target instruction.

The execution order of the step S330 and the step S340 is not fixed, and may be performed simultaneously, or the step S340 may be performed first and then the step S330 may be performed, which is not limited in the embodiment of the present application.

Step S350, sending the at least one target instruction to the instruction queue matching the type according to the instruction type to which the at least one target instruction belongs in the current clock cycle.

For example, an instruction of an instruction type a is sent to a first instruction queue corresponding to the type a in a current clock cycle, and an instruction of an instruction type B is sent to a second instruction queue corresponding to the type B.

In the embodiment of the application, the plurality of instructions are classified through an actual calculation scene, and at least one item of target instruction which is screened to meet related constraint conditions is respectively stored in the corresponding type of instruction queue according to the instruction type, so that a plurality of different instructions can be simultaneously sent to different ALUs for execution through the plurality of types of instruction queues, various ALUs are used uniformly as much as possible when in use, and the execution efficiency of a single thread bundle can be further improved.

In some embodiments, after the step S350, the method further includes the following steps S360 to S370:

step S360, performing data dependency detection on the target instruction contained in the instruction queue of each type.

Here, the data dependency detection is a recent data dependency detection that depends on a time determination, which is intended to know whether the value of a given variable is affected by another variable value. At least one target instruction screened in a plurality of instructions read at one time passes through long-term data dependency detection by default.

In implementation, the different types of instruction queues respectively perform data dependency detection on the target instructions included in the different types of instruction queues, and the target instructions can be detected in parallel by using N (total number of instruction queues) detection threads or in series, which is not limited by the embodiment of the present application.

It will be appreciated that there are different types of data dependencies in the computer domain, and if instruction s2 depends on instruction s1, the following scenario is possible: the instruction s1 writes into the memory, and the instruction s2 reads; the instruction s1 reads the memory, and the instruction s2 writes; the instruction s1 writes the memory, and the instruction s2 writes; instruction s1 reads the memory and instruction s2 reads.

Step S370, sending the target instruction for releasing the data dependency to the corresponding arithmetic logic unit for execution.

Here, the arithmetic logic unit, that is, ALU, is a combined digital circuit in which a CPU performs arithmetic operations and bit operations on integers, and is composed of a series of gates. Most ALUs can do the following: integer arithmetic operations (addition, subtraction, and sometimes multiplication and division, but at a higher cost); bit logic operations (and, or, not, exclusive or); shift operations (shifting a word left or right or floating a particular bit without sign extension).

In implementation, the near-term data dependency check is performed on different types of instructions in each type of instruction queue, and at least one target instruction for releasing the data dependency is issued to realize parallel execution of different types of ALUs.

According to the embodiment, after at least one target instruction is respectively stored in the corresponding instruction queues according to the extracted and allocated instruction types, the dependence of the included instructions is respectively analyzed in each instruction queue, and when the target instruction is respectively released, a plurality of instructions of different types can be simultaneously sent to different ALUs for execution, so that the execution efficiency of single-thread bundles is improved.

In some embodiments, fig. 4 is a schematic flowchart of an alternative method for sending instructions according to an embodiment of the present application, as shown in fig. 4, step S220 or step S340 "selecting at least one target instruction from the plurality of instructions based on the relevant constraint conditions" includes the following steps S410 to S420:

step S410, a first number of instructions are read from the plurality of instructions as candidate instructions, the candidate instructions conforming to the long-term data dependency.

Here, the first number is at least two, and the first number of candidate instructions fetched in different clock cycles may vary, and may be specifically determined based on factors such as the number of types of arithmetic logic units, the processing bandwidth of the single-threaded bundle, the computing scenario, and the like. In the implementation, a first number of candidate instructions are read according to the distance between the values of the program counter, and then the candidate instructions are classified and issued to the instruction queues with matched types.

The long term data dependencies include, but are not limited to, the following conditions: the state of the target ALU to be sent is idle; the number of the simultaneous issued instructions of the single thread bundle is less than 8, and no unfinished instructions exist at present; the number of instructions processed by the same ALU in multiple thread bundles cannot exceed a specified number (e.g., not more than 3 instructions); there are 3 instructions and no more instruction 4 is issued when another instruction is completed.

In some embodiments, the first number is greater than or equal to 2, and the first number is determined based on at least one of the following: the number of types of arithmetic logic units, the processing bandwidth and the computational scenario.

Here, different computing scenarios correspond to different instruction types, and the instruction types are allocated in real time according to the computing scenarios. In some embodiments, where the instruction length is relatively short, the first number is capped by the number of types of different types of arithmetic logic units or the number of instruction types included in the single-threaded bundle.

In some embodiments, the instruction lengths of the different instructions may not be identical, for example, where the instruction length of each instruction is long, the processing bandwidth may be 2, and where the instruction length of each instruction is short, the processing bandwidth may be 4, so the first number and instruction length need to match the processing bandwidth.

Therefore, a plurality of candidate instructions are read in practical application, and the instructions of different types can be sent simultaneously, so that the arithmetic processing units of all types are ensured to be used as uniformly as possible when in use, and the instruction execution efficiency under a single-thread bundle is improved.

Step S420, selecting at least one target instruction from the candidate instructions based on the relevant constraint.

Here, the selected at least one target instruction may be synchronously sent to the type-matching instruction queue.

In some embodiments, the relevant constraints include: in the case that at least two instructions belong to the same instruction type in the candidate instructions, the address of the target instruction is closest to the value of a program counter; wherein the value of the program counter characterizes the address of the currently to-be-executed instruction.

Here, the Program Counter (PC) is used to store the address of the instruction to be executed, and it should be noted that there is a direct path between the Program Counter and the MAR register of the Main Memory (Main Memory), and the address of the next instruction can be formed by the function of self-adding 1.

In an implementation, an instruction whose address is closest to the value of the program counter is selected from the at least two instructions as an entry target instruction. In this way, only one target instruction of the same instruction type is read from the candidate instructions, so that the simultaneous sending of the instructions is ensured to be classified into different instruction classes.

In some implementations, the relevant constraint includes that the target instruction cannot depend on an untransmitted instruction that does not belong to the candidate instruction.

Here, the unsent instruction is an instruction that is not sent in the current clock cycle. For example, the non-sent instruction is an instruction except the candidate instruction in at least two instructions corresponding to the same instruction type.

The value of the program counter characterizes the target instruction currently to be issued, and other candidate instructions that are independent of the non-issued instruction after the value of the address program counter may be issued as target instructions.

In some embodiments, the method further comprises: in the case that a fourth candidate instruction depends on a third candidate instruction, the at least one target instruction includes the fourth candidate instruction and the third candidate instruction, wherein an instruction type of the fourth candidate instruction is different from an instruction type of the third candidate instruction.

Here, for at least two instructions that are interdependent among different instruction types, the instructions may be sent to the instruction queues that match the respective types at the same time. In this way, the instruction queues are classified according to the instruction types of the plurality of instructions, and different instruction queues correspond to different ALUs respectively, so that the limit that only one instruction can be sent in the same clock cycle is broken through, and the ALUs of different types can be used uniformly.

In the above embodiment, after the candidate instruction conforming to the long-term data dependency is read, at least one target instruction is obtained by judging and screening according to the relevant constraint. Therefore, the instructions which are simultaneously sent are ensured to belong to different instruction classes and meet constraint requirements, so that at least one item of target instruction which is screened can be simultaneously sent to different instruction queues, various ALUs are uniformly used as much as possible when in use, and the execution efficiency of single-thread bundles is improved.

In some embodiments, after the sending of the at least one target instruction is completed, the method further comprises: determining a total number of instructions of the target instruction that are consecutively issued adjacent to a value of a program counter within the current clock cycle; based on the instruction total number, a value of the program counter is updated.

Here, the number of consecutively transmitted instructions adjacent to the value of the program counter is the total number of target instructions simultaneously transmitted in the current clock cycle. The increment value of the program counter may be determined by the total number of instructions, and the value of the program counter may be updated based on the increment value.

Illustratively, in the current clock cycle, consecutive 4-entry target instructions closest to the value of the program counter are simultaneously sent to the instruction queues corresponding to the respective types, and the value of the program counter is shifted by 4 bits based on the original value.

In this way, the value of the program counter is updated in real time after each transmission is completed to ensure that at least one target instruction expected to be transmitted in the next clock cycle is subsequently filtered based on the updated value of the program counter.

In some embodiments, the method further comprises: after the value of the program counter is updated, marking the sent instruction of which the instruction address is still located behind the updated value of the program counter, and obtaining a marked instruction; and in response to the next read instruction in the plurality of instructions, skipping the sending flow of the marked instruction.

Here, the issued instructions located after the updated value of the program counter in the plurality of instructions read in the current clock cycle are marked, and when there are marked instructions in the plurality of instructions read at once in the program counter in the subsequent clock cycle, the issue is not repeated. In this way, by marking the issued instruction at a particular location, unnecessary issue time overhead in the scheduling of subsequent instructions is reduced.

The method of sending instructions described above is described below in connection with a specific embodiment, however, it should be noted that this specific embodiment is only for better illustrating the present application, and is not meant to be a undue limitation on the present application.

Aiming at the problem that the execution efficiency of an ALU under a single thread bundle is lower because the same instruction queue can only be issued to a plurality of ALUs one by one according to the instruction sequence under the existing single thread bundle, the instruction queue can only issue one instruction to one ALU in each clock cycle.

In the implementation, a plurality of instructions are fetched from the storage unit once in each clock period, one or more target instructions are synchronously sent to different types of instruction queues according to the issuing conditions and related constraint conditions, at least one target instruction is respectively stored into the corresponding type of instruction queues according to the instruction types which are distributed in advance, the data dependence of each instruction is respectively analyzed by the instruction queues, and when the instructions without the data dependence exist in the at least one instruction queue, one or more target instructions can be simultaneously issued to the ALU for execution.

Wherein the winding piece corresponds to a remote data dependent detection (requiring a long time and uncertainty in time), including but not limited to: the state of the target ALU to be sent is idle; the number of the simultaneous issued instructions of the single thread bundle is less than 8, and no unfinished instructions exist at present; the number of instructions processed by the same ALU in multiple thread bundles cannot exceed a specified number (e.g., not more than 3 instructions); there are 3 instructions and no more instruction 4 is issued when another instruction is completed.

Relevant constraints include, but are not limited to: the sent target instructions belong to different instruction classes (the instruction types are actually distributed according to the computing scene); when at least two instructions of the same type exist in a plurality of instructions read at one time, only one instruction is sent; in the case that the read instruction depends on an instruction not transmitted in the current clock cycle, the read instruction is not transmitted temporarily; instructions which are sent aiming at a history period in a plurality of instructions read according to the PC value at one time are not sent any more; the increment value of the PC value is equal to the number of consecutive instructions transmitted nearest to the PC.

In some embodiments, the N instructions currently fetched are grouped into N different instruction types and pass long-term data dependency (requiring a long time and uncertainty) checks, the N instructions are issued simultaneously to the respective type-matching instruction queues, and the PC value is shifted N bits.

Fig. 5 is a schematic diagram of a multi-instruction parallel transmission process provided in an embodiment of the present application, as shown in fig. 5, taking an example that multiple instructions read simultaneously include instructions A1, B, C, D belonging to different instruction types, where a current PC value is an address of the instruction A1, and each instruction is independent of the other, and the instruction A1, B, C, D are selected as candidate instructions according to the instruction type, where the four candidate instructions are simultaneously transmitted as target instructions to instruction queues of respective corresponding types under the condition that long-term data dependency is satisfied. That is, instruction A1 is sent to instruction queue A, instruction B is sent to instruction queue B, instruction C is sent to instruction queue C, and instruction D is sent to instruction queue D. After completion of the transmission, the current PC value is shifted by 4 bits, and the next PC value is the address of instruction A2.

In some embodiments, when there are multiple instructions of the same type read at the same time, for example, for the same instruction type including a first candidate instruction and a second candidate instruction, the first candidate instruction closest to the address pointed by the PC value is first transmitted to the instruction queue of the corresponding type, and if the instruction (corresponding to the third candidate instruction) behind the second candidate instruction does not depend on the second candidate instruction and satisfies the long-term data dependence, the instruction can be simultaneously issued as a target instruction together with the first candidate instruction to the instruction queue of the corresponding type, and the PC value is moved to the position of the second candidate instruction after this time of instruction transmission.

Fig. 6 is a schematic diagram of parallel multi-instruction sending with the same type, as shown in fig. 6, taking an example that multiple simultaneously read instructions include an instruction A1 and an instruction A2 with the same type, and an instruction C and an instruction D with different instruction types, because the instruction A1 and the instruction A2 are of the same instruction type, only one instruction is selected as a candidate instruction corresponding to the type in the current clock cycle, for convenience in operation, the instruction A1 closest to the current PC value may be selected as a candidate instruction, and the instruction C and the instruction D located behind the other instruction A2 may be issued simultaneously with the instruction A1 if they do not depend on the instruction A2, that is, the instruction A1 is sent to the instruction queue a, the instruction C is sent to the instruction queue C, and the instruction D is sent to the instruction queue D. After this instruction is sent, the current PC value is moved to the position of instruction A2 and updated to the next PC value.

In some embodiments, when the second candidate instruction cannot issue due to the presence of the same type of instruction in the plurality of simultaneously read instructions, if there is a data dependency on the second candidate instruction, at least one third candidate instruction following the second candidate instruction that is currently about to issue is not issued temporarily, and the PC moves to the location of the second candidate instruction after this issue.

Fig. 7 is a schematic diagram of parallel multi-instruction transmission with pre-data dependency provided in an embodiment of the present application, as shown in fig. 7, taking an example that multiple instructions read simultaneously include an instruction A1 and an instruction A2 of the same type, and an instruction C and an instruction D that belong to different instruction types, since the instruction A1 and the instruction A2 belong to the same instruction type, the instruction A1 that is closest to the PC value and satisfies the data dependency detection is transmitted first in the current clock cycle, and another instruction A2 of the same type is transmitted in the next clock cycle, and since the instruction C depends on the instruction A2 that is not transmitted in the pre-data, the instruction A2 is not transmitted in the current clock cycle.

That is, instruction A1 is sent to instruction queue A and instruction D is sent to instruction queue D. After this instruction is sent, the current PC value is moved to the position of instruction A2 and updated to the next PC value. Instruction C may then be sent concurrently with instruction A2 in the next clock cycle.

In some embodiments, the instruction read for the current cycle is marked after the PC update value; when a plurality of instructions read once at the PC value have been issued in a previous cycle, no further issue is made this time.

FIG. 8 is a schematic diagram of parallel multi-instruction transmission with transmitted instructions provided in an embodiment of the present application, as shown in FIG. 8, the plurality of instructions read according to the current PC value include an instruction A1, an instruction C, an instruction D and an instruction B, wherein the instruction D is an instruction whose marked history period has been transmitted, and is not transmitted any more this time; the instruction C depends on the instruction A1, other instructions A1, C and B all belong to different instruction types, long-term data dependence does not exist, and the long-term data dependence can be simultaneously sent to instruction queues matched with the respective types as three-item target instructions. That is, instruction A1 is sent to instruction queue A, instruction B is sent to instruction queue B, and instruction C is sent to instruction queue C. After this instruction is sent, the current PC value is moved to the position of instruction A2 and updated to the next PC value.

The instruction sending method provided by the embodiment of the application can simultaneously send a plurality of instructions to different ALUs for execution when the single-thread bundles are performed, and the instruction execution efficiency of the GPU under the single-thread bundles is improved.

The embodiment of the application classifies the instructions according to different use scenes, each type of instruction is executed through the corresponding ALU, so that various ALUs are used uniformly as much as possible, a plurality of instructions can be simultaneously sent to the instruction queues corresponding to the various types of the separated ALUs when the instructions are transmitted, a plurality of instructions can be simultaneously transmitted at most when the instructions are separated into a plurality of types, a plurality of instruction queues can also carry out data dependency judgment on the respective instructions, and a plurality of different instructions can be simultaneously sent to the different ALUs to be executed when the instructions are respectively independent, thereby improving the execution efficiency of single-thread bundles.

Based on the foregoing embodiments, an embodiment of the present application provides an instruction sending apparatus, where the instruction sending apparatus includes each module included, and each sub-module and each unit included in each module may be implemented by a processor in a computer device; of course, the method can also be realized by a specific logic circuit; in an implementation, the Processor may be a central processing unit (Central Processing Unit, CPU), a microprocessor (Microprocessor Unit, MPU), a digital signal Processor (DIGITAL SIGNAL Processor, DSP), or a field programmable gate array (Field Programmable GATE ARRAY, FPGA), or the like.

The instruction sending device performs a series of functions such as fetching, decoding, scheduling, distributing, etc. on the thread bundle running on the computing unit, so that a plurality of computing cores of the computing unit, such as Stream Processors (SPs), run the thread bundle. For example, each computing core includes an arithmetic logic unit, a floating point computing unit, and the like. The multiple thread bundles in a thread block may be executed simultaneously or in time-sharing fashion depending on the number of compute cores in the compute unit. The multiple threads in each thread bundle execute the same instruction, and the result obtained after the instruction execution is updated to the register corresponding to each thread bundle.

Fig. 9 is a schematic structural diagram of an instruction sending device according to an embodiment of the present application, and as shown in fig. 9, an instruction sending device 900 includes: an instruction acquisition module 910, an instruction filtering module 920, and an instruction sending module 930, wherein:

the instruction acquisition module 910 is configured to acquire a plurality of instructions;

the instruction filtering module 920 is configured to select at least one target instruction from the plurality of instructions based on a related constraint condition; wherein the relevant constraint condition at least comprises at most one target instruction corresponding to each instruction type under the condition that the plurality of instructions correspond to at least two instruction types;

the instruction sending module 930 is configured to send, in the current clock cycle, the at least one target instruction to an instruction queue matching the type according to the instruction type to which the at least one target instruction belongs.

In some possible embodiments, the instruction filtering module 920 includes: the first screening submodule is used for reading a first number of instructions from the plurality of instructions to serve as candidate instructions, and the candidate instructions accord with long-term data dependence; and the second screening sub-module is used for selecting at least one target instruction from the candidate instructions based on the related constraint conditions.

In some possible embodiments, the relevant constraint includes that the target instruction cannot depend on an untransmitted instruction that does not belong to the candidate instruction.

In some possible embodiments, the relevant constraints include: in the case that at least two instructions belong to the same instruction type in the candidate instructions, the address of the target instruction is closest to the value of a program counter; wherein the value of the program counter characterizes the address of the currently to-be-executed instruction.

In some possible embodiments, the relevant constraints include: in the case that a fourth candidate instruction depends on a third candidate instruction, the at least one target instruction includes the fourth candidate instruction and the third candidate instruction, wherein an instruction type of the fourth candidate instruction is different from an instruction type of the third candidate instruction.

In some possible embodiments, the apparatus further comprises: a send instruction determining module configured to determine a total number of instructions of the target instruction that are consecutively sent adjacent to a value of a program counter in the current clock cycle; and the counting value updating module is used for updating the value of the program counter based on the total number of the instructions.

In some possible embodiments, the apparatus further comprises: the instruction marking module is used for marking the sent instruction with the instruction address still positioned behind the updated value of the program counter after the value of the program counter is updated to obtain a marked instruction; and the default logic module is used for responding to the existence of the marked instruction in the plurality of instructions read next time and skipping the sending flow of the marked instruction.

In some possible embodiments, the first number is greater than or equal to 2, and the first number is determined based on at least one of the following factors: the number of types of arithmetic logic units, the processing bandwidth and the computational scenario.

In some possible embodiments, the apparatus further comprises: the instruction classification module is used for determining at least two instruction types corresponding to the plurality of instructions according to a calculation scene; and the queue configuration module is used for setting the instruction queues with matched types for the instructions of each instruction type.

In some possible embodiments, the apparatus further comprises: a dependency detection module, configured to perform data dependency detection on the target instruction included in the instruction queue of each type; and the instruction execution module is used for sending the target instruction for releasing the data dependence to a corresponding arithmetic logic unit for execution.

The description of the apparatus embodiments above is similar to that of the method embodiments above, with similar advantageous effects as the method embodiments. In some embodiments, the functions or modules included in the apparatus provided by the embodiments of the present disclosure may be used to perform the methods described in the embodiments of the methods, and for technical details that are not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the description of the embodiments of the methods of the present disclosure for understanding.

In the embodiment of the present application, if the above-mentioned instruction sending method is implemented in the form of a software functional module, and sold or used as a separate product, the instruction sending method may also be stored in a computer readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or some of contributing to the related art may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the application are not limited to any specific hardware, software, or firmware, or any combination of hardware, software, and firmware.

The embodiment of the application provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor realizes part or all of the steps in the method when executing the program.

Embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs some or all of the steps of the above-described method. The computer readable storage medium may be transitory or non-transitory.

Embodiments of the present application provide a computer program comprising computer readable code which, when run in a computer device, causes a processor in the computer device to perform some or all of the steps for carrying out the above method.

Embodiments of the present application provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program which, when read and executed by a computer, performs some or all of the steps of the above-described method. The computer program product may be realized in particular by means of hardware, software or a combination thereof. In some embodiments, the computer program product is embodied as a computer storage medium, and in other embodiments, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

It should be noted here that: the above description of various embodiments is intended to emphasize the differences between the various embodiments, the same or similar features being referred to each other. The above description of apparatus, storage medium, computer program and computer program product embodiments is similar to that of method embodiments described above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus, the storage medium, the computer program and the computer program product of the present application, reference should be made to the description of the embodiments of the method of the present application.

It should be noted that, fig. 10 is a schematic diagram of a hardware entity of a computer device according to an embodiment of the present application, and as shown in fig. 10, the hardware entity of the computer device 1000 includes: a processor 1001, a communication interface 1002, and a memory 1003, wherein:

the processor 1001 generally controls the overall operation of the computer device 1000.

The communication interface 1002 may enable the computer device to communicate with other terminals or servers over a network.

The memory 1003 is configured to store instructions and applications executable by the processor 1001, and may also cache data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or processed by the respective modules in the processor 1001 and the computer device 1000, which may be implemented by a FLASH memory (FLASH) or a random access memory (Random Access Memory, RAM). Data transfer may be performed between the processor 1001, the communication interface 1002, and the memory 1003 via the bus 1004.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence number of each step/process described above does not mean that the execution sequence of each step/process should be determined by its functions and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present application. The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read Only Memory (ROM), a magnetic disk or an optical disk, or the like, which can store program codes.

Or the above-described integrated units of the application may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the related art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.

The foregoing is merely an embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application.

Claims

1. A method of instruction transmission, applied to Shan Xiancheng bundles, the method comprising:

Acquiring a plurality of instructions;

Selecting at least one target instruction from the plurality of instructions based on the associated constraint; wherein the relevant constraint condition at least comprises at most one target instruction corresponding to each instruction type under the condition that the plurality of instructions correspond to at least two instruction types;

transmitting the at least one target instruction to an instruction queue matched with the type according to the instruction type of the at least one target instruction in the current clock cycle;

The selecting at least one target instruction from the plurality of instructions based on the relevant constraint includes:

Reading a first number of instructions from the plurality of instructions as candidate instructions, the candidate instructions conforming to a long-term data dependency;

Selecting at least one target instruction from the candidate instructions based on related constraint conditions; wherein the relevant constraint includes that the target instruction cannot depend on an untransmitted instruction that does not belong to the candidate instruction.

2. The method of claim 1, wherein the associated constraint comprises: in the case that at least two instructions belong to the same instruction type in the candidate instructions, the address of the target instruction is closest to the value of a program counter; wherein the value of the program counter characterizes the address of the currently to-be-executed instruction.

3. The method of claim 1, wherein the associated constraint comprises:

in the case that a fourth candidate instruction depends on a third candidate instruction, the at least one target instruction includes the fourth candidate instruction and the third candidate instruction, wherein an instruction type of the fourth candidate instruction is different from an instruction type of the third candidate instruction.

4. The method of claim 2, wherein after the at least one target instruction is sent to completion, the method further comprises:

determining a total number of instructions of the target instruction that are consecutively issued adjacent to the value of the program counter within the current clock cycle;

based on the instruction total number, a value of the program counter is updated.

5. The method according to claim 4, wherein the method further comprises:

After the value of the program counter is updated, marking the sent instruction of which the instruction address is still located behind the updated value of the program counter, and obtaining a marked instruction;

And in response to the next read instruction in the plurality of instructions, skipping the sending flow of the marked instruction.

6. The method of any one of claims 1 to 5, wherein the first number is greater than or equal to 2, and the first number is determined based on at least one of: the number of types of arithmetic logic units, the processing bandwidth and the computational scenario.

7. The method according to any one of claims 1 to 5, further comprising:

determining at least two instruction types corresponding to the plurality of instructions according to a calculation scene;

setting the instruction queue matched with the instruction type for the instruction of each instruction type.

8. The method according to any one of claims 1 to 5, further comprising:

Performing data dependency detection on the target instructions contained in the instruction queues of each type;

and sending the target instruction for releasing the data dependence to a corresponding arithmetic logic unit for execution.

9. An instruction transmitting apparatus, characterized in that the apparatus comprises:

the instruction sending module is used for sending the at least one target instruction to an instruction queue matched with the type according to the instruction type to which the at least one target instruction belongs in the current clock cycle;

The instruction screening module is further used for reading a first number of instructions from the plurality of instructions to serve as candidate instructions, wherein the candidate instructions accord with long-term data dependence; selecting at least one target instruction from the candidate instructions based on related constraint conditions; wherein the relevant constraint includes that the target instruction cannot depend on an untransmitted instruction that does not belong to the candidate instruction.

10. A computer device comprising a memory and a processor, the memory storing a computer program executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1 to 8 when the program is executed.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, realizes the steps in the method according to any one of claims 1 to 8.