WO2023173642A1

WO2023173642A1 - Instruction scheduling method, processing circuit and electronic device

Info

Publication number: WO2023173642A1
Application number: PCT/CN2022/107512
Authority: WO
Inventors: 王磊; 常亮; 许飞翔; 侯红朝; 姚飞; 仇小钢
Original assignee: 海飞科(南京)信息技术有限公司
Priority date: 2022-03-14
Filing date: 2022-07-22
Publication date: 2023-09-21
Also published as: CN114610394A; CN114610394B

Abstract

An instruction scheduling method, a processing circuit, an electronic device, a computer-readable storage medium, and a computer program product. The method comprises: determining a state indicator associated with a target instruction, the state indicator being used for indicating a state of a resource associated with the target instruction (810); determining, on the basis of the state indicator and the type of the target instruction, whether the target instruction is ready (820); in response to determining that the target instruction is ready, executing a target stage of the target instruction (830), wherein the target stage is determined on the basis of the type of the target instruction; and in response to the completion of an access operation of the target instruction with regard to a register (840), updating the state indicator (850). The method can use the state indicator to efficiently manage the data dependency between instructions, thereby improving the system performance and reducing the circuit complexity.

Description

指令调度的方法、处理电路和电子设备Instruction scheduling method, processing circuit and electronic device

技术领域Technical field

本公开的实施例一般地涉及电子领域，更具体而言涉及一种用于指令调度的方法、处理电路、电子设备、计算机可读存储介质和计算机程序产品。Embodiments of the present disclosure relate generally to the field of electronics, and more particularly to a method for instruction scheduling, a processing circuit, an electronic device, a computer-readable storage medium, and a computer program product.

背景技术Background technique

一些处理单元(例如，AIGPU)采用了load-store(加载-存储)架构，除了存取指令之外的其他指令都使用寄存器中操作数。这些指令从寄存器堆(register file–RF)中读取数据，然后送进执行单元计算，最后将结果回写到寄存器堆。为了能够每个时钟周期内读写多个数据，寄存器一般是多端口的，并分成多块。寄存器堆通常很小，其读取速度接近执行单元且延时是固定的，一般都放置在执行单元旁边。AIGPU的每个线程都有自己的寄存器堆，并有固定的执行单元。Some processing units (for example, AIGPU) adopt a load-store (load-store) architecture, and other instructions except access instructions use register operands. These instructions read data from the register file (register file-RF), then send it to the execution unit for calculation, and finally write the result back to the register file. In order to be able to read and write multiple data per clock cycle, registers are generally multi-ported and divided into multiple blocks. The register file is usually small, its read speed is close to the execution unit and the latency is fixed, and it is generally placed next to the execution unit. Each thread of AIGPU has its own register file and a fixed execution unit.

为了简化硬件实现，一些方法强制要求同类的读指令必须按序执行，这就加剧了读数据造成的延迟。In order to simplify hardware implementation, some methods force similar read instructions to be executed in order, which aggravates the delay caused by reading data.

发明内容Contents of the invention

本公开的实施例提供了一种用于指令调度的方案。Embodiments of the present disclosure provide a solution for instruction scheduling.

在第一方面，提供了一种用于指令调度的方法。该方法包括确定与目标指令相关联的状态指示符，该状态指示符用于指示与目标指令相关联的资源的状态；基于状态指示符和目标指令的类型，确定目标指令是否就绪；响应于确定目标指令就绪，执行目标指令的目标阶段，其中该目标阶段基于目标指令的类型而被确定；以及响应于目标指令的、针对资源的访问操作执行完成，更新状态指示符。In a first aspect, a method for instruction scheduling is provided. The method includes determining a status indicator associated with the target instruction, the status indicator being used to indicate a status of a resource associated with the target instruction; determining whether the target instruction is ready based on the status indicator and the type of the target instruction; in response to determining The target instruction is ready, a target phase of the target instruction is executed, wherein the target phase is determined based on the type of the target instruction; and in response to completion of the access operation to the resource for the target instruction, the status indicator is updated.

在一些实施例中，基于状态指示符和目标指令的类型确定目标指令是否就绪包括：响应于目标指令的类型指示针对资源的数据生产操作，确定状态指示符是否为第一值，第一值指示资源中的数据已经被消费；以及响应于状态指示符为第一值，确定目标指令就绪。In some embodiments, determining whether the target instruction is ready based on the status indicator and the type of the target instruction includes: in response to the type of the target instruction indicating a data production operation for the resource, determining whether the status indicator is a first value, the first value indicates The data in the resource has been consumed; and in response to the status indicator being the first value, it is determined that the target instruction is ready.

在一些实施例中，更新状态指示符包括：响应于目标指令的、针对资源的访问操作执行完成，将状态指示符更新为第二值，第二值指示资源中的数据能够被消费。In some embodiments, updating the status indicator includes: in response to completion of the access operation for the resource in response to the target instruction, updating the status indicator to a second value, the second value indicating that the data in the resource can be consumed.

在一些实施例中，执行所述目标指令的目标阶段包括：执行所述目标指令的回写阶段。In some embodiments, executing the target phase of the target instruction includes executing a writeback phase of the target instruction.

在一些实施例中，基于状态指示符和目标指令的类型确定目标指令是否就绪包括：响应于目标指令的类型指示针对资源的数据消费操作，确定状态指示符是否为第二值，第二值指示资源中的数据能够被消费；以及响应于状态指示符为第二值，确定目标指令就绪。In some embodiments, determining whether the target instruction is ready based on the status indicator and the type of the target instruction includes: in response to the type of the target instruction indicating a data consumption operation for the resource, determining whether the status indicator is a second value, the second value indicating The data in the resource can be consumed; and in response to the status indicator being the second value, determining that the target instruction is ready.

在一些实施例中，执行目标指令的目标阶段包括：发出目标指令。In some embodiments, executing the target phase of the target instruction includes issuing the target instruction.

在一些实施例中，更新状态指示符包括：响应于目标指令的、针对资源的访问操作执行完成，将状态指示符更新为第一值，第一值指示资源中的数据已经被消费。In some embodiments, updating the status indicator includes: in response to completion of the access operation for the resource in response to the target instruction, updating the status indicator to a first value, the first value indicating that the data in the resource has been consumed.

在一些实施例中，目标指令为第一指令，资源为第一资源，第一指令指示针对第一资源的数据生产操作，并且执行目标指令的目标阶段包括：在第二指令的执行期间，发出第一指令，第二指令指示针对第二资源的数据生产操作，第一资源不同于第二资源。In some embodiments, the target instruction is a first instruction, the resource is a first resource, the first instruction indicates a data production operation for the first resource, and the target phase of executing the target instruction includes: during execution of the second instruction, issuing A first instruction and a second instruction indicating a data production operation for a second resource, the first resource being different from the second resource.

在一些实施例中，目标指令为第三指令，资源为第三资源，第三指令指示针对第三资源的数据消费操作，执行目标指令的目标阶段包括：响应于第四指令将目标指示符设置为第一值，发出第三指令，第四指令指示针对第三资源的数据生产操作，第四指令先于第三指令被发出；并且方法还包括：响应于第三指令将目标指示符更新为第二值，使第五指令被执行，第五指令指示针对第三资源的数据生产操作，第五指令先于第三指令并且晚于第四指令被发出。In some embodiments, the target instruction is a third instruction, the resource is a third resource, the third instruction indicates a data consumption operation for the third resource, and the target phase of executing the target instruction includes: setting the target indicator in response to the fourth instruction. is the first value, issuing a third instruction, the fourth instruction indicating a data production operation for the third resource, the fourth instruction being issued before the third instruction; and the method further includes: in response to the third instruction, updating the target indicator to The second value causes a fifth instruction to be executed, the fifth instruction indicates a data production operation for the third resource, and the fifth instruction is issued before the third instruction and later than the fourth instruction.

在一些实施例中，响应于确定目标指令就绪发出目标指令包括：响应于确定目标指令就绪，确定已经发出且未完成的指令的数目是否小于阈值；以及响应于确定数目小于阈值，发出目标指令。In some embodiments, issuing the target instruction in response to determining that the target instruction is ready includes: in response to determining that the target instruction is ready, determining whether the number of issued and outstanding instructions is less than a threshold; and in response to determining that the number is less than the threshold, issuing the target instruction.

在一些实施例中，确定与目标指令相关联的资源的状态指示符包括：作为内存加载指令发出目标指令；以及在目标指令作为内存加载指令的第一阶段，确定与目标指令相关联的资源的状态指示符；并且响应于确定目标指令就绪执行目标指令的目标阶段包括：响应于确定目标指令就绪，作为运算指令发出重新发出目标指令。In some embodiments, determining the status indicator of the resource associated with the target instruction includes: issuing the target instruction as a memory load instruction; and in a first phase of the target instruction being a memory load instruction, determining the status indicator of the resource associated with the target instruction. a status indicator; and in response to determining that the target instruction is ready, executing the target phase of the target instruction includes: in response to determining that the target instruction is ready, re-issuing the target instruction as an operation instruction.

在一些实施例中，确定目标指令是否就绪包括：确定第一内存加载指令是否将状态指示符设置为第二值。In some embodiments, determining whether the target instruction is ready includes determining whether the first memory load instruction sets the status indicator to a second value.

在一些实施例中，该方法还包括：在发出目标指令之后，发出与状态指示符相关联的第二内存加载指令，而无需确认状态指示符是否为第一值。In some embodiments, the method further includes, after issuing the target instruction, issuing a second memory load instruction associated with the status indicator without confirming whether the status indicator is the first value.

在一些实施例中，资源可以包括以下至少一项：寄存器、内存地址、队列或处理器资源。In some embodiments, resources may include at least one of: a register, a memory address, a queue, or a processor resource.

在本公开的第二方面，提供了一种处理电路，该处理电路包括片上存储器、流处理器和处理引擎。该处理电路被配置为执行第一方面及其实现方式的任一方法。In a second aspect of the present disclosure, a processing circuit is provided that includes an on-chip memory, a stream processor, and a processing engine. The processing circuit is configured to perform any method of the first aspect and its implementation.

在本公开的第三方面，提供了一种电子设备。该电子设备包括处理电路，该处理电路被配置为执行第一方面及其实现方式的任一方法。In a third aspect of the present disclosure, an electronic device is provided. The electronic device includes a processing circuit configured to perform any method of the first aspect and its implementation.

在本公开的第四方面，提供了一种计算机可读存储介质。计算机可读存储介质存储有指令，指令在被处理电路执行时使得处理电路执行第一方面及其实现方式的任一方法。In a fourth aspect of the present disclosure, a computer-readable storage medium is provided. The computer-readable storage medium stores instructions that, when executed by the processing circuit, cause the processing circuit to perform any method of the first aspect and its implementation.

在本公开的第五方面，提供了一种计算机程序产品。计算机程序产品包括指令，指令在被处理电路执行时使得处理电路执行第一方面及其实现方式的任一方法。In a fifth aspect of the disclosure, a computer program product is provided. The computer program product includes instructions which, when executed by the processing circuit, cause the processing circuit to perform any method of the first aspect and its implementation.

可以理解地，上述提供的第二方面的处理电路、第三方面的电子设备、第四方面的计算机存储介质或第五方面的计算机程序产品均可以用于执行第一方面所提供的方法。因此，关于第一方面的解释或者说明同样适用于第二方面、第三方面、第四方面和第五方面。此外，第二方面、第三方面、第四方面和第五方面所能达到的有益效果可参考对应方法中的有益效果，此处不再赘述。It can be understood that the processing circuit of the second aspect, the electronic device of the third aspect, the computer storage medium of the fourth aspect, or the computer program product of the fifth aspect provided above can all be used to execute the method provided by the first aspect. Therefore, the explanations or explanations regarding the first aspect are equally applicable to the second, third, fourth and fifth aspects. In addition, the beneficial effects that can be achieved in the second aspect, the third aspect, the fourth aspect and the fifth aspect can be referred to the beneficial effects in the corresponding methods, and will not be described again here.

应当理解，发明内容部分中所描述的内容并非旨在限定本公开的关键或重要特征，亦非用于限制本公开的范围。本公开的其他特征通过以下的描述将变得容易理解。It should be understood that nothing described in this summary is intended to identify key or important features of the disclosure, nor to limit the scope of the disclosure. Other features of the present disclosure will become readily apparent from the following description.

附图说明Description of the drawings

通过结合附图对本公开示例性实施例进行更详细的描述，本公开的上述以及其他目的、特征和优势将变得更加明显，其中，在本公开示例性实施例中，相同的参考标号通常代表相同部件。The above and other objects, features and advantages of the present disclosure will become more apparent by describing the exemplary embodiments of the present disclosure in more detail with reference to the accompanying drawings, wherein, in the exemplary embodiments of the present disclosure, the same reference numerals generally represent Same parts.

图1示出了本公开的多个实施例能够在其中实现的示例环境的示意图；1 illustrates a schematic diagram of an example environment in which various embodiments of the present disclosure can be implemented;

图2示出了根据本公开的一些实施例的处理电路的示意框图；Figure 2 shows a schematic block diagram of a processing circuit in accordance with some embodiments of the present disclosure;

图3示出了根据本公开的一些实施例的三维张量的示意框图；Figure 3 shows a schematic block diagram of a three-dimensional tensor according to some embodiments of the present disclosure;

图4示出了根据本公开的一些实施例的指令调度过程；Figure 4 illustrates an instruction scheduling process in accordance with some embodiments of the present disclosure;

图5示出了根据本公开的另一些实施例的指令调度过程；Figure 5 illustrates an instruction scheduling process according to other embodiments of the present disclosure;

图6示出了根据本公开的又一些实施例的指令调度过程；Figure 6 illustrates an instruction scheduling process according to further embodiments of the present disclosure;

图7示出了根据本公开的又一些实施例的指令调度过程；Figure 7 illustrates an instruction scheduling process according to further embodiments of the present disclosure;

图8示出了根据本公开的一些实施例的流处理方法的示例过程的流程图。Figure 8 shows a flowchart of an example process of a stream processing method according to some embodiments of the present disclosure.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的优选实施例。虽然附图中示出了本公开的优选实施例，然而应该理解，本公开可以以各种形式实现而不应被这里阐述的实施例限制。相反，提供这些实施例是为了使本公开更加透彻和完整，并且能够将本公开的范围完整地传达给本领域的技术人员。Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

在本文中使用的术语“包括”及其变形表示开放性包括，即“包括但不限于”。除非特别申明，术语“或”表示“和/或”。术语“基于” 表示“至少部分地基于”。术语“一个示例实施例”和“一个实施例”表示“至少一个示例实施例”。术语“另一实施例”表示“至少一个另外的实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。As used herein, the term "include" and its variations mean an open inclusion, ie, "including but not limited to." Unless otherwise stated, the term "or" means "and/or". The term "based on" means "based at least in part on." The terms "one example embodiment" and "an embodiment" mean "at least one example embodiment." The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," etc. may refer to different or the same object. Other explicit and implicit definitions may be included below.

处理器通常采用两种方式解决指令之间对寄存器的依赖关系，即保证一条指令正确使用了前面一条指令输出的结果寄存器。Processors usually use two methods to resolve the dependency on registers between instructions, that is, ensuring that one instruction correctly uses the result register output by the previous instruction.

在一些传统方案中，如果前面指令的执行时长在任何情况下都是恒定的，处理器或软件可以总是安排后面的有依赖关系的指令在这个时长之后开始执行。In some traditional solutions, if the execution time of the previous instruction is constant under any circumstances, the processor or software can always arrange for subsequent instructions with dependencies to start executing after this time.

在另一些传统方案中，如果指令的执行时间是不确定的，如访存指令每次执行的时长是不一定的，硬件记录每条指令的输出寄存器，并跟踪它们的完成时间，同时硬件译码每条指令使用的(多个)输入寄存器，并与未完成的输入寄存器进行比较，发现依赖关系。In other traditional solutions, if the execution time of the instruction is uncertain, such as the execution time of each memory access instruction is uncertain, the hardware records the output register of each instruction and tracks their completion time. At the same time, the hardware translates Code the input register(s) used by each instruction and compare it with the unfinished input registers to find dependencies.

根据本公开的实施例，能够通过资源(例如，寄存器、内存地址、队列或处理器单元等)的状态指示符来有效地解决指令之间的资源依赖性，从而提高指令调度的效率，提高***性能并降低电路实现的复杂程度。According to embodiments of the present disclosure, resource dependencies between instructions can be effectively resolved through status indicators of resources (such as registers, memory addresses, queues or processor units, etc.), thereby improving the efficiency of instruction scheduling and improving the system. performance and reduce circuit implementation complexity.

示例环境Example environment

图1示出了本公开的多个实施例能够在其中实现的示例环境100的示意图。示例环境100例如可以是诸如计算机之类的具有计算能力的电子设备。在一个实施例中，示例环境100例如包括中央处理器(CPU)20、***存储器10、北桥/存储器桥30、加速器子***40、设备存储器50和南桥/输入输出(IO)桥60。***存储器10例如可以是诸如动态随机存取存储器(DRAM)之类的易失性存储器。北桥/存储器桥30例如集成了内存控制器、PCIe控制器等，其负责CPU 20和高速接口之间的数据交换以及桥接CPU 20和南桥/IO桥60。南桥/IO桥60用于计算机的低速接口，例如串行高级技术接口(SATA)控制器等。加速器子***40例如可以包括诸如图形处理器(GPU) 和人工智能(AI)加速器等用于对图形、视频等数据进行加速处理的装置或芯片。在本公开中，加速器子***40也可以被称为“处理电路”。Figure 1 shows a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented. Example environment 100 may be, for example, an electronic device with computing capabilities, such as a computer. In one embodiment, example environment 100 includes, for example, central processing unit (CPU) 20, system memory 10, northbridge/memory bridge 30, accelerator subsystem 40, device memory 50, and southbridge/input-output (IO) bridge 60. System memory 10 may be, for example, volatile memory such as dynamic random access memory (DRAM). The northbridge/memory bridge 30, for example, integrates a memory controller, a PCIe controller, etc., and is responsible for data exchange between the CPU 20 and the high-speed interface and bridging the CPU 20 and the southbridge/IO bridge 60. Southbridge/IO bridge 60 is used for low-speed interfaces of computers, such as Serial Advanced Technology Interface (SATA) controllers, etc. The accelerator subsystem 40 may include, for example, a device or chip such as a graphics processing unit (GPU) and an artificial intelligence (AI) accelerator for accelerating processing of graphics, video, and other data. In this disclosure, accelerator subsystem 40 may also be referred to as "processing circuitry."

继续参考图1，设备存储器50例如可以是诸如DRAM之类的位于加速器子***40外部的易失性存储器。在本公开中，设备存储器50也被称为片外存储器，即，位于加速器子***40的芯片外部的存储器。相对而言，加速器子***40的芯片内部也具有易失性存储器，例如一级(L1)高速缓存(cache)以及可选的二级(L2)高速缓存，其可以被统一称为“片上存储器”Continuing with reference to FIG. 1 , device memory 50 may be, for example, volatile memory such as DRAM external to accelerator subsystem 40 . In this disclosure, device memory 50 is also referred to as off-chip memory, ie, memory located outside the chip of accelerator subsystem 40 . Relatively speaking, the chip of the accelerator subsystem 40 also has volatile memory, such as a first-level (L1) cache (cache) and an optional second-level (L2) cache, which can be collectively referred to as "on-chip memory." "

应当理解，虽然在图1中示出了本公开的多个实施例能够在其中实现的一种示例环境100，但是本公开不限于此。本公开的一些实施例也可以在诸如ARM架构和RISC-V架构之类的具有诸如GPU之类的加速器子***的一些应用环境中使用。It should be understood that although an example environment 100 in which various embodiments of the present disclosure can be implemented is shown in FIG. 1, the present disclosure is not limited thereto. Some embodiments of the present disclosure may also be used in some application environments such as ARM architecture and RISC-V architecture with accelerator subsystems such as GPUs.

图2示出了根据本公开的一个实施例的处理电路200的示意框图。处理电路200例如可以是图1中加速器子***40的芯片的一种具体实现方式。处理电路200例如是诸如GPU之类的处理电路芯片。在一个实施例中，处理电路200包括流处理器(SP)210、页表装置220、处理引擎(PE)单元230、直接存储器访问(DMA)控制器240、L1高速缓存(cache)260和L2高速缓存250。Figure 2 shows a schematic block diagram of processing circuit 200 according to one embodiment of the present disclosure. The processing circuit 200 may be, for example, a specific implementation of the chip of the accelerator subsystem 40 in FIG. 1 . The processing circuit 200 is, for example, a processing circuit chip such as a GPU. In one embodiment, processing circuit 200 includes stream processor (SP) 210, page table device 220, processing engine (PE) unit 230, direct memory access (DMA) controller 240, L1 cache (cache) 260, and L2 Cache 250.

处理电路200由诸如CPU 20之类的主机设备控制，并且接收来自CPU 20的指令。SP 210对来自CPU 20的指令进行分析，并且将经分析的操作指派给PE单元230、页表装置220和DMA控制器240进行处理。页表装置220用于管理处理电路200的片上虚拟存储。在本公开中，L2高速缓存250和诸如图1中的设备存储器50之类的片外存储器构成虚拟存储***。页表装置220由SP 210、PE单元230和DMA控制器240共同维护。The processing circuit 200 is controlled by a host device such as the CPU 20 and receives instructions from the CPU 20. The SP 210 analyzes instructions from the CPU 20 and assigns the analyzed operations to the PE unit 230, the page table device 220, and the DMA controller 240 for processing. The page table device 220 is used to manage the on-chip virtual storage of the processing circuit 200 . In the present disclosure, L2 cache 250 and off-chip memory such as device memory 50 in FIG. 1 constitute a virtual storage system. Page table device 220 is jointly maintained by SP 210, PE unit 230 and DMA controller 240.

PE单元230包括多个处理引擎(processing engine,PE)PE_1、PE_2……PE_N，其中N表示大于1的整数。PE单元230中的每个PE可以是单指令多线程(SIMT)装置。在PE中，每个线程可以具有自己的寄存器堆(register file)，并且每个PE的所有线程还共享一个统一寄存器堆(uniform register file)。多个PE可以并行地执行相同或不同的处理工作，可以并行地进行下文所述的地址转换和存储器中目标数据的访问，从而减少处理时间。可以理解，多个PE处理的目标元素并不相同，并且目标元素所在的段、页、缓存行和元素的属性、尺寸、维度排序等可以有所不同，如下文具体描述。The PE unit 230 includes a plurality of processing engines (PE) PE_1, PE_2...PE_N, where N represents an integer greater than 1. Each PE in PE unit 230 may be a single instruction multi-thread (SIMT) device. In PE, each thread can have its own register file (register file), and all threads of each PE also share a unified register file (uniform register file). Multiple PEs can perform the same or different processing work in parallel, and can perform address translation and access to target data in the memory as described below in parallel, thereby reducing processing time. It can be understood that the target elements processed by multiple PEs are not the same, and the segments, pages, cache lines where the target elements are located, and the attributes, sizes, dimension ordering, etc. of the elements may be different, as described in detail below.

每个线程可以在自己的寄存器堆与存储器子***之间做线程级的数据交换。每个线程有自己的算数逻辑执行单元并使用自己的存储地址，其采用典型的寄存器存取架构(load-store architecture)。每个执行单元包括一个支持多种数据类型的浮点/定点单元以及一个算数逻辑单元。Each thread can perform thread-level data exchange between its own register file and memory subsystem. Each thread has its own arithmetic logic execution unit and uses its own storage address, which uses a typical load-store architecture. Each execution unit includes a floating-point/fixed-point unit that supports multiple data types and an arithmetic logic unit.

大多数的指令执行算数和逻辑运算，例如，浮点和定点数的加、减、乘、除，或者逻辑与、或、非等。操作数来自于寄存器。存储器读写指令可以提供寄存器与片上/片外存储器之间的数据交换。一般地，PE中所有的执行单元可以同步地执行相同指令。通过使用谓词(predicate)寄存器，可以屏蔽部分执行单元，从而实现分支指令的功能。Most instructions perform arithmetic and logical operations, such as addition, subtraction, multiplication, and division of floating-point and fixed-point numbers, or logical AND, OR, NOT, etc. The operands come from registers. Memory read and write instructions can provide data exchange between registers and on-chip/off-chip memory. Generally, all execution units in a PE can execute the same instructions synchronously. By using the predicate register, part of the execution unit can be masked to implement the function of the branch instruction.

在一个实施例中，图2的处理电路200可以例如执行如下操作：1)组建页表项内容和初始状态；2)将诸如图1中的设备存储器50之类的片外存储器上的数据搬运至片上存储器，例如L2高速缓存250；3)启动和执行程序；4)定义各个段并对张量以及存储的属性进行描述；5)在程序执行完成时，将执行结果的数据写入至片外存储器。In one embodiment, the processing circuit 200 of FIG. 2 may, for example, perform the following operations: 1) assemble page table entry content and initial state; 2) transfer data on an off-chip memory such as the device memory 50 in FIG. 1 to on-chip memory, such as L2 cache 250; 3) start and execute the program; 4) define each segment and describe the tensor and stored attributes; 5) when the program execution is completed, write the execution result data to the chip external memory.

可以理解，在公开的实施例中，处理电路200所处理的数据主要针对多维张量。例如，在一个实施例中，张量可以是四维张量，其具有四个维度D1、D2、D3和D4，并且张量在各维上的尺寸可以不同。在另一些实施例中，张量可以是一维、二维、三维或更多维张量，本公开对此不进行限制。It can be understood that in the disclosed embodiments, the data processed by the processing circuit 200 is mainly aimed at multi-dimensional tensors. For example, in one embodiment, the tensor may be a four-dimensional tensor with four dimensions D1, D2, D3, and D4, and the tensor may have different dimensions in each dimension. In other embodiments, the tensor may be a one-dimensional, two-dimensional, three-dimensional or more dimensional tensor, which is not limited by this disclosure.

此外，在本公开的实施例中，张量内部可以支持诸如uint8、int8、 bfloat16、float16、uint16、int16、float32、int32、uint32以及其他自定义元素类型，本公开对此也不进行限制。对于张量的寻址而言，其以元素为基本单位。例如，如果元素类型为int8，则元素以字节为单位。再例如，如果元素类型为int16，则寻址基本单位为双字节，依此类推。In addition, in the embodiment of the present disclosure, the tensor can internally support such as uint8, int8, bfloat16, float16, uint16, int16, float32, int32, uint32 and other custom element types, and the present disclosure does not limit this. For tensor addressing, the basic unit is element. For example, if the element type is int8, the element is in bytes. For another example, if the element type is int16, the basic unit of addressing is double bytes, and so on.

在一些情形中，张量所包含的数据量可能较大，而L2高速缓存250的容量有限，因此无法将张量整体加载至片上的L2高速缓存250。在本公开的一些实施例中，为了便于张量的并行处理，可以将张量划分为至少一个段。在张量仅包括一个段的情形下，张量即为段。而在张量包括多个段的情形下，段为张量的一部分。CPU 20可以通过指令指定段的各个部分由哪个PE进行处理。In some cases, the amount of data contained in the tensor may be large, and the capacity of the L2 cache 250 is limited, so the entire tensor cannot be loaded into the on-chip L2 cache 250 . In some embodiments of the present disclosure, to facilitate parallel processing of the tensor, the tensor may be divided into at least one segment. In the case where a tensor consists of only one segment, the tensor is a segment. In the case of a tensor containing multiple segments, the segments are part of the tensor. The CPU 20 can specify which PE to process each part of the segment through instructions.

张量的存储结构Tensor storage structure

图3示出了根据本公开的一个实施例的三维张量300的示意框图。三维张量300具有三个维度D1、D2和D3，并且包括第一段S1、第二段S2和第三段S3。CPU 20可以指定段S1的张量元素由PE_1、PE_2、PE_3、PE_4、PE_5、PE_6、PE_7和PE_8处理。此外，CPU 20还指定了第二段S2的张量元素由PE_1-PE_4处理。在本公开的实施例中，每个段所具有的尺寸可以不同，因此编程人员可以基于设计需要灵活配置段。实际上，页的划分可以在任意一个或多个维上实施，并且各维上划分的页数是相互独立的。Figure 3 shows a schematic block diagram of a three-dimensional tensor 300 according to one embodiment of the present disclosure. The three-dimensional tensor 300 has three dimensions D1, D2, and D3, and includes a first segment S1, a second segment S2, and a third segment S3. CPU 20 may specify that the tensor elements of segment S1 are processed by PE_1, PE_2, PE_3, PE_4, PE_5, PE_6, PE_7, and PE_8. In addition, CPU 20 also specifies that the tensor elements of the second segment S2 are processed by PE_1-PE_4. In embodiments of the present disclosure, each segment may have different sizes, so programmers can flexibly configure the segments based on design needs. In fact, page division can be implemented in any one or more dimensions, and the number of pages divided in each dimension is independent of each other.

在一个实施例中，可以将张量数据存储于片上的高速存储器，例如L2高速缓存250。但由于片上的高速存储器的容量较少，因此在张量规模较大时，编程人员可以将张量划分为多个段，每个段描述张量一部分。核心程序(kernel)可以分多次启动，每次由DMA控制器240提前将张量的一个段由片外存储搬运到片内存储，并供kernel操作使用。在多次启动kernel后，张量包含的所有段均被处理，整个运行过程结束。当片上的高速存储器足以容纳kernel所要访问的所有张量时，一个张量仅需要一个段描述即可，kernel也只需要启动一次。In one embodiment, tensor data may be stored in on-chip high-speed memory, such as L2 cache 250. However, due to the small capacity of the on-chip high-speed memory, when the tensor size is large, programmers can divide the tensor into multiple segments, and each segment describes a part of the tensor. The core program (kernel) can be started multiple times. Each time, the DMA controller 240 transfers a segment of the tensor from off-chip storage to on-chip storage in advance and makes it available for kernel operations. After starting the kernel multiple times, all segments contained in the tensor are processed and the entire running process ends. When the on-chip high-speed memory is enough to accommodate all the tensors that the kernel needs to access, a tensor only needs one segment description, and the kernel only needs to be started once.

进一步地，在本公开的一些实施例中，在一个段内，还可以设置至少一个页以进一步细分张量。例如，在第一段S1中，具有4个页P[1]、P[2]、P[3]和P[4]。第二段S2仅具有一个页。在本公开的实施例中，每个段所具有的页的数目可以不同，因此编程人员可以基于设计需要灵活配置段内页的尺寸。例如，将页配置为适于整体存入L2高速缓存250。Further, in some embodiments of the present disclosure, within a segment, at least one page can also be set to further subdivide the tensor. For example, in the first segment S1, there are 4 pages P[1], P[2], P[3] and P[4]. The second segment S2 has only one page. In embodiments of the present disclosure, each segment may have a different number of pages, so programmers can flexibly configure the size of pages within a segment based on design needs. For example, the page is configured to fit into L2 cache 250 in its entirety.

如上所述，当对张量寻址时，最小的寻址单元是以元素为单元。一个页通常可以包括多个元素。目标元素所在的页在本文中被称为“目标元素页”。在本公开的一些实施例中，页可以包括多个缓存行。目标元素页可以位于L2高速缓存250中时，如果PE经由L1高速缓存260读取目标元素，则L2高速缓存250需要将L2高速缓存250中的包括目标元素在内的一小部分的物理地址连续的数据整体传输至L1高速缓存260。这一小部分数据也被称为缓存行(cache line)数据，而这种缓存机制基于空间邻近性原理。PE从L1高速缓存260读取数据仅需几个时钟周期，而L1高速缓存260从L2高速缓存250读取数据可能需要几十个甚至上百个时钟周期。因此，期望减少L1高速缓存260从L2高速缓存250读取数据的次数。虽然在此以“缓存行”来描述从L2高速缓存250到L1高速缓存260的最小传输数据单位，但在本公开中，这部分数据可以并不必然按行或列排列，一个“缓存行”里面的数据分布在多个维上，且各维上分布的数据尺寸不限于1。PE对一个段内的数据进行并行处理，PE的分配在数据的逻辑地址空间展开，独立于段的物理存储结构，具体如下文描述。As mentioned above, when addressing a tensor, the smallest unit of addressing is the element. A page can usually contain multiple elements. The page on which the target element is located is referred to herein as the "target element page". In some embodiments of the present disclosure, a page may include multiple cache lines. While the target element page may be located in the L2 cache 250, if the PE reads the target element via the L1 cache 260, the L2 cache 250 needs to concatenate the physical addresses of a small portion of the L2 cache 250 including the target element. The data is transferred to L1 cache 260 in its entirety. This small part of data is also called cache line data, and this caching mechanism is based on the principle of spatial proximity. It only takes a few clock cycles for the PE to read data from the L1 cache 260, while it may take dozens or even hundreds of clock cycles for the L1 cache 260 to read data from the L2 cache 250. Therefore, it is desirable to reduce the number of times L1 cache 260 reads data from L2 cache 250. Although a "cache line" is used here to describe the smallest unit of data transferred from the L2 cache 250 to the L1 cache 260, in this disclosure, this part of the data may not necessarily be arranged in rows or columns. A "cache line" The data inside is distributed in multiple dimensions, and the size of the data distributed in each dimension is not limited to 1. PE performs parallel processing on the data within a segment. The allocation of PE is expanded in the logical address space of the data and is independent of the physical storage structure of the segment, as described below.

在图3中，第一页P[1]中的第一组缓存行被指定由PE_1处理，第二组缓存行被指定由PE_2处理。虽然在此以顺序示出了张量由多个PE依序处理，但是可以理解张量数据的处理独立于PE的顺序，本公开对此不进行限制。例如图3中的PE_2表示部分的张量数据可以由PE_M处理，其中M表示不大于N的任意整数。In Figure 3, the first set of cache lines in the first page P[1] is designated to be processed by PE_1, and the second set of cache lines is designated to be processed by PE_2. Although the tensor is shown to be processed by multiple PEs in sequence here, it can be understood that the processing of tensor data is independent of the order of the PEs, and the present disclosure is not limited to this. For example, PE_2 in Figure 3 represents part of the tensor data that can be processed by PE_M, where M represents any integer not greater than N.

示例调度过程一Example scheduling process one

如下文将详细介绍的，处理电路200可以通过使用状态指示符来管理不同资源之间的依赖性。为了方便描述，以下将以寄存器作为资源的示例来描述状态指示符的机制。应当理解，其他适当类型的资源也是使用的，其示例包括但不限于：内存地址、队列和处理器单元等。As will be described in detail below, processing circuitry 200 can manage dependencies between different resources through the use of status indicators. For convenience of description, the following uses registers as examples of resources to describe the mechanism of status indicators. It should be understood that other suitable types of resources are used, examples of which include, but are not limited to, memory addresses, queues, processor units, etc.

在一些实施例中，处理电路200可以使用状态指示符(也称为，“令牌(token)”)来管理数据相关性。令牌是一种状态值，其可以用于指示对应的资源(例如，寄存器)中数据的状态。In some embodiments, processing circuitry 200 may use status indicators (also referred to as "tokens") to manage data dependencies. A token is a status value that can be used to indicate the status of data in a corresponding resource (for example, a register).

与传统的基于硬件的寄存器状态不同，token为开发者提供了一种基于软件的状态管理策略，其无需在实现过程中通过寄存器标识去访问对应的硬件状态，而是可以通过灵活的token管理来解决数据依赖性问题。Different from traditional hardware-based register status, token provides developers with a software-based status management strategy. It does not need to access the corresponding hardware status through register identification during the implementation process, but can be managed through flexible token management. Resolve data dependency issues.

示例性地，如果token为第一值(例如，1)，则表示其对应的寄存器中的数据已经准备好了，并且还没有被使用或消费。相反，如果token为第二值(例如，0)，则表示其对应的寄存器中的数据还没有准备好。For example, if the token is the first value (for example, 1), it means that the data in its corresponding register has been prepared and has not been used or consumed yet. On the contrary, if the token is the second value (for example, 0), it means that the data in its corresponding register is not ready yet.

在一些实施例中，处理电路200可以根据寄存器的token值和指令的类型来确定指令是否可以被发出。具体地，如果指令是数据消费指令，即，其指示对寄存器中数据的数据消费操作，则处理电路200可以确定与寄存器对应的token是否为1。如果是，则该指令可以被执行对应的阶段。否则，如果token为0，则该指令需要等待被执行对应的阶段。In some embodiments, the processing circuit 200 may determine whether the instruction can be issued based on the token value of the register and the type of instruction. Specifically, if the instruction is a data consumption instruction, that is, it indicates a data consumption operation on the data in the register, the processing circuit 200 may determine whether the token corresponding to the register is 1. If so, the instruction can be executed at the corresponding stage. Otherwise, if the token is 0, the instruction needs to wait for the corresponding stage to be executed.

在一些实施例中，该阶段是基于指令的类型而被确定的。示例性地，如果该指令是数据消费指令，则该阶段例如可以是指令的发出阶段。也即，只有在token为1时，该数据消费指令才可以被发出。In some embodiments, this stage is determined based on the type of instruction. For example, if the instruction is a data consumption instruction, this stage may be, for example, the issuing stage of the instruction. That is, the data consumption instruction can be issued only when the token is 1.

在一些实施例中，如果指令是数据生产指令，即其指示对寄存器中数据的数据生成操作，则处理电路可以确定与寄存器对应的token是否为0。如果是，则该指令可以被发出，否则，如果token为1，则该指令需要等待发出。In some embodiments, if the instruction is a data production instruction, that is, it indicates a data generation operation on the data in the register, the processing circuit may determine whether the token corresponding to the register is 0. If it is, the instruction can be issued, otherwise, if the token is 1, the instruction needs to wait to be issued.

在一些实施例中，该阶段是基于指令的类型而被确定的。如果该指令是数据生产指令，则该阶段例如可以是指令的回写阶段。也即，该数据生产指令可以先被发出，而在回写阶段检查token是否为0。也即，数据生产指令可以等待token为0，才能够执行数据回写。In some embodiments, this stage is determined based on the type of instruction. If the instruction is a data production instruction, this phase may be, for example, the write-back phase of the instruction. That is, the data production command can be issued first, and whether the token is 0 is checked during the write-back phase. That is, the data production instruction can wait for the token to be 0 before performing data writeback.

在一些实施例中，处理电路200还可以在指令的、针对寄存器的访问操作执行完成时，更新寄存器所对应的token。示例性地，如果该指令是数据消费指令，则可以在完成数据从token到执行单元的送入后，将该token置为0。在另一示例中，如果该指令是数据生产指令，则可以在完成数据到寄存器的回写后，将该token置为1。In some embodiments, the processing circuit 200 may also update the token corresponding to the register when the access operation of the instruction to the register is completed. For example, if the instruction is a data consumption instruction, the token can be set to 0 after the data is sent from the token to the execution unit. In another example, if the instruction is a data production instruction, the token can be set to 1 after writing back the data to the register.

图4示出了根据本公开的一些实施例的示例指令调度过程400。在图4中，“i”表示指令的发射点，“s”表示对应token的置位点(即，设置为1)，“c”表示token的复位点(即，设置为0)。Figure 4 illustrates an example instruction scheduling process 400 in accordance with some embodiments of the present disclosure. In Figure 4, "i" represents the launch point of the instruction, "s" represents the set point of the corresponding token (that is, set to 1), and "c" represents the reset point of the token (that is, set to 0).

如图4所示，指令Load RF[0],MemA在完成数据加载后，对应的token被设置为1。在一些实施例中，开发者可以在指令中指示要对特定的token执行特定的操作。例如，开发者可以编写指令Load RF[0],MemA(no clear,check and set token1)，即该指令没有clear操作，并需要检查“token1”，并在完成后将“token1”置位为1。As shown in Figure 4, the instruction Load RF[0], after MemA completes data loading, the corresponding token is set to 1. In some embodiments, developers can indicate in instructions that specific operations are to be performed on specific tokens. For example, developers can write the instruction Load RF[0],MemA(no clear,check and set token1), that is, the instruction has no clear operation and needs to check "token1" and set "token1" to 1 after completion .

相应地，在检测到token被设置为1后，指令Add RF[x],RF[0]可以被发出，并在完成数据送入后，将token设置为0。在一些实施例中，开发者可以在指令中指示要对特定的token执行特定的操作。例如，开发者可以编写指令Add RF[x],RF[0](check and clear token1,no set)，即该指令没有set操作，并需要检查“token1”，并在完成后将“token1”设置为0。Correspondingly, after detecting that the token is set to 1, the instruction Add RF[x], RF[0] can be issued, and after completing the data sending, the token is set to 0. In some embodiments, developers can indicate in instructions that specific operations are to be performed on specific tokens. For example, developers can write the instruction Add RF[x], RF[0] (check and clear token1, no set), that is, the instruction has no set operation and needs to check "token1" and set "token1" after completion is 0.

进一步地，指令Load RF[0],MemB例如可以先被发出，以执行访存操作。随后，在等待token被设置为0后，该指令可以执行数据回写阶段，并在完成数据加载后，将token设置为1。在一些实施例中，开发者可以在指令中指示要对特定的token执行特定的操作。例如，开发者可以编写指令Load RF[0],MemB(no clear,check and set token1)，即该指令没有clear操作，并需要检查“token1”，并在完成后将“token1”置位为1。Further, the instruction Load RF[0],MemB may be issued first, for example, to perform a memory access operation. Subsequently, after waiting for the token to be set to 0, the instruction can perform the data write-back phase and set the token to 1 after completing the data loading. In some embodiments, developers can indicate in instructions that specific operations are to be performed on specific tokens. For example, developers can write the instruction Load RF[0],MemB(no clear, check and set token1), that is, the instruction does not have a clear operation and needs to check "token1" and set "token1" to 1 after completion .

类似地，指令Add RF[y],RF[0]可以在检测到token为1后被发出，并在完成数据送入后，将token设置为0。在一些实施例中，开发者可以在指令中指示要对特定的token执行特定的操作。例如，开发者可以编写指令Add RF[y],RF[0](check and clear token1,no set)，即该指令没有set操作，并需要检查“token1”，并在完成后将“token1”设置为0。Similarly, the instruction Add RF[y],RF[0] can be issued after detecting that the token is 1, and after completing the data sending, the token is set to 0. In some embodiments, developers can indicate in instructions that specific operations are to be performed on specific tokens. For example, developers can write the instruction Add RF[y],RF[0](check and clear token1,no set), that is, the instruction has no set operation and needs to check "token1" and set "token1" after completion is 0.

进一步地，指令Load RF[0],MemC可以被发出，以执行数据访存操作，并等待token被设置为0后执行数据回写阶段。，在完成数据加载后，该指令可以将token设置为1。在一些实施例中，开发者可以在指令中指示要对特定的token执行特定的操作。例如，开发者可以编写指令Load RF[0],MemA(no clear,check and set token1)，即该指令没有clear操作，并需要检查“token1”，并在完成后将“token1”置位为1。Further, the instruction Load RF[0], MemC can be issued to perform the data memory access operation, and wait for the token to be set to 0 before performing the data write-back phase. , this instruction can set the token to 1 after completing the data loading. In some embodiments, developers can indicate in instructions that specific operations are to be performed on specific tokens. For example, developers can write the instruction Load RF[0],MemA(no clear,check and set token1), that is, the instruction has no clear operation and needs to check "token1" and set "token1" to 1 after completion .

类似地，指令Add RF[z],RF[0]可以在检测到token为1后被发出，并在完成数据送入后，将token设置为0。在一些实施例中，开发者可以在指令中指示要对特定的token执行特定的操作。例如，开发者可以编写指令Add RF[z],RF[0](check and clear token1,no set)，即该指令没有set操作，并需要检查“token1”，并在完成后将“token1”设置为0。Similarly, the instruction Add RF[z],RF[0] can be issued after detecting that the token is 1, and after completing the data sending, the token is set to 0. In some embodiments, developers can indicate in instructions that specific operations are to be performed on specific tokens. For example, developers can write the instruction Add RF[z],RF[0](check and clear token1,no set), that is, the instruction has no set operation and needs to check "token1" and set "token1" after completion is 0.

基于这样的方式，本公开的实施例能够利用状态指示符(即，token)来有效地管理指令之间的资源或数据依赖性，从而提高指令调度的效率，降低***的复杂度。Based on this approach, embodiments of the present disclosure can use status indicators (ie, tokens) to effectively manage resource or data dependencies between instructions, thereby improving the efficiency of instruction scheduling and reducing system complexity.

示例调度过程二Example scheduling process two

在一些实施例中，处理电路200可以通过将读指令迁前移来进一步挺高指令的执行效率。示例性地，每个读指令可以使用不同的token，结果写入不同的寄存器。每个Add指令也因此查看不同的token，使用不同的寄存器的操作数。In some embodiments, the processing circuit 200 may further improve instruction execution efficiency by moving read instructions forward. For example, each read instruction can use a different token, and the result is written to a different register. Each Add instruction therefore looks at a different token and uses different register operands.

在一些实施例中，多个读指令可以被要求按序完成。因此，处理电路可以仅使最后一条度指令更新token，并且是第一条Add指令检查该token后才发出。基于这样的方式，由于多个读指令的执行完成时间接近，处理电路可以在不增加过多等待时间的情况下，降低***复杂度。In some embodiments, multiple read instructions may be required to complete sequentially. Therefore, the processing circuit can only make the last instruction update the token, and issue it after the first Add instruction checks the token. Based on this approach, since the execution completion times of multiple read instructions are close to each other, the processing circuit can reduce system complexity without increasing excessive waiting time.

图5示出了根据本公开的一些实施例的示例指令调度过程500。如图5所示，指令Load RF[0],MemA；Load RF[1],MemB和Load RF[2],MemC可以按序发出，并且按序执行。相应地，指令Add RF[x],RF[0]可以检查与“RF[0]”所对应的token，并在token为1后发出。指令Add RF[y],RF[1]可以检查与RF[1]所对应的token，并在token为1后发出。指令Add RF[z],RF[2]可以检查与RF[2]所对应的token，并在token为1后发出。Figure 5 illustrates an example instruction scheduling process 500 in accordance with some embodiments of the present disclosure. As shown in Figure 5, the instructions Load RF[0], MemA; Load RF[1], MemB and Load RF[2], MemC can be issued in order and executed in order. Correspondingly, the command Add RF[x],RF[0] can check the token corresponding to "RF[0]" and issue it after the token is 1. The command Add RF[y],RF[1] can check the token corresponding to RF[1] and issue it after the token is 1. The command Add RF[z], RF[2] can check the token corresponding to RF[2] and issue it after the token is 1.

基于这样的方式，本公开的实施例可以使得多个读指令并行执行，从而提高***的加载效率。Based on this approach, embodiments of the present disclosure can enable multiple read instructions to be executed in parallel, thereby improving the loading efficiency of the system.

示例调度过程三Example scheduling process three

在一些实施例中，处理电路200可以通过单个token来解决多指令之间的数据依赖性。在示例调度过程中，处理电路200需要维护三个寄存器，以及对应的三个token。为了进一步降低***开销，处理电路200还可以通过利用单个token来调度以上指令的执行。In some embodiments, the processing circuit 200 can resolve data dependencies between multiple instructions through a single token. In the example scheduling process, the processing circuit 200 needs to maintain three registers and corresponding three tokens. In order to further reduce system overhead, the processing circuit 200 can also schedule the execution of the above instructions by using a single token.

图6示出了根据本公开的一些实施例的示例指令调度过程600。如图6所示，第一条读指令写完RF[0]之后将token[0]置位为1，下一条读指令就不能写RF[0]了。它要等待token[0]复位到0，这需要下一条Add指令来做只有Add指令可以将token[0]清0，Add指令在发射时查看token[0]。当token[0]变为1，Add指令可以发射和执行，在读取RF[0]后就可以将token[0]复位为0。Figure 6 illustrates an example instruction scheduling process 600 in accordance with some embodiments of the present disclosure. As shown in Figure 6, after the first read instruction writes RF[0], token[0] is set to 1, and the next read instruction cannot write RF[0]. It has to wait for token[0] to be reset to 0, which requires the next Add instruction. Only the Add instruction can clear token[0] to 0. The Add instruction checks token[0] when it is launched. When token[0] becomes 1, the Add instruction can be issued and executed, and token[0] can be reset to 0 after reading RF[0].

具体地，过程600的调度过程为：起始时，token[0]的值为0。三条读指令可以按序发射，以用于执行访存操作，从A、B、C三个存储地址读取数据并将按序把数据送回到RF[0]的写回队列上。如图6所示，三条读指令可以依次被发射，而不依赖于其他读指令执行完成。Specifically, the scheduling process of process 600 is: at the beginning, the value of token[0] is 0. Three read instructions can be issued in sequence to perform memory access operations, read data from the three storage addresses A, B, and C and send the data back to the write-back queue of RF[0] in sequence. As shown in Figure 6, three read instructions can be issued sequentially without relying on the completion of other read instructions.

进一步地，第一条Add指令在发射级等待，直到token[0]置位为1；第一条读指令取回数据，并且排列在写回RF[0]的第一位，检查token[0]的值为0，即可以写入RF[0]，并将token[0]置为1。Further, the first Add instruction waits at the launch level until token[0] is set to 1; the first read instruction retrieves data and is ranked first in writing back RF[0], checking token[0 ] is 0, that is, you can write RF[0] and set token[0] to 1.

此时，第二条读指令可能已经从MemB地址读取了数据并排在回写RF[0]的队列第二位，当第一条读指令写入RF[0]后，第二条读指令排在队列第一位，同时第三条读指令可能返回并排在第二位。At this time, the second read instruction may have read data from the MemB address and is in the second position of the queue to write back RF[0]. When the first read instruction writes RF[0], the second read instruction queued first, while a third read instruction may return and queue second.

或者，第二条读指令可能处于开始取数队列的第一位，并且第三条读指令可能处于开始取数队列的第二位。Alternatively, the second read instruction may be first in the queue to start fetching, and the third read instruction may be second in the queue to start fetching.

进一步地，第二条读指令检查token[0]为1(被第一条读指令置位)，等待其变为0。Further, the second read instruction checks that token[0] is 1 (set by the first read instruction) and waits for it to become 0.

随后第一条Add指令检测到token[0]为1，发射并执行，然后将token[0]清0。第二条Add指令在发射级等待token[0]为1，当token[0]从1变成0后，第二条读指令的数据回写到RF[0]并将token[0]置位为1。Then the first Add instruction detects that token[0] is 1, launches and executes it, and then clears token[0] to 0. The second Add instruction waits for token[0] to be 1 at the launch level. When token[0] changes from 1 to 0, the data of the second read instruction is written back to RF[0] and token[0] is set. is 1.

第二条Add既可以发射并执行，然后将token[0]清0。此时，第三条读指令应该已经从MemB地址读取了数据并等待回写RF[0]。当token[0]从1变成0后，第三条读指令可以回写RF[0]并将token[0]置为1。第三条Add随后发射并执行，然后将token[0]复位为0。The second Add can be launched and executed, and then clears token[0] to 0. At this time, the third read instruction should have read data from the MemB address and is waiting to write back RF[0]. When token[0] changes from 1 to 0, the third read instruction can write back RF[0] and set token[0] to 1. The third Add is then launched and executed, and then resets token[0] to 0.

基于这样的方式，本公开的实施例能够利用单个token来有效地管理多个指令之间的数据相关性，并且降低所使用的寄存器的数目。Based on this approach, embodiments of the present disclosure can use a single token to effectively manage data dependencies between multiple instructions and reduce the number of registers used.

在一些实施例中，处理电路200还可以在指令发射这一级增加资源来维持发射的指令数目来解决可能的死锁问题。具体地，处理电路200可以在回写队列和取数队列外添加一个指令发射排队，其中排队的长度就是允许发射的读指令最大数目，这可以保证读指令不被堵塞。In some embodiments, the processing circuit 200 can also add resources at the instruction issuance level to maintain the number of instructions issued to solve possible deadlock problems. Specifically, the processing circuit 200 can add an instruction issuance queue outside the write-back queue and the fetch queue, where the length of the queue is the maximum number of read instructions allowed to be issued, which can ensure that the read instructions are not blocked.

应当理解，增加指令发射排队虽然增加了一定资源，但这与增加寄存器数目相比还是更合算的,尤其是在单指令多线程的处理器中。It should be understood that although increasing the instruction issue queue increases certain resources, it is still more cost-effective than increasing the number of registers, especially in a single-instruction multi-thread processor.

在一些实施例中，处理电路200还可以通过前窥指令来解决可能的指令阻塞问题。虽然指令前窥需要增加资源，但其所需资源量与降低的寄存器使用量相比是合理的。使用适量的寄存器并结合前窥机制，本公开的实施例能够有效地隐藏DRAM存储延迟，大大提高性能。In some embodiments, the processing circuit 200 may also resolve possible instruction blocking issues by peeking ahead of instructions. Although instruction lookahead requires additional resources, the amount required is reasonable compared to the reduced register usage. Using an appropriate amount of registers and combined with a look-ahead mechanism, embodiments of the present disclosure can effectively hide DRAM storage delays and greatly improve performance.

示例调度过程四Example scheduling process four

在一些实施例中，处理电路200可以利用token来实现访存和计算的复合指令，从而进一步提高执行效率。在一些实施例中，处理电路200例如可以允许使用与读指令配合使用的计算指令(例如，如图7中所示的mm指令)。In some embodiments, the processing circuit 200 can use tokens to implement composite instructions of memory access and calculation, thereby further improving execution efficiency. In some embodiments, processing circuitry 200 may allow the use of compute instructions (eg, mm instructions as shown in Figure 7) for use with read instructions, for example.

在一些实施例中，与普通的数据消费指令不同，该计算指令可以被两次发出，并且具有两个不同的执行阶段。在第一阶段，该计算指令可以作为普通的读存储指令发出。在第二阶段，在检测到对应的token被设置为1后，该计算指令可以作为数据运算指令而被发出，并在将数据转移到执行单元后，将对应的token置为0。In some embodiments, unlike ordinary data consumption instructions, the calculation instructions can be issued twice and have two different execution stages. In the first stage, the calculation instruction can be issued as an ordinary read storage instruction. In the second stage, after detecting that the corresponding token is set to 1, the calculation instruction can be issued as a data operation instruction, and after transferring the data to the execution unit, the corresponding token is set to 0.

图7示出了根据本公开的一些实施例的示例指令调度过程700。如图7所示，Load RF[0],MemA至mm RF[z],RF[0]可以依次被发出。其中，三个mm指令作为普通读存储指令被发出。Figure 7 illustrates an example instruction scheduling process 700 in accordance with some embodiments of the present disclosure. As shown in Figure 7, Load RF[0], MemA to mm RF[z], RF[0] can be sent out in sequence. Among them, three mm instructions are issued as ordinary read storage instructions.

如图7所示，指令Load RF[0],MemA完成后将token置位为1，此时指令mm RF[x],RF[0]可以作为数据运算指令被重新发出，以消费寄存器中所存储的数据。进一步地，当数据被转移到执行单元后，对应的token可以被清0。As shown in Figure 7, after the instruction Load RF[0], MemA is completed, the token is set to 1. At this time, the instruction mm RF[x], RF[0] can be reissued as a data operation instruction to consume the information in the register. stored data. Further, when the data is transferred to the execution unit, the corresponding token can be cleared to 0.

在token被清0后，指令Load RF[0],MemB可以执行，并且当其完成数据加载后，可以将token置位为1。进一步地，指令mm RF[y],RF[0]可以作为数据运算指令被重新发出，以消费寄存器中所存储的数据。进一步地，当数据被转移到执行单元后，对应的token可以被清0。After the token is cleared to 0, the command Load RF[0], MemB can be executed, and when it completes the data loading, the token can be set to 1. Further, the instructions mm RF[y], RF[0] can be reissued as data operation instructions to consume the data stored in the register. Further, when the data is transferred to the execution unit, the corresponding token can be cleared to 0.

类似地，在token在此被清0后，指令Load RF[0],MemC可以执行，并且当其完成数据加载后，可以将token置位为1。进一步地，指令mm RF[z],RF[0]可以作为数据运算指令被重新发出，以消费寄存器中所存储的数据。进一步地，当数据被转移到执行单元后，对应的token可以被清0。Similarly, after the token is cleared to 0 here, the instruction Load RF[0], MemC can be executed, and when it completes the data loading, the token can be set to 1. Further, the instructions mm RF[z], RF[0] can be reissued as data operation instructions to consume the data stored in the register. Further, when the data is transferred to the execution unit, the corresponding token can be cleared to 0.

在一些实施例中，mm指令例如可以是用于执行矩阵乘运算的指令。In some embodiments, mm instructions may be, for example, instructions for performing matrix multiplication operations.

基于这样的方式，本公开的实施例能够自调节读数据和运算指令，使得程序不受存储延迟的影响，从而能够使得用来隐藏读数据延迟的寄存器数量可以最小化。Based on this approach, embodiments of the present disclosure can self-adjust read data and operation instructions so that the program is not affected by storage delays, thereby minimizing the number of registers used to hide read data delays.

指令调度的示例过程Example process of instruction scheduling

图8示出了根据本公开的一些实施例的指令调度方法800的流程图。在一个实施例中，方法800例如可以由诸如GPU之类的处理电路200(或加速器子***40)实施，因此上面针对图1至图3描述的各个方面可以选择性地适用于方法800。Figure 8 illustrates a flow diagram of an instruction scheduling method 800 in accordance with some embodiments of the present disclosure. In one embodiment, the method 800 may be implemented, for example, by a processing circuit 200 (or accelerator subsystem 40) such as a GPU, and thus various aspects described above with respect to FIGS. 1-3 may be selectively applicable to the method 800.

在框810，处理电路200确定与目标指令相关联的状态指示符，该状态指示符用于指示与目标指令相关联的资源的状态。在框820，处理电路200基于状态指示符和目标指令的类型来确定目标指令是否就绪。响应于在框820确定目标指令就绪，方法800进行到框830，在框830，处理电路200执行目标指令的目标阶段，该目标阶段是基于目标指令的类型而被确定。在框840，处理电路200确定目标指令的、针对资源的访问操作是否执行完成。响应于在框840确定访问操作执行完成，则方法800进行到框850，在框850，处理电路200更新状态指示符。At block 810, processing circuitry 200 determines a status indicator associated with the target instruction that indicates the status of the resource associated with the target instruction. At block 820, processing circuitry 200 determines whether the target instruction is ready based on the status indicator and the type of the target instruction. In response to determining at block 820 that the target instruction is ready, the method 800 proceeds to block 830 where the processing circuit 200 executes a target stage of the target instruction, the target stage being determined based on the type of the target instruction. At block 840, processing circuitry 200 determines whether execution of the target instruction's access operation to the resource is complete. In response to determining at block 840 that the access operation is complete, the method 800 proceeds to block 850 where the processing circuit 200 updates the status indicator.

本公开可以是方法、处理电路、电子设备、计算机存储介质和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质，其上载有用于执行本公开的各个方面的计算机可读程序指令。The disclosure may be a method, processing circuit, electronic device, computer storage medium, and/or computer program product. A computer program product may include a computer-readable storage medium having thereon computer-readable program instructions for performing various aspects of the present disclosure.

计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括：便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身，诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如，通过光纤电缆的光脉冲)、或者通过电线传输的电信号。Computer-readable storage media may be tangible devices that can retain and store instructions for use by an instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. More specific examples (non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) or Flash memory), Static Random Access Memory (SRAM), Compact Disk Read Only Memory (CD-ROM), Digital Versatile Disk (DVD), Memory Stick, Floppy Disk, Mechanical Coding Device, such as a printer with instructions stored on it. Protruding structures in hole cards or grooves, and any suitable combination of the above. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or through electrical wires. transmitted electrical signals.

这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备，或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令，并转发该计算机可读程序指令，以供存储在各个计算/处理设备中的计算机可读存储介质中。Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to various computing/processing devices, or to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage on a computer-readable storage medium in the respective computing/processing device .

用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码，编程语言包括面向对象的编程语言—诸如Smalltalk、C++等，以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中，通过利用计算机可读程序指令的状态信息来个性化定制电子电路，例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA)，该电子电路可以执行计算机可读程序指令，从而实现本公开的各个方面。Computer program instructions for performing operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages. Source code or object code written in any combination of programming languages including object-oriented programming languages - such as Smalltalk, C++, etc., and conventional procedural programming languages - such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server implement. In situations involving remote computers, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider through the Internet). connect). In some embodiments, by utilizing state information of computer-readable program instructions to personalize an electronic circuit, such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA), the electronic circuit can Computer readable program instructions are executed to implement various aspects of the disclosure.

这里参照根据本公开实施例的方法、装置(***)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解，流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合，都可以由计算机可读程序指令实现。Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理单元，从而生产出一种机器，使得这些指令在通过计算机或其它可编程数据处理装置的处理单元执行时，产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中，这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作，从而，存储有指令的计算机可读介质则包括一个制造品，其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer-readable program instructions may be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus, thereby producing a machine such that the instructions, when executed by a processing unit of the computer or other programmable data processing apparatus, , resulting in an apparatus that implements the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium. These instructions cause the computer, programmable data processing device and/or other equipment to work in a specific manner. Therefore, the computer-readable medium storing the instructions includes An article of manufacture that includes instructions that implement aspects of the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.

也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上，使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤，以产生计算机实现的过程，从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other equipment, causing a series of operating steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executed on a computer, other programmable data processing apparatus, or other equipment to implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

附图中的流程图和框图显示了根据本公开的多个实施例的***、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分，模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的***来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions that contains one or more executable functions for implementing the specified logical functions instruction. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two consecutive blocks may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts. , or can be implemented using a combination of specialized hardware and computer instructions.

以上已经描述了本公开的各实施方式，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施方式。在不偏离明的各实施方式的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实施方式的原理、实际应用或对市场中的技术的改进，或者使本技术领域的其他普通技术人员能理解本文披露的各实施方式。The various embodiments of the present disclosure have been described above. The above description is illustrative, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the disclosed embodiments. The terminology used herein is chosen to best explain the principles of the various embodiments, practical applications, or improvements to the technology in the market, or to enable other persons of ordinary skill in the art to understand the various embodiments disclosed herein.

Claims

一种指令调度的方法，所述方法包括：A method of instruction scheduling, the method includes:

确定与目标指令相关联的状态指示符，所述状态指示符用于指示与所述目标指令相关联的资源的状态；determining a status indicator associated with the target instruction, the status indicator being used to indicate the status of a resource associated with the target instruction;

基于所述状态指示符和所述目标指令的类型，确定所述目标指令是否就绪；determining whether the target instruction is ready based on the status indicator and the type of the target instruction;

响应于确定所述目标指令就绪，执行所述目标指令的目标阶段，所述目标阶段基于所述类型而被确定；以及In response to determining that the target instruction is ready, executing a target phase of the target instruction, the target phase being determined based on the type; and

响应于所述目标指令的、针对所述资源的访问操作执行完成，更新所述状态指示符。In response to completion of execution of the access operation for the resource by the target instruction, the status indicator is updated.
根据权利要求1所述的方法，其中基于所述状态指示符和所述目标指令的类型确定所述目标指令是否就绪包括：The method of claim 1, wherein determining whether the target instruction is ready based on the status indicator and the type of the target instruction includes:

响应于所述目标指令的所述类型指示针对所述资源的数据生产操作，确定所述状态指示符是否为第一值，所述第一值指示所述资源中的数据已经被消费；以及responsive to the type of the target instruction indicating a data production operation for the resource, determining whether the status indicator is a first value indicating that data in the resource has been consumed; and

响应于所述状态指示符为所述第一值，确定所述目标指令就绪。In response to the status indicator being the first value, the target instruction is determined to be ready.
根据权利要求2所述的方法，其中执行所述目标指令的目标阶段包括：执行所述目标指令的回写阶段。The method of claim 2, wherein executing a target phase of the target instruction includes executing a writeback phase of the target instruction.
根据权利要求2所述的方法，其中更新所述状态指示符包括：The method of claim 2, wherein updating the status indicator includes:

响应于所述目标指令的、针对所述资源的访问操作执行完成，将所述状态指示符更新为第二值，所述第二值指示所述资源中的数据能够被消费。In response to completion of the access operation for the resource by the target instruction, the status indicator is updated to a second value, the second value indicating that the data in the resource can be consumed.
根据权利要求1所述的方法，其中基于所述状态指示符和所述目标指令的类型确定所述目标指令是否就绪包括：The method of claim 1, wherein determining whether the target instruction is ready based on the status indicator and the type of the target instruction includes:

响应于所述目标指令的所述类型指示针对所述资源的数据消费操作，确定所述状态指示符是否为第二值，所述第二值指示所述资源中的数据能够被消费；以及In response to the type of the target instruction indicating a data consumption operation for the resource, determining whether the status indicator is a second value indicating that data in the resource can be consumed; and

响应于所述状态指示符为所述第二值，确定所述目标指令就绪。In response to the status indicator being the second value, the target instruction is determined to be ready.
根据权利要求5所述的方法，其中执行所述目标指令的目标阶段包括：发出所述目标指令。The method of claim 5, wherein executing a target phase of the target instruction includes issuing the target instruction.
根据权利要求5所述的方法，其中更新所述状态指示符包括：The method of claim 5, wherein updating the status indicator includes:

响应于所述目标指令的、针对所述资源的访问操作执行完成，将所述状态指示符更新为第一值，所述第一值指示所述资源中的数据已经被消费。In response to completion of the access operation for the resource by the target instruction, the status indicator is updated to a first value, the first value indicating that the data in the resource has been consumed.
根据权利要求1所述的方法，其中所述目标指令为第一指令，所述资源为第一资源，所述第一指令指示针对所述第一资源的数据生产操作，并且执行所述目标指令的目标阶段包括：The method of claim 1, wherein the target instruction is a first instruction, the resource is a first resource, the first instruction indicates a data production operation for the first resource, and the target instruction is executed The target stages include:

在第二指令的执行期间，发出所述第一指令，所述第二指令指示针对第二资源的数据生产操作，所述第一资源不同于所述第二资源。The first instruction is issued during execution of a second instruction that instructs a data production operation for a second resource, the first resource being different than the second resource.
根据权利要求1所述的方法，其中所述目标指令为第三指令，所述资源为第三资源，所述第三指令指示针对所述第三资源的数据消费操作，The method of claim 1, wherein the target instruction is a third instruction, the resource is a third resource, and the third instruction indicates a data consumption operation for the third resource,

执行所述目标指令的目标阶段包括：响应于第四指令将所述目标指示符设置为第一值，发出所述第三指令，所述第四指令指示针对所述第三资源的数据生产操作，所述第四指令先于所述第三指令被发出；并且Executing a target phase of the target instruction includes issuing the third instruction in response to a fourth instruction setting the target indicator to a first value, the fourth instruction indicating a data production operation for the third resource , the fourth instruction is issued before the third instruction; and

所述方法还包括：响应于所述第三指令将所述目标指示符更新为第二值，使第五指令被执行，所述第五指令指示针对所述第三资源的数据生产操作，所述第五指令先于所述第三指令并且晚于所述第四指令被发出。The method further includes: updating the target indicator to a second value in response to the third instruction, causing a fifth instruction to be executed, the fifth instruction indicating a data production operation for the third resource, so The fifth instruction is issued before the third instruction and after the fourth instruction.
根据权利要求1所述的方法，其中响应于确定所述目标指令就绪发出所述目标指令包括：The method of claim 1, wherein issuing the target instruction in response to determining that the target instruction is ready includes:

响应于确定所述目标指令就绪，确定已经发出且未完成的指令的数目是否小于阈值；以及In response to determining that the target instruction is ready, determining whether the number of issued and outstanding instructions is less than a threshold; and

响应于确定所述数目小于所述阈值，发出所述目标指令。The target instruction is issued in response to determining that the number is less than the threshold.
根据权利要求1所述的方法，其中确定与目标指令相关联的资源的状态指示符包括：作为内存加载指令发出所述目标指令；以及在所述目标指令作为所述内存加载指令的第一阶段，确定与所述目标指令相关联的所述资源的所述状态指示符；并且The method of claim 1 , wherein determining a status indicator of a resource associated with a target instruction includes: issuing the target instruction as a memory load instruction; and in a first phase of the target instruction as the memory load instruction , determining the status indicator of the resource associated with the target instruction; and

响应于确定所述目标指令就绪执行所述目标指令的所述目标阶段包括：响应于确定所述目标指令就绪，作为运算指令发出重新发出所述目标指令。Executing the target phase of the target instruction in response to determining that the target instruction is ready includes re-issuing the target instruction as an operational instruction in response to determining that the target instruction is ready.
根据权利要求11所述的方法，其中确定所述目标指令是否就绪包括：The method of claim 11, wherein determining whether the target instruction is ready includes:

确定第一内存加载指令是否将所述状态指示符设置为第二值。It is determined whether the first memory load instruction sets the status indicator to a second value.
根据权利要求11所述的方法，还包括：The method of claim 11, further comprising:

在发出所述目标指令之后，发出与所述状态指示符相关联的第二内存加载指令，而无需确认所述状态指示符是否为第一值。After issuing the target instruction, a second memory load instruction associated with the status indicator is issued without confirming whether the status indicator is the first value.
根据权利要求1至13中任一项所述的方法，其中所述资源包括以下至少一项：寄存器、内存地址、队列或处理器资源。The method of any one of claims 1 to 13, wherein the resource includes at least one of: a register, a memory address, a queue or a processor resource.
一种处理电路，包括片上存储器、流处理器和处理引擎，其中所述处理电路被配置为执行根据权利要求1至14中任一项所述的方法。A processing circuit comprising an on-chip memory, a stream processor and a processing engine, wherein the processing circuit is configured to perform the method according to any one of claims 1 to 14.
一种电子设备，包括片外存储存储器和处理电路，其中所述处理电路被配置为执行根据权利要求1至14中任一项所述的方法。An electronic device comprising an off-chip storage memory and a processing circuit, wherein the processing circuit is configured to perform the method according to any one of claims 1 to 14.
一种计算机可读存储介质，其上存储有一条或多条计算机指令，其中所述一条或多条计算机指令被处理电路执行以实现根据权利要求1至14中任一项所述的方法。A computer-readable storage medium having one or more computer instructions stored thereon, wherein the one or more computer instructions are executed by a processing circuit to implement the method according to any one of claims 1 to 14.
一种计算机程序产品，包括计算机可执行指令，其中所述计算机可执行指令在被处理电路执行时实现根据权利要求1至14中任一项所述的方法。A computer program product comprising computer-executable instructions, wherein the computer-executable instructions, when executed by processing circuitry, implement a method according to any one of claims 1 to 14.