CN118296084A - Data processing apparatus, instruction synchronization method, electronic apparatus, and storage medium - Google Patents

Data processing apparatus, instruction synchronization method, electronic apparatus, and storage medium Download PDF

Info

Publication number
CN118296084A
CN118296084A CN202410711364.4A CN202410711364A CN118296084A CN 118296084 A CN118296084 A CN 118296084A CN 202410711364 A CN202410711364 A CN 202410711364A CN 118296084 A CN118296084 A CN 118296084A
Authority
CN
China
Prior art keywords
instruction
synchronization
hardware computing
computing units
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410711364.4A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd, Beijing Bilin Technology Development Co ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202410711364.4A priority Critical patent/CN118296084A/en
Publication of CN118296084A publication Critical patent/CN118296084A/en
Pending legal-status Critical Current

Links

Landscapes

  • Multi Processors (AREA)

Abstract

The present disclosure provides a data processing apparatus, an instruction synchronization method, an electronic apparatus, and a non-transitory computer readable storage medium. The data processing apparatus may include: the system comprises N hardware computing units, wherein the N hardware computing units are provided with corresponding synchronous cache blocks, the synchronous cache blocks are used for caching synchronous indication signals, each hardware computing unit is configured to sequentially execute each instruction according to an instruction sequence, and each hardware computing unit is further configured to: responding to a previous instruction after the target instruction is executed, and before the target instruction is executed, sending a first signal to a synchronous cache block, wherein the target instruction is an instruction which needs to be synchronously executed by N hardware computing units; and executing the target instruction in response to the synchronization instruction signal indicating that all of the N hardware computing units have executed the previous instruction of the target instruction. The data processing equipment simulates instruction synchronization operation through the synchronization buffer block, expands the number of synchronization units, and reduces the limitation on operator development.

Description

Data processing apparatus, instruction synchronization method, electronic apparatus, and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data processing apparatus, an instruction synchronization method, an electronic apparatus, and a storage medium.
Background
In some data processing apparatus, a plurality of hardware computing units may be used to perform computing tasks in parallel, for example one hardware computing unit may perform a part of the computing task, or each hardware computing unit may perform the computing task on different input data in parallel. Because each hardware computing unit performs a computing task in parallel, the instruction sequences of each hardware computing unit are generally approximately the same, and instruction timing synchronization is often required in the process of performing the computing task, for example, each hardware computing unit is required to perform the instruction synchronously or perform a certain instruction in a very short window period, so as to ensure coordination and operation sequence among the hardware computing units.
Disclosure of Invention
At least one embodiment of the present disclosure provides a data processing apparatus, including N hardware computing units, where N is a positive integer greater than 1, and the N hardware computing units have corresponding synchronization buffer blocks, where the synchronization buffer blocks are used to buffer synchronization instruction signals, each hardware computing unit is configured to sequentially execute each instruction according to an instruction sequence, and each hardware computing unit is further configured to: responding to a previous instruction after the target instruction is executed, and before the target instruction is executed, sending a first signal to the synchronous cache block, wherein the target instruction is an instruction which needs to be synchronously executed by the N hardware computing units; and executing the target instruction in response to the synchronization instruction signal indicating that all of the N hardware computing units have executed a previous instruction of the target instruction, wherein the synchronization instruction signal is determined according to the first signals sent to the synchronous cache block by the N hardware computing units.
For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, the first signal is an atomic addition operation signal, the synchronization instruction signal is determined by performing an accumulation operation on the received first signal, and the hardware computing unit executes a previous instruction that instructs the N hardware computing units to execute the target instruction in response to the synchronization instruction signal, and executing the target instruction includes: circularly reading the synchronous indication signal, and executing the target instruction in response to the value of the synchronous indication signal being equal to a preset value; resetting the synchronization indication signal.
For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, the preset value includes a first value or an initial value, where the first value is a sum of accumulated values of N first signals sent by the N hardware computing units and the initial value; resetting the synchronization indication signal includes: resetting the synchronization indication signal to the initial value.
For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, the synchronization instruction signal is an atomic variable, the atomic addition operation signal includes accumulating the atomic variable by 1, and the first value is a sum of the initial value and N.
For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, the target instruction includes a data load instruction, where the data load instruction is used to load the same piece of target data from the memory, and the N hardware computing units trigger the multicast characteristic when executing the data load instruction simultaneously.
For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, after all of the N hardware computing units execute the target instruction, one of the N hardware computing units loads the target data from the memory into the one hardware computing unit, and broadcasts the target data to the remaining N-1 hardware computing units.
For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, the target instruction includes a logic operation instruction and an operation instruction, where the logic operation instruction is used to execute an arithmetic logic operation in a cache, the operation instruction is a next instruction executed after the logic operation instruction in the instruction sequence, the number of synchronization cache blocks corresponding to the N hardware computing units is 2, and the synchronization instruction signal includes a first synchronization cache block and a second synchronization cache block, the first synchronization cache block is used to cache the first synchronization instruction signal, the second synchronization cache block is used to cache the second synchronization instruction signal, the first synchronization instruction signal is used to indicate whether all of the N hardware computing units have executed a previous instruction of the logic operation instruction, and the second synchronization instruction signal is used to indicate whether all of the N hardware computing units have executed the logic operation instruction.
For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, in response to the target instruction being the logical operation instruction, the hardware computing unit executing a previous instruction in response to execution of the target instruction, before executing the target instruction, sends a first signal to the synchronous cache block, including: transmitting the first signal to the first synchronous cache block before executing the logic operation instruction in response to a previous instruction after executing the logic operation instruction according to the instruction sequence; the hardware computing unit executing a previous instruction in response to the synchronization indication signal indicating that all of the N hardware computing units have executed the target instruction, the executing the target instruction comprising: the first synchronization instruction signal is circularly read, and the logic operation instruction is executed in response to the fact that the value of the first synchronization instruction signal is equal to a preset value, wherein the first synchronization instruction signal is determined according to the first signals sent to the first synchronization cache block by the N hardware computing units; and resetting the first synchronization indication signal.
For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, in response to the target instruction being the operation instruction, the hardware computing unit executing a previous instruction in response to execution of the target instruction, before executing the target instruction, sends a first signal to the synchronization cache block, including: in response to the execution of the logic operation instruction, before the execution of the operation instruction, sending the first signal to the second synchronous cache block; the hardware computing unit executing a previous instruction in response to the synchronization indication signal indicating that all of the N hardware computing units have executed the target instruction, the executing the target instruction comprising: the second synchronous indication signal is circularly read, and the operation instruction is executed in response to the fact that the value of the second synchronous indication signal is equal to a preset value, wherein the second synchronous indication signal is determined according to the first signals sent to the second synchronous cache block by the N hardware computing units; resetting the second synchronization indication signal.
For example, in a data processing apparatus provided by at least one embodiment of the present disclosure, the arithmetic logic operation includes a reduction or accumulation.
For example, in a data processing apparatus provided in at least one embodiment of the present disclosure, the type of hardware computing unit includes thread hardware, an execution unit, a computing unit, or a programmable multiprocessor.
For example, in the data processing apparatus provided in at least one embodiment of the present disclosure, the synchronization buffer block is located in a memory area that is accessible to all of the N hardware computing units.
For example, in a data processing apparatus provided in at least one embodiment of the present disclosure, the synchronization buffer block is located in a shared memory of the data processing apparatus in response to the type of the N hardware computing units being an execution unit or thread hardware, and the synchronization buffer block is located in a global memory of the data processing apparatus in response to the type of the N hardware computing units being a computing unit or a programmable multiprocessor.
At least one embodiment of the present disclosure provides an instruction synchronization method for a data processing apparatus including N hardware computing units, N being a positive integer greater than 1, the instruction synchronization method including: responding to a previous instruction of each hardware computing unit which completes execution of a target instruction according to an instruction sequence, and before the target instruction is executed, sending a first signal to the synchronous cache block, wherein the target instruction is an instruction which needs synchronous execution of the N hardware computing units; and responding to the synchronization indication signal to indicate that all the N hardware computing units have executed the previous instruction of the target instruction, so that the N hardware computing units synchronously execute the target instruction, wherein the synchronization indication signal is determined according to the first signals sent to the synchronous cache blocks by the N hardware computing units.
For example, in the instruction synchronization method provided in at least one embodiment of the present disclosure, the first signal is an atomic addition operation signal, the synchronization instruction signal is determined by performing an accumulation operation on the received first signal, and executing the target instruction in response to the synchronization instruction signal indicating that all of the N hardware computing units have executed a previous instruction of the target instruction includes: circularly reading the synchronous indication signal, and executing the target instruction in response to the value of the synchronous indication signal being equal to a preset value; resetting the synchronization indication signal.
For example, in the instruction synchronization method provided in at least one embodiment of the present disclosure, the preset value includes a first value or an initial value, where the first value is a sum of accumulated values of N first signals sent by the N hardware computing units and the initial value; resetting the synchronization indication signal includes: resetting the synchronization indication signal to the initial value.
For example, in the instruction synchronization method provided in at least one embodiment of the present disclosure, the target instruction includes a data loading instruction, where the data loading instruction is used to load the same piece of target data from the memory, the N hardware computing units trigger the multicast feature when executing the data loading instruction simultaneously, and after the N hardware computing units are caused to synchronously execute the target instruction, the instruction synchronization method further includes: and enabling one hardware computing unit in the N hardware computing units to load the target data from the memory to the one hardware computing unit, and broadcasting the target data to the rest N-1 hardware computing units.
For example, in an instruction synchronization method provided in at least one embodiment of the present disclosure, the target instruction includes a logic operation instruction for performing an arithmetic logic operation in a cache and an operation instruction that is a next instruction in the instruction sequence to be performed after the logic operation instruction, the instruction synchronization method includes: the 1 st instruction synchronization is performed before each hardware computing unit executes the logical operation instruction according to the instruction sequence, and the 2 nd instruction synchronization is performed before each hardware computing unit executes the operation instruction according to the instruction sequence.
At least one embodiment of the present disclosure provides an electronic device, including: a processor; and a memory, wherein the memory has stored therein computer readable code which, when executed by the processor, performs the instruction synchronization method of any embodiment of the present disclosure.
At least one embodiment of the present disclosure provides a non-transitory computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the instruction synchronization method of any embodiment of the present disclosure.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.
FIG. 1 shows a schematic block diagram of a General Purpose Graphics Processor (GPGPU);
FIG. 2 illustrates a schematic block diagram of a data processing apparatus provided by at least one embodiment of the present disclosure;
FIG. 3 is a schematic diagram of an instruction synchronization process provided in at least one embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a process for instruction synchronization according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a process for instruction synchronization provided by another embodiment of the present disclosure;
FIG. 6 is a schematic flow chart diagram of an instruction synchronization method provided by at least one embodiment of the present disclosure;
FIG. 7 is a schematic block diagram of an electronic device provided in accordance with at least one embodiment of the present disclosure;
Fig. 8 is a schematic diagram of a non-transitory computer readable storage medium according to at least one embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It will be apparent that the described embodiments are merely embodiments of a portion, but not all, of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are intended to be within the scope of the present disclosure, based on the embodiments in this disclosure.
Furthermore, as shown in the present disclosure and claims, unless the context clearly indicates otherwise, the words "a," "an," "the," and/or "the" are not specific to the singular, but may include the plural. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
A flowchart is used in this disclosure to describe the steps of a method according to an embodiment of the present disclosure. It should be understood that the steps that follow or before do not have to be performed in exact order. Rather, the various steps may be processed in reverse order or simultaneously. Also, other operations may be added to these processes. It will be understood that the terms and terminology used herein have the meanings known to those skilled in the art.
Fig. 1 shows a schematic structural diagram of a General Purpose Graphics Processor (GPGPU).
As shown in fig. 1, the general-purpose graphics processor is actually an array of programmable multiprocessors, for example, the programmable multiprocessors may be a stream processor cluster (STREAMING PROCESSOR CLUSTER, abbreviated SPC), for example, including stream processor cluster 1 shown in fig. 1. In a general-purpose graphics processor, 1 stream processor cluster processes one computing task, or a plurality of stream processor clusters process one computing task. And sharing data among the plurality of stream processor clusters through a global cache or a global memory.
As shown in fig. 1, taking the streaming processor cluster 1 as an example, the 1 streaming processor cluster includes a plurality of computing units, for example, the computing unit 1, the computing unit 2, the computing unit N, N in fig. 1 are positive integers. Each Computation Unit (CU) is used to perform arithmetic logic operations such as accumulation, reduction, conventional addition, subtraction, multiplication, division, etc. One computing unit includes a plurality of cores (also referred to as cores or cores), each including an Arithmetic Logic Unit (ALU), a floating point computing unit, etc., for performing a particular computing task. Furthermore, the computing units include registers (e.g., the register file of fig. 1) for hierarchically storing source and destination data associated with the computing tasks, and a shared memory in one computing unit for sharing data among the cores of the computing unit.
In parallel computing, computing tasks are typically performed by multiple threads (threads). These threads are divided into thread blocks (thread blocks) before execution in a general purpose graphics processor (otherwise referred to as a parallel computing processor), and then the thread blocks are distributed to individual computing units via a thread block distribution module (not shown in FIG. 1). All threads in a thread block must be allocated to the same compute unit for execution. At the same time, the thread block is split into a minimum execution thread bundle (or simply thread bundle, warp), each of which contains a fixed number of threads (or less than the fixed number), e.g., 32 threads. The thread bundles run on Execution Units (EU) and the threads run on thread hardware, e.g. 1 thread bundle runs on 1 execution Unit and 1 thread runs on 1 thread hardware. Multiple thread blocks may be executed in the same computing unit or in different computing units.
In each compute unit, a thread bundle scheduling/dispatching module (not shown in FIG. 1) schedules, dispatches, and distributes thread bundles so that multiple compute cores of the compute unit run the thread bundles. The multiple thread bundles in a thread block may be executed simultaneously or in a time-sharing manner, depending on the number of compute cores in the compute unit. Multiple threads in each thread bundle will execute the same instruction. The memory execution instruction may be transmitted to a shared memory in the computing unit or further transmitted to a mid-level cache or a global memory for performing read/write operations, etc.
The artificial intelligence chip may be implemented in the form of a Graphics Processor (GPU), a general purpose graphics processor, a Tensor Processor (TPU), a Data Processor (DPU), or the like. An artificial intelligence chip may be used to implement an artificial intelligence model, such as an artificial neural network (ARTIFICIAL NEURAL NETWORKS).
Operators in artificial intelligence models generally refer to basic mathematical operations or operations used in the model network layer. These operators are used to construct the various layers and components of the artificial intelligence model to effect the transfer, conversion and computation of data. They are the basic building blocks of artificial intelligence models, defining the structure and operational flow of the model, including input, output and intermediate computation. In the model, the connection relation between operators forms a directed graph reflecting the calculation sequence of different operations in the artificial intelligence model. By combining these operators, a complex and powerful artificial intelligence model can be constructed for processing a variety of complex tasks and data.
To enhance operator performance of artificial intelligence models, artificial intelligence chips typically have broadcast and multicast features. When all threads (threads) in the same thread bundle (warp) read the same data in the shared memory (group shared memory, GSM), if each thread executes an instruction for reading the data, multiple read operations from the memory need to be executed, increasing instruction execution delay and reducing operator performance. At this time, the data loading instruction can be synchronously executed by each thread to trigger the multicast feature, for example, the data loading operation for the same block of data only needs to load data once to the shared memory and then broadcast the data to all threads, so as to improve the performance of the operator. Because the programmable multiprocessors are independent of each other, in order to enable the multicast feature, it is necessary to perform timing synchronization once for the multiple programmable multiprocessors when a scenario occurs in which the multiple programmable multiprocessors read the same block of data.
The artificial intelligence chip's cache (e.g., global cache) may be provided with computational functionality to perform some arithmetic logic operations (ARITHMETIC LOGIC UNIT, ALU for short). For example, arithmetic logic operations may include various arithmetic operations, logical operations, and the like, such as addition, subtraction, multiplication, division, bit operations, comparisons, accumulation, reduction, and the like. For e.g. reduction or accumulation calculations, it is necessary to accumulate the calculation results, e.g. in a cache using an in-memory calculation function of the cache. Because of the limited space of the buffer, if the data A to be accumulated is not in the buffer, the data A is moved to the buffer from the memory, and then the calculation result B is accumulated on the data A. This algorithm requires that data a can stay in the cache until all the calculation results of all the batches are accumulated, otherwise frequent handling of data a from memory to cache is required, affecting the performance of the operator. However, these calculation results may come from different programmable multiprocessors, and if some programmable multiprocessors have completed calculation and accumulation, other operations will continue to be performed, and these other operations are likely to occupy the space of the buffer and squeeze the data a out of the buffer, where it is required to transfer the data a from the memory to the buffer, which affects the performance of other programmable multiprocessors that have not completed the accumulation operation yet. Therefore, in this case, timing synchronization is often required between the plurality of programmable multiprocessors.
Instruction timing synchronization is generally achieved by a hardware synchronization unit provided in the processor, in which a preset number of instruction synchronization signals for different levels of execution units are provided, for example, 10 instruction synchronization signals are provided for synchronization between computation units, 10 instruction synchronization signals are provided for synchronization between programmable multiprocessors, and so on. The instruction synchronization signal can only be used for synchronization at the same level or lower, for example, when synchronization of the computing units is required, the instruction synchronization signal for synchronization between the execution units or the thread hardware cannot be used, and although the instruction synchronization signal for synchronization between the programmable multiprocessors can be used, this brings about a decrease in performance because synchronization of the computing units is dependent on synchronization between the programmable multiprocessors at this time.
Since the synchronization operation depends on the hardware synchronization unit provided by the hardware native, the operator development is greatly limited when the hardware synchronization unit is fewer. Because the number of the synchronization signals provided by the hardware synchronization unit is limited, support cannot be provided for some instruction synchronization scenes, for example, multicast characteristics cannot be enabled, so that multiple data loading operations need to be executed, operator efficiency is reduced, or computing efficiency is reduced due to frequent occurrence of data carrying from a memory to a cache, and the like, so that hardware extreme performance is difficult to develop.
At least one embodiment of the present disclosure provides a data processing apparatus, an instruction synchronization method, an electronic apparatus, and a non-transitory computer readable storage medium. The data processing apparatus may include: n hardware computing units, N is a positive integer greater than 1, the N hardware computing units are provided with corresponding synchronous cache blocks, the synchronous cache blocks are used for caching synchronous indication signals, each hardware computing unit is configured to sequentially execute each instruction according to an instruction sequence, and each hardware computing unit is further configured to: responding to a previous instruction after the target instruction is executed, and before the target instruction is executed, sending a first signal to a synchronous cache block, wherein the target instruction is an instruction which needs to be synchronously executed by N hardware computing units; and responding to the synchronous indication signal to indicate that N hardware computing units all execute the previous instruction of the target instruction, and executing the target instruction, wherein the synchronous indication signal is determined according to the first signals sent to the synchronous cache block by the N hardware computing units.
In the data processing device provided in at least one embodiment of the present disclosure, an instruction synchronization operation is simulated by using a synchronization buffer block, so that synchronization signals are sent between different hardware computing units by using the synchronization buffer block and a synchronization instruction signal, the number of synchronization units is expanded, and the limitation on operator development is reduced. Theoretically, the number of synchronization buffer blocks can be unlimited, so that a large number of virtual synchronization units can be added, providing a large number of synchronization signals for instruction synchronization.
FIG. 2 illustrates a schematic block diagram of a data processing apparatus provided by at least one embodiment of the present disclosure.
As shown in fig. 2, the data processing apparatus 100 includes N hardware computing units 101, N being a positive integer greater than 1.
For example, instruction synchronization is required between N hardware computing units. For example, for a certain target instruction, N hardware computing units are required to execute synchronously, where synchronization includes that N hardware computing units execute the target instruction at the same time, or that N hardware computing units execute the target instruction in a very short window period, that is, the interval time between N hardware computing units executing the target instruction that needs to execute synchronously is very short.
It should be noted that the data processing apparatus may further include more hardware computing units, where the N hardware computing units are hardware computing units that need to perform instruction synchronization among the hardware computing units included in the data processing apparatus.
The data processing device may be a central processor, a graphics processor, a general-purpose graphics processor, a tensor processor, a neural network processor, etc., and the type of the data processing device is not particularly limited by the present disclosure.
For example, the data processing apparatus may be a computer or an integrated circuit chip, etc., and the hardware structure and hardware form of the data processing apparatus are not particularly limited in the present disclosure.
For example, the data processing device may include multiple levels of hardware computing units, e.g., the data processing device may be a graphics processor or a general purpose graphics processor, and the types of hardware computing units may include thread hardware, execution units, computing units, programmable multiprocessors and the like. For example, where the processor is a graphics processor or a general-purpose graphics processor, the programmable multiprocessor may be a streaming processor cluster. With respect to the relationships among the thread hardware, the execution units, the computation units, and the streaming processor clusters, reference may be made to the foregoing description related to fig. 1, and no further description is given here.
For other configurations or types of data processing devices, then different types of hardware computing units may be provided, as this disclosure is not particularly limited.
For example, the N hardware computing units may be the same type, or the N hardware computing units may be the same hierarchical level, e.g., all are computing units. For example, the types of the N hardware computing units may be different, e.g., the N hardware computing units may belong to different tiers, e.g., some are computing units, some are streaming processor clusters, etc. The present disclosure is not particularly limited thereto. In the present disclosure, instruction synchronization is not limited to the same level of hardware, and instruction synchronization may be performed between high-level and low-level hardware computing units.
For example, as shown in fig. 2, N hardware computing units 101 have corresponding synchronization buffer blocks 102 for buffering the synchronization instruction signals.
For example, N hardware computing units 101 share the same synchronization cache block 102, that is, N hardware computing units all use the same synchronization cache block to perform instruction synchronization operations.
For example, the synchronization buffer is located in a memory area that is accessible to all of the N hardware computing units in the data processing apparatus 100, and for example, a block of memory space is allocated as the synchronization buffer from the memory area. Therefore, the number of the synchronous cache blocks can be quite large, theoretically can be infinitely large, and the number of the synchronous units can be greatly expanded as the virtual synchronous units, so that the limitation of operator development is reduced.
For example, in response to the type of N hardware computing units being execution units or thread hardware, the synchronization cache block is located in a shared memory of the data processing device.
For example, when the hardware computing unit is an execution unit or thread hardware, as shown in fig. 1, it may access the shared memory within the computing unit, and thus a memory region may be allocated from the shared memory as a synchronous cache block. Because the access speed of the shared memory is relatively high, compared with the use of other memory areas (such as global memory), the instruction synchronization efficiency can be improved.
For example, where the hardware computing unit is a computing unit or a programmable multiprocessor (e.g., a cluster of stream processors), as shown in FIG. 1, it may access global memory in the data processing apparatus, and thus a block of memory may be allocated from the global memory as a synchronized cache block.
Different from the established hardware connection relation of the hardware synchronization unit, the instruction synchronization operation by using the synchronization buffer block is not limited to the synchronization between the same layers, and the synchronization between different layers is identical to the synchronization between the same layers, so that the performance loss caused by using a high-layer synchronization unit for low-layer synchronization is reduced.
For example, the synchronous buffer blocks can be allocated to the hardware computing units before the task is executed, and the allocation mode can be fixed or can be adjusted according to different computing tasks.
For example, the hardware computing units that need to perform instruction synchronization in the computing task may be allocated with synchronization cache blocks, for example, different numbers of synchronization cache blocks may be allocated to the hardware computing units according to different instructions to be synchronized. For example, referring to the following, for a data load instruction that desires to trigger a multicast feature, 1 synchronization cache block may be allocated to a hardware computing unit that needs to perform the instruction synchronization, and for a logic operation instruction that performs an arithmetic logic operation in the cache, 2 synchronization cache blocks may be allocated to a hardware computing unit that needs to perform the instruction synchronization.
For example, each hardware computing unit is configured to execute the respective instructions in turn in accordance with the instruction sequence.
For example, for any one hardware computing unit, a series of instructions are executed in order of their instruction sequences to direct the hardware computing unit to perform a particular operation or task. For example, sequences of instructions typically include operations for loading data, processing data, storing data, controlling program flow, and the like. The instruction sequence includes a plurality of instructions that each tell the hardware computing unit to perform a particular action, such as addition, subtraction, comparison, jump, etc. When a computer executes a program, it executes the instructions one by one in the order of the sequence of instructions until the program ends or a particular interrupt or exception is encountered. The instruction sequence may be written directly by a programmer or may be converted to machine language instructions by a compiler after being written using a high-level programming language.
For example, the N hardware computing units may have the same instruction sequence or different instruction sequences. For example, where N hardware computing units are each configured to perform the same computing task, the N hardware computing units may have the same instruction sequence when each hardware computing unit is configured to perform a portion of the computing task.
For example, each hardware computing unit may be configured to: responding to a previous instruction after the target instruction is executed, and before the target instruction is executed, sending a first signal to a synchronous cache block, wherein the target instruction is an instruction which needs to be synchronously executed by N hardware computing units; and responding to the synchronous indication signal to indicate that N hardware computing units all execute the previous instruction of the target instruction, and executing the target instruction, wherein the synchronous indication signal is determined according to the first signals sent to the synchronous cache block by the N hardware computing units.
For example, the target instruction is an instruction that needs to be executed synchronously by N hardware computing units, and for N hardware computing units, the target instruction may be the same instruction, for example, the same instruction that needs to be executed synchronously by N hardware computing units, or the target instruction may be a different instruction for different hardware computing units, for example, the N hardware computing units need to execute different instructions synchronously. The present disclosure is not particularly limited thereto.
For example, the hardware computing unit executes the last instruction of the target instruction, i.e. executes the target instruction requiring the timing synchronization of the instruction, before executing the target instruction, the hardware computing unit sends a first signal to the synchronous cache block, waits for and circularly reads the synchronous instruction signal, and instructs the N hardware computing units to execute the previous instruction of the target instruction before executing the target instruction. Therefore, N hardware computing units can be guaranteed to execute target instructions simultaneously or almost simultaneously (in a short window period), and instruction timing synchronization of the target instructions is achieved.
For example, the synchronization instruction signal indicates that all of the N hardware computing units are about to execute the target instruction, that is, all of the N hardware computing units have executed a previous instruction of the target instruction, and may start executing the target instruction.
For example, the first signal is an atomic addition operation signal, for example, an atomic add signal, the synchronization indication signal is an atomic variable, and the atomic addition operation signal needs to wait for the addition operation to be completed before other operations can be executed when the atomic addition operation signal is executed, so that the situation that accumulation errors occur when a plurality of hardware computing units send the first signal simultaneously does not occur when the atomic addition operation signal is used for accumulation.
Of course, the first signal may take other signal forms capable of performing an accumulation operation, which is not particularly limited by the present disclosure.
The operations performed for each hardware computing unit are substantially the same, and thus the specific operations of any one hardware computing unit for instruction synchronization are described in detail below.
For example, when the hardware computing unit executes a previous instruction that indicates that all of the N hardware computing units have executed the target instruction in response to the synchronization indication signal, the method includes: circularly reading the synchronous indication signal, and executing a target instruction in response to the value of the synchronous indication signal being equal to a preset value; the synchronization indication signal is reset.
For example, the preset value includes a first value or an initial value, where the first value is a sum of accumulated values of N first signals sent by the N hardware computing units and the initial value; resetting the synchronization indication signal includes: the synchronization indication signal is reset to an initial value.
For example, in one specific example, the atomic addition operation signal includes accumulating 1 for a synchronization indication signal in the form of an atomic variable, and the first value is the sum of an initial value and N.
For example, the initial value may be set to 0.
For example, the synchronization instruction signal is an atomic variable, and in the initial state, it is an initial value a, after any hardware computing unit executes the previous instruction of the target instruction, before executing the target instruction, it sends a first signal to the synchronization buffer block 102, and after the synchronization instruction signal receives a first signal from the synchronization buffer block 102, it accumulates b, until receiving the first signals sent by all the N hardware computing units 101, where the value of the synchronization instruction signal is updated to a+nxb, and a and b may be integers or floating point numbers.
For example, each hardware calculation unit always circularly reads the synchronization instruction signal, and determines whether or not the value of the synchronization instruction signal is a+nxb. If the value of the synchronization instruction signal is not a+n×b, waiting is continued, and when the value of the synchronization instruction signal is a+n×b, it indicates that all of the N hardware computing units 101 have executed the previous instruction of the target instruction, i.e., the target instruction is to be executed, and at this time, each hardware computing unit may execute the target instruction.
After that, the hardware calculation unit 101 resets the synchronization instruction signal to the initial value a so as to perform instruction synchronization of the next round.
Therefore, the behavior of the hardware synchronization unit is simulated from software, so that instruction synchronization is performed among different hardware calculation units, a large number of synchronization signals can be added, a large number of synchronization signals are provided for an instruction synchronization scene, and the hardware extreme performance can be exerted.
Fig. 3 is a schematic diagram of an instruction synchronization process according to at least one embodiment of the present disclosure.
As shown in fig. 3, for any hardware computing unit, each instruction is executed sequentially according to the instruction sequence, and after the previous instruction of the target instruction is executed, the instruction synchronization operation as shown in the dashed box part in fig. 3 is executed.
Specifically, as shown in fig. 3, after the last instruction of the target instruction is executed and before the target instruction is executed, a first signal, for example, an atomic add signal, is sent to the synchronous cache block, so that the synchronous instruction signal performs an accumulation operation. After that, the synchronization instruction signal in the synchronization buffer block is always circularly read.
When the value of the synchronization instruction signal is equal to a preset value, such as a first value or an initial value, the N hardware computing units are all running before the target instruction is executed (the N hardware computing units all execute the previous instruction of the target instruction), and at this time, the target instruction is executed, so as to realize instruction timing synchronization among the N hardware computing units. And resetting the synchronization instruction signal as an initial value to complete one round of instruction synchronization operation. Thereafter, execution of subsequent instructions may continue in accordance with the instruction sequence.
For example, in some embodiments, when the hardware computing unit reads that the value of the synchronization instruction signal is the first value, the target instruction may be executed and the synchronization instruction signal may be reset to the initial value, and when the hardware computing unit reads that the value of the synchronization instruction signal is the initial value, only the target instruction may be executed without resetting the synchronization instruction signal to the initial value again.
Two scenarios for instruction timing synchronization using a data processing apparatus provided in at least one embodiment of the present disclosure are specifically described below.
For example, in some embodiments, the target instruction comprises a data Load instruction (Load instruction) that loads the same block of target data from memory. For example, N hardware computing units may trigger a multicast feature when executing the data load instruction simultaneously.
For example, after each of the N hardware computing units executes the target instruction, one of the N hardware computing units loads target data from the memory into one of the N hardware computing units, and broadcasts the target data to the remaining N-1 hardware computing units.
That is, after the multicast feature is triggered, only the data loading instruction of one hardware computing unit is actually executed, and the target data is read from the memory, and other hardware computing units broadcast the target data through the hardware computing unit to acquire the target data, so that only one data loading is executed, delay cost and performance cost caused by the data loading are reduced, and operator performance is improved.
Fig. 4 is a schematic diagram of a process of instruction synchronization according to an embodiment of the disclosure.
As shown in fig. 4, the N hardware computing units that need instruction synchronization include a hardware computing unit 1, a hardware computing unit 2, a hardware computing unit N.
For each hardware computing unit, the instructions are executed in order of instruction sequence. As shown in fig. 4, the time when N hardware computing units execute the previous instruction (instruction 1) of the data load instruction (target instruction) is different, and if each hardware computing unit directly starts to execute the data load instruction without instruction synchronization, the multicast feature cannot be enabled.
For example, referring to the process described above, for each hardware computing unit, after instruction 1 is executed, a first signal, such as an atomic addition operation signal with the first signal being accumulation 1, is sent to the synchronous cache block to perform an accumulation operation on the synchronous instruction signal before executing the data load instruction. Then, the synchronization instruction signal is circularly read, and when the value of the synchronization instruction signal is N or 0 (for example, the initial value is 0), the data loading instruction is executed.
Therefore, each hardware computing unit can start to execute the data loading instruction after all N hardware computing units execute the instruction 1, so that the data loading instruction can be simultaneously or almost simultaneously executed, thereby enabling the multicast characteristic and improving the operator performance.
For example, in other embodiments, the target instructions include logical operation instructions and operation instructions. For example, the instruction synchronization process provided according to the foregoing embodiment may perform one instruction synchronization with a logical operation instruction as a target instruction. For example, the instruction synchronization process provided according to the foregoing embodiment may perform one instruction synchronization with the operation instruction as a target instruction.
The logical operation instruction is used to perform an arithmetic logical operation in the cache, e.g., the arithmetic logical operation may include a reduction or accumulation. For example, the logic operation instructions executed by the N hardware computing units are used to perform the same arithmetic logic operation, such as providing addends for performing accumulation operations, respectively. For example, the logical operation instruction may be a reduce instruction or an accumulate instruction that is performed in a global cache.
An operation instruction is the next instruction in the instruction sequence to be executed immediately after a logical operation instruction. The operation instructions may be the same instruction or different instructions for different hardware computing units.
When executing the logic operation instruction, in order to avoid that the logic operation result (for example, the accumulation result) is flushed out of the buffer memory to the memory due to other operations executed after the accumulation is executed by some hardware calculation units, the data loading from the memory occurs for a plurality of times, the instruction synchronization is performed once before executing the logic operation instruction, and the instruction synchronization is performed once again after executing the logic operation instruction and before executing the operation instruction, so that other operations cannot occur in the execution process of the logic operation instruction, the logic operation result is flushed out of the buffer memory to the memory, the data loading from the memory is avoided, the delay and the performance loss caused by the data reading are reduced, and the operator performance is improved.
For example, in this embodiment, two synchronization buffer blocks are provided for two instruction syncs, respectively, and the two synchronization buffer blocks buffer different synchronization instruction signals, respectively.
For example, the number of the synchronization buffer blocks corresponding to the N hardware computing units is 2, including a first synchronization buffer block and a second synchronization buffer block, and the synchronization instruction signal includes a first synchronization instruction signal and a second synchronization instruction signal.
For example, the first synchronization buffer block is used for buffering the first synchronization indication signal, and the second synchronization buffer block is used for buffering the second synchronization indication signal.
For example, the first synchronization indication signal is used to indicate whether all of the N hardware computing units have executed a previous instruction of the logical operation instruction. For example, the second synchronization instruction signal is used to indicate whether all of the N hardware computing units have executed the logic operation instruction.
For example, when a logical operation instruction is taken as a target instruction, the hardware computing unit executes a previous instruction in response to the execution of the target instruction, and when a first signal is sent to the synchronous cache block before the execution of the target instruction, the hardware computing unit performs the following operations: in response to a previous instruction after execution of the logical operation instruction according to the instruction sequence, sending a first signal to a first synchronous cache block before execution of the logical operation instruction; the hardware computing unit executing a previous instruction in response to the synchronization indication signal indicating that all of the N hardware computing units have executed the logical operation instruction, the executing the target instruction comprising: circularly reading a first synchronization instruction signal, and executing a logic operation instruction in response to the value of the first synchronization instruction signal being equal to a preset value, wherein the first synchronization instruction signal is determined according to first signals sent to a first synchronization cache block by N hardware computing units; the first synchronization indication signal is reset.
For example, when the operation instruction is taken as a target instruction, the hardware computing unit executes a previous instruction in response to the execution of the target instruction, and before the execution of the target instruction, sends a first signal to the synchronous cache block, including: in response to executing the logic operation instruction, before executing the operation instruction, transmitting a first signal to a second synchronous cache block; the hardware computing unit executing a previous instruction in response to the synchronization indication signal indicating that all of the N hardware computing units have executed the logical operation instruction, the executing the target instruction comprising: circularly reading a second synchronous indication signal, and executing an operation instruction in response to the value of the second synchronous indication signal being equal to a preset value, wherein the second synchronous indication signal is determined according to first signals sent to a second synchronous cache block by N hardware computing units; the second synchronization indication signal is reset.
Fig. 5 is a schematic diagram illustrating a process of instruction synchronization according to another embodiment of the present disclosure.
As shown in fig. 5, the N hardware computing units that need instruction synchronization include a hardware computing unit 1, a hardware computing unit 2, a hardware computing unit N.
For each hardware computing unit, the instructions are executed in order of instruction sequence. As shown in fig. 5, the N hardware computing units have different times when executing the logic operation instruction, if each hardware computing unit directly starts executing the operation instruction without instruction synchronization, the logic operation result may be flushed out of the cache, resulting in multiple data loading operations from the memory.
In this embodiment, as shown in fig. 5, instruction synchronization is performed twice. Firstly, the instruction timing synchronization is carried out once before each hardware computing unit executes the logic operation instruction, and then the instruction timing synchronization is carried out once again before each computing unit executes the operation instruction, so that other operations do not exist in the execution process of the logic operation instruction, the condition that an operation result is flushed out of a cache is avoided, delay caused by data loading operation from a memory for many times is reduced, and operator performance is improved.
For example, in instruction synchronization at the 1 st time, instruction timing synchronization is performed using the first synchronization buffer block and the first synchronization instruction signal. With reference to the foregoing procedure, for each hardware computing unit, after executing the instruction 1, before executing the logic operation instruction, a first signal, for example, an atomic addition operation signal in which the first signal is an accumulation 1, is sent to the first synchronization buffer block to perform an accumulation operation on the first synchronization instruction signal. And then, circularly reading the first synchronization instruction signal, and executing a logic operation instruction when the value of the first synchronization instruction signal is N or 0 (for example, the initial value is 0).
For example, in instruction synchronization of the 2 nd time, instruction timing synchronization is performed using the second synchronization buffer block and the second synchronization instruction signal. With reference to the foregoing procedure, for each hardware computing unit, after the logic operation instruction is executed, before the execution of the operation instruction, a first signal, for example, an atomic addition operation signal in which the first signal is an accumulation 1, is sent to the second synchronous buffer block to perform an accumulation operation on the second synchronous instruction signal. And then, circularly reading the second synchronous indication signal, and executing the operation instruction when the value of the second synchronous indication signal is N or 0 (for example, the initial value is 0).
Of course, depending on the scene, a third synchronization buffer block, a third synchronization instruction signal, etc. may also be provided, and the process of performing instruction timing synchronization by using more buffer blocks and synchronization instruction signals is similar to the foregoing, and will not be repeated here.
In at least one embodiment of the present disclosure, a virtual synchronization buffer block is provided to replace a hardware synchronization unit, and a software mode is used to simulate the synchronization behavior of the hardware synchronization unit, so that a large number of virtual synchronization units can be added, a large number of synchronization signals can be added for an instruction synchronization scene, and the hardware extreme performance can be exerted. In addition, the synchronization behavior of the hardware synchronization unit is simulated in a software mode, the established hardware connection relation of the hardware synchronization unit is not limited, the instruction synchronization operation is not limited to synchronization among the same layers, synchronization among different layers is identical to synchronization among the same layers, and performance loss caused by using a high-level synchronization unit for low-level synchronization is reduced.
At least one embodiment of the present disclosure also provides an instruction synchronization method for a data processing apparatus including N hardware computing units.
The description of the data processing device and the N hardware computing units may refer to the related content of the foregoing data processing device, and the repetition is not repeated.
Fig. 6 is a schematic flow chart diagram of an instruction synchronization method provided in at least one embodiment of the present disclosure. As shown in fig. 6, the instruction synchronization method provided in at least one embodiment of the present disclosure at least includes steps S10 and S20.
In step S10, for each hardware computing unit, in response to a previous instruction of the hardware computing unit having executed the target instruction in accordance with the instruction sequence, a first signal is sent by the hardware computing unit to the synchronization buffer block before executing the target instruction.
For example, the target instruction is an instruction that needs to be executed synchronously by N hardware computing units. The content of the target instruction may refer to the related description of the data processing apparatus, and the repetition is not repeated.
In step S20, in response to the synchronization instruction signal indicating that all of the N hardware computing units have executed the previous instruction of the target instruction, the N hardware computing units are caused to synchronously execute the target instruction.
For example, the synchronization instruction signal is determined according to the first signals sent to the synchronization buffer block by the N hardware computing units.
For example, in some embodiments, the synchronization indication signal is determined by performing an accumulation operation on the received first signal, e.g., the first signal is an atomic addition operation signal, e.g., an atomic add signal, and the synchronization indication signal is an atomic variable.
For example, step S20 may include: circularly reading the synchronous indication signal, and executing a target instruction in response to the value of the synchronous indication signal being equal to a preset value; the synchronization indication signal is reset.
For example, the preset value includes a first value or an initial value, where the first value is a sum of accumulated values of N first signals sent by the N hardware computing units and the initial value. For example, resetting the synchronization indication signal includes: the synchronization indication signal is reset to an initial value.
The specific process of the instruction synchronization operation in this embodiment may refer to the related description in the foregoing data processing apparatus, and the repetition is not repeated.
For example, the target instructions include a data load instruction, where the data load instruction is used to load the same piece of target data from the memory, that is, the data load instructions executed by the N hardware computing units are used to load the same piece of target data from the memory. The N hardware computing units trigger the multicast feature when executing the data loading instruction simultaneously.
For example, after having N hardware computing units synchronously execute the target instruction, the instruction synchronization method further includes: and enabling one hardware computing unit in the N hardware computing units to load target data from the memory into the one hardware computing unit, and broadcasting the target data to the rest N-1 hardware computing units.
An instruction synchronization method according to at least one embodiment of the present disclosure performs an instruction synchronization once before executing a data load instruction to trigger a multicast feature. After the multicast characteristic is triggered, only the data loading instruction of one hardware computing unit is actually executed, the target data is read from the memory, and other hardware computing units broadcast the target data through the hardware computing unit to acquire the target data, so that the data loading is executed only once, delay cost and performance cost caused by the data loading are reduced, and the operator performance is improved.
The specific instruction synchronization process when the target instruction is a data load instruction may refer to the related description in the foregoing data processing apparatus, and the repetition is not repeated.
For example, the target instructions include a logical operation instruction and an operation instruction. For example, the logical operation instruction is used to perform an arithmetic logical operation in a cache, e.g., the arithmetic logical operation may include a reduction or accumulation. For example, an operation instruction is the next instruction in the instruction sequence to be executed immediately after the logical operation instruction.
In this scenario, 2 instruction syncs are required. For example, the 1 st instruction synchronization is performed before each hardware computing unit executes a logical operation instruction according to an instruction sequence, and the 2 nd instruction synchronization is performed before each hardware computing unit executes the operation instruction according to the instruction sequence.
For example, in this embodiment, two synchronization buffer blocks are provided for two instruction syncs, respectively, and the two synchronization buffer blocks buffer different synchronization instruction signals, respectively.
For example, the number of the synchronization buffer blocks corresponding to the N hardware computing units is 2, including a first synchronization buffer block and a second synchronization buffer block, and the synchronization instruction signal includes a first synchronization instruction signal and a second synchronization instruction signal.
For example, the first synchronization buffer block is used for buffering the first synchronization indication signal, and the second synchronization buffer block is used for buffering the second synchronization indication signal.
For example, the first synchronization indication signal is used to indicate whether all of the N hardware computing units have executed a previous instruction of the logical operation instruction. For example, the second synchronization instruction signal is used to indicate whether all of the N hardware computing units have executed the logic operation instruction.
For example, when a logical operation instruction is taken as a target instruction, S10 may include: in response to a previous instruction that executed the logical operation instruction in accordance with the instruction sequence, a first signal is sent to a first synchronous cache block prior to execution of the logical operation instruction. Step S20 may include: circularly reading a first synchronization instruction signal, and executing a logic operation instruction in response to the value of the first synchronization instruction signal being equal to a preset value, wherein the first synchronization instruction signal is determined according to first signals sent to a first synchronization cache block by N hardware computing units; the first synchronization indication signal is reset.
For example, when the operation instruction is taken as the target instruction, step S10 may include: in response to executing the logical operation instruction, a first signal is sent to a second synchronous cache block before executing the operation instruction. Step S20 may include: circularly reading a second synchronous indication signal, and executing an operation instruction in response to the value of the second synchronous indication signal being equal to a preset value, wherein the second synchronous indication signal is determined according to first signals sent to a second synchronous cache block by N hardware computing units; the second synchronization indication signal is reset.
The specific instruction synchronization process when the target instruction is a logic operation instruction and an operation instruction may refer to the related description in the foregoing data processing apparatus, and the repetition is omitted.
When executing the logic operation instruction, in order to avoid that the logic operation result (for example, the accumulation result) is flushed out of the buffer memory to the memory due to other operations executed after the accumulation is executed by some hardware calculation units, the data loading from the memory occurs for a plurality of times, the instruction synchronization is performed once before executing the logic operation instruction, and the instruction synchronization is performed once again after executing the logic operation instruction and before executing the operation instruction, so that other operations cannot occur in the execution process of the logic operation instruction, the logic operation result is flushed out of the buffer memory to the memory, the data loading from the memory is avoided, the delay and the performance loss caused by the data reading are reduced, and the operator performance is improved.
For example, N hardware computing units share the same synchronous cache block.
For example, the types of hardware computing units include thread hardware, execution units, computing units, or programmable multiprocessors.
For example, the synchronization buffer block is located in a memory area that is accessible to all N hardware computing units in the data processing apparatus.
The details of the contents of the N hardware computing units and the synchronization buffer block may refer to the description related to the foregoing data processing apparatus, and the repetition is not repeated.
In at least one embodiment of the present disclosure, a virtual synchronization buffer block is provided to replace a hardware synchronization unit, and a software mode is used to simulate the synchronization behavior of the hardware synchronization unit, so that a large number of virtual synchronization units can be added, a large number of synchronization signals are added, sufficient support is provided for instruction timing synchronization, the instruction timing synchronization is not limited by the number of synchronization signals, the limitation on operator development is reduced, and the hardware extreme performance can be exerted. In addition, the synchronization behavior of the hardware synchronization unit is simulated in a software mode, the established hardware connection relation of the hardware synchronization unit is not limited, the instruction synchronization operation is not limited to synchronization among the same layers, synchronization among different layers is identical to synchronization among the same layers, and performance loss caused by using a high-level synchronization unit for low-level synchronization is reduced.
With respect to a specific implementation procedure involved in the instruction synchronization method according to the embodiments of the present disclosure, reference may be made to the data processing apparatus according to some embodiments of the present disclosure described above in connection with fig. 2 to 5, and a description thereof will not be repeated here.
At least one embodiment of the present disclosure also provides an electronic device. Fig. 7 is a schematic block diagram of an electronic device provided in at least one embodiment of the present disclosure.
For example, as shown in fig. 7, the electronic device includes a processor 201, a communication interface 202, a memory 203, and a communication bus 204. The processor 201, the communication interface 202, and the memory 203 communicate with each other via the communication bus 204, and the components of the processor 201, the communication interface 202, and the memory 203 may also communicate with each other via a network connection. The present disclosure is not limited herein with respect to the type and functionality of the network. It should be noted that the components of the electronic device shown in fig. 7 are exemplary only and not limiting, and that the electronic device may have other components as desired for practical applications.
For example, the memory 203 is used to store computer readable instructions non-transitory. The processor 201 is configured to implement the instruction synchronization method according to any of the above embodiments when executing computer readable instructions. For specific implementation of each step of the instruction synchronization method and related explanation, reference may be made to the above embodiment of the instruction synchronization method, which is not described herein.
For example, other implementations of the instruction synchronization method implemented by the processor 201 executing computer readable instructions stored on the memory 203 are the same as those mentioned in the foregoing method embodiment, and will not be described herein again. The processor 201 may perform various actions and processes according to programs stored in the memory 203. In particular, the processor 201 may be an integrated circuit having signal processing capabilities. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. For example, a processor herein may refer to any processor capable of data processing, such as the aforementioned data processing apparatus.
For example, the communication bus 204 may be a peripheral component interconnect standard (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
For example, the communication interface 202 is used to enable communication between an electronic device and other devices.
For example, the processor 201 may control other components in the electronic device to perform desired functions. The processor 201 may be a device with data processing and/or program execution capabilities such as a Central Processing Unit (CPU), network Processor (NP), tensor Processor (TPU), or Graphics Processor (GPU); but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The Central Processing Unit (CPU) can be an X86 or ARM architecture, etc.
For example, memory 203 may comprise any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer readable instructions may be stored on the computer readable storage medium that may be executed by the processor 201 to implement various functions of the electronic device. Various applications and various data, etc. may also be stored in the storage medium.
For example, a detailed description of a procedure of the electronic device performing the instruction synchronization method may refer to a related description in an embodiment of the instruction synchronization method, and a detailed description is omitted.
Of course, the architecture shown in fig. 7 is merely exemplary, and one or more components of the electronic device shown in fig. 7 may be omitted as may be practical in implementing different devices.
Fig. 8 is a schematic diagram of a non-transitory computer readable storage medium according to at least one embodiment of the present disclosure.
For example, as shown in fig. 8, the storage medium 300 may be a non-transitory computer-readable storage medium, and the one or more computer-readable instructions 301 may be stored non-transitory on the storage medium 300. For example, computer readable instructions 301, when executed by a processor, may perform one or more steps in accordance with the instruction synchronization methods described above.
For example, the storage medium 300 may be applied to the above-described electronic device, and for example, the storage medium 300 may include a memory in the electronic device.
For example, the storage medium may include a memory card of a smart phone, a memory component of a tablet computer, a hard disk of a personal computer, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), portable compact disc read only memory (CD-ROM), flash memory, or any combination of the foregoing, as well as other suitable storage media.
For example, the description of the storage medium 300 may refer to the description of the memory in the embodiment of the electronic device, and the repetition is omitted.
Those skilled in the art will appreciate that various modifications and improvements can be made to the disclosure. For example, the various devices or components described above may be implemented in hardware, or may be implemented in software, firmware, or a combination of some or all of the three.
Further, while the present disclosure makes various references to certain elements in a system according to embodiments of the present disclosure, any number of different elements may be used and run on a client and/or server. The units are merely illustrative and different aspects of the systems and methods may use different units.
A flowchart is used in this disclosure to describe the steps of a method according to an embodiment of the present disclosure. It should be understood that the steps that follow or before do not have to be performed in exact order. Rather, the various steps may be processed in reverse order or simultaneously. Also, other operations may be added to these processes.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the methods described above may be implemented by a computer program to instruct related hardware, and the program may be stored in a computer readable storage medium, such as a read only memory, a magnetic disk, or an optical disk. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits. Accordingly, each module/unit in the above embodiment may be implemented in the form of hardware, or may be implemented in the form of a software functional module. The present disclosure is not limited to any specific form of combination of hardware and software.
Unless defined otherwise, all terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The foregoing is illustrative of the present disclosure and is not to be construed as limiting thereof. Although a few exemplary embodiments of this disclosure have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this disclosure. Accordingly, all such modifications are intended to be included within the scope of this disclosure as defined in the claims. It is to be understood that the foregoing is illustrative of the present disclosure and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The disclosure is defined by the claims and their equivalents.

Claims (20)

1. A data processing apparatus includes N hardware computing units, N being a positive integer greater than 1,
The N hardware computing units are provided with corresponding synchronous cache blocks, the synchronous cache blocks are used for caching synchronous indication signals,
Each hardware computing unit is configured to execute the respective instructions in turn according to a sequence of instructions,
Each hardware computing unit is further configured to:
responding to a previous instruction after the target instruction is executed, and before the target instruction is executed, sending a first signal to the synchronous cache block, wherein the target instruction is an instruction which needs to be synchronously executed by the N hardware computing units;
And responding to the synchronous indication signal to indicate that all the N hardware computing units have executed the previous instruction of the target instruction, and executing the target instruction, wherein the synchronous indication signal is determined according to the first signals sent to the synchronous cache block by the N hardware computing units.
2. The data processing apparatus according to claim 1, wherein the first signal is an atomic addition operation signal, the synchronization instruction signal is determined by performing an accumulation operation on the received first signal,
The hardware computing unit executing a previous instruction in response to the synchronization indication signal indicating that all of the N hardware computing units have executed the target instruction, the executing the target instruction comprising:
Circularly reading the synchronous indication signal, and executing the target instruction in response to the value of the synchronous indication signal being equal to a preset value;
resetting the synchronization indication signal.
3. The data processing apparatus according to claim 2, wherein the preset value includes a first value or an initial value, wherein the first value is a sum of accumulated values of N first signals transmitted by the N hardware computing units and the initial value;
Resetting the synchronization indication signal includes: resetting the synchronization indication signal to the initial value.
4. A data processing apparatus according to claim 3, wherein said synchronization indication signal is an atomic variable, said atomic addition operation signal comprising accumulating said atomic variable by 1,
The first value is the sum of the initial value and N.
5. The data processing apparatus of any of claims 1-4, wherein the target instructions comprise a data load instruction to load the same block of target data from memory,
The N hardware computing units trigger a multicast feature when executing the data loading instruction simultaneously.
6. The data processing apparatus of claim 5, wherein after each of the N hardware computing units executes the target instruction, one of the N hardware computing units loads the target data from the memory to the one hardware computing unit and broadcasts the target data to the remaining N-1 hardware computing units.
7. The data processing apparatus according to any one of claims 1 to 4, wherein the target instruction includes a logical operation instruction for performing an arithmetic logical operation in a cache and an operation instruction that is a next instruction in the instruction sequence to be executed after the logical operation instruction,
The number of the synchronous cache blocks corresponding to the N hardware computing units is 2, comprising a first synchronous cache block and a second synchronous cache block,
The synchronization indication signal comprises a first synchronization indication signal and a second synchronization indication signal,
The first synchronous buffer block is used for buffering the first synchronous indication signal, the second synchronous buffer block is used for buffering the second synchronous indication signal,
The first synchronization indication signal is used for indicating whether all of the N hardware computing units have executed a previous instruction of the logical operation instruction,
The second synchronization instruction signal is used for indicating whether all the N hardware computing units have executed the logic operation instruction.
8. The data processing apparatus according to claim 7, wherein, in response to the target instruction being the logical operation instruction,
The hardware computing unit executing a previous instruction in response to execution of a target instruction, when sending a first signal to the synchronous cache block before execution of the target instruction, comprising:
Transmitting the first signal to the first synchronous cache block before executing the logic operation instruction in response to a previous instruction after executing the logic operation instruction according to the instruction sequence;
the hardware computing unit executing a previous instruction in response to the synchronization indication signal indicating that all of the N hardware computing units have executed the target instruction, the executing the target instruction comprising:
The first synchronization instruction signal is circularly read, and the logic operation instruction is executed in response to the fact that the value of the first synchronization instruction signal is equal to a preset value, wherein the first synchronization instruction signal is determined according to the first signals sent to the first synchronization cache block by the N hardware computing units;
And resetting the first synchronization indication signal.
9. The data processing apparatus according to claim 7, wherein, in response to the target instruction being the operation instruction,
The hardware computing unit executing a previous instruction in response to execution of a target instruction, when sending a first signal to the synchronous cache block before execution of the target instruction, comprising:
In response to the execution of the logic operation instruction, before the execution of the operation instruction, sending the first signal to the second synchronous cache block;
the hardware computing unit executing a previous instruction in response to the synchronization indication signal indicating that all of the N hardware computing units have executed the target instruction, the executing the target instruction comprising:
The second synchronous indication signal is circularly read, and the operation instruction is executed in response to the fact that the value of the second synchronous indication signal is equal to a preset value, wherein the second synchronous indication signal is determined according to the first signals sent to the second synchronous cache block by the N hardware computing units;
resetting the second synchronization indication signal.
10. The data processing apparatus of claim 7, wherein the arithmetic logic operation comprises a reduction or accumulation.
11. The data processing apparatus of any of claims 1-4, wherein the type of hardware computing unit comprises thread hardware, an execution unit, a computing unit, or a programmable multiprocessor.
12. The data processing apparatus according to any one of claims 1 to 4, wherein the synchronization buffer block is located in a memory area accessible to all of the N hardware computing units.
13. The data processing apparatus of claim 12, wherein the synchronization buffer block is located in a shared memory of the data processing apparatus in response to the type of the N hardware computing units being an execution unit or thread hardware,
And in response to the N hardware computing units being of a type of computing unit or a programmable multiprocessor, the synchronous cache block is located in a global memory of the data processing device.
14. An instruction synchronization method for a data processing apparatus including N hardware computing units, N being a positive integer greater than 1, the instruction synchronization method comprising:
For each hardware computing unit, responding to a previous instruction of the hardware computing unit which completes execution of a target instruction according to an instruction sequence, and before the target instruction is executed, transmitting a first signal to the synchronous cache block by the hardware computing unit, wherein the target instruction is an instruction which needs synchronous execution by the N hardware computing units;
And responding to the synchronization indication signal to indicate that all the N hardware computing units have executed the previous instruction of the target instruction, so that the N hardware computing units synchronously execute the target instruction, wherein the synchronization indication signal is determined according to the first signals sent to the synchronous cache blocks by the N hardware computing units.
15. The instruction synchronization method according to claim 14, wherein the first signal is an atomic addition operation signal, the synchronization instruction signal is determined by performing an accumulation operation on the received first signal,
And responding to the synchronous indication signal indicating that all the N hardware computing units have executed the previous instruction of the target instruction, so that the N hardware computing units synchronously execute the target instruction, comprising:
for each hardware computing unit, circularly reading the synchronous indication signals by the hardware computing unit, and executing the target instruction by the hardware computing unit in response to the numerical value of the synchronous indication signals being equal to a preset value;
resetting the synchronization indication signal.
16. The instruction synchronization method according to claim 15, wherein the preset value includes a first value or an initial value, the first value being a sum of accumulated values of N first signals transmitted by the N hardware computing units and the initial value;
Resetting the synchronization indication signal includes: resetting the synchronization indication signal to the initial value.
17. The instruction synchronization method according to any one of claims 14-16, wherein the target instruction comprises a data load instruction for loading the same block of target data from a memory, the N hardware computing units triggering a multicast feature when executing the data load instruction simultaneously,
After the N hardware computing units are caused to synchronously execute the target instruction, the instruction synchronization method further includes:
And enabling one hardware computing unit in the N hardware computing units to load the target data from the memory to the one hardware computing unit, and broadcasting the target data to the rest N-1 hardware computing units.
18. The instruction synchronization method according to any one of claims 14 to 16, wherein the target instruction includes a logical operation instruction for performing an arithmetic logical operation in a cache and an operation instruction that is a next instruction in the instruction sequence to be executed after the logical operation instruction,
The instruction synchronization method comprises the following steps: the 1 st instruction synchronization is performed before each hardware computing unit executes the logical operation instruction according to the instruction sequence, and the 2 nd instruction synchronization is performed before each hardware computing unit executes the operation instruction according to the instruction sequence.
19. An electronic device, comprising:
a processor; and
A memory, wherein the memory has stored therein computer readable code which, when executed by the processor, performs the instruction synchronization method of any of claims 14-18.
20. A non-transitory computer readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the instruction synchronization method of any of claims 14-18.
CN202410711364.4A 2024-06-04 2024-06-04 Data processing apparatus, instruction synchronization method, electronic apparatus, and storage medium Pending CN118296084A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410711364.4A CN118296084A (en) 2024-06-04 2024-06-04 Data processing apparatus, instruction synchronization method, electronic apparatus, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410711364.4A CN118296084A (en) 2024-06-04 2024-06-04 Data processing apparatus, instruction synchronization method, electronic apparatus, and storage medium

Publications (1)

Publication Number Publication Date
CN118296084A true CN118296084A (en) 2024-07-05

Family

ID=91686863

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410711364.4A Pending CN118296084A (en) 2024-06-04 2024-06-04 Data processing apparatus, instruction synchronization method, electronic apparatus, and storage medium

Country Status (1)

Country Link
CN (1) CN118296084A (en)

Similar Documents

Publication Publication Date Title
US11175920B2 (en) Efficient work execution in a parallel computing system
JP6660991B2 (en) Scheduling tasks on multi-threaded processors
US11182207B2 (en) Pre-fetching task descriptors of dependent tasks
KR20210002676A (en) Modification of machine learning models to improve locality
US8572355B2 (en) Support for non-local returns in parallel thread SIMD engine
US9471387B2 (en) Scheduling in job execution
US11782760B2 (en) Time-multiplexed use of reconfigurable hardware
US11816061B2 (en) Dynamic allocation of arithmetic logic units for vectorized operations
US9513923B2 (en) System and method for context migration across CPU threads
CN114153500A (en) Instruction scheduling method, instruction scheduling device, processor and storage medium
US9354850B2 (en) Method and apparatus for instruction scheduling using software pipelining
CN109558226B (en) DSP multi-core parallel computing scheduling method based on inter-core interruption
CN111767121A (en) Operation method, device and related product
US20120151145A1 (en) Data Driven Micro-Scheduling of the Individual Processing Elements of a Wide Vector SIMD Processing Unit
US20210326189A1 (en) Synchronization of processing elements that execute statically scheduled instructions in a machine learning accelerator
CN115775199B (en) Data processing method and device, electronic equipment and computer readable storage medium
US20230236878A1 (en) Efficiently launching tasks on a processor
CN118296084A (en) Data processing apparatus, instruction synchronization method, electronic apparatus, and storage medium
CN112214443B (en) Secondary unloading device and method arranged in graphic processor
US12020076B2 (en) Techniques for balancing workloads when parallelizing multiply-accumulate computations
CN112463218A (en) Instruction emission control method and circuit, data processing method and circuit
CN111831405B (en) Data processing method, logic chip and equipment thereof
CN118277490A (en) Data processing system, data synchronization method, electronic device, and storage medium
CN118035618B (en) Data processor, data processing method, electronic device, and storage medium
US20230195478A1 (en) Access To Intermediate Values In A Dataflow Computation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination