WO2024124807A1 - Ai chip, tensor processing method, and electronic device - Google Patents

Ai chip, tensor processing method, and electronic device Download PDF

Info

Publication number
WO2024124807A1
WO2024124807A1 PCT/CN2023/096078 CN2023096078W WO2024124807A1 WO 2024124807 A1 WO2024124807 A1 WO 2024124807A1 CN 2023096078 W CN2023096078 W CN 2023096078W WO 2024124807 A1 WO2024124807 A1 WO 2024124807A1
Authority
WO
WIPO (PCT)
Prior art keywords
addressing
tensor
addressing mode
vector
mode
Prior art date
Application number
PCT/CN2023/096078
Other languages
French (fr)
Chinese (zh)
Inventor
王平
罗前
顾铭秋
Original Assignee
成都登临科技有限公司
上海登临科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202211597989.XA external-priority patent/CN115658146B/en
Priority claimed from CN202211597988.5A external-priority patent/CN115599442B/en
Application filed by 成都登临科技有限公司, 上海登临科技有限公司 filed Critical 成都登临科技有限公司
Publication of WO2024124807A1 publication Critical patent/WO2024124807A1/en

Links

Definitions

  • the present application belongs to the field of computer technology, and specifically relates to an AI chip, an electronic device, and a tensor processing method.
  • Step 1A the processing unit (processing core) calculates the address of the tensor element corresponding to one or more source operands; Step 2, read one or more source operands from the memory into the vector operation unit according to the above address; Step 3, the vector operation unit calculates the result.
  • the AI chip only supports real-time addressing when addressing. After calculating the address of the tensor element, it is necessary to read one or more source operands from the memory into the vector operation unit according to the above address in a timely manner. In this way, if an error occurs in the operation, re-addressing is required, resulting in low operation efficiency and high overhead.
  • step 3 is a valid calculation, and the other steps will cause additional costs.
  • the address calculation cost of the tensor element corresponding to step 1 is relatively high, and this part of the ineffective calculation accounts for a large proportion in the current common systems.
  • the first aspect of the present application provides an AI chip and a tensor processing method to improve the problem that the AI chip in the related technology only supports real-time addressing, and if an error occurs in the operation in the middle, re-addressing is required, resulting in low computing efficiency and high overhead.
  • an AI chip which includes: a vector register, an engine unit, and a vector operation unit; wherein the engine unit is used to obtain the source operands required for the vector operation instruction from the original tensor data according to the tensor addressing mode, and the number of source operands obtained by the tensor addressing mode at one time is greater than or equal to 1; the vector register is connected to the engine unit, and the vector register is used to store the source operands obtained by the engine unit; the vector operation unit is connected to the vector register, and the vector operation unit is used to obtain the source operands of the vector operation instruction from the vector register for operation to obtain the operation result.
  • the tensor elements determined by the engine unit using the tensor addressing mode at one time can have one or more.
  • the engine unit adopts the tensor addressing mode to obtain the source operands required for the vector operation instruction from the tensor data, so that at least one source operand can be obtained by one addressing, which greatly improves the efficiency of obtaining the source operands, so that only fewer instructions are needed to find the required source operands; at the same time, the vector register is used to store the source operands obtained by the engine unit, so that the vector operation unit can directly obtain the source operands of the vector operation instruction from the vector register for operation.
  • This design method can realize the separation of addressing and operation, and can support the acquisition of the source operands required for the vector operation instruction in advance, and can be directly read when performing the operation later. Even if an error occurs in the operation midway, there is no need to re-address, and the corresponding source operand can be directly obtained from the vector register, which can improve the efficiency of the operation.
  • the parameters of the tensor addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step size representing the offset amplitude of the addressing pointer, and a size representing the shape of the data obtained by addressing.
  • the parameters included in the tensor addressing mode are expanded from the three parameters included in the original slice addressing mode to more parameters (including size information for characterizing the shape of the data obtained by addressing), thereby improving the addressing efficiency and achieving the purpose of reducing redundant instructions.
  • the parameters of the tensor addressing mode also include characteristic parameters that characterize the situation where the shape of the data obtained by addressing is retained as an incomplete shape.
  • a characteristic parameter (such as represented by partial) that characterizes the retention of an incomplete shape of the data shape obtained by addressing can be introduced, so that by configuring the value of the characteristic parameter, it can be flexibly determined whether to retain the data contained in the incomplete shape.
  • the tensor addressing mode includes a nested dual tensor addressing mode
  • the dual tensor addressing mode includes: an outer iterative addressing mode and an inner iterative addressing mode, wherein the inner iterative addressing mode is nested within the outer iterative addressing mode. Addressing is performed on the tensor data obtained by addressing.
  • a nested dual tensor addressing mode is used for addressing to realize independent addressing of an external iterative addressing mode and an internal iterative addressing mode.
  • the addressing efficiency is higher, so that only one instruction is needed to realize the addressing of two single-layer instructions, and the internal iterative addressing mode is addressed based on the tensor data obtained in the external iterative addressing mode.
  • the data addressed by the external iterative addressing mode is only an intermediate result and does not need to be read or written.
  • the data reading and writing of the first layer is reduced.
  • the engine unit is also connected to an external memory, and the external memory is used to store the original tensor data.
  • the external memory when a vector register is connected to an external memory through an engine unit, the external memory can be used directly to store original tensor data instead of storing source operands obtained by the engine unit, so that the storage space of the external memory does not need to be designed to be very large (since tensor operations involve a large amount of repeated data, the amount of data addressed using the tensor addressing mode may be larger than the original tensor data).
  • vector registers there are multiple vector registers, and different vector registers store different source operands.
  • the engine unit includes: multiple addressing engines, one addressing engine corresponds to one vector register, and each addressing engine is used to obtain the source operand required for the vector operation instruction from the corresponding original tensor data according to the respective tensor addressing mode.
  • each addressing engine can obtain the source operand required for the vector operation instruction from the corresponding original tensor data according to its own tensor addressing mode, which not only improves the addressing efficiency but also does not interfere with each other.
  • the engine unit also includes: a main engine, used to send control commands to each of the addressing engines, control each of the addressing engines to address according to their respective tensor addressing modes, and obtain source operands required for vector operation instructions from the corresponding original tensor data.
  • a main engine is set to send control commands to each addressing engine to control the independent addressing of each addressing engine, and the main engine is used for centralized control to ensure that the shape of the source operands obtained by each addressing engine is consistent, thereby ensuring the accuracy of the operation.
  • control commands include three control commands: an Advance command, a Reset command, and a NOP command.
  • the main engine sends a control command to each of the addressing engines, it sends a combined control command composed of at least one of the three control commands to control each of the addressing engines to address different dimensions of the original tensor data according to their respective tensor addressing modes.
  • a combined control command composed of at least one of an Advance command, a Reset command, and a NOP command is sent to the addressing engine.
  • the combined control command is: "W cmd: NOP; H cmd: NOP", “W cmd: Advance; H cmd: NOP”, etc., so as to control each addressing engine to address different dimensions of the original tensor data (such as width W, height H, etc.) according to their respective tensor addressing modes.
  • Multiple functions can be realized by combining simple commands.
  • the embodiment of the first aspect of the present application further provides a tensor processing method, the tensor processing method comprising:
  • source operands required for the vector operation instruction are obtained from the original tensor data, wherein the number of source operands obtained by the tensor addressing mode at one time is greater than or equal to 1; the obtained source operands are stored in the vector register; the source operands of the vector operation instruction are obtained from the vector register for operation to obtain the operation result.
  • the method can be applied to the AI chip of the first aspect and/or the electronic device of the second aspect.
  • tensor elements obtained by addressing at one time using the tensor addressing mode.
  • the source operands required for the vector operation instructions are obtained from the original tensor data according to the tensor addressing mode, including: using the addressing engine to obtain the source operands required for the vector operation instructions from the original tensor data according to the tensor addressing mode under the control command sent by the main engine.
  • the parameters of the tensor addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step size representing the offset amplitude of the addressing pointer, and a size representing the shape of the data obtained by addressing.
  • the tensor addressing mode includes a nested dual tensor addressing mode
  • the dual tensor addressing mode includes: an external iterative addressing mode and an internal iterative addressing mode; obtaining source operands required for vector operation instructions from original tensor data according to the tensor addressing mode, including: using the external iterative addressing mode to select at least one candidate area from the original tensor data, and the candidate area contains multiple tensor elements; using the internal iterative addressing mode to obtain the source operands required for vector operation instructions from the at least one candidate area.
  • an AI chip including: a vector register, an engine unit, and a vector operation unit; wherein the vector register is used to store tensor data required for operation; an engine unit, connected to the vector register, and the engine unit is used to obtain source operands required for vector operation instructions from the tensor data stored in the vector register according to a tensor addressing mode, wherein the number of source operands obtained by the engine unit at one time according to the tensor addressing mode is greater than or equal to 1; a vector operation unit, connected to the engine unit, and the vector operation unit is used to operate on the source operands obtained by the engine unit according to the vector operation instruction to obtain an operation result.
  • the engine unit adopts a tensor addressing mode to obtain the source operands required for the vector operation instruction from the tensor data stored in the vector register, so that at least one source operand can be obtained by one addressing, wherein the engine unit can obtain one or more tensor elements by one addressing using the tensor addressing mode, which greatly improves the efficiency of obtaining the source operands, so that only fewer instructions are needed to find the required source operands; at the same time, the engine unit is set between the vector register and the vector operation unit, so that the vector register can directly store the original tensor data required for the operation.
  • the amount of data addressed using the tensor addressing mode may be larger than the original tensor data. This design method does not need to design the storage space of the vector register to be very large.
  • multiple vector registers there are multiple vector registers, and different vector registers store different tensor data.
  • different vector registers store different tensor data, so that when addressing, the tensor data stored in each vector register can be addressed in parallel at the same time, which can improve the addressing efficiency.
  • the engine unit includes: multiple addressing engines, one addressing engine corresponds to one vector register, and each addressing engine is used to obtain the source operand required for the vector operation instruction from the corresponding vector register according to the respective tensor addressing mode.
  • the engine unit also includes: a main engine, used to send control commands to each of the addressing engines, control each of the addressing engines to address according to their respective tensor addressing modes, and obtain the source operands required for the vector operation instructions from the corresponding vector registers.
  • the addressing engine keeps the addressing pointer unchanged when addressing the first dimension of the tensor data according to the Advance command sent by the main engine, so as to continuously obtain the source operand pointed to by the current addressing pointer.
  • the addressing engine is used to keep the addressing pointer unchanged when addressing the first dimension of the tensor data according to the Advance command sent by the main engine, so as to continuously obtain the source operand pointed to by the current addressing pointer, thereby realizing repeated reading of data and achieving the purpose of broadcasting without the need for re-addressing, which can improve the addressing efficiency.
  • the parameters of the tensor addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step size representing the offset amplitude of the addressing pointer, and a size representing the shape of the data obtained by addressing.
  • the slice addressing mode of the related technology is improved and expanded, so that the parameters included in the tensor addressing mode (which can also be regarded as a new slice addressing mode) are expanded from the three parameters (start, stop, step) included in the original slice addressing mode to include more parameters, thereby improving the addressing efficiency and achieving the purpose of reducing redundant instructions.
  • the extended parameters by configuring the extended parameters, more flexible and diverse addressing can be supported.
  • the parameters of the tensor addressing mode also include a parameter characterizing that the shape of the data obtained by addressing is incomplete. Characteristic parameters of the shape preservation case.
  • a characteristic parameter (such as represented by partial) can be introduced to characterize the retention of an incomplete shape of the data shape obtained by addressing, so that by configuring the value of the characteristic parameter, it can be flexibly determined whether to retain the data contained in the incomplete shape.
  • the tensor addressing mode includes a nested dual tensor addressing mode
  • the dual tensor addressing mode includes: an external iterative addressing mode and an internal iterative addressing mode, wherein the internal iterative addressing mode addresses the tensor data addressed by the external iterative addressing mode.
  • a nested dual tensor addressing mode can be used for addressing to achieve independent addressing of an external iterative addressing mode and an internal iterative addressing mode.
  • the addressing efficiency is higher, so that only one instruction is needed to achieve the addressing of two single-layer instructions, and the internal iterative addressing mode is addressed on the basis of the external iterative addressing mode.
  • the data addressed by the external iterative addressing mode is only an intermediate result and does not need to be read or written.
  • the data reading and writing of the first layer is reduced.
  • the parameters of the external iterative addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step size representing the offset amplitude of the addressing pointer, a size representing the shape of the data obtained by addressing, and a characteristic parameter representing the retention of an incomplete shape of the data obtained by addressing.
  • an addressing mode including the above five parameters is used for addressing, so that each addressing in the external iterative addressing mode can obtain a candidate area of a complete shape, thereby facilitating subsequent addressing in the internal iterative addressing mode based on the candidate area of the complete shape, thereby improving the addressing efficiency and achieving the purpose of reducing redundant instructions.
  • the parameters of the inner iterative addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step size representing the offset amplitude of the addressing pointer, and a size representing the shape of the data obtained by addressing.
  • an addressing mode including the above four parameters is used for addressing, so that each addressing can obtain a data shape including multiple tensor elements, and the source operand obtained by one addressing is greater than or equal to 1, thereby improving the addressing efficiency and achieving the purpose of reducing redundant instructions.
  • the external iterative addressing mode is used to select at least one candidate area from the tensor data, and the candidate area contains multiple tensor elements; the internal iterative addressing mode is used to obtain the source operand required for the vector operation instruction from the at least one candidate area.
  • the tensor elements of the candidate area can participate in data reading and writing as part of the addressing result.
  • the tensor elements of the candidate area may not participate in data reading and writing, but only serve as intermediate results of the addressing process. Whether to output data of the tensor elements of the candidate area can be flexibly configured and decided according to actual needs.
  • the external iterative addressing mode and the internal iterative addressing mode are both one of a new slice addressing mode, a slice addressing mode, and an index addressing mode
  • the new slice addressing mode includes more addressing parameters than the slice addressing mode
  • the more addressing parameters include: characterizing the size of the data shape obtained by addressing.
  • both the external iterative addressing mode and the internal iterative addressing mode support multiple addressing modes, so that the entire addressing mode is compatible with the addressing mode of the related technology (slice addressing mode, index addressing mode), and is compatible with the tensor addressing mode (new slice addressing mode) proposed in this application, thereby increasing the flexibility and ease of use of the solution.
  • An embodiment of the second aspect of the present application also provides a tensor processing method, including: obtaining source operands required for vector operation instructions from tensor data according to a tensor addressing mode, wherein the number of source operands obtained by one addressing according to the tensor addressing mode is greater than or equal to 1; according to the vector operation instruction, performing operations on the obtained source operands to obtain operation results.
  • the method can be applied to the AI chip of the first aspect and/or the electronic device of the second aspect.
  • the principles and beneficial effects of the method please refer to the relevant descriptions of the embodiments or implementation methods of other aspects.
  • obtaining source operands required for vector operation instructions from tensor data according to the tensor addressing mode may include: using the addressing engine to obtain the source operands required for the vector operation instructions from the tensor data according to the tensor addressing mode under the control command sent by the main engine.
  • the tensor addressing mode includes a nested dual tensor addressing mode
  • the dual tensor addressing mode includes: an external iterative addressing mode and an internal iterative addressing mode; wherein, obtaining source operands required for vector operation instructions from tensor data according to the tensor addressing mode includes: selecting at least one candidate area from the tensor data using the external iterative addressing mode, and the candidate area contains multiple tensor elements; and obtaining the source operands required for vector operation instructions from the at least one candidate area using the internal iterative addressing mode.
  • the parameters of the external iterative addressing mode include: a starting address representing the starting point of addressing, Characteristic parameters representing the end point address of the addressing end point, the step size representing the offset amplitude of the addressing pointer, the size of the data shape obtained by addressing, and the retention of the incomplete shape of the data shape obtained by addressing; and/or, the parameters of the inner iterative addressing mode include: the starting address representing the starting point of the addressing, the end point address representing the end point of the addressing, the step size representing the offset amplitude of the addressing pointer, and the size of the data shape obtained by addressing.
  • An embodiment of the present application also provides an electronic device, comprising: a memory for storing tensor data required for operations; an AI chip as described in the first aspect of the present application, the AI chip is connected to the memory, and the AI chip is used to obtain source operands required for vector operation instructions from the original tensor data stored in the memory according to the tensor addressing mode, and according to the vector operation instruction, the obtained source operands are operated to obtain operation results; or an AI chip as described in the second aspect of the present application, the AI chip is connected to the memory, and the AI chip is used to write the tensor data stored in the memory into the vector register in the AI chip.
  • FIG. 1A is a schematic diagram of the structure of an AI chip and a memory in the related art.
  • FIGS. 1B and 1C are schematic diagrams showing the principle of addressing using a slice addressing mode in the related art.
  • FIG2 shows a schematic structural diagram of a connection between an AI chip and a memory provided in an embodiment of the first aspect of the present application.
  • FIG. 3A shows a schematic diagram of a principle of broadcasting provided by an embodiment of the present application.
  • FIG. 3B shows a schematic diagram of the principle of a second broadcasting provided by an embodiment of the present application.
  • FIG. 3C shows a schematic diagram of a third principle of broadcasting provided by an embodiment of the present application.
  • FIG4 shows a schematic diagram showing the principle of a first tensor addressing mode provided by an embodiment of the first aspect of the present application.
  • FIG5 is a schematic diagram showing the principle of a second tensor addressing mode provided by an embodiment of the first aspect of the present application.
  • FIG6 shows a schematic diagram of the structure of another AI chip provided in an embodiment of the first aspect of the present application.
  • FIG. 7 is a schematic diagram showing the principle of addressing according to the 0:8:2:5 addressing mode provided by the embodiment of the first aspect of the present application.
  • FIG8 is a schematic diagram showing the principle of addressing according to the 0:5:2:1 addressing mode provided by the embodiment of the first aspect of the present application.
  • FIG9 shows a schematic diagram of a tensor processing method provided by an embodiment of the first aspect of the present application.
  • FIG10 shows a schematic structural diagram of a first AI chip provided in an embodiment of the second aspect of the present application.
  • FIG11 shows a schematic diagram of the structure of a second AI chip provided in an embodiment of the second aspect of the present application.
  • FIG12 shows a schematic structural diagram of a tensor processing method provided by an embodiment of the second aspect of the present application.
  • FIG13A shows a schematic structural diagram of an electronic device provided by an embodiment of the first aspect of the present application.
  • FIG13B shows a schematic structural diagram of an electronic device provided in an embodiment of the second aspect of the present application.
  • FIG14 is a schematic diagram showing the principle of addressing according to the 0:4:1:3 addressing mode provided by the embodiment of the first aspect of the present application.
  • the related technology takes a long time to calculate the address of the tensor element corresponding to the source operand for the calculation process between tensors, that is, a large part of the time is wasted on calculating the address of the tensor element corresponding to the source operand.
  • the reason is mainly because the related technology is based on the slicing addressing mode of numpy (Numerical Python, an open source numerical computing extension of Python).
  • numpy Genetic Python, an open source numerical computing extension of Python.
  • start indicates the starting address of the addressing
  • stop indicates the ending address of the addressing
  • step is equivalent to stride, which indicates the offset of the addressing pointer.
  • This slice addressing mode can only obtain one tensor element at a time, for example, (0,0), (2,0), (4,0), (0,2), (2,2), (4,2), (0,4), (2,4), (4,4) in the example.
  • this slice addressing mode can only obtain one tensor element at a time. Since tensor calculations involve a large number of source operands, if you want to obtain multiple tensor elements corresponding to many source operands, you need to use a large number of instructions for multiple addressing. This will result in a large number of instructions being wasted on addressing, that is, wasted on ineffective calculations.
  • the present application provides a new AI chip, which can determine the address of at least one tensor element corresponding to a source operand with one addressing, which can greatly improve the efficiency of calculating the address of the tensor element corresponding to the source operand, so that only fewer instructions are needed to find the required source operand, which greatly reduces redundant instructions, improves the effective instruction density, improves processing performance, and simplifies the amount of programming.
  • the vector register is used to store the source operand obtained by the engine unit, so that the vector operation unit can directly obtain the source operand of the vector operation instruction from the vector register for operation.
  • This design method can support the early acquisition of the source operand required for the vector operation instruction, and then directly read it when performing the operation, thereby realizing the separation of addressing and operation. Even if an error occurs in the operation in the middle, there is no need to re-address, and the corresponding source operand can be directly obtained from the vector register, which can improve the efficiency of the operation.
  • This AI chip can accelerate the addressing and calculation of tensors at the hardware level.
  • the hardware can support multiple tensor addressing modes, has good compatibility, and is conducive to quickly completing calculations between tensors.
  • the embodiments of the present application also propose new addressing modes for tensor addressing in terms of flexibility, ease of use, and processing efficiency.
  • a new slice-based tensor addressing method is further expanded and designed; on the other hand, a nested addressing mode that can perform dual tensor addressing is designed.
  • These new addressing modes can be applied to the AI chip provided in the embodiments of the present application, which is beneficial to reduce the total amount of instructions required when facing calculations between tensors, flexible addressing, improved hardware processing performance, and improved processing efficiency for tensors.
  • the AI chip includes a vector register, an engine unit, and a vector operation unit.
  • the vector register is directly connected to the vector operation unit, and the engine unit and the vector operation unit can be directly connected.
  • the engine unit is used to obtain the source operands required for the vector operation instructions from the original tensor data according to the tensor addressing mode.
  • the number of source operands (number) obtained by the tensor addressing mode at one time is greater than or equal to 1.
  • Tensor data is multidimensional data, which is usually expanded in some way (such as linear layout, tiled layout, etc.) and stored in devices with storage functions such as memory, on-chip memory, etc.
  • the vector register is used to store the source operands obtained by the engine unit. In this way, even if an error occurs in the operation, there is no need to re-address, and the corresponding source operand can be directly obtained from the vector register.
  • the vector operation unit is connected to the vector register, and is used to obtain the source operand of the vector operation instruction from the vector register to perform the operation and obtain the operation result.
  • the vector operation instruction refers to an instruction that can calculate more than two operands at the same time.
  • the type of vector operation instruction can be various operation types such as addition, subtraction, multiplication, multiplication and accumulation.
  • the vector operation unit is directly connected to the vector register. Since the data stored in the vector register is the source operand required by the vector operation instruction addressed from the original tensor data according to the tensor addressing mode, the vector operation unit can obtain the content of the tensor addressing by directly accessing the vector register.
  • the engine unit when the vector register is directly connected to the engine unit, the engine unit is also connected to the external memory. At this time, it is equivalent to the vector register being connected to the external memory through the engine unit.
  • the external memory is used to store original tensor data. Taking the vector operation instruction with three operands, such as A*B+C as an example, A, B, and C are all source operands, where A, B, and C can be a single element or an array containing multiple elements.
  • the memory is used to store the original tensor data where operand A is located, the original tensor data where operand B is located, and the original tensor data where operand C is located.
  • the engine unit obtains the operand A of the vector operation instruction from the original tensor data where operand A is located according to the tensor addressing mode, and stores it in the vector register, and obtains the operand B of the vector operation instruction from the original tensor data where operand B is located, and stores it in the vector register, and obtains the operand C of the vector operation instruction from the original tensor data where operand C is located, and stores it in the vector register. middle.
  • the above-mentioned AI chip can be an integrated circuit chip with data processing capabilities, which can be used to process operations between tensors.
  • it can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices.
  • the general-purpose processor can also be a microprocessor or the AI chip can also be any conventional processor, such as a graphics processor (GPU), a general-purpose graphics processor (GPGPU), etc.
  • the first type is to perform one-to-one operations on the tensor elements in two or more tensor data with the same dimension size (called element-by-tensor operation or element-wise operation for short). For example, assuming that the width (W) and height (H) of two tensor data are both 5, that is, W*H is 5*5, then when performing one-to-one operations on the tensor elements in the two tensor data, one-to-one operations are performed between the tensor elements in the corresponding positions.
  • the second type is to operate a tensor element in one tensor data with a group of tensor elements in another tensor data (called broadcasting). For example, for an image tensor (the first operand), each pixel value of the image needs to be divided by 255 (the second operand). Assume that the dimension information of the two operands are: the number of channels, width, and height of the first operand are 3, 224, 224 respectively; the number of channels, width, and height of the second operand are 1, 1, 1 respectively.
  • the second operand will be broadcasted in three dimensions (i.e., the number of channels, width, and height mentioned above), that is, before the division operation, the second operand with dimension (1, 1, 1) (value is 255) needs to be expanded into a tensor with dimension (3, 224, 224), and the value of each dimension of this tensor is 255.
  • the dimensions of the two operands are the same, and the division operation can be performed, and then the tensor elements in the two tensor data can be operated one-to-one.
  • the second operand it is only necessary to broadcast in the width direction, that is, to expand the second operand with a dimension of (3, 1, 224) into a tensor with a dimension of (3, 224, 224). At this time, the two operands have the same dimension and can be divided.
  • the two tensor data must have the same number of dimensions, and at least one dimension of one of the tensor data must be 1.
  • all other dimensions of the broadcasted tensor data will have elements copied on the dimensions where the broadcast occurs.
  • the address calculation mode of common tensor elements (such as the slice addressing mode of the related technology) is also relatively regular, where slicing refers to accessing tensor elements in a certain order or span.
  • the AI chip provided by this application also proposes a new tensor addressing mode based on the full use of the regularity of tensor operations in related technologies in order to improve the addressing efficiency of tensor elements.
  • This tensor addressing mode improves and expands the slice addressing mode of related technologies, so that the parameters contained in the tensor addressing mode (which can also be regarded as a new slice addressing mode) are expanded from the three parameters contained in the original slice addressing mode to more parameters, thereby improving the addressing efficiency and achieving the purpose of reducing redundant instructions.
  • the parameters of the tensor addressing mode in this application include: the starting address representing the starting point of addressing (such as represented by start), the end address representing the end point of addressing (such as represented by stop), the step length representing the offset amplitude of the addressing pointer (such as represented by step), and the size of the data shape obtained by addressing (such as represented by size). Then the expression of the tensor addressing mode can be recorded as [start: stop: step: size]. That is, the new slice addressing mode contains more addressing parameters than the slice addressing mode. More addressing parameters include: the size of the data shape obtained by addressing.
  • size is represented by To describe the size of the addressed data shape, that is, at each step, all element points contained in a complete shape will be extracted instead of just one point.
  • the values of start, stop, step, and size in the above expression are all configurable to suit calculations between various tensors.
  • the parameters of the tensor addressing mode also include: characteristic parameters that characterize the retention of the incomplete shape of the data shape obtained by addressing (such as expressed by partial, used to reflect the integrity of the local tensor data shape).
  • characteristic parameters that characterize the retention of the incomplete shape of the data shape obtained by addressing
  • the expression of the tensor addressing mode is [start: stop: step: size: partial].
  • the parameters of the tensor addressing mode in this application add two parameters, size and partial.
  • one step can determine multiple tensor elements.
  • One addressing can determine the address of the tensor element corresponding to one or more source operands. For example, in the above example, one (i.e., one step) addressing will address the address of 9 tensor elements, while using the method of the related art, 9 addressings are required to address the address of these 9 tensor elements. Accordingly, more instructions are required to address the address of these 9 tensor elements.
  • the addressing method provided by the embodiment of the first aspect of the present application can greatly improve the efficiency of calculating the address of the tensor element corresponding to the source operand, so that only fewer instructions are needed to find the required source operand, greatly reducing redundant instructions, improving the effective instruction density, improving performance, and simplifying programming.
  • each step only addresses one tensor element.
  • the addressing mode at this time is similar to the addressing mode of the related art (the addressing logic of the related art can be compatible with the hardware by simply changing the value of the parameter), and its schematic diagram is shown in Figure 1. Alternatively, the existing addressing mode is directly used for addressing.
  • the required tensor element can be further selected from the shape obtained according to the size and step. That is to say, when sliding a step, only one required tensor element can be extracted from the multiple tensor elements contained in the obtained shape. Therefore, when only one tensor element needs to be addressed for each step, the size may not be 1. In this way, the tensor addressing mode provided in the embodiments of the present application can be compatible with the slice addressing effect of the related technology, taking into account both processing efficiency and addressing flexibility.
  • the number of vector registers in the AI chip can be multiple, and different vector registers may store different source operands (part of the tensor data). Take a vector operation instruction with three operands, such as A*B+C, as an example. At this time, the number of vector registers may include 3, for example, vector register 1, vector register 2, and vector register 3. Vector register 1 can be used to store source operand A, vector register 2 can be used to store source operand B, and vector register 3 can be used to store source operand C. This application does not limit the specific instruction expression form.
  • each operand corresponds to an addressing mode.
  • the addressing modes of multiple operands match each other. Broadcast operations may occur during the matching process.
  • the shapes of multiple operands after matching are the same.
  • one or more addressing engines may be used to address tensor data stored in the multiple vector registers.
  • the engine unit includes: a plurality of addressing engines, one addressing engine corresponds to a vector register, and each addressing engine is used to obtain a vector operation instruction from the corresponding original tensor data according to the respective tensor addressing mode.
  • the required source operands are obtained.
  • three addressing engines for example, respectively recorded as addressing engine 1, addressing engine 2, and addressing engine 3 can be used for addressing respectively.
  • addressing engine 1 corresponds to vector register 1, which is used to obtain the source operand A required for the vector operation instruction from the original tensor data where the source operand A is located according to the tensor addressing mode of the addressing engine 1;
  • addressing engine 2 corresponds to vector register 2, which is used to obtain the source operand B required for the vector operation instruction from the original tensor data where the source operand B is located according to the tensor addressing mode of the addressing engine 2;
  • addressing engine 3 corresponds to vector register 3, which is used to obtain the source operand C required for the vector operation instruction from the original tensor data where the source operand C is located according to the tensor addressing mode of the addressing engine 3.
  • the same addressing engine can also obtain multiple source operands required for vector operation instructions from multiple original tensor data.
  • addressing engine 1 can obtain the source operand A, source operand B and source operand C in the aforementioned example.
  • each addressing engine such as the above-mentioned addressing engine 1, addressing engine 2, and addressing engine 3 are independent and do not interfere with each other.
  • each addressing engine uses an independent tensor addressing mode for independent addressing, that is, each source operand of the vector operation instruction corresponds to an independent tensor addressing mode.
  • the above-mentioned source operand A corresponds to tensor addressing mode 1
  • source operand B corresponds to tensor addressing mode 2
  • source operand C corresponds to tensor addressing mode 3.
  • tensor addressing mode 1 The types of parameters contained in tensor addressing mode 1, tensor addressing mode 2, and tensor addressing mode 3 can be the same, for example, they can all contain the above-mentioned 5 types (start: stop: step: size: partial) parameters, but the specific values of the parameters may be different.
  • all addressing engines may adopt the tensor addressing mode provided by this application (which can be regarded as a new slice addressing mode) for addressing, or some of the addressing engines may adopt the tensor addressing mode provided by this application for addressing, and the remaining addressing engines may adopt the addressing mode of the related technology (index addressing mode, numpy's basic slice addressing mode) for addressing.
  • the aforementioned tensor addressing mode 1, tensor addressing mode 2, and tensor addressing mode 3 may be different types of addressing modes in addition to being the same type of addressing mode (such as a new slice addressing mode).
  • tensor addressing mode 1 corresponds to the tensor addressing mode (new slice addressing mode) in this application
  • tensor addressing mode 2 adopts a 3-parameter slice addressing mode in the related field
  • tensor addressing mode 3 adopts an index addressing mode in the related field. That is, compared with addressing modes that all adopt related technologies, the addressing efficiency can also be improved to a certain extent and the number of instructions required for addressing can be reduced.
  • each addressing engine may be configured to support the capability of processing broadcast operations in addition to its own independent addressing mode.
  • the engine unit also includes: a main engine, which is connected to each addressing engine respectively, and the main engine is used to send control commands to each addressing engine, control each addressing engine to address according to the tensor addressing mode, and obtain the source operands required for the vector operation instructions from the corresponding original tensor data.
  • a main engine By introducing a main engine to centrally control these independent addressing engines, the addressing efficiency can be improved.
  • the main engine sends control commands to the addressing engine of the operand at each step, so that the corresponding addressing engine traverses each dimension on the tensor data in order from low dimension to high dimension, and addresses the corresponding dimension.
  • control commands include Advance command, Reset command, and NOP command. That is, the main engine will send at least one of the above three control commands to the addressing engine of the operand in each step. That is, each time the main engine sends a control command to each addressing engine, it sends a combined control command composed of at least one of the above three control commands to control each addressing engine to address different dimensions of the original tensor data according to its own tensor addressing mode.
  • the Advance command means to advance (move forward or increase) the step size of one step in the current dimension.
  • the Reset command means that when you reach the end of this dimension, you need to start addressing from the beginning in this dimension.
  • the NOP command indicates that no action is taken on this dimension.
  • Step 1 W cmd: NOP; H cmd: NOP, where cmd represents a command. The status at this time is shown in (1) of FIG5 ;
  • Step 2 W cmd: Advance; H cmd: NOP. The status at this time is shown in (2) in Figure 5.
  • Step 3 W cmd: Advance; H cmd: NOP. The status at this time is shown in (3) of Figure 5.
  • Step 4 W cmd: Reset; H cmd: Advance. The status at this time is shown in (4) of Figure 5.
  • Step 5 W cmd: Advance; H cmd: NOP. The status at this time is shown in (5) of Figure 5.
  • Step 6 W cmd: Advance; H cmd: NOP. The status at this time is shown in (6) of Figure 5.
  • Step 7 W cmd: Reset; H cmd: Advance. The status at this time is shown in (7) of Figure 5.
  • Step 8 W cmd: Advance; H cmd: NOP. The status at this time is shown in (8) in Figure 5.
  • Step 9 W cmd: Advance; H cmd: NOP.
  • the status at this time is shown in (9) in FIG. 5 .
  • the addressing engine corresponding to the tensor data keeps the addressing pointer unchanged when it receives the Advance command while addressing the first dimension of the tensor data, so that the addressing engine will always read the source operand pointed to by the current addressing pointer when reading the source operand from the vector register, thereby achieving the purpose of broadcasting.
  • the main engine when the main engine sends a control command to control the addressing engine to address the first dimension of the tensor data, when the addressing engine reaches the end of the first dimension of the tensor data, the main engine will not send a Reset command, and can keep sending Advance commands so as to keep reading the source operand pointed to by the current addressing pointer until the broadcast is completed, and then send other control commands.
  • addressing engine 1 is responsible for obtaining the operands in the array "np.arange(3)”
  • addressing engine 2 is responsible for obtaining operand 5.
  • the main engine sends an Advance command to addressing engine 1 and addressing engine 2, and addressing engine 1 obtains operand 0 in the array "np.arange(3)", and addressing engine 2 obtains operand 5;
  • the main engine sends an Advance command to addressing engine 1 and addressing engine 2, and addressing engine 1 takes a step forward to obtain operand 1 in the array "np.arange(3)".
  • addressing engine 2 When addressing engine 2 receives the Advance command, it keeps the addressing pointer unchanged and still outputs operand 5; at the next moment, the main engine sends an Advance command to addressing engine 1 and addressing engine 2, and addressing engine 1 takes a step forward to obtain operand 2 in the array "np.arange(3)". When addressing engine 2 receives the Advance command, it keeps the addressing pointer unchanged and still outputs operand 5.
  • the tensor addressing mode may also include a nested dual tensor addressing mode, the dual tensor addressing mode including: an outer iterative addressing mode (outer iterator) and an inner iterative addressing mode (inner iterator), wherein the inner iterative addressing mode performs addressing based on the tensor data obtained in the outer iterative addressing mode.
  • the outer iterator addressing mode (outer iterator) can be the aforementioned new slice addressing mode using the parameters of start: stop: step: size: partial.
  • the tensor data of the local area (for example, the shape defined by these parameters) can be quickly selected to provide the data basis for addressing the inner iterator addressing mode.
  • the nested dual tensor addressing mode is used for addressing. Compared with other addressing modes, the addressing efficiency is higher, so that only one instruction is needed to implement the addressing of two single-layer instructions.
  • the inner iterative addressing mode is addressed based on the tensor data obtained in the outer iterative addressing mode. Compared with the addressing using a single-layer tensor addressing mode, the data reading and writing of the first layer (external iterative addressing mode) is reduced.
  • the expressions of the external iteration addressing mode and the internal iteration addressing mode may be: [start: stop: step: size], or [start: stop: step: size: partial].
  • the parameter values in the expressions of the external iteration addressing mode and the internal iteration addressing mode may be different.
  • the traversed part in FIG7 can be regarded as a feature map of size 8*8.
  • the parameter value of the inner iterative addressing mode expression is 0:5:2:1 as an example, and its addressing principle diagram is shown in FIG8.
  • the addressing mode of 0:8:2:5 four parameters can be obtained, namely the parameters contained in the shape of (1), (2), (3), and (4) in FIG7.
  • the outer inner iterative addressing mode performs addressing based on the tensor data obtained by the outer iterative addressing mode, that is, the multiple tensor elements obtained by (1), (2), (3), and (4) in FIG7 are respectively addressed using the addressing mode of 0:5:2:1.
  • the outer iteration addressing mode is used to select at least one candidate region from the tensor data, and the candidate region includes multiple tensor elements, such as selecting candidate regions (1), (2), (3), (4) and the like in FIG7;
  • the inner iteration addressing mode is used to obtain the source operand required for the vector operation instruction from at least one candidate region.
  • the inner iteration addressing mode performs addressing based on the data of (1), (2), (3), (4) in FIG7 to obtain the source operand required for the vector operation instruction.
  • the inner iterative addressing mode does not necessarily wait until the outer iterative addressing mode has completed all addressing before addressing. For example, when the outer iterative addressing mode addresses the 5*5 elements contained in (1) in FIG. 7, the inner iterative addressing mode can address on this basis. After the inner iterative addressing mode completes addressing of the data contained in (1) in FIG. 7, the outer iterative addressing mode can continue to address. When the outer iterative addressing mode addresses the 5*5 elements contained in (2) in FIG. 7, the inner iterative addressing mode performs traversal addressing on this basis. After the inner iterative addressing mode completes addressing of the data contained in (2) in FIG.
  • the outer iterative addressing mode continues, and so on, until all addressing in the 8*8 feature map is completed. In this way, the data addressed by the outer iterative addressing mode is only an intermediate result and does not need to be read or written. Compared with the addressing using a single-layer tensor addressing mode, the data reading and writing of the first layer (external iterative addressing mode) is reduced.
  • the external iterative addressing mode and the internal iterative addressing mode can not only adopt the addressing mode (new slice addressing mode) shown in this application, but also expand its addressing mode, such as adopting the existing addressing mode (slice addressing mode, index addressing mode).
  • the addressing is performed according to the principle of addressing mode), so there can be many combinations. Some exemplary combinations are shown in Table 1 below.
  • the pointer can obtain the 5*5 element parameters contained in the shape shown in (1), (2), (3), and (4) in Figure 7 every time it moves one step.
  • the external iterative addressing mode adopts the slice addressing mode or index addressing mode of the related art for addressing, since only one element point can be obtained for each sliding step, if these element parameters contained in the shape shown in (1), (2), (3), and (4) in Figure 7 are to be obtained, multiple addressing is required to ensure that no matter what addressing mode the external iterative addressing mode adopts, the data obtained by addressing using the new slice addressing mode can be obtained, thereby providing a basis for the internal iterative addressing mode.
  • the data obtained is consistent. For example, multiple tensor element parameters contained in the shapes shown in (1), (2), (3), and (4) in Figure 7 are obtained. However, when the new slice addressing mode is used, the parameters contained in one shape can be obtained each time a step is slid. If the addressing mode of other related technologies is used, multiple addressing (sliding multiple steps) is required to obtain the parameters contained in the shape obtained each time the new slice addressing mode slides one step.
  • the embodiment of the first aspect of the present application further provides a tensor processing method, as shown in Figure 9.
  • the principle of the tensor processing method is described below in conjunction with Figure 9. The method can be applied to the aforementioned AI chip and the aforementioned electronic device.
  • S1 Obtain source operands required for vector operation instructions from tensor data according to a tensor addressing mode, wherein the number of source operands obtained by one addressing in the tensor addressing mode is greater than or equal to 1.
  • the tensor addressing mode in the embodiment of the first aspect of the present application improves and expands the slice addressing mode of the related art to include more parameters.
  • the parameters of the tensor addressing mode include: the starting address representing the starting point of addressing (such as represented by start), the end address representing the end point of addressing (such as represented by stop), the step length representing the offset amplitude of the addressing pointer (such as represented by step), and the size of the data shape obtained by addressing (such as represented by size). Then the expression of the tensor addressing mode is [start: stop: step: size].
  • size is used to describe the size of the data shape obtained by addressing, that is, all element points contained in a shape will be extracted at each step, instead of just one point.
  • the address of the tensor element corresponding to at least one source operand can be determined by one addressing, which can greatly improve the efficiency of calculating the address of the tensor element corresponding to the source operand, so that only fewer instructions are needed to find the required source operand, which greatly reduces redundant instructions, improves the effective instruction density, improves performance, and simplifies programming.
  • the engine unit in the above-mentioned AI chip may obtain the source operand required for the vector operation instruction from the original tensor data stored in the external memory according to the tensor addressing mode.
  • the engine unit After the engine unit obtains the source operand according to the tensor addressing mode, it can store the obtained source operand in the vector register, wherein different source operands can be stored in different vector registers, so that when the operands are subsequently read for calculation, they can be read in parallel, thereby improving efficiency.
  • the obtained source operands can be operated according to the vector operation instruction to obtain the operation result.
  • the vector operation unit in the above AI chip can obtain the source operands required by the vector operation instruction from the vector register to perform the operation and obtain the operation result.
  • efficient addressing can be performed in various links of tensor processing.
  • the address calculation of the tensor elements will be involved. This is because when reading the tensor to be calculated, it is necessary to read the data according to the address of the tensor element.
  • the tensor processing method provided in the embodiment of the first aspect of the present application has the same implementation principle and technical effects as those of the aforementioned AI chip embodiment.
  • the inventors also noticed that in the step of performing vector operations, since ineffective calculations will cause additional costs in the system, there is a need for improved AI chips and tensor processing methods.
  • Figure 10 shows the principle of an AI chip according to an embodiment of the second aspect of the present application.
  • the AI chip may include a vector register, an engine unit, and a vector operation unit.
  • the vector register is connected to the vector operation unit through the engine unit.
  • the vector register is used to store the tensor data required for the operation, wherein the tensor data is a set of multidimensional data, which is usually expanded in some way (such as linear layout, tiled layout, etc.) and stored in a device with storage function such as memory, on-chip memory, register, etc.
  • the tensor data stored in the vector register can be moved from a memory outside the AI chip.
  • the engine unit is connected to the vector register, and the engine unit is used to obtain the source operand required for the vector operation instruction from the tensor data according to the tensor addressing mode, wherein the source operand (number) obtained by the engine unit at one time according to the tensor addressing mode is greater than or equal to 1.
  • the vector operation unit is connected to the engine unit, and the vector operation unit is used to operate on the source operand obtained by the engine unit according to the vector operation instruction to obtain the operation result.
  • the vector operation instruction refers to an instruction that can calculate more than two operands at the same time.
  • the type of vector operation instruction can be various operation types such as addition, subtraction, multiplication, multiplication and accumulation.
  • the AI chip according to the above exemplary embodiment can be a chip such as an integrated circuit chip, which has data processing capabilities and can be used to process operations between tensors.
  • the specific examples of the above AI chip can refer to the description of the specific examples of the AI chip of the first aspect of this application.
  • the number of vector registers in the AI chip may be multiple, and different vector registers may store different tensor data.
  • A, B, and C are all source operands, where A, B, and C can be single (tensor) elements or arrays containing multiple elements.
  • the number of vector registers may include 3, for example, vector register 1, vector register 2, and vector register 3.
  • Vector register 1 can be used to store the tensor data where the source operand A is located
  • vector register 2 can be used to store the tensor data where the source operand B is located
  • vector register 3 can be used to store the tensor data where the source operand C is located.
  • each operand corresponds to an addressing mode, and the addressing modes of multiple operands can match each other.
  • a broadcast operation may be required, and the shapes of the multiple operands after matching are the same.
  • one or more addressing engines may be used to address tensor data stored in the multiple vector registers.
  • the engine unit includes: a plurality of addressing engines, one addressing engine corresponds to one vector register, wherein each addressing engine is used to obtain the source operand required for the vector operation instruction from the corresponding vector register according to the respective tensor addressing mode.
  • the three addressing engines can be used to address the tensor data stored in the three vector registers respectively.
  • addressing engine 1 corresponds to vector register 1, and is used to obtain the source operand required for the vector operation instruction from the corresponding vector register 1 according to the tensor addressing mode of the addressing engine 1;
  • addressing engine 2 corresponds to vector register 2, and is used to obtain the source operand required for the vector operation instruction from the corresponding vector register 2 according to the tensor addressing mode of the addressing engine 2;
  • addressing engine 3 corresponds to vector register 3, and is used to obtain the source operand required for the vector operation instruction from the corresponding vector register 3 according to the tensor addressing mode of the addressing engine 3.
  • the tensor addressing modes used by each addressing engine can be the same or different.
  • the same addressing engine may obtain the source operands required for the vector operation instruction from multiple vector registers. For example, addressing engine 1 obtains source operand A required for the vector operation instruction from vector register 1, addressing engine 1 obtains source operand B required for the vector operation instruction from vector register 2, and addressing engine 1 obtains source operand C required for the vector operation instruction from vector register 3.
  • each addressing engine such as the above-mentioned addressing engine 1, addressing engine 2, and addressing engine 3, are independent and do not interfere with each other.
  • each addressing engine uses an independent tensor addressing mode for independent addressing, that is, each source operand in the vector operation instruction corresponds to an independent tensor addressing mode.
  • the above-mentioned source operand A corresponds to tensor addressing mode 1
  • source operand B corresponds to tensor addressing mode 2
  • source operand C corresponds to tensor addressing mode 3.
  • Tensor addressing mode 1, tensor addressing mode 2, and tensor addressing mode 3 include The types of parameters contained may be the same, for example, they may all contain the above five types of parameters (start: stop: step: size: partial). The difference is that the specific values of the parameters may be different.
  • each addressing engine may be configured to support the capability of processing broadcast operations in addition to its own independent addressing mode.
  • the engine unit of the AI chip may also include: a main engine, the main engine is connected to each addressing engine, and the main engine is used to send control commands to each addressing engine to control each addressing engine to address according to the tensor addressing mode, and obtain the source operand required for the vector operation instruction from the corresponding vector register.
  • the main engine can send a control command to the addressing engine of the operand at each step, so that the addressing engine can traverse each dimension of the tensor data in order from low dimension to high dimension according to the control command, and address the corresponding dimension.
  • control command may include an Advance command, a Reset command, and a NOP command, which will not be repeated here.
  • the tensor addressing mode of the present application may also include: a nested dual tensor addressing mode, the dual tensor addressing mode includes: an outer iterative addressing mode (outer iterator) and an inner iterative addressing mode (inner iterator), wherein the inner iterative addressing mode performs addressing based on the tensor data obtained in the outer iterative addressing mode. That is, the AI chip provided by the embodiment of the second aspect of the present application can not only support tensor addressing in the aforementioned new slice addressing mode, but also support tensor addressing in the dual tensor addressing mode.
  • the embodiment of the second aspect of the present application further provides a tensor processing method, as shown in Figure 12.
  • the principle of the tensor processing method is described below in conjunction with Figure 12. The method can be applied to the aforementioned AI chip and the aforementioned electronic device.
  • S10 Obtain source operands required for the vector operation instruction from the tensor data according to the tensor addressing mode, wherein the number of source operands obtained by the tensor addressing mode at one time is greater than or equal to 1.
  • the tensor addressing mode in the embodiment of the second aspect of the present application improves and expands the slice addressing mode of the related art to include more parameters.
  • the parameters of the tensor addressing mode include: the starting address representing the starting point of addressing (such as represented by start), the end address representing the end point of addressing (such as represented by stop), the step length representing the offset amplitude of the addressing pointer (such as represented by step), and the size of the data shape obtained by addressing (such as represented by size).
  • start: stop: step: size the expression of the tensor addressing mode can be [start: stop: step: size].
  • size is used to describe the size of the data shape obtained by addressing, that is, all element points contained in a shape will be extracted at each step, not just one point.
  • there can be one or more tensor elements determined by a step and the address of the tensor element corresponding to at least one source operand can be determined by one addressing, which can greatly improve the efficiency of calculating the address of the tensor element corresponding to the source operand, so that only fewer instructions are needed to find the required source operand, which greatly reduces redundant instructions, improves the effective instruction density, improves performance, and simplifies programming.
  • the engine unit in the above-mentioned AI chip may obtain the source operand required for the vector operation instruction from the tensor data stored in the vector register according to the tensor addressing mode.
  • the obtained source operands can be operated according to the vector operation instruction to obtain the operation result.
  • the vector operation unit in the above AI chip can operate on the obtained source operands according to the vector operation instruction to obtain the operation result.
  • the tensor processing method provided in the embodiment of the second aspect of the present application has the same implementation principle and technical effects as those of the aforementioned AI chip embodiment.
  • efficient addressing can be performed in each link of tensor processing.
  • the address calculation of the tensor elements will be involved.
  • the present application also provides An electronic device.
  • the electronic device may include: a memory and an AI chip as described in the first aspect of the present application.
  • the AI chip can be connected to a memory to obtain source operands required for vector operation instructions from the original tensor data stored in the memory according to the tensor addressing mode, and perform operations on the obtained source operands according to the vector operation instructions to obtain operation results.
  • the electronic device may include: a memory and an AI chip as described in the second aspect of the present application.
  • the AI chip can be connected to a memory to write tensor data stored in the memory into a vector register in the AI chip.
  • a memory is used to store tensor data required for operation, which can be various common memories, such as random access memory (Random Access Memory, RAM), read-only memory (Read Only Memory, ROM), programmable read-only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), and electrically erasable read-only memory (Electric Erasable Programmable Read-Only Memory, EEPROM).
  • the random access memory can be a static random access memory (Static Random Access Memory, SRAM) or a dynamic random access memory (Dynamic Random Access Memory, DRAM).
  • the memory can be a single data rate (Single Data Rate, SDR) memory or a double data rate (Double Data Rate, DDR) memory.
  • the engine unit obtains the operand A of the vector operation instruction from the original tensor data where operand A is located according to the tensor addressing mode, and writes it into the vector register 1.
  • the engine unit obtains the operand B of the vector operation instruction from the original tensor data where operand B is located according to the tensor addressing mode, and writes it into the vector register 2.
  • the engine unit obtains the operand C of the vector operation instruction from the original tensor data where operand C is located according to the tensor addressing mode, and writes it into the vector register 3.
  • the vector operation unit directly obtains the source operand A corresponding to the vector operation instruction from the vector register 1, obtains the source operand B corresponding to the vector operation instruction from the vector register 2, and obtains the source operand C corresponding to the vector operation instruction from the vector register 3 to perform the operation to obtain the operation result.
  • the engine unit will generate 3*3*4 addresses, and then read the data at these addresses to obtain 3*3*4 data (i.e., operands), and send them to the vector register for storage.
  • the principle of the engine unit addressing according to the 0:4:1:3 addressing mode is shown in Figure 14.
  • the original tensor data of the above three operands can be all written into the memory, and then addressed in turn according to the tensor addressing mode, and the addressed data is written into the vector register; or the original tensor data of one of the operands is first written into the memory, and then the original tensor data of the operand is addressed according to the tensor addressing mode, and the addressed operand is written into the vector register, and then the original tensor data of one of the operands is written into the memory, and then the original tensor data of the operand is addressed according to the tensor addressing mode, and the addressed operand is written into the vector register.
  • the order in which the original tensor data of each operand is written into the memory and the order in which the vector register is written from the memory are not limited.
  • the original tensor data containing different operands may be addressed by the same addressing engine, or the original tensor data containing different operands may be addressed by different addressing engines.
  • the original tensor data of operand A in the memory is written into vector register 1
  • the original tensor data of operand B in the memory is written into vector register 2
  • the original tensor data of operand C in the memory is written into vector register 3.
  • the engine unit needs to calculate the tensor data according to the tensor.
  • the principle of the engine unit addressing according to the 0:4:1:3 addressing mode is shown in Figure 14.
  • the original tensor data of the above three operands may be all written into the memory, and then the original tensor data of each operand may be written into the vector register in sequence; or the original tensor data of one of the operands may be first written into the memory, and then the original tensor data of the operand may be written into the vector register, and then the original tensor data of one of the operands may be written into the memory, and then the original tensor data of the operand may be written into the vector register.
  • the order in which the original tensor data of each operand is written into the memory and the order in which the original tensor data of the operand is written into the vector register are not limited.
  • the above-mentioned electronic devices can be but are not limited to mobile phones, tablets, computers, servers, vehicle-mounted devices, wearable devices, edge boxes and other electronic devices.
  • the present application provides an AI chip, an electronic device and a tensor processing method, which belongs to the field of computer technology.
  • the AI chip includes: a vector register, an engine unit, and a vector operation unit; the engine unit is used to obtain the source operands required for the vector operation instruction from the original tensor data according to the tensor addressing mode; the vector register is used to store the source operands obtained by the engine unit; the vector operation unit is used to obtain the source operands of the vector operation instruction from the vector register for operation to obtain the operation result.
  • the vector register is used to store the source operands obtained by the engine unit, so that the vector operation unit can directly obtain the source operands required for the vector operation instruction from the vector register for operation.
  • This design method can realize the separation of addressing and operation, and can support the early acquisition of the source operands required for the vector operation instruction. When performing the operation later, it can be directly read, which is conducive to improving the operation efficiency.
  • the present application also provides another AI chip, electronic device and tensor processing method.
  • the AI chip includes: a vector register, an engine unit, and a vector operation unit; the vector register is used to store the tensor data required for the operation; the engine unit is connected to the vector register, and the engine unit is used to obtain the source operands required for the vector operation instruction from the tensor data stored in the vector register according to the tensor addressing mode, and the number of source operands obtained by the engine unit at one time according to the tensor addressing mode is greater than or equal to 1; the vector operation unit is connected to the engine unit, and the vector operation unit is used to operate on the source operands obtained by the engine unit according to the vector operation instruction to obtain the operation result.
  • the tensor addressing mode is used for addressing, so that at least one source operand can be obtained by one addressing, which greatly improves the efficiency of obtaining the source operand, so that only fewer instructions are needed to find the required source operand.
  • the AI chip, electronic device, and tensor processing method of the present application are reproducible and can be used in a variety of industrial applications.
  • the AI chip, electronic device, and tensor processing method of the present application can be used in the field of computer technology.

Landscapes

  • Executing Machine-Instructions (AREA)

Abstract

An AI chip, an electronic device, and a tensor processing method, relating to the technical field of computers. The AI chip comprises: a vector register, an engine unit, and a vector operation unit. The engine unit is used for acquiring, according to a tensor addressing mode, a source operand required for a vector operation instruction from original tensor data; the vector register is used for storing the source operand acquired by the engine unit; and the vector operation unit is used for acquiring the source operand for the vector operation instruction from the vector register to perform operation to obtain an operation result. The vector register is used to store the source operand acquired by the engine unit, so that the vector operation unit can directly acquire the source operand required for the vector operation instruction from the vector register to perform operation; such a design method can achieve the separation of addressing and operation, and can support acquiring the source operand required for the vector operation instruction in advance and directly reading the source operand during subsequent operations, thereby facilitating improvement of the operation efficiency. Also provided are another AI chip, electronic device, and tensor processing method. The AI chip comprises: a vector register, an engine unit, and a vector operation unit. The vector register is used for storing tensor data required for operation; the engine unit is connected to the vector register, and the engine unit is used for acquiring, according to a tensor addressing mode, a source operand required for a vector operation instruction from the tensor data stored in the vector register, wherein the number of source operands acquired by the engine unit in one addressing according to the tensor addressing mode is greater than or equal to 1; the vector operation unit is connected to the engine unit, and the vector operation unit is used for performing, according to the vector operation instruction, operation on the source operand acquired by the engine unit to obtain an operation result.

Description

一种AI芯片、张量处理方法及电子设备AI chip, tensor processing method and electronic device
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2022年12月14日提交中国国家知识产权局的申请号为202211597988.5、名称为“一种AI芯片、电子设备及张量处理方法”的中国专利申请的优先权;要求于2022年12月14日提交中国国家知识产权局的申请号为202211597989.X、名称为“一种AI芯片、张量处理方法及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application with application number 202211597988.5, entitled “A AI chip, an electronic device and a tensor processing method”, filed with the State Intellectual Property Office of China on December 14, 2022; and claims priority to the Chinese patent application with application number 202211597989.X, entitled “A AI chip, a tensor processing method and an electronic device”, filed with the State Intellectual Property Office of China on December 14, 2022, the entire contents of which are incorporated by reference in this application.
技术领域Technical Field
本申请属于计算机技术领域,具体涉及一种AI芯片、电子设备及张量处理方法。The present application belongs to the field of computer technology, and specifically relates to an AI chip, an electronic device, and a tensor processing method.
背景技术Background technique
随着人工智能技术(Artificial Intelligence,AI)的发展,神经网络已然成为当今最为流行的人工智能技术之一。神经网络中经常涉及张量之间的计算,当前用于张量之间计算的AI芯片的架构如图1A所示,在进行张量(tensor)之间的计算时,大致的步骤如下:步骤1、处理单元(处理核)计算一个或者多个源操作数对应的张量元素(tensor element)的地址;步骤2、根据上述地址从存储器中把一个或者多个源操作数读入向量运算单元;步骤3、向量运算单元计算出结果。With the development of artificial intelligence (AI), neural networks have become one of the most popular artificial intelligence technologies today. Neural networks often involve calculations between tensors. The architecture of the current AI chip used for calculations between tensors is shown in Figure 1A. When performing calculations between tensors, the general steps are as follows: Step 1, the processing unit (processing core) calculates the address of the tensor element corresponding to one or more source operands; Step 2, read one or more source operands from the memory into the vector operation unit according to the above address; Step 3, the vector operation unit calculates the result.
当前在计算源操作数对应的张量元素的地址时,仍然存在一些问题。一方面,该AI芯片在寻址时仅支持实时寻址,在计算出张量元素的地址后,需要及时根据上述地址从存储器中把一个或者多个源操作数读入向量运算单元中,这样若中途运算出错,又需要重新寻址,导致运算效率低及开销大。另一方面,在上述如图1A所示的架构中进行张量之间的计算的步骤中,只有步骤3是有效的计算,其它步骤的都会造成额外的成本。步骤1对应的张量元素的地址计算代价较大,这一部分的非有效计算在目前常见的***中占了很大的比例。There are still some problems when calculating the address of the tensor element corresponding to the source operand. On the one hand, the AI chip only supports real-time addressing when addressing. After calculating the address of the tensor element, it is necessary to read one or more source operands from the memory into the vector operation unit according to the above address in a timely manner. In this way, if an error occurs in the operation, re-addressing is required, resulting in low operation efficiency and high overhead. On the other hand, in the steps of performing calculations between tensors in the architecture shown in Figure 1A, only step 3 is a valid calculation, and the other steps will cause additional costs. The address calculation cost of the tensor element corresponding to step 1 is relatively high, and this part of the ineffective calculation accounts for a large proportion in the current common systems.
发明内容Summary of the invention
鉴于此,本申请的第一方面提供了一种AI芯片及张量处理方法,以改善相关技术的AI芯片仅支持实时寻址,容易出现若中途运算出错,又需要重新寻址,导致运算效率低及开销大的问题。In view of this, the first aspect of the present application provides an AI chip and a tensor processing method to improve the problem that the AI chip in the related technology only supports real-time addressing, and if an error occurs in the operation in the middle, re-addressing is required, resulting in low computing efficiency and high overhead.
根据本申请的第一方面的实施例提供了一种AI芯片,该AI芯片包括:向量寄存器、引擎单元、向量运算单元;其中,引擎单元用于根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数,所述张量寻址模式一次寻址获取的源操作数个数大于或等于1;向量寄存器与所述引擎单元连接,所述向量寄存器用于存储所述引擎单元获取的源操作数;向量运算单元与所述向量寄存器连接,所述向量运算单元用于从所述向量寄存器中获取所述向量运算指令的源操作数进行运算,得到运算结果。其中,引擎单元采用张量寻址模式一次寻址确定的张量元素可以有一个或多个。According to an embodiment of the first aspect of the present application, an AI chip is provided, which includes: a vector register, an engine unit, and a vector operation unit; wherein the engine unit is used to obtain the source operands required for the vector operation instruction from the original tensor data according to the tensor addressing mode, and the number of source operands obtained by the tensor addressing mode at one time is greater than or equal to 1; the vector register is connected to the engine unit, and the vector register is used to store the source operands obtained by the engine unit; the vector operation unit is connected to the vector register, and the vector operation unit is used to obtain the source operands of the vector operation instruction from the vector register for operation to obtain the operation result. Among them, the tensor elements determined by the engine unit using the tensor addressing mode at one time can have one or more.
本申请的第一方面的实施例中,引擎单元采用张量寻址模式从张量数据中获取向量运算指令所需的源操作数,使得一次寻址可获得至少一个源操作数,极大提高了获取源操作数的效率,使得只需要较少的指令便可寻找到所需的源操作数;同时,利用向量寄存器来存储引擎单元所获取的源操作数,使得向量运算单元可以直接从向量寄存器中获取向量运算指令的源操作数进行运算,这样的设计方式可以实现寻址与运算相分离,可以支持提前去获取向量运算指令所需的源操作数,后续在进行运算时,直接读取即可。即便中途运算出错,也不需要重新去寻址,直接从向量寄存器中获取对应的源操作数即可,可以提高运算的效率。In the embodiment of the first aspect of the present application, the engine unit adopts the tensor addressing mode to obtain the source operands required for the vector operation instruction from the tensor data, so that at least one source operand can be obtained by one addressing, which greatly improves the efficiency of obtaining the source operands, so that only fewer instructions are needed to find the required source operands; at the same time, the vector register is used to store the source operands obtained by the engine unit, so that the vector operation unit can directly obtain the source operands of the vector operation instruction from the vector register for operation. This design method can realize the separation of addressing and operation, and can support the acquisition of the source operands required for the vector operation instruction in advance, and can be directly read when performing the operation later. Even if an error occurs in the operation midway, there is no need to re-address, and the corresponding source operand can be directly obtained from the vector register, which can improve the efficiency of the operation.
在结合本申请的第一方面的可选的实施方式中,所述张量寻址模式的参数包括:表征寻址起点的起始地址、表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸。In an optional implementation in combination with the first aspect of the present application, the parameters of the tensor addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step size representing the offset amplitude of the addressing pointer, and a size representing the shape of the data obtained by addressing.
本申请的第一方面的实施例中,通过对相关技术的切片寻址模式进行了改进与扩展,张量寻址模式(也可以看成是新的切片寻址模式)所包含的参数,从原有切片寻址模式包含的三个参数扩展成包含更多参数(包含用于表征寻址所得数据形状的尺寸信息),从而可以提高寻址效率,达到减少冗余指令的目的。In the embodiment of the first aspect of the present application, by improving and expanding the slice addressing mode of the related art, the parameters included in the tensor addressing mode (which can also be regarded as a new slice addressing mode) are expanded from the three parameters included in the original slice addressing mode to more parameters (including size information for characterizing the shape of the data obtained by addressing), thereby improving the addressing efficiency and achieving the purpose of reducing redundant instructions.
在结合本申请的第一方面的可选的实施方式中,所述张量寻址模式的参数还包括表征寻址所得数据形状为非完整形状的保留情况的特征参数。In an optional implementation in combination with the first aspect of the present application, the parameters of the tensor addressing mode also include characteristic parameters that characterize the situation where the shape of the data obtained by addressing is retained as an incomplete shape.
本申请的第一方面的实施例中,可以通过引入表征寻址所得数据形状为非完整形状的保留情况的特征参数(如用partial表示),使得可以通过配置特征参数的值,从而可以灵活的决定是否保留非完整形状所包含的数据。In an embodiment of the first aspect of the present application, a characteristic parameter (such as represented by partial) that characterizes the retention of an incomplete shape of the data shape obtained by addressing can be introduced, so that by configuring the value of the characteristic parameter, it can be flexibly determined whether to retain the data contained in the incomplete shape.
在结合本申请的第一方面的可选的实施方式中,所述张量寻址模式包括嵌套的双重张量寻址模式,所述双重张量寻址模式包括:外迭代寻址模式和内迭代寻址模式,其中,所述内迭代寻址模式在所述外迭代寻址模式 寻址得到的张量数据上进行寻址。In an optional implementation of the first aspect of the present application, the tensor addressing mode includes a nested dual tensor addressing mode, the dual tensor addressing mode includes: an outer iterative addressing mode and an inner iterative addressing mode, wherein the inner iterative addressing mode is nested within the outer iterative addressing mode. Addressing is performed on the tensor data obtained by addressing.
本申请的第一方面的实施例中,采用嵌套的双重张量寻址模式进行寻址,以实现外迭代寻址模式和内迭代寻址模式的独立寻址,相比于单一的寻址模式来说,寻址效率更高,使得只需要一条指令就可实现2条单层指令的寻址,并且内迭代寻址模式是在外迭代寻址模式得到的张量数据的基础上进行寻址的,外迭代寻址模式寻址到的数据仅是一个中间结果,并不需要读写,相比于采用单层的张量寻址模式的寻址,减少了第一层(外迭代寻址模式)的数据读写。In the embodiment of the first aspect of the present application, a nested dual tensor addressing mode is used for addressing to realize independent addressing of an external iterative addressing mode and an internal iterative addressing mode. Compared with a single addressing mode, the addressing efficiency is higher, so that only one instruction is needed to realize the addressing of two single-layer instructions, and the internal iterative addressing mode is addressed based on the tensor data obtained in the external iterative addressing mode. The data addressed by the external iterative addressing mode is only an intermediate result and does not need to be read or written. Compared with addressing using a single-layer tensor addressing mode, the data reading and writing of the first layer (external iterative addressing mode) is reduced.
在结合本申请的第一方面的可选的实施方式中,所述引擎单元还与外部存储器连接,所述外部存储器,用于存储所述原始张量数据。In an optional implementation in combination with the first aspect of the present application, the engine unit is also connected to an external memory, and the external memory is used to store the original tensor data.
本申请的第一方面的实施例中,当向量寄存器通过引擎单元与外部存储器连接时,此时,外部存储器,可以直接用于存储原始张量数据,而不是存储引擎单元获取的源操作数,使得外部存储器的存储空间不用设计得很大(由于张量运算中会涉及到大量重复的数据,因此利用张量寻址模式寻址后的数据的数据量可能会大于原始张量数据)。In an embodiment of the first aspect of the present application, when a vector register is connected to an external memory through an engine unit, the external memory can be used directly to store original tensor data instead of storing source operands obtained by the engine unit, so that the storage space of the external memory does not need to be designed to be very large (since tensor operations involve a large amount of repeated data, the amount of data addressed using the tensor addressing mode may be larger than the original tensor data).
在结合本申请的第一方面的可选的实施方式中,所述向量寄存器的数量为多个,不同的所述向量寄存器存储不同的源操作数。In an optional implementation in combination with the first aspect of the present application, there are multiple vector registers, and different vector registers store different source operands.
本申请的第一方面的实施例中,通过设置多个向量寄存器,不同向量寄存器存储不同的源操作数,这样在进行向量运算时,可以同时从各个向量寄存器中读取所需的操作数,可以提高运算效率。In the embodiment of the first aspect of the present application, by setting multiple vector registers, different vector registers store different source operands. In this way, when performing vector operations, the required operands can be read from each vector register at the same time, which can improve the operation efficiency.
在结合本申请的第一方面的可选的实施方式中,所述引擎单元包括:多个寻址引擎,一个所述寻址引擎对应一个所述向量寄存器,每个所述寻址引擎,用于根据各自的张量寻址模式从对应的原始张量数据中获取向量运算指令所需的源操作数。In an optional implementation in combination with the first aspect of the present application, the engine unit includes: multiple addressing engines, one addressing engine corresponds to one vector register, and each addressing engine is used to obtain the source operand required for the vector operation instruction from the corresponding original tensor data according to the respective tensor addressing mode.
本申请的第一方面的实施例中,通过设置多个寻址引擎,且一个寻址引擎对应一个向量寄存器,这样在寻址时,每个寻址引擎可以根据各自的张量寻址模式从对应的原始张量数据中获取向量运算指令所需的源操作数,不仅能提高寻址效率,且互不干扰。In the embodiment of the first aspect of the present application, multiple addressing engines are set, and one addressing engine corresponds to a vector register. In this way, when addressing, each addressing engine can obtain the source operand required for the vector operation instruction from the corresponding original tensor data according to its own tensor addressing mode, which not only improves the addressing efficiency but also does not interfere with each other.
在结合本申请的第一方面的可选的实施方式中,所述引擎单元还包括:主引擎,用于向每个所述寻址引擎发送控制命令,控制每个所述寻址引擎根据各自的张量寻址模式进行寻址,从对应的原始张量数据中获取向量运算指令所需的源操作数。In an optional implementation in combination with the first aspect of the present application, the engine unit also includes: a main engine, used to send control commands to each of the addressing engines, control each of the addressing engines to address according to their respective tensor addressing modes, and obtain source operands required for vector operation instructions from the corresponding original tensor data.
本申请的第一方面的实施例中,通过设置主引擎向每个寻址引擎发送控制命令,来控制各个寻址引擎独立寻址,通过主引擎来集中控制,以保证各个寻址引擎所获取的源操作数的形状(shape)一致,从而保证运算的准确性。In the embodiment of the first aspect of the present application, a main engine is set to send control commands to each addressing engine to control the independent addressing of each addressing engine, and the main engine is used for centralized control to ensure that the shape of the source operands obtained by each addressing engine is consistent, thereby ensuring the accuracy of the operation.
在结合本申请的第一方面的可选的实施方式中,所述控制命令包括:Advance前进命令、Reset重置命令、NOP空操作命令三个控制命令,所述主引擎每次向每个所述寻址引擎发送控制命令时,发送由所述三个控制命令中的至少一个控制命令组合而成的组合控制命令,以控制每个所述寻址引擎根据各自的张量寻址模式在所述原始张量数据的不同维度进行寻址。In an optional implementation in combination with the first aspect of the present application, the control commands include three control commands: an Advance command, a Reset command, and a NOP command. Each time the main engine sends a control command to each of the addressing engines, it sends a combined control command composed of at least one of the three control commands to control each of the addressing engines to address different dimensions of the original tensor data according to their respective tensor addressing modes.
本申请的第一方面的实施例中,通过向寻址引擎发送由Advance前进命令、Reset重置命令、NOP空操作命令中的至少一种命令组合而成的组合控制命令,例如,组合控制命令为:“W cmd:NOP;H cmd:NOP”、“W cmd:Advance;H cmd:NOP”等,从而控制每个寻址引擎根据各自的张量寻址模式在原始张量数据的不同维度(如宽度W、高度H等维度)进行寻址,只需要将简单的命令进行组合,从而可以实现多种功能。In an embodiment of the first aspect of the present application, a combined control command composed of at least one of an Advance command, a Reset command, and a NOP command is sent to the addressing engine. For example, the combined control command is: "W cmd: NOP; H cmd: NOP", "W cmd: Advance; H cmd: NOP", etc., so as to control each addressing engine to address different dimensions of the original tensor data (such as width W, height H, etc.) according to their respective tensor addressing modes. Multiple functions can be realized by combining simple commands.
本申请的第一方面的实施例还提供了一种张量处理方法,该张量处理方法包括:The embodiment of the first aspect of the present application further provides a tensor processing method, the tensor processing method comprising:
根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数,其中,所述张量寻址模式一次寻址获取的源操作数个数大于或等于1;将获取的源操作数存储至向量寄存器;从所述向量寄存器中获取所述向量运算指令的源操作数进行运算,得到运算结果。According to the tensor addressing mode, source operands required for the vector operation instruction are obtained from the original tensor data, wherein the number of source operands obtained by the tensor addressing mode at one time is greater than or equal to 1; the obtained source operands are stored in the vector register; the source operands of the vector operation instruction are obtained from the vector register for operation to obtain the operation result.
该方法可应用于前述第一方面的AI芯片和/或前述第二方面的电子设备。该方法中,采用张量寻址模式一次寻址获得的张量元素可以有一个或多个。关于该方法的原理、有益效果,可参见其他方面实施例或实施方式的相关描述。The method can be applied to the AI chip of the first aspect and/or the electronic device of the second aspect. In the method, there can be one or more tensor elements obtained by addressing at one time using the tensor addressing mode. For the principles and beneficial effects of the method, please refer to the relevant description of the embodiments or implementation methods of other aspects.
在本申请的第一方面的可选的实施方式中,根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数,包括:利用寻址引擎在主引擎发送的控制命令下,根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数。 In an optional implementation of the first aspect of the present application, the source operands required for the vector operation instructions are obtained from the original tensor data according to the tensor addressing mode, including: using the addressing engine to obtain the source operands required for the vector operation instructions from the original tensor data according to the tensor addressing mode under the control command sent by the main engine.
在本申请的第一方面的可选的实施方式中,所述张量寻址模式的参数包括:表征寻址起点的起始地址、表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸。In an optional implementation of the first aspect of the present application, the parameters of the tensor addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step size representing the offset amplitude of the addressing pointer, and a size representing the shape of the data obtained by addressing.
在本申请的第一方面的可选的实施方式中,所述张量寻址模式包括嵌套的双重张量寻址模式,所述双重张量寻址模式包括:外迭代寻址模式和内迭代寻址模式;根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数,包括:利用所述外迭代寻址模式从所述原始张量数据中选定至少一个候选区域,所述候选区域包含多个张量元素;利用所述内迭代寻址模式从所述至少一个候选区域中获取向量运算指令所需的源操作数。In an optional implementation of the first aspect of the present application, the tensor addressing mode includes a nested dual tensor addressing mode, and the dual tensor addressing mode includes: an external iterative addressing mode and an internal iterative addressing mode; obtaining source operands required for vector operation instructions from original tensor data according to the tensor addressing mode, including: using the external iterative addressing mode to select at least one candidate area from the original tensor data, and the candidate area contains multiple tensor elements; using the internal iterative addressing mode to obtain the source operands required for vector operation instructions from the at least one candidate area.
此外,为了改善相关技术的寻址方式需要大量的指令才能寻找到所需的源操作数从而导致寻址效率低的问题,根据本申请的第二方面提供了另一种AI芯片及张量处理方法。In addition, in order to improve the problem that the addressing method of the related technology requires a large number of instructions to find the required source operands, resulting in low addressing efficiency, another AI chip and tensor processing method are provided according to the second aspect of the present application.
根据本申请的第二方面的实施例提供了一种AI芯片,包括:向量寄存器、引擎单元、向量运算单元;其中,向量寄存器用于存储运算所需的张量数据;引擎单元,与所述向量寄存器连接,所述引擎单元用于根据张量寻址模式,从所述向量寄存器存储的张量数据中获取向量运算指令所需的源操作数,其中,所述引擎单元根据所述张量寻址模式一次寻址获取的源操作数个数大于或等于1;向量运算单元,与所述引擎单元连接,所述向量运算单元用于根据所述向量运算指令对所述引擎单元获取的源操作数进行运算,得到运算结果。According to an embodiment of the second aspect of the present application, an AI chip is provided, including: a vector register, an engine unit, and a vector operation unit; wherein the vector register is used to store tensor data required for operation; an engine unit, connected to the vector register, and the engine unit is used to obtain source operands required for vector operation instructions from the tensor data stored in the vector register according to a tensor addressing mode, wherein the number of source operands obtained by the engine unit at one time according to the tensor addressing mode is greater than or equal to 1; a vector operation unit, connected to the engine unit, and the vector operation unit is used to operate on the source operands obtained by the engine unit according to the vector operation instruction to obtain an operation result.
本申请的第二方面的实施例中,引擎单元采用张量寻址模式从向量寄存器存储的张量数据中获取向量运算指令所需的源操作数,使得一次寻址可获得至少一个源操作数,其中,引擎单元采用张量寻址模式一次寻址获得的张量元素可以有一个或多个,极大提高了获取源操作数的效率,使得只需要较少的指令便可寻找到所需的源操作数;同时,将引擎单元设置于向量寄存器与向量运算单元之间,使得向量寄存器可以直接存储运算所需的原始张量数据,由于张量运算中会涉及到大量重复的数据,因此利用张量寻址模式寻址后的数据的数据量有可能会大于原始张量数据,这样的设计方式,使得向量寄存器的存储空间不用设计得很大。In an embodiment of the second aspect of the present application, the engine unit adopts a tensor addressing mode to obtain the source operands required for the vector operation instruction from the tensor data stored in the vector register, so that at least one source operand can be obtained by one addressing, wherein the engine unit can obtain one or more tensor elements by one addressing using the tensor addressing mode, which greatly improves the efficiency of obtaining the source operands, so that only fewer instructions are needed to find the required source operands; at the same time, the engine unit is set between the vector register and the vector operation unit, so that the vector register can directly store the original tensor data required for the operation. Since a large amount of repeated data is involved in the tensor operation, the amount of data addressed using the tensor addressing mode may be larger than the original tensor data. This design method does not need to design the storage space of the vector register to be very large.
可选的,所述向量寄存器的数量为多个,不同的所述向量寄存器存储不同的张量数据。该实施方式中,通过设置多个向量寄存器,不同向量寄存器存储不同的张量数据,这样在进行寻址时,可以同时对各个向量寄存器存储的张量数据进行并行寻址,可以提高寻址效率。Optionally, there are multiple vector registers, and different vector registers store different tensor data. In this implementation, by setting multiple vector registers, different vector registers store different tensor data, so that when addressing, the tensor data stored in each vector register can be addressed in parallel at the same time, which can improve the addressing efficiency.
在本申请的第二方面的可选的实施方式中,所述引擎单元包括:多个寻址引擎,一个所述寻址引擎对应一个所述向量寄存器,每个所述寻址引擎,用于根据各自的张量寻址模式从对应的向量寄存器中获取向量运算指令所需的源操作数。In an optional implementation of the second aspect of the present application, the engine unit includes: multiple addressing engines, one addressing engine corresponds to one vector register, and each addressing engine is used to obtain the source operand required for the vector operation instruction from the corresponding vector register according to the respective tensor addressing mode.
该实施方式中,通过设置多个寻址引擎,且一个寻址引擎对应一个向量寄存器,这样在寻址时,每个寻址引擎可以根据各自的张量寻址模式从对应的向量寄存器中获取向量运算指令所需的源操作数,不仅能提高寻址效率,且互不干扰。In this implementation, multiple addressing engines are set, and one addressing engine corresponds to one vector register. In this way, when addressing, each addressing engine can obtain the source operand required for the vector operation instruction from the corresponding vector register according to its own tensor addressing mode, which not only improves the addressing efficiency but also does not interfere with each other.
在本申请的第二方面的可选的实施方式中,所述引擎单元还包括:主引擎,用于向每个所述寻址引擎发送控制命令,控制每个所述寻址引擎根据各自的张量寻址模式进行寻址,从对应的向量寄存器中获取向量运算指令所需的源操作数。In an optional implementation of the second aspect of the present application, the engine unit also includes: a main engine, used to send control commands to each of the addressing engines, control each of the addressing engines to address according to their respective tensor addressing modes, and obtain the source operands required for the vector operation instructions from the corresponding vector registers.
该实施方式中,通过设置主引擎向每个寻址引擎发送控制命令,来控制各个寻址引擎独立寻址,通过主引擎来集中控制,以保证各个寻址引擎所获取的源操作数的形状(shape)一致,从而保证运算的准确性。In this implementation, the main engine is set to send control commands to each addressing engine to control each addressing engine to address independently, and the main engine is used for centralized control to ensure that the shapes of source operands obtained by each addressing engine are consistent, thereby ensuring the accuracy of the operation.
在本申请的第二方面的可选的实施方式中,若所述张量数据的第一维度需要广播,寻址引擎根据所述主引擎发送的Advance前进命令在所述张量数据的第一维度上进行寻址时,保持寻址指针不变,以持续获取当前寻址指针所指向的源操作数。In an optional implementation of the second aspect of the present application, if the first dimension of the tensor data needs to be broadcast, the addressing engine keeps the addressing pointer unchanged when addressing the first dimension of the tensor data according to the Advance command sent by the main engine, so as to continuously obtain the source operand pointed to by the current addressing pointer.
该实施方式中,若张量数据的第一维度需要广播时,寻址引擎用于根据主引擎发送的Advance前进命令在张量数据的第一维度上进行寻址时,保持寻址指针不变,以持续获取当前寻址指针所指向的源操作数,从而实现数据的重复读取,达到广播的目的,而不需要重新去寻址,可以提高寻址效率。In this implementation, if the first dimension of the tensor data needs to be broadcast, the addressing engine is used to keep the addressing pointer unchanged when addressing the first dimension of the tensor data according to the Advance command sent by the main engine, so as to continuously obtain the source operand pointed to by the current addressing pointer, thereby realizing repeated reading of data and achieving the purpose of broadcasting without the need for re-addressing, which can improve the addressing efficiency.
在本申请的第二方面的可选的实施方式中,所述张量寻址模式的参数包括:表征寻址起点的起始地址、表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸。In an optional implementation of the second aspect of the present application, the parameters of the tensor addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step size representing the offset amplitude of the addressing pointer, and a size representing the shape of the data obtained by addressing.
本申请的第二方面的实施例中,通过对相关技术的切片寻址模式进行了改进与扩展,使得张量寻址模式(也可以看成是新切片寻址模式)所包含的参数,从原有切片寻址模式包含的三个参数(start,stop,step)扩展成包含更多参数,从而可以提高寻址效率,达到减少冗余指令的目的,另外,通过对扩展增加的参数进行配置可以支持更灵活更多样的寻址。In the embodiment of the second aspect of the present application, the slice addressing mode of the related technology is improved and expanded, so that the parameters included in the tensor addressing mode (which can also be regarded as a new slice addressing mode) are expanded from the three parameters (start, stop, step) included in the original slice addressing mode to include more parameters, thereby improving the addressing efficiency and achieving the purpose of reducing redundant instructions. In addition, by configuring the extended parameters, more flexible and diverse addressing can be supported.
在本申请的第二方面的可选的实施方式中,所述张量寻址模式的参数还包括表征寻址所得数据形状为非完 整形状的保留情况的特征参数。In an optional implementation of the second aspect of the present application, the parameters of the tensor addressing mode also include a parameter characterizing that the shape of the data obtained by addressing is incomplete. Characteristic parameters of the shape preservation case.
本申请的第二方面的实施例中,可以通过引入用于表征寻址所得数据形状为非完整形状的保留情况的特征参数(如用partial表示),使得可以通过配置特征参数的值,来灵活决定是否保留非完整形状所包含的数据。In an embodiment of the second aspect of the present application, a characteristic parameter (such as represented by partial) can be introduced to characterize the retention of an incomplete shape of the data shape obtained by addressing, so that by configuring the value of the characteristic parameter, it can be flexibly determined whether to retain the data contained in the incomplete shape.
在本申请的第二方面的可选的实施方式中,所述张量寻址模式包括嵌套的双重张量寻址模式,所述双重张量寻址模式包括:外迭代寻址模式和内迭代寻址模式,其中,所述内迭代寻址模式在所述外迭代寻址模式寻址得到的张量数据上进行寻址。In an optional implementation of the second aspect of the present application, the tensor addressing mode includes a nested dual tensor addressing mode, and the dual tensor addressing mode includes: an external iterative addressing mode and an internal iterative addressing mode, wherein the internal iterative addressing mode addresses the tensor data addressed by the external iterative addressing mode.
本申请的第二方面的实施例中,可采用嵌套的双重张量寻址模式进行寻址,以实现外迭代寻址模式和内迭代寻址模式的独立寻址,相比于单一的寻址模式来说,寻址效率更高,使得只需要一条指令就可实现2条单层指令的寻址,并且内迭代寻址模式是在外迭代寻址模式的基础上进行寻址的,外迭代寻址模式寻址到的数据仅是一个中间结果,并不需要读写,相比于采用单层的张量寻址模式的寻址,减少了第一层(外迭代寻址模式)的数据读写。In an embodiment of the second aspect of the present application, a nested dual tensor addressing mode can be used for addressing to achieve independent addressing of an external iterative addressing mode and an internal iterative addressing mode. Compared with a single addressing mode, the addressing efficiency is higher, so that only one instruction is needed to achieve the addressing of two single-layer instructions, and the internal iterative addressing mode is addressed on the basis of the external iterative addressing mode. The data addressed by the external iterative addressing mode is only an intermediate result and does not need to be read or written. Compared with addressing using a single-layer tensor addressing mode, the data reading and writing of the first layer (external iterative addressing mode) is reduced.
在本申请的第二方面的可选的实施方式中,所述外迭代寻址模式的参数包括:表征寻址起点的起始地址、表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸、表征寻址所得数据形状为非完整形状的保留情况的特征参数。In an optional implementation of the second aspect of the present application, the parameters of the external iterative addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step size representing the offset amplitude of the addressing pointer, a size representing the shape of the data obtained by addressing, and a characteristic parameter representing the retention of an incomplete shape of the data obtained by addressing.
本申请的第二方面的实施例中,采用包含上述5参数的寻址模式进行寻址,使得在外迭代寻址模式下每次寻址都可以得到一个完整形状(shape)的候选区域,从而便于后续进一步基于完整形状的候选区域进行内迭代寻址模式下的寻址,可以提高寻址效率,达到减少冗余指令的目的。In the embodiment of the second aspect of the present application, an addressing mode including the above five parameters is used for addressing, so that each addressing in the external iterative addressing mode can obtain a candidate area of a complete shape, thereby facilitating subsequent addressing in the internal iterative addressing mode based on the candidate area of the complete shape, thereby improving the addressing efficiency and achieving the purpose of reducing redundant instructions.
在本申请的第二方面的可选的实施方式中,所述内迭代寻址模式的参数包括:表征寻址起点的起始地址、表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸。In an optional implementation of the second aspect of the present application, the parameters of the inner iterative addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step size representing the offset amplitude of the addressing pointer, and a size representing the shape of the data obtained by addressing.
本申请的第二方面的实施例中,采用包含上述4参数的寻址模式进行寻址,使得每次寻址都可以得到一个包含多个张量元素的数据形状(shape),一次寻址获取的源操作数大于等于1,从而可以提高寻址效率,达到减少冗余指令的目的。In an embodiment of the second aspect of the present application, an addressing mode including the above four parameters is used for addressing, so that each addressing can obtain a data shape including multiple tensor elements, and the source operand obtained by one addressing is greater than or equal to 1, thereby improving the addressing efficiency and achieving the purpose of reducing redundant instructions.
在本申请的第二方面的可选的实施方式中,所述外迭代寻址模式,用于从所述张量数据中选定至少一个候选区域,所述候选区域包含多个张量元素;所述内迭代寻址模式,用于从所述至少一个候选区域中获取向量运算指令所需的源操作数。可选的,在一些应用场景下,候选区域的张量元素可以参与数据读写,作为寻址结果的一部分,在一些应用场景下,候选区域的张量元素可以不参与数据读写,仅作为寻址过程的中间结果。具体是否将候选区域的张量元素进行数据输出,可根据实际需求进行灵活配置决定。In an optional implementation of the second aspect of the present application, the external iterative addressing mode is used to select at least one candidate area from the tensor data, and the candidate area contains multiple tensor elements; the internal iterative addressing mode is used to obtain the source operand required for the vector operation instruction from the at least one candidate area. Optionally, in some application scenarios, the tensor elements of the candidate area can participate in data reading and writing as part of the addressing result. In some application scenarios, the tensor elements of the candidate area may not participate in data reading and writing, but only serve as intermediate results of the addressing process. Whether to output data of the tensor elements of the candidate area can be flexibly configured and decided according to actual needs.
在本申请的第二方面的可选的实施方式中,所述外迭代寻址模式、内迭代寻址模式均为新切片寻址模式、切片寻址模式、索引寻址模式中的一种,所述新切片寻址模式相比于所述切片寻址模式包含更多寻址参数,所述更多寻址参数包括:表征寻址所得数据形状的尺寸。In an optional implementation of the second aspect of the present application, the external iterative addressing mode and the internal iterative addressing mode are both one of a new slice addressing mode, a slice addressing mode, and an index addressing mode, and the new slice addressing mode includes more addressing parameters than the slice addressing mode, and the more addressing parameters include: characterizing the size of the data shape obtained by addressing.
本申请的第二方面的实施例中,关于双重张量寻址模式,无论是外迭代寻址模式还是内迭代寻址模式,均支持多种寻址模式,从而使得整个寻址方式既兼容相关技术的寻址方式(切片寻址模式、索引寻址模式),又可以兼容本申请提出的张量寻址方式(新切片寻址模式),增加了方案的灵活性和易用性。In the embodiment of the second aspect of the present application, regarding the dual tensor addressing mode, both the external iterative addressing mode and the internal iterative addressing mode support multiple addressing modes, so that the entire addressing mode is compatible with the addressing mode of the related technology (slice addressing mode, index addressing mode), and is compatible with the tensor addressing mode (new slice addressing mode) proposed in this application, thereby increasing the flexibility and ease of use of the solution.
本申请的第二方面的实施例还提供了一种张量处理方法,包括:根据张量寻址模式从张量数据中获取向量运算指令所需的源操作数,其中,根据所述张量寻址模式一次寻址获取的源操作数个数大于或等于1;根据所述向量运算指令,对获取的源操作数进行运算,得到运算结果。An embodiment of the second aspect of the present application also provides a tensor processing method, including: obtaining source operands required for vector operation instructions from tensor data according to a tensor addressing mode, wherein the number of source operands obtained by one addressing according to the tensor addressing mode is greater than or equal to 1; according to the vector operation instruction, performing operations on the obtained source operands to obtain operation results.
本申请的第二方面的实施例中,采用张量寻址模式一次寻址获得的张量元素可以有一个或多个。该方法可应用于前述第一方面的AI芯片和/或前述第二方面的电子设备。关于该方法的原理、有益效果,可参见其他方面实施例或实施方式的相关描述。In the embodiment of the second aspect of the present application, there may be one or more tensor elements obtained by addressing at one time using the tensor addressing mode. The method can be applied to the AI chip of the first aspect and/or the electronic device of the second aspect. For the principles and beneficial effects of the method, please refer to the relevant descriptions of the embodiments or implementation methods of other aspects.
可选的,根据张量寻址模式从张量数据中获取向量运算指令所需的源操作数,可包括:利用寻址引擎在主引擎发送的控制命令下,根据张量寻址模式从张量数据中获取向量运算指令所需的源操作数。Optionally, obtaining source operands required for vector operation instructions from tensor data according to the tensor addressing mode may include: using the addressing engine to obtain the source operands required for the vector operation instructions from the tensor data according to the tensor addressing mode under the control command sent by the main engine.
在本申请的第二方面的可选的实施方式中,所述张量寻址模式包括嵌套的双重张量寻址模式,所述双重张量寻址模式包括:外迭代寻址模式和内迭代寻址模式;其中,根据张量寻址模式从张量数据中获取向量运算指令所需的源操作数,包括:利用所述外迭代寻址模式从所述张量数据中选定至少一个候选区域,所述候选区域包含多个张量元素;利用所述内迭代寻址模式从所述至少一个候选区域中获取向量运算指令所需的源操作数。In an optional implementation of the second aspect of the present application, the tensor addressing mode includes a nested dual tensor addressing mode, and the dual tensor addressing mode includes: an external iterative addressing mode and an internal iterative addressing mode; wherein, obtaining source operands required for vector operation instructions from tensor data according to the tensor addressing mode includes: selecting at least one candidate area from the tensor data using the external iterative addressing mode, and the candidate area contains multiple tensor elements; and obtaining the source operands required for vector operation instructions from the at least one candidate area using the internal iterative addressing mode.
在本申请的第二方面的可选的实施方式中,所述外迭代寻址模式的参数包括:表征寻址起点的起始地址、 表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸、表征寻址所得数据形状为非完整形状的保留情况的特征参数;和/或,所述内迭代寻址模式的参数包括:表征寻址起点的起始地址、表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸。In an optional implementation of the second aspect of the present application, the parameters of the external iterative addressing mode include: a starting address representing the starting point of addressing, Characteristic parameters representing the end point address of the addressing end point, the step size representing the offset amplitude of the addressing pointer, the size of the data shape obtained by addressing, and the retention of the incomplete shape of the data shape obtained by addressing; and/or, the parameters of the inner iterative addressing mode include: the starting address representing the starting point of the addressing, the end point address representing the end point of the addressing, the step size representing the offset amplitude of the addressing pointer, and the size of the data shape obtained by addressing.
本申请的实施例还提供了一种电子设备,包括:存储器,用于存储运算所需的张量数据;如本申请的第一方面所述的AI芯片,所述AI芯片与所述存储器连接,所述AI芯片,用于根据张量寻址模式从所述存储器存储的原始张量数据中,获取向量运算指令所需的源操作数,并根据所述向量运算指令,对获取的源操作数进行运算,得到运算结果;或者如本申请的第二方面所述的AI芯片,所述AI芯片与所述存储器连接,所述AI芯片用于将所述存储器中存储的张量数据写入所述AI芯片中的向量寄存器。An embodiment of the present application also provides an electronic device, comprising: a memory for storing tensor data required for operations; an AI chip as described in the first aspect of the present application, the AI chip is connected to the memory, and the AI chip is used to obtain source operands required for vector operation instructions from the original tensor data stored in the memory according to the tensor addressing mode, and according to the vector operation instruction, the obtained source operands are operated to obtain operation results; or an AI chip as described in the second aspect of the present application, the AI chip is connected to the memory, and the AI chip is used to write the tensor data stored in the memory into the vector register in the AI chip.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例或相关技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。通过附图所示,本申请的上述及其它目的、特征和优势将更加清晰。在全部附图中相同的附图标记指示相同的部分。并未刻意按实际尺寸等比例缩放绘制附图,重点在于示出本申请的主旨。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the related art, the drawings required for use in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can also be obtained based on these drawings without creative work. As shown in the drawings, the above and other purposes, features and advantages of the present application will be clearer. The same reference numerals indicate the same parts in all the drawings. The drawings are not deliberately scaled to the actual size, and the focus is on showing the main purpose of the present application.
图1A为相关技术中的AI芯片与存储器的结构示意图。FIG. 1A is a schematic diagram of the structure of an AI chip and a memory in the related art.
图1B和图1C为相关技术中的切片寻址模式进行寻址的原理示意图。1B and 1C are schematic diagrams showing the principle of addressing using a slice addressing mode in the related art.
图2示出了本申请的第一方面的实施例提供的一种AI芯片与存储器连接的结构示意图。FIG2 shows a schematic structural diagram of a connection between an AI chip and a memory provided in an embodiment of the first aspect of the present application.
图3A示出了本申请的实施例提供的一种发生广播的原理示意图。FIG. 3A shows a schematic diagram of a principle of broadcasting provided by an embodiment of the present application.
图3B示出了本申请的实施例提供的第二种发生广播的原理示意图。FIG. 3B shows a schematic diagram of the principle of a second broadcasting provided by an embodiment of the present application.
图3C示出了本申请的实施例提供的第三种发生广播的原理示意图。FIG. 3C shows a schematic diagram of a third principle of broadcasting provided by an embodiment of the present application.
图4示出了本申请的第一方面的实施例提供的第一种张量寻址模式的原理示意图。FIG4 shows a schematic diagram showing the principle of a first tensor addressing mode provided by an embodiment of the first aspect of the present application.
图5示出了本申请的第一方面的实施例提供的第二种张量寻址模式的原理示意图。FIG5 is a schematic diagram showing the principle of a second tensor addressing mode provided by an embodiment of the first aspect of the present application.
图6示出了本申请的第一方面的实施例提供的另一种AI芯片的结构示意图。FIG6 shows a schematic diagram of the structure of another AI chip provided in an embodiment of the first aspect of the present application.
图7示出了本申请的第一方面的实施例提供的按照0:8:2:5的寻址模式进行寻址的原理示意图。FIG. 7 is a schematic diagram showing the principle of addressing according to the 0:8:2:5 addressing mode provided by the embodiment of the first aspect of the present application.
图8示出了本申请的第一方面的实施例提供的按照0:5:2:1的寻址模式进行寻址的原理示意图。FIG8 is a schematic diagram showing the principle of addressing according to the 0:5:2:1 addressing mode provided by the embodiment of the first aspect of the present application.
图9示出了本申请的第一方面的实施例提供的一种张量处理方法的示意图。FIG9 shows a schematic diagram of a tensor processing method provided by an embodiment of the first aspect of the present application.
图10示出了本申请的第二方面的实施例提供的第一种AI芯片的结构示意图。FIG10 shows a schematic structural diagram of a first AI chip provided in an embodiment of the second aspect of the present application.
图11示出了本申请的第二方面的实施例提供的第二种AI芯片的结构示意图。FIG11 shows a schematic diagram of the structure of a second AI chip provided in an embodiment of the second aspect of the present application.
图12示出了本申请的第二方面的实施例提供的一种张量处理方法的结构示意图。FIG12 shows a schematic structural diagram of a tensor processing method provided by an embodiment of the second aspect of the present application.
图13A示出了本申请的第一方面的实施例提供的一种电子设备的结构示意图。FIG13A shows a schematic structural diagram of an electronic device provided by an embodiment of the first aspect of the present application.
图13B示出了本申请的第二方面的实施例提供的一种电子设备的结构示意图。FIG13B shows a schematic structural diagram of an electronic device provided in an embodiment of the second aspect of the present application.
图14示出了本申请的第一方面的实施例提供的按照0:4:1:3的寻址模式进行寻址的原理示意图。FIG14 is a schematic diagram showing the principle of addressing according to the 0:4:1:3 addressing mode provided by the embodiment of the first aspect of the present application.
具体实施方式Detailed ways
为下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that similar numbers and letters represent similar items in the following figures, so once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. The terms "comprises", "comprising" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device that includes a series of elements includes not only those elements, but also includes other elements that are not explicitly listed, or also includes elements that are inherent to such process, method, article or device. In the absence of further limitations, an element defined by the statement "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or device that includes the element.
发明人经过研究发现,相关技术对于张量之间的计算过程,需要消耗较长的时间去计算源操作数对应的张量元素的地址,即很大一部分时间都浪费在了计算源操作数对应的张量元素的地址上,究其原因,主要是因为相关技术的方式是基于numpy(Numerical Python,为Python的一种开源的数值计算扩展)的切片(slicing)寻址模式进行寻址。例如,对于一个张量数组a,用a[start:stop:step]这种思路提取出一个新数组,并且可以支 持多维数组中的任一维度进行处理。其中,start表示寻址的起点地址,stop表示寻址的终点地址,step相当于stride(步长),表示寻址指针的偏移幅度。After research, the inventor found that the related technology takes a long time to calculate the address of the tensor element corresponding to the source operand for the calculation process between tensors, that is, a large part of the time is wasted on calculating the address of the tensor element corresponding to the source operand. The reason is mainly because the related technology is based on the slicing addressing mode of numpy (Numerical Python, an open source numerical computing extension of Python). For example, for a tensor array a, a[start: stop: step] is used to extract a new array, and it can support It can process any dimension in a multidimensional array. Start indicates the starting address of the addressing, stop indicates the ending address of the addressing, and step is equivalent to stride, which indicates the offset of the addressing pointer.
下面结合图1B和图1C具体说明如何以切片寻址模式进行寻址。在一个实例中,以切片寻址模式start:stop:step=0:5:2为例,其寻址原理图如图1B所示。这种切片寻址模式一次寻址只能得到一个张量元素,比如,示例中的(0,0)、(2,0)、(4,0)、(0,2)、(2,2)、(4,2)、(0,4)、(2,4)、(4,4)。在另一实例中,以切片寻址模式start:stop:step=0:6:2为例,其寻址原理图如图1C所示。这种切片寻址模式一次寻址只能得到一个张量元素,比如,示例中的(0,0)、(2,0)、(4,0)、(0,2)、(2,2)、(4,2)、(0,4)、(2,4)、(4,4)。如图1B和图1C所示出的,这种切片寻址模式一次寻址只能得到一个张量元素,由于张量计算会涉及到大量的源操作数,如果要获取很多源操作数对应的多个张量元素,则需要用大量的指令进行多次寻址,这样就会导致大量指令被浪费在了寻址上,即浪费在了非有效计算上。The following specifically describes how to address in slice addressing mode in conjunction with Figures 1B and 1C. In one example, taking the slice addressing mode start:stop:step=0:5:2 as an example, its addressing principle diagram is shown in Figure 1B. This slice addressing mode can only obtain one tensor element at a time, for example, (0,0), (2,0), (4,0), (0,2), (2,2), (4,2), (0,4), (2,4), (4,4) in the example. In another example, taking the slice addressing mode start:stop:step=0:6:2 as an example, its addressing principle diagram is shown in Figure 1C. This slice addressing mode can only obtain one tensor element at a time, for example, (0,0), (2,0), (4,0), (0,2), (2,2), (4,2), (0,4), (2,4), (4,4) in the example. As shown in FIG. 1B and FIG. 1C , this slice addressing mode can only obtain one tensor element at a time. Since tensor calculations involve a large number of source operands, if you want to obtain multiple tensor elements corresponding to many source operands, you need to use a large number of instructions for multiple addressing. This will result in a large number of instructions being wasted on addressing, that is, wasted on ineffective calculations.
基于此,本申请提供了一种全新的AI芯片,一次寻址便可确定至少一个源操作数对应的张量元素的地址,能极大提高计算源操作数对应的张量元素的地址的效率,使得只需要较少的指令便可寻找到所需的源操作数,极大的减少了冗余指令、提高了有效指令密度、提高了处理性能、并且可以简化编程量。同时,利用向量寄存器来存储引擎单元所获取的源操作数,使得向量运算单元可以直接从向量寄存器中获取向量运算指令的源操作数进行运算,这样的设计方式使得可以支持提前获取向量运算指令所需的源操作数,后续在进行运算时,直接读取即可,从而实现寻址与运算相分离,即便中途运算出错,也不需要重新去寻址,直接从向量寄存器中获取对应的源操作数即可,可以提高运算的效率。Based on this, the present application provides a new AI chip, which can determine the address of at least one tensor element corresponding to a source operand with one addressing, which can greatly improve the efficiency of calculating the address of the tensor element corresponding to the source operand, so that only fewer instructions are needed to find the required source operand, which greatly reduces redundant instructions, improves the effective instruction density, improves processing performance, and simplifies the amount of programming. At the same time, the vector register is used to store the source operand obtained by the engine unit, so that the vector operation unit can directly obtain the source operand of the vector operation instruction from the vector register for operation. This design method can support the early acquisition of the source operand required for the vector operation instruction, and then directly read it when performing the operation, thereby realizing the separation of addressing and operation. Even if an error occurs in the operation in the middle, there is no need to re-address, and the corresponding source operand can be directly obtained from the vector register, which can improve the efficiency of the operation.
该AI芯片能够对张量的寻址、运算进行硬件层面的加速,硬件可支持多种张量寻址模式,兼容性好,有利于快速完成张量之间的计算。This AI chip can accelerate the addressing and calculation of tensors at the hardware level. The hardware can support multiple tensor addressing modes, has good compatibility, and is conducive to quickly completing calculations between tensors.
此外,本申请的实施例还从灵活性、易用性和处理效率方面,针对张量的寻址提出了新的寻址模式,一方面,基于传统切片寻址这种按顺序或按固定跨度来访问数组中的元素的设计思想,进一步扩展设计了一种新的基于切片的张量寻址方式;另一方面,设计了一种可以嵌套使用的能够进行双重张量寻址的寻址模式。这些新的寻址模式都可以在本申请实施例提供的AI芯片上进行应用,有利于在面临张量之间的计算时,减少所需的指令总量,灵活寻址,提升硬件处理性能,提升对于张量的处理效率。In addition, the embodiments of the present application also propose new addressing modes for tensor addressing in terms of flexibility, ease of use, and processing efficiency. On the one hand, based on the design concept of traditional slice addressing that accesses elements in an array sequentially or by fixed span, a new slice-based tensor addressing method is further expanded and designed; on the other hand, a nested addressing mode that can perform dual tensor addressing is designed. These new addressing modes can be applied to the AI chip provided in the embodiments of the present application, which is beneficial to reduce the total amount of instructions required when facing calculations between tensors, flexible addressing, improved hardware processing performance, and improved processing efficiency for tensors.
为便于对本申请的第一方面的实施例进行理解,首先参考图2至图9对根据本申请的第一方面的AI芯片和张量处理方法进行详细介绍。To facilitate understanding of the embodiments of the first aspect of the present application, the AI chip and tensor processing method according to the first aspect of the present application are first introduced in detail with reference to Figures 2 to 9.
下面将结合图2对本申请的第一方面的实施例提供的AI芯片的原理进行说明,该AI芯片包括向量寄存器、引擎单元以及向量运算单元。其中,向量寄存器与向量运算单元直接连接,引擎单元与向量运算单元可以是直接连接。The principle of the AI chip provided by the embodiment of the first aspect of the present application will be described below in conjunction with Figure 2, and the AI chip includes a vector register, an engine unit, and a vector operation unit. The vector register is directly connected to the vector operation unit, and the engine unit and the vector operation unit can be directly connected.
引擎单元用于根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数,张量寻址模式一次寻址获取的源操作数(个数)大于或等于1。张量数据是多维数据,通常会以某种方式(如线性布局、平铺布局等)展开,存储在诸如内存、片上存储器等具有存储功能的器件中。The engine unit is used to obtain the source operands required for the vector operation instructions from the original tensor data according to the tensor addressing mode. The number of source operands (number) obtained by the tensor addressing mode at one time is greater than or equal to 1. Tensor data is multidimensional data, which is usually expanded in some way (such as linear layout, tiled layout, etc.) and stored in devices with storage functions such as memory, on-chip memory, etc.
向量寄存器,用于存储引擎单元获取的源操作数,这样即便中途运算出错,也不需要重新去寻址,直接从向量寄存器中获取对应的源操作数即可。The vector register is used to store the source operands obtained by the engine unit. In this way, even if an error occurs in the operation, there is no need to re-address, and the corresponding source operand can be directly obtained from the vector register.
向量运算单元与向量寄存器连接,向量运算单元用于从向量寄存器中获取向量运算指令的源操作数进行运算,得到运算结果。其中,向量运算指令是指一条指令能够同时计算两个以上的操作数的指令。向量运算指令的类型可以是加法、减法、乘法、乘累加等各种运算类型。The vector operation unit is connected to the vector register, and is used to obtain the source operand of the vector operation instruction from the vector register to perform the operation and obtain the operation result. Among them, the vector operation instruction refers to an instruction that can calculate more than two operands at the same time. The type of vector operation instruction can be various operation types such as addition, subtraction, multiplication, multiplication and accumulation.
向量运算单元与向量寄存器直接连接,由于存储在向量寄存器中的数据为根据张量寻址模式从原始张量数据中寻址的向量运算指令所需的源操作数,因此向量运算单元直接访问向量寄存器就能得到张量寻址的内容。The vector operation unit is directly connected to the vector register. Since the data stored in the vector register is the source operand required by the vector operation instruction addressed from the original tensor data according to the tensor addressing mode, the vector operation unit can obtain the content of the tensor addressing by directly accessing the vector register.
其中,当向量寄存器与引擎单元直接连接时,引擎单元还与外部存储器连接,此时,相当于向量寄存器通过引擎单元与外部存储器连接。外部存储器用于存储原始张量数据。以向量运算指令为三操作数的向量运算指令,如A*B+C为例,A、B、C均为源操作数,其中,A、B、C可以是单个元素,也可以是包含多个元素的数组。存储器用于存储操作数A所在的原始张量数据、存储操作数B所在的原始张量数据、存储操作数C所在的原始张量数据。引擎单元根据张量寻址模式从操作数A所在的原始张量数据获取向量运算指令的操作数A,并将其存储至向量寄存器中,以及从操作数B所在的原始张量数据获取向量运算指令的操作数B,并将其存储至向量寄存器中,以及从操作数C所在的原始张量数据获取向量运算指令的操作数C,并将其存储至向量寄存器 中。Among them, when the vector register is directly connected to the engine unit, the engine unit is also connected to the external memory. At this time, it is equivalent to the vector register being connected to the external memory through the engine unit. The external memory is used to store original tensor data. Taking the vector operation instruction with three operands, such as A*B+C as an example, A, B, and C are all source operands, where A, B, and C can be a single element or an array containing multiple elements. The memory is used to store the original tensor data where operand A is located, the original tensor data where operand B is located, and the original tensor data where operand C is located. The engine unit obtains the operand A of the vector operation instruction from the original tensor data where operand A is located according to the tensor addressing mode, and stores it in the vector register, and obtains the operand B of the vector operation instruction from the original tensor data where operand B is located, and stores it in the vector register, and obtains the operand C of the vector operation instruction from the original tensor data where operand C is located, and stores it in the vector register. middle.
其中,上述的AI芯片可以是一种集成电路芯片,具有数据处理能力,可用于处理张量与张量之间的运算。例如,可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件。通用处理器也可以是微处理器或者该AI芯片也可以是任何常规的处理器等,比如可以是图形处理器(Graphics Processing Unit,GPU)、通用图形处理器(General Purpose computing on Graphics Processing Unit,GPGPU)等。Among them, the above-mentioned AI chip can be an integrated circuit chip with data processing capabilities, which can be used to process operations between tensors. For example, it can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices. The general-purpose processor can also be a microprocessor or the AI chip can also be any conventional processor, such as a graphics processor (GPU), a general-purpose graphics processor (GPGPU), etc.
发明人发现,张量元素的运算操作过程比较具有规律性,一般操作过程有以下两种:The inventors have found that the operation process of tensor elements is relatively regular, and generally there are two types of operation processes:
第一种,维度尺寸相同的两个或者多个张量数据中的张量元素,进行一对一的操作(称为逐个张量元素操作或简称逐元素操作,element-wise operation)。比如,假设两个张量数据的宽(W)和高(H)均为5,即W*H为5*5,则这两个张量数据中的张量元素在进行一对一的操作时,是对应位置的张量元素之间进行一对一的操作。The first type is to perform one-to-one operations on the tensor elements in two or more tensor data with the same dimension size (called element-by-tensor operation or element-wise operation for short). For example, assuming that the width (W) and height (H) of two tensor data are both 5, that is, W*H is 5*5, then when performing one-to-one operations on the tensor elements in the two tensor data, one-to-one operations are performed between the tensor elements in the corresponding positions.
第二种,一个张量数据中的某一个张量元素和另一个张量数据中的一组张量元素进行操作(称为广播,broadcasting)。比如,对于一个图像张量(第一操作数),需要把该图像的每一个像素值都除以255(第二操作数)的场景下,假设这2个操作数的维度信息分别是:第一操作数的通道数、宽度、高度分别是:3,224,224;第二操作数的通道数、宽度、高度分别是:1,1,1。The second type is to operate a tensor element in one tensor data with a group of tensor elements in another tensor data (called broadcasting). For example, for an image tensor (the first operand), each pixel value of the image needs to be divided by 255 (the second operand). Assume that the dimension information of the two operands are: the number of channels, width, and height of the first operand are 3, 224, 224 respectively; the number of channels, width, and height of the second operand are 1, 1, 1 respectively.
此时,第二操作数在3个维度(即上述的通道数、宽度、高度)上都会发生广播,也就是说,在进行除法运算之前,需要把维度为(1,1,1)的第二操作数(值为255),扩展成一个维度为(3,224,224)的张量,并且这个张量的每一个维度上的值都是255。此时这两个操作数的维度尺寸相同,可以进行除法操作,之后再将这2个张量数据中的张量元素进行一对一的操作即可。At this time, the second operand will be broadcasted in three dimensions (i.e., the number of channels, width, and height mentioned above), that is, before the division operation, the second operand with dimension (1, 1, 1) (value is 255) needs to be expanded into a tensor with dimension (3, 224, 224), and the value of each dimension of this tensor is 255. At this time, the dimensions of the two operands are the same, and the division operation can be performed, and then the tensor elements in the two tensor data can be operated one-to-one.
又例如,假设第一操作数的通道数、宽度、高度分别是:3,224,224,第二操作数的通道数、宽度、高度分别是:3,1,224,则对于第二操作数,只需要在宽度方向上进行广播,即将维度为(3,1,224)的第二操作数扩展成一个维度为(3,224,224)的张量,此时这两个操作数的维度尺寸相同,可以进行除法操作。For another example, assuming that the number of channels, width, and height of the first operand are 3, 224, 224 respectively, and the number of channels, width, and height of the second operand are 3, 1, 224 respectively, then for the second operand, it is only necessary to broadcast in the width direction, that is, to expand the second operand with a dimension of (3, 1, 224) into a tensor with a dimension of (3, 224, 224). At this time, the two operands have the same dimension and can be divided.
其中,要发生广播操作,需要两个张量数据的维度数量相同,且其中一个张量数据的至少一个维度为1,进行广播处理时,发生广播的张量数据的所有其他维度都会在发生广播的维度上进行元素复制。For a broadcast operation to occur, the two tensor data must have the same number of dimensions, and at least one dimension of one of the tensor data must be 1. When broadcasting, all other dimensions of the broadcasted tensor data will have elements copied on the dimensions where the broadcast occurs.
为了更好的理解上述的广播操作,下面结合图3A、图3B、图3C所示的原理图进行说明。在图3A中,由于操作数5与“np.arange(3)”这个数组在宽度这个维度上的尺寸不同,因此,在相加之前,需要在宽度方向对5进行广播,使其在宽度方向的维度尺寸与“np.arange(3)”这个数组的维度尺寸一致,之后再相加。同理,在图3B中,由于“np.arange(3)”这个数组与“np.ones((3,3))”这个数组在高度这个维度上的尺寸不同,在相加之前,“np.arange(3)”这个数组需要在高度方向进行广播,使其在高度方向的维度尺寸与“np.arange(3)”这个数组的维度尺寸一致,之后再相加。同理,在图3C中,由于“np.arange(3).reshape((3,1))”这个数组与“np.arange(3)”这个数组,在宽度、高度的维度尺寸都不同,因此需要对“np.arange(3).reshape((3,1))”这个数组在宽度方向进行广播,以及对“np.arange(3)”这个数组在高度方向进行广播,使二者的维度尺寸相同,之后再相加。In order to better understand the above broadcasting operation, the following is an explanation in conjunction with the principle diagrams shown in Figures 3A, 3B, and 3C. In Figure 3A, since the operand 5 and the array "np.arange(3)" have different sizes in the width dimension, before adding, 5 needs to be broadcasted in the width direction so that its dimension size in the width direction is consistent with the dimension size of the array "np.arange(3)", and then added. Similarly, in Figure 3B, since the array "np.arange(3)" and the array "np.ones((3, 3))" have different sizes in the height dimension, before adding, the array "np.arange(3)" needs to be broadcasted in the height direction so that its dimension size in the height direction is consistent with the dimension size of the array "np.arange(3)", and then added. Similarly, in FIG3C , since the array “np.arange(3).reshape((3, 1))” and the array “np.arange(3)” have different dimensions in width and height, it is necessary to broadcast the array “np.arange(3).reshape((3, 1))” in the width direction and broadcast the array “np.arange(3)” in the height direction to make the dimensions of the two the same and then add them together.
关于张量的逐元素操作、广播操作这些部分的原理和细节,已经为本领域所熟知,在此不再介绍。The principles and details of element-by-element operations and broadcast operations of tensors are already well known in the art and will not be introduced here.
此外,常见张量元素的地址计算模式(比如相关技术的切片寻址模式)也比较具有规则性,其中,切片是指按照一定顺序或者跨度去访问张量元素。In addition, the address calculation mode of common tensor elements (such as the slice addressing mode of the related technology) is also relatively regular, where slicing refers to accessing tensor elements in a certain order or span.
本申请提供的AI芯片除了在架构上进行改进外,同时为了能提高张量元素的寻址效率,本申请发明人在充分利用相关技术的张量操作所具有的规律性的基础上,提出了一种全新的张量寻址模式。该张量寻址模式通过对相关技术的切片寻址模式进行了改进与扩展,使得张量寻址模式(也可以看成是新的切片寻址模式)所包含的参数,从原有切片寻址模式包含的三个参数扩展成包含更多参数,从而可以提高寻址效率,达到减少冗余指令的目的。In addition to the improvements in architecture, the AI chip provided by this application also proposes a new tensor addressing mode based on the full use of the regularity of tensor operations in related technologies in order to improve the addressing efficiency of tensor elements. This tensor addressing mode improves and expands the slice addressing mode of related technologies, so that the parameters contained in the tensor addressing mode (which can also be regarded as a new slice addressing mode) are expanded from the three parameters contained in the original slice addressing mode to more parameters, thereby improving the addressing efficiency and achieving the purpose of reducing redundant instructions.
本申请中的张量寻址模式(新切片寻址模式)的参数包括:表征寻址起点的起始地址(如用start表示)、表征寻址终点的终点地址(如用stop表示)、表征寻址指针偏移幅度的步长(如用step表示)、表征寻址所得数据形状的尺寸(如用size表示)。则张量寻址模式的表达式可以记为[start:stop:step:size]。即,新切片寻址模式相比于切片寻址模式包含更多寻址参数,更多寻址参数包括:表征寻址所得数据形状的尺寸。其中,size用 来描述寻址所得数据形状(shape)的尺寸,即在每一个step的时候会提取一个完整shape包含的所有元素点,而不只提取一个点。上述表达式中的start、stop、step、size的值都是可以配置的,以适用于各种张量之间的计算。The parameters of the tensor addressing mode (new slice addressing mode) in this application include: the starting address representing the starting point of addressing (such as represented by start), the end address representing the end point of addressing (such as represented by stop), the step length representing the offset amplitude of the addressing pointer (such as represented by step), and the size of the data shape obtained by addressing (such as represented by size). Then the expression of the tensor addressing mode can be recorded as [start: stop: step: size]. That is, the new slice addressing mode contains more addressing parameters than the slice addressing mode. More addressing parameters include: the size of the data shape obtained by addressing. Among them, size is represented by To describe the size of the addressed data shape, that is, at each step, all element points contained in a complete shape will be extracted instead of just one point. The values of start, stop, step, and size in the above expression are all configurable to suit calculations between various tensors.
可选地,张量寻址模式的参数还包括:表征寻址所得数据形状为非完整形状的保留情况的特征参数(如用partial表示,用于反映局部张量数据形状的完整性),此时,张量寻址模式的表达式为[start:stop:step:size:partial]。当partial=false时,处在边缘的非完整shape所包含的点会被抛弃,即不保留;当partial=true时,处在边缘的非完整的shape所包含的点需保留。本申请中的张量寻址模式的参数相比于相关技术的切片寻址模式的参数,增加size和partial两个参数。Optionally, the parameters of the tensor addressing mode also include: characteristic parameters that characterize the retention of the incomplete shape of the data shape obtained by addressing (such as expressed by partial, used to reflect the integrity of the local tensor data shape). In this case, the expression of the tensor addressing mode is [start: stop: step: size: partial]. When partial = false, the points contained in the incomplete shape at the edge will be discarded, that is, not retained; when partial = true, the points contained in the incomplete shape at the edge need to be retained. Compared with the parameters of the slice addressing mode in the related art, the parameters of the tensor addressing mode in this application add two parameters, size and partial.
为了更好的理解,下面以张量寻址模式[start:stop:step:size]=0:6:2:3为例,分别对partial=false以及partial=true的情况进行说明。其中,在该表达式中,start=0,stop=6,step=2,size=3。For a better understanding, the following takes the tensor addressing mode [start: stop: step: size] = 0: 6: 2: 3 as an example to illustrate the cases of partial = false and partial = true. In this expression, start = 0, stop = 6, step = 2, size = 3.
当partial=false时,处在边缘的非完整shape所包含的点会被抛弃,其原理图如图4。在这种情况下,每滑动一个step都会根据size判断当前寻址到的元素点能不能形成一个完整的shape(该例中,由于size=3,因此,一个完整shape应有3*3个元素点),,如果不能形成完整的shape,那么这个step寻址到的点会被抛弃。当partial=true时,处在边缘的非完整的shape所包含的点需保留,其原理图如图5所示。在这种情况下,每一个step还是会根据size判断当前寻址到的点能不能形成一个完整的shape,即便是不能形成完整的shape,这个step寻址到的点也需保留。When partial=false, the points contained in the incomplete shape at the edge will be discarded, and its principle diagram is shown in Figure 4. In this case, each sliding step will determine whether the currently addressed element points can form a complete shape based on size (in this example, since size=3, a complete shape should have 3*3 element points). If a complete shape cannot be formed, the points addressed by this step will be discarded. When partial=true, the points contained in the incomplete shape at the edge need to be retained, and its principle diagram is shown in Figure 5. In this case, each step will still determine whether the currently addressed points can form a complete shape based on size. Even if a complete shape cannot be formed, the points addressed by this step need to be retained.
通过对比图4和图5可知,当partial=false时,会抛弃掉图4中处于边缘的非完整shape所包含的张量元素点。在图4和图5的示例中,一个完整的shape包含9个点,由于处于边缘的shape的点低于9个,为非完整shape,当partial=false时,非完整shape所包含的点会被抛弃;当partial=true时,非完整shape所包含的点需保留。By comparing Figure 4 and Figure 5, we can see that when partial = false, the tensor element points contained in the incomplete shape at the edge of Figure 4 will be discarded. In the examples of Figures 4 and 5, a complete shape contains 9 points. Since the points of the shape at the edge are less than 9, it is an incomplete shape. When partial = false, the points contained in the incomplete shape will be discarded; when partial = true, the points contained in the incomplete shape need to be retained.
通过对比图4(或图5)与图1B,可以看出:采用本申请所示的张量寻址模式,一个step可确定多个张量元素。一次寻址便可确定1个或多个源操作数对应的张量元素的地址,例如,上述示例中,一次(即一个step)寻址会寻址到9个张量元素的地址,而采用相关技术的方式,需要9次寻址才能寻址到这9个张量元素的地址,相应地,需要更多的指令才能寻址到这9个张量元素的地址。因此,相对于相关技术的寻址方式,本申请的第一方面的实施例提供的寻址方式能极大提高计算源操作数对应的张量元素的地址的效率,使得只需要较少的指令便可寻找到所需的源操作数,极大地减少冗余指令,提高了有效指令密度,提高了性能、简化了编程。By comparing FIG. 4 (or FIG. 5) with FIG. 1B, it can be seen that: using the tensor addressing mode shown in the present application, one step can determine multiple tensor elements. One addressing can determine the address of the tensor element corresponding to one or more source operands. For example, in the above example, one (i.e., one step) addressing will address the address of 9 tensor elements, while using the method of the related art, 9 addressings are required to address the address of these 9 tensor elements. Accordingly, more instructions are required to address the address of these 9 tensor elements. Therefore, relative to the addressing method of the related art, the addressing method provided by the embodiment of the first aspect of the present application can greatly improve the efficiency of calculating the address of the tensor element corresponding to the source operand, so that only fewer instructions are needed to find the required source operand, greatly reducing redundant instructions, improving the effective instruction density, improving performance, and simplifying programming.
可以理解的是,当期望每一个step只需要寻址1个张量元素时,可以通过配置size=1实现,此时,每一个step只寻址1个张量元素,此时的寻址模式与相关技术的寻址模式类似(只需通过更改参数的值即可在硬件上兼容相关技术的寻址逻辑),其原理图如图1所示。或者,直接采用已有的寻址模式进行寻址。It is understandable that when it is expected that only one tensor element needs to be addressed in each step, it can be achieved by configuring size = 1. At this time, each step only addresses one tensor element. The addressing mode at this time is similar to the addressing mode of the related art (the addressing logic of the related art can be compatible with the hardware by simply changing the value of the parameter), and its schematic diagram is shown in Figure 1. Alternatively, the existing addressing mode is directly used for addressing.
需要说明的是,当期望每一个step只需要寻址1个张量元素,其size参数也可以配置为大于1,这种情况下,从根据size和step得到的shape中进一步选择所需的张量元素即可,也就是说,在每滑动一个step的时候,从得到的shape包含的多个张量元素中仅提取一个所需的张量元素即可。因此,当每一个step只需要寻址1个张量元素时,size也可以不为1。以此可以通过本申请的实施例提供的张量寻址模式兼容实现相关技术的切片寻址效果,兼顾处理效率和寻址灵活性。It should be noted that when it is expected that only one tensor element needs to be addressed for each step, its size parameter can also be configured to be greater than 1. In this case, the required tensor element can be further selected from the shape obtained according to the size and step. That is to say, when sliding a step, only one required tensor element can be extracted from the multiple tensor elements contained in the obtained shape. Therefore, when only one tensor element needs to be addressed for each step, the size may not be 1. In this way, the tensor addressing mode provided in the embodiments of the present application can be compatible with the slice addressing effect of the related technology, taking into account both processing efficiency and addressing flexibility.
其中,AI芯片中的向量寄存器的数量可以为多个,不同的向量寄存器存储的源操作数(为张量数据中的一部分)可以不同。以向量运算指令为三操作数的向量运算指令,如A*B+C为例。此时,向量寄存器的数量可以包括3个,例如为向量寄存器1、向量寄存器2、向量寄存器3。向量寄存器1可以用于存储源操作数A,向量寄存器2可以用于存储源操作数B,向量寄存器3可以用于存储源操作数C。本申请不对具体的指令表达形式进行限制。Among them, the number of vector registers in the AI chip can be multiple, and different vector registers may store different source operands (part of the tensor data). Take a vector operation instruction with three operands, such as A*B+C, as an example. At this time, the number of vector registers may include 3, for example, vector register 1, vector register 2, and vector register 3. Vector register 1 can be used to store source operand A, vector register 2 can be used to store source operand B, and vector register 3 can be used to store source operand C. This application does not limit the specific instruction expression form.
在一条向量运算指令中,每一个操作数都对应有一个寻址模式,多个操作数的寻址模式相互匹配,在匹配的过程中可能会发生广播操作,匹配过后的多个操作数的shape相同。In a vector operation instruction, each operand corresponds to an addressing mode. The addressing modes of multiple operands match each other. Broadcast operations may occur during the matching process. The shapes of multiple operands after matching are the same.
可以理解的是,不同的操作数(如上述A、B)也可以存储在同一个向量寄存器中,只要向量寄存器的空间够大,因此,不能将上述示例的不同的向量寄存器存储的操作数可以不同的情形理解成是对本申请的限制。It is understandable that different operands (such as A and B mentioned above) can also be stored in the same vector register as long as the space of the vector register is large enough. Therefore, the situation in which different vector registers in the above example can store different operands cannot be understood as a limitation of the present application.
当向量寄存器的数量为多个时,可以采用一个或多个寻址引擎对多个向量寄存器存储的张量数据进行寻址。When there are multiple vector registers, one or more addressing engines may be used to address tensor data stored in the multiple vector registers.
为了提高寻址效率,一种可选实施方式下,如图6所示,引擎单元包括:多个寻址引擎,一个寻址引擎对应一个向量寄存器,每个寻址引擎,用于根据各自的张量寻址模式从对应的向原始张量数据中获取向量运算指 令所需的源操作数。例如,以采用3个向量寄存器分别存储所需的3个源操作数为例,可通过3个寻址引擎(例如分别记为寻址引擎1、寻址引擎2、寻址引擎3)分别进行寻址。其中,寻址引擎1对应向量寄存器1,用于根据该寻址引擎1的张量寻址模式,从源操作数A所在的原始张量数据中获取向量运算指令所需的源操作数A;寻址引擎2对应向量寄存器2,用于根据该寻址引擎2的张量寻址模式,从对源操作数B所在的原始张量数据中获取向量运算指令所需的源操作数B;寻址引擎3对应向量寄存器3,用于根据该寻址引擎3的张量寻址模式,从源操作数C所在的原始张量数据中获取向量运算指令所需的源操作数C。In order to improve the addressing efficiency, in an optional implementation, as shown in FIG6 , the engine unit includes: a plurality of addressing engines, one addressing engine corresponds to a vector register, and each addressing engine is used to obtain a vector operation instruction from the corresponding original tensor data according to the respective tensor addressing mode. The required source operands are obtained. For example, taking the example of using three vector registers to store the three required source operands respectively, three addressing engines (for example, respectively recorded as addressing engine 1, addressing engine 2, and addressing engine 3) can be used for addressing respectively. Among them, addressing engine 1 corresponds to vector register 1, which is used to obtain the source operand A required for the vector operation instruction from the original tensor data where the source operand A is located according to the tensor addressing mode of the addressing engine 1; addressing engine 2 corresponds to vector register 2, which is used to obtain the source operand B required for the vector operation instruction from the original tensor data where the source operand B is located according to the tensor addressing mode of the addressing engine 2; addressing engine 3 corresponds to vector register 3, which is used to obtain the source operand C required for the vector operation instruction from the original tensor data where the source operand C is located according to the tensor addressing mode of the addressing engine 3.
可以理解的是,一种可选实施方式下,也可以是由同一个寻址引擎从多个原始张量数据中获取向量运算指令所需的多个源操作数,比如,由寻址引擎1来获取前述示例的源操作数A、源操作数B以及源操作数C。It can be understood that, in an optional implementation, the same addressing engine can also obtain multiple source operands required for vector operation instructions from multiple original tensor data. For example, addressing engine 1 can obtain the source operand A, source operand B and source operand C in the aforementioned example.
其中,每个寻址引擎,如上述的寻址引擎1、寻址引擎2、寻址引擎3都是独立的,互相不干扰。每个寻址引擎在获取向量运算指令所需的源操作数时,都是采用独立的张量寻址模式进行独立寻址,即,向量运算指令的每一个源操作数对应一种独立的张量寻址模式。比如上述的源操作数A对应张量寻址模式1,源操作数B对应张量寻址模式2,源操作数C对应张量寻址模式3。张量寻址模式1、张量寻址模式2、张量寻址模式3包含的参数种类可以相同,比如可以都包含上述的5种(start:stop:step:size:partial)参数,但参数的具体数值可能不同。Among them, each addressing engine, such as the above-mentioned addressing engine 1, addressing engine 2, and addressing engine 3 are independent and do not interfere with each other. When each addressing engine obtains the source operand required for the vector operation instruction, it uses an independent tensor addressing mode for independent addressing, that is, each source operand of the vector operation instruction corresponds to an independent tensor addressing mode. For example, the above-mentioned source operand A corresponds to tensor addressing mode 1, source operand B corresponds to tensor addressing mode 2, and source operand C corresponds to tensor addressing mode 3. The types of parameters contained in tensor addressing mode 1, tensor addressing mode 2, and tensor addressing mode 3 can be the same, for example, they can all contain the above-mentioned 5 types (start: stop: step: size: partial) parameters, but the specific values of the parameters may be different.
可以理解的是,多个寻址引擎在独立寻址时,可以是所有寻址引擎都采用本申请提供的张量寻址模式(可视为新切片寻址模式)进行寻址,也可以是其中一部分寻址引擎采用本申请提供的张量寻址模式进行寻址,剩余部分寻址引擎采用相关技术的寻址模式(索引寻址模式、numpy的基本切片寻址模式)进行寻址。例如,前述的张量寻址模式1、张量寻址模式2、张量寻址模式3除了可以是同一类型的寻址模式(如新的切片寻址模式)外,还可以是不同类型的寻址模式,比如,张量寻址模式1对应本申请中的张量寻址模式(新的切片寻址模式)、张量寻址模式2采用相关领域的3参数的切片寻址模式,张量寻址模式3采用相关领域的的索引寻址模式。也即,相比于全部采用相关技术的寻址模式来说,也可以在一定程度上提高寻址效率,减少寻址所需的指令数。It is understandable that when multiple addressing engines are independently addressed, all addressing engines may adopt the tensor addressing mode provided by this application (which can be regarded as a new slice addressing mode) for addressing, or some of the addressing engines may adopt the tensor addressing mode provided by this application for addressing, and the remaining addressing engines may adopt the addressing mode of the related technology (index addressing mode, numpy's basic slice addressing mode) for addressing. For example, the aforementioned tensor addressing mode 1, tensor addressing mode 2, and tensor addressing mode 3 may be different types of addressing modes in addition to being the same type of addressing mode (such as a new slice addressing mode). For example, tensor addressing mode 1 corresponds to the tensor addressing mode (new slice addressing mode) in this application, tensor addressing mode 2 adopts a 3-parameter slice addressing mode in the related field, and tensor addressing mode 3 adopts an index addressing mode in the related field. That is, compared with addressing modes that all adopt related technologies, the addressing efficiency can also be improved to a certain extent and the number of instructions required for addressing can be reduced.
作为一种实施方式,每个寻址引擎除了各自的独立寻址模式以外,也都还可以被设置为支持处理广播操作的能力。As an implementation method, each addressing engine may be configured to support the capability of processing broadcast operations in addition to its own independent addressing mode.
为了便于控制各个寻址引擎的寻址,一种可选实施方式下,该引擎单元还包括:主引擎,主引擎分别与每个寻址引擎连接,主引擎用于向每个寻址引擎发送控制命令,控制每个寻址引擎根据张量寻址模式进行寻址,从对应的原始张量数据中获取向量运算指令所需的源操作数。通过引入一个主引擎来集中控制这些独立的寻址引擎,可以提高寻址效率。主引擎在每一个step会给操作数的寻址引擎发送控制命令,使得相应的寻址引擎按照从低维到高维的顺序遍历张量数据上的每一个维度,在对应维度寻址。In order to facilitate the control of the addressing of each addressing engine, in an optional implementation, the engine unit also includes: a main engine, which is connected to each addressing engine respectively, and the main engine is used to send control commands to each addressing engine, control each addressing engine to address according to the tensor addressing mode, and obtain the source operands required for the vector operation instructions from the corresponding original tensor data. By introducing a main engine to centrally control these independent addressing engines, the addressing efficiency can be improved. The main engine sends control commands to the addressing engine of the operand at each step, so that the corresponding addressing engine traverses each dimension on the tensor data in order from low dimension to high dimension, and addresses the corresponding dimension.
其中,控制命令包括Advance前进命令、Reset重置命令、NOP空操作命令。也即,主引擎在每一个step会给操作数的寻址引擎发送上述三个控制命令中的至少一个控制命令。即,主引擎每次向每个寻址引擎发送控制命令时,发送由上述三个控制命令中的至少一个控制命令组合而成的组合控制命令,以控制每个寻址引擎根据各自的张量寻址模式在原始张量数据的不同维度进行寻址。Among them, the control commands include Advance command, Reset command, and NOP command. That is, the main engine will send at least one of the above three control commands to the addressing engine of the operand in each step. That is, each time the main engine sends a control command to each addressing engine, it sends a combined control command composed of at least one of the above three control commands to control each addressing engine to address different dimensions of the original tensor data according to its own tensor addressing mode.
Advance前进命令表示在当前这个维度上advance(前进或增加)一个step的步长。The Advance command means to advance (move forward or increase) the step size of one step in the current dimension.
Reset重置命令表示在这个维度上走到尽头时,需要在这个维度上从头开始寻址。The Reset command means that when you reach the end of this dimension, you need to start addressing from the beginning in this dimension.
NOP空操作命令表示在这个维度上没有发生动作。The NOP command indicates that no action is taken on this dimension.
为了更好的理解主引擎如何根据上述的Advance前进命令、Reset重置命令、NOP空操作命令,控制寻址引擎进行寻址的逻辑,下面以图5所示的原理图进行说明:In order to better understand how the main engine controls the addressing engine to perform addressing logic according to the above-mentioned Advance command, Reset command, and NOP command, the following is an explanation based on the principle diagram shown in Figure 5:
Step1:W cmd:NOP;H cmd:NOP,其中cmd表示命令,此时的状态如图5中的(1)所示;Step 1: W cmd: NOP; H cmd: NOP, where cmd represents a command. The status at this time is shown in (1) of FIG5 ;
Step2:W cmd:Advance;H cmd:NOP,此时的状态如图5中的(2)所示;Step 2: W cmd: Advance; H cmd: NOP. The status at this time is shown in (2) in Figure 5.
Step3:W cmd:Advance;H cmd:NOP,此时的状态如图5中的(3)所示;Step 3: W cmd: Advance; H cmd: NOP. The status at this time is shown in (3) of Figure 5.
Step4:W cmd:Reset;H cmd:Advance,此时的状态如图5中的(4)所示;Step 4: W cmd: Reset; H cmd: Advance. The status at this time is shown in (4) of Figure 5.
Step5:W cmd:Advance;H cmd:NOP,此时的状态如图5中的(5)所示;Step 5: W cmd: Advance; H cmd: NOP. The status at this time is shown in (5) of Figure 5.
Step6:W cmd:Advance;H cmd:NOP,此时的状态如图5中的(6)所示;Step 6: W cmd: Advance; H cmd: NOP. The status at this time is shown in (6) of Figure 5.
Step7:W cmd:Reset;H cmd:Advance,此时的状态如图5中的(7)所示;Step 7: W cmd: Reset; H cmd: Advance. The status at this time is shown in (7) of Figure 5.
Step8:W cmd:Advance;H cmd:NOP,此时的状态如图5中的(8)所示;Step 8: W cmd: Advance; H cmd: NOP. The status at this time is shown in (8) in Figure 5.
Step9:W cmd:Advance;H cmd:NOP,此时的状态如图5中的(9)所示。 Step 9: W cmd: Advance; H cmd: NOP. The status at this time is shown in (9) in FIG. 5 .
若张量数据的第一维度(可以是张量数据的任一维度)需要广播,则该张量数据对应的寻址引擎在张量数据的第一维度上进行寻址时,在接收到Advance前进命令时,保持寻址指针不变,这样寻址引擎在从向量寄存器中读取源操作数时,会一直读取当前寻址指针所指向的源操作数,从而实现广播的目的。相应地,若张量数据的第一维度(可以是张量数据的任一维度)需要广播,则主引擎在发送控制命令控制寻址引擎在张量数据的第一维度上进行寻址时,当寻址引擎在张量数据的第一维度上走到尽头时,主引擎并不会发送Reset重置命令,可以一直发Advance前进命令,以便一直读取当前寻址指针所指向的源操作数,直至完成广播后,再发送其他控制命令。If the first dimension of the tensor data (which can be any dimension of the tensor data) needs to be broadcast, the addressing engine corresponding to the tensor data keeps the addressing pointer unchanged when it receives the Advance command while addressing the first dimension of the tensor data, so that the addressing engine will always read the source operand pointed to by the current addressing pointer when reading the source operand from the vector register, thereby achieving the purpose of broadcasting. Correspondingly, if the first dimension of the tensor data (which can be any dimension of the tensor data) needs to be broadcast, when the main engine sends a control command to control the addressing engine to address the first dimension of the tensor data, when the addressing engine reaches the end of the first dimension of the tensor data, the main engine will not send a Reset command, and can keep sending Advance commands so as to keep reading the source operand pointed to by the current addressing pointer until the broadcast is completed, and then send other control commands.
为了更好的理解,下面结合图3A所示的原理图进行说明。假设向量运算单元要完成图3A所示的运算操作,寻址引擎1负责获取“np.arange(3)”这个数组中的操作数,寻址引擎2负责获取操作数5,初始时刻,主引擎向寻址引擎1、寻址引擎2发送Advance前进命令,寻址引擎1获取“np.arange(3)”这个数组中的0操作数,寻址引擎2获取操作数5;下一时刻,主引擎向寻址引擎1、寻址引擎2发送Advance前进命令,寻址引擎1前进一步获取“np.arange(3)”这个数组中的1操作数,寻址引擎2在接收到Advance前进命令时,保持寻址指针不变,此时仍然输出操作数5;在下一时刻,主引擎向寻址引擎1、寻址引擎2发送Advance前进命令,寻址引擎1前进一步获取“np.arange(3)”这个数组中的2操作数,寻址引擎2在接收到Advance前进命令时,保持寻址指针不变,此时仍然输出操作数5。For a better understanding, the following description is given in conjunction with the principle diagram shown in FIG. 3A . Assume that the vector operation unit needs to complete the operation shown in Figure 3A, addressing engine 1 is responsible for obtaining the operands in the array "np.arange(3)", and addressing engine 2 is responsible for obtaining operand 5. At the initial moment, the main engine sends an Advance command to addressing engine 1 and addressing engine 2, and addressing engine 1 obtains operand 0 in the array "np.arange(3)", and addressing engine 2 obtains operand 5; at the next moment, the main engine sends an Advance command to addressing engine 1 and addressing engine 2, and addressing engine 1 takes a step forward to obtain operand 1 in the array "np.arange(3)". When addressing engine 2 receives the Advance command, it keeps the addressing pointer unchanged and still outputs operand 5; at the next moment, the main engine sends an Advance command to addressing engine 1 and addressing engine 2, and addressing engine 1 takes a step forward to obtain operand 2 in the array "np.arange(3)". When addressing engine 2 receives the Advance command, it keeps the addressing pointer unchanged and still outputs operand 5.
一种可选实施方式下,张量寻址模式还可以包括嵌套的双重张量寻址模式,双重张量寻址模式包括:外迭代寻址模式(outer iterator)和内迭代寻址模式(inner iterator),其中,内迭代寻址模式在外迭代寻址模式得到的张量数据的基础上进行寻址。In an optional implementation, the tensor addressing mode may also include a nested dual tensor addressing mode, the dual tensor addressing mode including: an outer iterative addressing mode (outer iterator) and an inner iterative addressing mode (inner iterator), wherein the inner iterative addressing mode performs addressing based on the tensor data obtained in the outer iterative addressing mode.
作为一种实施方式,外迭代寻址模式(outer iterator)可以是前述的采用start:stop:step:size:partial这些参数的新的切片寻址模式,这种情况下,可以快速选定局部区域(例如基于这些参数限定出的shape)的张量数据,为内迭代寻址模式提供寻址的数据基础。在一些应用场景下,通过外迭代寻址模式选定的局部区域均是在partial=true的情况下得到的完整形状,这样就无需用额外的指令来指示进行模式切换,或者,用额外的指令来处理内迭代寻址模式下可能面临的数据形状改变(从完整到不完整)的情况。As an implementation method, the outer iterator addressing mode (outer iterator) can be the aforementioned new slice addressing mode using the parameters of start: stop: step: size: partial. In this case, the tensor data of the local area (for example, the shape defined by these parameters) can be quickly selected to provide the data basis for addressing the inner iterator addressing mode. In some application scenarios, the local areas selected by the outer iterator addressing mode are all complete shapes obtained when partial=true, so there is no need to use additional instructions to indicate mode switching, or to use additional instructions to handle the data shape changes (from complete to incomplete) that may be encountered in the inner iterator addressing mode.
采用嵌套的双重张量寻址模式进行寻址,相比于其他寻址模式来说,寻址效率更高,使得只需要一条指令就可实现2条单层的指令的寻址,并且,内迭代寻址模式是在外迭代寻址模式得到的张量数据的基础上进行寻址的,相比于采用单层的张量寻址模式的寻址,减少了第一层(外迭代寻址模式)的数据读写。The nested dual tensor addressing mode is used for addressing. Compared with other addressing modes, the addressing efficiency is higher, so that only one instruction is needed to implement the addressing of two single-layer instructions. In addition, the inner iterative addressing mode is addressed based on the tensor data obtained in the outer iterative addressing mode. Compared with the addressing using a single-layer tensor addressing mode, the data reading and writing of the first layer (external iterative addressing mode) is reduced.
其中,外迭代寻址模式和内迭代寻址模式的表达式可以为:[start:stop:step:size],或者为[start:stop:step:size:partial]。外迭代寻址模式和内迭代寻址模式的表达式中的参数值可以不同。The expressions of the external iteration addressing mode and the internal iteration addressing mode may be: [start: stop: step: size], or [start: stop: step: size: partial]. The parameter values in the expressions of the external iteration addressing mode and the internal iteration addressing mode may be different.
为了更好的理解,以外迭代寻址模式表达式为[start:stop:step:size]=0:8:2:5为例,其原理图如图7所示,图7中被遍历的这部分可以视为尺寸为8*8的特征图。内迭代寻址模式表达式的参数值以0:5:2:1为例,其寻址原理图如图8所示。按照0:8:2:5的寻址模式进行寻址,可以得到4部分参数,即图7中(1)、(2)、(3)、(4)的shape所包含的参数。外内迭代寻址模式在外迭代寻址模式得到的张量数据的基础上进行寻址,即利用0:5:2:1的寻址模式分别对图7中的(1)、(2)、(3)、(4)得到的多个张量元素进行寻址。For a better understanding, the outer iterative addressing mode expression is [start: stop: step: size] = 0:8:2:5 as an example, and its principle diagram is shown in FIG7. The traversed part in FIG7 can be regarded as a feature map of size 8*8. The parameter value of the inner iterative addressing mode expression is 0:5:2:1 as an example, and its addressing principle diagram is shown in FIG8. According to the addressing mode of 0:8:2:5, four parameters can be obtained, namely the parameters contained in the shape of (1), (2), (3), and (4) in FIG7. The outer inner iterative addressing mode performs addressing based on the tensor data obtained by the outer iterative addressing mode, that is, the multiple tensor elements obtained by (1), (2), (3), and (4) in FIG7 are respectively addressed using the addressing mode of 0:5:2:1.
通过图7和图8的示例可以看出,外迭代寻址模式用于从张量数据中选定至少一个候选区域,候选区域包含多个张量元素,比如选取图7中的(1)、(2)、(3)、(4)等候选区域;内迭代寻址模式用于从至少一个候选区域中获取向量运算指令所需的源操作数。内迭代寻址模式是在图7中的(1)、(2)、(3)、(4)的数据基础上进行寻址,以获取向量运算指令所需的源操作数。It can be seen from the examples of FIG7 and FIG8 that the outer iteration addressing mode is used to select at least one candidate region from the tensor data, and the candidate region includes multiple tensor elements, such as selecting candidate regions (1), (2), (3), (4) and the like in FIG7; the inner iteration addressing mode is used to obtain the source operand required for the vector operation instruction from at least one candidate region. The inner iteration addressing mode performs addressing based on the data of (1), (2), (3), (4) in FIG7 to obtain the source operand required for the vector operation instruction.
需要说明的是,内迭代寻址模式并不一定是等到外迭代寻址模式全部寻址完后,才进行寻址,示例性的,当外迭代寻址模式寻址到图7中(1)所包含的5*5个元素时,内迭代寻址模式便可在此基础上进行寻址,内迭代寻址模式完成对图7中(1)所包含的这部分数据的寻址后,又可继续外迭代寻址模式进行寻址,当外迭代寻址模式寻址到图7中(2)所包含的5*5个元素时,内迭代寻址模式便在此基础上进行遍历寻址,内迭代寻址模式完成对图7中(2)所包含的这部分数据的寻址后,又继续外迭代寻址模式,以此类推,直至完成该8*8特征图中所有的寻址。这样,外迭代寻址模式寻址到的数据仅是一个中间结果,并不需要读写,相比于采用单层的张量寻址模式的寻址,减少了第一层(外迭代寻址模式)的数据读写。It should be noted that the inner iterative addressing mode does not necessarily wait until the outer iterative addressing mode has completed all addressing before addressing. For example, when the outer iterative addressing mode addresses the 5*5 elements contained in (1) in FIG. 7, the inner iterative addressing mode can address on this basis. After the inner iterative addressing mode completes addressing of the data contained in (1) in FIG. 7, the outer iterative addressing mode can continue to address. When the outer iterative addressing mode addresses the 5*5 elements contained in (2) in FIG. 7, the inner iterative addressing mode performs traversal addressing on this basis. After the inner iterative addressing mode completes addressing of the data contained in (2) in FIG. 7, the outer iterative addressing mode continues, and so on, until all addressing in the 8*8 feature map is completed. In this way, the data addressed by the outer iterative addressing mode is only an intermediate result and does not need to be read or written. Compared with the addressing using a single-layer tensor addressing mode, the data reading and writing of the first layer (external iterative addressing mode) is reduced.
可以理解的是,关于双重张量寻址,外迭代寻址模式和内迭代寻址模式除了可以采用本申请所示的寻址模式(新的切片寻址模式)外,还可以对其寻址模式进行扩展,比如采用已有的寻址模式(切片寻址模式、索引 寻址模式)的原理进行寻址,从而可以有多种组合方式,示例性的一些组合方式参见下表1所示。It can be understood that, regarding dual tensor addressing, the external iterative addressing mode and the internal iterative addressing mode can not only adopt the addressing mode (new slice addressing mode) shown in this application, but also expand its addressing mode, such as adopting the existing addressing mode (slice addressing mode, index addressing mode). The addressing is performed according to the principle of addressing mode), so there can be many combinations. Some exemplary combinations are shown in Table 1 below.
表1
Table 1
需要说明的是,当外迭代寻址模式采用新的切片寻址模式进行寻址时,指针每移动一个step便可得到图7中的(1)、(2)、(3)、(4)所示的shape所包含的5*5个元素参数。当外迭代寻址模式采用相关技术的切片寻址模式或索引寻址模式进行寻址时,由于每滑动一个step仅能得到一个元素点,若要得到图7中的(1)、(2)、(3)、(4)所示的shape所包含的这些元素参数,则需要进行多次寻址,以保证无论外迭代寻址模式采用何种寻址模式,都可以得到采用新的切片寻址模式寻址所得的这些数据,以此为内迭代寻址模式提供依据。即,无论外迭代寻址模式采用何种寻址模式所得到的数据是一致的,例如,都是得到图7中的(1)、(2)、(3)、(4)所示的shape所包含的多个张量元素参数,只是采用新的切片寻址模式时每滑动一个step便可得到一个shape所包含的参数,若是采用其他的相关技术的寻址模式,则需要进行多次寻址(滑动多个step),才能得到新的切片寻址模式每滑动一个step所得到的shape所包含的这些参数。It should be noted that when the external iterative addressing mode adopts the new slice addressing mode for addressing, the pointer can obtain the 5*5 element parameters contained in the shape shown in (1), (2), (3), and (4) in Figure 7 every time it moves one step. When the external iterative addressing mode adopts the slice addressing mode or index addressing mode of the related art for addressing, since only one element point can be obtained for each sliding step, if these element parameters contained in the shape shown in (1), (2), (3), and (4) in Figure 7 are to be obtained, multiple addressing is required to ensure that no matter what addressing mode the external iterative addressing mode adopts, the data obtained by addressing using the new slice addressing mode can be obtained, thereby providing a basis for the internal iterative addressing mode. That is, no matter what addressing mode is used in the external iterative addressing mode, the data obtained is consistent. For example, multiple tensor element parameters contained in the shapes shown in (1), (2), (3), and (4) in Figure 7 are obtained. However, when the new slice addressing mode is used, the parameters contained in one shape can be obtained each time a step is slid. If the addressing mode of other related technologies is used, multiple addressing (sliding multiple steps) is required to obtain the parameters contained in the shape obtained each time the new slice addressing mode slides one step.
基于同样的发明构思,本申请的第一方面的实施例还提供了一种张量处理方法,如图9所示。下面结合图9对该张量处理方法的原理进行说明。该方法可应用于前述的AI芯片,可应用于前述的电子设备。Based on the same inventive concept, the embodiment of the first aspect of the present application further provides a tensor processing method, as shown in Figure 9. The principle of the tensor processing method is described below in conjunction with Figure 9. The method can be applied to the aforementioned AI chip and the aforementioned electronic device.
S1:根据张量寻址模式从张量数据中获取向量运算指令所需的源操作数,所述张量寻址模式一次寻址获取的源操作数个数大于或等于1。S1: Obtain source operands required for vector operation instructions from tensor data according to a tensor addressing mode, wherein the number of source operands obtained by one addressing in the tensor addressing mode is greater than or equal to 1.
本申请的第一方面的实施例中的张量寻址模式通过对相关技术的切片寻址模式进行了改进与扩展,使其包含更多的参数,比如该张量寻址模式的参数包括:表征寻址起点的起始地址(如用start表示)、表征寻址终点的终点地址(如用stop表示)、表征寻址指针偏移幅度的步长(如用step表示)、表征寻址所得数据形状的尺寸(如用size表示)。则张量寻址模式的表达式为[start:stop:step:size]。其中,size用来描述寻址所得数据形状(shape)的尺寸,即在每一个step的时候会提取一个shape包含的所有元素点,而不只提取一个点。从而使得一次寻址便可确定至少一个源操作数对应的张量元素的地址,能极大提高计算源操作数对应的张量元素的地址的效率,使得只需要较少的指令便可寻找到所需的源操作数,极大的减少了冗余指令、提高了有效指令密度、提高了性能、简化了编程。The tensor addressing mode in the embodiment of the first aspect of the present application improves and expands the slice addressing mode of the related art to include more parameters. For example, the parameters of the tensor addressing mode include: the starting address representing the starting point of addressing (such as represented by start), the end address representing the end point of addressing (such as represented by stop), the step length representing the offset amplitude of the addressing pointer (such as represented by step), and the size of the data shape obtained by addressing (such as represented by size). Then the expression of the tensor addressing mode is [start: stop: step: size]. Among them, size is used to describe the size of the data shape obtained by addressing, that is, all element points contained in a shape will be extracted at each step, instead of just one point. As a result, the address of the tensor element corresponding to at least one source operand can be determined by one addressing, which can greatly improve the efficiency of calculating the address of the tensor element corresponding to the source operand, so that only fewer instructions are needed to find the required source operand, which greatly reduces redundant instructions, improves the effective instruction density, improves performance, and simplifies programming.
一种实施方式下,可以是上述AI芯片中的引擎单元根据张量寻址模式从外部存储器存储的原始张量数据中获取向量运算指令所需的源操作数。In one implementation, the engine unit in the above-mentioned AI chip may obtain the source operand required for the vector operation instruction from the original tensor data stored in the external memory according to the tensor addressing mode.
S2:将获取的源操作数存储至向量寄存器。S2: Store the fetched source operand into the vector register.
引擎单元在根据张量寻址模式获取到源操作数后,可以将获取到的源操作数存储至向量寄存器中,其中,不同的源操作数可以存储在不同的向量寄存器中,以便于后续读取其中的操作数进行运算时,可以并行读取,从而提高效率。After the engine unit obtains the source operand according to the tensor addressing mode, it can store the obtained source operand in the vector register, wherein different source operands can be stored in different vector registers, so that when the operands are subsequently read for calculation, they can be read in parallel, thereby improving efficiency.
S3:从所述向量寄存器中获取所述向量运算指令所需的源操作数进行运算,得到运算结果。S3: Obtain source operands required by the vector operation instruction from the vector register to perform operation and obtain operation results.
在获取到向量运算指令所需的源操作数后,便可根据向量运算指令,对获取的源操作数进行运算,得到运算结果。一种可选实施方式下,可以是上述AI芯片中的向量运算单元从向量寄存器中获取向量运算指令所需的源操作数进行运算,得到运算结果。After obtaining the source operands required by the vector operation instruction, the obtained source operands can be operated according to the vector operation instruction to obtain the operation result. In an optional implementation, the vector operation unit in the above AI chip can obtain the source operands required by the vector operation instruction from the vector register to perform the operation and obtain the operation result.
可以理解的是,根据本申请的第一方面的实施例揭露的关于张量寻址模式的内容,可以在对张量进行处理的各个环节进行高效寻址,例如对于张量之间的计算过程,除了计算出最终结果的这一步有效运算步骤外,不论是要读取待计算的张量,还是要写入计算完毕的张量结果,都会涉及到张量元素的地址计算,这是因为在读取待计算的张量时需要根据张量元素的地址去读取数据,而在从相应的地址读取到待计算的张量并进行张量之间的运算后,如果到了输出结果的环节,在一些场景下可能需要将张量的运算结果写入用于存储张量结果的地址,这个写入运算结果的过程也有可能需要进行张量元素的地址计算。基于本申请揭露的关于张量寻址模式的 原理,可以在能够高效寻址的情况下对张量进行快速读写。It can be understood that according to the contents of the tensor addressing mode disclosed in the embodiments of the first aspect of the present application, efficient addressing can be performed in various links of tensor processing. For example, for the calculation process between tensors, in addition to the effective operation step of calculating the final result, whether it is to read the tensor to be calculated or to write the tensor result after calculation, the address calculation of the tensor elements will be involved. This is because when reading the tensor to be calculated, it is necessary to read the data according to the address of the tensor element. After reading the tensor to be calculated from the corresponding address and performing operations between tensors, if it comes to the stage of outputting the results, in some scenarios it may be necessary to write the tensor calculation results to the address used to store the tensor results. This process of writing the calculation results may also require the address calculation of the tensor elements. Principle, tensors can be read and written quickly when they can be efficiently addressed.
本申请的第一方面的实施例所提供的张量处理方法,其实现原理及产生的技术效果和前述AI芯片实施例相同,为简要描述,方法实施例部分未提及之处,可参考前述AI芯片实施例中相应内容。The tensor processing method provided in the embodiment of the first aspect of the present application has the same implementation principle and technical effects as those of the aforementioned AI chip embodiment. For the sake of brief description, for matters not mentioned in the method embodiment, reference may be made to the corresponding contents in the aforementioned AI chip embodiment.
在根据本申请的上述第一方面的实施例中,发明人还注意到,在进行向量运算的步骤中,由于非有效计算在***中会造成额外的成本,因此存在对于改进的AI芯片及张量处理方法的需求。In the embodiment according to the above-mentioned first aspect of the present application, the inventors also noticed that in the step of performing vector operations, since ineffective calculations will cause additional costs in the system, there is a need for improved AI chips and tensor processing methods.
接下来,参考图10至图13对根据本申请的第二方面的AI芯片和张量处理方法进行详细介绍。Next, the AI chip and tensor processing method according to the second aspect of the present application are introduced in detail with reference to Figures 10 to 13.
图10示出了根据本申请的第二方面的实施例的AI芯片的原理。如图所示,该AI芯片可以包括向量寄存器、引擎单元以及向量运算单元。其中,向量寄存器通过引擎单元与向量运算单元连接。Figure 10 shows the principle of an AI chip according to an embodiment of the second aspect of the present application. As shown in the figure, the AI chip may include a vector register, an engine unit, and a vector operation unit. Among them, the vector register is connected to the vector operation unit through the engine unit.
根据图示的实施例,向量寄存器用于存储运算所需的张量数据,其中,张量数据是一组多维数据,通常会以某种方式(如线性布局、平铺布局等)展开,存储在诸如内存、片上存储器、寄存器等具有存储功能的器件中。存储于向量寄存器中的张量数据可以是从AI芯片外部的存储器中搬运过来的。引擎单元与向量寄存器连接,引擎单元用于根据张量寻址模式从张量数据中获取向量运算指令所需的源操作数,其中,引擎单元根据张量寻址模式一次寻址获取的源操作数(个数)大于或等于1。向量运算单元与引擎单元连接,向量运算单元用于根据向量运算指令,对引擎单元获取的源操作数进行运算,得到运算结果。其中,向量运算指令是指一条指令能够同时计算两个以上的操作数的指令。示例性的,向量运算指令的类型可以是加法、减法、乘法、乘累加等各种运算类型。According to the illustrated embodiment, the vector register is used to store the tensor data required for the operation, wherein the tensor data is a set of multidimensional data, which is usually expanded in some way (such as linear layout, tiled layout, etc.) and stored in a device with storage function such as memory, on-chip memory, register, etc. The tensor data stored in the vector register can be moved from a memory outside the AI chip. The engine unit is connected to the vector register, and the engine unit is used to obtain the source operand required for the vector operation instruction from the tensor data according to the tensor addressing mode, wherein the source operand (number) obtained by the engine unit at one time according to the tensor addressing mode is greater than or equal to 1. The vector operation unit is connected to the engine unit, and the vector operation unit is used to operate on the source operand obtained by the engine unit according to the vector operation instruction to obtain the operation result. Among them, the vector operation instruction refers to an instruction that can calculate more than two operands at the same time. Exemplarily, the type of vector operation instruction can be various operation types such as addition, subtraction, multiplication, multiplication and accumulation.
根据上述示例性实施例的AI芯片可以是一种如集成电路芯片,具有数据处理能力,可用于处理张量与张量之间的运算。上述的AI芯片的具体示例可参考本申请的第一方面的AI芯片的具体示例的描述内容。The AI chip according to the above exemplary embodiment can be a chip such as an integrated circuit chip, which has data processing capabilities and can be used to process operations between tensors. The specific examples of the above AI chip can refer to the description of the specific examples of the AI chip of the first aspect of this application.
在本申请的第二方面的可选的实施例中,AI芯片中的向量寄存器的数量可以为多个,不同的向量寄存器存储的张量数据可以不同。以向量运算指令为三操作数的向量运算指令,如A*B+C为例,A、B、C均为源操作数,其中,A、B、C可以是单个(张量)元素,也可以是包含多个元素的数组。此时,向量寄存器的数量可以包括3个,例如为向量寄存器1、向量寄存器2、向量寄存器3。向量寄存器1可以用于存储源操作数A所在的张量数据,向量寄存器2可以用于存储源操作数B所在的张量数据,向量寄存器3可以用于存储源操作数C所在的张量数据。In an optional embodiment of the second aspect of the present application, the number of vector registers in the AI chip may be multiple, and different vector registers may store different tensor data. Taking a vector operation instruction with three operands, such as A*B+C as an example, A, B, and C are all source operands, where A, B, and C can be single (tensor) elements or arrays containing multiple elements. At this time, the number of vector registers may include 3, for example, vector register 1, vector register 2, and vector register 3. Vector register 1 can be used to store the tensor data where the source operand A is located, vector register 2 can be used to store the tensor data where the source operand B is located, and vector register 3 can be used to store the tensor data where the source operand C is located.
如前所述,在一条向量运算指令中,每一个操作数都对应有一个寻址模式,多个操作数的寻址模式可相互匹配,在匹配的过程中可能会需要发生广播操作,匹配过后的多个操作数的shape相同。本申请不对具体的指令表达形式进行限制。As mentioned above, in a vector operation instruction, each operand corresponds to an addressing mode, and the addressing modes of multiple operands can match each other. During the matching process, a broadcast operation may be required, and the shapes of the multiple operands after matching are the same. This application does not limit the specific instruction expression form.
可以理解的是,不同的张量数据也可以存储在同一个向量寄存器中,只要向量寄存器的空间够大,因此,不能将上述示例的不同的向量寄存器存储的张量数据可以不同的情形理解成是对本申请的限制。It is understandable that different tensor data can also be stored in the same vector register as long as the space of the vector register is large enough. Therefore, the above example that different vector registers can store different tensor data cannot be understood as a limitation of the present application.
当向量寄存器的数量为多个时,可以采用一个或多个寻址引擎对多个向量寄存器存储的张量数据进行寻址。When there are multiple vector registers, one or more addressing engines may be used to address tensor data stored in the multiple vector registers.
为了提高寻址效率,一种可选实施方式下,如图11所示,引擎单元包括:多个寻址引擎,一个寻址引擎对应一个向量寄存器,其中,每个寻址引擎,用于根据各自的张量寻址模式从对应的向量寄存器中获取向量运算指令所需的源操作数。例如,以采用3个向量寄存器存储张量数据为例,此时,可以通过3个寻址引擎来分别对这3个向量寄存器存储的张量数据进行寻址。例如,寻址引擎1对应向量寄存器1,用于根据该寻址引擎1的张量寻址模式从对应的向量寄存器1中获取向量运算指令所需的源操作数;寻址引擎2对应向量寄存器2,用于根据该寻址引擎2的张量寻址模式从对应的向量寄存器2中获取向量运算指令所需的源操作数;寻址引擎3对应向量寄存器3,用于根据该寻址引擎3的张量寻址模式,从对应的向量寄存器3中获取向量运算指令所需的源操作数。各寻址引擎采用的张量寻址模式可以相同,也可以不同。In order to improve the addressing efficiency, in an optional implementation, as shown in FIG11, the engine unit includes: a plurality of addressing engines, one addressing engine corresponds to one vector register, wherein each addressing engine is used to obtain the source operand required for the vector operation instruction from the corresponding vector register according to the respective tensor addressing mode. For example, taking the use of three vector registers to store tensor data as an example, at this time, the three addressing engines can be used to address the tensor data stored in the three vector registers respectively. For example, addressing engine 1 corresponds to vector register 1, and is used to obtain the source operand required for the vector operation instruction from the corresponding vector register 1 according to the tensor addressing mode of the addressing engine 1; addressing engine 2 corresponds to vector register 2, and is used to obtain the source operand required for the vector operation instruction from the corresponding vector register 2 according to the tensor addressing mode of the addressing engine 2; addressing engine 3 corresponds to vector register 3, and is used to obtain the source operand required for the vector operation instruction from the corresponding vector register 3 according to the tensor addressing mode of the addressing engine 3. The tensor addressing modes used by each addressing engine can be the same or different.
可以理解的是,另一种可选实施方式下,也可以是由同一个寻址引擎从多个向量寄存器中获取向量运算指令所需的源操作数,比如,由寻址引擎1从向量寄存器1中获取向量运算指令所需的源操作数A,由寻址引擎1从向量寄存器2中获取向量运算指令所需的源操作数B,由寻址引擎1从向量寄存器3中获取向量运算指令所需的源操作数C。It can be understood that in another optional implementation, the same addressing engine may obtain the source operands required for the vector operation instruction from multiple vector registers. For example, addressing engine 1 obtains source operand A required for the vector operation instruction from vector register 1, addressing engine 1 obtains source operand B required for the vector operation instruction from vector register 2, and addressing engine 1 obtains source operand C required for the vector operation instruction from vector register 3.
其中,每个寻址引擎,如上述的寻址引擎1、寻址引擎2、寻址引擎3都是独立的,互相不干扰。每个寻址引擎在获取向量运算指令所需的源操作数时,都是采用独立的张量寻址模式进行独立寻址,即,向量运算指令中的每一个源操作数对应一种独立的张量寻址模式。比如上述的源操作数A对应张量寻址模式1,源操作数B对应张量寻址模式2,源操作数C对应张量寻址模式3。张量寻址模式1、张量寻址模式2、张量寻址模式3包 含的参数种类可以相同,比如可以都包含上述的5种(start:stop:step:size:partial)参数,不同之处在于参数的具体数值可能不同。Among them, each addressing engine, such as the above-mentioned addressing engine 1, addressing engine 2, and addressing engine 3, are independent and do not interfere with each other. When each addressing engine obtains the source operand required by the vector operation instruction, it uses an independent tensor addressing mode for independent addressing, that is, each source operand in the vector operation instruction corresponds to an independent tensor addressing mode. For example, the above-mentioned source operand A corresponds to tensor addressing mode 1, source operand B corresponds to tensor addressing mode 2, and source operand C corresponds to tensor addressing mode 3. Tensor addressing mode 1, tensor addressing mode 2, and tensor addressing mode 3 include The types of parameters contained may be the same, for example, they may all contain the above five types of parameters (start: stop: step: size: partial). The difference is that the specific values of the parameters may be different.
作为一种实施方式,每个寻址引擎除了各自的独立寻址模式以外,也都还可以被设置为支持处理广播操作的能力。As an implementation method, each addressing engine may be configured to support the capability of processing broadcast operations in addition to its own independent addressing mode.
为了便于控制各个寻址引擎的寻址,一种可选实施方式下,该AI芯片的引擎单元还可包括:主引擎,主引擎与每个寻址引擎连接,主引擎用于向每个寻址引擎发送控制命令,以控制每个寻址引擎根据张量寻址模式进行寻址,从对应的向量寄存器中获取向量运算指令所需的源操作数。通过引入一个主引擎来集中控制这些独立的寻址引擎,可以提高寻址效率。主引擎可在每一个step给操作数的寻址引擎发送一个控制命令,使寻址引擎可根据控制命令,按照从低维到高维的顺序,遍历张量数据的每一个维度,在相应维度寻址。In order to facilitate the control of the addressing of each addressing engine, in an optional implementation, the engine unit of the AI chip may also include: a main engine, the main engine is connected to each addressing engine, and the main engine is used to send control commands to each addressing engine to control each addressing engine to address according to the tensor addressing mode, and obtain the source operand required for the vector operation instruction from the corresponding vector register. By introducing a main engine to centrally control these independent addressing engines, the addressing efficiency can be improved. The main engine can send a control command to the addressing engine of the operand at each step, so that the addressing engine can traverse each dimension of the tensor data in order from low dimension to high dimension according to the control command, and address the corresponding dimension.
与根据本申请的第一方面的实施例相同,在根据本申请的第二方面的实施例中,控制命令可以包括Advance前进命令、Reset重置命令、NOP空操作命令,在此不再赘述。Similar to the embodiment according to the first aspect of the present application, in the embodiment according to the second aspect of the present application, the control command may include an Advance command, a Reset command, and a NOP command, which will not be repeated here.
在本申请的第二方面的可选实施方式下,本申请的张量寻址模式还可以包括:嵌套的双重张量寻址模式,双重张量寻址模式包括:外迭代寻址模式(outer iterator)和内迭代寻址模式(inner iterator),其中,内迭代寻址模式在外迭代寻址模式得到的张量数据的基础上进行寻址。即,本申请的第二方面的实施例提供的AI芯片不仅可支持以前述的新切片寻址模式进行张量寻址,还可支持以双重张量寻址模式进行张量寻址。In an optional implementation of the second aspect of the present application, the tensor addressing mode of the present application may also include: a nested dual tensor addressing mode, the dual tensor addressing mode includes: an outer iterative addressing mode (outer iterator) and an inner iterative addressing mode (inner iterator), wherein the inner iterative addressing mode performs addressing based on the tensor data obtained in the outer iterative addressing mode. That is, the AI chip provided by the embodiment of the second aspect of the present application can not only support tensor addressing in the aforementioned new slice addressing mode, but also support tensor addressing in the dual tensor addressing mode.
关于外迭代寻址模式和内迭代寻址模式可以参考本申请的第一方面的详细描述(结合图7和8)。For the outer iterative addressing mode and the inner iterative addressing mode, reference may be made to the detailed description of the first aspect of the present application (in conjunction with FIGS. 7 and 8 ).
基于同样的发明构思,本申请的第二方面的实施例还提供了一种张量处理方法,如图12所示。下面结合图12对该张量处理方法的原理进行说明。该方法可应用于前述的AI芯片,可应用于前述的电子设备。Based on the same inventive concept, the embodiment of the second aspect of the present application further provides a tensor processing method, as shown in Figure 12. The principle of the tensor processing method is described below in conjunction with Figure 12. The method can be applied to the aforementioned AI chip and the aforementioned electronic device.
S10:根据张量寻址模式从张量数据中获取向量运算指令所需的源操作数,所述张量寻址模式一次寻址获取的源操作数个数大于或等于1。S10: Obtain source operands required for the vector operation instruction from the tensor data according to the tensor addressing mode, wherein the number of source operands obtained by the tensor addressing mode at one time is greater than or equal to 1.
本申请的第二方面的实施例中的张量寻址模式通过对相关技术的切片寻址模式进行了改进与扩展,使其包含更多的参数,比如该张量寻址模式的参数包括:表征寻址起点的起始地址(如用start表示)、表征寻址终点的终点地址(如用stop表示)、表征寻址指针偏移幅度的步长(如用step表示)、表征寻址所得数据形状的尺寸(如用size表示)。则张量寻址模式的表达式可以为[start:stop:step:size]。其中,size用来描述寻址所得数据形状(shape)的尺寸,即在每一个step的时候会提取一个shape包含的所有元素点,而不只提取一个点。从而使得一个step确定的张量元素可以有一个或多个,一次寻址便可确定至少一个源操作数对应的张量元素的地址,能极大提高计算源操作数对应的张量元素的地址的效率,使得只需要较少的指令便可寻找到所需的源操作数,极大的减少了冗余指令、提高了有效指令密度、提高了性能、简化了编程。The tensor addressing mode in the embodiment of the second aspect of the present application improves and expands the slice addressing mode of the related art to include more parameters. For example, the parameters of the tensor addressing mode include: the starting address representing the starting point of addressing (such as represented by start), the end address representing the end point of addressing (such as represented by stop), the step length representing the offset amplitude of the addressing pointer (such as represented by step), and the size of the data shape obtained by addressing (such as represented by size). Then the expression of the tensor addressing mode can be [start: stop: step: size]. Among them, size is used to describe the size of the data shape obtained by addressing, that is, all element points contained in a shape will be extracted at each step, not just one point. As a result, there can be one or more tensor elements determined by a step, and the address of the tensor element corresponding to at least one source operand can be determined by one addressing, which can greatly improve the efficiency of calculating the address of the tensor element corresponding to the source operand, so that only fewer instructions are needed to find the required source operand, which greatly reduces redundant instructions, improves the effective instruction density, improves performance, and simplifies programming.
一种实施方式下,可以是上述AI芯片中的引擎单元根据张量寻址模式从向量寄存器存储的张量数据中获取向量运算指令所需的源操作数。In one implementation, the engine unit in the above-mentioned AI chip may obtain the source operand required for the vector operation instruction from the tensor data stored in the vector register according to the tensor addressing mode.
S20:根据所述向量运算指令,对获取的源操作数进行运算,得到运算结果。S20: According to the vector operation instruction, the obtained source operand is operated to obtain an operation result.
在获取到向量运算指令所需的源操作数后,便可根据向量运算指令,对获取的源操作数进行运算,得到运算结果。一种可选实施方式下,可以是上述AI芯片中的向量运算单元根据向量运算指令,对获取的源操作数进行运算,得到运算结果。After obtaining the source operands required by the vector operation instruction, the obtained source operands can be operated according to the vector operation instruction to obtain the operation result. In an optional implementation, the vector operation unit in the above AI chip can operate on the obtained source operands according to the vector operation instruction to obtain the operation result.
本申请的第二方面的实施例所提供的张量处理方法,其实现原理及产生的技术效果和前述AI芯片实施例相同,为简要描述,方法实施例部分未提及之处,可参考前述AI芯片实施例中相应内容。The tensor processing method provided in the embodiment of the second aspect of the present application has the same implementation principle and technical effects as those of the aforementioned AI chip embodiment. For the sake of brief description, for matters not mentioned in the method embodiment, reference may be made to the corresponding contents in the aforementioned AI chip embodiment.
需要说明的是,本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。It should be noted that the various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the various embodiments can be referenced to each other.
可以理解的是,根据本申请的第二方面的实施例揭露的关于张量寻址模式的内容,可以在对张量进行处理的各个环节进行高效寻址,例如对于张量之间的计算过程,除了计算出最终结果的这一步有效运算步骤外,不论是要读取待计算的张量,还是要写入计算完毕的张量结果,都会涉及到张量元素的地址计算,这是因为在读取待计算的张量时需要根据张量元素的地址去读取数据,而在从相应的地址读取到待计算的张量并进行张量之间的运算后,如果到了输出结果的环节,在一些场景下可能需要将张量的运算结果写入用于存储张量结果的地址,这个写入运算结果的过程也有可能需要进行张量元素的地址计算。基于本申请揭露的关于张量寻址模式的原理,可以在能够高效寻址的情况下对张量进行快速读写。It can be understood that according to the content of the tensor addressing mode disclosed in the embodiment of the second aspect of the present application, efficient addressing can be performed in each link of tensor processing. For example, for the calculation process between tensors, in addition to the effective operation step of calculating the final result, whether it is to read the tensor to be calculated or to write the tensor result after calculation, the address calculation of the tensor elements will be involved. This is because when reading the tensor to be calculated, it is necessary to read the data according to the address of the tensor element, and after reading the tensor to be calculated from the corresponding address and performing operations between tensors, if it comes to the link of outputting the result, in some scenarios, it may be necessary to write the tensor operation result to the address used to store the tensor result, and this process of writing the operation result may also require the address calculation of the tensor element. Based on the principle of the tensor addressing mode disclosed in the present application, tensors can be quickly read and written while being able to be efficiently addressed.
与本申请的第一方面和/或第二方面所提供的AI芯片及其相关的张量处理方法相对应的,本申请还提供了 一种电子设备。Corresponding to the AI chip and the related tensor processing method provided in the first aspect and/or the second aspect of the present application, the present application also provides An electronic device.
如图13A所示,在根据本申请的一些实施例中,该电子设备可以包括:存储器和如本申请的第一方面所描述的AI芯片。As shown in FIG. 13A , in some embodiments according to the present application, the electronic device may include: a memory and an AI chip as described in the first aspect of the present application.
在根据本申请的一些可选的实施例中,AI芯片可以与存储器连接,用于根据张量寻址模式从存储器存储的原始张量数据中,获取向量运算指令所需的源操作数,并根据向量运算指令,对获取的源操作数进行运算,得到运算结果。In some optional embodiments of the present application, the AI chip can be connected to a memory to obtain source operands required for vector operation instructions from the original tensor data stored in the memory according to the tensor addressing mode, and perform operations on the obtained source operands according to the vector operation instructions to obtain operation results.
如图13B所示,在根据本申请的一些实施例中,该电子设备可以包括:存储器和如本申请的第二方面所描述的AI芯片。As shown in FIG. 13B , in some embodiments according to the present application, the electronic device may include: a memory and an AI chip as described in the second aspect of the present application.
在根据本申请的一些可选的实施例中,AI芯片可以与存储器连接,用于将存储器中存储的张量数据写入AI芯片中的向量寄存器。In some optional embodiments according to the present application, the AI chip can be connected to a memory to write tensor data stored in the memory into a vector register in the AI chip.
在根据本申请的实施例中,存储器用于存储运算所需的张量数据,其可以是常见的各种存储器,比如可以是随机存取存储器(Random Access Memory,RAM),只读存储器(Read Only Memory,ROM),可编程只读存储器(Programmable Read-Only Memory,PROM),可擦除只读存储器(Erasable Programmable Read-Only Memory,EPROM),电可擦除只读存储器(Electric Erasable Programmable Read-Only Memory,EEPROM)。其中,随机存取存储器又可以是静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)。此外,存储器可以是单倍数据速率(Single Data Rate,SDR)存储器,也可以是双倍数据速率(Double Data Rate,DDR)存储器。In an embodiment according to the present application, a memory is used to store tensor data required for operation, which can be various common memories, such as random access memory (Random Access Memory, RAM), read-only memory (Read Only Memory, ROM), programmable read-only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), and electrically erasable read-only memory (Electric Erasable Programmable Read-Only Memory, EEPROM). Among them, the random access memory can be a static random access memory (Static Random Access Memory, SRAM) or a dynamic random access memory (Dynamic Random Access Memory, DRAM). In addition, the memory can be a single data rate (Single Data Rate, SDR) memory or a double data rate (Double Data Rate, DDR) memory.
为了更好的理解在示例性的电子设备中根据本申请的第一方面的AI芯片与存储器的交互流程,下面以向量运算指令为三操作数的向量运算指令如D=A*B+C为例来进行说明。In order to better understand the interaction process between the AI chip and the memory according to the first aspect of the present application in an exemplary electronic device, a vector operation instruction with three operands such as D=A*B+C is used as an example for explanation below.
首先,将操作数A所在的原始张量数据(如H*W=4*4)写入存储器,将操作数B所在的原始张量数据(如H*W=4*4)写入存储器,将操作数C所在的原始张量数据(如H*W=4*4)写入存储器,引擎单元根据张量寻址模式从操作数A所在的原始张量数据获取向量运算指令的操作数A,并将其写入向量寄存器1中,引擎单元根据张量寻址模式从操作数B所在的原始张量数据获取向量运算指令的操作数B,并将其写入向量寄存器2中,引擎单元根据张量寻址模式从操作数C所在的原始张量数据获取向量运算指令的操作数C,并将其写入向量寄存器3中。在需要进行张量运算的时候,向量运算单元直接从向量寄存器1中获取向量运算指令对应的源操作数A、从向量寄存器2中获取向量运算指令对应的源操作数B、从向量寄存器3中获取向量运算指令对应的源操作数C进行运算,得到运算结果。First, the original tensor data (such as H*W=4*4) where operand A is located is written into the memory, the original tensor data (such as H*W=4*4) where operand B is located is written into the memory, and the original tensor data (such as H*W=4*4) where operand C is located is written into the memory. The engine unit obtains the operand A of the vector operation instruction from the original tensor data where operand A is located according to the tensor addressing mode, and writes it into the vector register 1. The engine unit obtains the operand B of the vector operation instruction from the original tensor data where operand B is located according to the tensor addressing mode, and writes it into the vector register 2. The engine unit obtains the operand C of the vector operation instruction from the original tensor data where operand C is located according to the tensor addressing mode, and writes it into the vector register 3. When tensor operation is required, the vector operation unit directly obtains the source operand A corresponding to the vector operation instruction from the vector register 1, obtains the source operand B corresponding to the vector operation instruction from the vector register 2, and obtains the source operand C corresponding to the vector operation instruction from the vector register 3 to perform the operation to obtain the operation result.
在根据本申请的示例性实施方式中,假设张量寻址模式[start:stop:step:size]=0:4:1:3,则对于每个4*4的原始张量数据,引擎单元都会产生3*3*4的地址,之后会读取这些地址所在的数据,得到3*3*4的数据(即为操作数),并发送给向量寄存器进行存储。其中,引擎单元按照0:4:1:3的寻址模式进行寻址的原理如图14所示。In an exemplary embodiment according to the present application, assuming that the tensor addressing mode [start: stop: step: size] = 0:4:1:3, for each 4*4 original tensor data, the engine unit will generate 3*3*4 addresses, and then read the data at these addresses to obtain 3*3*4 data (i.e., operands), and send them to the vector register for storage. The principle of the engine unit addressing according to the 0:4:1:3 addressing mode is shown in Figure 14.
可以理解的是,在根据本申请的示例性实施方式中,在存储器的存储空间够用的情况下,上述过程中,可以是将上述3个操作数所在原始张量数据全部写入存储器中后,再依次根据张量寻址模式进行寻址,将寻址到的数据写入向量寄存器中;也可以是先将其中某个操作数所在原始张量数据写入存储器中后,再根据张量寻址模式对该操作数所在原始张量数据进行寻址,将寻址到的操作数写入向量寄存器中,之后,再将其中某个操作数所在原始张量数据写入存储器中后,之后再根据张量寻址模式对该操作数所在原始张量数据进行寻址,将寻址到的操作数写入向量寄存器中。本申请中,不对各个操作数所在原始张量数据写入存储器的先后顺序,以及从存储器中写入向量寄存器的先后顺序进行限定。It can be understood that, in an exemplary embodiment according to the present application, when the storage space of the memory is sufficient, in the above process, the original tensor data of the above three operands can be all written into the memory, and then addressed in turn according to the tensor addressing mode, and the addressed data is written into the vector register; or the original tensor data of one of the operands is first written into the memory, and then the original tensor data of the operand is addressed according to the tensor addressing mode, and the addressed operand is written into the vector register, and then the original tensor data of one of the operands is written into the memory, and then the original tensor data of the operand is addressed according to the tensor addressing mode, and the addressed operand is written into the vector register. In this application, the order in which the original tensor data of each operand is written into the memory and the order in which the vector register is written from the memory are not limited.
此外,在根据本申请的实施方式中,在上述过程中,可以是由同一个寻址引擎对不同操作数所在的原始张量数据进行寻址,也可以是由不同的寻址引擎对不同操作数所在的原始张量数据进行寻址。In addition, in an implementation according to the present application, in the above process, the original tensor data containing different operands may be addressed by the same addressing engine, or the original tensor data containing different operands may be addressed by different addressing engines.
为了更好的理解在示例性的电子设备中根据本申请的第二方面的AI芯片与存储器的交互流程,下面以向量运算指令为三操作数的向量运算指令,如D=A*B+C为例来进行说明。In order to better understand the interaction process between the AI chip and the memory according to the second aspect of the present application in an exemplary electronic device, the following is an illustration using a vector operation instruction with three operands, such as D=A*B+C.
首先,将操作数A所在的原始张量数据(如H*W=4*4)写入存储器,将操作数B所在的原始张量数据(如H*W=4*4)写入存储器,将操作数C所在的原始张量数据(如H*W=4*4)写入存储器,把存储器中操作数A所在的原始张量数据写入向量寄存器1,把存储器中操作数B所在的原始张量数据写入向量寄存器2,把存储器中操作数C所在的原始张量数据写入向量寄存器3。在需要进行张量运算的时候,需要通过引擎单元根据张 量寻址模式从向量寄存器存储的张量数据中获取向量运算指令所需的源操作数。假设张量寻址模式[start:stop:step:size]=0:4:1:3,则对于每个4*4的原始张量数据,引擎单元都会产生3*3*4的地址,之后会读取这些地址所在的数据,得到3*3*4的数据,并发送给向量运算单元进行运算。其中,引擎单元按照0:4:1:3的寻址模式进行寻址的原理如图14所示。First, the original tensor data of operand A (such as H*W=4*4) is written into the memory, the original tensor data of operand B (such as H*W=4*4) is written into the memory, the original tensor data of operand C (such as H*W=4*4) is written into the memory, the original tensor data of operand A in the memory is written into vector register 1, the original tensor data of operand B in the memory is written into vector register 2, and the original tensor data of operand C in the memory is written into vector register 3. When tensor operations are required, the engine unit needs to calculate the tensor data according to the tensor. The vector addressing mode obtains the source operands required by the vector operation instruction from the tensor data stored in the vector register. Assuming the tensor addressing mode [start: stop: step: size] = 0:4:1:3, for each 4*4 original tensor data, the engine unit will generate 3*3*4 addresses, and then read the data at these addresses to obtain 3*3*4 data, and send it to the vector operation unit for operation. The principle of the engine unit addressing according to the 0:4:1:3 addressing mode is shown in Figure 14.
可以理解的是,在根据本申请的示例性实施方式中,在存储器的存储空间够用的情况下,上述过程中,可以是将上述3个操作数所在原始张量数据全部写入存储器中后,再依次将各个操作数所在原始张量数据写入向量寄存器中;也可以是先将其中某个操作数所在原始张量数据写入存储器中后,再将该操作数所在原始张量数据写入向量寄存器中,之后,再将其中某个操作数所在原始张量数据写入存储器中后,之后再将该操作数所在原始张量数据写入向量寄存器中。本申请中,不对各个操作数所在原始张量数据写入存储器的先后顺序,以及从存储器中写入向量寄存器的先后顺序进行限定。It is understandable that, in an exemplary embodiment according to the present application, when the storage space of the memory is sufficient, in the above process, the original tensor data of the above three operands may be all written into the memory, and then the original tensor data of each operand may be written into the vector register in sequence; or the original tensor data of one of the operands may be first written into the memory, and then the original tensor data of the operand may be written into the vector register, and then the original tensor data of one of the operands may be written into the memory, and then the original tensor data of the operand may be written into the vector register. In the present application, the order in which the original tensor data of each operand is written into the memory and the order in which the original tensor data of the operand is written into the vector register are not limited.
其中,上述的电子设备可以是但不限于手机、平板、电脑、服务器、车载设备、可穿戴设备、边缘盒子等电子设备。Among them, the above-mentioned electronic devices can be but are not limited to mobile phones, tablets, computers, servers, vehicle-mounted devices, wearable devices, edge boxes and other electronic devices.
需要说明的是,本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。It should be noted that the various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the various embodiments can be referenced to each other.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应所述以权利要求的保护范围为准。The above is only a specific implementation of the present application, but the protection scope of the present application is not limited thereto. Any technician familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, which should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.
工业实用性Industrial Applicability
本申请提供了一种AI芯片、电子设备及张量处理方法,属于计算机技术领域。该AI芯片包括:向量寄存器、引擎单元、向量运算单元;引擎单元用于根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数;向量寄存器用于存储引擎单元获取的源操作数;向量运算单元用于从向量寄存器中获取向量运算指令的源操作数进行运算,得到运算结果。本申请中,利用向量寄存器来存储引擎单元所获取的源操作数,使得向量运算单元可以直接从向量寄存器中获取向量运算指令所需的源操作数进行运算,这样的设计方式可以实现寻址与运算相分离,可支持提前获取向量运算指令所需的源操作数,后续在进行运算时,直接读取即可,有利于提高运算效率。The present application provides an AI chip, an electronic device and a tensor processing method, which belongs to the field of computer technology. The AI chip includes: a vector register, an engine unit, and a vector operation unit; the engine unit is used to obtain the source operands required for the vector operation instruction from the original tensor data according to the tensor addressing mode; the vector register is used to store the source operands obtained by the engine unit; the vector operation unit is used to obtain the source operands of the vector operation instruction from the vector register for operation to obtain the operation result. In the present application, the vector register is used to store the source operands obtained by the engine unit, so that the vector operation unit can directly obtain the source operands required for the vector operation instruction from the vector register for operation. This design method can realize the separation of addressing and operation, and can support the early acquisition of the source operands required for the vector operation instruction. When performing the operation later, it can be directly read, which is conducive to improving the operation efficiency.
此外,本申请还提供了另一种AI芯片、电子设备及张量处理方法。该AI芯片包括:向量寄存器、引擎单元、向量运算单元;向量寄存器用于存储运算所需的张量数据;引擎单元与向量寄存器连接,引擎单元用于根据张量寻址模式从向量寄存器存储的张量数据中,获取向量运算指令所需的源操作数,引擎单元根据张量寻址模式一次寻址获取的源操作数个数大于或等于1;向量运算单元与引擎单元连接,向量运算单元用于根据向量运算指令对引擎单元获取的源操作数进行运算,得到运算结果。本申请中,采用张量寻址模式进行寻址,使得一次寻址可获得至少一个源操作数,极大提高了获取源操作数的效率,使得只需要较少的指令便可寻找到所需的源操作数。In addition, the present application also provides another AI chip, electronic device and tensor processing method. The AI chip includes: a vector register, an engine unit, and a vector operation unit; the vector register is used to store the tensor data required for the operation; the engine unit is connected to the vector register, and the engine unit is used to obtain the source operands required for the vector operation instruction from the tensor data stored in the vector register according to the tensor addressing mode, and the number of source operands obtained by the engine unit at one time according to the tensor addressing mode is greater than or equal to 1; the vector operation unit is connected to the engine unit, and the vector operation unit is used to operate on the source operands obtained by the engine unit according to the vector operation instruction to obtain the operation result. In the present application, the tensor addressing mode is used for addressing, so that at least one source operand can be obtained by one addressing, which greatly improves the efficiency of obtaining the source operand, so that only fewer instructions are needed to find the required source operand.
可以理解的是,本申请的AI芯片、电子设备及张量处理方法是可以重现的,并且可以用在多种工业应用中。例如,本申请的AI芯片、电子设备及张量处理方法可以用于计算机技术领域。 It is understandable that the AI chip, electronic device, and tensor processing method of the present application are reproducible and can be used in a variety of industrial applications. For example, the AI chip, electronic device, and tensor processing method of the present application can be used in the field of computer technology.

Claims (23)

  1. 一种AI芯片,其中,所述AI芯片包括:An AI chip, wherein the AI chip comprises:
    引擎单元,用于根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数,所述张量寻址模式一次寻址获取的源操作数个数大于或等于1;An engine unit, configured to obtain source operands required for vector operation instructions from raw tensor data according to a tensor addressing mode, wherein the number of source operands obtained by one addressing in the tensor addressing mode is greater than or equal to 1;
    向量寄存器,与所述引擎单元连接,所述向量寄存器用于存储所述引擎单元获取的源操作数;A vector register connected to the engine unit, the vector register being used to store a source operand obtained by the engine unit;
    向量运算单元,与所述向量寄存器连接,所述向量运算单元用于从所述向量寄存器中获取所述向量运算指令的源操作数进行运算,得到运算结果。A vector operation unit is connected to the vector register, and is used to obtain the source operand of the vector operation instruction from the vector register to perform operation and obtain the operation result.
  2. 根据权利要求1所述的AI芯片,其中,所述张量寻址模式的参数包括:表征寻址起点的起始地址、表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸。The AI chip according to claim 1, wherein the parameters of the tensor addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step size representing the offset amplitude of the addressing pointer, and a size representing the shape of the data obtained by addressing.
  3. 根据权利要求2所述的AI芯片,其中,所述张量寻址模式的参数还包括表征寻址所得数据形状为非完整形状的保留情况的特征参数。The AI chip according to claim 2, wherein the parameters of the tensor addressing mode further include characteristic parameters characterizing a situation in which a shape of the data obtained by addressing is retained as an incomplete shape.
  4. 根据权利要求1至3中的任一项所述的AI芯片,其中,所述张量寻址模式包括嵌套的双重张量寻址模式,所述双重张量寻址模式包括:外迭代寻址模式和内迭代寻址模式,其中,所述内迭代寻址模式在所述外迭代寻址模式寻址得到的张量数据上进行寻址。The AI chip according to any one of claims 1 to 3, wherein the tensor addressing mode includes a nested dual tensor addressing mode, and the dual tensor addressing mode includes: an external iterative addressing mode and an internal iterative addressing mode, wherein the internal iterative addressing mode addresses the tensor data addressed by the external iterative addressing mode.
  5. 根据权利要求1至4中的任一项所述的AI芯片,其中,所述引擎单元还与外部存储器连接,所述外部存储器,用于存储所述原始张量数据;所述引擎单元包括:The AI chip according to any one of claims 1 to 4, wherein the engine unit is further connected to an external memory, and the external memory is used to store the original tensor data; the engine unit comprises:
    多个寻址引擎,一个所述寻址引擎对应一个所述向量寄存器,每个所述寻址引擎,用于根据各自的张量寻址模式从对应的原始张量数据中获取向量运算指令所需的源操作数。Multiple addressing engines, one addressing engine corresponds to one vector register, and each addressing engine is used to obtain source operands required by vector operation instructions from corresponding original tensor data according to the respective tensor addressing modes.
  6. 根据权利要求5所述的AI芯片,其中,所述引擎单元还包括:The AI chip according to claim 5, wherein the engine unit further comprises:
    主引擎,用于向每个所述寻址引擎发送控制命令,控制每个所述寻址引擎根据各自的张量寻址模式进行寻址,从对应的原始张量数据中获取向量运算指令所需的源操作数。The main engine is used to send control commands to each of the addressing engines, control each of the addressing engines to perform addressing according to their respective tensor addressing modes, and obtain source operands required for vector operation instructions from corresponding original tensor data.
  7. 一种张量处理方法,其中,所述张量处理方法包括:A tensor processing method, wherein the tensor processing method comprises:
    根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数,其中,所述张量寻址模式一次寻址获取的源操作数个数大于或等于1;Obtaining source operands required for the vector operation instruction from the original tensor data according to the tensor addressing mode, wherein the number of source operands obtained by the tensor addressing mode at one time is greater than or equal to 1;
    将获取的源操作数存储至向量寄存器;Store the fetched source operand into the vector register;
    从所述向量寄存器中获取所述向量运算指令的源操作数以进行运算,得到运算结果。The source operand of the vector operation instruction is obtained from the vector register to perform operation and obtain the operation result.
  8. 根据权利要求7所述的方法,其中,根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数,包括:The method according to claim 7, wherein obtaining source operands required for the vector operation instruction from the original tensor data according to the tensor addressing mode comprises:
    利用寻址引擎在主引擎发送的控制命令下,根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数。The addressing engine is used to obtain the source operands required for the vector operation instruction from the original tensor data according to the tensor addressing mode under the control command sent by the main engine.
  9. 根据权利要求8所述的方法,其中,所述控制命令包括:Advance前进命令、Reset重置命令、NOP空操作命令三个控制命令,所述主引擎每次向每个所述寻址引擎发送控制命令时,发送由所述三个控制命令中的至少一个控制命令组合而成的组合控制命令,以控制每个所述寻址引擎根据各自的张量寻址模式在所述原始张量 数据的不同维度进行寻址。The method according to claim 8, wherein the control command comprises three control commands: Advance command, Reset command, and NOP command, and each time the main engine sends a control command to each of the addressing engines, it sends a combined control command composed of at least one of the three control commands to control each of the addressing engines to address the original tensor according to their respective tensor addressing modes. Addressing different dimensions of data.
  10. 根据权利要求7至9中的任一项所述的方法,其中,所述张量寻址模式的参数包括:表征寻址起点的起始地址、表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸。According to the method described in any one of claims 7 to 9, the parameters of the tensor addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step size representing the offset amplitude of the addressing pointer, and a size representing the shape of the data obtained by addressing.
  11. 一种AI芯片,其中,包括:An AI chip, comprising:
    向量寄存器,用于存储运算所需的张量数据;Vector registers, used to store tensor data required for operations;
    引擎单元,与所述向量寄存器连接,所述引擎单元用于根据张量寻址模式,从所述向量寄存器存储的张量数据中获取向量运算指令所需的源操作数,其中,所述引擎单元根据所述张量寻址模式一次寻址获取的源操作数个数大于或等于1;An engine unit connected to the vector register, the engine unit being used to obtain source operands required for a vector operation instruction from tensor data stored in the vector register according to a tensor addressing mode, wherein the number of source operands obtained by the engine unit at one time according to the tensor addressing mode is greater than or equal to 1;
    向量运算单元,与所述引擎单元连接,所述向量运算单元用于根据所述向量运算指令对所述引擎单元获取的源操作数进行运算,得到运算结果。A vector operation unit is connected to the engine unit, and is used to operate on the source operand obtained by the engine unit according to the vector operation instruction to obtain an operation result.
  12. 根据权利要求11所述的AI芯片,其中,所述引擎单元包括:The AI chip according to claim 11, wherein the engine unit comprises:
    多个寻址引擎,一个所述寻址引擎对应一个所述向量寄存器,每个所述寻址引擎,用于根据各自的张量寻址模式从对应的向量寄存器中获取向量运算指令所需的源操作数。Multiple addressing engines, one addressing engine corresponds to one vector register, and each addressing engine is used to obtain source operands required by vector operation instructions from the corresponding vector register according to the respective tensor addressing modes.
  13. 根据权利要求12所述的AI芯片,其中,所述引擎单元还包括:The AI chip according to claim 12, wherein the engine unit further comprises:
    主引擎,用于向每个所述寻址引擎发送控制命令,控制每个所述寻址引擎根据各自的张量寻址模式进行寻址,从对应的向量寄存器中获取向量运算指令所需的源操作数。The main engine is used to send control commands to each of the addressing engines, control each of the addressing engines to perform addressing according to their respective tensor addressing modes, and obtain source operands required by vector operation instructions from corresponding vector registers.
  14. 根据权利要求13所述的AI芯片,其中,若所述张量数据的第一维度需要广播,所述寻址引擎用于根据所述主引擎发送的Advance前进命令在所述张量数据的第一维度上进行寻址时,保持寻址指针不变,以持续获取当前寻址指针所指向的源操作数。The AI chip according to claim 13, wherein, if the first dimension of the tensor data needs to be broadcast, the addressing engine is used to keep the addressing pointer unchanged when addressing on the first dimension of the tensor data according to the Advance command sent by the main engine, so as to continuously obtain the source operand pointed to by the current addressing pointer.
  15. 根据权利要求11至14中的任一项所述的AI芯片,其中,所述张量寻址模式包括嵌套的双重张量寻址模式,所述双重张量寻址模式包括:外迭代寻址模式和内迭代寻址模式,其中,所述内迭代寻址模式在所述外迭代寻址模式寻址得到的张量数据上进行寻址。The AI chip according to any one of claims 11 to 14, wherein the tensor addressing mode includes a nested dual tensor addressing mode, and the dual tensor addressing mode includes: an external iterative addressing mode and an internal iterative addressing mode, wherein the internal iterative addressing mode addresses the tensor data addressed by the external iterative addressing mode.
  16. 根据权利要求15所述的AI芯片,其中,所述外迭代寻址模式的参数包括:表征寻址起点的起始地址、表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸、表征寻址所得数据形状为非完整形状的保留情况的特征参数。The AI chip according to claim 15, wherein the parameters of the external iterative addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step size representing the offset amplitude of the addressing pointer, a size representing the shape of the data obtained by addressing, and a characteristic parameter representing the retention of the incomplete shape of the data obtained by addressing.
  17. 根据权利要求15或16所述的AI芯片,其中,所述内迭代寻址模式的参数包括:表征寻址起点的起始地址、表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸。The AI chip according to claim 15 or 16, wherein the parameters of the inner iterative addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step size representing the offset amplitude of the addressing pointer, and a size representing the shape of the data obtained by addressing.
  18. 根据权利要求15至17中任一项所述的AI芯片,其中,所述外迭代寻址模式,用于从所述张量数据中选定至少一个候选区域,所述候选区域包含多个张量元素;The AI chip according to any one of claims 15 to 17, wherein the outer iterative addressing mode is used to select at least one candidate region from the tensor data, and the candidate region includes a plurality of tensor elements;
    所述内迭代寻址模式,用于从所述至少一个候选区域中获取向量运算指令所需的源操作数。The inner iteration addressing mode is used to obtain source operands required by the vector operation instruction from the at least one candidate area.
  19. 根据权利要求15至18中任一项所述的AI芯片,其中,所述外迭代寻址模式、内迭代寻址模式均为新切片寻址模式、切片寻址模式、索引寻址模式中的一种,所述新切片寻址模式相比于所述切片寻址模式包含更多寻址参数,所述更多寻址参数包括:表征寻址所得数据形状的尺寸。According to any one of claims 15 to 18, the external iterative addressing mode and the internal iterative addressing mode are both one of a new slice addressing mode, a slice addressing mode, and an index addressing mode, and the new slice addressing mode includes more addressing parameters than the slice addressing mode, and the more addressing parameters include: characterizing the size of the data shape obtained by addressing.
  20. 一种张量处理方法,其中,包括: A tensor processing method, comprising:
    根据张量寻址模式从张量数据中获取向量运算指令所需的源操作数,其中,根据所述张量寻址模式一次寻址获取的源操作数个数大于或等于1;Obtaining source operands required for the vector operation instruction from the tensor data according to the tensor addressing mode, wherein the number of source operands obtained by one addressing according to the tensor addressing mode is greater than or equal to 1;
    根据所述向量运算指令,对获取的源操作数进行运算,得到运算结果。According to the vector operation instruction, the obtained source operand is operated to obtain an operation result.
  21. 根据权利要求20所述的方法,其中,所述张量寻址模式包括嵌套的双重张量寻址模式,所述双重张量寻址模式包括外迭代寻址模式和内迭代寻址模式;根据张量寻址模式从张量数据中获取向量运算指令所需的源操作数,包括:The method according to claim 20, wherein the tensor addressing mode comprises a nested dual tensor addressing mode, the dual tensor addressing mode comprises an outer iteration addressing mode and an inner iteration addressing mode; and obtaining a source operand required for a vector operation instruction from tensor data according to the tensor addressing mode comprises:
    利用所述外迭代寻址模式从所述张量数据中选定至少一个候选区域,所述候选区域包含多个张量元素;Selecting at least one candidate region from the tensor data using the outer iterative addressing mode, the candidate region comprising a plurality of tensor elements;
    利用所述内迭代寻址模式从所述至少一个候选区域中获取向量运算指令所需的源操作数。The inner iteration addressing mode is used to obtain source operands required by the vector operation instruction from the at least one candidate area.
  22. 根据权利要求21所述的方法,其中,所述外迭代寻址模式的参数包括:表征寻址起点的起始地址、表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸、表征寻址所得数据形状为非完整形状的保留情况的特征参数;和/或,The method according to claim 21, wherein the parameters of the external iterative addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, a step length representing the offset amplitude of the addressing pointer, a size representing the shape of the data obtained by addressing, and a characteristic parameter representing the retention of the shape of the data obtained by addressing as an incomplete shape; and/or,
    所述内迭代寻址模式的参数包括:表征寻址起点的起始地址、表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸。The parameters of the inner iterative addressing mode include: a starting address representing the addressing start point, an end address representing the addressing end point, a step length representing the addressing pointer offset amplitude, and a size representing the shape of the data obtained by addressing.
  23. 一种电子设备,其中,所述电子设备包括用于存储运算所需的张量数据的存储器,所述电子设备还包括:An electronic device, wherein the electronic device comprises a memory for storing tensor data required for operation, and the electronic device further comprises:
    如权利要求1至6中的任一项所述的AI芯片,所述AI芯片与所述存储器连接,所述AI芯片,用于根据张量寻址模式从所述存储器存储的原始张量数据中,获取向量运算指令所需的源操作数,并根据所述向量运算指令,对获取的源操作数进行运算,得到运算结果;或者The AI chip according to any one of claims 1 to 6, wherein the AI chip is connected to the memory, and the AI chip is used to obtain source operands required for vector operation instructions from original tensor data stored in the memory according to a tensor addressing mode, and to operate on the obtained source operands according to the vector operation instructions to obtain operation results; or
    如权利要求11至19中的任一项所述的AI芯片,所述AI芯片与所述存储器连接,所述AI芯片,用于将所述存储器中存储的张量数据写入所述AI芯片中的向量寄存器。 The AI chip according to any one of claims 11 to 19, wherein the AI chip is connected to the memory, and the AI chip is used to write the tensor data stored in the memory into a vector register in the AI chip.
PCT/CN2023/096078 2022-12-14 2023-05-24 Ai chip, tensor processing method, and electronic device WO2024124807A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202211597989.XA CN115658146B (en) 2022-12-14 2022-12-14 AI chip, tensor processing method and electronic equipment
CN202211597988.5 2022-12-14
CN202211597988.5A CN115599442B (en) 2022-12-14 2022-12-14 AI chip, electronic equipment and tensor processing method
CN202211597989.X 2022-12-14

Publications (1)

Publication Number Publication Date
WO2024124807A1 true WO2024124807A1 (en) 2024-06-20

Family

ID=91484338

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/096078 WO2024124807A1 (en) 2022-12-14 2023-05-24 Ai chip, tensor processing method, and electronic device

Country Status (1)

Country Link
WO (1) WO2024124807A1 (en)

Similar Documents

Publication Publication Date Title
WO2019127517A1 (en) Data processing method and device, dma controller, and computer readable storage medium
US20060143428A1 (en) Semiconductor signal processing device
CN115658146B (en) AI chip, tensor processing method and electronic equipment
US11455781B2 (en) Data reading/writing method and system in 3D image processing, storage medium and terminal
TWI743627B (en) Method and device for accessing tensor data
US11705207B2 (en) Processor in non-volatile storage memory
US20200278923A1 (en) Multi-dimensional accesses in memory
US20240161830A1 (en) Apparatuses and methods for organizing data in a memory device
CN114115507B (en) Memory and method for writing data
KR20220045480A (en) Memory device for performing in-memory processing
CN111459552A (en) Method and device for parallelizing memory computation
WO2019223383A1 (en) Direct memory access method and device, dedicated computing chip and heterogeneous computing system
WO2024124807A1 (en) Ai chip, tensor processing method, and electronic device
WO2024012492A1 (en) Artificial intelligence chip, method for flexibly accessing data, device, and medium
US20200293452A1 (en) Memory device and method including circular instruction memory queue
CN115599442B (en) AI chip, electronic equipment and tensor processing method
CN111191780B (en) Averaging pooling accumulation circuit, device and method
CN116360672A (en) Method and device for accessing memory and electronic equipment
US11500632B2 (en) Processor device for executing SIMD instructions
US20240126444A1 (en) Storage device, computing system and proximity data processing module with improved efficiency of memory bandwidth
CN112395009A (en) Operation method, operation device, computer equipment and storage medium
CN100366068C (en) A storage space saved storage processing method
JPS592058B2 (en) Storage device
CN112395002B (en) Operation method, device, computer equipment and storage medium
US20210157495A1 (en) Device and method for controlling data-reading and -writing