WO2022001498A1 - 计算装置、集成电路芯片、板卡、电子设备和计算方法 - Google Patents

计算装置、集成电路芯片、板卡、电子设备和计算方法 Download PDF

Info

Publication number
WO2022001498A1
WO2022001498A1 PCT/CN2021/095703 CN2021095703W WO2022001498A1 WO 2022001498 A1 WO2022001498 A1 WO 2022001498A1 CN 2021095703 W CN2021095703 W CN 2021095703W WO 2022001498 A1 WO2022001498 A1 WO 2022001498A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
array
processing circuits
processing
dimensional
Prior art date
Application number
PCT/CN2021/095703
Other languages
English (en)
French (fr)
Inventor
刘少礼
陶劲桦
刘道福
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Publication of WO2022001498A1 publication Critical patent/WO2022001498A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system

Definitions

  • the present disclosure relates generally to the field of computing. More particularly, the present disclosure relates to a computing device, an integrated circuit chip, a board, an electronic device, and a computing method.
  • an instruction set is a set of instructions for performing computations and controlling the computing system, and plays a key role in improving the performance of computing chips (eg, processors) in the computing system.
  • computing chips eg, processors
  • Various current computing chips can complete various general or specific control operations and data processing operations by using associated instruction sets.
  • the current instruction set still has many defects.
  • the existing instruction set is limited by hardware architecture and is less flexible in terms of flexibility.
  • many instructions can only complete a single operation, and the execution of multiple operations usually requires multiple instructions, potentially resulting in increased on-chip I/O data throughput.
  • the current instructions have improvements in execution speed, execution efficiency, and power consumption on the chip.
  • the arithmetic instructions of conventional processor CPUs are designed to perform basic single-data scalar arithmetic operations.
  • a single data operation means that each operand of the instruction is a scalar data.
  • the oriented operands are often multi-dimensional vector (ie, tensor data) data types, and only using scalar operations cannot enable hardware to efficiently complete computing tasks. Therefore, how to efficiently perform multi-dimensional tensor operations is also an urgent problem to be solved in the current computing field.
  • the present disclosure provides a hardware architecture with a processing circuit array.
  • the solution of the present disclosure can obtain technical advantages in various aspects including enhancing the processing performance of the hardware, reducing power consumption, improving the execution efficiency of computing operations, and avoiding computing overhead.
  • the solution of the present disclosure supports efficient memory access and processing of tensor data on the basis of the aforementioned hardware architecture, thereby accelerating tensor operations and reducing the cost of tensor operations when multi-dimensional vector operands are included in the calculation instructions. computational overhead.
  • the present disclosure provides a computing device, comprising: a processing circuit array formed by connecting a plurality of processing circuits in a one-dimensional or multi-dimensional array structure, wherein the processing circuit array is configured as a plurality of processing circuits subarray, and perform a multi-threaded operation in response to receiving a plurality of operation instructions obtained by parsing the calculation instructions received by the computing device, and wherein the operands of the calculation instructions including a descriptor for indicating a shape of a tensor, the descriptor for determining a storage address of data corresponding to the operand, the at least one processing circuit sub-array configured to execute the plurality of At least one of the operation instructions.
  • the present disclosure provides an integrated circuit chip including a computing device as described above and described in various embodiments below.
  • the present disclosure provides a board including an integrated circuit chip as described above and described in various embodiments below.
  • the present disclosure provides an electronic device comprising an integrated circuit chip as described above and described in various embodiments below.
  • the present disclosure provides a method of performing computation using the aforementioned computing device, wherein the computing device includes an array of processing circuits connected by a plurality of processing circuits in a one-dimensional or multi-dimensional array structure. and the processing circuit array is configured into a plurality of processing circuit sub-arrays, the method includes: receiving a calculation instruction at the computing device, and parsing it to obtain a plurality of operation instructions, wherein the calculation instructions are The operand includes a descriptor for indicating the shape of the tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand; in response to receiving the plurality of operation instructions, using the plurality of processing circuit sub-systems array to perform multi-threaded operations, wherein at least one processing circuit sub-array of the plurality of processing circuit sub-arrays is configured to execute at least one operational instruction of a plurality of operational instructions according to the memory address.
  • an appropriate processing circuit array can be constructed according to computing requirements, so that computing instructions can be efficiently executed, computing overhead can be reduced, and I/O can be reduced data throughput.
  • the processing circuit of the present disclosure can be configured to support corresponding operations according to the operation requirements, the number of operands of the calculation instructions of the present disclosure can be increased or decreased according to the operation requirements, and the type of the operation code can also be specified in the processing circuit matrix.
  • the supported operation types can be arbitrarily selected and combined, thereby expanding the application scenarios and adaptability of the hardware architecture.
  • Figure 1a is a block diagram illustrating a computing device according to one embodiment of the present disclosure
  • Figure 1b is a schematic diagram illustrating a data storage space according to one embodiment of the present disclosure
  • Figure 2a is a block diagram illustrating a computing device according to another embodiment of the present disclosure.
  • Figure 2b is a block diagram illustrating a computing device according to yet another embodiment of the present disclosure.
  • FIG. 3 is a block diagram illustrating a computing device according to yet another embodiment of the present disclosure.
  • FIG. 4 is an example block diagram illustrating various types of processing circuit arrays of a computing device according to an embodiment of the present disclosure
  • 5a, 5b, 5c and 5d are schematic diagrams illustrating various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure
  • 6a, 6b, 6c and 6d are schematic diagrams illustrating further various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure
  • Figures 7a, 7b, 7c and 7d are schematic diagrams illustrating various looped configurations of processing circuits according to embodiments of the present disclosure
  • FIGS. 8a, 8b and 8c are schematic diagrams illustrating further various looped configurations of processing circuits according to embodiments of the present disclosure
  • 9a, 9b, 9c and 9d are schematic diagrams illustrating data splicing operations performed by a pre-operation circuit according to an embodiment of the present disclosure
  • 10a, 10b and 10c are schematic diagrams illustrating data compression operations performed by post-operation circuits according to embodiments of the present disclosure
  • FIG. 11 is a simplified flowchart illustrating a method of using a computing device to perform an arithmetic operation according to an embodiment of the present disclosure
  • FIG. 12 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure.
  • FIG. 13 is a schematic structural diagram illustrating a board according to an embodiment of the present disclosure.
  • the disclosed solution provides a hardware architecture that supports multi-threaded operations.
  • the computing device includes at least a plurality of processing circuits, wherein the plurality of processing circuits are connected according to different configurations to form a one-dimensional or multi-dimensional array structure.
  • the processing circuit array may be configured into a plurality of processing circuit sub-arrays, and each processing circuit sub-array may be configured to execute at least one operational instruction among the plurality of operational instructions.
  • the operands of the computing instructions of the present disclosure may include a descriptor for indicating the shape of the tensor, and the descriptor may be used to determine the storage address of the data (eg, tensor) corresponding to the operand, thereby processing
  • the circuit sub-array can read and save tensor data according to this memory address to perform tensor-related tensor operations.
  • computing operations including tensor operations can be efficiently performed, which expands the application scenarios of computing and reduces computing overhead.
  • FIG. 1a is a block diagram illustrating a computing device 80 according to one embodiment of the present disclosure.
  • the computing device 80 includes an array of processing circuits formed by a plurality of processing circuits 104 .
  • the plurality of processing circuits are connected in a two-dimensional array structure to form a processing circuit array, and include a plurality of processing circuit sub-arrays, such as a plurality of one-dimensional processing circuit sub-arrays M1, M2, . . . shown in the figure ...Mn.
  • the two-dimensionally structured processing circuit array and the plurality of one-dimensional processing circuit sub-arrays included here are merely exemplary and non-limiting, and the processing circuit array of the present disclosure can be configured as Array structures with different dimensions, and one or more closed loops may be formed within a sub-array of processing circuits or between sub-arrays of processing circuits, as shown in Figures 5-8 to be described later Example connection.
  • the processing circuit array of the present disclosure may be configured to perform multi-threaded operations, such as single-instruction multi-threading ("SIMT") instructions. Further, each processing circuit sub-array may be configured to execute at least one operation instruction among the aforementioned plurality of operation instructions.
  • the aforementioned plurality of operation instructions may be micro-instructions or control signals running inside the computing device (or processing circuit, processor), which may include (or instruct) one or more computing devices Operation performed.
  • operation operations may include, but are not limited to, addition operations, multiplication operations, convolution operations, pooling operations, and other operations, and these operations may also involve tensor operations.
  • operands of the compute instructions of the present disclosure may include descriptors for indicating the shape of the tensors.
  • descriptors for indicating the shape of the tensors.
  • the above-mentioned multiple operation instructions may include at least one multi-stage pipeline operation.
  • the aforementioned multi-stage pipeline operation may include at least two operation instructions.
  • the operation instructions of the present disclosure may include predicates, and each of the processing circuits determines whether to execute the operation instructions associated therewith according to the predicates.
  • the processing circuit of the present disclosure can flexibly perform various arithmetic operations, including but not limited to arithmetic operations, logical operations, comparison operations, and table look-up operations, according to the configuration.
  • the processing circuit sub-matrix M1 can serve as the first-stage pipeline operation unit in the pipeline operation
  • the processing circuit sub-matrix M2 can serve as the second-stage pipeline operation unit in the pipeline operation.
  • the processing circuit sub-matrix Mn can serve as the nth stage pipeline operation unit in the pipeline operation.
  • the first-stage pipeline operation unit can perform top-to-bottom operations at all levels until the n-stage pipeline operation is completed.
  • the processing circuit array of the present disclosure may be a one-dimensional array in some scenarios, and that one or more processing circuits in the processing circuit array are configured to as a sub-array of the processing circuits.
  • the processing circuit array of the present disclosure is a two-dimensional array, and wherein one or more rows of processing circuits in the processing circuit array are configured as a sub-array of the processing circuits; or the processing circuits One or more rows of processing circuits in the array are configured as one of the processing circuit sub-arrays; or one or more rows of processing circuits along the diagonal direction in the processing circuit array are configured as one of the processing circuit sub-arrays array.
  • the present disclosure can also provide corresponding calculation instructions, and configure and construct a processing circuit array based on the calculation instructions, so as to realize multi-stage pipeline operation.
  • scr0 ⁇ src4 are source operands (in some computing scenarios, it can be, for example, the tensors represented by the descriptors of the present disclosure)
  • op0 ⁇ op3 are opcodes
  • the aforementioned data conversion operations may be performed by processing circuits in an array of processing circuits, or performed by additional operational circuits, such as post-operation circuits described in detail later in conjunction with FIG. 3 .
  • the processing circuit can be configured to support the corresponding operation according to the operation requirements, the number of operands of the calculation instruction of the present disclosure can be increased or decreased according to the operation requirements, and the type of the operation code can also be specified in the processing circuit matrix Any selection and combination of supported operation types.
  • connection between the multiple processing circuits disclosed in the present disclosure can be either a hardware-based configuration connection (or “hard connection”), or a software configuration (for example, a software configuration based on a specific hardware connection)
  • a logical configuration connection (or "soft connection") is performed through configuration instructions).
  • the array of processing circuits may form a closed loop in at least one of one or more dimensions, a "looped structure" in the context of this disclosure.
  • the computing operation of the present disclosure further includes using the descriptor to obtain information about the shape of the tensor, so as to determine the storage address of the tensor data, so as to obtain and save the tensor data through the aforementioned storage address.
  • tensors can contain various forms of data composition.
  • Tensors can be of different dimensions. For example, scalars can be regarded as 0-dimensional tensors, vectors can be regarded as 1-dimensional tensors, and matrices can be 2-dimensional or 2-dimensional. A tensor of more than dimensionality.
  • the shape of a tensor includes information such as the dimension of the tensor and the dimensions of each dimension of the tensor. For example, for tensors:
  • the shape of the tensor can be described by the descriptor as (2, 4), that is, two parameters indicate that the tensor is a two-dimensional tensor, and the size of the first dimension (column) of the tensor is 2, the first dimension The dimension of the two dimensions (rows) is 4. It should be noted that the present application does not limit the manner in which the descriptor indicates the shape of the tensor.
  • the value of N may be determined according to the dimension (order) of the tensor data, and may also be set according to the usage requirements of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor can be used to indicate the shape (eg offset, size, etc.) of the three-dimensional tensor data in the three-dimensional direction. It should be understood that those skilled in the art can set the value of N according to actual needs, which is not limited in the present disclosure.
  • the descriptor may include the identifier of the descriptor and/or the content of the descriptor.
  • the identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be numbered; the content of the descriptor may include at least one shape parameter representing the shape of the tensor data.
  • the tensor data is 3-dimensional data. Among the three dimensions of the tensor data, the shape parameters of two dimensions are fixed, and the content of the descriptor may include the shape representing the other dimension of the tensor data. parameter.
  • the identifier and/or content of the descriptor may be stored in a descriptor storage space (internal memory), such as a register, on-chip SRAM or other medium caches, and the like.
  • the tensor data indicated by the descriptor can be stored in data storage space (internal memory or external memory), such as on-chip cache or off-chip memory, etc.
  • data storage space internal memory or external memory
  • present disclosure does not limit the specific locations of the descriptor storage space and the data storage space.
  • the identifier, content of the descriptor, and tensor data indicated by the descriptor can be stored in the same area of the internal memory, for example, a continuous area of the on-chip cache can be used to store the related information of the descriptor content, its address is ADDR0-ADDR1023.
  • the addresses ADDR0-ADDR63 can be used as the descriptor storage space to store the identifier and content of the descriptor
  • the addresses ADDR64-ADDR1023 can be used as the data storage space to store the tensor data indicated by the descriptor.
  • addresses ADDR0-ADDR31 can be used to store the identifier of the descriptor
  • addresses ADDR32-ADDR63 can be used to store the content of the descriptor.
  • the address ADDR is not limited to 1 bit or one byte, and is used here to represent an address, which is an address unit.
  • Those skilled in the art can determine the descriptor storage space, the data storage space and their specific addresses according to actual conditions, which are not limited in this disclosure.
  • the identifier, content of the descriptor, and tensor data indicated by the descriptor may be stored in different areas of the internal memory.
  • a register can be used as a descriptor storage space to store the identifier and content of the descriptor in the register
  • an on-chip cache can be used as a data storage space to store the tensor data indicated by the descriptor.
  • the number of the register may be used to represent the identifier of the descriptor. For example, when the number of the register is 0, the identifier of the descriptor it stores is set to 0. When the descriptor in the register is valid, an area can be allocated in the cache space for storing the tensor data according to the size of the tensor data indicated by the descriptor.
  • the identifier and content of the descriptor may be stored in an internal memory, and the tensor data indicated by the descriptor may be stored in an external memory.
  • the identifier and content of the descriptor can be stored on-chip, and the tensor data indicated by the descriptor can be stored off-chip.
  • the data address of the data storage space corresponding to each descriptor may be a fixed address.
  • a separate data storage space can be divided for tensor data, and the starting address of each tensor data in the data storage space corresponds to a descriptor one-to-one.
  • the circuit or module responsible for parsing the computing instruction eg, an entity external to the computing device of the present disclosure or eg the control circuit 102 shown in FIGS. 2-3
  • the descriptor when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor may also be used to indicate the address of N-dimensional tensor data, wherein the descriptor The content of can also include at least one address parameter representing the address of the tensor data.
  • the content of the descriptor may include an address parameter indicating the address of the tensor data, such as the starting physical address of the tensor data, It may also include multiple address parameters of the address of the tensor data, such as the start address + address offset of the tensor data, or address parameters of the tensor data based on each dimension.
  • address parameters such as the start address + address offset of the tensor data, or address parameters of the tensor data based on each dimension.
  • the address parameter of the tensor data may include a reference address of the data reference point of the descriptor in the data storage space of the tensor data.
  • the reference address can be different according to the change of the data reference point. This disclosure does not limit the selection of data benchmarks.
  • the reference address may include a start address of the data storage space.
  • the reference address of the descriptor is the starting address of the data storage space.
  • the reference address of the descriptor is the address of the data block in the data storage space.
  • the shape parameter of the tensor data includes at least one of the following: the size of the data storage space in at least one direction of N dimensions, the size of the storage area in N dimensions The size in at least one direction of , the offset of the storage area in at least one direction of N dimensions, the position of at least two vertices at the diagonal positions of N dimensions relative to the data reference point , the mapping relationship between the data description location of the tensor data indicated by the descriptor and the data address.
  • the data description position is the mapping position of the point or area in the tensor data indicated by the descriptor.
  • the descriptor can be represented by three-dimensional space coordinates (x, y, z).
  • the shape of the tensor data, and the data description position of the tensor data may be the position of a point or area in the three-dimensional space that the tensor data is mapped to, represented by three-dimensional space coordinates (x, y, z).
  • the reference address of the data reference point of the descriptor in the data storage space of the tensor data, the reference address of the data storage space in at least one of the N dimension directions of the data storage space may be used in a possible implementation manner.
  • the size, the size of the storage area in at least one of the N-dimensional directions, and/or the offset of the storage area in at least one of the N-dimensional directions determine the size of the descriptor of the tensor data. content.
  • Figure 1b shows a schematic diagram of a data storage space according to an embodiment of the present disclosure.
  • the data storage space 21 stores a two-dimensional data in a row-first manner, which can be represented by (x, y) (where the X axis is horizontally to the right, and the Y axis is vertically downward), and the X axis direction
  • the size (the size of each line) on the ori_x (not shown in the figure), the size in the Y-axis direction (the total number of lines) is ori_y (not shown in the figure)
  • the starting address of the data storage space 21 PA_start (reference address) is the physical address of the first data block 22.
  • the data block 23 is part of the data in the data storage space 21, the offset 25 in the X-axis direction is represented as offset_x, the offset 24 in the Y-axis direction is represented as offset_y, and the size in the X-axis direction is represented by is size_x, and the size in the Y-axis direction is represented by size_y.
  • the data reference point of the descriptor may use the first data block of the data storage space 21, and the reference address of the descriptor may be agreed as the data storage space 21 The starting address of PA_start. Then the size ori_x on the X axis, the size ori_y on the Y axis of the data storage space 21, and the offset amount offset_y in the Y axis direction, the offset amount offset_x in the X axis direction, and the offset amount in the X axis direction of the data block 23 can be combined.
  • the content of the descriptor of the data block 23 is determined by the size size_x and the size size_y in the Y-axis direction.
  • the content of the descriptor represents a two-dimensional space
  • those skilled in the art can set the specific dimension of the content of the descriptor according to the actual situation, which is not limited in the present disclosure.
  • a reference address of the data reference point of the descriptor in the data storage space may be agreed, and on the basis of the reference address, according to at least two diagonal positions in the N dimension directions The position of each vertex relative to the data reference point determines the content of the descriptor of the tensor data.
  • the base address PA_base of the data base point of the descriptor in the data storage space may be agreed.
  • one piece of data for example, the data at the position (2, 2)
  • the physical address of the data in the data storage space may be used as the reference address PA_base.
  • the content of the descriptor of the data block 23 in FIG. 1b can be determined according to the positions of the two diagonal vertices relative to the data reference point.
  • the positions of at least two vertices of the diagonal positions of the data block 23 relative to the data reference point are determined, for example, the positions of the diagonal position vertices relative to the data reference point in the upper left to lower right direction are used, wherein the upper left corner vertex is The relative position is (x_min, y_min), the relative position of the lower right vertex is (x_max, y_max), and then the relative position of the upper left vertex (x_min, y_min) and the relative position of the lower right vertex (x_max, y_max) determines the content of the descriptor of the data block 23.
  • the following formula (3) can be used to represent the content of the descriptor (the base address is PA_base):
  • the reference address of the data reference point of the descriptor in the data storage space and the distance between the data description position and the data address of the tensor data indicated by the descriptor may be used.
  • the mapping relationship determines the content of the descriptor of the tensor data.
  • the mapping relationship between the data description position and the data address can be set according to actual needs. For example, when the tensor data indicated by the descriptor is three-dimensional space data, the function f(x, y, z) can be used to define The data describes the mapping relationship between the location and the data address.
  • the descriptor is further used to indicate the address of N-dimensional tensor data, wherein the content of the descriptor further includes at least one address parameter indicating the address of the tensor data, such as description
  • the content of the character can be:
  • PA is the address parameter.
  • the address parameter can be a logical address or a physical address.
  • the descriptor parsing circuit can take PA as any one of the vertex, middle point or preset point of the vector shape, and obtain the corresponding data address in combination with the shape parameters in the X and Y directions.
  • the address parameter of the tensor data includes a reference address of the data reference point of the descriptor in the data storage space of the tensor data, and the reference address includes the data storage The starting address of the space.
  • the descriptor may further include at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be:
  • PA_start is a reference address parameter, which is not repeated here.
  • mapping relationship between the data description location and the data address can be set according to the actual situation, which is not limited in the present disclosure.
  • a predetermined reference address may be set in a task, the descriptors in the instructions under this task all use the reference address, and the content of the descriptor may include shape parameters based on the reference address.
  • the base address can be determined by setting the environment parameters for this task. For the relevant description and usage of the reference address, reference may be made to the foregoing embodiments.
  • the content of the descriptor can be mapped to the data address more quickly.
  • a reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with the method of setting a common reference address by using environmental parameters, each descriptor in this method can describe data more flexibly and use a larger data address space.
  • the data address in the data storage space of the data corresponding to the operand of the processing instruction may be determined according to the content of the descriptor.
  • the calculation of the data address is automatically completed by the hardware, and when the representation of the content of the descriptor is different, the calculation method of the data address is also different. This disclosure does not limit the specific calculation method of the data address.
  • the content of the descriptor in the operand is represented by formula (1)
  • the offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively
  • the size is size_x*size_y
  • the The starting data address PA1 (x, y) of the tensor data indicated by the descriptor in the data storage space can be determined using the following formula (5):
  • PA1 (x,y) PA_start+(offset_y-1)*ori_x+offset_x (5)
  • the data address in the data storage space of the data corresponding to the operand can be determined according to the content of the descriptor and the data description location. In this way, part of the data (eg, one or more data) in the tensor data indicated by the descriptor can be processed.
  • the content of the descriptor in the operand is represented by formula (2).
  • the offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively, and the size is size_x*size_y.
  • the operand includes The data description position for the descriptor is (x q , y q ), then, the data address PA2 (x, y) of the tensor data indicated by the descriptor in the data storage space can use the following formula (6) to make sure:
  • PA2 (x,y) PA_start+(offset_y+y q -1)*ori_x+(offset_x+x q ) (6)
  • the computing device of the present disclosure is described above in conjunction with FIGS. 1 a and 1 b , by utilizing one or more arrays of processing circuits in the computing device and based on the operational functions of the processing circuits, the computing instructions of the present disclosure can be efficiently implemented on the computing device. Execute to complete multi-threaded operations, thereby improving the execution efficiency of parallel operations and reducing computational overhead. In addition, by using descriptors to perform operations on tensors, the solution of the present disclosure also significantly improves the access and processing efficiency of tensor data, and reduces the overhead of operations on tensors.
  • FIG. 2a is a block diagram illustrating a computing device 100 according to another embodiment of the present disclosure.
  • computing device 100 in addition to having the same processing circuitry 104 as computing device 80 , computing device 100 also includes control circuitry 102 .
  • the control circuit 102 may be configured to obtain the calculation instructions described above and parse the calculation instructions to obtain the plurality of operation instructions corresponding to the plurality of operations represented by the operation codes, such as represented by the formula (1).
  • the control circuit configures the processing circuit array according to the plurality of operation instructions to obtain the plurality of processing circuit sub-arrays, such as the processing circuit sub-arrays M1 and M2 shown in FIG. 1a ...Mn.
  • control circuit may include a register for storing configuration information, and the control circuit may extract corresponding configuration information according to the plurality of operation instructions, and configure the processing circuit array according to the configuration information to obtain the plurality of processing circuit sub-arrays.
  • the aforementioned registers or other registers of the control circuit may be configured to store information about the descriptors of the present disclosure, such as the identifiers of the descriptors and/or the content of the descriptors, so that the descriptors can be used to determine the The storage address of the data.
  • control circuit may include one or more registers that store configuration information about the array of processing circuits, the control circuit being configured to read the registers from the registers in accordance with the configuration instructions configuration information is sent to the processing circuit so that the processing circuit can connect with the configuration information.
  • configuration information may include preset location information of the processing circuits constituting the one or more processing circuit arrays, and the location information may include, for example, coordinate information or label information of the processing circuits.
  • the configuration information may further include loop-forming configuration information about the processing circuit array forming a closed loop.
  • the above-mentioned configuration information can also be directly carried through a configuration instruction instead of being read from the register.
  • the processing circuit can be directly configured according to the position information in the received configuration instruction, so as to form an array without a closed loop with other processing circuits or further form an array with a closed loop.
  • the processing circuits located in the two-dimensional array are configured in at least one of its row, column, or diagonal directions. One or more is connected to the remaining one or more of the processing circuits in a row, column or diagonal in a predetermined two-dimensional spaced pattern to form one or more closed loops.
  • the aforementioned predetermined two-dimensional spacing pattern is associated with the number of processing circuits spaced in the connection.
  • the processing circuit array is connected in a looped manner of a three-dimensional array composed of a plurality of layers, wherein each layer includes a row direction, a column direction and a two-dimensional array of a plurality of said processing circuits arranged in a diagonal direction, and wherein: said processing circuits located in said three-dimensional array are arranged in their row, column, diagonal and layer directions At least one is connected in a predetermined three-dimensional spaced pattern to the remaining one or more processing circuits in a row, column, diagonal, or different layers to form one or more closed loops.
  • the predetermined three-dimensional spacing pattern is associated with the number of spacings and the number of spacing layers between processing circuits to be connected.
  • FIG. 2b is a block diagram illustrating a computing device 200 according to another embodiment of the present disclosure. As can be seen from the figure, in addition to including the same control circuit 102 and plurality of processing circuits 104 as computing device 100 , computing device 200 in FIG. 2 also includes storage circuit 106 .
  • the above-mentioned storage circuit may be configured with interfaces for data transmission in multiple directions, so as to be connected with multiple processing circuits 104, so that the data to be calculated by the processing circuits and the intermediate data obtained during the execution of the calculation can be analyzed. The result and the operation result obtained after executing the operation process are stored accordingly.
  • the storage circuit of the present disclosure may include a main storage module and/or a main cache module, wherein the main storage module is configured to store data for performing operations in the processing circuit array and after the operations are performed. and the main cache module is configured to cache the intermediate operation result after the operation is performed in the processing circuit array.
  • the aforementioned operation result and intermediate operation result may be tensors, which may be stored in the storage circuit according to the storage address determined by the descriptor of the present disclosure.
  • the storage circuit may also have an interface for data transmission with an off-chip storage medium, so that data transfer between on-chip and off-chip systems can be realized.
  • FIG. 3 is a block diagram illustrating a computing device 300 according to yet another embodiment of the present disclosure.
  • the pre-operation circuit 110 is configured to perform preprocessing of input data (eg, tensor-type data) of at least one operation instruction
  • the post-operation circuit 112 is configured to perform at least one operation instruction Post-processing of the output data (e.g. tensor-type data).
  • the preprocessing performed by the pre-operation circuit may include data placement and/or table lookup operations
  • the post-processing performed by the post-operation circuit may include data type conversion and/or compression operations.
  • the pre-operation circuit in performing a table lookup operation, is configured to look up one or more tables by index values to obtain one or more tables associated with the operand from the one or more tables or more constant terms. Additionally or alternatively, the pre-operation circuit is configured to determine an associated index value from the operand, and to look up one or more tables from the index value to obtain from the one or more tables One or more constant terms associated with the operand.
  • the pre-operation circuit may split the operation data correspondingly according to the type of operation data and the logical address of each processing circuit, and transmit the multiple sub-data obtained after the division to the array respectively corresponding to each processing circuit for operation.
  • the pre-operation circuit may select a data splicing mode from multiple data splicing modes according to the parsed instruction, so as to perform a splicing operation on the two input data.
  • the post-operation circuit may be configured to perform a compression operation on the data, the compression operation includes filtering the data by using a mask or filtering by comparing a given threshold with the data size, so as to realize the compression of the data. compression.
  • the computing device of the present disclosure can execute computing instructions including the aforementioned preprocessing and postprocessing. Based on this, the data conversion operation of the calculation instruction as expressed in the preceding formula (1) can be performed by the above-mentioned post-operation circuit.
  • Two illustrative examples of computing instructions in accordance with the present disclosure will be given below:
  • TMUADCO MULT+ADD+RELU(N)+CONVERTFP2FIX (7)
  • the instruction expressed in Equation (7) above is a computation instruction that inputs a 3-ary operand and outputs a 1-ary operand, and it can be executed by an operation according to the present disclosure including three-stage pipeline operations (ie, multiply + add + activate ) of the processing circuit matrix to complete.
  • the ternary operation is A*B+C, in which the microinstruction of MULT completes the multiplication operation between operands A and B to obtain the product value, that is, the first-stage pipeline operation.
  • the microinstruction of ADD is executed to complete the addition operation of the product value and C to obtain the summation result "N", that is, the second-stage pipeline operation.
  • the activation operation RELU is performed on the result, that is, the third-stage pipeline operation.
  • the micro-instruction CONVERTFP2FIX can be executed through the post-operation circuit above, so as to convert the type of the result data after the activation operation from floating-point to fixed-point, so as to be output as the final result or as an intermediate
  • the result is input to a fixed-point operator for further computational operations.
  • the instruction expressed in Equation (8) above is a calculation instruction that inputs a 3-ary operand and outputs a 1-ary operand, and which includes a two-stage pipeline operation (ie, multiply + add) that can be performed by an operation according to the present disclosure.
  • the multiplication operation between operands A and B is performed by the first stage pipeline operation to obtain the product value.
  • operands A, B and product values may be tensors read and saved according to the descriptors of the present disclosure.
  • the microinstruction of ADD is executed to complete the addition operation of the aforementioned product value and C, so as to obtain the summation result "N", that is, the second-stage pipeline operation.
  • the aforementioned summation result is a tensor, it can also be stored according to the storage address determined by the descriptor of the present disclosure.
  • the calculation instructions of the present disclosure can be flexibly designed and determined according to the requirements of the calculation, so that the hardware architecture of the present disclosure including a plurality of processing circuit sub-matrixes can be designed and connected according to the calculation instructions and their specific completed operations. , so as to improve the execution efficiency of the instruction and reduce the computational cost.
  • FIG. 4 is an example block diagram illustrating various types of processing circuit arrays of a computing device 400 according to an embodiment of the present disclosure.
  • the computing device 400 shown in FIG. 4 has a similar architecture to the computing device 300 shown in FIG. 3 , so the description about the computing device 300 in FIG. The same details are not mentioned in the following paragraphs.
  • the plurality of processing circuits may include, for example, a plurality of first type processing circuits 104-1 and a plurality of second type processing circuits 104-2 (distinguished by different background colors in the figure).
  • the plurality of processing circuits may be arranged through physical connections to form a two-dimensional array. For example, as shown in the figure, there are M rows and N columns (denoted as M*N) of first type processing circuits in the two-dimensional array, where M and N are positive integers greater than zero.
  • the first type of processing circuit can be used to perform arithmetic operations and logical operations, for example, can include linear operations such as addition, subtraction and multiplication, comparison operations and non-linear operations such as AND-OR, or any combination of the aforementioned types of operations. .
  • linear operations such as addition, subtraction and multiplication
  • comparison operations such as AND-OR
  • non-linear operations such as AND-OR
  • any combination of the aforementioned types of operations. on the left and right sides of the periphery of the M*N first type processing circuit arrays, there are two columns, a total of (M*2+M*2) second type processing circuits, and on the lower side of the periphery thereof There are two rows and a total of (N*2+8) second-type processing circuits, that is, the processing circuit array has a total of (M*2+M*2+N*2+8) second-type processing circuits.
  • the second type of processing circuit may be used to perform non-linear operations such as comparison operations, table lookup operations or shift operations on the received data.
  • a first type of processing circuit may form a first sub-array of processing circuits of the present disclosure
  • a second type of processing circuit may form a second sub-array of processing circuits of the present disclosure for performing multi-threaded operations.
  • the first processing circuit sub-array can perform several stages of the multi-stage pipeline operation
  • the second The processing sub-array can perform several additional stages of pipeline operations.
  • the first processing circuit sub-array can execute the first multi-stage pipeline operation, and the second processing circuit The sub-array may perform a second multi-stage pipeline operation.
  • the memory circuits applied to both the first type of processing circuit and the second type of processing circuit may have different storage scales and storage modes.
  • the predicate storage circuit in the first type of processing circuit may utilize a plurality of numbered registers to store predicate information.
  • the first-type processing circuit can access the predicate information in the register of the corresponding number according to the register number specified in the received parsed instruction.
  • the second type of processing circuit may store the predicate information in a static random access memory ("SRAM").
  • SRAM static random access memory
  • the second type processing circuit can determine the storage address of the predicate information in the SRAM according to the offset of the location of the predicate information specified in the received parsed instruction, and can store the predicate information.
  • the predicate information in the address performs a predetermined read or write operation.
  • 5a, 5b, 5c and 5d are schematic diagrams illustrating various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure.
  • the plurality of processing circuits of the present disclosure may be connected in a hardwired manner or in a logically connected manner according to configuration instructions, thereby forming a topology of a one-dimensional or multi-dimensional array of connections.
  • the multi-dimensional array may be a two-dimensional array, and the processing circuits located in the two-dimensional array may be arranged in a row direction, a column direction or a diagonal direction thereof.
  • FIG. 5a to 5c exemplarily show the topology of various forms of two-dimensional arrays between a plurality of processing circuits.
  • processing circuits are connected to form a simple two-dimensional array. Specifically, one processing circuit is used as the center of the two-dimensional array, and one processing circuit is connected to each of the four horizontal and vertical directions relative to the processing circuit, thereby forming a two-dimensional array with three rows and three columns. . Further, since the processing circuits located in the center of the two-dimensional array are respectively directly connected with the processing circuits adjacent to the previous and next columns of the same row, and the processing circuits adjacent to the previous row and the next row of the same column, the number of spaced processing circuits ( abbreviated as "Number of Intervals") is 0.
  • each processing circuit is connected to its adjacent processing circuits in the preceding and following rows, and the preceding and following columns, namely, The number of intervals connected to adjacent processing circuits is all zero.
  • the first processing circuit located in each row or column in the two-dimensional Torus array is also connected to the last processing circuit of the row or column, and the number of intervals between the processing circuits connected end to end in each row or column is equal to is 2.
  • the processing circuits with four rows and four columns can also be connected to form a two-dimensional array in which the number of intervals between adjacent processing circuits is 0, and the number of intervals between non-adjacent processing circuits is 1.
  • adjacent processing circuits in the same row or in the same column are directly connected, that is, the number of intervals is 0, and the processing circuits in the same row or in the same column that are not adjacent are connected to the processing circuit in the number of intervals.
  • FIG. 5b and FIG. 5c there may be different numbers of intervals between the processing circuits in the same row or in the same column shown in FIG. 5b and FIG. 5c.
  • different numbers of intervals may also be connected to the processing circuits in the diagonal direction.
  • a three-dimensional Torus array is based on the two-dimensional Torus array, and uses a spacing pattern similar to that between rows and columns for interlayer connection. For example, firstly, the processing circuits in the same row and the same column of adjacent layers are directly connected, that is, the number of intervals is 0. Next, connect the processing circuits of the first layer and the last layer in the same column, that is, the number of intervals is 2. Finally, a three-dimensional Torus array with four layers, four rows and four columns can be formed.
  • connection relationship of other multi-dimensional arrays of processing circuits can be formed on the basis of two-dimensional arrays by adding new dimensions and increasing the number of processing circuits.
  • the solutions of the present disclosure may also configure logical connections to processing circuits by using configuration instructions.
  • the disclosed solution may selectively connect some processing circuits or selectively bypass some processing circuits through configuration instructions to form one or more processing circuits.
  • a logical connection can also be adjusted according to actual operation requirements (eg, data type conversion).
  • the solutions of the present disclosure can configure the connection of the processing circuits, including, for example, configuring into a matrix or configuring into one or more closed computing loops.
  • FIGS. 6a, 6b, 6c and 6d are schematic diagrams illustrating further various connection relationships of a plurality of processing circuits according to an embodiment of the present disclosure.
  • FIGS. 6 a to 6 d are still another exemplary connection relationship of a multi-dimensional array formed by a plurality of processing circuits shown in FIGS. 5 a to 5 d .
  • the technical details described in conjunction with Figs. 5a to 5d also apply to the content shown in Figs. 6a to 6d.
  • the processing circuit of the two-dimensional array includes a central processing circuit located in the center of the two-dimensional array and three processing circuits respectively connected to the central processing circuit in four directions in the same row and in the same column. Therefore, the number of bays connected between the central processing circuit and the remaining processing circuits is 0, 1 and 2, respectively.
  • the processing circuit of the two-dimensional array includes a central processing circuit located in the center of the two-dimensional array, three processing circuits in two opposite directions parallel to the processing circuit, and two processing circuits in the same column as the processing circuit A processing circuit in the opposite direction. Therefore, the number of intervals between the central processing circuit and the processing circuits in the same row is 0 and 2 respectively, and the number of intervals between the central processing circuit and the processing circuits in the same column is all 0.
  • a multi-dimensional array formed by a plurality of processing circuits may be a three-dimensional array formed by a plurality of layers. Wherein each layer of the three-dimensional array may comprise a two-dimensional array of a plurality of the processing circuits arranged along its row and column directions. Further, the processing circuits located in the three-dimensional array may be in a predetermined three-dimensional spaced pattern with a row, column, diagonal or The remaining one or more processing circuits on different layers are connected. Further, the predetermined three-dimensional spacing pattern and the number of mutually spaced processing circuits in the connection may be related to the number of spaced layers. The connection mode of the three-dimensional array will be further described below with reference to FIG. 6c and FIG. 6d.
  • Figure 6c shows a multi-layer, multi-row and multi-column three-dimensional array formed by connecting a plurality of processing circuits.
  • the processing circuit located at the lth layer, the rth row, and the cth column (represented as (l, r, c)) as an example, it is located at the center of the array, and is in the same layer as the previous column (l, r, The processing circuit at c-1) and the processing circuit at the next column (l, r, c+1), the processing circuit at the previous row (l, r-1, c) of the same layer and the same column and the processing circuit at the next row (l, r-1, c)
  • the processing circuit at r+1, c) and the processing circuit at the previous layer (l-1, r, c) and the processing circuit at the next layer (l+1, r, c) of different layers in the same column to connect.
  • FIG. 6d shows a three-dimensional array when the number of spaces connected between a plurality of processing circuits in the row direction, the column direction and the layer direction is all one.
  • the processing circuit located at the center of the array (l, r, c) as an example, it is separated from (l, r, c-2) and (l, r, c+2) by one column before and after different columns in the same layer, respectively. ), and the processing circuits at (1, r-2, c) and (1, r+2, c) at the same layer and the same column and different rows are connected.
  • processing circuits at (l-2, r, c) and (l+2, r, c) are connected with the processing circuits at (l-2, r, c) and (l+2, r, c) at the same row and different layers in the same row before and after each other.
  • the processing circuits at (l, r, c-3) and (l, r, c-1) at the same level and one column apart are connected to each other, and (l, r, c+1) and ( The processing circuits at l, r, c+3) are connected to each other.
  • the processing circuits at (l, r-3, c) and (l, r-1, c) in the same layer and the same column are connected to each other, (l, r+1, c) and (l, r+ 3.
  • the processing circuits at c) are connected to each other.
  • the processing circuits at (l-3, r, c) and (l-1, r, c) in the same row and one layer are connected to each other, and (l+1, r, c) and (l+3)
  • the processing circuits at , r, c) are connected to each other.
  • connection relationship of the multi-dimensional array formed by a plurality of processing circuits has been exemplarily described above, and different loop structures formed by a plurality of processing circuits will be further exemplarily described below with reference to FIGS. 7-8 .
  • FIGS. 7a, 7b, 7c and 7d are schematic diagrams respectively illustrating various loop structures of processing circuits according to embodiments of the present disclosure.
  • a plurality of processing circuits can not only be connected in a physical connection relationship, but also can be configured to be connected in a logical relationship according to the received parsed instruction.
  • the plurality of processing circuits may be configured to be connected using the logical connection relationship to form a closed loop.
  • the four adjacent processing circuits are sequentially numbered "0, 1, 2 and 3".
  • the four processing circuits are sequentially connected in a clockwise direction from processing circuit 0, and processing circuit 3 is connected with processing circuit 0, so that the four processing circuits are connected in series to form a closed loop (referred to as "looping" for short).
  • the number of intervals between processing circuits is 0 or 2, eg, the number of intervals between processing circuits 0 and 1 is 0, and the number of intervals between processing circuits 3 and 0 is 2.
  • the physical addresses (which may also be referred to as physical coordinates in the context of this disclosure) of the four processing circuits in the loop shown can be represented as 0-1-2-3, while their logical addresses (in the context of this disclosure) can also be called logical coordinates) can also be expressed as 0-1-2-3.
  • the connection sequence shown in FIG. 7a is only exemplary and non-limiting, and those skilled in the art can also connect the four processing circuits in a counterclockwise direction in series to form a closed circuit according to actual calculation needs. the loop.
  • a plurality of processing circuits may be combined into a processing circuit group to represent one data. For example, suppose a processing circuit can handle 8-bit data. When 32-bit data needs to be processed, four processing circuits can be combined into a processing circuit group, so that four 8-bit data can be connected to form a 32-bit data. Further, one processing circuit group formed by the aforementioned four 8-bit processing circuits can serve as one processing circuit 104 shown in FIG. 7b, so that higher bit-width arithmetic operations can be supported.
  • FIG. 7b shows the layout of the processing circuits shown is similar to that shown in Fig. 7a, but the number of intervals of connections between the processing circuits in Fig. 7b is different from that of Fig. 7a.
  • Figure 7b shows four processing circuits numbered sequentially 0, 1, 2 and 3 starting from processing circuit 0 in a clockwise direction, connecting processing circuit 1, processing circuit 3 and processing circuit 2 in sequence, and processing circuit 2 connected to processing circuit 2.
  • circuit 0 thus forming a closed loop in series.
  • the number of intervals of the processing circuits shown in Figure 7b is 0 or 1, eg, the interval between processing circuits 0 and 1 is 0, and the interval between processing circuits 1 and 3 is 1.
  • the physical addresses of the four processing circuits in the illustrated closed loop may be 0-1-2-3, and the logical addresses may be represented as 0-1-3-2 according to the illustrated looping manner. Therefore, when data of high bit width needs to be split to be allocated to different processing circuits, the data sequence can be rearranged and allocated according to the logical addresses of the processing circuits.
  • the pre-operation circuit can rearrange the input data according to the physical addresses and logical addresses of the plurality of processing circuits, so as to meet the requirements of data operation. Assuming that four sequentially arranged processing circuits 0 to 3 are connected as shown in Figure 7a, since the physical and logical addresses of the connections are both 0-1-2-3, the pre-operation circuit can convert the input data ( For example, pixel data) aa0, aa1, aa2 and aa3 are sequentially transmitted to the corresponding processing circuits.
  • FIG. 7c shows that more processing circuits are arranged and connected in different ways, respectively, to form a closed loop.
  • the 16 processing circuits 104 numbered in the order of 0, 1...15, starting from processing circuit 0, are sequentially connected and combined every two processing circuits to form a processing circuit group (that is, the processing circuit group of the present disclosure).
  • processing circuit subarray For example, as shown in the figure, processing circuit 0 is connected with processing circuit 1 to form a processing circuit group . . .
  • the processing circuit 14 is connected with the processing circuit 15 to form one processing circuit group, and finally eight processing circuit groups are formed. Further, the eight processing circuit groups can also be connected in a manner similar to the aforementioned processing circuits, including connection according to, for example, predetermined logical addresses, so as to form a closed loop of the processing circuit groups.
  • a plurality of processing circuits 104 are connected in an irregular or non-uniform manner to form a processing circuit matrix having a closed loop.
  • the number of intervals between the processing circuits can be 0 or 3 to form a closed loop, for example, the processing circuit 0 can be respectively connected with the processing circuit 1 (the interval number is 0) and the processing circuit 4 (the interval number is 0) The number is 3) connected.
  • the processing circuit of the present disclosure may be spaced by different numbers of processing circuits so as to be connected in a closed loop.
  • any number of intermediate intervals can also be selected for dynamic configuration, thereby connecting into a closed loop.
  • the connection of the plurality of processing circuits may be a hard connection formed by hardware, or may be a soft connection configured by software.
  • FIGS. 8a, 8b and 8c are schematic diagrams illustrating further various loop structures of processing circuits according to embodiments of the present disclosure.
  • multiple processing circuits may form a closed loop, and each processing circuit in the closed loop may be configured with a respective logical address.
  • the pre-operation circuit described in conjunction with FIG. 2 can be configured to perform corresponding splitting of the operational data and obtain after the splitting according to the type of operational data (such as 32bit data, 16bit data or 8bit data) and the logical address.
  • the multiple sub-data of are respectively transferred to the corresponding processing circuits in the loop for subsequent operations.
  • the upper diagram of FIG. 8a shows that four processing circuits are connected to form a closed loop, and the physical addresses of the four processing circuits in the order from right to left can be represented as 0-1-2-3.
  • the lower diagram of Figure 8a shows that the logical addresses of the four processing circuits in the aforementioned loop are represented as 0-3-1-2 in order from right to left.
  • the processing circuit with the logical address "3" shown in the lower diagram of Fig. 8a has the physical address "1" shown in the upper diagram of Fig. 8a.
  • the granularity of the operation data is the lower 128 bits of the input data, such as the original sequence "15, 14, ... 2, 1, 0" in the figure (each number corresponds to 8 bits of data), and set this
  • the logical addresses of the 16 8-bit data are numbered from low to high in order from 0 to 15.
  • the pre-operation circuit can encode or arrange the data using different logical addresses according to different data types.
  • the logical addresses are (3,2,1,0), (7,6,5,4), (11,10,9,8) and (15,14) , 13, 12) can represent the 0th to 3rd 32bit data respectively.
  • the pre-operation circuit can transmit the 0th 32-bit data to the processing circuit whose logical address is "0" (the corresponding physical address is "0"), and can transmit the first 32-bit data to the logical address "1".
  • the second 32-bit data can be transferred to the processing circuit whose logical address is "2" (corresponding physical address is “3"), and the third The 32bit data is sent to the processing circuit whose logical address is "3" (the corresponding physical address is "1").
  • the mapping relationship between the logical address and the physical address of the final data is (15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)-> (11,10,9,8,7,6,5,4,15,14,13,12,3,2,1,0).
  • the logical addresses are (1,0), (3,2), (5,4), (7,6), (9,8), (11,10) ), (13,12) and (15,14) 8 numbers can represent the 0th to 7th 16bit data respectively.
  • the pre-operation circuit can transmit the 0th and 4th 16-bit data to the processing circuit whose logical address is "0" (the corresponding physical address is "0"), and can transfer the 1st and 5th 16-bit data.
  • the mapping relationship between the logical address and the physical address of the final data is: (15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)- >(13,12,5,4,11,10,3,2,15,14,7,6,9,8,1,0).
  • the pre-operation circuit can transmit the 0th, 4th, 8th and 12th 8bit data to the processing circuit whose logical address is "0" (the corresponding physical address is “0"); the 1st, 5th, 9th and 13th 8bit data can be transferred to the processing circuit whose logical address is "1" (the corresponding physical address is "2"); The 2nd, 6th, 10th and 14th 8bit data are transferred to the processing circuit with the logical address "2" (the corresponding physical address is "3”); the third, seventh, The 11th and 15th 8bit data are transferred to the processing circuit whose logical address is "3” (the corresponding physical address is “1”). Therefore, the mapping relationship between the logical address and the physical address of the final data is:
  • Figure 8b shows that eight sequentially numbered processing circuits 0 to 7 are connected to form a closed loop, and the physical addresses of the eight processing circuits are 0-1-2-3-4-5-6- 7.
  • the lower diagram of Fig. 8b shows that the logical addresses of the aforementioned eight processing circuits are 0-7-1-6-2-5-3-4.
  • a processing circuit with a physical address of "6" shown in the upper diagram of Figure 8b corresponds to a logical address of "3" shown in the lower diagram of Figure 8b.
  • the operations of the pre-operation circuit for rearranging the data and then transmitting the data to the corresponding processing circuit are similar to those in FIG. 8a, so the technical solutions described in conjunction with FIG. 8a are also applicable to FIG. 8b. , the above data rearrangement operation process will not be repeated here.
  • the connection relationship of the processing circuits shown in FIG. 8b is similar to that shown in FIG. 8a, but the eight processing circuits shown in FIG. 8b are twice the number of processing circuits shown in FIG. 8a.
  • the granularity of the operational data described in conjunction with FIG. 8b may be twice that of the operational data described in conjunction with FIG. 8a.
  • the granularity of the operation data in this example can be 256 bits lower than that of the input data, for example, the original data sequence "31, 30, . , 0", each digit corresponds to an 8-bit ("bit") length.
  • the figures also show the arrangement results of the data in the looped processing circuits.
  • the data bit width of the operation is 32 bits
  • one 32-bit data in the processing circuit whose logical address is "1" is (7, 6, 5, 4)
  • the corresponding physical address of this processing circuit is "2”.
  • the data bit width of the operation is 16 bits
  • the two 16-bit data in the processing circuit whose logical address is "3" is (23, 22, 7, 6)
  • the corresponding physical address of the processing circuit is "6”.
  • the data bit width of the operation is 8 bits
  • the four 8-bit data in the processing circuit whose logical address is "6" is (30, 22, 14, 6), and the corresponding physical address of the processing circuit is "3".
  • FIG. 8c shows that twenty multi-type processing circuits numbered in the order of 0, 1 . . . 19 are connected to form a closed loop (the numbers shown in the figure are the physical addresses of the processing circuits). Sixteen processing circuits numbered from 0 to 15 are processing circuits of the first type (that is, forming the processing circuit sub-array of the present disclosure), and four processing circuits numbered 16 to 19 are processing circuits of the second type (that is, forming a sub-array of processing circuits of the present disclosure). processing circuit subarrays of the present disclosure). Similarly, the physical address of each of the twenty processing circuits has a mapping relationship with the logical address of the corresponding processing circuit shown in the lower figure of FIG. 8c.
  • FIG. 8c also shows the result of operating the aforementioned original data for different data types supported by the processing circuit.
  • the data bit width of the operation is 32 bits
  • one 32-bit data in the processing circuit whose logical address is "1" is (7, 6, 5, 4)
  • the corresponding physical address of this processing circuit is "2”.
  • the data bit width of the operation is 16 bits
  • the two 16-bit data in the processing circuit whose logical address is "11” are (63, 62, 23, 22)
  • the corresponding physical address of the processing circuit is "9”.
  • the data bit width of the operation is 8 bits
  • the four 8-bit data in the processing circuit whose logical address is "17” is (77, 57, 37, 17)
  • the corresponding physical address of the processing circuit is "18”.
  • 9a, 9b, 9c and 9d are schematic diagrams illustrating data stitching operations performed by a preprocessing circuit according to an embodiment of the present disclosure.
  • the pre-processing circuit described in the present disclosure in conjunction with FIG. 2 can also be configured to select a data splicing mode from a plurality of data splicing modes according to the parsed instruction, so as to perform a splicing operation on the two input data.
  • the solution of the present disclosure divides and numbers the two data to be spliced according to the minimum data unit, and then extracts different minimum data units of the data based on specified rules to form different data. Stitching mode.
  • the minimum data unit here can be simply 1-bit or 1-bit data, or 2-bit, 4-bit, 8-bit, 16-bit or 32-bit or the length of the bit .
  • the scheme of the present disclosure can either extract alternately with the smallest data unit, or can extract in multiples of the smallest data unit, for example, alternately extract two data at a time from the two data. Partial data of one or three minimum data units are grouped together as a group.
  • the input data are In1 and In2, and when each square in the figure represents a minimum data unit, both input data have a bit width length of 8 minimum data units.
  • the minimum data unit may represent different number of bits (or number of bits). For example, for data with a bit width of 8 bits, the smallest data unit represents 1-bit data, and for data with a bit width of 16 bits, the smallest data unit represents 2-bit data. For another example, for data with a bit width of 32 bits, the minimum data unit represents 4-bit data.
  • the two input data In1 and In2 to be spliced are each composed of eight minimum data units sequentially numbered 1, 2, . . . , 8 from right to left.
  • Data splicing is performed according to the principle of parity interleaving with numbers from small to large, In1 followed by In2, and odd numbers followed by even numbers.
  • the data bit width of the operation is 8 bits
  • the data In1 and In2 each represent one 8-bit data
  • each minimum data unit represents 1-bit data (ie, one square represents 1-bit data).
  • the minimum data units numbered 1, 3, 5 and 7 of the data In1 are first extracted and arranged in the lower order.
  • the data In1 and In2 each represent a 16-bit data, and each minimum data unit at this time represents 2-bit data (ie, a square represents a 2-bit data).
  • the minimum data units numbered 1, 2, 5 and 6 of the data In1 can be extracted first and arranged in the lower order. Then, the smallest data units numbered 1, 2, 5, and 6 of the data In2 are sequentially arranged. Similarly, the minimum data units numbered 3, 4, 7 and 8 and the data In2 are sequentially arranged to form a 32-bit or 2 16-bit new data composed of the final 16 minimum data units. , as shown in the second row of squares in Figure 9b.
  • the data In1 and In2 each represent a 32-bit data
  • each minimum data unit represents 4-bit data (ie, a square represents a 4-bit data).
  • the bit width of the data and the aforementioned principle of interleaving and splicing the smallest data units with the same numbers as the data In1 and the same numbers as the data In2 can be extracted and arranged in the lower order. Then, extract the smallest data units numbered 5, 6, 7, and 8 with the same numbers as the data In2 and arrange them in sequence, thereby splicing to form a 64-bit or two 32-bit new data consisting of the final 16 smallest data units .
  • Exemplary data splicing manners of the present disclosure are described above in conjunction with FIGS. 9a-9c. However, it can be understood that in some computing scenarios, data splicing does not involve the above-mentioned staggered arrangement, but only a simple arrangement of two data while keeping their original data positions unchanged, such as shown in Figure 9d out. It can be seen from Figure 9d that the two data In1 and In2 do not perform the staggered arrangement as shown in Figures 9a-9c, but only the last minimum data unit of the data In1 and the first minimum data unit of In2 The data units are concatenated to obtain a new data type with an increased (eg doubled) bit width.
  • the solution of the present disclosure can also perform group stitching based on data attributes. For example, neuron data or weight data with the same feature map can be formed into a group and then arranged to form a continuous part of the spliced data.
  • FIGS. 10a, 10b and 10c are schematic diagrams illustrating data compression operations performed by post-processing circuits according to embodiments of the present disclosure.
  • the compressing operation may include filtering the data with a mask or compressing by comparing a given threshold with the size of the data.
  • it can be divided and numbered in the smallest data unit as previously described. Similar to that described in connection with Figures 9a-9d, the minimum data unit may be, for example, 1-bit or 1-bit data, or a length of 2, 4, 8, 16 or 32 bits or bits. Exemplary descriptions for different data compression modes will be made below in conjunction with Figures 10a to 10c.
  • the original data consists of eight squares (ie, eight minimum data units) sequentially numbered 1, 2..., 8 from right to left, assuming that each minimum data unit can represent 1 bit data.
  • the post-processing circuit may filter the original data by using the mask to perform the data compression operation.
  • the bit width of the mask corresponds to the number of minimum data units of the original data. For example, if the aforementioned original data has 8 minimum data units, the bit width of the mask is 8 bits, and the minimum data unit numbered 1 corresponds to the lowest bit of the mask, and the minimum data unit numbered 2 corresponds to the next low. And so on, the smallest data unit numbered 8 corresponds to the most significant bit of the mask.
  • the compression principle may be set to extract the smallest data unit in the original data corresponding to the data bit whose mask is "1".
  • the numbers of the smallest data units corresponding to the mask value "1" are 1, 2, 5, and 8.
  • the minimum data units numbered 1, 2, 5 and 8 can be extracted and arranged in order from low to high to form new compressed data, as shown in the second row of Figure 10a.
  • Fig. 10b shows the original data similar to Fig. 10a, and it can be seen from the second row of Fig. 10b that the data sequence passed through the post-processing circuit maintains the original data arrangement order and content. It will thus be appreciated that the data compression of the present disclosure may also include a disabled mode or a non-compressed mode so that no compression operation is performed when the data passes through the post-processing circuit.
  • the original data consists of eight squares arranged in sequence, the number above each square indicates its number, and the order from right to left is 1, 2...8, and it is assumed that each minimum data unit can be is 8-bit data. Further, the number in each square represents the decimal value of that smallest data unit. Taking the smallest data unit numbered 1 as an example, its decimal value is "8", and the corresponding 8-bit data is "00001111".
  • the threshold is decimal data "8”
  • the compression principle can be set to extract all the smallest data units in the original data that are greater than or equal to the threshold "8".
  • the smallest data units numbered 1, 4, 7 and 8 can be extracted.
  • FIG. 11 is a simplified flowchart illustrating a method 1100 of using a computing device to perform arithmetic operations in accordance with an embodiment of the present disclosure.
  • the computing device here may be the computing device described in conjunction with FIG. 1 (including FIG. 1a and FIG. 1b )- FIG. 4 , which has the processing circuit connection relationship shown in FIG. 5- FIG. 10 and Additional types of operations are supported.
  • the method 1100 receives a calculation instruction at the computing device, and parses it to obtain a plurality of operation instructions.
  • the operand of the calculation instruction includes a descriptor for indicating the shape of the tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand.
  • the method 1100 utilizes the plurality of processing circuit sub-arrays to perform a multi-threaded operation, wherein at least one processing circuit sub-array of the plurality of processing circuit sub-arrays The array is configured to execute at least one operation instruction of the plurality of operation instructions according to the memory address.
  • FIG. 12 is a structural diagram illustrating a combined processing apparatus 1200 according to an embodiment of the present disclosure.
  • the combined processing device 1200 includes a computing processing device 1202 , an interface device 1204 , other processing devices 1206 and a storage device 1208 .
  • one or more computing devices 1210 may be included in the computing processing device, and the computing devices may be configured to perform the operations described herein in conjunction with FIG. 1 to FIG. 11 .
  • the computing processing devices of the present disclosure may be configured to perform user-specified operations.
  • the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor.
  • one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core.
  • multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user.
  • other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors.
  • processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the computing processing device of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.
  • the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices.
  • other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.
  • the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices.
  • the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device.
  • the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into a control cache on the computing and processing device chip.
  • the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.
  • the combined processing device of the present disclosure may also include a storage device.
  • the storage device is connected to the computing processing device and the other processing device, respectively.
  • a storage device may be used to store data of the computing processing device and/or the other processing device.
  • the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.
  • the present disclosure also discloses a chip (eg, chip 1302 shown in FIG. 13 ).
  • the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 12 .
  • the chip can be connected with other related components through an external interface device (such as the external interface device 1306 shown in FIG. 13 ).
  • the relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface.
  • other processing units such as video codecs
  • interface modules such as DRAM interfaces
  • the present disclosure also discloses a chip package structure including the above-mentioned chip.
  • the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 13 .
  • FIG. 13 is a schematic structural diagram illustrating a board 1300 according to an embodiment of the present disclosure.
  • the board includes a storage device 1304 for storing data, which includes one or more storage units 1310 .
  • the storage device can be connected and data transferred with the control device 1308 and the chip 1302 described above through, for example, a bus.
  • the board also includes an external interface device 1306, which is configured for data relay or transfer function between the chip (or a chip in a chip package structure) and an external device 1312 (such as a server or a computer, etc.).
  • the data to be processed can be transmitted to the chip by an external device through an external interface device.
  • the calculation result of the chip may be transmitted back to the external device via the external interface device.
  • the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.
  • control device in the board of the present disclosure may be configured to regulate the state of the chip.
  • control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.
  • MCU Micro Controller Unit
  • an electronic device or device which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal.
  • the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device eg, a personal computer, a server or network equipment, etc.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • the various types of devices described herein may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • a computing device comprising:
  • a processing circuit array consisting of a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, wherein the processing circuit array is configured into a plurality of processing circuit sub-arrays, and in response to receiving a plurality of operation instructions to perform multithreading operation,
  • the plurality of operation instructions are obtained by parsing a calculation instruction received by the computing device, and wherein an operand of the calculation instruction includes a descriptor for indicating a shape of a tensor, the descriptor being used for Determine the storage address of the data corresponding to the operand,
  • the at least one processing circuit sub-array is configured to execute at least one operation instruction of the plurality of operation instructions according to the memory address.
  • Clause 2 The computing device of clause 1, wherein the computing instruction includes an identification of a descriptor and/or content of the descriptor, the content of the descriptor including at least one shape parameter representing the shape of the tensor data.
  • Clause 3 The computing device of clause 2, wherein the content of the descriptor further comprises at least one address parameter representing an address of tensor data.
  • Clause 4 The computing device of Clause 3, wherein the address parameter of the tensor data comprises a base address of a data base point of the descriptor in the data storage space of the tensor data.
  • Clause 5 The computing device of Clause 4, wherein the shape parameter of the tensor data comprises at least one of the following:
  • the size of the data storage space in at least one of the N-dimensional directions, the size of the storage area of the tensor data in at least one of the N-dimensional directions, the size of the storage area in at least one of the N-dimensional directions The offset in one direction, the position of at least two vertices at diagonal positions in N dimension directions relative to the data reference point, the data description position of the tensor data indicated by the descriptor and the data address.
  • Clause 6 The computing device of clause 1, the opcodes of the computing instructions representing a plurality of operations performed by the array of processing circuits, the computing device further comprising a control circuit configured to obtain the computing instructions and parsing the calculation instruction to obtain the plurality of operation instructions corresponding to the plurality of operations represented by the operation code, and when the operand of the calculation instruction includes the descriptor, the control circuit It is configured to determine the storage address of the data corresponding to the operand according to the descriptor.
  • Clause 7 The computing device of Clause 6, wherein the control circuit configures the processing circuit array according to the plurality of operation instructions to obtain the plurality of processing circuit sub-arrays.
  • Clause 8 The computing device of Clause 7, wherein the control circuit includes a register for storing configuration information, and the control circuit extracts corresponding configuration information according to the plurality of operation instructions, and configures according to the configuration information the processing circuit array to obtain the plurality of processing circuit sub-arrays.
  • Item 9 The computing device according to Item 1, wherein the plurality of operation instructions include at least one multi-stage pipeline operation, and the one multi-stage pipeline operation includes at least two operation instructions.
  • Clause 10 The computing device of Clause 1, wherein the operation instruction includes a predicate, and each of the processing circuits determines whether to execute the operation instruction associated therewith according to the predicate.
  • the array of processing circuits is a two-dimensional array, and wherein: one or more rows of processing circuits in the array of processing circuits are configured as one of the sub-arrays of processing circuits; Or one or more columns of processing circuits in the processing circuit array are configured as one of the processing circuit sub-arrays; or one or more rows of processing circuits in the diagonal direction in the processing circuit array are configured as a the processing circuit sub-array.
  • Clause 13 The computing device of clause 12, wherein the plurality of processing circuits located in the two-dimensional array are configured in a predetermined two-dimensional manner in at least one of a row, column, or diagonal direction thereof.
  • the spaced mode is connected to the remaining one or more of the processing circuits in the same row, column or diagonal.
  • Clause 14 The computing device of clause 13, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.
  • Clause 15 The computing device of clause 1, wherein the array of processing circuits is a three-dimensional array, and a three-dimensional sub-array or sub-arrays in the array of processing circuits are configured as one of the sub-arrays of processing circuits.
  • the three-dimensional array is a three-dimensional array of a plurality of layers, wherein each layer includes a plurality of the processing circuits arranged in row, column, and diagonal directions A two-dimensional array, wherein: the processing circuits located in the three-dimensional array are configured to be in a row, column, diagonal, and layer direction with a predetermined three-dimensional spacing pattern in at least one of its row, column, diagonal, and layer directions. Connect to one or more remaining processing circuits on the diagonal or on different layers.
  • Clause 17 The computing device of clause 16, wherein the predetermined three-dimensional spacing pattern is associated with a number of spacings and layers of spacing between processing circuits to be connected.
  • Clause 18 The computing device of any of clauses 11-17, wherein a plurality of processing circuits in the sub-array of processing circuits form one or more closed loops.
  • each of the processing circuit sub-arrays is adapted to perform at least one of the following operations: arithmetic operations, logical operations, comparison operations, and table look-up operations.
  • Clause 20 The computing device of clause 1, further comprising a data manipulation circuit comprising a pre-operation circuit and/or a post-operation circuit, wherein the pre-operation circuit is configured to execute an input of at least one of the operational instructions preprocessing of the data, and the post-operation circuit is configured to perform post-processing of the output data of the at least one arithmetic instruction.
  • Clause 21 The computing device of clause 20, wherein the preprocessing includes data placement and/or table lookup operations, and the postprocessing includes data type conversion and/or compression operations.
  • Clause 22 The computing device according to Clause 21, wherein the data arrangement comprises disassembling the input data and/or the output data correspondingly according to the data type of the input data and/or the output data of the operation instruction. After dividing or merging, it is passed to the corresponding processing circuit for operation.
  • Clause 25 An electronic device comprising the integrated circuit chip of clause 23.
  • a method of performing a computation using a computing device wherein the computing device comprises an array of processing circuits formed from a plurality of processing circuits connected in a one-dimensional or multi-dimensional array, and the processing circuits The array is configured as a plurality of sub-arrays of processing circuits, and the method includes: receiving a calculation instruction at the computing device, and parsing it to obtain a plurality of operation instructions, wherein the operands of the calculation instructions include instructions for indicating a A descriptor of the shape of the quantity, the descriptor is used to determine the storage address of the data corresponding to the operand;
  • performing a multi-threaded operation using the plurality of processing circuit sub-arrays wherein at least one processing circuit sub-array of the plurality of processing circuit sub-arrays is configured according to the memory address to execute at least one operation instruction among the plurality of operation instructions.
  • Clause 27 The method of clause 26, wherein the computation instruction includes an identification of a descriptor and/or content of the descriptor including at least one shape parameter representing the shape of the tensor data.
  • Clause 28 The method of clause 27, wherein the content of the descriptor further comprises at least one address parameter representing an address of tensor data.
  • Clause 30 The method of Clause 29, wherein the shape parameter of the tensor data comprises at least one of: a size of the data storage space in at least one of N dimensions, a size of the tensor data The size of the storage area in at least one of the N-dimensional directions, the offset of the storage area in at least one of the N-dimensional directions, and at least two vertices at diagonal positions in the N-dimensional directions are relative to the The mapping relationship between the position of the data reference point, the data description position of the tensor data indicated by the descriptor and the data address, where N is an integer greater than or equal to zero.
  • Clause 31 The method of clause 26, wherein the opcodes of the computing instructions represent a plurality of operations performed by the array of processing circuits, the computing device further comprising a control circuit, the method comprising utilizing the control circuit to obtain the calculation instruction and parse the calculation instruction to obtain the multiple operation instructions corresponding to the multiple operations indicated by the operation code.
  • Clause 32 The method of clause 31, wherein the array of processing circuits is configured with the control circuit according to the plurality of operational instructions to obtain the plurality of sub-arrays of processing circuits.
  • Clause 33 The method of clause 32, wherein the control circuit includes a register for storing configuration information, and the method includes utilizing the control circuit to extract corresponding configuration information according to the plurality of operation instructions, and according to the The configuration information is used to configure the processing circuit array to obtain the plurality of processing circuit sub-arrays.
  • Clause 34 The method of Clause 26, wherein the plurality of operation instructions includes at least one multi-stage pipeline operation, and the one multi-stage pipeline operation includes at least two operation instructions.
  • Clause 35 The method of clause 26, wherein the operation instruction includes a predicate, and the method further comprises utilizing each of the processing circuits to determine whether to execute the operation instruction associated therewith based on the predicate.
  • Clause 36 The method of clause 26, wherein the array of processing circuits is a one-dimensional array, and the method comprises configuring one or more processing circuits in the array of processing circuits as one of the processing circuit sub-arrays array.
  • Clause 37 The method of clause 26, wherein the array of processing circuits is a two-dimensional array, and the method further comprises: configuring one or more rows of processing circuits in the array of processing circuits as one of the processing circuits circuit sub-array; or configure one or more columns of processing circuits in the processing circuit array as one of the processing circuit sub-arrays; or configure one or more rows of the processing circuit array along the diagonal direction
  • the processing circuits are configured as a sub-array of said processing circuits.
  • Clause 38 The method of clause 37, wherein the plurality of processing circuits located in the two-dimensional array are configured in a predetermined two-dimensional manner in at least one of a row, column, or diagonal direction thereof.
  • the spaced mode is connected to the remaining one or more of the processing circuits in the same row, column or diagonal.
  • Clause 39 The method of clause 38, wherein the predetermined two-dimensional spacing pattern is associated with a number of processing circuits spaced in the connection.
  • Clause 40 The method of clause 26, wherein the array of processing circuits is a three-dimensional array, and the method comprises configuring a three-dimensional sub-array or sub-arrays in the array of processing circuits as one of the processing circuits circuit sub-array.
  • the three-dimensional array comprises a three-dimensional array of a plurality of layers, wherein each layer comprises a plurality of the processing circuits arranged in row, column, and diagonal directions.
  • a two-dimensional array the method comprising: configuring the processing circuits located in the three-dimensional array to line up with a predetermined three-dimensional spacing pattern in at least one of its row, column, diagonal, and layer directions , connected to one or more other processing circuits on the same column, on the same diagonal, or on different layers.
  • Clause 42 The method of clause 41, wherein the predetermined three-dimensional spacing pattern is associated with a number of spacings and layers of spacing between processing circuits to be connected.
  • Clause 43 The method of any of clauses 36-42, wherein the plurality of processing circuits in the sub-array of processing circuits form one or more closed loops.
  • each of the processing circuit sub-arrays is adapted to perform at least one of the following operations: arithmetic operations, logical operations, comparison operations, and table look-up operations.
  • Clause 45 The method of clause 26, further comprising a data manipulation circuit comprising a pre-operation circuit and/or a post-operation circuit, the method comprising utilizing the pre-operation circuit to execute at least one of the operational instructions
  • the pre-processing of the input data and/or the post-processing of the output data of the at least one operation instruction is performed using the post-operation circuit.
  • Clause 46 The method of clause 45, wherein the preprocessing comprises operations for data placement and/or table lookup, and the postprocessing comprises data type conversion and/or compression operations.
  • Clause 47 The method according to Clause 46, wherein the data arrangement comprises correspondingly splitting the input data and/or the output data according to the data type of the input data and/or the output data of the operation instruction Or merged, and passed to the corresponding processing circuit for operation. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

提供了一种计算装置、集成电路芯片、板卡和使用计算装置来执行运算操作的方法。计算装置(1210)包括在组合处理装置中,组合处理装置还包括通用互联接口和其他处理装置(1206)。计算装置(1210)与其他处理装置(1206)进行交互,共同完成用户指定的计算操作。组合处理装置还包括存储装置(1208),存储装置(1208)分别与设备和其他处理装置(1206)连接,用于存储设备和其他处理装置(1206)的数据。

Description

计算装置、集成电路芯片、板卡、电子设备和计算方法
相关申请的交叉引用
本申请要求于2020年6月30日申请的,申请号为2020106194580,名称为“计算装置、集成电路芯片、板卡、电子设备和计算方法”的中国专利申请的优先权,在此将其全文引入作为参考。
技术领域
本披露一般地涉及计算领域。更具体地,本披露涉及一种计算装置、集成电路芯片、板卡、电子设备和计算方法。
背景技术
在计算***中,指令集是用于执行计算和对计算***进行控制的一套指令的集合,并且在提高计算***中计算芯片(例如处理器)的性能方面发挥着关键性的作用。当前的各类计算芯片(特别是人工智能领域的芯片)利用相关联的指令集,可以完成各类通用或特定的控制操作和数据处理操作。然而,当前的指令集还存在诸多方面的缺陷。例如,现有的指令集受限于硬件架构而在灵活性方面表现较差。进一步,许多指令仅能完成单一的操作,而多个操作的执行则通常需要多条指令,这潜在地导致片内I/O数据吞吐量增大。另外,当前的指令在执行速度、执行效率和对芯片造成的功耗方面还有改进之处。
另外,传统的处理器CPU的运算指令被设计为能够执行基本的单数据标量运算操作。这里,单数据操作指的是指令的每一个操作数都是一个标量数据。然而,在图像处理和模式识别等任务里,面向的操作数往往是多维向量(即,张量数据)的数据类型,仅仅使用标量操作无法使硬件高效地完成运算任务。因此,如何高效地执行多维的张量运算也是当前计算领域亟需解决的问题。
发明内容
为了至少解决上述现有技术中存在的问题,本披露提供一种具有处理电路阵列的硬件架构。通过利用该硬件架构来执行计算指令,本披露的方案可以在包括增强硬件的处理性能、减小功耗、提高计算操作的执行效率和避免计算开销等多个方面获得技术优势。进一步,本披露的方案在前述硬件架构的基础上支持对张量数据的高效访存和处理,从而在计算指令中包括多维向量操作数的情况下,加速张量运算并且减小张量运算所带来的计算开销。
在第一方面中,本披露提供一种计算装置,包括:处理电路阵列,其由多个处理电路以一维或多维阵列的结构连接而成,其中所述处理电路阵列配置成多个处理电路子阵列,并且响应于接收到多个运算指令来执行多线程运算,其中所述多个运算指令由对所述计算装置接收到的计算指令进行解析而获得,并且其中所述计算指令的操作数包括用于指示张量的形状的描述符,所述描述符用于确定所述操作数对应数据的存储地址,所述至少一个处理电路子阵列配置成根据所述存储地址来执行所述多个运算指令中的至少一个运算指令。
在第二方面中,本披露提供一种集成电路芯片,其包括如上所述并且将在下面多 个实施例中描述的计算装置。
在第三方面中,本披露提供一种板卡,其包括如上所述并且将在下面多个实施例中描述的集成电路芯片。
在第四方面中,本披露提供一种电子设备,其包括如上所述并且将在下面多个实施例中描述的集成电路芯片。
在第五方面中,本披露提供一种使用前述计算装置来执行计算的方法,其中所述计算装置包括处理电路阵列,该处理电路阵列由多个处理电路以一维或多维阵列的结构连接而成,并且所述处理电路阵列配置成多个处理电路子阵列,所述方法包括:在所述计算装置处接收计算指令,并且对其进行解析而获得多个运算指令,其中所述计算指令的操作数包括用于指示张量的形状的描述符,所述描述符用于确定所述操作数对应数据的存储地址;响应于接收到所述多个运算指令,利用所述多个处理电路子阵列来执行多线程运算,其中所述多个处理电路子阵列中的至少一个处理电路子阵列配置成根据所述存储地址来执行多个运算指令中的至少一个运算指令。
通过使用本披露上述的计算装置、集成电路芯片、板卡、电子设备和方法,可以根据计算要求来构建适当的处理电路阵列,从而可以高效地执行计算指令、减小计算开销并减少I/O数据的吞吐量。另外,由于本披露的处理电路可以根据运算要求而配置成支持相应的运算,因此本披露计算指令的操作数的数目可以根据运算要求增加或减少,并且操作码的类型也可以在处理电路矩阵所支持的操作类型中任意选择和组合,从而扩展了硬件架构的应用场景和适配性。
附图说明
通过参考附图阅读下文的详细描述,本公开示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本公开的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:
图1a是示出根据本披露一个实施例的计算装置的框图;
图1b是示出根据本披露一个实施例的数据存储空间的示意图;
图2a是示出根据本披露另一个实施例的计算装置的框图;
图2b是示出根据本披露又一个实施例的计算装置的框图;
图3是示出根据本披露又一个实施例的计算装置的框图;
图4是示出根据本披露实施例的计算装置的多种类型处理电路阵列的示例结构图;
图5a,5b,5c和5d是示出根据本披露实施例的多个处理电路的多种连接关系的示意图;
图6a,6b,6c和6d是示出根据本披露实施例的多个处理电路的另外多种连接关系的示意图;
图7a,7b、7c和7d是示出根据本披露实施例的处理电路的多种成环结构的示意图;
图8a,8b和8c是示出根据本披露实施例的处理电路的另外多种成环结构的示意图;
图9a,9b,9c和9d是示出根据本披露实施例的前操作电路所执行的数据拼接操作示意图;
图10a,10b和10c是示出根据本披露实施例的后操作电路所执行的数据压缩操作示意图;
图11是示出根据本披露实施例的使用计算装置来执行运算操作的方法的简化流程图;
图12是示出根据本披露实施例的一种组合处理装置的结构图;以及
图13是示出根据本披露实施例的一种板卡的结构示意图。
具体实施方式
本披露的方案提供一种支持多线程运算的硬件架构。当该硬件架构实现于计算装置中时,该计算装置至少包括多个处理电路,其中多个处理电路根据不同的配置来进行连接,以形成一维或多维阵列的结构。根据实现方式的不同,处理电路阵列可以配置成多个处理电路子阵列,并且每个处理电路子阵列可以配置成执行多个运算指令中的至少一个运算指令。当涉及张量运算时,本披露的计算指令的操作数可以包括用于指示张量的形状的描述符,该描述符可以用于确定操作数对应数据(例如张量)的存储地址,从而处理电路子阵列可以根据该存储地址来读取和保存张量数据,以执行与张量相关的张量操作。借助于本披露的硬件架构和运算指令,可以高效地执行包括张量运算的计算操作,扩展了计算的应用场景并且减小了计算开销。
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
图1a是示出根据本披露一个实施例的计算装置80的框图。如图1a中所示,该计算装置80包括由多个处理电路104所形成的处理电路阵列。具体地,该多个处理电路以二维阵列的结构连接以形成处理电路阵列,并且包括多个处理电路子阵列,例如图中所示出的多个一维处理电路子阵列M1、M2、……Mn。需要理解的是这里二维结构的处理电路阵列及其包括的多个一维处理电路子阵列仅仅是示例性地而非限制性地,本披露的处理电路阵列根据不同的运算场景,可以配置成具有不同维度的阵列结构,并且在处理电路子阵列内或多个处理电路子阵列之间可以形成一个或多个闭合的环路,如稍后将描述的图5-图8中所示出的示例性连接。
在一个实施例中,响应于接收到多个运算指令,本披露的处理电路阵列可以配置成执行多线程运算,例如执行单指令多线程(“SIMT”)指令。进一步,每个处理电路子阵列可以配置成执行前述多个运算指令中的至少一个运算指令。在本披露的上下文中,前述的多个运算指令可以是在计算装置(或处理电路、处理器)内部运行的微指令或控制信号,其可以包括(或者说指示)一个或多个需计算装置执行的运算操作。根据不同的运算场景,运算操作可以包括但不限于加法操作、乘法操作、卷积运算操作、池化操作等各种操作,并且这些运算操作也可以涉及张量运算。为此,本披露的计算指令的操作数可以包括用于指示张量的形状的描述符。通过利用该描述符所确定的存储地址,执行运算指令的一个或多个处理电路子阵列可以快速地访问到待用于运算操作的一个或多个张量(或称张量数据)。
在一个实施例中,上述的多个运算指令可以包括至少一条多级流水运算。在一个 场景中,前述的一条多级流水运算可以包括至少两个运算指令。根据不同的执行要求,本披露的运算指令可以包括谓词,并且每个所述处理电路根据谓词来判断是否执行与其关联的运算指令。本披露的处理电路根据配置可以灵活地执行各类运算操作,包括但不限于算术运算、逻辑运算、比较运算和查表运算。
以图1a所示出的处理电路矩阵以及包括的M1~Mn个处理电路子矩阵执行一条n级流水运算为例,其中处理电路子矩阵M1可以充当该流水运算中的第一级流水运算单元,而处理电路子矩阵M2可以充当该流水运算中的第二级流水运算单元。以此类推,处理电路子矩阵Mn可以充当该流水运算中的第n级流水运算单元。在执行n级流水运算过程中,可以从第一级流水运算单元开始执行自上而下的各级运算,直至完成该n级流水运算。
通过上面对于处理电路子阵列的示例性描述,可以理解的是本披露的所述处理电路阵列在一些场景中可以是一维阵列,并且所述处理电路阵列中的一个或多个处理电路配置成作为一个所述处理电路子阵列。在另一些场景中,本披露的所述处理电路阵列是二维阵列,并且其中所述处理电路阵列中的一行或多行处理电路配置成作为一个所述处理电路子阵列;或者所述处理电路阵列中的一列或多列处理电路配置成作为一个所述处理电路子阵列;或者所述处理电路阵列中沿对角线方向上的一排或多排处理电路配置成作为一个所述处理电路子阵列。
为了实现多级的流水运算,本披露还可以提供相应的计算指令,并且基于该计算指令来配置和构建处理电路阵列,以实现多级流水运算。根据运算场景的不同,本披露的计算指令可以包括多个操作码,该操作码可以表示由处理电路阵列执行的多个操作。例如,当图1a中的n=4(即执行4级流水运算时),根据本披露方案的计算指令可以如下式(1)表示:
Result=convert((((scr0 op0 scr1)op1 src2)op2 src3)op3 src4)    (1)
其中,scr0~src4是源操作数(在一些计算场景中,其可以例如是本披露的描述符所表示的张量)、op0~op3为操作码,而convert表示对执行操作码op4后所获得的数据执行数据转换操作。根据不同的实施方式,前述数据转换操作可以由处理电路阵列中的处理电路来完成,或者由另外的操作电路来执行,例如由稍后结合图3详细描述的后操作电路来执行。根据本披露的方案,由于处理电路可以根据运算要求而配置成支持相应的运算,因此本披露计算指令的操作数的数目可以根据运算要求增加或减少,并且操作码的类型也可以在处理电路矩阵所支持的操作类型中任意选择和组合。
根据不同的应用场景,本披露的多个处理电路之间的连接既可以是基于硬件的配置连接(或称“硬连接”),也可以是在特定的硬件连接基础上,通过软件配置(例如通过配置指令)来进行逻辑配置连接(或称“软连接”)。在一个实施例中,所述处理电路阵列可以在一维或多维方向的至少一个维度方向上形成闭合的环路,即本披露上下文中的“成环结构”。
如前所述,本披露的运算操作还包括利用描述符来获取关于张量形状相关的信息,以便确定张量数据的存储地址,从而通过前述的存储地址来获取和保存张量数据。
在一种可能的实现方式中,可以用描述符指示N维的张量数据的形状,N为正整数,例如N=1、2或3,或者为零。其中,张量可以包含多种形式的数据组成方式,张量可以是不同维度的,比如标量可以看作是0维张量,向量可以看作1维张量,而 矩阵可以是2维或2维以上的张量。张量的形状包括张量的维度、张量各个维度的尺寸等信息。举例而言,对于张量:
Figure PCTCN2021095703-appb-000001
该张量的形状可以被描述符描述为(2,4),也即通过两个参数表示该张量为二维张量,且该张量的第一维度(列)的尺寸为2、第二维度(行)的尺寸为4。需要说明的是,本申请对于描述符指示张量形状的方式并不做限定。
在一种可能的实现方式中,N的取值可根据张量数据的维数(阶数)来确定,也可以根据张量数据的使用需要进行设定。例如,在N的取值为3时,张量数据为三维的张量数据,描述符可用来指示该三维的张量数据在三个维度方向上的形状(例如偏移量、尺寸等)。应当理解,本领域技术人员可以根据实际需要对N的取值进行设置,本披露对此不作限制。
在一种可能的实现方式中,所述描述符可包括描述符的标识和/或描述符的内容。其中,描述符的标识用于对描述符进行区分,例如描述符的标识可以为其编号;描述符的内容可以包括表示张量数据的形状的至少一个形状参数。例如,张量数据为3维数据,在该张量数据的三个维度中,其中两个维度的形状参数固定不变,其描述符的内容可包括表示该张量数据的另一个维度的形状参数。
在一种可能的实现方式中,描述符的标识和/或内容可存储在描述符存储空间(内部存储器),例如寄存器、片上的SRAM或其他介质缓存等。描述符所指示的张量数据可存储在数据存储空间(内部存储器或外部存储器),例如片上缓存或片下存储器等。本披露对描述符存储空间及数据存储空间的具***置不作限制。
在一种可能的实现方式中,描述符的标识、内容以及描述符所指示的张量数据可以存储在内部存储器的同一块区域,例如,可使用片上缓存的一块连续区域来存储描述符的相关内容,其地址为ADDR0-ADDR1023。其中,可将地址ADDR0-ADDR63作为描述符存储空间,存储描述符的标识和内容,地址ADDR64-ADDR1023作为数据存储空间,存储描述符所指示的张量数据。在描述符存储空间中,可用地址ADDR0-ADDR31存储描述符的标识,地址ADDR32-ADDR63存储描述符的内容。应当理解,地址ADDR并不限于1位或一个字节,此处用来表示一个地址,是一个地址单位。本领域技术人员可以实际情况确定描述符存储空间、数据存储空间以及其具体地址,本披露对此不作限制。
在一种可能的实现方式中,描述符的标识、内容以及描述符所指示的张量数据可以存储在内部存储器的不同区域。例如,可以将寄存器作为描述符存储空间,在寄存器中存储描述符的标识及内容,将片上缓存作为数据存储空间,存储描述符所指示的张量数据。
在一种可能的实现方式中,在使用寄存器存储描述符的标识和内容时,可以使用寄存器的编号来表示描述符的标识。例如,寄存器的编号为0时,其存储的描述符的标识设置为0。当寄存器中的描述符有效时,可根据描述符所指示的张量数据的大小在缓存空间中分配一块区域用于存储该张量数据。
在一种可能的实现方式中,描述符的标识及内容可存储在内部存储器,描述符所指示的张量数据可存储在外部存储器。例如,可以采用在片上存储描述符的标识及内容、在片下存储描述符所指示的张量数据的方式。
在一种可能的实现方式中,与各描述符对应的数据存储空间的数据地址可以是固定地址。例如,可以为张量数据划分单独的数据存储空间,每个张量数据在数据存储空间的起始地址与描述符一一对应。在这种情况下,负责对计算指令进行解析的电路或模块(例如本披露计算装置外部的实体或例如图2-图3中所示的控制电路102)可以根据描述符来确定与操作数对应的数据在数据存储空间中的数据地址。
在一种可能的实现方式中,在与描述符对应的数据存储空间的数据地址为可变地址时,所述描述符还可用于指示N维的张量数据的地址,其中,所述描述符的内容还可包括表示张量数据的地址的至少一个地址参数。例如,张量数据为3维数据,在描述符指向该张量数据的地址时,描述符的内容可包括表示该张量数据的地址的一个地址参数,例如张量数据的起始物理地址,也可以包括该张量数据的地址的多个地址参数,例如张量数据的起始地址+地址偏移量,或张量数据基于各维度的地址参数。本领域技术人员可以根据实际需要对地址参数进行设置,本披露对此不作限制。
在一种可能的实现方式中,所述张量数据的地址参数可以包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址。其中,基准地址可根据数据基准点的变化而不同。本披露对数据基准点的选取不作限制。
在一种可能的实现方式中,所述基准地址可包括所述数据存储空间的起始地址。在描述符的数据基准点是数据存储空间的第一个数据块时,描述符的基准地址即为数据存储空间的起始地址。在描述符的数据基准点是数据存储空间中第一个数据块以外的其他数据时,描述符的基准地址即为该数据块在数据存储空间中的地址。
在一种可能的实现方式中,所述张量数据的形状参数包括以下至少一种:所述数据存储空间在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置、所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系。其中,数据描述位置是描述符所指示的张量数据中的点或区域的映射位置,例如,张量数据为3维数据时,描述符可使用三维空间坐标(x,y,z)来表示该张量数据的形状,该张量数据的数据描述位置可以是使用三维空间坐标(x,y,z)表示的、该张量数据映射在三维空间中的点或区域的位置。
应当理解,本领域技术人员可以根据实际情况选择表示张量数据的形状参数,本披露对此不作限制。通过在数据存取过程中使用描述符,可建立数据之间的关联,从而降低数据存取的复杂度,提高指令处理效率。
在一种可能的实现方式中,可根据所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址、所述数据存储空间的N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的尺寸和/或所述存储区域在N个维度方向的至少一个方向上的偏移量,确定所述张量数据的描述符的内容。
图1b示出根据本披露实施例的数据存储空间的示意图。如图1b所示,数据存储空间21采用行优先的方式存储了一个二维数据,可通过(x,y)来表示(其中,X轴水平向右,Y轴垂直向下),X轴方向上的尺寸(每行的尺寸)为ori_x(图中未示出),Y轴方向上的尺寸(总行数)为ori_y(图中未示出),数据存储空间21的起始地址PA_start(基准地址)为第一个数据块22的物理地址。数据块23是数据存储 空间21中的部分数据,其在X轴方向上的偏移量25表示为offset_x,在Y轴方向上的偏移量24表示为offset_y,在X轴方向上的尺寸表示为size_x,在Y轴方向上的尺寸表示为size_y。
在一种可能的实现方式中,使用描述符来定义数据块23时,描述符的数据基准点可使用数据存储空间21的第一个数据块,可以约定描述符的基准地址为数据存储空间21的起始地址PA_start。然后可以结合数据存储空间21在X轴的尺寸ori_x、在Y轴上的尺寸ori_y,以及数据块23在Y轴方向的偏移量offset_y、X轴方向上的偏移量offset_x、X轴方向上的尺寸size_x以及Y轴方向上的尺寸size_y来确定数据块23的描述符的内容。
在一种可能的实现方式中,可以使用下述公式(2)来表示描述符的内容:
Figure PCTCN2021095703-appb-000002
应当理解,虽然上述示例中,描述符的内容表示的是二维空间,但本领域技术人员可以根据实际情况对描述符的内容表示的具体维度进行设置,本披露对此不作限制。
在一种可能的实现方式中,可约定所述描述符的数据基准点在所述数据存储空间中的基准地址,在基准地址的基础上,根据处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置,确定所述张量数据的描述符的内容。
举例来说,可以约定描述符的数据基准点在数据存储空间中的基准地址PA_base。例如,可以在数据存储空间21中选取一个数据(例如,位置为(2,2)的数据)作为数据基准点,将该数据在数据存储空间中的物理地址作为基准地址PA_base。可以根据对角位置的两个顶点相对于数据基准点的位置,确定出图1b中数据块23的描述符的内容。首先,确定数据块23的对角位置的至少两个顶点相对于数据基准点的位置,例如,使用左上至右下方向的对角位置顶点相对于数据基准点的位置,其中,左上角顶点的相对位置为(x_min,y_min),右下角顶点的相对位置为(x_max,y_max),然后可以根据基准地址PA_base、左上角顶点的相对位置(x_min,y_min)以及右下角顶点的相对位置(x_max,y_max)确定出数据块23的描述符的内容。
在一种可能的实现方式中,可以使用下述公式(3)来表示描述符的内容(基准地址为PA_base):
Figure PCTCN2021095703-appb-000003
应当理解,虽然上述示例中使用左上角和右下角两个对角位置的顶点来确定描述符的内容,但本领域技术人员可以根据实际需要对对角位置的至少两个顶点的具体顶点进行设置,本披露对此不作限制。
在一种可能的实现方式中,可根据所述描述符的数据基准点在所述数据存储空间中的基准地址,以及所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系,确定所述张量数据的描述符的内容。其中,数据描述位置与数据地址之间的映射关系可以根据实际需要进行设定,例如,描述符所指示的张量数据为三维空间 数据时,可以使用函数f(x,y,z)来定义数据描述位置与数据地址之间的映射关系。
在一种可能的实现方式中,可以使用下述公式(4)来表示描述符的内容:
Figure PCTCN2021095703-appb-000004
在一种可能的实现方式中,所述描述符还用于指示N维的张量数据的地址,其中,所述描述符的内容还包括表示张量数据的地址的至少一个地址参数,例如描述符的内容可以是:
D:
Figure PCTCN2021095703-appb-000005
其中PA为地址参数。地址参数可以是逻辑地址,也可以是物理地址。描述符解析电路可以以PA为向量形状的顶点、中间点或预设点中的任意一个,结合X方向和Y方向的形状参数得到对应的数据地址。
在一种可能的实现方式中,所述张量数据的地址参数包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址,所述基准地址包括所述数据存储空间的起始地址。
在一种可能的实现方式中,描述符还可以包括表示张量数据的地址的至少一个地址参数,例如描述符的内容可以是:
D:
Figure PCTCN2021095703-appb-000006
其中PA_start为基准地址参数,不再赘述。
应当理解,本领域技术人员可以根据实际情况对数据描述位置与数据地址之间的映射关系进行设定,本披露对此不作限制。
在一种可能的实现方式中,可以在一个任务中设定约定的基准地址,此任务下指令中的描述符均使用此基准地址,描述符内容中可以包括基于此基准地址的形状参数。可以通过设定此任务的环境参数的方式确定此基准地址。基准地址的相关描述和使用方式可参见上述实施例。此种实现方式下,描述符的内容可以更快速的被映射为数据地址。
在一种可能的实现方式中,可以在各描述符的内容中包含基准地址,则各描述符的基准地址可不同。相对于利用环境参数设定共同的基准地址的方式,此种方式中的各描述符可以更加灵活的描述数据,并使用更大的数据地址空间。
在一种可能的实现方式中,可根据描述符的内容,确定与处理指令的操作数对应的数据在数据存储空间中的数据地址。其中,数据地址的计算由硬件自动完成,且描述符的内容的表示方式不同时,数据地址的计算方法也会不同。本披露对数据地址的具体计算方法不作限制。
例如,操作数中描述符的内容是使用公式(1)表示的,描述符所指示的张量数据在数据存储空间中的偏移量分别为offset_x和offset_y,尺寸为size_x*size_y,那么,该描述符所指示的张量数据在数据存储空间中的起始数据地址PA1 (x,y)可以使用下述公式(5)来确定:
PA1 (x,y)=PA_start+(offset_y-1)*ori_x+offset_x    (5)
根据上述公式(5)确定的数据起始地址PA1 (x,y),结合偏移量offset_x和offset_y,以及存储区域的尺寸size_x和size_y,可确定出描述符所指示的张量数据在数据存储空间中的存储区域。
在一种可能的实现方式中,当操作数还包括针对描述符的数据描述位置时,可根据描述符的内容以及数据描述位置,确定操作数对应的数据在数据存储空间中的数据地址。通过这种方式,可以对描述符所指示的张量数据中的部分数据(例如一个或多个数据)进行处理。
例如,操作数中描述符的内容是使用公式(2)表示的,描述符所指示的张量数据在数据存储空间中偏移量分别为offset_x和offset_y,尺寸为size_x*size_y,操作数中包括的针对描述符的数据描述位置为(x q,y q),那么,该描述符所指示的张量数据在数据存储空间中的数据地址PA2 (x,y)可以使用下述公式(6)来确定:
PA2 (x,y)=PA_start+(offset_y+y q-1)*ori_x+(offset_x+x q)    (6)
以上结合图1a和图1b对本披露的计算装置进行了描述,通过利用计算装置中的一个或多个处理电路阵列并且基于该处理电路的操作功能,本披露的计算指令得以在计算装置上高效地执行,以完成多线程操作,从而提升了并行运算的执行效率并减小计算的开销。另外,通过利用描述符来执行针对张量的操作,本披露的方案也显著提升了对张量数据的存取和处理效率,减小了针对于张量运算的开销。
图2a是示出根据本披露另一个实施例的计算装置100的框图。从图中可以看出,除了具有与计算装置80相同的处理电路104,计算装置100还包括控制电路102。在一个实施例中,控制电路102可以配置成获取上文所述的计算指令并且对计算指令进行解析,以得到与所述操作码表示的多个操作相对应的所述多个运算指令,例如式(1)所表示的。在另一个实施例中,所述控制电路根据所述多个运算指令配置所述处理电路阵列,以得到所述多个处理电路子阵列,例如图1a中所示的处理电路子阵列M1、M2……Mn。
在一个应用场景中,所述控制电路可以包括用于存储配置信息的寄存器,并且控制电路可以根据所述多个运算指令提取对应的配置信息,并根据所述配置信息来配置所述处理电路阵列以得到所述多个处理电路子阵列。在另一个应用场景中,前述的寄存器或控制电路的其他寄存器可以配置成存储关于本披露的描述符的信息,例如描述符的标识和/或描述符的内容,以便可以利用描述符来确定张量数据的存储地址。
在一个实施例中,所述控制电路可以包括一个或多个寄存器,其存储有关于所述处理电路阵列的配置信息,所述控制电路配置成根据所述配置指令从所述寄存器读取所述配置信息并向所述处理电路发送,以便所述处理电路以所述配置信息进行连接。在一个应用场景中,所述配置信息可以包括预设的、组成所述一个或多个处理电路阵列的处理电路的位置信息,该位置信息例如可以包括处理电路的坐标信息或标号信息。
当所述处理电路阵列配置形成闭合环路时,所述配置信息还可以包括关于所述处理电路阵列形成闭合环路的成环配置信息。替代地,在一个实施例中,上述的配置信息也可以通过配置指令来直接携带而非从所述寄存器中读取。在该情况下,处理电路可以根据接收到的配置指令中的位置信息来直接进行配置,以便与其他处理电路形成无闭合环路的阵列或是进一步构成具有闭合环路的阵列。
在通过配置指令或通过从寄存器获得的配置信息来配置连接以形成二维阵列时,位于所述二维阵列中的所述处理电路配置成在其行方向、列方向或对角线方向的至少一个上以预定的二维间隔模式与同行、同列或同对角线的其余一个或多个所述处理电路连接,以便形成一个或多个闭合环路。这里,前述预定的二维间隔模式与所述连接中间隔的处理电路的数目相关联。
进一步,根据前述配置指令或配置信息来配置连接以形成三维阵列时,所述处理电路阵列以由多个层构成的三维阵列的成环方式进行连接,其中每个层包括沿行方向、列方向和对角线方向排列的多个所述处理电路的二维阵列,并且其中:位于所述三维阵列中的所述处理电路配置成在其行方向、列方向、对角线方向和层方向的至少一个上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个处理电路连接,以便形成一个或多个闭合环路。这里,预定的三维间隔模式与待连接的处理电路之间的间隔数目和间隔层数相关联。
图2b是示出根据本披露另一个实施例的计算装置200的框图。从图中可以看出,除了包括与计算装置100相同的控制电路102和多个处理电路104以外,图2中的计算装置200还包括存储电路106。
在一个应用场景中,上述的存储电路可以在多个方向上配置有用于数据传输的接口,以便与多个处理电路104连接,从而可以对处理电路待运算的数据、执行运算过程中获得的中间结果以及执行运算过程后获得的运算结果进行相应地存储。鉴于前述情形,在一个应用场景中,本披露的存储电路可以包括主存储模块和/或主缓存模块,其中所述主存储模块配置成存储用于处理电路阵列中执行运算的数据与执行运算后的运算结果,并且所述主缓存模块配置成缓存所述处理电路阵列中执行运算后的中间运算结果。在一个应用场景中,前述的运算结果和中间运算结果可以是张量,其可以根据本披露的描述符所确定的存储地址来存储于存储电路中。进一步,存储电路还可以具有用于与片外存储介质进行数据传输的接口,从而可以实现片上与片外***之间的数据搬运。
图3是示出根据本披露又一个实施例的计算装置300的框图。从图中可以看出,除了包括与计算装置200相同的控制电路102、多个处理电路104和存储电路106以外,图3中的计算装置300还包括数据操作电路108,其包括前操作电路110和后操作电路112。基于这样的硬件架构,所述前操作电路110配置成执行至少一个所述运算指令的输入数据(例如张量类型的数据)的预处理,而所述后操作电路112配置成执行至少一个运算指令的输出数据(例如张量类型的数据)的后处理。在一个实施例中,前操作电路执行的预处理可以包括数据摆放和/或查表操作,而后操作电路执行的后处理可以包括数据类型转换和/或压缩操作。
在一个应用场景中,在执行查表操作中,所述前操作电路配置成通过索引值来查找一个或多个表,以从所述一个或多个表中获得与所述操作数关联的一个或多个常数项。附加地或替代地,所述前操作电路配置成通过所述操作数来确定关联的索引值,并且通过所述索引值来查找一个或多个表,以从所述一个或多个表中获得与所述操作数关联的一个或多个常数项。
在一个应用场景中,所述前操作电路可以根据运算数据的类型和各个处理电路的逻辑地址,将所述运算数据进行相应的拆分并将拆分后获得的多个子数据分别传递至 阵列中对应的各个处理电路中以便运算。在另一应用场景中,所述前操作电路可以根据解析后的指令从多种数据拼接模式中选择一种数据拼接模式,以对输入的两个数据执行拼接操作。在一个应用场景中,所述后操作电路可以配置成对数据执行压缩操作,所述压缩操作包括利用掩码对数据进行筛选或通过给定阈值与数据大小的比较来进行筛选,从而实现数据的压缩。
基于上述图3的硬件架构,本披露的计算装置可以执行包括前述预处理和后处理的计算指令。基于此,如前式(1)中表达的计算指令的数据转换操作就可以由上述的后操作电路来执行。下面将给出根据本披露方案的计算指令的两个示例性例子:
例1:TMUADCO=MULT+ADD+RELU(N)+CONVERTFP2FIX  (7)
在上式(7)中表达的指令是输入一个3元操作数,输出一个1元操作数的计算指令,并且其可以由根据本披露的一个包括三级流水运算(即,乘+加+激活)的处理电路矩阵来完成。具体来说,三元操作是A*B+C,其中MULT的微指令完成操作数A和B之间的乘法操作以获得乘积值,即第一级流水运算。接着,执行ADD的微指令来完成前述乘积值与C的加法操作,以获得求和结果“N”,即第二级流水运算。然后,对该结果执行激活操作RELU,即第三级流水运算。通过该三级流水运算后,最后可以通过上文的后操作电路来执行微指令CONVERTFP2FIX,从而将激活操作后的结果数据的类型从浮点数转换成定点数,以便作为最终的结果输出或作为中间结果输入到定点运算器以进行进一步的计算操作。
例2:TSEADMUAD=SEARCHADD+MULT+ADD   (8)
在上式(8)中表达的指令是输入一个3元操作数,输出一个1元操作数的计算指令,并且其包括可以由根据本披露的一个包括二级流水运算(即,乘+加)的处理电路矩阵来完成的微指令。具体来说,三元操作是A*B+C,其中SEARCHADD的微指令可以由前操作电路来完成,以得到查表结果A。接着,由第一级流水运算来完成操作数A和B之间的乘法操作以获得乘积值。这里,操作数A、B和乘积值可以是根据本披露的描述符所读取和保存的张量。然后,执行ADD的微指令来完成前述乘积值与C的加法操作,以获得求和结果“N”,即第二级流水运算。同样地,当前述的求和结果是张量时,其也可以根据本披露的描述符所确定的存储地址来保存。
如前所述,本披露的计算指令可以根据计算的要求来灵活地设计和确定,从而本披露的包括多个处理电路子矩阵的硬件架构可以依据计算指令及其具体完成的操作来设计和连接,从而提升指令的执行效率并减小计算开销。
图4是示出根据本披露实施例的计算装置400的多种类型处理电路阵列的示例结构图。从图中所示可知,图4所示出的计算装置400具有与图3所示出的计算装置300相类似的架构,因此关于图3中计算装置300的描述同样也适用于图4中示出的相同细节,因此下文不再赘述。
从图4中可以看出,多个处理电路可以包括例如多个第一类型处理电路104-1和多个第二类型处理电路104-2(在图中以不同背景颜色来区分)。所述多个处理电路可以通过物理连接进行排布以形成二维阵列。例如,如图中所示,所述二维阵列中有M行N列(表示为M*N)个第一类型处理电路,其中M和N是大于0的正整数。所述第一类型处理电路可以用于执行算术运算和逻辑运算,例如可以包括加法、减法和乘法等线性运算、比较运算和与或非等非线性运算,或者前述各类运算的任意多种 组合。进一步,在M*N个第一类型处理电路阵列的***的左、右两侧各有两列、共(M*2+M*2)个第二类型处理电路,并且在其***的下侧有两行、共(N*2+8)个第二类型处理电路,即该处理电路阵列共有(M*2+M*2+N*2+8)个第二类型处理电路。在一个实施例中,所述第二类型处理电路可以用于对接收到的数据执行例如比较运算、查表运算或移位操作等非线性运算。在一个或多个实施例中,第一类型处理电路可以形成本披露的第一处理电路子阵列,而第二类型处理电路可以形成本披露的第二处理电路子阵列,以便执行多线程运算。在一个场景中,当多线程运算涉及多个运算指令并且多个运算指令构成一条多级流水运算时,该第一处理电路子阵列可以执行多级流水运算中的若干级流水运算,而第二处理子阵列可以执行另外若干级流水运算。在另一个场景中,当多线程运算涉及多个运算指令并且多个运算指令构成两条多级流水运算时,该第一处理电路子阵列可以执行第一多级流水运算,而第二处理电路子阵列可以执行第二多级流水运算。
在一些应用场景中,第一类型处理电路与第二类型处理电路二者所应用的存储电路可以具有不同的存储规模和存储方式。例如,第一类型处理电路中的谓词存储电路可以利用多个经过编号的寄存器存储谓词信息。进一步,第一类型处理电路可以根据接收到的解析后的指令中指定的寄存器编号来存取对应编号的寄存器中的谓词信息。又例如,第二类型处理电路可以采用静态随机存取存储器(“SRAM”)的方式对谓词信息进行存储。具体来说,所述第二类型处理电路可以根据接收到的解析后的指令中指定的该谓词信息所在位置的偏移量来确定所述谓词信息在SRAM中的存储地址,并且可以对该存储地址中的谓词信息进行预定的读出或写入操作。
图5a,5b,5c和5d是示出根据本披露实施例的多个处理电路的多种连接关系的示意图。如前所述,本披露的多个处理电路可以硬线连接的方式或根据配置指令的逻辑连接方式来连接,从而形成连接的一维或多维阵列的拓扑结构。当多个处理电路之间以多维阵列进行连接时,所述多维阵列可以是二维阵列,并且位于所述二维阵列中的所述处理电路可以在其行方向、列方向或对角线方向的至少一个方向上,以预定的二维间隔模式与同行、同列或同对角线上的其余一个或多个所述处理电路连接。其中所述预定的二维间隔模式可以与所述连接中间隔的处理电路的数目相关联。图5a至图5c示例性示出多个处理电路之间的多种形式的二维阵列的拓扑结构。
如图5a所示,五个处理电路(每个以方框表示)连接形成一个简单的二维阵列。具体来说,以一个处理电路作为二维阵列的中心,向相对于该处理电路的水平和垂直的四个方向上各连接一个处理电路,从而形成一个具有三行和三列大小的二维阵列。进一步,由于位于二维阵列中心的处理电路分别与同行的前一列和后一列相邻的处理电路、与同列的上一行和下一行相邻的处理电路直接连接,从而间隔的处理电路的数目(简称“间隔数目”)为0。
如图5b所示,四行四列的处理电路可以连接形成一个二维Torus阵列,其中每个处理电路分别与其相邻的前一行和后一行、前一列和后一列的处理电路进行连接,即相邻处理电路连接的间隔数目均为0。进一步,位于该二维Torus阵列中每行或每列的第一个处理电路还与该行或该列的最后一个处理电路相连,每行或每列首尾相连的处理电路之间的间隔数目均为2。
如图5c所示,四行四列的处理电路还可以连接形成一个相邻处理电路之间的间 隔数目为0、不相邻处理电路之间的间隔数目为1的二维阵列。具体地,该二维阵列中同行或同列相邻的处理电路直接连接,即间隔数目为0,而同行或同列不相邻的处理电路与间隔数目为1的处理电路进行连接。可以看出,当多个处理电路连接形成二维阵列时,图5b和图5c示出的同行或同列的处理电路之间可以有不同的间隔数目。类似地,在一些场景中,也可以不同的间隔数目与对角线方向上的处理电路进行连接。
如图5d所示,利用四个如图5b示出的二维Torus阵列,可以按照预定的间隔排列成四层二维Torus阵列进行连接,以形成一个三维Torus阵列。该三维Torus阵列在二维Torus阵列的基础上,利用与行间、列间类似的间隔模式进行层间连接。例如,首先将相邻层同行同列的处理电路直接相连,即间隔数目为0。接着,将第一层和最后一层同行同列的处理电路进行连接,即间隔数目为2。最终可以形成四层四行四列的三维Torus阵列。
通过上面这些示例,本领域技术人员可以理解处理电路的其他多维阵列的连接关系可以在二维阵列的基础上,通过增加新的维度和增加处理电路的数目来形成。在一些应用场景中,本披露的方案也可以通过使用配置指令来对处理电路配置逻辑连接。换句话说,尽管处理电路之间可能存在硬线连接,但本披露的方案也可以通过配置指令来选择性地令一些处理电路连接,或者选择性地旁路一些处理电路,以形成一个或多个逻辑连接。在一些实施例中,还可以根据实际运算的需求(例如数据类型的转换)来调整前述的逻辑连接。进一步,针对于不同的计算场景,本披露的方案可以对处理电路的连接进行配置,包括例如配置成矩阵或者配置成一个或多个闭合的计算环路。
图6a,6b,6c和6d是示出根据本披露实施例的多个处理电路的另外多种连接关系的示意图。从图中可以看出,图6a至图6d是在图5a至图5d示出的多个处理电路形成的多维阵列的又一种示例性连接关系。鉴于此,结合图5a至图5d所描述的技术细节也同样适用于图6a至图6d所示出的内容。
如图6a所示,二维阵列的处理电路包括位于二维阵列中心的中心处理电路和与该中心处理电路同行和同列的四个方向上分别连接的三个处理电路。因此,该中心处理电路与其余处理电路之间连接的间隔数目分别是0、1和2。如图6b所示,二维阵列的处理电路包括位于二维阵列中心的中心处理电路、和与该处理电路同行的两个相对方向上的三个处理电路,以及与该处理电路同列的两个相对方向上的一个处理电路。因此,中心处理电路与同行的处理电路之间连接的间隔数目分别为0和2,与同列的处理电路之间连接的间隔数目均为0。
正如前文结合图5d所示出的,多个处理电路形成的多维阵列可以由多个层构成的三维阵列。其中所述三维阵列的每个层可以包括沿其行方向和列方向排列的多个所述处理电路的二维阵列。进一步,位于所述三维阵列中的所述处理电路可以在其行方向、列方向、对角线方向和层方向的至少一个方向上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个处理电路连接。进一步,所述预定的三维间隔模式与所述连接中相互间隔的处理电路的数目可以和间隔的层数目相关。下面将结合图6c与图6d对三维阵列的连接方式做出进一步描述。
图6c示出多个处理电路连接形成的多层多行多列的三维阵列。以位于第l层、第r行、第c列(表示为(l,r,c))的处理电路为例,其位于阵列中心位置,并且分别与同层同行的前一列(l,r,c-1)处的处理电路和后一列(l,r,c+1)处的处理 电路、同层同列的前一行(l,r-1,c)处的处理电路和后一行(l,r+1,c)处的处理电路,以及同行同列不同层的前一层(l-1,r,c)处的处理电路和后一层(l+1,r,c)处的处理电路进行连接。进一步,(l,r,c)处的处理电路与其他处理电路在行方向、列方向和层方向上连接的间隔数目均为0。
图6d示出当多个处理电路之间在行方向、列方向和层方向上连接的间隔数目均为1时的三维阵列。以位于阵列中心位置(l,r,c)的处理电路为例,其分别与同层同行不同列的前后各间隔一列的(l,r,c-2)和(l,r,c+2)处的处理电路、同层同列不同行的前后各间隔一行的(l,r-2,c)和(l,r+2,c)处的处理电路进行连接。进一步,其与同行同列不同层的前后各间隔一层的(l-2,r,c)和(l+2,r,c)处的处理电路进行连接。类似地,其余的同层同行间隔一列的(l,r,c-3)与(l,r,c-1)处的处理电路彼此进行连接,而(l,r,c+1)与(l,r,c+3)处的处理电路彼此进行连接。接着,同层同列间隔一行的(l,r-3,c)与(l,r-1,c)处的处理电路彼此进行连接、(l,r+1,c)与(l,r+3,c)处的处理电路彼此进行连接。另外,同行同列间隔一层的(l-3,r,c)与(l-1,r,c)处的处理电路彼此进行连接、而(l+1,r,c)与(l+3,r,c)处的处理电路彼此进行连接。
上文对多个处理电路形成的多维阵列的连接关系进行了示例性描述,下文将结合图7-图8对多个处理电路形成的不同环路结构做出进一步示例性说明。
图7a,7b、7c和7d是分别示出根据本披露实施例的处理电路的多种环路结构的示意图。根据不同的应用场景,多个处理电路不仅可以物理连接关系来进行连接,也可以根据接收到的解析后的指令配置成以逻辑关系来进行连接。所述多个处理电路可以配置成利用所述逻辑连接关系进行连接以形成闭合的环路。
如图7a所示,四个相邻的处理电路顺序编号为“0、1、2和3”。接着,从处理电路0开始按照顺时针方向将该四个处理电路顺序相连,并且处理电路3与处理电路0进行连接,以使四个处理电路串联形成一个闭合的环路(简称“成环”)。在该环路中,处理电路的间隔数目为0或2,例如处理电路0与1之间间隔数目为0,而处理电路3与0之间间隔数目为2。进一步,所示环路中的四个处理电路的物理地址(在本披露的上下文中也可以称为物理坐标)可以表示为0-1-2-3,而其逻辑地址(在本披露的上下文中也可以称为逻辑坐标)同样可以表示为0-1-2-3。需要注意的是,图7a所示出的连接顺序仅仅是示例性的而非限制性的,本领域技术人员根据实际计算需要,也可以以逆时针方向对四个处理电路进行串联连接以形成闭合的环路。
在一些实际场景中,当一个处理电路支持的数据位宽不能满足运算数据的位宽要求时,可以利用多个处理电路组合成一个处理电路组以表示一个数据。例如,假设一个处理电路可以处理8位数据。当需要处理32位的数据时,则可以将4个处理电路进行组合成为一个处理电路组,以便对4个8位数据进行连接以形成一个32位数据。进一步,前述4个8位处理电路形成的一个处理电路组可以充当图7b中示出的一个处理电路104,从而可以支持更高位宽的运算操作。
从图7b中可以看出,其所示出的处理电路的布局与图7a示出的类似,但图7b中处理电路之间连接的间隔数目与图7a不同。图7b示出以0、1、2和3顺序编号的四个处理电路按顺时针方向从处理电路0开始,顺序连接处理电路1、处理电路3和处理电路2,并且处理电路2连接至处理电路0,从而串联形成一个闭合的环路。从 该环路中可以看出,图7b中示出的处理电路的间隔数目为0或1,例如处理电路0与1之间间隔为0,而处理电路1与3之间间隔为1。进一步,所示闭合环路中的四个处理电路的物理地址可以为0-1-2-3,而逻辑地址依所示的成环方式可以表示为0-1-3-2。因此,当需要对高比特位宽的数据进行拆分以分配给不同的处理电路时,可以根据处理电路的逻辑地址对数据顺序进行重新排列和分配。
上述的拆分和重新排列的操作可以由结合图3描述的前操作电路来执行。特别地,该前操作电路可以根据多个处理电路的物理地址和逻辑地址来对输入数据进行重新排列,以用于满足数据运算的要求。假设四个顺序排列的处理电路0至处理电路3如图7a中所示出的连接,由于连接的物理地址和逻辑地址都为0-1-2-3,因此前操作电路可以将输入数据(例如像素数据)aa0、aa1、aa2和aa3依次传送到对应的处理电路中。然而,当前述的四个处理电路按图7b所示出的连接时,其物理地址保持0-1-2-3不变,而逻辑地址变为0-1-3-2,此时前操作电路需要将输入数据aa0、aa1、aa2和aa3重新排列为aa0-aa1-aa3-aa2,以传送到对应的处理电路中。基于上述的输入数据重排列,本披露的方案可以保证数据运算顺序的正确性。类似地,如果前述获得的四个运算输出结果(例如是像素数据)的顺序是bb0-bb1-bb3-bb2,可以利用结合图2描述的后操作电路将运算输出结果的顺序还原调整为bb0-bb1-bb2-bb3,以用于保证输入数据和输出结果数据之间的排列一致性。
图7c和图7d示出更多的处理电路分别以不同方式进行排列和连接,以形成闭合的环路。如图7c所示,以0,1…15顺序编号的16个处理电路104从处理电路0开始,顺序地每两个处理电路进行连接和组合,以形成一个处理电路组(也即本披露的处理电路子阵列)。例如,如图中所示,处理电路0与处理电路1连接形成一个处理电路组……。以此类推,处理电路14与处理电路15连接以形成一个处理电路组,最终形成八个处理电路组。进一步,该八个处理电路组也可以类似于前述的处理电路的连接方式进行连接,包括按照例如预定的逻辑地址来进行连接,以形成一个处理电路组的闭合的环路。
如图7d所示,多个处理电路104以不规则或者说不统一的方式来连接,以形成具有闭合环路的处理电路矩阵。具体来说,在图7d中示出处理电路之间可以间隔数目为0或3来形成闭合的环路,例如处理电路0可以分别与处理电路1(间隔数目为0)和处理电路4(间隔数目为3)相连。
由上述结合图7a、7b、7c和7d的描述可知,本披露的处理电路可以间隔有不同数目的处理电路,以便连接成闭合的环路。当处理电路总数变化时,也可以选择任意的中间间隔数目进行动态配置,从而连接成闭合的环路。还可以将多个处理电路组合成为处理电路组,并连接成处理电路组的闭合的环路。另外,多个处理电路的连接可以是硬件构成的硬连接方式,或者可以是软件配置的软连接方式。
图8a,8b和8c是示出根据本披露实施例的处理电路的另外多种环路结构的示意图。正如结合图6所示出的多个处理电路可以形成一个闭合的环路,并且所述闭合的环路中的每个处理电路可以配置有各自的逻辑地址。进一步,由结合图2描述的前操作电路可以配置成根据运算数据的类型(例如32bit数据,16bit数据或8bit数据)和逻辑地址,将所述运算数据进行相应的拆分并将拆分后获得的多个子数据分别传递至环路中对应的各个处理电路中以用于后续运算。
图8a上图示出四个处理电路连接形成一个闭合环路,并且该四个处理电路按从右到左顺序的物理地址可以表示为0-1-2-3。图8a下图示出前述所述环路中的四个处理电路从右到左顺序的逻辑地址表示为0-3-1-2。例如,图8a下图所示出的逻辑地址为“3”的处理电路具有图8a上图示出的物理地址“1”。
在一些应用场景中,假设操作数据的粒度是输入数据的低128bit,例如图中的原始序列“15,14,……2,1,0”(每个数字对应8bit数据),并且设定该16个8bit数据的逻辑地址从低到高编号依次是0~15。进一步,按照如图8a下图所示出的逻辑地址,所述前操作电路可以根据不同的数据类型,对数据采用不同的逻辑地址进行编码或排列。
当处理电路操作的数据位宽为32bit时,逻辑地址分别为(3,2,1,0),(7,6,5,4),(11,10,9,8)和(15,14,13,12)的4个数可以分别表示第0~3个32bit数据。所述前操作电路可以将第0个32bit数据传送至逻辑地址为“0”的处理电路中(对应的物理地址为“0”),可以将第1个32bit数据传送至逻辑地址为“1”的处理电路中(对应的物理地址为“2”),可以将第2个32bit数据传送至逻辑地址为“2”的处理电路中(对应的物理地址为“3”),可以将第3个32bit数据传送至逻辑地址为“3”的处理电路中(对应的物理地址为“1”)。通过数据的重新排列,以用于满足处理电路的后续运算需求。因此最终数据的逻辑地址与物理地址之间的映射关系为(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)->(11,10,9,8,7,6,5,4,15,14,13,12,3,2,1,0)。
当处理电路操作的数据位宽为16bit时,逻辑地址分别为(1,0),(3,2),(5,4),(7,6),(9,8),(11,10),(13,12)和(15,14)的8个数可以分别表示第0~7个16bit数据。所述前操作电路可以将第0个和第4个16bit数据传送至逻辑地址为“0”的处理电路中(对应的物理地址为“0”),可以将第1个和第5个16bit数据传送至逻辑地址为“1”的处理电路中(对应的物理地址为“2”),可以将第2个和第6个16bit数据传送至逻辑地址为“2”的处理电路中(对应的物理地址为“3”),可以将第3个和第7个16bit数据传送至逻辑地址为“3”的处理电路中(对应的物理地址为“1”)。因此最终数据的逻辑地址与物理地址之间的映射关系为:(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)->(13,12,5,4,11,10,3,2,15,14,7,6,9,8,1,0)。
当处理电路操作的数据位宽为8bit时,逻辑地址为0~15的16个数可以分别表示第0~15个8bit数据。根据图8a所示出的连接,所述前操作电路可以将第0个、第4个、第8个和第12个8bit数据传送至逻辑地址为“0”的处理电路中(对应的物理地址为“0”);可以将第1个、第5个、第9个和第13个8bit数据传送至逻辑地址为“1”的处理电路中(对应的物理地址为“2”);可以将第2个、第6个、第10个和第14个8bit数据传送至逻辑地址为“2”的处理电路中(对应的物理地址为“3”);可以将第3个、第7个、第11个和第15个8bit数据传送至逻辑地址为“3”的处理电路中(对应的物理地址为“1”)。因此最终数据的逻辑地址与物理地址之间的映射关系为:
(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0)->(14,19,6,2,13,9,5,1,15,11,7,3,12,8,4,0)。
图8b上图示出八个顺序编号的处理电路0至处理电路7连接形成一个闭合的环路,并且该八个处理电路的物理地址为0-1-2-3-4-5-6-7。图8b下图示出前述八个处理电路的逻辑地址为0-7-1-6-2-5-3-4。例如,图8b上图示出物理地址为“6”的处理电 路对应于图8b下图示出的逻辑地址为“3”。
图8b所示出的针对不同数据类型,所述前操作电路对数据进行重新排列后传送至对应的处理电路的操作与图8a类似,因此结合图8a所描述的技术方案也同样适用于图8b,此处不再对上述的数据重新排列操作过程进行赘述。进一步,图8b所示出的处理电路的连接关系与图8a所示出的类似,但图8b示出八个处理电路为图8a所示出的处理电路个数的两倍。由此,在根据不同数据类型进行操作的应用场景中,结合图8b所描述操作数据的粒度可以为结合图8a所描述操作数据的粒度的两倍。因此,相对于前面例子中输入数据的粒度为低128bit,本例中操作数据的粒度可以为输入数据的低256bit,例如图中示出的原始数据序列“31,30,……,2,1,0”,每个数字对应于8比特(“bit”)长度。
针对于上述原始数据序列,当处理电路操作的数据位宽分别是32bit、16bit和8bit时,图中还分别示出成环的处理电路中的数据的排列结果。例如,当操作的数据位宽是32bit时,逻辑地址为“1”的处理电路中的1个32bit数据为(7,6,5,4),该处理电路对应的物理地址为“2”。而当操作的数据位宽是16bit时,逻辑地址为“3”的处理电路中的2个16bit数据为(23,22,7,6),该处理电路对应的物理地址为“6”。当操作的数据位宽是8bit时,逻辑地址为“6”的处理电路中的4个8bit数据为(30,22,14,6),该处理电路对应的物理地址为“3”。
上文结合图8a和图8b所示出的多个单个类型处理电路(如图3示出的第一类型处理电路)连接形成闭合环路的情形,针对不同数据类型的数据操作进行了描述。下文将结合图8c所示出的多个不同类型处理电路(如图4示出的第一类型处理电路和第二类型处理电路)进行连接形成闭合环路的情形,针对不同数据类型的数据操作做出进一步描述。
图8c上图示出,以0,1……19顺序编号的二十个多类型处理电路进行连接,以形成一个闭合的环路(图中示出的编号为处理电路的物理地址)。编号从0至15的十六个处理电路为第一类型处理电路(也即形成本披露的处理电路子阵列),编号从16至19的四个处理电路为第二类型处理电路(也即形成本披露的处理电路子阵列)。类似地,该二十个处理电路中每个的物理地址,与图8c下图示出的对应处理电路的逻辑地址具有映射关系。
进一步,在对不同数据类型进行操作时,例如对于图中示出的80个8bit的原始序列,图8c还示出针对于处理电路支持的不同数据类型,对前述原始数据进行操作后的结果。例如,当操作的数据位宽是32bit时,逻辑地址为“1”的处理电路中的1个32bit数据为(7,6,5,4),该处理电路对应的物理地址为“2”。而当操作的数据位宽是16bit时,逻辑地址为“11”的处理电路中的2个16bit数据为(63,62,23,22),该处理电路对应的物理地址为“9”。而当操作的数据位宽是8bit时,逻辑地址为“17”的处理电路中的4个8bit数据为(77,57,37,17),该处理电路对应的物理地址为“18”。
图9a,9b,9c和9d是示出根据本披露实施例的前处置电路所执行的数据拼接操作示意图。如前所述,本披露结合图2所描述的前处置电路还可以配置成根据解析后的指令从多种数据拼接模式中选择一种数据拼接模式,以对输入的两个数据执行拼接操作。关于多种数据拼接模式,在一个实施例中,本披露的方案通过对待拼接的两个数据按最小数据单元划分和编号,然后基于指定的规则来抽取数据的不同最小数据单 元以形成不同的数据拼接模式。例如,可以基于编号的奇偶性或编号是否是指定数字的整数倍来进行例如交替式地抽取和摆放,从而形成不同的数据拼接模式。根据不同的计算场景(例如数据位宽的不同),这里的最小数据单元可以简单的就是1位或1比特数据,或者是2位、4位、8位、16位或32位或比特的长度。进一步,在抽取两个数据的不同编号部分时,本披露的方案既可以以最小数据单元来交替地抽取,也可以以最小数据单元的倍数来抽取,例如从两个数据中交替地一次抽取两个或三个最小数据单元的部分数据作为一组来按组进行拼接。
基于上述数据拼接模式的描述,下面将结合图9a至图9c来以具体的例子示例性阐述本披露的数据拼接模式。在所示的图中,输入数据为In1和In2,当图中的每个方格代表一个最小数据单元时,两个输入数据都具有8个最小数据单元的位宽长度。如前所述,对于不同位宽长度的数据,该最小数据单元可以代表不同的位数(或比特数)。例如,对于位宽为8位的数据,最小数据单元代表1位数据,而对于位宽为16位的数据,最小数据单元代表2位数据。又例如,对于位宽为32位的数据,最小数据单元代表4位数据。
如图9a所示,待拼接的两个输入数据In1和In2各由从右至左顺序编号为1,2,……,8的八个最小数据单元构成。按照编号由小到大、先In1后In2、先奇数编号后偶数编号的奇偶交错原则进行数据拼接。具体而言,当操作的数据位宽为8bit时,数据In1和In2各表示一个8位数据,而每个最小数据单元代表1位数据(即一个方格代表1比特数据)。根据数据的位宽和前述的拼接原则,首先抽取数据In1编号为1、3、5和7的最小数据单元顺序布置于低位。接着,顺序布置数据In2的四个奇数编号的最小数据单元。类似地,顺序布置数据In1编号为2、4、6和8的最小数据单元和数据In2的四个偶数编号的最小数据单元。最终,由16个最小数据单元拼接形成1个16位或2个8位的新数据,如图9a中第二行方格所示出的。
如图9b所示,在数据位宽为16bit时,数据In1和In2各表示一个16位数据,此时每个最小数据单元代表2位数据(即一个方格代表一个2比特数据)。根据数据的位宽和前述的交错拼接原则,可以先抽取数据In1编号为1、2、5和6的最小数据单元顺序布置于低位。然后,顺序布置数据In2编号为1、2、5和6的最小数据单元。类似地,顺序布置数据In1编号为3、4、7和8和数据In2相同编号的最小数据单元,以拼接形成最终的16个最小数据单元组成的1个32位或2个16位的新数据,如图9b中第二行方格所示出的。
如图9c所示,在数据位宽为32bit时,数据In1和In2各表示一个32位数据,而每个最小数据单元代表4位数据(即一个方格代表一个4比特数据)。根据数据的位宽和前述的交错拼接原则,可以先抽取数据In1编号为1、2、3和4和数据In2相同编号的最小数据单元顺序布置于低位。然后,抽取数据In1编号为5、6、7和8与数据In2相同编号的最小数据单元顺序布置,从而拼接形成最终的16个最小数据单元组成的1个64位或2个32位的新数据。
上面结合图9a-图9c描述了本披露的示例性数据拼接方式。然而,可以理解的是在一些计算场景中,数据拼接并不涉及上述的交错排放,而仅仅是两个数据在保持各自原有数据位置不变情况下的简单排布,例如图9d中所示出的。从图9d中可看出,两个数据In1和In2并不执行如图9a-图9c中示出的交错排布,而仅仅是将数据In1 的最后一个最小数据单元和In2的第一个最小数据单元进行串联,从而获得位宽增大(例如加倍)的新数据类型。在一些场景中,本披露的方案还可以基于数据属性进行成组的拼接。例如,可以将具有同一特征图的神经元数据或权值数据形成一组,然后进行排布,以构成拼接后数据的连续部分。
图10a,10b和10c是示出根据本披露实施例的后处置电路所执行的数据压缩操作示意图。所述压缩操作可以包括利用掩码对数据进行筛选或通过给定阈值与数据大小的比较来进行压缩。关于数据压缩操作,可以对其按如前所述的最小数据单元进行划分和编号。与结合图9a-图9d所述的类似,最小数据单元可以例如是1位或1比特数据,或者是2位、4位、8位、16位或32位或比特的长度。下面将结合图10a至图10c针对不同的数据压缩模式做出示例性描述。
如图10a所示,原始数据由从右至左顺序编号为1,2……,8的八个方格(即八个最小数据单元)依次排列组成,假设每个最小数据单元可以表示1比特数据。当根据掩码进行数据压缩操作时,所述后处置电路可以利用掩码对原始数据进行筛选以执行数据压缩操作。在一个实施例中,掩码的位宽与原始数据的最小数据单元的个数对应。例如,前述的原始数据具有8个最小数据单元,则掩码位宽为8位,并且编号为1的最小数据单元对应于掩码的最低位,编号为2的最小数据单元对应于掩码的次低位。以此类推,编号为8的最小数据单元对应于掩码的最高位。在一个应用场景中,当8位掩码为“10010011”时,压缩原则可以设置为抽取与该掩码为“1”的数据位对应的原始数据中的最小数据单元。例如,对应掩码数值为“1”的最小数据单元的编号为1、2、5和8。由此,可以抽取编号为1、2、5和8的最小数据单元,并且按照编号从低到高的顺序依次排列,以形成压缩后的新数据,如图10a第二行所示。
图10b示出与图10a类似的原始数据,并且从图10b的第二行中可以看出,经过后处置电路的数据序列维持原有的数据排列顺序和内容。由此可以理解,本披露的数据压缩也可以包括禁用模式或非压缩模式,以便在数据经过后处置电路时不执行压缩操作。
如图10c所示,原始数据由八个方格依次排列组成,每个方格上方的数字表示其编号,从右至左顺序编号为1,2……8,并且假设每个最小数据单元可以为8比特数据。进一步,每个方格中的数字表示该最小数据单元的十进制数值。以编号为1的最小数据单元为例,其十进制数值为“8”,对应的8比特数据为“00001111”。当根据阈值进行数据压缩操作时,假设阈值为十进制数据“8”,压缩原则可以设置为抽取原始数据中所有大于或等于该阈值“8”的最小数据单元。由此,可以抽取编号为1、4、7和8的最小数据单元。然后,将抽取得到的所有最小数据单元按照编号从低到高的顺序进行排列,以获得最终的数据结果,如图10c中的第二行所示。图11是示出根据本披露实施例的使用计算装置来执行运算操作的方法1100的简化流程图。根据前文的描述,可以理解这里的计算装置可以是结合图1(包括图1a和图1b)-图4所描述的计算装置,其具有如图5-图10所示出的处理电路连接关系并且支持附加的各类操作。
如图11所示,在步骤1110处,方法1100在所述计算装置处接收计算指令,并且对其进行解析而获得多个运算指令。在一个实施例中,所述计算指令的操作数包括用于指示张量的形状的描述符,所述描述符用于确定所述操作数对应数据的存储地址。 接着,在步骤1120,方法1100响应于接收到所述多个运算指令,利用所述多个处理电路子阵列来执行多线程运算,其中所述多个处理电路子阵列中的至少一个处理电路子阵列配置成根据所述存储地址来执行多个运算指令中的至少一个运算指令。
以上为了简明的目的,仅结合图11描述了本披露的计算方法。本领域技术人员根据本披露的公开内容也可以想到本方法可以包括更多的步骤,并且这些步骤的执行可以实现前文结合图1-图10所描述的本披露的各类操作,此处不再赘述。
图12是示出根据本披露实施例的一种组合处理装置1200的结构图。如图12中所示,该组合处理装置1200包括计算处理装置1202、接口装置1204、其他处理装置1206和存储装置1208。根据不同的应用场景,计算处理装置中可以包括一个或多个计算装置1210,该计算装置可以配置用于执行本文结合图1-图11所描述的操作。
在不同的实施例中,本披露的计算处理装置可以配置成执行用户指定的操作。在示例性的应用中,该计算处理装置可以实现为单核人工智能处理器或者多核人工智能处理器。类似地,包括在计算处理装置内的一个或多个计算装置可以实现为人工智能处理器核或者人工智能处理器核的部分硬件结构。当多个计算装置实现为人工智能处理器核或人工智能处理器核的部分硬件结构时,就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。
在示例性的操作中,本披露的计算处理装置可以通过接口装置与其他处理装置进行交互,以共同完成用户指定的操作。根据实现方式的不同,本披露的其他处理装置可以包括中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)、人工智能处理器等通用和/或专用处理器中的一种或多种类型的处理器。这些处理器可以包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算处理装置和其他处理装置共同考虑时,二者可以视为形成异构多核结构。
在一个或多个实施例中,该其他处理装置可以作为本披露的计算处理装置(其可以具体化为人工智能例如神经网络运算的相关运算装置)与外部数据和控制的接口,执行包括但不限于数据搬运、对计算装置的开启和/或停止等基本控制。在另外的实施例中,其他处理装置也可以和该计算处理装置协作以共同完成运算任务。
在一个或多个实施例中,该接口装置可以用于在计算处理装置与其他处理装置间传输数据和控制指令。例如,该计算处理装置可以经由所述接口装置从其他处理装置中获取输入数据,写入该计算处理装置片上的存储装置(或称存储器)。进一步,该计算处理装置可以经由所述接口装置从其他处理装置中获取控制指令,写入计算处理装置片上的控制缓存中。替代地或可选地,接口装置也可以读取计算处理装置的存储装置中的数据并传输给其他处理装置。
附加地或可选地,本披露的组合处理装置还可以包括存储装置。如图中所示,该存储装置分别与所述计算处理装置和所述其他处理装置连接。在一个或多个实施例中,存储装置可以用于保存所述计算处理装置和/或所述其他处理装置的数据。例如, 该数据可以是在计算处理装置或其他处理装置的内部或片上存储装置中无法全部保存的数据。
在一些实施例里,本披露还公开了一种芯片(例如图13中示出的芯片1302)。在一种实现中,该芯片是一种***级芯片(System on Chip,SoC),并且集成有一个或多个如图12中所示的组合处理装置。该芯片可以通过对外接口装置(如图13中示出的对外接口装置1306)与其他相关部件相连接。该相关部件可以例如是摄像头、显示器、鼠标、键盘、网卡或wifi接口。在一些应用场景中,该芯片上可以集成有其他处理单元(例如视频编解码器)和/或接口模块(例如DRAM接口)等。在一些实施例中,本披露还公开了一种芯片封装结构,其包括了上述芯片。在一些实施例里,本披露还公开了一种板卡,其包括上述的芯片封装结构。下面将结合图13对该板卡进行详细地描述。
图13是示出根据本披露实施例的一种板卡1300的结构示意图。如图13中所示,该板卡包括用于存储数据的存储器件1304,其包括一个或多个存储单元1310。该存储器件可以通过例如总线等方式与控制器件1308和上文所述的芯片1302进行连接和数据传输。进一步,该板卡还包括对外接口装置1306,其配置用于芯片(或芯片封装结构中的芯片)与外部设备1312(例如服务器或计算机等)之间的数据中继或转接功能。例如,待处理的数据可以由外部设备通过对外接口装置传递至芯片。又例如,所述芯片的计算结果可以经由所述对外接口装置传送回外部设备。根据不同的应用场景,所述对外接口装置可以具有不同的接口形式,例如其可以采用标准PCIE接口等。
在一个或多个实施例中,本披露板卡中的控制器件可以配置用于对所述芯片的状态进行调控。为此,在一个应用场景中,该控制器件可以包括单片机(Micro Controller Unit,MCU),以用于对所述芯片的工作状态进行调控。
根据上述结合图12和图13的描述,本领域技术人员可以理解本披露也公开了一种电子设备或装置,其可以包括一个或多个上述板卡、一个或多个上述芯片和/或一个或多个上述组合处理装置。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟 终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行划分,而实际实现时也可以有另外的划分方式。又例如,可以将多个单元或组件结合或者集成到另一个***,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如CPU、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced  Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
依据以下条款可更好地理解前述内容:
条款1、一种计算装置,包括:
处理电路阵列,其由多个处理电路以一维或多维阵列的结构连接而成,其中所述处理电路阵列配置成多个处理电路子阵列,并且响应于接收到多个运算指令来执行多线程运算,
其中所述多个运算指令由对所述计算装置接收到的计算指令进行解析而获得,并且其中所述计算指令的操作数包括用于指示张量的形状的描述符,所述描述符用于确定所述操作数对应数据的存储地址,
所述至少一个处理电路子阵列配置成根据所述存储地址来执行所述多个运算指令中的至少一个运算指令。
条款2、根据条款1所述的计算装置,其中所述计算指令包括描述符的标识和/或描述符的内容,所述描述符的内容包括表示张量数据的形状的至少一个形状参数。
条款3、根据条款2所述的计算装置,其中所述描述符的内容还包括表示张量数据的地址的至少一个地址参数。
条款4、根据条款3所述的计算装置,其中所述张量数据的地址参数包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址。
条款5、根据条款4所述的计算装置,其中所述张量数据的形状参数包括以下至少一种:
所述数据存储空间在N个维度方向的至少一个方向上的尺寸、所述张量数据的存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置、所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系,其中N为大于或等于零的整数。
条款6、根据条款1所述的计算装置,所述计算指令的操作码表示由所述处理电路阵列执行的多个操作,所述计算装置还包括控制电路,其配置成获取所述计算指令并对所述计算指令进行解析,以得到与所述操作码表示的多个操作相对应的所述多个运算指令,并且在所述计算指令的操作数包括所述描述符时,所述控制电路配置成根据所述描述符来确定所述操作数对应数据的存储地址。
条款7、根据条款6所述的计算装置,其中所述控制电路根据所述多个运算指令配置所述处理电路阵列,以得到所述多个处理电路子阵列。
条款8、根据条款7所述的计算装置,其中所述控制电路包括用于存储配置信息的寄存器,并且控制电路根据所述多个运算指令提取对应的配置信息,并根据所述配置信息来配置所述处理电路阵列以得到所述多个处理电路子阵列。
条款9、根据条款1所述的计算装置,所述多个运算指令包括至少一条多级流水运算,所述一条多级流水运算中包括至少两个运算指令。
条款10、根据条款1所述的计算装置,其中所述运算指令包括谓词,并且每个所述处理电路根据所述谓词判断是否执行与其关联的所述运算指令。
条款11、根据条款1所述的计算装置,其中所述处理电路阵列是一维阵列,并 且所述处理电路阵列中的一个或多个处理电路配置成作为一个所述处理电路子阵列。
条款12、根据条款1所述的计算装置,其中所述处理电路阵列是二维阵列,并且其中:所述处理电路阵列中的一行或多行处理电路配置成作为一个所述处理电路子阵列;或者所述处理电路阵列中的一列或多列处理电路配置成作为一个所述处理电路子阵列;或者所述处理电路阵列中沿对角线方向上的一排或多排处理电路配置成作为一个所述处理电路子阵列。
条款13、根据条款12所述的计算装置,其中位于所述二维阵列中的所述多个处理电路配置成在其行方向、列方向或对角线方向的至少一个上以预定的二维间隔模式与同行、同列或同对角线的其余一个或多个所述处理电路连接。
条款14、根据条款13所述的计算装置,其中所述预定的二维间隔模式与所述连接中间隔的处理电路的数目相关联。
条款15、根据条款1所述的计算装置,其中所述处理电路阵列是三维阵列,并且所述处理电路阵列中的三维子阵列或多个三维子阵列配置成作为一个所述处理电路子阵列。
条款16、根据条款15所述的计算装置,其中所述三维阵列由多个层构成的三维阵列,其中每个层包括沿行方向、列方向和对角线方向排列的多个所述处理电路的二维阵列,其中:位于所述三维阵列中的所述处理电路配置成在其行方向、列方向、对角线方向和层方向的至少一个上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个处理电路连接。
条款17、根据条款16所述的计算装置,其中所述预定的三维间隔模式与待连接的处理电路之间的间隔数目和间隔层数相关联。
条款18、根据条款11-17的任意一项所述的计算装置,其中所述处理电路子阵列中的多个处理电路形成一个或多个闭合的环路。
条款19、根据条款1所述的计算装置,其中各所述处理电路子阵列适于执行以下运算中的至少一种:算术运算、逻辑运算、比较运算和查表运算。
条款20、根据条款1所述的计算装置,还包括数据操作电路,该数据操作电路包括前操作电路和/或后操作电路,其中所述前操作电路配置成执行至少一个所述运算指令的输入数据的预处理,而所述后操作电路配置成执行至少一个运算指令的输出数据的后处理。
条款21、根据条款20所述的计算装置,其中所述预处理包括数据摆放和/或查表操作,所述后处理包括数据类型转换和/或压缩操作。
条款22、根据条款21所述的计算装置,其中所述数据摆放包括根据所述运算指令的输入数据和/或输出数据的数据类型,将所述输入数据和/或输出数据进行相应的拆分或合并后,传递至对应的处理电路中以便运算。
条款23、一种集成电路芯片,包括根据条款1-22的任意一项所述的计算装置。
条款24、一种板卡,包括根据条款23所述的集成电路芯片。
条款25、一种电子设备,包括根据条款23所述的集成电路芯片。
条款26、一种使用计算装置来执行计算的方法,其中所述计算装置包括处理电路阵列,该处理电路阵列由多个处理电路以一维或多维阵列的结构连接而成,并且所述处理电路阵列配置成多个处理电路子阵列,所述方法包括:在所述计算装置处接收 计算指令,并且对其进行解析而获得多个运算指令,其中所述计算指令的操作数包括用于指示张量的形状的描述符,所述描述符用于确定所述操作数对应数据的存储地址;
响应于接收到所述多个运算指令,利用所述多个处理电路子阵列来执行多线程运算,其中所述多个处理电路子阵列中的至少一个处理电路子阵列配置成根据所述存储地址来执行多个运算指令中的至少一个运算指令。
条款27、根据条款26所述的方法,其中所述计算指令包括描述符的标识和/或描述符的内容,所述描述符的内容包括表示张量数据的形状的至少一个形状参数。
条款28、根据条款27所述的方法,其中所述描述符的内容还包括表示张量数据的地址的至少一个地址参数。
条款29、根据条款28所述的方法,其中所述张量数据的地址参数包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址。
条款30、根据条款29所述的方法,其中所述张量数据的形状参数包括以下至少一种:所述数据存储空间在N个维度方向的至少一个方向上的尺寸、所述张量数据的存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置、所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系,其中N为大于或等于零的整数。
条款31、根据条款26所述的方法,其中所述计算指令的操作码表示由所述处理电路阵列执行的多个操作,所述计算装置还包括控制电路,所述方法包括利用所述控制电路来获取所述计算指令并对所述计算指令进行解析,以得到与所述操作码表示的多个操作相对应的所述多个运算指令。
条款32、根据条款31所述的方法,其中利用所述控制电路来根据所述多个运算指令配置所述处理电路阵列,以得到所述多个处理电路子阵列。
条款33、根据条款32所述的方法,其中所述控制电路包括用于存储配置信息的寄存器,并且所述方法包括利用控制电路来根据所述多个运算指令提取对应的配置信息,并根据所述配置信息来配置所述处理电路阵列,以得到所述多个处理电路子阵列。
条款34、根据条款26所述的方法,其中所述多个运算指令包括至少一条多级流水运算,所述一条多级流水运算中包括至少两个运算指令。
条款35、根据条款26所述的方法,其中所述运算指令包括谓词,并且所述方法还包括利用每个所述处理电路来根据所述谓词判断是否执行与其关联的所述运算指令。
条款36、根据条款26所述的方法,其中所述处理电路阵列是一维阵列,并且所述方法包括将所述处理电路阵列中的一个或多个处理电路配置成作为一个所述处理电路子阵列。
条款37、根据条款26所述的方法,其中所述处理电路阵列是二维阵列,并且所述方法还包括:将所述处理电路阵列中的一行或多行处理电路配置成作为一个所述处理电路子阵列;或者将所述处理电路阵列中的一列或多列处理电路配置成作为一个所述处理电路子阵列;或者将所述处理电路阵列中沿对角线方向上的一排或多排处理电路配置成作为一个所述处理电路子阵列。
条款38、根据条款37所述的方法,其中将位于所述二维阵列中的所述多个处理 电路配置成在其行方向、列方向或对角线方向的至少一个上以预定的二维间隔模式与同行、同列或同对角线的其余一个或多个所述处理电路连接。
条款39、根据条款38所述的方法,其中所述预定的二维间隔模式与所述连接中间隔的处理电路的数目相关联。
条款40、根据条款26所述的方法,其中所述处理电路阵列是三维阵列,并且所述方法包括将所述处理电路阵列中的三维子阵列或多个三维子阵列配置成作为一个所述处理电路子阵列。
条款41、根据条款40所述的方法,其中所述三维阵列由多个层构成的三维阵列,其中每个层包括沿行方向、列方向和对角线方向排列的多个所述处理电路的二维阵列,所述方法包括:将位于所述三维阵列中的所述处理电路配置成在其行方向、列方向、对角线方向和层方向的至少一个上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个处理电路连接。
条款42、根据条款41所述的方法,其中所述预定的三维间隔模式与待连接的处理电路之间的间隔数目和间隔层数相关联。
条款43、根据条款36-42的任意一项所述的方法,其中所述处理电路子阵列中的多个处理电路形成一个或多个闭合的环路。
条款44、根据条款26所述的方法,其中各所述处理电路子阵列适于执行以下运算中的至少一种:算术运算、逻辑运算、比较运算和查表运算。
条款45、根据条款26所述的方法,还包括数据操作电路,该数据操作电路包括前操作电路和/或后操作电路,所述方法包括利用所述前操作电路来执行至少一个所述运算指令的输入数据的预处理和/或利用所述后操作电路来执行至少一个运算指令的输出数据的后处理。
条款46、根据条款45所述的方法,其中所述预处理包括针对于数据摆放和/或查表操作,所述后处理包括数据类型转换和/或压缩操作。
条款47、根据条款46所述的方法,其中所述数据摆放包括根据所述运算指令的输入数据和/或输出数据的数据类型,将所述输入数据和/或输出数据进行相应的拆分或合并后,传递至对应的处理电路中以便运算。。
虽然本文已经示出和描述了本披露的多个实施例,但对于本领域技术人员显而易见的是,这样的实施例只是以示例的方式来提供。本领域技术人员可以在不偏离本披露思想和精神的情况下想到许多更改、改变和替代的方式。应当理解的是在实践本披露的过程中,可以采用对本文所描述的本披露实施例的各种替代方案。所附权利要求书旨在限定本披露的保护范围,并因此覆盖这些权利要求范围内的等同或替代方案。

Claims (47)

  1. 一种计算装置,包括:
    处理电路阵列,其由多个处理电路以一维或多维阵列的结构连接而成,其中所述处理电路阵列配置成多个处理电路子阵列,并且响应于接收到多个运算指令来执行多线程运算,
    其中所述多个运算指令由对所述计算装置接收到的计算指令进行解析而获得,并且其中所述计算指令的操作数包括用于指示张量的形状的描述符,所述描述符用于确定所述操作数对应数据的存储地址,
    所述至少一个处理电路子阵列配置成根据所述存储地址来执行所述多个运算指令中的至少一个运算指令。
  2. 根据权利要求1所述的计算装置,其中所述计算指令包括描述符的标识和/或描述符的内容,所述描述符的内容包括表示张量数据的形状的至少一个形状参数。
  3. 根据权利要求2所述的计算装置,其中所述描述符的内容还包括表示张量数据的地址的至少一个地址参数。
  4. 根据权利要求3所述的计算装置,其中所述张量数据的地址参数包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址。
  5. 根据权利要求4所述的计算装置,其中所述张量数据的形状参数包括以下至少一种:
    所述数据存储空间在N个维度方向的至少一个方向上的尺寸、所述张量数据的存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置、所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系,其中N为大于或等于零的整数。
  6. 根据权利要求1所述的计算装置,所述计算指令的操作码表示由所述处理电路阵列执行的多个操作,所述计算装置还包括控制电路,其配置成获取所述计算指令并对所述计算指令进行解析,以得到与所述操作码表示的多个操作相对应的所述多个运算指令,并且在所述计算指令的操作数包括所述描述符时,所述控制电路配置成根据所述描述符来确定所述操作数对应数据的存储地址。
  7. 根据权利要求6所述的计算装置,其中所述控制电路根据所述多个运算指令配置所述处理电路阵列,以得到所述多个处理电路子阵列。
  8. 根据权利要求7所述的计算装置,其中所述控制电路包括用于存储配置信息的寄存器,并且控制电路根据所述多个运算指令提取对应的配置信息,并根据所述配置信息来配置所述处理电路阵列以得到所述多个处理电路子阵列。
  9. 根据权利要求1所述的计算装置,所述多个运算指令包括至少一条多级流水运算,所述一条多级流水运算中包括至少两个运算指令。
  10. 根据权利要求1所述的计算装置,其中所述运算指令包括谓词,并且每个所述处理电路根据所述谓词判断是否执行与其关联的所述运算指令。
  11. 根据权利要求1所述的计算装置,其中所述处理电路阵列是一维阵列,并且所述处理电路阵列中的一个或多个处理电路配置成作为一个所述处理电路子阵列。
  12. 根据权利要求1所述的计算装置,其中所述处理电路阵列是二维阵列,并且其中:
    所述处理电路阵列中的一行或多行处理电路配置成作为一个所述处理电路子阵列;或者
    所述处理电路阵列中的一列或多列处理电路配置成作为一个所述处理电路子阵列;或者
    所述处理电路阵列中沿对角线方向上的一排或多排处理电路配置成作为一个所述处理电路子阵列。
  13. 根据权利要求12所述的计算装置,其中位于所述二维阵列中的所述多个处理电路配置成在其行方向、列方向或对角线方向的至少一个上以预定的二维间隔模式与同行、同列或同对角线的其余一个或多个所述处理电路连接。
  14. 根据权利要求13所述的计算装置,其中所述预定的二维间隔模式与所述连接中间隔的处理电路的数目相关联。
  15. 根据权利要求1所述的计算装置,其中所述处理电路阵列是三维阵列,并且所述处理电路阵列中的三维子阵列或多个三维子阵列配置成作为一个所述处理电路子阵列。
  16. 根据权利要求15所述的计算装置,其中所述三维阵列由多个层构成的三维阵列,其中每个层包括沿行方向、列方向和对角线方向排列的多个所述处理电路的二维阵列,其中:
    位于所述三维阵列中的所述处理电路配置成在其行方向、列方向、对角线方向和层方向的至少一个上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个处理电路连接。
  17. 根据权利要求16所述的计算装置,其中所述预定的三维间隔模式与待连接的处理电路之间的间隔数目和间隔层数相关联。
  18. 根据权利要求11-17的任意一项所述的计算装置,其中所述处理电路子阵列中的多个处理电路形成一个或多个闭合的环路。
  19. 根据权利要求1所述的计算装置,其中各所述处理电路子阵列适于执行以下运算中的至少一种:算术运算、逻辑运算、比较运算和查表运算。
  20. 根据权利要求1所述的计算装置,还包括数据操作电路,该数据操作电路包括前操作电路和/或后操作电路,其中所述前操作电路配置成执行至少一个所述运算指令的输入数据的预处理,而所述后操作电路配置成执行至少一个运算指令的输出数据的后处理。
  21. 根据权利要求20所述的计算装置,其中所述预处理包括数据摆放和/或查表操作,所述后处理包括数据类型转换和/或压缩操作。
  22. 根据权利要求21所述的计算装置,其中所述数据摆放包括根据所述运算指令的输入数据和/或输出数据的数据类型,将所述输入数据和/或输出数据进行相应的拆分或合并后,传递至对应的处理电路中以便运算。
  23. 一种集成电路芯片,包括根据权利要求1-22的任意一项所述的计算装置。
  24. 一种板卡,包括根据权利要求23所述的集成电路芯片。
  25. 一种电子设备,包括根据权利要求23所述的集成电路芯片。
  26. 一种使用计算装置来执行计算的方法,其中所述计算装置包括处理电路阵列,该处理电路阵列由多个处理电路以一维或多维阵列的结构连接而成,并且所述处理电路阵列配置成多个处理电路子阵列,所述方法包括:
    在所述计算装置处接收计算指令,并且对其进行解析而获得多个运算指令,其中所 述计算指令的操作数包括用于指示张量的形状的描述符,所述描述符用于确定所述操作数对应数据的存储地址;
    响应于接收到所述多个运算指令,利用所述多个处理电路子阵列来执行多线程运算,其中所述多个处理电路子阵列中的至少一个处理电路子阵列配置成根据所述存储地址来执行多个运算指令中的至少一个运算指令。
  27. 根据权利要求26所述的方法,其中所述计算指令包括描述符的标识和/或描述符的内容,所述描述符的内容包括表示张量数据的形状的至少一个形状参数。
  28. 根据权利要求27所述的方法,其中所述描述符的内容还包括表示张量数据的地址的至少一个地址参数。
  29. 根据权利要求28所述的方法,其中所述张量数据的地址参数包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址。
  30. 根据权利要求29所述的方法,其中所述张量数据的形状参数包括以下至少一种:
    所述数据存储空间在N个维度方向的至少一个方向上的尺寸、所述张量数据的存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置、所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系,其中N为大于或等于零的整数。
  31. 根据权利要求26所述的方法,其中所述计算指令的操作码表示由所述处理电路阵列执行的多个操作,所述计算装置还包括控制电路,所述方法包括利用所述控制电路来获取所述计算指令并对所述计算指令进行解析,以得到与所述操作码表示的多个操作相对应的所述多个运算指令。
  32. 根据权利要求31所述的方法,其中利用所述控制电路来根据所述多个运算指令配置所述处理电路阵列,以得到所述多个处理电路子阵列。
  33. 根据权利要求32所述的方法,其中所述控制电路包括用于存储配置信息的寄存器,并且所述方法包括利用控制电路来根据所述多个运算指令提取对应的配置信息,并根据所述配置信息来配置所述处理电路阵列,以得到所述多个处理电路子阵列。
  34. 根据权利要求26所述的方法,其中所述多个运算指令包括至少一条多级流水运算,所述一条多级流水运算中包括至少两个运算指令。
  35. 根据权利要求26所述的方法,其中所述运算指令包括谓词,并且所述方法还包括利用每个所述处理电路来根据所述谓词判断是否执行与其关联的所述运算指令。
  36. 根据权利要求26所述的方法,其中所述处理电路阵列是一维阵列,并且所述方法包括将所述处理电路阵列中的一个或多个处理电路配置成作为一个所述处理电路子阵列。
  37. 根据权利要求26所述的方法,其中所述处理电路阵列是二维阵列,并且所述方法还包括:
    将所述处理电路阵列中的一行或多行处理电路配置成作为一个所述处理电路子阵列;或者
    将所述处理电路阵列中的一列或多列处理电路配置成作为一个所述处理电路子阵列;或者
    将所述处理电路阵列中沿对角线方向上的一排或多排处理电路配置成作为一个所述 处理电路子阵列。
  38. 根据权利要求37所述的方法,其中将位于所述二维阵列中的所述多个处理电路配置成在其行方向、列方向或对角线方向的至少一个上以预定的二维间隔模式与同行、同列或同对角线的其余一个或多个所述处理电路连接。
  39. 根据权利要求38所述的方法,其中所述预定的二维间隔模式与所述连接中间隔的处理电路的数目相关联。
  40. 根据权利要求26所述的方法,其中所述处理电路阵列是三维阵列,并且所述方法包括将所述处理电路阵列中的三维子阵列或多个三维子阵列配置成作为一个所述处理电路子阵列。
  41. 根据权利要求40所述的方法,其中所述三维阵列由多个层构成的三维阵列,其中每个层包括沿行方向、列方向和对角线方向排列的多个所述处理电路的二维阵列,所述方法包括:
    将位于所述三维阵列中的所述处理电路配置成在其行方向、列方向、对角线方向和层方向的至少一个上以预定的三维间隔模式与同行、同列、同对角线或不同层上的其余一个或多个处理电路连接。
  42. 根据权利要求41所述的方法,其中所述预定的三维间隔模式与待连接的处理电路之间的间隔数目和间隔层数相关联。
  43. 根据权利要求36-42的任意一项所述的方法,其中所述处理电路子阵列中的多个处理电路形成一个或多个闭合的环路。
  44. 根据权利要求26所述的方法,其中各所述处理电路子阵列适于执行以下运算中的至少一种:算术运算、逻辑运算、比较运算和查表运算。
  45. 根据权利要求26所述的方法,还包括数据操作电路,该数据操作电路包括前操作电路和/或后操作电路,所述方法包括利用所述前操作电路来执行至少一个所述运算指令的输入数据的预处理和/或利用所述后操作电路来执行至少一个运算指令的输出数据的后处理。
  46. 根据权利要求45所述的方法,其中所述预处理包括针对于数据摆放和/或查表操作,所述后处理包括数据类型转换和/或压缩操作。
  47. 根据权利要求46所述的方法,其中所述数据摆放包括根据所述运算指令的输入数据和/或输出数据的数据类型,将所述输入数据和/或输出数据进行相应的拆分或合并后,传递至对应的处理电路中以便运算。
PCT/CN2021/095703 2020-06-30 2021-05-25 计算装置、集成电路芯片、板卡、电子设备和计算方法 WO2022001498A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010619458.0 2020-06-30
CN202010619458.0A CN113867792A (zh) 2020-06-30 2020-06-30 计算装置、集成电路芯片、板卡、电子设备和计算方法

Publications (1)

Publication Number Publication Date
WO2022001498A1 true WO2022001498A1 (zh) 2022-01-06

Family

ID=78981749

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/095703 WO2022001498A1 (zh) 2020-06-30 2021-05-25 计算装置、集成电路芯片、板卡、电子设备和计算方法

Country Status (2)

Country Link
CN (1) CN113867792A (zh)
WO (1) WO2022001498A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100145993A1 (en) * 2008-12-09 2010-06-10 Novafora, Inc. Address Generation Unit Using End Point Patterns to Scan Multi-Dimensional Data Structures
US20140157248A1 (en) * 2012-12-05 2014-06-05 Fujitsu Limited Conversion apparatus, method of converting, and non-transient computer-readable recording medium having conversion program stored thereon
US20160321074A1 (en) * 2015-05-01 2016-11-03 Nvidia Corporation Programmable Vision Accelerator
CN106257411A (zh) * 2015-06-17 2016-12-28 联发科技股份有限公司 单指令多线程计算***及其方法
CN110503179A (zh) * 2018-05-18 2019-11-26 上海寒武纪信息科技有限公司 计算方法以及相关产品
CN110727911A (zh) * 2018-07-17 2020-01-24 展讯通信(上海)有限公司 一种矩阵的运算方法及装置、存储介质、终端
CN110955404A (zh) * 2018-09-27 2020-04-03 英特尔公司 用于使用操作的混合精度分解的较高精度计算的计算机处理器

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6023753A (en) * 1997-06-30 2000-02-08 Billion Of Operations Per Second, Inc. Manifold array processor
CN103019656B (zh) * 2012-12-04 2016-04-27 中国科学院半导体研究所 可动态重构的多级并行单指令多数据阵列处理***
CN109725936B (zh) * 2017-10-30 2022-08-26 上海寒武纪信息科技有限公司 扩展计算指令的实现方法以及相关产品
CN110163355B (zh) * 2018-02-13 2020-10-09 上海寒武纪信息科技有限公司 一种计算装置及方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100145993A1 (en) * 2008-12-09 2010-06-10 Novafora, Inc. Address Generation Unit Using End Point Patterns to Scan Multi-Dimensional Data Structures
US20140157248A1 (en) * 2012-12-05 2014-06-05 Fujitsu Limited Conversion apparatus, method of converting, and non-transient computer-readable recording medium having conversion program stored thereon
US20160321074A1 (en) * 2015-05-01 2016-11-03 Nvidia Corporation Programmable Vision Accelerator
CN106257411A (zh) * 2015-06-17 2016-12-28 联发科技股份有限公司 单指令多线程计算***及其方法
CN110503179A (zh) * 2018-05-18 2019-11-26 上海寒武纪信息科技有限公司 计算方法以及相关产品
CN110727911A (zh) * 2018-07-17 2020-01-24 展讯通信(上海)有限公司 一种矩阵的运算方法及装置、存储介质、终端
CN110955404A (zh) * 2018-09-27 2020-04-03 英特尔公司 用于使用操作的混合精度分解的较高精度计算的计算机处理器

Also Published As

Publication number Publication date
CN113867792A (zh) 2021-12-31

Similar Documents

Publication Publication Date Title
US11531540B2 (en) Processing apparatus and processing method with dynamically configurable operation bit width
US20220091849A1 (en) Operation module and method thereof
TW202321999A (zh) 一種計算裝置及方法
CN111488976A (zh) 神经网络计算装置、神经网络计算方法及相关产品
WO2022134873A1 (zh) 数据处理装置、数据处理方法及相关产品
CN111488963A (zh) 神经网络计算装置和方法
WO2022001498A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
WO2022001497A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
WO2022001500A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
WO2022001456A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
WO2022001499A1 (zh) 一种计算装置、芯片、板卡、电子设备和计算方法
CN111368967A (zh) 一种神经网络计算装置和方法
WO2022001457A1 (zh) 一种计算装置、芯片、板卡、电子设备和计算方法
WO2022001439A1 (zh) 计算装置、集成电路芯片、板卡和计算方法
WO2021082746A1 (zh) 运算装置及相关产品
WO2022001454A1 (zh) 集成计算装置、集成电路芯片、板卡和计算方法
CN111367567B (zh) 一种神经网络计算装置和方法
CN111368990B (zh) 一种神经网络计算装置和方法
CN112766473A (zh) 运算装置及相关产品
JP7368512B2 (ja) 計算装置、集積回路チップ、ボードカード、電子デバイスおよび計算方法
CN111368987A (zh) 一种神经网络计算装置和方法
WO2022111013A1 (zh) 支援多种访问模式的设备、方法及可读存储介质
CN111368986A (zh) 一种神经网络计算装置和方法
WO2022134872A1 (zh) 数据处理装置、数据处理方法及相关产品
WO2022001496A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21833833

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21833833

Country of ref document: EP

Kind code of ref document: A1