WO2023045445A1 - 数据处理装置、数据处理方法及相关产品 - Google Patents

数据处理装置、数据处理方法及相关产品 Download PDF

Info

Publication number
WO2023045445A1
WO2023045445A1 PCT/CN2022/100302 CN2022100302W WO2023045445A1 WO 2023045445 A1 WO2023045445 A1 WO 2023045445A1 CN 2022100302 W CN2022100302 W CN 2022100302W WO 2023045445 A1 WO2023045445 A1 WO 2023045445A1
Authority
WO
WIPO (PCT)
Prior art keywords
dimension
data
size
input
output
Prior art date
Application number
PCT/CN2022/100302
Other languages
English (en)
French (fr)
Inventor
肖麟慧
郑鎏韬
王楠
喻歆
Original Assignee
寒武纪(西安)集成电路有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 寒武纪(西安)集成电路有限公司 filed Critical 寒武纪(西安)集成电路有限公司
Publication of WO2023045445A1 publication Critical patent/WO2023045445A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates generally to the field of data processing. More specifically, the present disclosure relates to a data processing device, a data processing method for executing block instructions on data by using the data processing device, a chip and a board.
  • Deep learning Deep Learning
  • AI artificial intelligence
  • Neural network is one of the most critical technologies in artificial intelligence and deep learning, among which Convolution Neural Network (CNN) is the most important network type.
  • the most critical calculation in the convolutional neural network is the convolution operation (Convolution Operation) of the convolution layer (Conv layer).
  • the function of the convolutional layer is to extract features from the input data. Through multi-layer convolution, complex features can be extracted to ensure that the network has sufficient expressive ability and generalization ability.
  • the neural network model contains a large number of various types of convolution operations, and the computing performance of the convolution operation greatly affects the computing performance of the entire neural network model.
  • the corresponding input feature maps and weights may have different dimensions.
  • the present disclosure proposes a data processing device in various aspects, which can make data of various dimensions fit The hardware of the convolution operation, thereby improving the computational efficiency of the convolution operation.
  • the convolution operation in the embodiment of the present disclosure can be an operation in various neural network models, and these neural network models can be applied in various fields, such as image processing, speech processing, text processing, etc., such processing can include but not limited to identification and classification.
  • an embodiment of the present disclosure provides a data processing device, including a control circuit, a first storage circuit, and a second storage circuit, wherein: the first storage circuit is used to store data before processing; the The second storage circuit is used to store the processed data; and the control circuit is used to configure and execute the block instruction, so that the input data stored on the first storage circuit according to the storage order of the first dimension is divided into units.
  • Split and store as output data on the second storage circuit wherein on the second storage circuit, each split unit is stored in the second dimension storage order, and the split units are stored in the third dimension storage order .
  • an embodiment of the present disclosure provides a chip, which includes the data processing device in the aforementioned first aspect.
  • an embodiment of the present disclosure provides a board, which includes the aforementioned chip in the second aspect.
  • an embodiment of the present disclosure provides a data processing method for executing a block instruction on input data by using the data processing apparatus in the aforementioned first aspect.
  • the solution of the embodiment of the present disclosure performs block processing on the data in various convolution splitting schemes, so as to Adapting to the processing capability of the hardware computing device, the parallel processing capability of multiple slave processing circuits can be fully utilized, and the computing efficiency of the convolution operation can be effectively improved.
  • Fig. 1 shows the structural diagram of the board card of the disclosed embodiment
  • FIG. 2 shows a structural diagram of a combination processing device according to an embodiment of the present disclosure
  • FIG. 3a shows a schematic diagram of the internal structure of a processor core of a single-core computing device according to an embodiment of the disclosure
  • Fig. 3b shows a simplified schematic diagram of the internal structure of a multi-core computing device according to an embodiment of the present disclosure
  • FIG. 4 shows an example of an exemplary convolution operation principle to which embodiments of the present disclosure can be applied
  • Fig. 5 shows a schematic structural block diagram of a computing device according to an embodiment of the disclosure
  • FIG. 6 shows an exemplary data storage sequence according to an embodiment of the present disclosure
  • Figures 7a-7c illustrate several exemplary grouping modes according to embodiments of the present disclosure
  • Fig. 8 shows an exemplary split schematic diagram of an input feature map according to an embodiment of the present disclosure
  • FIG. 9 shows a schematic diagram of splitting and storage of the Forward4 scheme according to an embodiment of the present disclosure.
  • FIG. 10 shows a schematic diagram of division of output points of the computing circuit in the Forward4 scheme according to an embodiment of the present disclosure
  • FIG. 11 shows a schematic diagram of a single operation in the Forward4 scheme according to an embodiment of the disclosure
  • Fig. 12 shows a schematic diagram of sliding convolution in the Forward4 scheme according to an embodiment of the present disclosure
  • Fig. 13 shows a schematic diagram of the output data format of the Forward4 scheme according to an embodiment of the present disclosure
  • FIG. 14 shows an overall data handling process according to an embodiment of the present disclosure
  • Figure 15 shows a schematic conceptual diagram of Trans Tiling according to an embodiment of the disclosure
  • Figure 16 shows a schematic diagram of the front and back tables
  • Fig. 17 shows a schematic diagram of executing a block instruction on neuron data according to an embodiment of the present disclosure.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure.
  • the board card 10 includes a chip 101, which is a system-on-chip (System on Chip, SoC), or system-on-chip, integrated with one or more combined processing devices, and the combined processing device is an artificial
  • SoC System on Chip
  • the intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is the large amount of input data, which has high requirements for the storage capacity and computing power of the platform.
  • the board 10 of this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing capabilities.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102.
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus.
  • the control device 106 in the board 10 is configured to regulate the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a block diagram showing the combined processing means in the chip 101 of this embodiment.
  • the combined processing device 20 includes a computing device 201 , an interface device 202 , a processing device 203 and a storage device 204 .
  • the computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor for performing deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete user-specified operations.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write it into a storage device on the computing device 201 .
  • the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the chip of the computing device 201 .
  • the interface device 202 may also read data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201 .
  • the processing device 203 may be one or more types of a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or other general-purpose and/or special-purpose processors.
  • processors including but not limited to digital signal processors (digital signal processors, DSPs), application specific integrated circuits (application specific integrated circuits, ASICs), field-programmable gate arrays (field-programmable gate arrays, FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when considering the integration of the computing device 201 and the processing device 203 together, they are considered to form a heterogeneous multi-core structure.
  • the storage device 204 is used to store data to be processed, which may be a DRAM, which is a DDR memory, and its size is usually 16G or larger, and is used to store data of the computing device 201 and/or the processing device 203 .
  • Fig. 3a shows a schematic diagram of the internal structure of a processing core when the computing device 201 is a single-core device.
  • the computing device 301 is used for processing input data such as computer vision, speech, natural language, data mining, etc.
  • the computing device 301 includes three modules: a control module 31 , an operation module 32 and a storage module 33 .
  • the control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, which includes an instruction fetch unit (IFU) 311 and an instruction decoding unit (instruction decode unit, IDU) 312.
  • the instruction fetching unit 311 is used to obtain instructions from the processing device 203 , and the instruction decoding unit 312 decodes the obtained instructions and sends the decoding results to the computing module 32 and the storage module 33 as control information.
  • the operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 .
  • the vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.
  • the storage module 33 is used to store or transfer relevant data, including a neuron storage unit (neuron RAM, NRAM) 331, a weight storage unit (weight RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333.
  • NRAM 331 is used to store input neurons, output neurons and intermediate results after calculation;
  • WRAM 332 is used to store convolution kernels of deep learning networks, that is, weights;
  • DMA 333 is connected to DRAM 204 through bus 34, responsible for computing device 301 Data transfer between DRAM 204 and DRAM 204.
  • Fig. 3b shows a simplified schematic diagram of the multi-core internal structure of the computing device 201 .
  • Multi-core computing devices can be abstracted using a hierarchical hardware model. As shown in the figure, the multi-core computing device can be abstracted into four levels, namely card level (Card) 350 , chip level (Chip) 360 , processor cluster level (Cluster) 370 and processor core level (Core) 380 .
  • Card card level
  • Chip chip level
  • Core processor core level
  • the embodiments of the present disclosure mainly involve the data transmission of the storage unit and the calculation unit, so the drawings and description briefly show and introduce the relevant calculation structure, and other parts are omitted.
  • each board contains local DDR storage, and each processor chip acts as a computing and control unit.
  • each processor chip contains multiple multiprocessors as computing units.
  • each multiprocessor includes multiple accelerator cores as control and computing units, and a shared storage SRAM as a storage unit.
  • each accelerator core contains local storage and an array of local processing units.
  • NFU refers to the Neuron Function Unit, which is used for convolution calculations.
  • the storage model includes board global memory, SRAM (shared memory) on the Cluster, NRAM, WRAM and registers on the Core, and the like.
  • SRAM is included in the storage processing unit MPU (Memory Process Unit Core, referred to as MPU, or Mem Core).
  • MPU Memory Process Unit Core
  • Mem Core refers to an intelligent processing core (Intelligent Process Unit Core, referred to as IPU Core or Core) in a multi-core computing device.
  • IPU Core contains NRAM, WRAM, NFU and so on.
  • Cluster refers to a processor cluster or a computing cluster.
  • a multi-core computing device includes several Clusters, and a Cluster includes 1 Mem Core+N IPU Cores.
  • the convolutional layer in a neural network model can perform convolution operations by applying convolution kernels (also called filters, weights, etc.) to input feature maps (also called input data, neurons, or input neurons) processing for feature extraction.
  • convolution kernels also called filters, weights, etc.
  • input feature maps also called input data, neurons, or input neurons
  • the convolution layer can contain multiple convolution kernels, and each element that makes up the convolution kernel corresponds to a weight coefficient and a bias.
  • Embodiments of the present disclosure can be applied to data splitting of various convolution operations.
  • X is the input data
  • Y is the output data
  • K is the convolution kernel
  • Kh and Kw are the length and width of K
  • sh and sw are the strides in the length and width directions
  • the formula ignores Bias bias, fill pad and expand dilation, and assume that the input data X has been filled, and the convolution kernel has been expanded.
  • the formula ignores the N dimension and the C dimension.
  • the forward calculation of the neural network model is independent in the N dimension and fully connected in the C dimension.
  • Fig. 4 shows an example of an exemplary conventional 3D convolution operation principle to which embodiments of the present disclosure can be applied.
  • the figure exemplarily shows four-dimensional input data X with a size of [N Hi Wi Ci], which can be expressed as N three-dimensional rectangles 410 of size Hi ⁇ Wi ⁇ Ci.
  • the figure also exemplarily shows a four-dimensional convolution kernel K with a size of [Co Kh Kw Ci], which can be expressed as Co three-dimensional convolution kernels 420 of size Kh ⁇ Kw ⁇ Ci.
  • the convolution result of the input data X and the convolution kernel K obtains the output data Y, which is four-dimensional data of the size [N Ho Wo Co], which can be expressed as N three-dimensional rectangles 430 of the size Ho ⁇ Wo ⁇ Co.
  • the figure also specifically shows an example of convolution operation, in which the input data is an input feature map 440 with a size of 6 ⁇ 6 ⁇ 3, and the N dimension is omitted; the convolution kernel is a three-dimensional convolution kernel 450 with a size of 3 ⁇ 3 ⁇ 3 , for a single Co; the output data is a 4 ⁇ 4 output feature map 460 .
  • the specific operation process is as follows:
  • the convolution kernel 450 scans the input feature map 440 according to a certain step size, performs matrix element multiplication and summation on the input features in the convolution window 470 and superimposes offsets. That is, the value at each position in the output feature map 460 is obtained by performing a two-dimensional convolution operation on the corresponding block and the corresponding convolution kernel of each input feature map and then summing them up. For example, the figure shows that the value of the (0,0) position on the output feature map 460 (that is, the convolution output point) is two-dimensionally performed by the convolution window 470 framed by the black cube in the input feature map and the three-dimensional convolution kernel 450. The three-dimensional convolution operation obtains 3 values, which are summed to obtain the final value.
  • the position of the convolution kernel 450 can be moved on the input feature map 440 , that is, the convolution window of the convolution output point can be moved.
  • the convolution step size (Sx, Sy) is (1,1).
  • the convolution operation can be obtained respectively The value at (0,1) or (1,0) position on the feature map 460 is output.
  • a convolutional layer of the neural network there are N groups of input feature maps, and each group contains Hi ⁇ Wi ⁇ Ci information, where Hi and Wi are the height and width of the input feature map, and Ci is The number of input feature maps, also known as the number of input channels.
  • the convolutional layer has Ci ⁇ Co convolution kernels of Kh ⁇ Kw size, where Ci is the number of input channels, Co is the number of output feature maps (or the number of output channels), Kh and Kw are the height and width.
  • the output feature map contains Ho ⁇ Wo ⁇ Co pieces of information, where Ho and Wo are the height and width of the output feature map, respectively, and Co is the number of output channels.
  • the convolution step size Sx, Sy
  • the size of the convolution step size will affect the size of the output feature map.
  • input feature map (Feature map), input data, neuron or input neuron are used interchangeably; convolution kernel, filter or weight are used interchangeably.
  • H (height) and Y dimensions are used interchangeably, and the W (width) and X dimensions are used interchangeably.
  • the H dimension of the input feature map can be expressed as Hi or Yi
  • the H dimension of the output feature map can be expressed as Ho or Yo
  • the W dimension can be expressed similarly.
  • each convolution output point has a corresponding convolution window, and the shape of the convolution window is equal to the shape of the convolution kernel. The value of each convolution output point corresponds to the result of the multiplication and accumulation of the input feature map and the weight in its convolution window.
  • a computing device with a master-slave structure may be used to implement the above convolution operation.
  • different data paths can be configured for input feature maps and convolution kernels, thereby improving memory access efficiency.
  • FIG. 5 shows a schematic structural block diagram of a computing device 500 according to an embodiment of the disclosure. It can be understood that this structure can be regarded as the refinement of the internal structure of the operation module of a single processing core in FIG. 3 , or can be regarded as a functional division block diagram based on the combination of multiple operation modules of the processing core shown in FIG. 3 .
  • a computing device 500 in an embodiment of the present disclosure may be configured to perform various types of convolution operations, which may include a master processing circuit (MA) 510 and a plurality of slave processing circuits (SL) 520, shown in the figure 16 slave processing circuits SL0 to SL15 are shown.
  • MA master processing circuit
  • SL slave processing circuits
  • the master processing circuit and the slave processing circuits, as well as multiple slave processing circuits, can communicate with each other through various connections.
  • the connection between multiple slave processing circuits can be hard-wired, or logically configured according to, for example, micro-instructions to form a variety of slave processing circuits
  • the topology of the array Embodiments of the present disclosure are not limited in this regard.
  • the main processing circuit and the slave processing circuit can cooperate with each other, thereby realizing parallel operation processing.
  • the main processing circuit and the slave processing circuit may include various calculation circuits, for example, may include a vector operation unit and a matrix operation unit.
  • the vector operation unit is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit is responsible for the core calculations of deep learning algorithms, such as matrix multiplication and convolution.
  • the slave processing circuit can be used to perform intermediate operations on corresponding data in parallel according to the operation instruction to obtain multiple intermediate results, and transmit the multiple intermediate results back to the main processing circuit.
  • the computing device 500 By setting the computing device 500 into a master-slave structure (for example, a master-multiple-slave structure, or a multi-master-multiple-slave structure, the present disclosure is not limited in this respect), for the calculation instructions of the forward operation, the data can be disassembled according to the calculation instructions. In this way, multiple slave processing circuits are used to perform parallel calculations on the part with a large amount of calculation to improve the calculation speed, save calculation time, and reduce power consumption.
  • a master-slave structure for example, a master-multiple-slave structure, or a multi-master-multiple-slave structure, the present disclosure is not limited in this respect
  • multiple multiplexing methods of input feature maps and weights can be supported, thereby reducing the amount of data access during operations and improving processing efficiency .
  • the computing device 500 may further include a first storage device 530 and a second storage device 540 for respectively storing data transmitted via different data channels.
  • the first storage circuit 530 can be used to store multicast data, that is, the data in the first storage circuit will be transmitted to multiple slave processing circuits through the broadcast bus, and these slave processing circuits receive the same data. It can be understood that broadcasting and multicasting can be implemented through the broadcasting bus. Multicast refers to a communication method that transmits a piece of data to multiple slave processing circuits; broadcasting is a communication method that transmits a piece of data to all slave processing circuits, which is a special case of multicast. Since both multicast and broadcast correspond to a one-to-many transmission mode, there is no special distinction between the two in this document. Broadcast and multicast can be collectively referred to as multicast, and those skilled in the art can clarify their meanings according to the context.
  • the second storage circuit 540 may be used to store and distribute data, that is, the data in the second storage circuit will be transmitted to different slave processing circuits respectively, and each slave processing circuit receives different data.
  • the main processing circuit may determine one of the input feature map and the convolution kernel as multicast data and store it in the first storage circuit, so as to transmit the data to the scheduled multiple from the processing circuit.
  • the main processing circuit may determine the other of the input feature map and the convolution kernel as distribution data and store it in the second storage circuit. These distributed data can be distributed to corresponding slave processing circuits before operation.
  • FIG. 5 also shows a schematic diagram of the internal structure of the slave processing circuit SL according to an embodiment of the present disclosure.
  • each slave processing circuit 520 may include a plurality of operation circuits CU 521, a first buffer circuit 522 and a second buffer circuit 523.
  • four arithmetic circuits CU0 to CU3 are shown.
  • the number of computing circuits may be more or less depending on specific hardware configurations, and the embodiments of the present disclosure are not limited in this regard.
  • the first buffer circuit 522 may be used for buffering weights or input feature maps assigned to the slave processing circuit.
  • the second buffer circuit 523 may be used for buffering the input feature map or the weight assigned to the slave processing circuit. These two buffer circuits are used to select the data involved in the operation.
  • the data of the first buffer circuit 522 can be a plurality of data rows from, for example, the first storage circuit 530 or the second storage circuit 540, and correspondingly, the data of the second buffer circuit 523 can come from, for example, the second storage circuit 540 or the first storage circuit 540 Multiple data rows of circuit 530. Depending on the specific multiplexing method, these data rows can be distributed to the corresponding computing circuit CU 521 or broadcast to all CUs 521 in the slave processing circuit 520 during the operation.
  • Each operation circuit CU521 is used for performing a bitwise multiply-accumulate operation on data rows selected from the first buffer circuit and data rows selected from the second buffer circuit during each calculation.
  • the slave processing circuit 520 may also include a third buffer circuit 524 for buffering the calculation results of each calculation circuit CU 521.
  • each processing circuit and storage circuit are shown as separate modules in FIG. 5 , according to different configurations, the storage circuit and the processing circuit may also be combined into one module.
  • the first storage circuit 530 can be combined with the main processing circuit 510
  • the second storage circuit 540 can be shared by multiple slave processing circuits 520, and an independent storage area is assigned to each slave processing circuit to speed up access.
  • Embodiments of the present disclosure are not limited in this regard.
  • the main processing circuit and the slave processing circuit may belong to different modules of the same processor or chip, or may belong to different processors, and the present disclosure is not limited in this respect.
  • the dimensions of the involved multidimensional data are represented by (N, H, W, C) or (Co, H, W, Ci), which represent the storage order of the data in the memory. It can be understood that although the multidimensional data has multiple dimensions, since the layout of the memory is always one-dimensional, there is a corresponding relationship between the multidimensional data and the storage order on the memory. Multidimensional data is usually allocated in continuous storage space, that is, multidimensional data can be expanded in one dimension and stored in the memory in sequence.
  • the initial input feature map can be stored sequentially in a low-dimensional (where C/Ci is the lowest dimension) priority; and in order to optimize the convolution operation, the input feature can be adjusted during the operation
  • C/Ci is the lowest dimension
  • Adjacent dimensions refer to dimensions that are next to each other in the dimension information representation of multidimensional data, for example, W and Ci are adjacent, and adjacent dimensions may also be called continuous dimensions.
  • the main computing unit of the hardware is a vector multiply-accumulate operator.
  • Implementing support for various convolution algorithms in hardware design is essentially to extract the multiplication and addition operations in the algorithm to the maximum extent, and realize the connection between the on-chip RAM (such as NRAM, WRAM, etc. in Figure 3) and the arithmetic unit through the data path. efficiently exchange the input and output data of the multiply-accumulate operation.
  • Hardware is stored line by line (cache line).
  • the read, write, and calculation operations are most efficient when the entire line is aligned. Therefore, in order to make full use of the bandwidth and adapt to the memory access requirements of the arithmetic unit array, it is usually necessary to
  • the data is vectorized and aligned.
  • the design of artificial intelligence chips usually takes the Ci dimension as the lowest dimension, that is, the above-mentioned NHWC arrangement order, and the data on the Ci dimension is continuous. Therefore, vectorization alignment requires that the size of the Ci dimension be aligned to a specified value, such as the alignment value M, so that the number of accesses is performed in units of the alignment value M, and M can also be called the maximum single operation of the hardware.
  • M can have different values, such as 64bit, 128bit, 256bit, 512bit, etc.
  • the size of the input port of the operator array is also related to M.
  • the input port size of the operator array is usually twice the size of M, that is, the input of the alignment value M scale is processed at one time.
  • Feature map data and weight data When the Ci dimension of the input feature map is large, it is easier to meet the above alignment requirements.
  • the Ci dimension of the input feature map is small, such as smaller than the size of a cache line, the Ci dimension needs to be filled to one line of data (for example, 512 bits), that is, invalid data 0 is filled. This filling will cause a large number of redundant calculations, resulting in waste of resources and reducing the efficiency of operations.
  • a convolution operation scheme which can determine the corresponding convolution splitting scheme according to the size of the lowest storage dimension (such as Ci) of the input feature map, wherein the convolution splitting scheme at least indicates The shape of the split unit of the data to be operated on.
  • the amount of data contained in a split unit does not exceed the maximum single operation amount of the hardware.
  • the amount of data contained in a split unit can be set as the one-time processing alignment value M of the hardware, so that the calculation and processing can be performed in units of split units, which can fully utilize the computing power of the hardware and avoid or reduce invalid calculations. .
  • the data type can be Int8, Int16, Float16 or Float32
  • the split scheme of 64B ⁇ 1 ⁇ 1 shape is called Forward64
  • the split scheme of 16B ⁇ 2 ⁇ 2 shape is called Forward16
  • the split scheme of 4B ⁇ 4 ⁇ 4 shape is called Forward4
  • the split scheme of 4B ⁇ 4 ⁇ 4 shape is called Forward4.
  • the split scheme of 4B ⁇ 4 ⁇ 4 shape applied to depth convolution operation is called Forward1
  • the split scheme of 4B ⁇ 4 ⁇ 4 shape applied to reverse depth convolution operation is called Update1
  • the 4-shape split scheme applied to the cross-product convolution operation is called Update4.
  • these splitting schemes are suitable for scenarios where channel C is relatively small in convolution calculations, so they can also be collectively referred to as small convolutions.
  • a split unit includes data of the lowest storage dimension and at least one other storage dimension, and the total data volume of a split unit does not exceed the maximum single operation of the hardware.
  • the input feature map and convolution kernel can be split into multiple corresponding splitting units according to the determined convolution splitting scheme, and the dimension storage order can be converted, so that a split
  • the data in the unit is continuously stored as a data row, so as to facilitate subsequent reading processing in units of split units (data rows).
  • one or more split units may be read in the first reading order from the data to be operated stored in the storage order of the first dimension, in units of split units, and the read split units may be stored in On the corresponding storage circuit, the data in each split unit is stored according to the storage order of the second dimension, and the data between the split units is stored according to the storage order of the third dimension.
  • FIG. 6 shows an exemplary data storage sequence according to an embodiment of the present disclosure.
  • 610 represents the storage method of the four-dimensional tensor to be calculated, including N three-dimensional sub-tensors, and N is in the highest dimension, that is, the storage order of the first dimension of the four-dimensional tensor is NHWC.
  • H and Y, W and X are used interchangeably herein.
  • Each subtensor is divided into smaller data blocks or split units, and the number of data blocks in each dimension is C/Y/X respectively.
  • the diagram 620 in the middle shows the storage method of each sub-tensor, and each data block is stored as a continuous 64Byte, that is, one row.
  • the order between rows changes accordingly.
  • the data block is read in the direction of C first, then X, and finally Y, that is, the first reading sequence is YXC, and the rows are stored in the order of Y*X*C, that is, the third
  • the dimension storage order is YXC or HWC.
  • the third dimension is stored in the same order as the first dimension. It can be understood that other reading orders may also be used, resulting in the storage order of the third dimension being different from that of the first dimension, which will not be listed here.
  • the diagram 630 on the right shows the order in each row, that is, the order of data in each data block, and its shape is blockC*blockY*blockX. At this time, the storage order of the second dimension is CYX or CHW.
  • the small convolution adopts the block form. Compared with the traditional convolution, the advantage is that the alignment in the Ci direction only needs to satisfy the alignment of the block in the Ci direction.
  • the weight (co*Kh*kw*ci) is generally small, Kh and Kw are usually single digits, and co and ci are similar.
  • the storage space of the second storage circuit (such as the WRAM 332 in FIG. 3) is larger than that of the first storage circuit (such as the NRAM 331 in FIG. 3).
  • the convolution operation principle described above it can be known that the operation results on the Co dimension (the depth convolution is the C dimension) do not need to be accumulated, so the operation allocation on different Co can be carried out relatively independently on different operation circuits.
  • the size of the Co dimension of the output channel of the convolution kernel in a single round of operation does not exceed the number of scheduled slave processing circuits, so the operation of a single Co needs to be completed by one or more slave processing circuits.
  • the dimension of Co is large, it can be realized by splitting into multiple rounds of operations, wherein the size of Co processed by each round of operations does not exceed the number of scheduled slave processing circuits.
  • the calculation rounds required to complete the convolution operation and the Co processed in each round of operation can be determined. Quantity or corresponding grouping mode.
  • the convolution kernel is multiplexed on Rs SLs in the same SLB, and Rs represents the number of times the convolution kernel is multiplexed between slave processing circuits.
  • Factors such as the limitation of hardware buffer space (such as the size of the first buffer circuit and the second buffer circuit in Figure 5) can be considered to determine the maximum number of times rs of convolution kernel multiplexing and the maximum number of input feature map multiplexes applicable in a single slave processing circuit. Use the number of times rn.
  • the situation that a slave processing circuit processes multiple Co values in a single round of operation is not considered for the time being, but only one or more slave processing circuits are considered.
  • the circuit only handles the case of one Co value in a single round of operation.
  • Different grouping modes can be used according to the number of slave processing circuits SL processing the same Co value in a single round of operation. It can be understood that it is preferable to evenly distribute the callable slave processing circuits SL, so as to balance the computing power, for example, every 2 SLs, so that 16 SLs can process 8 Co values at the same time; or every 4 SLs, so that 16 SLs can handle 4 Co values simultaneously; etc.
  • the second storage circuit WRAM has 16 storage areas, which are allocated to the 16 slave processing circuits SL respectively. Further, every 4 blocks can be combined into a storage block, which is assigned to the corresponding slave processing circuit group SLB.
  • the following grouping modes can be selected: Group1 mode, Group4 mode and Group16 mode.
  • grouping modes can refer to the above three representative grouping modes given herein for corresponding processing.
  • the above grouping mode can be uniformly expressed as GroupN, representing that all slave processing circuits SL scheduled in the current round of operations are divided into N groups, each slave processing circuit group SLB processes the same Co value, and different slave processing circuit groups SLB handles different Co values.
  • N can be 1, 4, or 16, corresponding to Group1, Group4, and Group16 above.
  • Figures 7a-7d illustrate several exemplary grouping schemes according to embodiments of the present disclosure.
  • Figure 7a shows a Group1 mode
  • Figure 7b shows a Group16 mode
  • Figure 7c shows a Group4 mode
  • Figure 7d shows another Group4 mode.
  • the Group1 mode means that all 16 schedulable SLs belong to one group and jointly process one Co value, for example, SL0-SL15 belong to group G0. Thus, operations for this one output channel are distributed over 16 SLs.
  • priority can be given to broadcasting the convolution kernel 720 of the output channel to each SL, and the input feature map 710 is split and distributed to each SL, thereby improving memory access efficiency.
  • the convolution kernel can be stored in the first storage circuit 530 in FIG. 5 for transmission using a broadcast channel.
  • the input feature map can be divided according to the XY direction of the output feature map and stored in the second storage circuit 540 to be allocated to different SLs.
  • all SLs jointly compute an output feature map of Co.
  • the Group16 mode means that all 16 schedulable SLs are divided into 16 groups, that is, each group has one SL, and each SL handles a different Co value.
  • SL0 belongs to group G0
  • SL1 belongs to group G1
  • SL15 belongs to group G15.
  • the same input feature map 730 can be reused among 16 SLs, so it can be prioritized to broadcast the input feature map 730 to each SL, while the convolution kernel 740 corresponding to different Co is distributed Give the corresponding SL.
  • 16 copies of the input feature map may be copied and stored in 16 storage areas allocated to the 16 slave processing circuits on the second storage circuit.
  • the convolution kernel is divided according to Co, one SL corresponds to one Co, and 16 Cos are processed at a time, stored in the first storage circuit, and distributed to different SLs in a unicast manner.
  • all SLs compute output feature maps of different Co for the same input feature map.
  • the Group4 mode means that all 16 schedulable SLs are divided into 4 groups, and each group processes a Co value.
  • SL0-SL3 belong to group G0
  • SL4-SL7 belong to group G1
  • SL8-SL11 belong to group G2
  • SL12-SL15 belong to group G3.
  • This mode is between Group1 and Group16, so either the convolution kernel or the input feature map can be determined as multicast data, while the other can be determined as distribution data.
  • the convolution kernels can be divided into 4 groups according to Co, and stored in the first storage circuit 530 in FIG. 5 , so as to be transmitted through a broadcast channel.
  • the input feature map can be divided into 4 parts according to the XY direction of the output feature map, copied into 4 parts, stored in the second storage circuit 540, and distributed to the 4 SLBs.
  • Each SLB obtains the same input feature map, and then distributes it to the 4 SLs in the SLB according to the 4 divided parts.
  • all SLs in each SLB jointly compute the output feature map of a Co, and the 4 SLBs process a different Co respectively.
  • the convolution kernels are divided into 4 groups, and each group is divided into each group at an interval of 1 according to Co.
  • Co 12
  • four groups of Co are divided into ⁇ 0, 4, 8 ⁇ , ⁇ 1, 5, 9 ⁇ , ⁇ 2, 6, 10 ⁇ and ⁇ 3, 7, 11 ⁇ respectively.
  • the neurons can be stored in the second storage circuit WRAM, and the weights can be stored in the first storage circuit NRAM.
  • the input feature map needs to be split between these multiple SLs.
  • the Group1 grouping mode needs to split the input feature map into 16 parts.
  • the Group4 grouping mode needs to split the input feature map into 4 parts.
  • the input feature map may be divided among the Rs slave processing circuits SL included in each slave processing circuit group as follows: according to the size of the corresponding output feature map, the output feature map is divided in the XY dimension (also That is, the Ho/Wo dimension) is evenly divided into Rs output feature blocks of the same shape; and according to the input feature map area required for calculating each output feature block, the input feature map is divided in the XY dimension (that is, the Hi/Wi dimension) The above is divided into Rs input feature blocks to be distributed to Rs slave processing circuits. It can be understood that depending on the size of the convolution kernel and the convolution step size, the input feature maps corresponding to adjacent output points on the output feature map may overlap.
  • Fig. 8 shows an exemplary split diagram of an input feature map according to an embodiment of the present disclosure.
  • the input feature map is divided into 16 parts and distributed on 16 SLs, corresponding to the Group1 mode.
  • the 16 output feature blocks can be mapped to the input feature map 820 to obtain the 16 input feature map regions required to calculate the 16 output feature blocks respectively, which also divides the input feature map in the XY direction.
  • These 16 input feature map regions can be assigned to 16 slave processing circuits SL accordingly.
  • the input feature map will be split in units of splitting units according to the determined convolution splitting scheme. Therefore, in the above embodiment, the block of the input feature map should make each divided input feature map
  • the block in the XY direction is a multiple of the dimension of the split unit in the XY direction, that is, it can be aligned according to the split unit in the XY direction. For example, when choosing a 4 ⁇ 4 ⁇ 4 convolution splitting scheme, each input feature map is aligned by 4 ⁇ 4; while choosing a 16 ⁇ 2 ⁇ 2 convolution splitting scheme, each input feature map Blocks are aligned 2 ⁇ 2.
  • the output feature map is not aligned according to the split unit (such as 4 ⁇ 4 or 2 ⁇ 2)
  • it is necessary to fill in the input feature map accordingly (such as filling 0), so that the actual calculated output XY is according to the split unit ( eg 4x4 or 2x2) aligned and input XY is also aligned by split unit (eg 4x4 or 2x2).
  • the output feature map can also be divided according to other rules in the XY direction, for example, divided into 16 output feature blocks with the same shape according to 1 ⁇ 16, and assigned to SL0-SL15 respectively.
  • Embodiments of the present disclosure are not limited in this regard.
  • this splitting method can also be applied to splitting in other scenarios, for example, between computing circuits CU in a single slave processing circuit SL Splitting, the embodiments of the present disclosure are not limited in this respect.
  • multiple slave processing circuits can be scheduled to perform convolution operations on the input feature map and the corresponding data rows of the convolution kernel, and then according to the convolution splitting scheme, A plurality of operation results returned from the processing circuit are spliced to obtain an output feature map of the convolution operation of the input feature map and the convolution kernel.
  • a plurality of operation circuits CU and each buffer circuit (see FIG. 5 ) in the slave processing circuit can be used to perform a specific convolution operation process.
  • multiple computing cycles are generally required to complete the required computing in each round of computing.
  • each output feature block corresponds to all schedulable N CU operation circuits in a single SL A single calculation capability (N CU *Nop output points).
  • the output feature map can be divided into output feature blocks according to the alignment of 16 output points in the XoYo dimension, and each output feature block can be calculated one by one. It can be understood that the 16 output points may be in a 4*4 format, or may be in a 1*16 format, which is not limited in the embodiment of the present disclosure.
  • the output points of the output characteristic block can be further divided among the N CU operation circuits, so as to determine the processing object of each operation circuit. Then, according to the division of output points, using the split unit as a sliding window, select N CU input feature data rows from the first buffer circuit and distribute them to N CU computing circuits, and select the corresponding weight value from the second buffer circuit The data is broadcast to N CU computing circuits, so that the parallel calculation of the output points corresponding to multiple sliding windows can be realized by multiplexing the weight data. Perform Nk sliding selections, wherein Nk is determined according to the smaller value of the size of the convolution kernel in the X and Y dimensions and the maximum convolution kernel size supported by a single operation of the processing circuit in the current convolution split mode.
  • the corresponding weight data when performing a conventional three-dimensional convolution operation, can be selected as follows: select 1/Nop weights from the second buffer circuit in a sliding manner corresponding to that in the first buffer circuit row, copying Nop-1 copies of it and expanding it into an extended weight value row, and broadcasting to N CU computing circuits in the slave processing circuit.
  • each operation circuit can use 1/Nop data line units for one input feature line from the first buffer circuit and one extended weight data line from the second buffer circuit during each sliding calculation. Carry out bitwise multiplication and accumulation to obtain Nop partial sums; and accumulate the Nk*Nop partial sums obtained by calculating Nk sliding number selections according to the corresponding convolution output points to obtain and output Nop operation results.
  • the slave processing circuit When the slave processing circuit outputs the output points of its internal operation circuit, it can output the output points calculated by multiple operation circuits in it in a specific order according to the division method of the output points, so that the output points of continuous output are in X and/or Y Dimensionally continuous, convenient for subsequent processing.
  • the main processing circuit may further store the operation results returned from each slave processing circuit in a fourth dimension storage order. According to the situation, the main processing circuit can also convert the operation result into a desired dimension storage sequence for storage.
  • the shape of the split unit block is 4B ⁇ 4 ⁇ 4. Depending on the data type, the shape of the block is slightly different. Table 2 shows the block shapes of Forward4 under different data types.
  • Fig. 9 shows a schematic diagram of splitting and storage of the Forward4 scheme according to an embodiment of the present disclosure.
  • the example in the figure assumes the data type is Int8.
  • the figure shows the original data to be operated (which may be neurons or weights), and its storage order is HWC.
  • the Forward4 scheme can support multiple grouping modes.
  • Group1 grouping mode the number of input neuron pendulums varies according to the HoWo splitting method:
  • 16 means 16 slave processing circuits SL
  • the end 4*4*4(CHW) means the BLOCK of CHW split from three dimensions, hi, wi divided twice 4, the second
  • the first 4 means splitting hi*wi into 16 parts and distributing them to 16 SLs
  • the second time 4 means folding hi and wi to the direction of ci.
  • the meaning of 1*16 splitting is the same.
  • the number of input neuron pendulums varies according to the HoWo splitting method:
  • the first 4 in the front means 4 SLBs
  • the neuron has been copied 4 copies
  • the second 4 means that the neuron is split on 4 SLs of an SLB
  • the last 4*4*4 means that it is composed of three The CHW BLOCK split from each dimension.
  • the input neuron does not need to be split, and its pendulum number is as follows:
  • the above 16 means that the neurons are replicated on 16 SLs, and the last 4*4*4 means the block of CHW split from the three dimensions. Hi and wi are divided by 4, which means folding hi and wi to the direction of ci.
  • Fig. 10 shows a schematic diagram of assigning interval output points to each operation circuit in the Forward4 scheme according to some embodiments of the present disclosure.
  • the output feature block can be equally divided into Nop output feature sub-blocks with the same shape among N CU computing circuits, each output feature sub-block includes N CU output points, and is divided into N CU output points respectively.
  • each output feature sub-block the 2*2 output points are allocated to 4 operation circuits.
  • each arithmetic circuit calculates one output point in each of the four output feature sub-blocks.
  • different backgrounds are used to show the output points assigned to four different arithmetic circuits CU0-CU3. It can be seen from the figure that each calculation circuit calculates a plurality of output points spaced in X and/or Y dimensions on the output feature map during each calculation.
  • the output point position of each output feature sub-block can be correspondingly obtained from the first buffer circuit according to the data required for calculating the output feature sub-block.
  • Select N CU data rows for operation For example, when selecting the number of input feature data for the first time, according to the 4 input feature blocks required to calculate the 4 output points in the first output feature sub-block 1011, select 4 input data rows from the corresponding input feature blocks , distributed to 4 arithmetic circuits. It can be understood that since the four output points are continuous in the X and/or Y direction, the interval or step size of the four input data rows selected at the same time in the X and/or Y direction is 1.
  • the corresponding weight data can be selected from the second buffer circuit and broadcast to NCU computing circuits, so as to achieve parallel calculation of output points corresponding to multiple computing circuits by multiplexing the weight data .
  • weight multiplexing can be performed in a single input data row , thus computing Nop output points or partial sums simultaneously.
  • the extended weight value row can also be broadcast to N CU computing circuits, so that while multiplexing the weights among multiple computing circuits, a smaller granularity (for example, 1 /Nop line) to reuse weights.
  • N CU *Nop output points or partial sums can be calculated each time by correspondingly taking N CU input feature data rows and taking 1/Nop weight value rows to copy and expand into 1 weight value row.
  • the calculation result is a partial sum
  • the partial sum can be calculated multiple times by sliding multiple times, and the partial sums of each time are accumulated according to the output points to which they belong, and the final result can be obtained.
  • the number of slides and the slide step of the convolution operation can be determined.
  • the maximum convolution kernel size supported by a single operation of the processing circuit is at least determined by the space sizes of the first buffer circuit and the second buffer circuit. It can be understood that when the convolution kernel exceeds the maximum convolution kernel size, it needs to be split in the Kx and Ky directions according to the maximum convolution kernel size.
  • Fig. 11 shows a schematic diagram of a single operation process in the Forward4 scheme according to an embodiment of the present disclosure.
  • the size of the first buffer circuit 1110 is 3 ⁇ 3 ⁇ 64B, that is, a maximum of 9 rows of data can be buffered
  • the size of the second buffer circuit 1120 is 2 ⁇ 2 ⁇ 64B, that is, a maximum of 4 rows of data can be buffered .
  • the storage in the buffer circuit in the figure is also shown in the split unit.
  • the figure shows the operation process of the first sliding fetch.
  • using the split unit as a sliding window slidingly select N CU input feature lines from the first buffer circuit, and send them to N CU computing circuits for calculation; from the second buffer circuit
  • 1/Nop weight rows are selected according to the corresponding sliding method in the first buffer circuit, where Nop is the maximum number of convolution output points that can be calculated for each operation circuit at a time, and its copy Nop-1 is expanded to An extended weight row is broadcast to N CU computing circuits in the slave processing circuit.
  • each operation circuit calculates 2 ⁇ 2 output points with an interval of 1 in the X and Y dimensions for each calculation.
  • one input characteristic data row is selected from the first buffer circuit 1110 at the initial position and the position moved by 1 in the X and/or Y directions, and a total of four input characteristic data rows are selected, and correspondingly sent to the slave Four arithmetic circuits 1140 in the processing circuit SL.
  • Select 1/4 weight data row at the starting position from the second buffer circuit 1120 that is, select data of 2 ⁇ 2 size, copy 3 copies of it and expand it into an extended weight data row 1130, and broadcast it to the SL 4 arithmetic circuits 1140 inside.
  • each operation circuit performs bitwise multiplication and accumulation in units of 1/Nop data lines for one input feature row from the first buffer circuit and one extended weight value row from the second buffer circuit to obtain Nop parts and.
  • the four computing circuits 1140 perform a bitwise multiplication and accumulation operation on the distributed input feature data row and the broadcasted extended weight data row to obtain the computing result 1150.
  • the results of different background colors in 1150 represent the results obtained by different computing circuits 1140. owned. It can be seen that for each operation, one CU will calculate the partial sum of 4 output points, and the 4 CUs will obtain a total of 4 ⁇ 4 partial sums. It can be seen that the output points calculated by each CU are not adjacent in the XoYo dimension of the output feature map.
  • the number is slidingly fetched synchronously, and the next calculation is performed.
  • Nk ceil(Kx/2)*ceil(Ky/2)
  • Kx and Ky are the dimensions of the convolution kernel in the X and Y dimensions respectively or from the processing circuit in the current convolution split mode The smaller value among the maximum convolution kernel sizes supported by a single operation.
  • the operation circuit accumulates the sums of Nk*Nop parts calculated in the Nk sliding calculations according to the corresponding convolution output points to obtain Nop operation results.
  • the maximum convolution kernel size supported by a single operation of the processing circuit is 8 ⁇ 8.
  • Fig. 12 shows a schematic diagram of a sliding convolution process in the Forward4 scheme according to an embodiment of the present disclosure.
  • This example takes a 9 ⁇ 9 input feature map and a 5 ⁇ 5 convolution kernel as an example. If the convolution step is 1, the output feature map size is 5 ⁇ 5.
  • the input feature map needs to be aligned to 12 ⁇ 12, divided into 9 blocks of 4 ⁇ 4 ⁇ 4 (C ⁇ H ⁇ W) size, and stored in the first buffer circuit, shown as 1210 in the figure, where the C dimension is omitted .
  • the convolution kernel 5 ⁇ 5 needs to be aligned to 8 ⁇ 8, and the aligned part is filled with 0, and stored in the second buffer circuit, which is shown as 1220 in the figure, and the C dimension is also omitted.
  • the copy operation can be realized by hardware.
  • FIG. 12 The input feature map and the selection range of the convolution kernel in the first buffer circuit and the second buffer circuit for each sliding are shown in Figure 12, a total of 9 images, representing a total of 9 sliding times.
  • block 1210 represents the input feature map in the first buffer circuit, and the four dotted-line boxes represent the areas selected to be sent to four CUs;
  • block 1220 represents the convolution kernel in the second buffer circuit, and the dotted-line boxes represent the selected 1/ 4 lines, which are copied into 3 copies and expanded into one line and then broadcast to 4 CUs.
  • each CU performs bitwise multiplication and accumulation in units of 1/4 data line for one input feature data line from the first buffer circuit and one extended weight data line from the second buffer circuit, to obtain 4 partial sums; and accumulating the Nk partial sums corresponding to the same convolution output point obtained in the Nk calculations in the current operation round, to obtain and output 4 operation results.
  • each output point is a standard convolution of 4 ⁇ 2 ⁇ 2 (Ci ⁇ Y ⁇ X).
  • the accumulation is completed in the Y ⁇ X direction
  • a complete 4 ⁇ 4 (Y ⁇ X) output is obtained in one SL (as shown in Figure 10b shown).
  • a single calculation only supports the case where the convolution kernel is not larger than 8 ⁇ 8.
  • each slave processing circuit SL can convert the operation result of its internal operation circuit CU into a specified format, for example, the format of Nco ⁇ Uy ⁇ Ux.
  • each slave processing circuit may output a partial operation result of its internal partial operation circuit each time, and the partial operation result is continuous on the X and/or Y dimension of the output feature map.
  • the main processing circuit may further store the operation results returned from each slave processing circuit in a fourth dimension storage order. According to the situation, the main processing circuit can also convert the operation result into a desired dimension storage sequence for storage.
  • the output data format is slightly different.
  • Fig. 13 shows a schematic diagram of an output data format of the Forward4 scheme according to an embodiment of the present disclosure.
  • the grouping mode is Group1
  • each SL outputs a 1 ⁇ 1 ⁇ 4 (Co ⁇ Y ⁇ X) area each time, that is, it outputs part of the operation results of its internal partial operation circuit each time, for example, each of the 2 CUs has 2 operation results (see FIG. 10 ), this part of the operation results is continuous on the X and/or Y dimensions of the output feature map, such as the same row (as shown in FIG. 13 ) or the same column.
  • the 1 ⁇ 4 ⁇ 4 (Co ⁇ Y ⁇ X) area is returned 4 times in a row, that is, the 4 operation results of each of the 4 CUs.
  • Different SLs output different regions of the output feature map of the same Co. After outputting all the 4 ⁇ 4 areas of Co, continuing to output will switch different output points.
  • 1320 in the figure shows the deposit-out data structure of 16 SLs.
  • the final output data becomes the format of Yo*Xo*Co*4*16*4 after being written into the storage circuit (for example, the first storage circuit), where Yo and Xo are the outputs divided by each SL
  • the number of blocks in the feature map, 16 is the division on 16 SLs.
  • pendulum operations can be performed again to convert to other desired data formats.
  • the output data format is also slightly different. Assuming the original output size is:
  • (4*16*4) is the basic output block of forward4, and the directions correspond to h*c*w respectively, where 16 represents the division of ho and wo of the same co on 16 SLs; ho and wo are divided twice 4, where the first 4 indicates 4 ⁇ 4 splitting when storing data in SL, and the second 4 indicates data block folding in h and w directions.
  • This shape is also the shape of the schematic diagram in FIG. 19 .
  • the Group4 output data shape is:
  • (4*16*4) has the same meaning as above, except that 16 represents the wo output division of 4 cos on 4 SLs.
  • the Group16 output data shape is:
  • (4*16*4) has the same meaning as above, except that 16 represents the output division of 16 COs on 16 SLs.
  • the hardware when outputting, can automatically output neurons according to the dimension of 4*16*4(Y*SL*X) in the row and the dimension of Y*X*C between the rows. The same is true for larger convolution kernels.
  • the bias Bias is the bias after the convolution calculation.
  • the original format of the bias is: [11co].
  • 64 means that a single offset is copied 64 times and placed continuously.
  • 16 means that a single offset is copied 16 times and placed continuously.
  • 4 means that a single offset is copied 4 times and placed continuously.
  • FIG. 14 shows an overall data moving process according to an embodiment of the present disclosure.
  • weights are read from off-chip storage, such as DDR, into SRAM via a global direct memory access module (GDMA).
  • GDMA global direct memory access module
  • the transfer process of neurons is similar to that of weights, except that after transferring to NRAM through block instructions, it also needs to be transferred to WRAM.
  • the neuron calculates, with the sliding of the convolution kernel, most of the data overlaps, which greatly reduces the efficiency of data handling.
  • the img2col instruction is used to distribute data, and details are described later.
  • the output data can be stored back to NRAM, and the data dimension change can also be completed through block instructions and transferred to SRAM. Then, it can be stored back to the off-chip storage DDR via GDMA.
  • Data dimension change and pendulum refers to the process of arranging tensor data of a specific shape into a specific shape required.
  • Data movement refers to the read and write operations of data in different memory spaces.
  • the Forward4 convolution operation scheme requires that the neurons and weights used for convolution operations be placed and aligned according to a specific block pattern.
  • the output data is also output in accordance with the specific output format of Forward4, which requires that the tensor data be arranged in block form before calculation, and it is also required to return to the normal tensor shape as required after the calculation is completed.
  • the Deform instruction series provides data shape transformation and data type conversion capabilities for the IO data path, mainly including functions such as TRANS (transposition), MOVE (transportation), and ROTATE (rotation).
  • the mode that implements the transpose function in this instruction series is named Trans Tiling, which mainly provides performance support for various shape transformations of small convolutions.
  • Deform divides a 3-dimensional data block into inner and outer layers.
  • the inner layer has three dimensions (corresponding to the parameter n0-2 in the instruction).
  • the unit of the lowest dimension is byte, and the second-lowest dimension and the highest dimension are unitless, representing the previous one. the number of layers.
  • the outer layer also has three dimensions (corresponding to the parameters n3-n5 in the command), all of which represent multiples of the corresponding inner layer dimensions.
  • the input data stored in the first dimension storage order (such as HWC) needs to be split, dimensionally converted and stored in units of splitting units, and each splitting unit is divided according to the second dimension Storage order (such as CHW) storage, split units are stored according to the third dimension storage order (such as HWC).
  • Figure 15 shows a schematic conceptual diagram of Trans Tiling according to an embodiment of the present disclosure.
  • the left panel in the figure shows the input data before deformation.
  • the three-dimensional input data is described by six dimensions, n0 and n3 correspond to the first dimension (such as the lowest dimension) of the original three-dimensional data, n1 and n4 correspond to the second dimension (such as the second-lowest dimension) of the original three-dimensional data, n2 and n5 correspond to the third dimension (for example, the highest dimension) of the data block.
  • the right panel in the figure shows the output data after deformation. Three-dimensional output data is also described using six dimensions.
  • the inner layer of the output data corresponds to the deformed splitting unit.
  • Trans Tiling also has the function of inline shuffle, including inline shuffle before Tiling based on Pretable and inline shuffle after Tiling based on Posttable.
  • the pre-allocation table is the function of rearranging the n0 data input by Tiling
  • the post-allocation table is the function of rearranging the n0 data output by Tiling. Without considering the flag bit of the table, the essence of the pre-allocation table and the post-allocation table is an array representing the data position of 64 bytes.
  • Fig. 16 shows a schematic diagram of front and back tables.
  • the front and back matching tables respectively indicate the rearrangement position of a row of data in dimension n0 of input or output, which includes 64B.
  • the 8 bits of each byte include a 6-bit Index bit respectively, and the record data bit stores the data of the first byte of the 0 to 63-bit byte data in the original data; the 1-bit zero_en bit indicates whether to set 0, if the bit is 1, it is forced to write 0, the [5,0] bit is invalid; and the 1-bit mask bit indicates whether the data of this bit is valid.
  • the data of n0 of the input data of the block instruction can be rearranged when needed, and/or the data of n0 of the output data of the block instruction can be rearranged.
  • Table 4 shows the meaning of each parameter of the block instruction. Assuming that the bit width of the data that needs to be divided into blocks is dwidth, the unit is B (byte), and the data size of an atomic operation of the block instruction is called the block bit width T, and the unit is B (byte).
  • the parameters of the block instruction a total of 11 parameters, n0 ⁇ n5, s1 ⁇ s5, are required to describe the tensor shape of the inner layer data and the outer layer data, among which n0 ⁇ n2, s1 ⁇ s2 are parameters describing the inner layer, and n3 ⁇ n5, s3 ⁇ s5 are parameters describing the outer layer.
  • the input tensor and the input tensor each need a set of parameters, which are described by a total of 22 parameters in0 ⁇ in5, is1 ⁇ is5, on0 ⁇ on5, and os1 ⁇ os5.
  • the block instruction can support various block bit widths T, such as 1B, 2B, 4B, 6B, 8B, 16B, 32B, etc., and the corresponding value can be set based on different block tasks. Therefore, the block instruction also includes the parameter of block bit width T.
  • a data processing device including a control circuit, a first storage circuit, and a second storage circuit.
  • the first storage circuit is used for storing data before executing the block instruction; the second storage circuit is used for storing data after executing the block instruction.
  • the control circuit is used to configure and execute the block instruction.
  • the data processing device may be, for example, a processor cluster in the multi-core computing device shown in FIG.
  • the shared storage is SRAM, while the second storage circuit is, for example, NRAM within the processor core.
  • the function of block instructions is to store them in the first dimension during the transfer of input neurons from, for example, SRAM to NRAM.
  • the input neurons stored sequentially (such as HWC) are split, dimensionally converted, and stored in units of split units.
  • Each split unit is stored in the order of the second dimension (such as CHW), and the split units are stored in the order of the third dimension.
  • Dimensions are stored sequentially (eg HWC) storage.
  • the alignment value M required by the block instruction is a multiple of U Ci .
  • the data needs to be arranged from [1*hi*wi*ci] to:
  • Fig. 17 shows a schematic diagram of executing a block instruction on neuron data according to an embodiment of the present disclosure.
  • the left figure in the figure shows the neuron data (that is, the input tensor of the block instruction) before block processing.
  • the three-dimensional neuron data [hi*wi*ci] (the N dimension is omitted here) is divided into inner and outer layers, each using three dimensions to describe.
  • the in1 dimension can be set to U W according to the shape of the split unit, which is 4 in this example;
  • the in2 dimension It can also be set to U H depending on the shape of the split unit, 4 in this example.
  • the sizes of the three outer dimensions in3, in4 and in5 can also be determined accordingly, and their sizes are respectively equal to the numbers of the inner data blocks contained in the corresponding dimensions.
  • the right figure in the figure shows the neuron data after block processing (that is, the output tensor of the block instruction). It can be seen that the shape of the neuron data at this time becomes [hi/4*wi/4*(ci*16)], which is also divided into inner and outer layers, and each is described by three dimensions. Since the neuron data needs to be split according to the split unit, combined with the constraints of the block instruction, the block bit width T can be set as U Ci , that is, the data volume of an atomic operation is U Ci , so that it is convenient to split The storage order is adjusted in units of units.
  • the inner layer data block 1702 may correspond to the inner layer data block 1701 of the input tensor, but the shape changes from M ⁇ U H ⁇ U W to (M*U H *U W ) ⁇ 1 ⁇ 1, as shown in the figure It is a large strip composed of 16 thin strips.
  • the sizes of the three outer dimensions on3, on4, and on5 can also be determined accordingly, and their sizes are respectively equal to the number of inner data blocks containing output tensors in the corresponding dimensions.
  • control circuit in the data processing device can be further configured to configure the block instruction as follows: set the post-allocation table of the block instruction, so that the inner lowest-dimensional data of the output tensor of the block instruction is arranged according to the following The instructions in the matching table are rearranged.
  • control circuit can be further used to set the post-allocation table as follows: convert the inner lowest dimension on0 data of the output tensor arranged in the storage order of the first dimension (for example, HWC) into the storage order in the second dimension (for example, CHW )arrangement.
  • the writing sequence of the 64B data is related to the data bit width dwidth of the data.
  • the post-configuration table can be configured according to the logic shown in the pseudo code in Table 5 below.
  • control circuit in the data processing device can divide the input data (for example, neuron) into an integer segment and a remainder segment according to the input channel Ci dimension, wherein the Ci dimension of the integer segment is aligned to the alignment value M, The Ci dimension of the remainder segment is smaller than M. Then, a first block instruction may be configured and executed for the integer segment, and a second block instruction may be configured and executed for the remainder segment.
  • ci there may be only the integer segment, or only the remainder segment, or both the integer segment and the remainder segment.
  • the length of the aligned 64B integer segment in ci is ci_full
  • the unaligned 64B remainder segment is ci_rem.
  • Table 6 shows the shape change of neuron data before and after executing the chunking instruction.
  • the first block instruction can be configured with reference to the content described above in conjunction with FIG. 17 .
  • the parameters of the input tensor of the first block instruction can be configured as follows: set the inner lowest-dimensional size in0 of the input tensor in the first block instruction to M, and the inner-layer low-dimensional size in1 Set to U W , set the highest dimension size in2 of the inner layer to U H ; and set the size values in3 and in4 of the three outer dimensions of the input tensor in the first block instruction according to the size of each dimension of the integer segment of the input data and in5, where the size values of the three outer dimensions represent the number of inner data blocks containing input tensors in the corresponding dimension.
  • the parameters of the output tensor of the first block instruction can be configured as follows: set the inner lowest dimension size on0 of the output tensor in the first block instruction to in1*in2*T , set the low-dimensional size on1 of the inner layer to M/T, set the highest dimension size on2 of the inner layer to 1; and set the three outer layers of the output tensor in the first block instruction according to the size of each dimension of the integer segment of the input data
  • the size values of the dimensions are on3, on4, and on5, where the size values of the three outer dimensions represent the number of inner data blocks containing output tensors in the corresponding dimension.
  • dimension steps also need to be set.
  • the other five dimensions adjacent to each other except the lowest dimension of the inner layer can be set.
  • the bit width T of the block can be set as U Ci according to the constraints of the block instruction and the shape transformation of the split unit before and after processing.
  • the first block for the integer segment can be configured according to the following table 7 instruction.
  • ci, hi, and wi respectively represent the number of data in Ci, H and W dimensions of the input data
  • dwidth represents the data bit width
  • ci_full represents the number of data in the Ci dimension of the integer segment
  • B represents bytes
  • T represents the block bit Width
  • is1 ⁇ is5 indicates the five-dimensional step size of the input tensor
  • os1 ⁇ os5 indicates the five-dimensional step size of the output tensor.
  • the second block instruction can be configured with a slight adjustment based on the integer segment part.
  • the second chunking instruction can be configured as follows: According to the Ci dimension size of the integer segment, the input tensor bias and the output tensor bias of the second chunking instruction executed for the remainder segment are set, where the input The tensor offset indicates the offset of the remainder segment before processing relative to the initial storage address of the input data, and the output tensor offset indicates the offset of the processed remainder segment relative to the initial storage address of the output data.
  • the input tensor address and the output tensor address of the second block instruction can be adjusted after considering the memory storage space of the integer segment.
  • the parameters of the input tensor of the second block instruction can be configured as follows: set the inner lowest dimension size in0 of the input tensor in the second block instruction to R, and R is the Ci of the remainder segment Dimension size, set the low-dimensional size in1 of the inner layer to U W , set the highest dimension size in2 of the inner layer to U H ; and set the input tensor in the second block instruction according to the size of each dimension of the remainder of the input data The size values in3, in4, and in5 of the three outer dimensions, where the size values of the three outer dimensions respectively represent the number of inner data blocks containing the input tensor in the corresponding dimension.
  • the parameters of the output tensor of the second block instruction can be configured as follows: set the inner lowest dimension size on0 of the output tensor in the second block instruction to in1*in2*T , set the inner-level low-dimensional size on1 to R/T, set the inner-level highest-dimensional size on2 to 1; and set the three output tensors in the second block instruction according to the size of each dimension of the remainder of the input data
  • the size values of the outer dimension are on3, on4, and on5, where the size values of the three outer dimensions respectively represent the number of inner data blocks containing the output tensor in the corresponding dimension.
  • control circuit can set adjacent data in the other five dimensions except for the lowest dimension of the inner layer based on the six-dimensional dimensions of the input tensor and the output tensor and the dimensions of the input data before processing.
  • the second block for the remainder segment can be configured according to the following Table 8 instruction.
  • ci, hi, and wi represent the data numbers of the Ci, H and W dimensions of the input data respectively
  • dwidth represents the data bit width
  • ci_rem represents the data number of the Ci dimension of the remainder segment
  • B represents the byte
  • T represents the minute Block bit width
  • is1 ⁇ is5 indicates the five-dimensional step size of the input tensor
  • os1 ⁇ os5 indicates the five-dimensional step size of the output tensor.
  • the embodiment of the present disclosure provides a block processing solution for neuron data.
  • the neuron data of any shape can be arranged from [1*hi*wi*ci] to [1*hi/4*wi/ through two-stage block processing. 4*ci/4*(4*4*4)].
  • the block processing of weight data is similar to the block processing of neuron data. Specifically, for the weight data in the Forward4 scheme, the data needs to be placed from [co*kh*hw*ci] to:
  • weight data has an additional co dimension. Since the co dimension and the kh dimension are continuous, the co dimension can be incorporated into the kh dimension.
  • Table 9 shows the shape change of the weight data before and after executing the block instruction.
  • the weight data can still be divided into blocks using the scheme described above for the neuron data.
  • two-stage processing can be adopted for weight data of any scale, that is, integer segment block processing and remainder segment block processing.
  • the integer segment of the weight data can be configured according to the following table 10 The first block instruction.
  • the following table 11 can be configured for Second chunking instruction for remainder segment of weight data.
  • the embodiment of the present disclosure also provides a data processing method for executing a block instruction by using the aforementioned data processing device.
  • a data processing method for executing a block instruction by using the aforementioned data processing device.
  • An embodiment of the present disclosure also provides a chip, which may include the data processing device in any embodiment described above with reference to the accompanying drawings. Further, the present disclosure also provides a board, which may include the aforementioned chip.
  • the electronic equipment or devices disclosed herein may include servers, cloud servers, server computing clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, Mobile terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment.
  • Said vehicles include airplanes, ships and/or vehicles;
  • said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods;
  • said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph.
  • the electronic equipment or device disclosed herein can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical treatment. Further, the electronic device or device disclosed herein can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge, and terminal.
  • electronic devices or devices with high computing power according to the disclosed solutions can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present disclosure is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present disclosure, those skilled in the art may understand that certain steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily required for the realization of one or some solutions of the present disclosure. In addition, according to different schemes, the description of some embodiments in this disclosure also has different emphases. In view of this, those skilled in the art may understand the part that is not described in detail in a certain embodiment of the present disclosure, and may also refer to related descriptions of other embodiments.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit.
  • the aforementioned components or units may be located at the same location or distributed over multiple network units.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit, or each unit exists physically independently.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits.
  • the physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein may be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (Enhanced Dynamic Random Access Memory, EDRAM), High Bandwidth Memory (High Bandwidth Memory , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • RRAM variable resistance memory
  • DRAM Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • High Bandwidth Memory High Bandwidth Memory
  • HBM High Bandwidth Memory
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Processing (AREA)
  • Complex Calculations (AREA)

Abstract

本披露公开了一种数据处理装置、利用数据处理装置执行分块指令的数据处理方法及相关产品。该数据处理装置可以作为计算装置包括在组合处理装置中,该组合处理装置还可以包括接口装置和其他处理装置。该计算装置与其他处理装置进行交互,共同完成用户指定的计算操作。组合处理装置还可以包括存储装置,该存储装置分别与计算装置和其他处理装置连接,用于存储该计算装置和其他处理装置的数据。本披露的方案实现了小卷积运算中的数据拆分存储,提高了运算处理效率。

Description

数据处理装置、数据处理方法及相关产品
相关申请的交叉引用
本公开要求于2021年9月26日申请的、申请号为202111129610.8、发明名称为“数据处理装置、数据处理方法及相关产品”的中国专利申请的优先权。
技术领域
本披露一般地涉及数据处理领域。更具体地,本披露涉及一种数据处理装置、利用数据处理装置对数据执行分块指令的数据处理方法、芯片和板卡。
背景技术
目前,深度学习(Deep Learning)已经成为机器学习中的重要分支,也大力助推着人工智能(AI)的发展。深度学习的核心技术——深度神经网络(DNN)已在诸多行业有着广泛的应用。
神经网络是人工智能、深度学习中最为关键的技术之一,其中卷积神经网络(Convolution Neural Network,CNN)是最为重要的一种网络类型。卷积神经网络中最为关键的计算即为卷积层(Conv layer)的卷积运算(Convolution Operation)。卷积层的功能是对输入数据进行特征提取,通过多层卷积,能够抽取复杂特征,以保证网络具有足够的表达能力和泛化能力。神经网络模型中包含了大量的、各种类型的卷积运算,卷积运算的计算性能极大地影响整个神经网络模型的计算性能。当神经网络模型应用于不同领域时,例如语音识别、机器翻译、图像处理等等,其对应的输入特征图和权值的各个维度大小可能各有不同。为了充分利用深度学习处理器的硬件优势,需要针对不同规模的、不同类型的卷积运算进行优化,以提高执行神经网络模型的计算性能。
发明内容
为了至少解决如上所提到的一个或多个技术问题,本披露在多个方面中提出了一种数据处理装置,其通过对数据执行分块指令,可以使得各种维度尺寸的数据能够适配卷积运算的硬件,从而提高卷积运算的计算效率。本披露实施例的卷积运算可以是各种神经网络模型中的运算,这些神经网络模型可以应用于各种领域,诸如图像处理、语音处理、文本处理等等,这些处理例如可以包括但不限于识别和分类。
在第一方面中,本披露实施例提供了一种数据处理装置,包括控制电路、第一存储电路和第二存储电路,其中:所述第一存储电路用于存储处理前的数据;所述第二存储电路用于存储处理后的数据;以及所述控制电路用于配置并执行分块指令,以将按照第一维度存储顺序存储在第一存储电路上的输入数据以拆分单元为单位进行拆分并存储为第二存储电路上的输出数据,其中在所述第二存储电路上,各个拆分单元内按照第二维度存储顺序存储,拆分单元之间按照第三维度存储顺序存储。
在第二方面中,本披露实施例提供了一种芯片,其包括前述第一方面的数据处理装置。
在第三方面中,本披露实施例提供了一种板卡,其包括前述第二方面的芯片。
在第四方面中,本披露实施例提供了一种利用前述第一方面的数据处理装置对输入数据执行分块指令的数据处理方法。
通过如上所提供的数据处理装置、芯片、板卡以及由数据处理装置执行分块指令的数据处理方法,本披露实施例的方案针对各种卷积拆分方案中的数据进行分块处理,以适应硬件运算装置的处理能力,从而充分利用多个从处理电路的并行处理能力,可以有效提高卷积运算的运算效率。
附图说明
通过参考附图阅读下文的详细描述,本公开示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本公开的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:
图1示出本披露实施例的板卡的结构图;
图2示出本披露实施例的组合处理装置的结构图;
图3a示出本披露实施例的单核计算装置的处理器核的内部结构示意图;
图3b示出本披露实施例的多核计算装置的内部结构简化示意图;
图4示出可以应用本披露实施例的示例性卷积运算原理示例;
图5示出了根据本披露实施例的计算装置的示意性结构框图;
图6示出了根据本披露实施例的一种示例性数据存储顺序;
图7a-7c示出了根据本披露实施例的几种示例性分组模式;
图8示出了根据本披露实施例的输入特征图的示例性拆分示意图;
图9示出根据本披露实施例的Forward4方案的拆分和存储示意图;
图10示出了根据本披露实施例的Forward4方案中运算电路的输出点划分示意图;
图11示出根据本披露实施例的Forward4方案中的单次运算示意图;
图12示出根据本披露实施例的Forward4方案中的滑动卷积示意图;
图13示出根据本披露实施例Forward4方案的输出数据格式示意图;
图14示出根据本披露实施例的整体的数据搬运过程;
图15示出根据本披露实施例的Trans Tiling的示意性概念图;
图16示出前后配表的示意图;以及
图17示出根据本披露实施例的对神经元数据执行分块指令的示意图。
具体实施方式
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
应当理解,本披露的权利要求、说明书及附图中可能出现的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。
示例性硬件环境
图1示出本披露实施例的一种板卡10的结构示意图。如图1所示,板卡10包括芯片101,其是一种***级芯片(System on Chip,SoC),或称片上***,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学***台的存储能力和计算能力有很高的要求,此实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。
芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部 设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe接口等。
板卡10还包括用于存储数据的存储器件104,其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。
图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示,组合处理装置20包括计算装置201、接口装置202、处理装置203和存储装置204。
计算装置201配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置202与处理装置203进行交互,以共同完成用户指定的操作。
接口装置202用于在计算装置201与处理装置203间传输数据和控制指令。例如,计算装置201可以经由接口装置202从处理装置203中获取输入数据,写入计算装置201片上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二者视为形成异构多核结构。
存储装置204用以存储待处理的数据,其可以是DRAM,为DDR内存,大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。
图3a示出了计算装置201为单核装置时处理核的内部结构示意图。计算装置301用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,计算装置301包括三大模块:控制模块31、运算模块32及存储模块33。
控制模块31用以协调并控制运算模块32和存储模块33的工作,以完成深度学习的任务,其包括取指单元(instruction fetch unit,IFU)311及指令译码单元(instruction decode unit,IDU)312。取指单元311用以获取来自处理装置203的指令,指令译码单元312则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块32和存储模块33。
运算模块32包括向量运算单元321及矩阵运算单元322。向量运算单元321用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元322负责深度学习算法的核心计算,即矩阵乘及卷积。
存储模块33用来存储或搬运相关数据,包括神经元存储单元(neuron RAM,NRAM)331、权值存储单元(weight RAM,WRAM)332、直接内存访问模块(direct memory access,DMA)333。NRAM 331用以存储输入神经元、输出神经元和计算后的中间结果;WRAM 332则用以存储深度学习网络的卷积核,即权值;DMA 333通过总线34连接DRAM 204,负责计算装置301与DRAM 204间的数据搬运。
图3b示出了计算装置201为多核的内部结构简化示意图。多核计算装置可以用层次化硬件模型来进行抽象。如图所示,多核计算装置可以抽象为四个层级,即板卡级(Card)350、芯片级(Chip)360、处理器簇级(Cluster)370和处理器核级(Core)380。本披露实施例中主要涉及存储单元的数据传输和计算单元部分,因此附图和描述简要示出和介绍相关的计算结构,省略其他部分。
在板卡级,每块板卡上包含本地DDR存储,每个处理器芯片作为计算和控制单元。
在芯片级,每个处理器芯片包含多个多处理器作为计算单元。
在计算簇级,每个多处理器包括多个加速器核作为控制和计算单元,另外还有共享存储SRAM作为存储单元。
在处理器核级,每个加速器核包含本地存储及本地处理单元阵列。NFU指神经运算单元(Neuron Function Unit),用于进行卷积计算。
在该多核计算装置中,存储模型包括板卡全局内存、Cluster上的SRAM(共享存储器)、Core上的NRAM、WRAM和寄存器等。为了获得更好的性能,可以显式地控制Card以下各存储层次之间的数据搬移以及访存/计算间的平衡。SRAM包含在存储处理单元MPU(Memory Process Unit Core,简称MPU,或者Mem Core)中。Core指多核计算装置中的智能处理核(Intelligent Process Unit Core,简称IPU Core或者Core)。1个IPU Core包含NRAM,WRAM,NFU等等。Cluster指处理器簇或称计算簇,通常多核计算装置包含若干个Cluster,一个Cluster包含1个Mem Core+N个IPU Core。
示例性卷积运算类型
神经网络模型中的卷积层可以执行卷积运算,通过对输入特征图(也称为输入数据、神经元或输入神经元)应用卷积核(也称为过滤器、权值等)做卷积处理,从而进行特征提取。卷积层内部可以包含多个卷积核,组成卷积核的每个元素对应一个权值系数和一个偏置bias。本披露实施例可以应用于各种卷积运算的数据拆分中。
在常规3D卷积运算中,假设卷积层中输入特征图(Feature map)张量形状表示为X[N Hi Wi Ci],卷积核(kernel)的张量形状表示为K[Co Kh Kw Ci],输出的结果为Y[N Ho Wo Co],那么,简化的卷积运算的数学计算公式可以表示如下:
Y in,jc,jh,jw=∑ 0≤ic≤ci,0≤ih≤kh,0≤iw≤kwX in,ic,jh×sh+ih,jw×sw+iw×K jc,ic,ih,iw    (1)
上式中,X是输入数据,Y是输出数据,K是卷积核,Kh和Kw是K的长和宽,sh和sw是在长和宽方向上的步长(stride),公式忽略了偏置bias,填充pad和膨胀dilation,并且假设输入数据X已经做了填充,卷积核已经做了膨胀。公式忽略了N维度和C维度,神经网络模型的正向计算在N维度上的计算都是独立的,在C维度上是全连接的。卷积核在工作时,会按照一定的步长扫过输入特征,在卷积窗口内对输入特征做矩阵元素乘法求和并叠加偏差量。
图4示出了可以应用本披露实施例的示例性常规3D卷积运算原理示例。
图中示例性示出了大小为[N Hi Wi Ci]的四维输入数据X,其可以表示成N个Hi×Wi×Ci大小的立体矩形410。图中还示例性示出了大小为[Co Kh Kw Ci]的四维卷积核K,其可以表示成Co个Kh×Kw×Ci大小的立体卷积核420。输入数据X与卷积核K的卷积结果得到输出数据Y,其为[N Ho Wo Co]大小的四维数据,可以表示成N个Ho×Wo×Co大小的立体矩形430。
图中还具体示出了一个卷积运算示例,其中输入数据为6×6×3大小的输入特征图440,省去N维度;卷积核为3×3×3大小的立体卷积核450,针对单个Co;输出数据为4×4的输出特征图460。具体运算过程如下:
卷积核450按照一定的步长扫过输入特征图440,在卷积窗口470内对输入特征做矩阵元素乘法求和并叠加偏置。也即,输出特征图460中每个位置上的值由每个输入特征图的对应区块和对应卷积核做二维卷积运算之后再加和得到。例如,图中示出了输出特征图460上(0,0)位置的值(也即卷积输出点)由输入特征图中黑色立方体框出的卷积窗口470与立体卷积核450进行二维卷积运算得到3个值,再加和得到最终值。
为了得到其他位置的输出,可以在输入特征图440上移动卷积核450的位置,也即移动卷积输出点的卷积窗口。在图中示例中,卷积步长(Sx,Sy)为(1,1),当横向(宽度方向)向右或纵向(高度方向)向下移动一格后做卷积运算,可以分别得到输出特征图460上(0,1)或(1,0)位置的值。
从上面的描述可知,在神经网络的一个卷积层中,有N组输入特征图,每组包含Hi×Wi× Ci个信息,其中Hi和Wi分别是输入特征图的高度和宽度,Ci是输入特征图的个数,也称为输入通道数。卷积层有Ci×Co个Kh×Kw大小的卷积核,其中Ci是输入通道数,Co是输出特征图的个数(或输出通道数),Kh和Kw分别是卷积核的高度和宽度。输出特征图包含Ho×Wo×Co个信息,其中Ho和Wo分别是输出特征图的高度和宽度,Co是输出通道数。此外,在卷积运算中,还会涉及到卷积步长(Sx,Sy),卷积步长的大小会影响输出特征图的尺寸。
在本文中,输入特征图(Feature map)、输入数据、神经元或输入神经元可互换使用;卷积核、过滤器或权值可互换使用。此外,H(高度)和Y维度可互换使用,W(宽度)和X维度可互换使用。相应地,输入特征图的H维度可以表示为Hi或Yi,输出特征图的H维度可以表示为Ho或Yo,W维度类似表示。在本披露实施例中,每个卷积输出点具有对应的卷积窗口,卷积窗口的形状等于卷积核的形状。每个卷积输出点的值对应于其卷积窗口内的输入特征图与权值的对位乘累加结果。
示例性计算装置/数据处理装置
在本披露实施例中,可以采用主从结构的计算装置来实施上述卷积运算。进一步地,可以为输入特征图和卷积核配置不同的数据通路,从而提高访存效率。
图5示出了根据本披露实施例的计算装置500的示意性结构框图。可以理解,该结构可以视为图3中单个处理核的运算模块的内部结构细化,也可以视为在多个图3所示处理核的运算模块基础上联合的功能划分框图。如图5所示,本披露实施例的计算装置500可以配置用于执行各种类型的卷积运算,其可以包括主处理电路(MA)510和多个从处理电路(SL)520,图中示出了16个从处理电路SL0~SL15。本领域技术人员可以理解,从处理电路的数量可以更多或更少,取决于具体的硬件配置,本披露实施例在此方面没有限制。
主处理电路和从处理电路之间以及多个从处理电路之间可以通过各种连接相互通信。在不同的应用场景中,多个从处理电路之间的连接方式既可以是通过硬线布置的硬连接方式,也可以是根据例如微指令进行配置的逻辑连接方式,以形成多种从处理电路阵列的拓扑结构。本披露实施例在此方面没有限制。主处理电路和从处理电路可以相互配合,由此实现并行运算处理。
为了支持运算功能,主处理电路和从处理电路可以包括各种计算电路,例如可以包括向量运算单元及矩阵运算单元。向量运算单元用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元负责深度学习算法的核心计算,例如矩阵乘和卷积。
从处理电路例如可以用于根据运算指令,对相应的数据并行执行中间运算得到多个中间结果,并将多个中间结果传输回主处理电路。
通过将计算装置500设置成主从结构(例如一主多从结构,或者多主多从结构,本披露在此方面没有限制),对于正向运算的计算指令,可以根据计算指令将数据进行拆分,从而通过多个从处理电路对计算量较大的部分进行并行运算以提高运算速度,节省运算时间,进而降低功耗。
在本披露一些实施例中,通过利用不同的数据通路传输输入特征图和权值,可以支持输入特征图和权值的多种复用方式,从而减小运算期间的数据访存量,提升处理效率。
具体地,计算装置500中还可以包括第一存储装置530和第二存储装置540,用于分别存储经由不同数据通道传输的数据。
第一存储电路530可以用于存储多播数据,也即第一存储电路中的数据将通过广播总线传输给多个从处理电路,这些从处理电路接收到相同的数据。可以理解,通过广播总线可以实现广播和多播。多播是指将一份数据传输到多个从处理电路的通信方式;而广播是将一份数据传输到所有从处理电路的通信方式,是多播的一个特例。由于多播和广播都对应一对多的传输方式,本文中未对二者特意区分,广播和多播可以统称为多播,本领域技术人员根据上下文可以明确其含义。
第二存储电路540可以用于存储分发数据,也即第二存储电路中的数据将分别传输给不同的从处理电路,每个从处理电路接收到不同的数据。
通过分别提供第一存储电路和第二存储电路,可以支持针对待运算的数据以不同传输方式进行传输,从而通过在多个从处理电路之间复用多播数据来降低数据访存量。
在一些实施例中,主处理电路可以将输入特征图和卷积核中之一确定为多播数据并存储在第一存储电路中,以在运算期间通过广播方式将数据传输给调度的多个从处理电路。对应地,主处理电路可以将输入特征图和卷积核中另一确定为分发数据并存储在第二存储电路中。这些分发数据可以在运算前分发给对应的从处理电路。
图5还示出了根据本披露实施例的从处理电路SL的内部结构示意图。如图所示,每个从处理电路520可以包括多个运算电路CU 521、第一缓冲电路522和第二缓冲电路523。图中示出了4个运算电路CU0~CU3。本领域技术人员可以理解,运算电路的数量可以更多或更少,取决于具体的硬件配置,本披露实施例在此方面没有限制。
在一些实施例中,第一缓冲电路522可以用于缓存分配给该从处理电路的权值或输入特征图。相应地,第二缓冲电路523则可以用于缓存分配给该从处理电路的输入特征图或权值。这两个缓冲电路均用于选取参与运算的数据。第一缓冲电路522的数据可以是来自例如第一存储电路530或第二存储电路540的多个数据行,对应地,第二缓冲电路523的数据可以来自例如第二存储电路540或第一存储电路530的多个数据行。取决于具体的复用方式,这些数据行可以在运算期间被分发给对应的运算电路CU 521或广播给该从处理电路520内的所有CU 521。
每个运算电路CU 521用于在每次计算时,针对分别从第一缓冲电路中选取的数据行和从第二缓冲电路中选取的数据行执行对位乘累加运算。
通过分别提供第一缓冲电路和第二缓冲电路,可以支持针对待运算的数据以不同传输方式进行传输,从而通过在单个从处理电路内的多个运算电路之间尽可能复用数据来降低数据访存量。
从处理电路520中还可以包括第三缓冲电路524,用于缓存各个运算电路CU 521的运算结果。
可以理解,虽然在图5中将各个处理电路与存储电路示出为分立的模块,但是根据不同的配置,存储电路与处理电路也可以合并成一个模块。例如,第一存储电路530可以与主处理电路510合并在一起,第二存储电路540则可以由多个从处理电路520共享,并为每个从处理电路分配独立的存储区域,加速访问。本披露实施例在此方面没有限制。此外,在该计算装置中,主处理电路和从处理电路可以属于同一处理器或芯片的不同模块,也可以属于不同处理器,本披露在此方面也没有限制。
示例性数据拆分和存储
在本披露实施例中,所涉及的多维数据的维度表征为(N,H,W,C)或(Co,H,W,Ci),其代表了数据在存储器中的存储顺序。可以理解,虽然多维数据具有多个维度,但是因为存储器的布局始终是一维的,因此多维数据与存储器上的存储顺序之间存在对应关系。多维数据通常被分配在连续的存储空间中,也即可以将多维数据进行一维展开,按顺序存储在存储器上。例如,在本披露实施例中,初始的输入特征图可以按照低维度(此处C/Ci为最低维度)优先方式,进行顺序存储;而为了优化卷积运算,在运算过程中可以调整输入特征图的存储顺序,如后面将详细描述的。相邻的维度是指多维数据的维度信息表示中相互紧挨着的维度,例如,W和Ci相邻,相邻的维度也可以称为连续的维度。
在智能处理器中,出于算力的需要和面积功耗开销的考虑,硬件的主要运算单元是向量的乘加运算器。在硬件设计中实现各类卷积算法的支持,本质上是最大化地提取算法中的乘加运算,并且通过数据通路实现在片上RAM(诸如图3中的NRAM、WRAM等)和运算器之间高效地交换乘加运算的输入和输出数据。
硬件在存储上是以一行一行(缓存行)进行存储的,读、写、计算操作在整行对齐时效率最高,因此为了充分利用带宽,适配运算器阵列的访存量等需求,通常需要将数据进行向量化对齐。人工智能芯片的设计通常以Ci维度为最低维度,也即上述NHWC摆放顺序,Ci维度上的数据是连续的。因此,向量化对齐要求需要Ci维度的大小对齐到指定数值,例如对齐值M,从而以该对齐值M为单位进行存取数,M也可以称为硬件单次最大运算量。基于不同的硬件设计,M可以有不同的数值,例如64bit、128bit、256bit、512bit等。通常,运算器阵列的输入端口大小也与M相关,例如在输入数据位宽对称的情形下,运算器阵列的输入端口大小通常为M的2倍,也 即一次性处理对齐值M规模的输入特征图数据和权值数据。当输入特征图的Ci维度较大时,比较容易满足上述对齐要求。
当输入特征图的Ci维度较小时,例如小于一个缓存行的大小,则需将Ci维度补齐到一行数据(例如,512比特),即填充无效数据0。这种填充会造成大量的冗余计算,导致资源浪费,降低了运算的效率。
在本披露实施例中,提出了一种卷积运算方案,其可以根据输入特征图的最低存储维度(例如Ci)的大小,确定对应的卷积拆分方案,其中卷积拆分方案至少指示待运算数据的拆分单元的形状。一个拆分单元包含的数据量不超过硬件单次最大运算量。
在一些实施例中,一个拆分单元包含的数据量可以设置成硬件的一次性处理对齐值M,从而以拆分单元为单位进行运算处理,可以充分发挥硬件的算力,避免或减少无效计算。
在本披露的示例性描述中,不防假设M=512bit=64Byte,数据类型可以是Int8、Int16、Float16或Float32,并且输入特征图与卷积核的数据类型一致。由于数据类型至少需要1字节的宽度,并且运算处理的最小单位是一个数据,因此在下面的示例中均以字节为单位进行各种计算,例如M=64B,Ci=28B等等,其中有时候为了简洁起见省略单位。
当拆分单元的数据量等于M时,每个拆分单元的数据块形状为blockC*blockY*blockX,其可能存在多种情形,表1列出了其中的几种:
Figure PCTCN2022100302-appb-000001
表1、数据块形状
从表1可以看出,有些数据块形状的X和Y维度尺寸相等(如深色行所示),这种形状可以简化后续的运算。因此在本披露实施例中,可以优选使用这种数据块形状对待运算数据进行拆分。
为了简便起见,将64B×1×1形状的拆分方案称为Forward64,将16B×2×2形状的拆分方案称为Forward16,将4B×4×4形状的拆分方案称为Forward4,将4B×4×4形状的应用于深度卷积运算的拆分方案称为Forward1,将4B×4×4形状的应用于反向深度卷积运算的拆分方案称为Update1,将4B×4×4形状的应用于叉乘卷积运算的拆分方案称为Update4。除了Forward64之外,这些拆分方案适合卷积计算中通道C比较小的场景,因此也可以统称为小卷积。在这些小卷积拆分方案中,一个拆分单元包括最低存储维度和至少一个其他存储维度的数据,并且一个拆分单元的总数据量不超过硬件单次最大运算量。
不同的卷积拆分方案可以适用于不同的运算场景,从而获得不同程度的性能优化。
在确定了拆分方案之后,接着可以按照所确定的卷积拆分方案,将输入特征图和卷积核拆分成多个对应的拆分单元并转换其维度存储顺序,以使得一个拆分单元内的数据连续存储为一个数据行,从而方便后续以拆分单元(数据行)为单位进行读取处理。
在一些实施例中,对于三维或者四维的神经元或者权值的数据,将其全部划分为大小为blockC*blockY*blockX(Uc×Uy×Ux)大小的数据块,每一个数据块连续存储在例如M=64B的一行上,由此在读取一行数据时,实际取出一个数据块的数据。
具体地,可以从以第一维度存储顺序存储的待运算数据中,以拆分单元为单位,按第一读取顺序读取一个或多个拆分单元,将读取的拆分单元存储到对应的存储电路上,其中每个拆分单元内的数据按照第二维度存储顺序存储,拆分单元之间按照第三维度存储顺序存储。
图6示出了根据本披露实施例的一种示例性数据存储顺序。
如图所示,610表示待运算的四维张量的存储方式,包含N个3维的子张量,N在最高维度,也即四维张量的第一维度存储顺序为NHWC。注意,本文中H和Y、W和X可互换使用。每一个子张量被划分为更小的数据块或拆分单元,每一维的数据块的个数分别为C/Y/X。
中间的图620表示每一个子张量的存储方式,每个数据块被存储为连续的64Byte,也即一行。当读取数据块的顺序不同时,行之间的顺序也会相应地变化。在图中示例中,按照先C、然后X、最后Y的方向读取数据块,也即第一读取顺序为YXC,则各行之间按照Y*X*C的顺序存储,也即第三维度存储顺序为YXC或HWC。在此示例中,第三维度存储顺序与第一维度存储顺序相同。可以理解,也可以使用其他读取顺序,从而导致第三维度存储顺序与第一维度存储顺序不同,此处不再一一列举。
右侧的图630表示每一行内的顺序,也即每个数据块内的数据顺序,其形状为blockC*blockY*blockX,此时第二维度存储顺序为CYX或CHW。
示例性分组运算
小卷积采用block形式,较传统卷积的优势在于Ci方向对齐只需要满足block在Ci方向对齐即可。在这一小通道的场景下,权值(co*Kh*kw*ci)普遍较小,Kh和Kw通常是个位数,co和ci差不多。在前面结合图5描述的计算装置/数据处理装置中,通常第二存储电路(例如图3的WRAM 332)的存储空间比第一存储电路(例如,图3的NRAM 331)要大。因此,为了充分利用片上的计算空间,在大部分小卷积方案中,例如Forward4、Forward1等,采用与正常卷积的神经元和权值存储位置互换的方案,也即将神经元存储在第二存储电路WRAM上,权值存储在第一存储电路NRAM上。
卷积的计算是每一个输入特征图都需要和每一个Co的卷积核进行乘加运算,从而输出Co个输出特征图。然而,并不是片上空间一定能同时存储下所有规模的卷积核和输入特征图,因此,对于硬件而言存在一系列重复加载输入特征数据或者权值数据的操作,如何平衡重复加载输入特征数据还是权值数据对计算的效率会有一定影响。在实际运算中,为了减少频繁的片外访存,存在对神经元和权值的拆分策略问题。在一些实施例中,根据参与运算的数据的规模特性,可以采取不同的拆分方式。
根据前面描述的卷积运算原理可知,Co维度(深度卷积为C维度)上的运算结果无需累加,因此不同Co上的运算分配在不同的运算电路上可以相对独立地进行。在小卷积场景中,通常单轮运算中卷积核的输出通道Co维度的尺寸不超过所调度的从处理电路的数量,因此单个Co的运算需要由一个或多个从处理电路来完成。更一般地,即使Co维度较大时,也可以通过拆分成多轮运算来实现,其中每轮运算处理的Co尺寸不超过所调度的从处理电路的数量。由此,在一个示例中,可以首先基于卷积核的输出通道Co维度尺寸和可调度的从处理电路数量Ns,确定完成卷积运算所需的运算轮次以及各轮次运算中处理的Co数量或相应的分组模式。
不管哪种分配方式,在单轮运算中,Co可能存在两种分配情况:多个从处理电路处理一个Co值,或者单个从处理电路处理一个或多个Co值。具体地,在处理Nco个输出通道的单个运算轮次中,每Rs个SL构成一个从处理电路组SLB,处理对应同一输出Co值的卷积核,Rs=[Ns/Nco],也即同一卷积核在同一SLB内的Rs个SL上复用,Rs表示卷积核在从处理电路之间的复用次数。与之相应地,输入特征图可以在各个从处理电路组SLB之间复用,Rn=[Ns/Rs],表示输入特征图在从处理电路之间的复用次数。
可选地或附加地,当每个从处理电路处理对应rn个Co值的卷积核,rn=[Nco/Ns],此时每个从处理电路处理的输入特征图可以重复用于rn个卷积核,rn表示输入特征图在单个从处理电路内的复用次数。可以考虑硬件缓冲空间限制等因素(例如图5中的第一缓冲电路和第二缓冲电路的大小)来确定单个从处理电路内可应用的最大卷积核复用次数rs和最大输入特征图复用次数rn。
考虑到硬件电路中的缓存大小限制和复用收益,在本披露一些实施例中暂时不考虑一个从处理电路在单轮运算中处理多个Co值的情况,而只考虑一个或多个从处理电路在单轮运算中只处 理一个Co值的情况。
根据在单轮运算中处理同一Co值的从处理电路SL的个数,可以采用不同的分组模式。可以理解,优选对可调用的从处理电路SL平均分配,从而均衡算力,例如,每2个SL一组,从而16个SL可以同时处理8个Co值;或者每4个SL一组,从而16个SL可以同时处理4个Co值;等等。在前面结合图5描述的计算装置中,第二存储电路WRAM具有16块存储区域,分别分配给16个从处理电路SL。进一步地,每4块又可以组合成一个存储区块,分给对应的从处理电路组SLB。因此,在一些实施例中,对于图5所示的包括Ns=16个SL的计算装置,可以选择如下几种分组模式:Group1模式、Group4模式和Group16模式。本领域技术人员可以理解,根据Ns的数值不同,可以有不同的分组模式,每种分组模式均可以参考本文给出的以上三种代表性分组模式进行对应的处理。
在一些实施例中,上述分组模式可以统一表示为GroupN,代表当前轮次运算中调度的所有从处理电路SL分为N组,每个从处理电路组SLB处理同一Co值,不同从处理电路组SLB处理不同Co值。对于总计16个SL可调度的场合下,N可以取1,4,16,分别对应上面的Group1、Group4和Group16。
图7a-7d示出了根据本披露实施例的几种示例性分组模式。图7a示出了Group1模式,图7b示出了Group16模式,图7c示出了一种Group4模式,以及图7d示出了另一种Group4模式。
如图7a所示,Group1模式是指所有可调度的16个SL属于一个组,共同处理一个Co值,例如SL0~SL15属于组G0。从而,针对该一个输出通道的运算被分配在16个SL上。在这种模式下,可以优先考虑将该输出通道的卷积核720以广播方式传输到各个SL,输入特征图710则进行拆分分配给各个SL,从而提高访存效率。
在一个实施例中,可以将卷积核存储在图5的第一存储电路530上,以利用广播通道进行传输。输入特征图则可以按照输出特征图的XY方向划分,存储在第二存储电路540上,以分配给不同的SL。由此,所有SL共同计算一个Co的输出特征图。后面将结合附图详细描述输入特征图的划分和存储。
如图7b所示,Group16模式是指所有可调度的16个SL分成16个组,也即每组一个SL,每个SL处理一个不同的Co值。例如SL0属于组G0,SL1属于组G1,以此类推,直至SL15属于组G15。在这种模式下,同一块输入特征图730可以在16个SL之间重复使用,因此可以优先考虑将输入特征图730以广播方式传输到各个SL,而对应不同Co的卷积核740则分发给对应的SL。
在一个实施例中,可以将输入特征图复制16份,存储在第二存储电路上为16个从处理电路分配的16个存储区域上。卷积核则根据Co划分,一个SL对应一个Co,一次处理16个Co,存储在第一存储电路上,以单播方式分配给不同的SL。由此,所有SL针对同一输入特征图计算不同Co的输出特征图。
如图7c所示,Group4模式是指所有可调度的16个SL分成4个组,每组处理一个Co值。每个SL组(简称SLB)包括的SL数量等于Rs=Ns/4=4。例如SL0~SL3属于组G0,SL4~SL7属于组G1,SL8~SL11属于组G2,以及SL12~SL15属于组G3。这种模式介于Group1和Group16之间,因此可以将卷积核或输入特征图任一确定为多播数据,而将另一确定为分发数据。
在一个实施例中,可以将卷积核按照Co划分成4组,存储在图5的第一存储电路530上,以利用广播通道进行传输。输入特征图则可以按照输出特征图的XY方向划分为4份并复制4份,存储在第二存储电路540上,以分发给4个SLB。每个SLB获得相同的输入特征图,在SLB内再按照所划分的4份分发给其内的4个SL。由此,每个SLB中的所有SL共同计算一个Co的输出特征图,4个SLB则分别处理一个不同的Co。
如图7c所示,将卷积核分成4组,按照Co以间隔1为单位划分至各组。例如,当Co=12时,分成的4组Co分别为{0,4,8}、{1,5,9}、{2,6,10}和{3,7,11}。每一次发送各组的一个Co,例如第一次发送Co=0~3,一个Co对应一个SLB,在一个SLB内的4个SL共用相同权值;第二次发送Co=4~7,依次类推。由此,每轮运算完成后,各个SLB输出的运算结果的Co维度是连续的。
当采用Forward4这种小卷积拆分运算方案时,为了同时支持以上三种模式,可以统一将神经元存储在第二存储电路WRAM上,将权值存储在第一存储电路NRAM上。
输入特征图的示例性拆分
从前面的描述可以看出,当多个SL共同处理一个Co值时,需要在这多个SL之间对输入特征图进行拆分,例如Group1分组模式需要将输入特征图拆分成16份,而Group4分组模式需要将输入特征图拆分成4份。
为了保证拆分的输入特征图可以共用卷积核,可以根据输出特征图的Ho/Wo方向来划分,从而映射回到输入特征图的划分。在一些实施例中,在每个从处理电路组内包括的Rs个从处理电路SL之间可以按如下划分输入特征图:根据对应的输出特征图的尺寸,将输出特征图在XY维度(也即Ho/Wo维度)上平均划分为Rs个形状相同的输出特征块;以及根据计算每个输出特征块所需的输入特征图区域,将输入特征图在XY维度(也即Hi/Wi维度)上划分为Rs个输入特征块,以分配给Rs个从处理电路。可以理解,取决于卷积核尺寸和卷积步长,输出特征图上相邻的输出点所对应的输入特征图可能会存在重叠。
图8示出了根据本披露实施例的输入特征图的示例性拆分示意图。在此示例中,将输入特征图划分成16份分配在16个SL上,对应Group1模式。
图中810代表单个Co的输出特征图,其在XY方向上按照4×4方式划分成16个形状相同的输出特征块,分别分配给SL0~SL15。继而,由这16个输出特征块可以映射到输入特征图820上,获得分别计算这16个输出特征块所需的16个输入特征图区域,其同样是将输入特征图在XY方向上划分。这16个输入特征图区域可以相应地分配给16个从处理电路SL。
根据前文描述,会按照确定的卷积拆分方案,以拆分单元为单位对输入特征图进行拆分,因此,上述实施例中对输入特征图的分块要使得划分的每个输入特征图块在XY方向上是拆分单元XY方向维度的倍数,也即在XY方向上可以按照拆分单元对齐。例如,在选择4×4×4的卷积拆分方案时,每个输入特征图块按4×4对齐;而在选择16×2×2的卷积拆分方案时,每个输入特征图块按2×2对齐。
对于输出特征图不按拆分单元(例如4×4或2×2)对齐的情况,需要相应的在输入特征图上填补(例如补0),使得实际计算的输出XY是按拆分单元(例如4×4或2×2)对齐的并且输入XY也是按拆分单元(例如4×4或2×2)对齐的。
本领域技术人员可以理解,也可以在输出特征图的XY方向按照其他规则进行拆分,例如按照1×16方式拆分成16个形状相同的输出特征块,分别分配给SL0~SL15。本披露实施例在此方面没有限制。此外,还可以理解,虽然前面结合从处理电路之间的拆分进行描述,但是这种拆分方式也可以应用于其他场景下的拆分,例如单个从处理电路SL内的运算电路CU之间的拆分,本披露实施例在此方面没有限制。
单个从处理电路内的示例性卷积运算过程
在拆分好待运算数据并进行相应的摆放存储之后,就可以调度多个从处理电路对输入特征图和卷积核的对应数据行执行卷积运算,继而可以根据卷积拆分方案,对多个从处理电路返回的运算结果进行拼接处理,以得到输入特征图和卷积核的卷积运算的输出特征图。具体地,可以利用从处理电路中的多个运算电路CU以及各个缓冲电路(参见图5)来执行具体的卷积运算过程。取决于从处理电路内部缓冲电路的空间大小以及运算电路的算力限制,在每轮运算中通常需要执行多个运算周期来完成所需运算。
从前面描述可知,在针对常规3D卷积运算场景下,单个从处理电路内的所有运算电路计算对应同一输出通道Co的一个输出特征图或部分输出特征图。取决于从处理电路SL内第一缓冲电路和第二缓冲电路的缓冲空间大小、运算电路CU的处理能力(例如内部寄存器等),从处理电路可能无法一次计算完分配给其的输出特征图。因此,可以以运算电路单次运算能力(例如,单次计算Nop个输出点或部分和)为单位,划分输出特征块,每个输出特征块对应单个SL内所有 可调度的N CU个运算电路的单次运算能力(N CU*Nop个输出点)。例如,以前文图5中每个SL包括4个CU为例,假设每个CU单次可以计算Nop=4个输出点或输出点的部分和,则单个SL单次可以计算4*4=16个输出点(或部分和)。因此,可以将输出特征图在XoYo维度上按照16个输出点对齐划分输出特征块,逐个计算各个输出特征块。可以理解,这16个输出点可以按照4*4形式,也可以按照1*16形式,本披露实施例在此方面没有限制。
在计算每个划分的输出特征块时,又可以进一步在这N CU个运算电路之间划分该输出特征块的输出点,以确定各个运算电路的处理对象。继而,可以根据输出点的划分,以拆分单元为滑动窗口,从第一缓冲电路中选取N CU个输入特征数据行分发给N CU个运算电路,从第二缓冲电路中选取对应的权值数据,广播给N CU个运算电路,从而通过复用权值数据来实现多个滑动窗口对应的输出点的并行计算。执行Nk次滑动选取,其中Nk根据卷积核在X和Y维度的尺寸和从处理电路在当前卷积拆分模式下单次运算所支持的最大卷积核尺寸中的较小值来确定。
在一些实施例中,当执行常规三维卷积运算时,可以按如下选取对应的权值数据:从第二缓冲电路中按照与在第一缓冲电路中对应的滑动方式选取1/Nop个权值行,将其复制Nop-1份扩展为一个扩展权值行,广播给从处理电路内的N CU个运算电路。
此时,每个运算电路可以在每次滑动选数计算时,针对来自第一缓冲电路的一个输入特征行和来自第二缓冲电路的一个扩展权值数据行,以1/Nop个数据行为单位进行对位乘累加,得到Nop个部分和;以及将Nk个滑动选数计算得到的Nk*Nop个部分和按照对应的卷积输出点进行累加,得到并输出Nop个运算结果。
从处理电路在输出其内运算电路的输出点时,可以根据输出点的划分方式,按特定顺序输出其内多个运算电路计算的输出点,以使得连续输出的输出点在X和/或Y维度上连续,方便后续处理。在一些实施例中,主处理电路可以进一步将从各个从处理电路返回的运算结果以第四维度存储顺序存储。根据情况,主处理电路还可以将运算结果转换为期望的维度存储顺序存储。
运算电路之间输出点的划分可以有多种方式,相应地滑动选数卷积过程以及输出点的输出顺序也有所不同。
以下结合Forward4方案来详细描述其整个数据拆分、存储、卷积滑动和计算输出过程。
Forward4方案的输入神经元、权值的形状描述
在Forward4中,拆分单元block的形状是4B×4×4。依据数据类型不同,block的形状略有差别。表2示出了Forward4在不同数据类型下的block形状。
Figure PCTCN2022100302-appb-000002
表2、Forward4在不同数据类型下的数据块形状
图9示出了根据本披露一个实施例的Forward4方案的拆分和存储示意图。为了简便起见,图中示例假设数据类型为Int8。
图中910示出了原始待运算数据(其可以是神经元或权值),其存储顺序是HWC。图中还示出了原始待运算数据按拆分单元进行拆分的4个数据块911-914,每个数据块包括4×4×4=64个数据。
图中920示出了拆分后的数据摆放格式,以方便读取。可以看出,原始的数据块(例如911-914)被摆放成C维度上的一行(例如921-924)。在每行内,数据按照CHW的顺序存储,例如对于数据行921,先存储C=0的16个数据,接着存储C=1的16个,然后是C=2的16个,最后是C=3的16个。
具体而言,对于神经元来说,需要将数据从[1 Hi Wi Ci]摆放为:
[1*Hi/4*Wi/4*Ci/4*(4×4×4)],这种七维张量的形状。
对于权值而言,需要将数据从[Co Kh Kw Ci]摆放为:
[Co*Kh/4*Kw/4*Ci/4*(4×4×4)],这种七维张量的形状。
从前文描述可知,Forward4方案可以支持多种分组模式。对于神经元而言,针对不同的分组模式和组内的HoWo拆分方式,上述block格式的七维形状最终拆分到第二存储电路的每个存储区域中还有细微差别。
假设原始输入神经元大小为:[1*hi*wi*ci]
Group1分组模式下,输入神经元摆数根据HoWo拆分方式而不同:
Ho*Wo 4*4拆分:16[hi/(4*4),wi/(4*4),ci/4*(4*4*4)]
Ho*Wo 1*16拆分:16[hi/(4),wi/(4*4*4),ci/4*(4*4*4)]
上述4*4拆分中,16表示16个从处理电路SL,末尾4*4*4(CHW)表示由三个维度拆分出来的CHW的BLOCK,hi,wi除的两次4中,第一次4表示将hi*wi拆成16份分发至16个SL,第二次4表示折叠hi、wi到ci方向。1*16拆分的含义同理。
Group4分组模式下,输入神经元摆数根据HoWo拆分方式而不同:
Ho*Wo 1*4拆分:4*4*[hi/(1*4),wi/(4*4),ci/4*(4*4*4)]
对于一个从处理电路SL:hi/(1*4),wi/(4*4),ci/4*(4*4*4)
上述表示中,前面第一个4表示4个SLB,神经元被复制了4份,第二个4表示神经元被拆分在一个SLB的4个SL上,末尾4*4*4表示由三个维度拆分出来的CHW的BLOCK。
Group16分组模式下,输入神经元无需拆分,其摆数如下:
16*[hi/4,wi/4,ci/4*(4*4*4)]
上述16表示神经元复制在16个SL上,末尾4*4*4表示由三个维度拆分出来的CHW的BLOCK,hi、wi都除4,表示折叠hi、wi到ci方向。
Forward4方案中运算电路之间的输出点拆分
当单个从处理电路SL内的多个运算电路CU共同处理一个Co值时,需要在这多个CU之间对输出点进行拆分。
图10示出了根据本披露一些实施例的Forward4方案中为每个运算电路分配间隔输出点的示意图。在这些实施例中,可以在N CU个运算电路之间将该输出特征块平均划分为Nop个形状相同的输出特征子块,每个输出特征子块包括N CU个输出点,分别划分给N CU个运算电路。例如,图中以每个SL包括4个CU,每个CU单次可以计算Nop=4个输出点或部分和为例,示出了输出特征块1010包括4*4个输出点,平均划分的每个输出特征子块1011~1014均包括2*2个输出点。在每个输出特征子块中,这2*2个输出点分配给4个运算电路。由此,每个运算电路计算4个输出特征子块中各一个输出点。图中用不同的背景示出分配给4个不同运算电路CU0~CU3的输出点。从图中可以看出,在每次计算时,每个运算电路计算输出特征图上在X和/或Y维度间隔的多个输出点。
基于上述输出点划分,当通过滑动选数执行卷积运算时,可以根据计算输出特征子块所需的数据,从每个输出特征子块的输出点位置相对应地,从第一缓冲电路中选取N CU个数据行进行运算。例如,在输入特征数据的首次选数时,可以根据计算首个输出特征子块1011内的4个输出点所需的4个输入特征块,从对应的输入特征块中选取4个输入数据行,分发给4个运算电路。可以理解,由于这4个输出点在X和/或Y方向是连续的,因此同时选取的4个输入数据行在X和/或Y方向的间隔或步长是1。
在进行权值数据选取时,可以从第二缓冲电路中选取对应的权值数据,广播给N CU个运算电路,从而通过复用权值数据来实现多个运算电路对应的输出点的并行计算。进一步地,在一些实施例中,为了充分发挥运算装置CU内部的算力(例如乘加运算器),例如单次计算Nop个输出点或部分和,可以在单个输入数据行内进行权值复用,从而同时计算Nop个输出点或部分和。
例如,在权值数据的选数时,可以只取1/Nop个权值行,将其复制Nop-1份以扩展成1个权值行,此扩展权值行中包括Nop个相同的1/Nop权值行。扩展权值行同样可以广播给N CU个运算电路,从而在多个运算电路之间复用权值的同时,在单个运算电路的Nop个输出点的计算之间以更小的粒度(例如1/Nop行)复用权值。
由此,通过每次对应地取N CU个输入特征数据行、取1/Nop个权值行复制扩展成1个权值行,每次可以计算N CU*Nop个输出点或部分和。当计算结果是部分和时,通过多次滑动,可以多次计算部分和,各次的部分和根据所属的输出点进行累加,可以得到最终结果。
根据输出点的划分方式,可以确定卷积运算的滑动次数和滑动步长。按照图10的划分方式,滑动次数Nk=ceil(Kx/2)*ceil(Ky/2),其中Kx、Ky分别是卷积核在X和Y维度的尺寸和从处理电路在当前卷积拆分模式下单次运算所支持的最大卷积核尺寸中的较小值,滑动步长=2。从处理电路单次运算所支持的最大卷积核尺寸例如至少由第一缓冲电路和第二缓冲电路的空间大小决定。可以理解,当卷积核超过最大卷积核尺寸时,需要在Kx和Ky方向按照该最大卷积核尺寸进行拆分。
Forward4方案中卷积滑动过程
图11示出了根据本披露一个实施例的Forward4方案中的单次运算过程示意图。在该示例中,第一缓冲电路1110的大小为3×3×64B,也即最多可以缓存9行数据,第二缓冲电路1120的大小为2×2×64B,也即最多可以缓存4行数据。为了与拆分单元一致,图中的缓冲电路内的存储同样以拆分单元为单位示出。
图中示出了第一次滑动取数的运算过程。按照与输出点的划分方式对应的方式,以拆分单元为滑动窗口,从第一缓冲电路中滑动选取N CU个输入特征行,分别发送给N CU个运算电路以供计算;从第二缓冲电路中按照与在第一缓冲电路中对应的滑动方式选取1/Nop个权值行,其中Nop是每个运算电路单次最大可计算卷积输出点数量,将其复制Nop-1份扩展为一个扩展权值行,广播给从处理电路内的N CU个运算电路。
具体地,在图5所示的计算装置中,N CU=4,Nop=4。在划分输出点时,按照每次计算时,每个运算电路计算X和Y维度上均间隔1的2×2个输出点进行划分。
如图所示,从第一缓冲电路1110中在起始位置以及X和/或Y方向各移动1的位置选取一个输入特征数据行,总计选取4个输入特征数据行,对应地发送给该从处理电路SL内的4个运算电路1140。从第二缓冲电路1120中在起始位置选取1/4个权值数据行,也即选取2×2大小的数据,将其复制3份扩展为一个扩展权值数据行1130,广播给该SL内的4个运算电路1140。
在每次计算时,每个运算电路针对来自第一缓冲电路的一个输入特征行和来自第二缓冲电路的一个扩展权值行,以1/Nop个数据行为单位进行对位乘累加,得到Nop个部分和。
如图所示,4个运算电路1140对分发的输入特征数据行和广播的扩展权值数据行执行对位乘累加运算,得到运算结果1150。1150中不同背景颜色的结果代表由不同运算电路1140得到的。可以看出,每次运算,一个CU会计算4个输出点的部分和,4个CU总计获得4×4的部分和。可以看出,每个CU计算的输出点在输出特征图的XoYo维度上并没有相邻。
接着,在第一缓冲电路和第二缓冲电路中同步滑动取数,进行下一计算。执行Nk次滑动选数,其中Nk=ceil(Kx/2)*ceil(Ky/2),Kx和Ky分别是卷积核在X和Y维度的尺寸或从处理电路在当前卷积拆分模式下单次运算所支持的最大卷积核尺寸中的较小值。相应地,运算电路将Nk次滑动计算中计算得到的Nk*Nop个部分和按照对应的卷积输出点进行累加,得到Nop个运算结果。
在一些实施例中,在Forward4模式下,从处理电路单次运算所支持的最大卷积核尺寸为8×8。
图12示出了根据本披露一个实施例的Forward4方案中的滑动卷积过程示意图。该示例以9×9的输入特征图,5×5的卷积核为例,卷积步长为1,则输出特征图大小为5×5。输入特征图需要对齐到12×12,分成9块4×4×4(C×H×W)大小的块,存储在第一缓冲电路中,图中示出为1210,其中省去了C维度。卷积核5×5则需要对齐到8×8,对齐部分补0,存储在第二缓冲电路中,图中示出为1220,同样省去了C维度。每次计算时,选取卷积核中2×2大小的块,复制4次,刚好对应上输入特征图的4×4的块,复制操作可以由硬件实现。
每次滑动时的输入特征图和卷积核在第一缓冲电路和第二缓冲电路中的选取范围如图12所 示,共9幅图,代表共滑动9次。图中方块1210代表第一缓冲电路中的输入特征图,四个虚线框表示选择发给四个CU的区域;方块1220代表第二缓冲电路中的卷积核,虚线框代表选出的1/4行,其被复制3份扩展成一行后广播给4个CU。滑动次数Nk=ceil(Kx/2)*ceil(Ky/2)=9。
在每次计算时,每个CU针对来自第一缓冲电路的一个输入特征数据行和来自第二缓冲电路的一个扩展权值数据行,以1/4个数据行为单位进行对位乘累加,得到4个部分和;以及在当前运算轮次中将Nk次计算中得到的对应同一卷积输出点的Nk个部分和进行累加,得到并输出4个运算结果。
具体地,对于图12中的每幅图,CU的个数Ncu=4,每个CU单次计算Nop=4个输出点或部分和,该部分和是1/4个数据行的对位乘累加结果,也即每个输出点为一个4×2×2(Ci×Y×X)的标准卷积。滑动Nk=ceil(Kx/2)*ceil(Ky/2)=9次之后,Y×X方向完成累加,最终1个SL中得到完整的4×4(Y×X)的输出(如图10b所示)。这种模式下单次计算仅支持卷积核不大于8×8的情形,对于更大的卷积核,需要在Kx和Ky方向按照8×8进行拆分,可以按照上面同样的原理进行拆分运算。
可以理解,当Ci>4时,需要在Ci方向遍历,同时切换输入和权值,直到计算出完整的输出。当每个CU计算的Xo/Yo大于4时,需要沿着Xo/Yo方向滑动,读取不同的输入神经元和权值。本领域技术人员根据前述描述可以类似地推导出其计算过程,此处不再赘述。
Forward4方案中的输出形状描述
从前面的输出点划分方式和滑动卷积过程可以看出,滑动模式输出的结果并不是传统卷积输出数据的正常排列顺序。因此,在输出过程中,各个从处理电路SL可以将其内运算电路CU的运算结果转换为指定的格式,例如Nco×Uy×Ux的格式。在一些实施例中,每个从处理电路可以每次输出其内部分运算电路的部分运算结果,该部分运算结果在输出特征图的X和/或Y维度上连续。主处理电路可以进一步将从各个从处理电路返回的运算结果以第四维度存储顺序存储。根据情况,主处理电路还可以将运算结果转换为期望的维度存储顺序存储。
当分组模式和/或单个SLB内输入特征图的拆分方式(也即根据输出特征图的HoWo拆分方式)不同时,输出的数据格式略有不同。
图13示出了根据本披露一个实施例Forward4方案的输出数据格式示意图。在此实施例中,分组模式为Group1,单个SLB(包括16个SL)内输入特征图的拆分方式按照Ho×Wo=1×16拆分。
图中1310示出了1个SL的原始输出。从图中可以看出,每个SL每次输出1×1×4(Co×Y×X)的区域,也即每次输出其内部分运算电路的部分运算结果,例如2个CU中各2个运算结果(参见图10),这一部分运算结果在输出特征图的X和/或Y维度上连续,例如为同一行(图13所示)或同一列。连续4次返回1×4×4(Co×Y×X)的区域,也即4个CU中各个的4个运算结果。不同的SL输出同一Co的输出特征图的不同区域。当输出所有Co的4×4区域之后,继续输出会切换不同的输出点。
图中1320示出了16个SL的存出数据结构。如图所示,最终输出数据在写入存储电路(例如第一存储电路)后变为Yo*Xo*Co*4*16*4的格式,其中的Yo和Xo为每一个SL划分到的输出特征图的块的个数,16为在16个SL上的划分。根据需要,在一些实现中,可以再次进行摆数操作以转化为其他期望的数据格式。
如前面所提到,分组模式和/或单个SLB内多个SL之间输入特征图的拆分方式不同时,输出的数据格式还有细微的差别。假设原始输出大小为:
1*ho*wo*co
那么,Group1在Ho*Wo按照4*4拆分时的输出数据形状为:
ho/(4*4)*wo/(4*4)*co/group*(4*16*4)
上式中,(4*16*4)是forward4的基本输出块,方向分别对应h*c*w,其中16表示16个SL上的相同co的ho、wo的划分;ho,wo除了两次4,其中,第一个4表示在SL存储数据时进行4×4拆分,第二个4表示h、w方向的数据块折叠。Group1模式下,上面的group=1。
Group1在Ho*Wo按照1*16拆分时的输出数据形状为:
ho/(4)*wo/(4*16)*co/group*(4*16*4)
上式中,(4*16*4)是forward4的基本输出块,方向分别对应h*c*w,其中16表示16个SL上的相同co的ho、wo的划分;Group1模式下,上面的group=1。该形状也即图19示意图的形状。
由此可见,Group1的情况下,16个SL平分一个输出特征图的Yo*Xo维度。输出时行内维度SL中的数据,与16个SL在Yo*Xo方向上平分输出神经元的方式一一对应。此场景适合输入神经元Y*X方向数值大,Co数值小。
Group4输出数据形状为:
ho/(2*4)*wo/(2*4)*co/group*(4*16*4)
上式中,(4*16*4)含义同上,不同的是16表示4个co在4个SL上的wo输出划分。Group4模式下,上面的group=4。
Group16输出数据形状为:
ho/4*wo/4*co/group*(4*16*4)
上述中,(4*16*4)含义同上,不同的是16表示16个co在16个SL上的输出划分。Group16模式下,上面的group=16。
由于Group在H*W方向上还有不同的拆分类别,上述中4*16*4的16在具体的拆分上还有差异。由于Forwrd4是按照4B*4*4的块为计算单元,那么不可避免在计算时就存在对齐限制。根据不同的Group模式,相同Group模式的不同H*W的拆分方式,最终在计算时的对齐限制也不一样。在对齐的计算上,可以首先根据输出特征图的拆分方式确定ho*wo的对齐限制,再由ho*wo反推回hi*wi,由于输入神经元需要摆成拆分单元块的形式,从而还需要再对齐一次。上述对齐限制可以汇总如下表3:
Figure PCTCN2022100302-appb-000003
表3、对齐限制
综上,在输出时,硬件可以自动按照行内4*16*4(Y*SL*X)维度,行间Y*X*C维度的方式输出神经元。对于更大的卷积核同理。
Forward4方案中的偏置形状描述
偏置Bias是卷积计算结束之后的偏置,偏置的原始格式为:[11co]。
由于Forward4输出的数据格式为ho*wo*co/group*(4*16*4)格式,因此如果需要在片上在Forward4直接输出的数据上加偏置,则需要改变偏置的基本形状。偏置在片上空间的摆放格式和Group分组模式有关。具体而言,各种分组模式下偏置的摆数如下:
Group1分组模式下,偏置的摆数:[11co*64]
其中,64表示将单个偏置复制64次且连续摆放。
Group4分组模式下,偏置的摆数:[11co*16]
其中,16表示将单个偏置复制16次且连续摆放。
Group16分组模式下,偏置的摆数:[11co*4]
其中,4表示将单个偏置复制4次且连续摆放。
数据搬运过程
从前面对小卷积运算方案的描述可知,输入神经元和权值都需要进行拆分和存储维度变换,输出神经元也需要进行一定的维度变换。当基于例如图3b的多核计算装置的硬件结构,出于硬件IO效率的考虑,输入数据需要先从全局内存中读取数据,加载数据之后存放在共享存储SRAM中。如前文所提到的,Forward4需要拆分神经元,考虑到对齐因素,其拆分的特点决定了Forward4在处理输入特征图比较大、通道数比较小的情况下更具有计算优势。因此,在涉及Forward4的硬件设计时,可以将较大的神经元存放在WRAM上,将相对较小的权值存放在NRAM中。同时,由于对权值和神经元数据需要摆放成前文描述的block的形式,因此,存放在WRAM上的神经元也需要经过一次NRAM,进行张量数据的形状变换。
图14示出了根据本披露实施例的整体的数据搬运过程。
如图所示,权值从片外存储,例如DDR,经由全局直接内存访问模块(GDMA)读取到SRAM中。在SRAM上完成HW面对齐和填补pad操作。在数据从SRAM到NRAM的过程中利用分块指令(Tiling),此时既能完成数据搬运过程也能完成数据维度变换、对齐的过程。
神经元的搬运过程与权值类似,只不过在通过分块指令搬运到NRAM之后,还需要搬运到WRAM上。由于神经元在计算时,随着卷积核的滑动,存在大部分的数据有重叠,这大大地降低了数据搬运的效率。为了解决这一问题,本披露一些实施例中采用img2col指令进行分发数据,具体见后文详述。
输出数据可以回存到NRAM上,并且同样可以通过分块指令完成数据维度变化并搬运到SRAM上。接着,可以经由GDMA回存到片外存储DDR中。
分块指令的示例性原理
数据维度变化和摆数,是指对特定形状的张量数据摆放成所需要的特定形状的过程。数据搬运是指数据在不同的内存空间进行的读写操作。如前文所述,Forward4卷积运算方案要求用于卷积运算的神经元、权值都按照特定block模式摆放、对齐好。此外,输出数据也是按照Forward4特定输出格式输出的数据,这就要求在计算之前将张量数据按block形式摆放好,同时还要求计算结束之后按照要求摆回正常张量形状。
在本披露实施例中,在输入的神经元、权值、偏置数据从SRAM搬运到NRAM的过程,以及输出数据从NRAM搬运到SRAM的过程中,均利用分块指令(Trans Tiling)来完成这一搬运操作。在这一搬运的过程中,需要完成数据的基本搬运过程,还需要完成数据的维度变化和摆放过程,从而达到计算的需求。
Deform指令系列为IO数据通路提供了数据形状变换、数据类型转换的能力,主要包括TRANS(转置)、MOVE(搬运)、ROTATE(旋转)等功能。这一指令系列中实现转置功能的模式命名为Trans Tiling,主要是给小卷积的各种形状变换提供性能支持。Deform将一个3维数据块分成了内外两层,内层有三个维度(对应指令中的参数n0-2),最低维的单位是字节,次低维和最高维是无单位的,代表上一层的个数。外层也有三个维度(对应指令中的参数n3-n5),均代表对应内层维度的倍数。
在实施小卷积拆分方案时,需要将以第一维度存储顺序(例如HWC)存储的输入数据以拆分单元为单位进行拆分、维度转换和存储,各个拆分单元内按照第二维度存储顺序(例如CHW)存储,拆分单元之间按照第三维度存储顺序(例如HWC)存储。
图15示出了根据本披露实施例的Trans Tiling的示意性概念图。
图中左图示出了变形前的输入数据。可以看出,三维的输入数据使用六个维度来描述,n0和n3对应原始三维数据的第一维(例如最低维),n1和n4对应原始三维数据的第二维(例如次低维),n2和n5对应数据块的第三维度(例如最高维)。在图中示例中,输入数据的内层对应拆分单元,以Forward4方案为例,输入数据的内层数据块为4B×4×4的数据块,其中n0=4B, n1=n2=4。
图中右图示出了变形后的输出数据。三维的输出数据同样使用六个维度来描述。此时,输出数据的内层对应变形后的拆分单元,在Forward方案中,输出数据的内层数据块为64B×1×1的数据块,其中n0=64B,n1=n2=1。
此外,转置分块(Trans Tiling)还具有行内变换(Inline shuffle)功能,包括基于前配表(Pretable)的Tiling前行内变换,和基于后配表(Posttable)的Tiling后行内变换的功能。前配表是对Tiling输入的n0的数据进行重排的功能,后配表是对Tiling输出的n0的数据进行重排的功能。在不考虑表的标志位的情况下,前配表和后配表本质是一个表示64个字节数据位置的数组。
图16示出了前后配表的示意图。
如图所示,前后配表分别表示输入或输出的n0维度的一行数据的重排位置,其包括64B。每字节的8比特位分别包括6比特的Index位,记录次数据位存放的是原数据中0~63位字节数据的第几位字节的数据;1比特的zero_en位,表示是否置0,如该比特为1,则强制写0,[5,0]位无效;以及1比特的mask位,表示此位数据是否有效。
通过前后配表,可以在需要的时候对分块指令的输入数据的n0的数据进行重排,和/或对分块指令的输出数据的n0的数据进行重排。
表4示出了分块指令的各个参数的含义。假设需要进行分块的数据的位宽为dwidth,单位为B(字节),分块指令一次原子操作的数据量大小称为分块位宽T,单位为B(字节)。分块指令的参数中,需要用n0~n5,s1~s5共11个参数来描述内层数据和外层数据的张量形状,其中n0~n2,s1~s2是描述内层的参数,n3~n5,s3~s5是描述外层的参数。
Figure PCTCN2022100302-appb-000004
表4、分块指令的参数含义
对于分块指令执行前后的张量描述,输入张量、输入张量各需要一套参数,分别用in0~in5,is1~is5,on0~on5,os1~os5共计22个参数进行描述。分块指令可以支持多种分块位宽T,例如1B、2B、4B、6B、8B、16B、32B等等,基于不同的分块任务,可以设置相应的值。因此,分块指令中还包括分块位宽T这一参数。
分块指令在使用时,还存在一些基本的使用限制或约束条件,这些限制例如包括:in0,in1,in2,on0,on1,on2<=64;n0性能上要求64B对齐;in0=on1*on2*T,on0=in1*in2*T;in3*in4*in5=on3*on4*on5;T<=32B;前后配表=64B。
此外,分块指令不能原位操作,也即需要两块存储区域。因此,在本披露实施例中,提供一种数据处理装置,包括控制电路、第一存储电路和第二存储电路。第一存储电路用于存储执行分块指令前的数据;第二存储电路用于存储执行分块指令后的数据。控制电路用于配置并执行分块指令。在一些实施例中,数据处理装置例如可以是图3b示出的多核计算装置中的处理器簇,控制电路例如是处理器簇内的处理器核,第一存储电路例如是处理器簇内的共享存储SRAM,而第二存储电路例如是处理器核内的NRAM。当针对不同数据(输入神经元、权值、输出神经元等) 执行分块指令时,其需要实现的维度变化和搬运过程也不同,需要设计不同的分块指令配参方案。
神经元的分块指令通用方案
根据前文针对诸如Forward4一类的小卷积运算方案的描述可知,对于输入神经元而言,分块指令的作用在于在输入神经元从例如SRAM搬运到NRAM的过程中,将按第一维度存储顺序(例如HWC)存储的输入神经元以拆分单元为单位进行拆分、维度转换和存储,各个拆分单元内按照第二维度存储顺序(例如CHW)存储,拆分单元之间按照第三维度存储顺序(例如HWC)存储。拆分单元形状为CHW=U Ci×U H×U W。分块指令要求的对齐值M是U Ci的倍数。
具体地,对于Forward4方案中的神经元数据而言,需要将数据从[1*hi*wi*ci]摆放为:
[1*hi/4*wi/4*ci/4*(4*4*4)]
图17示出了根据本披露实施例的对神经元数据执行分块指令的示意图。
图中左图示出了分块处理前的神经元数据(也即分块指令的输入张量)。可以看出,三维的神经元数据[hi*wi*ci](此处省略N维度)被分成内外两层,各使用三个维度来描述。内层数据块1701的in0维度根据分块指令的限制,对齐到第一对齐值,例如M=64B;in1维度则可以根据拆分单元的形状设置为U W,此示例中为4;in2维度也可以根据拆分单元的形状设置为U H,此示例中为4。确定了内层数据块之后,外层三个维度in3、in4和in5的大小也可以相应地确定,其大小分别等于对应维度上含有该内层数据块的个数。
图中右图示出了分块处理后的神经元数据(也即分块指令的输出张量)。可以看出,此时的神经元数据形状变为[hi/4*wi/4*(ci*16)],其同样被分成内外两层,各使用三个维度来描述。由于需要对神经元数据按拆分单元进行拆分,因此结合分块指令的约束条件,分块位宽T可以设置为U Ci,也即一次原子操作的数据量为U Ci,从而便于以拆分单元为单位调整存储顺序。此时,内层数据块1702可以对应于输入张量的内层数据块1701,但是形状由M×U H×U W变为(M*U H*U W)×1×1,图中示为16个细长条构成的大长条。内层数据块的on0维度根据分块指令的限制条件,设置为in1*in2*T=M,也即64B;on1维度设置为in0/T=M/T,此示例中为16,on2维度设置为in0/T/on1=1。确定了内层数据块之后,外层三个维度on3、on4和on5的大小也可以相应地确定,其大小分别等于对应维度上含有输出张量的内层数据块的个数。
从上面的分块指令执行过程可以看出,虽然可以将输入神经元从[1*hi*wi*ci]摆放为[1*hi/4*wi/4*ci/4*(4*4*4)]的形式,但是最低维的4*4*4的拆分单元块依然还是HWC的顺序,还不是CHW的顺序。为了实现最低维拆分单元块内是CHW的顺序,需要使用前文描述的后配表。
在一些实施例中,数据处理装置中的控制电路可以进一步用于按如下配置分块指令:设置分块指令的后配表,以将分块指令的输出张量的内层最低维数据按照后配表的指示进行重排。具体地,控制电路可以进一步用于按如下设置后配表:将按照第一维度存储顺序(例如HWC)排列的输出张量的内层最低维度on0数据转换成按照第二维度存储顺序(例如CHW)排列。
此时的后配表仅是获取输出张量的on0,也即输入的in1*in2*T=M=64B数据量的写入顺序。这个64B数据的写入顺序和数据的数据位宽dwidth有关。在一些实施例中,可以按照如下表5的伪代码所示的逻辑来配置后配表。
Figure PCTCN2022100302-appb-000005
表5、神经元分块指令后配表制表伪代码
在实际运算中,神经元的形状是动态变化的,也即ci的大小是任意的。由于分块指令要求内层数据块的最低维n0对齐M=64B,因此需要将整个分块处理过程分成两段来执行。第一段为64B对齐的整数段,第二部分为未达到64B的余数段。
由此,在一些实施例中,数据处理装置中的控制电路可以将输入数据(例如神经元)按照输入通道Ci维度分成整数段和余数段,其中整数段的Ci维度大小对齐到对齐值M,余数段的Ci维度大小则小于M。继而,可以针对整数段配置并执行第一分块指令,以及针对余数段配置并执行第二分块指令。
可以理解,取决于不同的ci值,可能只存在整数段、或者只存在余数段、或者整数段和余数段同时存在。假设ci中对齐了64B的整数段长度大小是ci_full,未对齐64B的余数段是ci_rem。比如,INT8类型的神经元1*256*256*96,ci=96,那么ci_full=64,ci_rem=32。
表6示出了神经元数据在执行分块指令前后的形状变化。
Figure PCTCN2022100302-appb-000006
表6、神经元数据分块处理前后的形状变化
注意,表6中的形状假设是已经按照前文表3中关于Forward4不同Group模式和不同H*W拆分方式下的对齐限制中的要求对齐之后的参数。
对于整数段部分,可以参考前文结合图17描述的内容配置第一分块指令。
在一个实施例中,可以按如下配置第一分块指令的输入张量的参数:将第一分块指令中的输入张量的内层最低维大小in0设置为M,内层次低维大小in1设置为U W,内层最高维大小in2设置为U H;以及根据输入数据的整数段各维度大小,设置第一分块指令中的输入张量的三个外层维度的大小值in3、in4和in5,其中三个外层维度的大小值分别表示对应维度上含有输入张量的内层数据块的个数。
附加地,在一个实施例中,可以按如下配置第一分块指令的输出张量的参数:将第一分块指令中的输出张量的内层最低维大小on0设置为in1*in2*T,内层次低维大小on1设置为M/T,内层最高维大小on2设置为1;以及根据输入数据的整数段各维度大小,设置第一分块指令中的输出张量的三个外层维度的大小值on3、on4和on5,其中三个外层维度的大小值分别表示对应维度上含有输出张量的内层数据块的个数。
除了维度参数之外,还需要设置维度步长。在一些实施例中,可以基于所设置的输入张量和输出张量各自六个维度的维度大小以及处理前的输入数据各维度大小,设置除了内层最低维之外其余五个维度上相邻数据点在存储空间上的步长。在一些实施例中,可以根据分块指令的约束条件,以及拆分单元在处理前后的形状变换,将分块位宽T设置为U Ci
在一个示例中,当针对Forward4方案,也即拆分单元为U Ci×U H×U W=4B×4×4,M=64B时,可以按如下表7配置针对整数段的第一分块指令。
Figure PCTCN2022100302-appb-000007
表7、神经元整数段分块指令配参方案
其中ci、hi、wi分别表示输入数据的Ci、H和W维度的数据个数,dwidth表示数据位宽,ci_full表示整数段的Ci维度的数据个数,B表示字节,T表示分块位宽,is1~is5表示输入张量的五个维度步长,os1~os5表示输出张量的五个维度步长。
对于余数段部分,可以基于整数段部分稍作调整来配置第二分块指令。
在一个实施例中,可以按如下配置第二分块指令:根据整数段的Ci维度大小,设置针对余数段执行的第二分块指令的输入张量偏置和输出张量偏置,其中输入张量偏置表示处理前的余数段相对于输入数据的起始存储地址的偏移,输出张量偏置表示处理后的余数段相对于输出数据的起始存储地址的偏移。通过设置第二分块指令的输入张量偏置和输出张量偏置,可以在考虑到整数段部分的内存存储空间后,调整第二分块指令的输入张量地址和输出张量地址。
在一个实施例中,可以按如下配置第二分块指令的输入张量的参数:将第二分块指令中的输入张量的内层最低维大小in0设置为R,R是余数段的Ci维度大小,将内层次低维大小in1设置为U W,将内层最高维大小in2设置为U H;以及根据输入数据的余数段各维度大小,设置第二分块指令中的输入张量的三个外层维度的大小值in3、in4和in5,其中三个外层维度的大小值分别表示对应维度上含有该输入张量的内层数据块的个数。
附加地,在一个实施例中,可以按如下配置第二分块指令的输出张量的参数:将第二分块指令中的输出张量的内层最低维大小on0设置为in1*in2*T,将内层次低维大小on1设置为R/T,将内层最高维大小on2设置为1;以及根据输入数据的余数段各维度大小,设置第二分块指令中的输出张量的三个外层维度的大小值on3、on4和on5,其中三个外层维度的大小值分别表示对应维度上含有该输出张量的内层数据块的个数。
类似地,控制电路可以基于所设置的输入张量和输出张量各自六个维度的维度大小以及处理前的输入数据各维度大小,设置除了内层最低维之外其余五个维度上相邻数据点在存储空间上的步长。
在一个示例中,当针对Forward4方案,也即拆分单元为U Ci×U H×U W=4B×4×4,M=64B时,可以按如下表8配置针对余数段的第二分块指令。
Figure PCTCN2022100302-appb-000008
表8、神经元余数段分块指令配参方案
其中ci、hi、wi分别表示输入数据的Ci、H和W维度的数据个数,dwidth表示数据位宽,ci_rem表示所述余数段的Ci维度的数据个数,B表示字节,T表示分块位宽,is1~is5表示输入张量的五个维度步长,os1~os5表示输出张量的五个维度步长。
由此,本披露实施例提供了一种针对神经元数据的分块处理方案。当神经元数据的形状为任意时,可以通过两段式分块处理,来实现将任意形状的神经元数据从[1*hi*wi*ci]摆放为[1*hi/4*wi/4*ci/4*(4*4*4)]的形式。
权值的分块指令通用方案
权值数据的分块处理和神经元数据的分块处理是类似的。具体地,对于Forward4方案中的权值数据而言,需要将数据从[co*kh*hw*ci]摆放为:
[co*kh/4*kw/4*ci/4*(4*4*4)]
与神经元数据不同的是,权值数据多了一个co维度。由于co维度和kh维度是连续的,因此可以将co维度并入kh维度中。
表9示出了权值数据在执行分块指令前后的形状变化。
Figure PCTCN2022100302-appb-000009
表9、权值数据分块处理前后的形状变化
注意,表9中的形状假设是已经按照前文表3中关于Forward4不同Group模式和不同H*W拆分方式下的对齐限制中的要求对齐之后的参数。
通过将co维度并入kh维度,针对权值数据仍然可以使用前文针对神经元数据描述的方案进行分块处理。具体地,在一些实施例中,可以针对任意规模的权值数据采用两段式处理,也即整数段分块处理和余数段分块处理。详细的分块指令配置方案可以参考前文描述。
在一个示例中,当针对Forward4方案,也即拆分单元为U Ci×U H×U W=4B×4×4,M=64B时,可以按如下表10配置针对权值数据的整数段的第一分块指令。
Figure PCTCN2022100302-appb-000010
表10、权值整数段分块指令配参方案
可选地或附加地,在一个示例中,当针对Forward4方案,也即拆分单元为U Ci×U H×U W=4B×4×4,M=64B时,可以按如下表11配置针对权值数据的余数段的第二分块指令。
Figure PCTCN2022100302-appb-000011
表11、权值余数段分块指令配参方案
本披露实施例还提供了利用前述数据处理装置执行分块指令的数据处理方法。本领域技术人员可以理解,执行分块指令的方法步骤与前面结合附图描述的计算装置的各个特征相对应,因此前面描述的特征同样适用于方法步骤,此处不再重复。
本披露实施例还提供了一种芯片,其可以包括前面结合附图描述的任一实施例的数据处理装置。进一步地,本披露还提供了一种板卡,该板卡可以包括前述芯片。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器计算簇、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个***,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心思想;同时,对于本领域的一般技术人员,依据本披露的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本披露的限制。

Claims (20)

  1. 一种数据处理装置,包括控制电路、第一存储电路和第二存储电路,其中:
    所述第一存储电路用于存储处理前的数据;
    所述第二存储电路用于存储处理后的数据;以及
    所述控制电路用于配置并执行分块指令,以将按照第一维度存储顺序存储在第一存储电路上的输入数据以拆分单元为单位进行拆分并存储为第二存储电路上的输出数据,其中在所述第二存储电路上,各个拆分单元内按照第二维度存储顺序存储,拆分单元之间按照第三维度存储顺序存储。
  2. 根据权利要求1所述的数据处理装置,其中所述控制电路进一步用于按如下配置分块指令:
    设置分块指令的后配表,以将分块指令的输出张量的内层最低维数据按照所述后配表的指示进行重排。
  3. 根据权利要求2所述的数据处理装置,其中所述控制电路进一步用于按如下设置所述后配表:
    将按照所述第一维度存储顺序排列的所述输出张量的内层最低维度数据转换成按照所述第二维度存储顺序排列。
  4. 根据权利要求2-3任一所述的数据处理装置,其中所述控制电路进一步用于:
    将所述输入数据按照输入通道Ci维度分成整数段和余数段,其中所述整数段的Ci维度大小对齐到对齐值M,所述余数段的Ci维度大小小于所述M;
    针对所述整数段配置并执行第一分块指令;以及
    针对所述余数段配置并执行第二分块指令。
  5. 根据权利要求4所述的数据处理装置,其中所述拆分单元的形状为CHW=U Ci×U H×U W,其中C表示输入通道维度,H表示高度维度,W表示宽度维度,其中M是U Ci的倍数。
  6. 根据权利要求5所述的数据处理装置,其中,当存在所述整数段时,所述控制电路进一步用于按如下对所述整数段配置第一分块指令:
    将所述第一分块指令中的输入张量的内层最低维大小in0设置为M,内层次低维大小in1设置为U W,内层最高维大小in2设置为U H;以及
    根据所述输入数据的所述整数段各维度大小,设置所述第一分块指令中的输入张量的三个外层维度的大小值in3、in4和in5,其中所述三个外层维度的大小值分别表示对应维度上含有所述内层数据块的个数。
  7. 根据权利要求6所述的数据处理装置,其中所述控制电路进一步用于按如下对所述整数段配置第一分块指令:
    将所述第一分块指令中的输出张量的内层最低维大小on0设置为in1*in2*T,内层次低维大小on1设置为M/T,内层最高维大小on2设置为1,其中T为所述分块指令的分块位宽,其表示所述分块指令一次原子操作的数据量;以及
    根据所述输入数据的所述整数段各维度大小,设置所述第一分块指令中的输出张量的三个外层维度的大小值on3、on4和on5,其中所述三个外层维度的大小值分别表示对应维度上含有所述内层数据块的个数。
  8. 根据权利要求7所述的计算装置,其中所述控制电路进一步用于按如下对所述整数段配置第一分块指令:
    基于所设置的输入张量和输出张量各自六个维度的维度大小以及处理前的输入数据各维度大小,设置除了内层最低维之外其余五个维度上相邻数据点在存储空间上的步长。
  9. 根据权利要求8所述的计算装置,其中所述控制电路进一步用于按如下对所述整数段配置第一分块指令:
    根据分块指令的约束条件以及所述拆分单元在处理前后的形状变换,将所述分块位宽T设置为U Ci
  10. 根据权利要求4-9任一所述的数据处理装置,其中,当存在所述余数段时,所述控制电路进一步用于按如下配置所述第二分块指令:
    根据所述整数段的Ci维度大小,设置针对所述余数段执行的第二分块指令的输入张量偏置和输出张量偏置,其中所述输入张量偏置表示处理前的所述余数段相对于所述输入数据的起始存储地址的偏移,所述输出张量偏置表示处理后的所述余数段相对于所述输出数据的起始存储地址的偏移。
  11. 根据权利要求10所述的计算装置,其中所述控制电路进一步用于按如下对所述余数段配置第二分块指令:
    将所述第二分块指令中的输入张量的内层最低维大小in0设置为R,R是所述余数段的Ci维度大小,将内层次低维大小in1设置为U W,将内层最高维大小in2设置为U H;以及
    根据所述输入数据的所述余数段各维度大小,设置所述第二分块指令中的输入张量的三个外层维度的大小值in3、in4和in5,其中所述三个外层维度的大小值分别表示对应维度上含有所述内层数据块的个数。
  12. 根据权利要求11所述的数据处理装置,其中所述控制电路进一步用于按如下对所述余数段配置第二分块指令:
    将所述第二分块指令中的输出张量的内层最低维大小on0设置为in1*in2*T,将内层次低维大小on1设置为R/T,将内层最高维大小on2设置为1;以及
    根据所述输入数据的所述余数段各维度大小,设置所述第二分块指令中的输出张量的三个外层维度的大小值on3、on4和on5,其中所述三个外层维度的大小值分别表示对应维度上含有所述内层数据块的个数。
  13. 根据权利要求12所述的计算装置,其中所述控制电路进一步用于按如下对所述余数段配置第二分块指令:
    基于所设置的输入张量和输出张量各自六个维度的维度大小以及处理前的输入数据各维度大小,设置除了内层最低维之外其余五个维度上相邻数据点在存储空间上的步长。
  14. 根据权利要求9-13任一所述的计算装置,其中当U Ci×U H×U W=4B×4×4,M=64B时,所述第一分块指令的参数按表1配置:
    表1
    Figure PCTCN2022100302-appb-100001
    其中ci、hi、wi分别表示输入数据的Ci、H和W维度的数据个数,dwidth表示数据位宽,ci_full表示所述整数段的Ci维度的数据个数,B表示字节,T表示分块位宽,is1~is5表示输入张量的五个维度步长,os1~os5表示输出张量的五个维度步长。
  15. 根据权利要求13-14任一所述的计算装置,其中当U Ci×U H×U W=4B×4×4,M=64B时,所述第二分块指令的参数按表2配置:
    表2
    Figure PCTCN2022100302-appb-100002
    Figure PCTCN2022100302-appb-100003
    其中ci、hi、wi分别表示输入数据的Ci、H和W维度的数据个数,dwidth表示数据位宽,ci_rem表示所述余数段的Ci维度的数据个数,B表示字节,T表示分块位宽,is1~is5表示输入张量的五个维度步长,os1~os5表示输出张量的五个维度步长。
  16. 根据权利要求1-15任一所述的数据处理装置,其中所述输入数据是包括H、W、C三个维度的神经元数据,所述第一维度存储顺序为HWC,所述第二维度存储顺序为CHW,所述第三维度存储顺序为HWC,其中C表示输入通道维度,H表示高度维度,W表示宽度维度。
  17. 根据权利要求1-15任一所述的数据处理装置,其中所述输入数据是包括Ci、Co、Kh、Kw四个维度的权值数据,其中Co维度与Kh维度合并成H维度,Ci维度对应C维度,Kw维度对应W维度,所述第一维度存储顺序为HWC,所述第二维度存储顺序为CHW,所述第三维度存储顺序为HWC,其中Ci表示权值输入通道维度,Co表示权值输出通道维度,Kh表示权值高度维度,Kw表示权值宽度维度,C表示输入通道维度,表示H表示高度维度,W表示宽度维度。
  18. 一种芯片,包括根据权利要求1-17任一所述的数据处理装置。
  19. 一种板卡,包括根据权利要求18所述的芯片。
  20. 一种利用根据权利要求1-17任一所述的数据处理装置对输入数据执行分块指令的数据处理方法。
PCT/CN2022/100302 2021-09-26 2022-06-22 数据处理装置、数据处理方法及相关产品 WO2023045445A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111129610.8 2021-09-26
CN202111129610.8A CN113850380A (zh) 2021-09-26 2021-09-26 数据处理装置、数据处理方法及相关产品

Publications (1)

Publication Number Publication Date
WO2023045445A1 true WO2023045445A1 (zh) 2023-03-30

Family

ID=78979679

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100302 WO2023045445A1 (zh) 2021-09-26 2022-06-22 数据处理装置、数据处理方法及相关产品

Country Status (2)

Country Link
CN (1) CN113850380A (zh)
WO (1) WO2023045445A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118041706A (zh) * 2024-04-12 2024-05-14 深圳市中农网有限公司 一种基于crm下的农产品数据双模式存储方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113850380A (zh) * 2021-09-26 2021-12-28 安徽寒武纪信息科技有限公司 数据处理装置、数据处理方法及相关产品
TWI814618B (zh) * 2022-10-20 2023-09-01 創鑫智慧股份有限公司 矩陣運算裝置及其操作方法
CN115796239B (zh) * 2022-12-14 2023-10-31 北京登临科技有限公司 Ai算法架构的实现装置、卷积计算装置及相关方法与设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180285715A1 (en) * 2017-03-28 2018-10-04 Samsung Electronics Co., Ltd. Convolutional neural network (cnn) processing method and apparatus
CN111079917A (zh) * 2018-10-22 2020-04-28 北京地平线机器人技术研发有限公司 张量数据分块存取的方法及装置
CN112416433A (zh) * 2020-11-24 2021-02-26 中科寒武纪科技股份有限公司 一种数据处理装置、数据处理方法及相关产品
CN113850380A (zh) * 2021-09-26 2021-12-28 安徽寒武纪信息科技有限公司 数据处理装置、数据处理方法及相关产品

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180285715A1 (en) * 2017-03-28 2018-10-04 Samsung Electronics Co., Ltd. Convolutional neural network (cnn) processing method and apparatus
CN111079917A (zh) * 2018-10-22 2020-04-28 北京地平线机器人技术研发有限公司 张量数据分块存取的方法及装置
CN112416433A (zh) * 2020-11-24 2021-02-26 中科寒武纪科技股份有限公司 一种数据处理装置、数据处理方法及相关产品
CN113850380A (zh) * 2021-09-26 2021-12-28 安徽寒武纪信息科技有限公司 数据处理装置、数据处理方法及相关产品

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118041706A (zh) * 2024-04-12 2024-05-14 深圳市中农网有限公司 一种基于crm下的农产品数据双模式存储方法

Also Published As

Publication number Publication date
CN113850380A (zh) 2021-12-28

Similar Documents

Publication Publication Date Title
WO2023045445A1 (zh) 数据处理装置、数据处理方法及相关产品
WO2023045446A1 (zh) 计算装置、数据处理方法及相关产品
CN112799599B (zh) 一种数据存储方法、计算核、芯片和电子设备
WO2022134873A1 (zh) 数据处理装置、数据处理方法及相关产品
CN113469336A (zh) 优化神经网络模型的编译方法、执行方法及相关产品
CN112084023A (zh) 数据并行处理的方法、电子设备及计算机可读存储介质
CN113850379A (zh) 数据处理装置、数据处理方法及相关产品
CN113850377A (zh) 数据处理装置、数据处理方法及相关产品
CN113469337B (zh) 用于优化神经网络模型的编译方法及其相关产品
WO2022095675A1 (zh) 神经网络稀疏化的装置、方法及相关产品
CN114692844A (zh) 数据处理装置、数据处理方法及相关产品
CN114281561A (zh) 处理单元、用于处理单元的同步方法及相应产品
WO2022134872A1 (zh) 数据处理装置、数据处理方法及相关产品
CN114691353A (zh) 一种张量的读取方法、装置以及相关产品
WO2023087698A1 (zh) 执行卷积运算的计算装置、方法及相关产品
WO2023045638A1 (zh) 计算装置、利用计算装置实施卷积运算的方法及相关产品
WO2023087814A1 (zh) 计算装置、利用计算装置实施卷积运算的方法及相关产品
CN113867800A (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
WO2022063183A1 (zh) 执行神经网络计算的装置、板卡、方法及可读存储介质
WO2022257980A1 (zh) 计算装置、利用计算装置实施卷积运算的方法及相关产品
WO2022135599A1 (zh) 融合分支结构的装置、板卡、方法及可读存储介质
WO2022135600A1 (zh) 计算神经网络的装置、板卡、方法及可读存储介质
CN113850378A (zh) 数据处理装置、数据处理方法及相关产品
CN113837923A (zh) 数据处理装置、数据处理方法及相关产品
CN113837921A (zh) 数据处理装置、数据处理方法及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22871506

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE