WO2023123919A1 - 数据处理电路、数据处理方法及相关产品 - Google Patents

数据处理电路、数据处理方法及相关产品 Download PDF

Info

Publication number
WO2023123919A1
WO2023123919A1 PCT/CN2022/100306 CN2022100306W WO2023123919A1 WO 2023123919 A1 WO2023123919 A1 WO 2023123919A1 CN 2022100306 W CN2022100306 W CN 2022100306W WO 2023123919 A1 WO2023123919 A1 WO 2023123919A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
convolution
input data
dimension
dimensional
Prior art date
Application number
PCT/CN2022/100306
Other languages
English (en)
French (fr)
Inventor
郑鎏韬
徐健
孙尧
Original Assignee
寒武纪行歌(南京)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 寒武纪行歌(南京)科技有限公司 filed Critical 寒武纪行歌(南京)科技有限公司
Publication of WO2023123919A1 publication Critical patent/WO2023123919A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/491Computations with decimal numbers radix 12 or 20.
    • G06F7/498Computations with decimal numbers radix 12 or 20. using counter-type accumulators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the present disclosure relates generally to the field of data processing. More specifically, the present disclosure relates to a data processing circuit, a data processing method, a chip and a board.
  • LiDAR point cloud data is usually sparse, and the point density varies drastically due to factors such as uneven sampling of 3D space, effective range of sensors, occlusions, and relative poses. Therefore, the traditional convolutional neural network suitable for dense data will become very inefficient when applied to such sparse data, especially when convolution operations are involved, a large amount of calculations will be wasted on zero-valued data points. power and other resources.
  • the solution disclosed in the present disclosure provides a data processing circuit, a data processing method, a chip, and a board.
  • the present disclosure discloses a data processing circuit, including: a control circuit, a storage circuit, and an operation circuit, wherein:
  • the control circuit is configured to control the storage circuit and the operation circuit to perform N-dimensional convolution operation processing on the input data and the convolution kernel, N>1, and N represents the number of convolution dimensions for performing sliding accumulation in the convolution operation , wherein the input data is sparse data and represented in a dense form;
  • the storage circuit is configured to store information including at least information before, during and/or after processing
  • the operation circuit is configured to perform multiple one-dimensional convolution operations on the input data and the convolution kernel under the control of the control circuit, to obtain multi-channel operation results and the corresponding first convolution dimension output point coordinates; and merging the multi-path operation results into one path of fusion data according to the corresponding output point coordinates as the result of the convolution operation, wherein the operation results with the same output point coordinates are accumulated.
  • the present disclosure provides a chip, including the data processing circuit of any one of the embodiments of the foregoing first aspect.
  • the present disclosure provides a board, including the chip in any embodiment of the second aspect.
  • the present disclosure provides a method of processing data using the aforementioned data processing circuit.
  • the embodiment of the present disclosure provides a convolution scheme suitable for sparse data, which uses only non-zero/non
  • the operation of empty data and convolution kernel can greatly save the amount of calculation and improve processing efficiency.
  • the sparse convolution scheme provided by the embodiments of the present disclosure can be applied to multi-dimensional convolution operations, including but not limited to two-dimensional convolution and three-dimensional convolution, and thus can be applied to the processing of LiDAR point cloud data.
  • Fig. 1 shows the structural diagram of the board card of the disclosed embodiment
  • FIG. 2 shows a structural diagram of a combination processing device according to an embodiment of the present disclosure
  • Fig. 3a shows a schematic diagram of the internal structure of a single-core computing device according to an embodiment of the present disclosure
  • Fig. 3b shows a schematic diagram of the internal structure of a multi-core computing device according to an embodiment of the present disclosure
  • Figure 4a shows the operational principle of a conventional convolution scheme
  • Figure 4b shows an exemplary principle of a sparse convolution operation scheme according to an embodiment of the present disclosure
  • Fig. 5 shows an exemplary representation method of input data according to an embodiment of the present disclosure
  • FIG. 6 shows an exemplary process of a sparse convolution scheme according to an embodiment of the present disclosure
  • Figure 7 shows an example of splitting of an input data block according to an embodiment of the present disclosure
  • FIG. 8 shows a flowchart of an exemplary method for screening valid input data points according to an embodiment of the present disclosure
  • FIG. 9 shows a schematic diagram of scanning traversal for a third input parameter according to an embodiment of the present disclosure.
  • FIG. 10 shows a schematic diagram of constructing a Q matrix according to an embodiment of the present disclosure
  • FIG. 11 illustrates exemplary logic for calculating wo coordinates of output data according to an embodiment of the present disclosure.
  • Fig. 12 shows a schematic structural diagram of a data processing circuit according to an embodiment of the present disclosure.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure.
  • the board card 10 includes a chip 101, which is a system-on-chip (System on Chip, SoC), or system-on-chip, integrated with one or more combined processing devices, and the combined processing device is an artificial
  • SoC System on Chip
  • the intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is the large amount of input data, which has high requirements for the storage capacity and computing power of the platform.
  • the board 10 of this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing capabilities.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus.
  • the control device 106 in the board 10 is configured to regulate the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a block diagram showing the combined processing means in the chip 101 of this embodiment.
  • the combined processing device 20 includes a computing device 201 , an interface device 202 , a processing device 203 and a storage device 204 .
  • the computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor for performing deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete user-specified operations.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write it into a storage device on the computing device 201 .
  • the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the chip of the computing device 201 .
  • the interface device 202 can also read data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201 .
  • the processing device 203 may be one or more types of a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or other general-purpose and/or special-purpose processors.
  • processors including but not limited to digital signal processors (digital signal processors, DSPs), application specific integrated circuits (application specific integrated circuits, ASICs), field-programmable gate arrays (field-programmable gate arrays, FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when considering the integration of the computing device 201 and the processing device 203 together, they are considered to form a heterogeneous multi-core structure.
  • the storage device 204 is used to store data to be processed, which may be a DRAM, which is a DDR memory, and its size is usually 16G or larger, and is used to store data of the computing device 201 and/or the processing device 203 .
  • Fig. 3a shows a schematic diagram of the internal structure of a processing core when the computing device 201 is a single-core device.
  • the computing device 301 is used to process input data such as computer vision, speech, natural language, data mining, etc.
  • the computing device 301 includes three modules: a control module 31 , a computing module 32 and a storage module 33 .
  • the control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, which includes an instruction fetch unit (IFU) 311 and an instruction decoding unit (instruction decode unit, IDU) 312.
  • the instruction fetching unit 311 is used to obtain instructions from the processing device 203 , and the instruction decoding unit 312 decodes the obtained instructions and sends the decoding results to the computing module 32 and the storage module 33 as control information.
  • the operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 .
  • the vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.
  • the storage module 33 is used to store or transfer relevant data, including a neuron storage unit (neuron RAM, NRAM) 331, a weight storage unit (weight RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333.
  • NRAM 331 is used to store input neurons, output neurons and calculated intermediate results;
  • WRAM 332 is used to store convolution kernels of deep learning networks, that is, weights;
  • DMA 333 is connected to DRAM 204 through bus 34, responsible for computing device 301 Data transfer between DRAM 204 and DRAM 204.
  • Fig. 3b shows a simplified schematic diagram of the multi-core internal structure of the computing device 201.
  • Multi-core computing devices can be abstracted using a hierarchical hardware model. As shown in the figure, the multi-core computing device can be abstracted into four levels, namely card level (Card) 350 , chip level (Chip) 360 , processor cluster level (Cluster) 370 and processor core level (Core) 380 .
  • Card card level
  • Chip chip level
  • Core processor core level
  • the embodiments of the present disclosure mainly involve the data transmission of the storage unit and the calculation unit, so the drawings and description briefly show and introduce the relevant calculation structure, and other parts are omitted.
  • each board contains local DDR storage, and each processor chip acts as a computing and control unit.
  • each processor chip contains multiple multiprocessors as computing units.
  • each multiprocessor includes multiple accelerator cores as control and computing units, and a shared storage SRAM as a storage unit.
  • each accelerator core contains local storage and an array of local processing units.
  • NFU refers to the Neuron Function Unit, which is used for convolution calculations.
  • the storage model includes board global memory, SRAM (shared memory) on the Cluster, NRAM, WRAM and registers on the Core, and the like.
  • SRAM is included in the storage processing unit MPU (Memory Process Unit Core, referred to as MPU, or Mem Core).
  • MPU Memory Process Unit Core
  • Mem Core refers to an intelligent processing core (Intelligent Process Unit Core, referred to as IPU Core or Core) in a multi-core computing device.
  • IPU Core contains NRAM, WRAM, NFU and so on.
  • Cluster refers to a processor cluster or a computing cluster.
  • a multi-core computing device includes several Clusters, and a Cluster includes 1 Mem Core+N IPU Cores.
  • the embodiments of the present disclosure provide a data processing circuit based on the aforementioned hardware environment, which supports convolution operations on sparse data.
  • the convolution processing associated with sparse data such as LiDAR point cloud data can be simplified and accelerated.
  • the sparse convolution scheme provided by the embodiments of the present disclosure can be applied to multi-dimensional convolution operations, including but not limited to two-dimensional convolution and three-dimensional convolution.
  • two-dimensional convolution is used as an example for illustration.
  • N represents the number of convolution dimensions for performing sliding accumulation in the convolution operation.
  • the convolution kernel performs translation and accumulation in two dimensions (eg, width W and height H) according to corresponding convolution steps.
  • the convolution kernel performs translation and accumulation in three dimensions (for example, width W, height H, and depth D) according to corresponding convolution steps.
  • the "non-convolution dimension” mentioned in the embodiments of the present disclosure refers to a dimension on which the convolution kernel does not perform sliding accumulation. There may be different required operations on different non-convolutional dimensions. For example, for conventional convolution, the input channel dimension Ci is required to be accumulated, and the output channel dimension Co is not accumulated; for another example, for depth-wise conv, the input channel dimension Ci is not accumulated.
  • Figure 4a shows the operational principle of a conventional convolution scheme.
  • the convolution kernel 410 is dense and is a 3 ⁇ 3 matrix, and the numbers in the convolution kernel are corresponding weight data.
  • the input data 420 is a 7 ⁇ 7 matrix, which is sparse with only four non-zero data: 2, 3, 5 and 6, as shown by the dark squares.
  • the convolution stride is set to 2 in both dimensions, the padding is 0, and there is no dilation.
  • the 3 ⁇ 3 gray square in the figure represents the sliding accumulation process of the convolution kernel on the input data.
  • 430 shows the calculation at the beginning of the convolution
  • 440 shows the calculation of one slide to the right (step size 2)
  • 450 shows the calculation of one slide down (step size 2).
  • the weight data of the convolution kernel and the input data are multiplied and accumulated.
  • 460 is the final calculation result as output data.
  • the output data is a matrix of size 3 ⁇ 3. It can be seen that the calculation of 430 corresponds to the data at coordinates (0,0) in the output data, the calculation of 440 corresponds to the data at coordinates (0,1) in the output data, and the calculation of 450 corresponds to the coordinates in the output data Data at (1,0).
  • the convolution kernel is dense, and its input format can be the same as that of conventional convolution; while the input data is sparse, its input format can be different from that of conventional convolution, This saves storage space.
  • these multiplication and addition operations can be further divided into the part obtained after performing the one-dimensional convolution operation by row and the result of bitwise accumulation by column.
  • the N-dimensional convolution operation can be divided into multiple one-dimensional convolution operations for implementation.
  • FIG. 4b shows an exemplary principle of a sparse convolution scheme according to an embodiment of the present disclosure.
  • FIG. 4b still uses the data in FIG. 4a as an example to describe the sparse convolution operation scheme of the embodiment of the present disclosure.
  • one row of operation results 460 corresponds to three rows of input data 420, for example, the first three rows of input data can be used to calculate the first row of output data, and the middle three rows (overlapping with the first three rows and the last three rows) of input data can be used The second row of output data is calculated, and the last three rows of input data are used to calculate the last row of output data.
  • the sparse data can be filtered at the granularity (referred to as "first filtering granularity" here) of the input data required to correspond to one row of operation results (the example in the figure is three rows of input data), Thereby reducing invalid operations.
  • first filtering granularity the input data of the middle three rows are all 0s, that is, there are no non-sparse points, so the convolution operation of the next three rows can be omitted.
  • the two-dimensional (H, W dimension) convolution operation can be split into three one-dimensional (W dimension) convolution operations.
  • the original 3*3 size convolution window slides and calculates one row of output data on three rows of input data, which can be converted into three 1*3 size convolution windows ( (shown in a dotted box) slide calculations on the three lines of input data (as shown in 470 ), to obtain the three-line partial sum result (as shown in 480 ), and then add up the bits to obtain the final line of output data ( 460 ).
  • the two-dimensional convolution operation is decomposed into multiple one-dimensional convolution operations, during the convolution operation, one dimension can be used as the granularity (herein referred to as "Second Filtering Granularity") to filter sparse data, thereby reducing invalid operations.
  • the input data of the third row is all 0, that is, there is no non-sparse point, so the convolution operation of this row can be omitted.
  • sparse data may also be filtered at a granularity of a one-dimensional convolution window (herein referred to as “the third filtering granularity”), thereby further reducing invalid operations.
  • the third filtering granularity a one-dimensional convolution window
  • the sparse convolution operation may include the following steps: performing multiple one-dimensional convolution operations on the convolution kernel and sparse input data to obtain multi-channel operation results (such as product results Or multiplication and addition results, considering that some dimensions of the data need to be accumulated, such as the input channel dimension Ci) and the output point coordinates on the corresponding first convolution dimension; then the multi-channel operation results are merged according to their corresponding output point coordinates To fuse data all the way, as the result of sparse convolution operation. During the merging process, the results of operations with the same output point coordinates are accumulated.
  • N-dimensional convolution operations when applied to N-dimensional convolution operations, N>1, N-dimensional convolution operations can be split into M one-dimensional convolution operations, where M is equal to the convolution kernel in addition to the one-dimensional convolution operation The product of the sizes of the N-1 convolution dimensions other than the first convolution dimension of .
  • input data In the sparse convolution operation, input data, convolution kernel and output data are involved.
  • the input data is also called input neuron in the neural network, and the output data is called output neuron.
  • the convolution kernel In convolution operations involving LiDAR point cloud data, the convolution kernel is dense, and its input format can be the same as that of regular convolution. In two-dimensional convolution, the size of the convolution kernel is usually 3*3, and a single convolution needs to accumulate and sum the numbers of 3*3*ci; in three-dimensional convolution, the size of the convolution kernel is usually 3* 3*3, a single convolution needs to accumulate and sum the numbers of 3*3*3*ci.
  • the input data to be convolutionally processed may include multi-dimensional data, and it is sparse in multiple dimensions.
  • the input data is the detection data in the three-dimensional space, which characterizes the gray value, RGB, signal strength, etc. of each coordinate point in the three-dimensional space, so according to the information content to be represented,
  • the input data elements at each coordinate point can be one-dimensional, two-dimensional, three-dimensional or higher dimensional data. Due to the characteristics of point cloud data, coordinate points with non-zero value data elements are sparse, that is, they are sparse in three spatial dimensions (e.g., width W, height H, and depth D).
  • preprocessing may be performed before the sparse input data is provided to the operation circuit for operation.
  • preprocessing may include, for example: combining sparse dimensions into one dimension; densifying the sparse data points in the input data in the combined dimension; and using several input parameters to represent the densification The value and coordinates of the input data after.
  • Fig. 5 shows an exemplary representation method of input data (sparse neurons).
  • the input data here may be the data after filling processing according to the requirements of the convolution operation.
  • a preprocessing operator can be used to convert the sparse input data into a dense input data representation, thereby saving storage space.
  • the input data 510 in sparse form includes five dimensions, batch (B) dimension, HWD three-dimensional space dimension and input channel Ci dimension.
  • the input data is sparse in the B dimension and the HWD three-dimensional space.
  • the dark squares in the HWD three-dimensional matrix in the figure represent places with values (called valid input points), and all other parts are zero values.
  • There are multiple such HWD stereo matrices on the B dimension and the sparse pattern (that is, the position of the dark square) on each stereo matrix may be different.
  • the input data is dense in the input channel Ci dimension, which is the lowest dimension. Due to the limited expressive ability of the drawing, 510 in the figure only shows four dimensions, but the Ci dimension can be understood as the thickness of each dark square.
  • the size of the Ci dimension is uniform, that is, the thickness of each dark square is the same.
  • the sparse input data 510 when converting the sparse input data 510 into dense input data, it can be represented by referring to the CSR format in the sparse matrix.
  • non-zero element values sometimes called effective element values
  • the storage of a sparse matrix not only stores the value of non-zero elements, but also stores its coordinate position (row index, column index).
  • the CSR storage method is called compressed sparse row format.
  • the CSR method uses three arrays to store the sparse matrix, which respectively store row pointers, column indexes and values.
  • the length of the column index array and the numeric array is the number of nonzero elements in the sparse matrix.
  • the row pointer array stores the offset position of the first non-zero element of each row from the first non-zero element of the sparse matrix, and its last element stores the total number of non-zero elements in the sparse matrix, so the length of the row pointer array is Increment the number of sparse matrix rows by 1. It can be understood that, according to the definition of the row pointer array, the number of non-zero elements in the previous row can be obtained by subtracting the pointer value of the previous row from the pointer value of the next row.
  • three input parameters may be used for representation.
  • the first input parameter is the effective input dense data, that is, densely arranged sparse data, represented by Min, the shape of Min is ain*ci, where ain is the number of non-sparse points in the input data, and ci is the input channel dimension the size of.
  • a batch-by-batch method may be adopted, and the batches may be split into different processor cores (such as cores in FIG. 3b ) for processing.
  • SWIFT has 12 batches, so based on the hardware environment of the example in Figure 3b, up to 12 batches can be processed in parallel at one time.
  • ain corresponds to the number of valid input points of the three dimensions of HWD.
  • the second input parameter is the coordinate or index of each valid input point in the W dimension, represented by wi_coord, and its shape is 1*ain. As shown in 522 in the figure, for the 4 valid input points in the figure, wi_coord is [1,2,0,6].
  • the third input parameter is data in CSR (compressed sparse row) format of input data in H dimension or H direction, represented by hin. hin stores the offset position of the first non-zero element in each row from the first non-zero element in the input data, and its last element stores the total number of non-zero elements in the input data.
  • the third input parameter may have multiple dimensions.
  • the third input parameter from high dimension to low dimension is: batch batch (B), depth in_d and height in_h, and its shape is B*Din*(Hin +1), where B, Din, and Hin are the B dimension size, D dimension size, and H dimension size of the input data in sparse form, respectively.
  • the last element "4" represents the total number of all non-zero elements.
  • the first output parameter is the dense data that is effectively output, that is, the sparse data that is tightly arranged, represented by Mout, and the shape of Mout is aout*co, where aout is the number of non-sparse points in the output data, and co is the output channel dimension the size of.
  • the second output parameter is the coordinate or index of each effective output point in the W dimension, represented by wo_coord, and its shape is 1*aout.
  • the third output parameter is data in CSR (compressed sparse row) format of the output data in the H direction, represented by hout.
  • Hout stores the offset position of the first non-zero element of each line in the output from the first non-zero element in the output data, and its last element stores the total number of non-zero elements in the output data.
  • the third output parameter is similar to the third input parameter, and may also have multiple dimensions, for example, from high dimension to low dimension: batch batch (B), depth out_d and height out_h, and its shape is B*Dout*(Hout +1), where B, Dout, and Hout are the B-dimension size, D-dimension size, and H-dimension size of the output data in sparse form, respectively.
  • the output data includes 4 effective output points, and the corresponding W dimension coordinate wo_coord 532 is [0,1, 0,2], the corresponding data hout 533 in the CSR format of the H direction is [0,2,2,4], and the first output parameter is not shown in the figure.
  • the effective output in the output data can be determined according to the coordinates of the input data The coordinates of the point. Therefore, in the sparse convolution operation scheme of the embodiment of the present disclosure, the operation process can be disassembled into several steps of coordinate calculation and value calculation.
  • FIG. 6 shows an exemplary process of a sparse convolution scheme according to an embodiment of the present disclosure.
  • step 610 first, based on the coordinates of the input data, the data to be processed by each processor core is screened out each time.
  • the input data here is data in CSR format that has been zero-filled and converted from sparse to dense.
  • splitting may be performed according to the dimensions of the output data.
  • the convolution operation results may be calculated row by row. Therefore, the splitting method may be: each processor core calculates a row of output data points of the Wo dimension each time.
  • the shape of the input data block corresponding to a row of Wo dimension output data points is (kz*ci)*ky*wi, where wi is the W dimension size of the input data, ci is the Ci dimension size of the input data, and the convolution kernel
  • the sizes in W, H, and D dimensions are kx, ky, and kz, respectively.
  • Fig. 7 shows an example of splitting the input data block, where the gray input data block exactly corresponds to a row of Wo-dimensional output data points after the volume accumulation operation is completed.
  • screening may be performed according to the third input parameter of the input data, namely hin. From the meaning of hin, we can know that subtracting the i-th value from the i+1 value can get the number of non-zero elements (valid input data points) in the i-th row. Therefore, based on this characteristic of hin, it can be judged whether there are valid input data points for screening.
  • FIG. 8 shows a flowchart of an exemplary method for screening valid input data points according to an embodiment of the disclosure.
  • the third input parameter is loaded from an external storage circuit to an on-chip memory (such as SRAM).
  • the size of the storage space required by the third input parameter is (Hin+1+2*ph)*(Din+2*pd)*dwidth, where Din and Hin are the size of D dimension and H dimension of the input data in sparse form, respectively, ph and pd are the one-sided padding amounts of the H dimension and D dimension respectively, and dwidth is the data bit width.
  • step 812 traverse the third input parameter with the specified scanning window (the first scanning window) and the specified scanning step (the first scanning step), so as to find valid input data points.
  • the size of the first scanning window corresponds to the input data required to calculate a row of Wo-dimensional output data points, that is, corresponds to the "first screening granularity" mentioned above.
  • the size of the first scanning window is determined according to the size of the convolution kernel. Specifically, the size of the first scan window is kz*(ky+1), kz corresponds to the size of the scan window in the D dimension, and ky+1 is the size of the scan window in the H dimension, because the third input parameter uses the CSR format, Among them, ky rows of H-dimension data need ky+1 data to represent.
  • the first scanning step is equal to the H-dimensional convolution step Sy.
  • the scan window in which valid input data points are detected may be referred to as a range_block.
  • the hi and di coordinates corresponding to the block can be recorded, so that the corresponding ho and do coordinates in the output data can be calculated according to the hi and di coordinates.
  • the found blocks may be sent to the processor core (IPU), for example, the NIPU blocks are respectively sent to NIPU different IPUs.
  • IPU processor core
  • each IPU may be notified of the H and D dimension coordinates corresponding to the output points processed by them in the output data, that is, the ho and do coordinates. It can be understood that when each IPU calculates the value of the output point (wo_x, ho, do), wo_x changes, ranging from 0 to (wo-1), and ho and do are fixed.
  • the size of the scanning window is 3*4 numbers
  • the hin data of the 3 rows have not changed, that is to say, there is no valid input data point, and the next step of scanning can be continued without processing.
  • the do coordinates corresponding to these 4 blocks are all 0, and the ho coordinates are 0, 2, 3, 4 respectively.
  • each IPU allocated with data to be processed (indicated by the aforementioned block) can take out the corresponding data according to the indication of the block and construct the matrix to be convoluted, hereafter referred to as the Q matrix , and calculate the output point coordinate information at the same time.
  • the Q matrix is constructed by fetching data from the shared memory SRAM, which is preloaded with input data from an external storage circuit.
  • the wi_coord vector of valid input data points in the input data block corresponding to the allocated block may first be obtained from the second input parameter wi_coord. Next, according to the wi_coord vector, the input data corresponding to the vector can be traversed with the second scan window and the second scan step to extract the corresponding input data points from the effectively input dense data Min (that is, the first input parameter) to construct the Q matrix.
  • the effectively input dense data Min that is, the first input parameter
  • the block is obtained from the third input parameter hin, which records the distance between the first valid input data point of each line and the specified point (that is, the first valid input data point of the entire input data), so it can be based on this
  • the information takes the specified amount of data from the specified position of the second input parameter wi_coord to form the corresponding wi_coord vector.
  • the meaning of the wi_coord vector refers to the vector formed by the wi coordinates of all valid input data points in a row of W dimension in the input data. For example, assuming that there are 34 valid input data points in the W dimension of one row, the length of the vector is 34, and the value of each vector element is the wi coordinate of the corresponding data point.
  • the wi_coord vector can be filtered during the process of constructing the wi_coord vector. Since the two elements before and after hin represent the number of valid input data points in the corresponding row, the wi_coord vector is empty for the row with a difference of 0, so the empty wi_coord vector can be filtered out.
  • This screening step corresponds to the "second screening granularity" described above.
  • the input data corresponding to the wi_coord vector can be traversed with the second scan window and the second scan step to extract the corresponding input data point from the first input parameter Min to construct the matrix Q.
  • FIG. 10 shows a schematic diagram of constructing a Q matrix according to an embodiment of the present disclosure.
  • the figure only shows the Q matrix construction of 3 rows w, and similar constructions can be carried out for w of other rows.
  • the wi_coord vector can be constructed based on the information in the third input parameter hin, that is, it is determined whether there is a valid input data point in the current row to be scanned. Specifically, the number of valid input data points in the i-th row can be determined according to the difference between the i+1th value and the i-th value in hin.
  • the input data corresponding to the wi_coord vector is traversed with the second scan window and the second scan step to extract corresponding valid input data points to construct a matrix Q.
  • scanning is performed row by row to construct corresponding rows of the Q matrix.
  • the data covered by the second scanning window in which valid input data points are detected is extracted, tiled in order to form the corresponding row of matrix Q, and the second scanning window in which no valid input data points are detected is skipped .
  • the size of the second scan window corresponds to the size of the convolution window of the convolution operation in the first convolution dimension (for example, the size of the W dimension, kx), and the second scan step size is Corresponding to the convolution step Sx of the convolution operation in the W dimension.
  • This step of scanning and screening corresponds to the "third screening granularity" described above.
  • the second scanning window in this example, the size is 1*3
  • scan and traverse the input data line by line along the W dimension when valid input data points exist in the scan window, the input data corresponding to the scan window is extracted. In this way, the window data extracted each time are sequentially expanded and tiled along the W dimension to construct the Q matrix.
  • a scanning window of size 1*3 is used, and the step size is 2 data for scanning.
  • the scanning result is: there are 2 valid input data points in the scanning window 1004 , there are 2 valid input data points in the scanning window 1005 , and there is no valid input data point in the scanning window 1006 .
  • the result of extracting the data and constructing the Q matrix is shown in the second row of 1040, which is composed of the data covered by the scanning windows 1004 and 1005.
  • the input data is in a compact form, when judging whether there is a valid input data point in the scanning window, it can be determined whether it is located in a certain scanning window according to the coordinate information of the input data. Specifically, it can be judged according to the block (1030) allocated to the IPU and the constructed wi_coord vector (1020). It can be understood that a scan window essentially corresponds to an output point (a partial sum), so the wo coordinates of the output data points contributed by it can be deduced according to the wi coordinates of the valid input data points, so as to determine whether it falls into one or more scan window.
  • FIG. 11 illustrates exemplary logic for calculating wo coordinates of output data points according to an embodiment of the present disclosure.
  • the convolution kernel size is 3*3*3
  • the convolution step size is 2 in the HWD direction.
  • mapping relationship may change accordingly.
  • convolution parameters specifically convolution kernel kx size and convolution step size Sx.
  • the Q matrix when valid input data points are extracted to construct the Q matrix, the Q matrix can also be constructed according to whether wi_coord is odd or even. Specifically, the specific position of the input data point in the second scanning window may be determined according to whether the wi coordinate is odd or even. For example, the input data points whose wi coordinates are odd must fall in the middle of the second scanning window, while the input data points whose wi coordinates are even fall in two adjacent positions of two adjacent second scanning windows.
  • the value of the effective input data point can be read from, for example, the shared memory SRAM, and stored in the on-chip memory (such as NRAM).
  • a Q matrix can be constructed. For each of the M non-empty wi_coord vectors, it can be sequentially processed in the above manner, and the Q matrix thus constructed has M rows, each row is composed of Li second scan windows, and Li depends on the output data of the row The number of points, as previously counted when calculating the wo coordinates.
  • the counted number of output data points can be used to calculate the third output parameter hout in the output data.
  • step 630 after the matrix Q is constructed, M one-dimensional convolution operations can be performed on the matrix Q, and the M one-dimensional convolution kernels of the M one-dimensional convolution operations are composed of the original N
  • the one-dimensional convolution kernel is obtained by splitting according to the first convolution dimension, and the convolution step size of the one-dimensional convolution operation is equal to the size of the second scanning window, that is, equal to the size of the N-dimensional convolution kernel in the first convolution dimension.
  • the first convolution dimension is the W dimension, so the convolution step of the one-dimensional convolution operation is equal to kx.
  • M-way partial sums can be obtained through M one-dimensional convolution operations, which correspond to the output points of the same row of wo.
  • Each part and the corresponding wo coordinates are also determined in each part and result.
  • step 640 the M road parts and results are merged into one road fusion data according to the corresponding output data point coordinates, and the final result of the corresponding row wo output data points is obtained.
  • the parts and results with the same wo coordinates are accumulated.
  • the merging process described above can be performed in a number of ways.
  • the merging and merging process may be implemented in a hardware manner.
  • it can be realized by a hardware instruction MERGE instruction.
  • the basic function of the MERGE instruction is to merge multiple channels of data to be fused into one channel of fused data according to their index order, and to accumulate the data with the same index.
  • software may be used to implement the merging and fusion processing.
  • a sorting algorithm in the process of merge and fusion may be implemented by a fully vectorized sorting algorithm based on a multi-core processor. Then call the bang_add operator to traverse the sorted data. When the coordinates are the same, add them directly; if they are different, you don’t need to add them, just continue to traverse.
  • the convolution operation scheme of sparse data in the embodiment of the present disclosure has been described above from various aspects. Compared with conventional convolution schemes, the schemes of the disclosed embodiments only perform operations on non-zero/non-null data in sparse data, which can avoid excessive invalid operations, greatly save computation, and improve processing efficiency. Further, the input data is screened through different levels (for example, three levels) of screening granularity, so that data that needs to perform convolution operations can be extracted quickly and efficiently.
  • the sparse convolution scheme provided by the embodiments of the present disclosure can be especially suitable for processing based on LiDAR point cloud data.
  • Embodiments of the present disclosure also provide a data processing circuit for performing the convolution operation of the aforementioned sparse data, and a data processing method implemented by the data processing circuit.
  • FIG. 12 exemplarily shows a schematic structural diagram of a data processing circuit that can implement an embodiment of the present disclosure.
  • the data processing circuit 1200 includes a control circuit 1210 , a storage circuit 1220 and an operation circuit 1230 .
  • the control circuit 1210 is responsible for processing various functions on the data processing circuit 1200, including but not limited to control, instruction fetching, decoding, calculation and so on.
  • the control circuit 1210 may include, for example, the control module 31 in FIG. 3 .
  • control circuit 1210 can be configured to control the storage circuit 1220 and the operation circuit 1230 to perform N-dimensional convolution operation processing on the input data and the convolution kernel, N>1, and N indicates that the sliding accumulation is performed in the convolution operation The number of convolution dimensions.
  • N N-dimensional convolution operation processing
  • the input data is sparsified and represented in a dense form.
  • control circuit 1210 may be configured to: filter the input data blocks allocated to the operation circuit 1230 in the current round according to the input parameters of the input data, wherein the operation circuit 1230 calculates a row of W-dimensional output data in each round.
  • the storage circuit 1220 can be used to store information, which includes at least pre-processing and/or post-processing information, and can also include intermediate information that needs to be cached during processing. It can be, for example, various RAMs shown in FIG. 3 , or on-chip cache. In some embodiments, the storage circuit 1220 may be configured to store input data, convolution kernels, convolution operation results and/or cache intermediate results, such as cache parts and results, or provide cache space required during execution of the MERGE instruction.
  • the arithmetic circuit 1230 may be configured to perform various arithmetic operations according to related instructions.
  • the operation circuit 1230 can be configured to perform multiple one-dimensional convolution operations on the input data and the convolution kernel under the control of the control circuit 1210, to obtain multi-channel operation results and corresponding outputs on the first convolution dimension. Point coordinates; and merge the results of multi-channel operations according to their corresponding output point coordinates into one channel of fusion data as the result of convolution operation, where the operation results with the same output point coordinates are accumulated.
  • the above-mentioned N-dimensional convolution operation is split into M one-dimensional convolution operations, and M is equal to the product of the size of the convolution kernel in N-1 convolution dimensions other than the first convolution dimension .
  • the arithmetic circuit 1230 may further include an arithmetic processing circuit (not shown), which may be configured to preprocess the data before the arithmetic circuit performs the arithmetic or post-process the data after the arithmetic according to the arithmetic instruction.
  • the foregoing preprocessing and postprocessing may, for example, include data splitting and/or data splicing operations.
  • the computing circuit 1230 may include multiple processor cores, and each processor core may process the input data blocks allocated by the control circuit 1210 each time, for example, calculate a row W of output points each time.
  • each processor core may be further configured to perform an operation as follows: constructing a matrix Q on which a one-dimensional convolution operation is to be performed based on an allocated block indicating a block of input data to be processed; computing the one-dimensional convolution The coordinates of each part of the output point of the operation and the result on the first convolution dimension; perform multiple one-dimensional convolution operations on the matrix Q to obtain the multi-path part and the result; and merge and fuse the multi-path part and the result, Get the final convolution operation result.
  • the step of determining the coordinates is described as being performed by an arithmetic circuit, those skilled in the art can understand that the step of determining the coordinates can also be performed by software, for example, by a control circuit.
  • each processing step is generally described as being executed on an operation circuit, the operation circuit here may also be distributed, for example, including operation circuits in a heterogeneous system, so that a part of operations such as Executed on the CPU, and another part of the calculation is performed on the GPU, for example.
  • preprocessing of input data which may include, for example, densification of input data in sparse form, etc., may be performed on a CPU, for example.
  • the one-dimensional convolution operation of the input data and the convolution kernel, the merging and fusion of the multi-channel part and the result can be performed on the GPU, thereby giving full play to the advantages of the heterogeneous system.
  • the present disclosure also provides a chip, which may include the data processing device of any embodiment described above with reference to the accompanying drawings. Further, the present disclosure also provides a board, which may include the aforementioned chip.
  • the electronic equipment or devices disclosed in this disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, Internet of Things terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment.
  • Said vehicles include airplanes, ships and/or vehicles;
  • said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods;
  • said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph.
  • the electronic equipment or device disclosed herein can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical treatment. Further, the electronic device or device disclosed herein can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge, and terminal.
  • electronic devices or devices with high computing power according to the disclosed solutions can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present disclosure is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present disclosure, those skilled in the art may understand that certain steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily required for the realization of one or some solutions of the present disclosure. In addition, according to different schemes, the description of some embodiments in this disclosure also has different emphases. In view of this, those skilled in the art may understand the part that is not described in detail in a certain embodiment of the present disclosure, and may also refer to related descriptions of other embodiments.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit.
  • the aforementioned components or units may be located at the same location or distributed over multiple network units.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit, or each unit exists physically independently.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits.
  • the physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein may be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (Enhanced Dynamic Random Access Memory, EDRAM), High Bandwidth Memory (High Bandwidth Memory , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • RRAM variable resistance memory
  • DRAM Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • High Bandwidth Memory High Bandwidth Memory
  • HBM High Bandwidth Memory
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)

Abstract

本披露公开了一种数据处理电路、数据处理方法及相关产品。该数据处理电路可以实现为计算装置包括在组合处理装置中,该组合处理装置还可以包括接口装置和其他处理装置。该计算装置与其他处理装置进行交互,共同完成用户指定的计算操作。组合处理装置还可以包括存储装置,该存储装置分别与计算装置和其他处理装置连接,用于存储该计算装置和其他处理装置的数据。本披露的方案提供了稀疏数据的卷积处理方案,其可以简化处理,提高机器的处理效率。

Description

数据处理电路、数据处理方法及相关产品
相关申请的交叉引用
本公开要求于2021年12月29日申请的、申请号为202111642096.8、发明名称为“数据处理电路、数据处理方法及相关产品”的中国专利申请的优先权。
技术领域
本披露一般地涉及数据处理领域。更具体地,本披露涉及数据处理电路、数据处理方法、芯片和板卡。
背景技术
近年来,在基于卷积神经网络的目标检测、实例分割和关键点检测中已经取得了巨大的进步。这些检测通常基于光达(LiDAR)数据或基于RGB-D数据,其可以应用于自动驾驶、机器人视觉等领域中。
不同于图像数据是密集的,LiDAR点云数据通常是稀疏的,并且由于诸如3D空间的不均匀采样、传感器的有效范围、遮挡和相对姿势等因素,点密度变化剧烈。因此,传统的适合于密集型数据的卷积神经网络在应用于这种稀疏型数据时,效率将变得非常低,尤其是涉及卷积运算时,会在零值数据点上浪费大量的算力等资源。
鉴于此,期望提供一种改进的卷积方案,以适合于诸如点云数据之类的稀疏型数据,从而提高处理效率。
发明内容
为了至少部分地解决背景技术中提到的一个或多个技术问题,本披露的方案提供了一种数据处理电路、数据处理方法、芯片和板卡。
在第一方面中,本披露公开一种数据处理电路,包括:控制电路、存储电路和运算电路,其中:
所述控制电路配置用于控制所述存储电路和所述运算电路对输入数据和卷积核执行N维卷积运算处理,N>1,N表示卷积运算中执行滑动累加的卷积维度数,其中所述输入数据是稀疏化数据并采用致密形式表征;
所述存储电路配置用于存储信息,所述信息至少包括处理前、处理期间和/或处理后的信息;以及
所述运算电路配置用于在所述控制电路的控制下,对所述输入数据与所述卷积核执行多个一维卷积运算,获得多路运算结果及对应的第一卷积维度上的输出点坐标;以及将所述多路运算结果按照其对应的输出点坐标,归并为一路融合数据作为所述卷积运算的结果,其中具有相同输出点坐标的运算结果进行累加。
在第二方面中,本披露提供一种芯片,包括前述第一方面任一实施例的数据处理电路。
在第三方面中,本披露提供一种板卡,包括前述第二方面任一实施例的芯片。
在第四方面中,本披露提供一种使用前述数据处理电路来处理数据的方法。
通过如上所提供的数据处理电路、使用数据处理电路来处理数据的方法、芯片和板卡,本披露实施例提供了一种适合于稀疏型数据的卷积方案,其通过仅将非零/非空数据与卷积核进行运算,可以极大节省运算量,提高处理效率。本披露实施例提供的稀疏卷积方案可以适用于多维卷积运算,包括但不限于二维卷积和三维卷积,由此可以适用于LiDAR点云数据的处理。
附图说明
通过参考附图阅读下文的详细描述,本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本披露的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:
图1示出本披露实施例的板卡的结构图;
图2示出本披露实施例的组合处理装置的结构图;
图3a示出本披露实施例的单核计算装置的内部结构示意图;
图3b示出本披露实施例的多核计算装置的内部结构示意图;
图4a示出常规卷积方案的运算原理;
图4b示出根据本披露实施例的稀疏卷积运算方案的示例性原理;
图5示出根据本披露实施例的输入数据的一种示例性表示方法;
图6示出根据本披露实施例的稀疏卷积方案的示例性过程;
图7示出了根据本披露实施例的输入数据块的拆分示例;
图8示出根据本披露实施例的筛选有效输入数据点的示例性方法流程图;
图9示出根据本披露实施例对第三输入参数进行扫描遍历的示意图;
图10示出根据本披露实施例的构造Q矩阵的示意图;
图11示出根据本披露实施例计算输出数据的wo坐标的示例性逻辑;以及
图12示出根据本披露实施例的数据处理电路的结构示意图。
具体实施方式
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。
下面结合附图来详细描述本披露的具体实施方式。
示例性硬件环境
图1示出本披露实施例的一种板卡10的结构示意图。如图1所示,板卡10包括芯片101,其是一种***级芯片(System on Chip,SoC),或称片上***,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学***台的存储能力和计算能力有很高的要求,此实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。
芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe接口等。
板卡10还包括用于存储数据的存储器件104,其包括一个或多个存储单元105。存储器件104 通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。
图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示,组合处理装置20包括计算装置201、接口装置202、处理装置203和存储装置204。
计算装置201配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置202与处理装置203进行交互,以共同完成用户指定的操作。
接口装置202用于在计算装置201与处理装置203间传输数据和控制指令。例如,计算装置201可以经由接口装置202从处理装置203中获取输入数据,写入计算装置201片上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二者视为形成异构多核结构。
存储装置204用以存储待处理的数据,其可以是DRAM,为DDR内存,大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。
图3a示出了计算装置201为单核装置时处理核的内部结构示意图。计算装置301用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,计算装置301包括三大模块:控制模块31、运算模块32及存储模块33。
控制模块31用以协调并控制运算模块32和存储模块33的工作,以完成深度学习的任务,其包括取指单元(instruction fetch unit,IFU)311及指令译码单元(instruction decode unit,IDU)312。取指单元311用以获取来自处理装置203的指令,指令译码单元312则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块32和存储模块33。
运算模块32包括向量运算单元321及矩阵运算单元322。向量运算单元321用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元322负责深度学习算法的核心计算,即矩阵乘及卷积。
存储模块33用来存储或搬运相关数据,包括神经元存储单元(neuron RAM,NRAM)331、权值存储单元(weight RAM,WRAM)332、直接内存访问模块(direct memory access,DMA)333。NRAM 331用以存储输入神经元、输出神经元和计算后的中间结果;WRAM 332则用以存储深度学习网络的卷积核,即权值;DMA 333通过总线34连接DRAM 204,负责计算装置301与DRAM 204间的数据搬运。
图3b示出了计算装置201为多核的内部结构简化示意图。多核计算装置可以用层次化硬件模型来进行抽象。如图所示,多核计算装置可以抽象为四个层级,即板卡级(Card)350、芯片级(Chip)360、处理器簇级(Cluster)370和处理器核级(Core)380。本披露实施例中主要涉及存储单元的数据传输和计算单元部分,因此附图和描述简要示出和介绍相关的计算结构,省略其他部分。
在板卡级,每块板卡上包含本地DDR存储,每个处理器芯片作为计算和控制单元。
在芯片级,每个处理器芯片包含多个多处理器作为计算单元。
在计算簇级,每个多处理器包括多个加速器核作为控制和计算单元,另外还有共享存储SRAM作为存储单元。
在处理器核级,每个加速器核包含本地存储及本地处理单元阵列。NFU指神经运算单元(Neuron Function Unit),用于进行卷积计算。
在该多核计算装置中,存储模型包括板卡全局内存、Cluster上的SRAM(共享存储器)、Core上的NRAM、WRAM和寄存器等。为了获得更好的性能,可以显式地控制Card以下各存储层次之间的数据搬移以及访存/计算间的平衡。SRAM包含在存储处理单元MPU(Memory Process Unit Core,简称MPU,或者Mem Core)中。Core指多核计算装置中的智能处理核(Intelligent Process Unit Core,简称IPU Core或者Core)。1个IPU Core包含NRAM,WRAM,NFU等等。Cluster指处理器簇或称计算簇,通常多核计算装置包含若干个Cluster,一个Cluster包含1个Mem Core+N个IPU Core。
示例性卷积运算原理
本披露的实施例基于前述硬件环境,提供一种数据处理电路,支持稀疏型数据的卷积操作。通过提供一种优化的卷积方案,可以简化并加速与诸如LiDAR点云数据一类的稀疏型数据相关的卷积处理。本披露实施例提供的稀疏卷积方案可以适用于多维卷积运算,包括但不限于二维卷积和三维卷积。为了简便起见和易于理解,在一些实施例中采用二维卷积作为示例进行阐述。
本披露实施例中所提到的“N维卷积”,其中N表示卷积运算中执行滑动累加的卷积维度数。例如,当N=2时,卷积核在两个维度(例如,宽度W和高度H)上根据对应的卷积步长进行平移累加。当N=3时,卷积核在三个维度(例如,宽度W、高度H和深度D)上根据对应的卷积步长进行平移累加。在本披露实施例中所提到的“非卷积维度”,是指卷积核不在该维度上进行滑动累加的维度。不同的非卷积维度上可能有不同要求的运算。例如,对于常规卷积,输入通道维度Ci要求进行累加,输出通道维度Co不进行累加;又例如,对于深度方向卷积(depth-wise conv),输入通道维度Ci也不进行累加。
为了更清楚地理解本披露实施例的卷积方案,先以二维卷积为例描述常规卷积方案的运算原理。
图4a示出了常规卷积方案的运算原理。在此示例中,卷积核410是致密的,为3×3的矩阵,卷积核中的数字为对应的权值数据。输入数据420为7×7的矩阵,其是稀疏的,仅具有四个非零数据:2、3、5和6,如深色方块所示。在此示例性卷积过程中,两个维度的卷积步长均设为2,填补量为0,没有膨胀。图中3×3大小的灰色方块代表卷积核在输入数据上的滑动累加过程。430示出了卷积开始时的计算,440示出了向右滑动一次(步长2)的计算,450示出了向下滑动一次(步长2)的计算。在每步计算中,卷积核的权值数据与输入数据对位相乘并累加。460为作为输出数据的最终计算结果。输出数据为3×3大小的矩阵。可以看出,430的计算对应于输出数据中坐标(0,0)处的数据,440的计算对应于输出数据中坐标(0,1)处的数据,而450的计算对应于输出数据中坐标(1,0)处的数据。
在本披露实施例的稀疏卷积运算过程中,卷积核是致密的,其输入格式可以与常规卷积相同;而输入数据是稀疏的,其输入格式可以不同于常规卷积的输入数据,由此节省存储空间。
从图4a的描述可以看出,稀疏卷积运算的最终结果只与非零输入数据元素的运算结果有关,因此,可以仅针对这些非零输入数据元素,执行与卷积核的乘加运算,从而减少无效运算。
进一步地,从图4a的输出数据460可以看出,这些乘加运算可以进一步拆分成按行进行一维卷积运算后,得到的部分和按列对位累加的结果。换言之,可以将N维的卷积运算拆分为多个一维卷积运算来实现。
图4b示出了根据本披露实施例的稀疏卷积方案的示例性原理。图4b仍然以图4a的数据为例来描述本披露实施例的稀疏卷积运算方案。
如图4b所示,对于整个卷积运算,可以拆分成逐行计算卷积运算结果。在此示例中,一行运算结果460对应三行输入数据420,例如前三行输入数据可供计算第一行输出数据,中间三行 (与前三行和最后三行有重叠)输入数据可供计算第二行输出数据,最后三行输入数据则供计算最后一行输出数据。
从上述运算过程的拆分可以看出,可以以对应一行运算结果所需的输入数据(图中示例为三行输入数据)为粒度(此处称为“第一筛选粒度”)过滤稀疏数据,从而减少无效运算。例如,在图4b的示例中,中间三行输入数据为全0,也即不存在非稀疏点,因此可以省略次三行的卷积运算。
进一步地,在每行输出结果的计算过程中,又可以将二维(H、W维度)的卷积运算拆分成3个一维(W维度)的卷积运算。
如图所示,原本由3*3大小的卷积窗口(黑色方框所示)在三行输入数据上滑动计算得到一行输出数据,可以转换为由3个1*3大小的卷积窗口(虚线方框所示)分别在三行输入数据上滑动计算(如470所示),得到三行部分和结果(如480所示),再对位累加,得到最终的一行输出数据(460)。
从上述卷积运算的拆分可以看出,由于将二维的卷积运算分解成多个一维的卷积运算,因此在卷积运算过程中,可以以一个维度为粒度(此处称为“第二筛选粒度”)过滤稀疏数据,从而减少无效运算。例如,在图4b的示例中,第三行输入数据为全0,也即不存在非稀疏点,因此可以省略该行的卷积运算。
进一步地,在每个一维卷积运算中,还可以以一维卷积窗口为粒度(此处称为“第三筛选粒度”)过滤稀疏数据,从而进一步减少无效运算。例如,在图4b的示例中,第一行输入数据的后两个一维卷积窗口内不存在非稀疏点,因此可以省略这两个卷积窗口的运算;而第二行输入数据的最后一个一维卷积窗口内不存在非稀疏点,也可以省略该卷积窗口的运算。
由此可见,通过三个不同大小的筛选粒度,可以快速、有效地过滤稀疏数据,尽量减少无效运算。
相应地,在本披露实施例的稀疏卷积方案中,稀疏卷积运算可以包括如下步骤:对卷积核与稀疏输入数据执行多个一维卷积运算,获得多路运算结果(例如乘积结果或乘加结果,考虑到数据有些维度还需要进行累加,例如输入通道维度Ci)以及对应的第一卷积维度上的输出点坐标;然后将多路运算结果按其对应的输出点坐标,归并为一路融合数据,作为稀疏卷积运算的结果。在归并过程中,具有相同输出点坐标的运算结果进行累加。
可以理解,虽然上面以二维卷积为例阐释了本披露实施例的稀疏卷积运算方案的示例性原理,上述方案也可以应用于三维卷积运算甚至更高维卷积运算中。
按照上述原理,在应用于N维卷积运算时,N>1,可以将N维卷积运算拆分为M个一维卷积运算,其中M等于卷积核在除了一维卷积运算所在的第一卷积维度之外的N-1个卷积维度的大小之积。例如,在上述二维卷积示例中,卷积核大小为kx*ky,第一卷积维度为W(kx),则M=ky;而在三维卷积示例中,卷积核大小为kx*ky*kz,第一卷积维度为W(kx),则M=ky*kz。
下面详细描述实施本披露实施例的稀疏卷积方案的运算过程。
输入和输出数据的示例性结构
在稀疏卷积运算中,涉及输入数据、卷积核和输出数据,其中输入数据在神经网络中也称为输入神经元,输出数据称为输出神经元。
在涉及LiDAR点云数据的卷积运算中,卷积核是致密的,其输入格式可以与常规卷积相同。在二维卷积中,卷积核的尺寸通常为3*3,单次卷积需要将3*3*ci个数累积求和;在三维卷积中,卷积核的尺寸通常为3*3*3,单次卷积需要将3*3*3*ci个数累积求和。卷积的步长在各个维度上通常为2,例如S H=S W=S D=2。
待卷积处理的输入数据可以包括多维数据,并且其在多个维度上是稀疏的。例如,在基于LiDAR数据的目标检测中,输入数据是三维空间内的检测数据,其例如表征每个三维空间坐标点的灰度值、RGB、信号强度等,因此根据其要表征的信息内容,每个坐标点处的输入数据元素可以是一维、二维、三维或更高维数据。由于点云数据的特性,具有非零值数据元素的坐标点是稀 疏的,也即其在三个空间维度(例如,宽度W、高度H和深度D)上是稀疏的。
取决于输入数据的初始状态,可以在将稀疏的输入数据提供给运算电路进行运算之前进行预处理。在一些实施例中,这种预处理例如可以包括:将稀疏的多个维度合并成一个维度;将输入数据中的稀疏数据点在合并的维度上致密化;以及使用若干输入参数来表示致密化后的输入数据的数值和坐标。
图5示出了输入数据(稀疏神经元)的一种示例性表示方法。此处的输入数据可以是根据卷积运算的要求,做完填充处理之后的数据。在此示例中,可以通过一个预处理算子来将稀疏的输入数据被转换为致密输入数据表示,由此节省存储空间。
如图所示,稀疏形式的输入数据510包括五个维度,批次batch(B)维度、HWD三维空间维度和输入通道Ci维度。输入数据在B维度和HWD三维空间是稀疏的,图中HWD立体矩阵中的深色方块代表有数值的地方(称为有效输入点),其他部分全部为零值。B维度上存在多个这种HWD立体矩阵,每个立体矩阵上的稀疏样式(也即深色方块的位置)可以不同。输入数据在输入通道Ci维度是致密的,Ci维度是最低维度。由于附图表现能力有限,图中510仅示出了四个维度,但是Ci维度可以理解为每个深色方块的厚度。Ci维度的大小是统一的,也即每个深色方块的厚度是一样。
在一些实施例中,将稀疏形式的输入数据510转换为致密输入数据时,可以参考稀疏矩阵中的CSR格式来表示。
在稀疏矩阵的存储中,为了达到压缩的目的,只存储非零元素值(有时也称为有效元素值),但是也要保留非零元素的位置,方便恢复。因此,稀疏矩阵的存储不仅存储非零元素值,同时存储其坐标位置(行索引,列索引)。
CSR存储方式,称为压缩稀疏行格式。CSR方式采用三个数组来存储稀疏矩阵,其分别存储行指针、列索引和数值。列索引数组和数值数组的长度为稀疏矩阵中非零元素的个数。行指针数组则存储每行第一个非零元素距离稀疏矩阵第一个非零元素的偏移位置,其最后一个元素存储稀疏矩阵中非零元素的总个数,因此行指针数组的长度是稀疏矩阵行数加1。可以理解,根据行指针数组的定义,后一行的指针值减去前一行的指针值可以得到前一行的非零元素的个数。
类似地,在本披露实施例中,将稀疏形式的输入数据510转换为致密输入数据时,可以使用三个输入参数来表示。
第一输入参数是有效输入的致密数据,也即紧致排列的稀疏数据,用Min表示,Min的形状是ain*ci,其中ain是输入数据中非稀疏点的个数,ci是输入通道维度的大小。
在预处理过程中,可以将输入数据的四个稀疏维度(B维度和HWD三维空间维度)合并成一个维度ain,将稀疏数据点(图中的深色方块)在合并后的维度上进行致密化,从而形成致密的输入数据元素。也即,B维度上的每个HWD立体矩阵都执行同样的维度合并和致密化处理,从而得到预处理后的致密形式的输入数据521,其为二维矩阵,低维是Ci,高维是BHWD的合并维度ain。此示例中,ain=4。
在一些实施例中,针对存在多个批次batch的情况,可以采取逐个批次进行处理的方法,按批次拆分到不同的处理器核(例如图3b中的core)上进行处理。例如,SWIFT有12个batch,因此基于图3b示例的硬件环境,一次性最多可以并行处理12个batch。在这些实施例中,针对每个batch,只需要合并HWD这三个稀疏维度的有效输入点,此时ain对应HWD三个维度的有效输入点个数。
第二输入参数是每一个有效输入点在W维度的坐标或索引,用wi_coord表示,其形状为1*ain。如图中522所示,对于图中4个有效输入点,wi_coord为[1,2,0,6]。
第三输入参数是输入数据在H维度或H方向的CSR(压缩稀疏行)格式的数据,用hin表示。hin存储每行第一个非零元素距离输入数据中第一个非零元素的偏移位置,其最后一个元素存储输入数据中非零元素的总个数。取决于输入数据的维度数,第三输入参数可能具有多个维度。例如,当输入数据是图5示例的五维数据时,第三输入参数从高维到低维依次是:批次batch(B)、深度in_d和高度in_h,其形状为B*Din*(Hin+1),其中B、Din和Hin分别为稀疏形式的输入数 据的B维度大小、D维度大小和H维度大小。
不失一般性地,第三输入参数hin的形状可以表示为X*(Hin+1),其中X表示输入数据可能存在的除H、W、Ci之外的其他维度的大小之积。例如,当输入数据为包括H、W和Ci的三维数据时,X不存在或者X=1;而当输入数据为包括D、H、W和Ci的四维数据时,X=D。
如图中523所示,对于B=0,Din=0时,示例中的hin为[0,1,2,2,2,2,2,4]。具体地,第一个元素为“0”,也即表示hi=0这一行的第一个非零元素距离输入数据第一个非零元素的偏移位置为0,因为就是第一个非零元素自身。第二个元素“1”表示hi=1这一行的第一个非零元素距离输入数据第一个非零元素的偏移位置为1,其等于hi=0这一行非零元素的个数;第三个元素“2”表示hi=2这一行第一个非零元素(若有)距离输入数据第一个非零元素的偏移位置为2,其等于前两行hi=0和hi=1非零元素的个数。第四个元素“2”表示hi=3这一行第一个非零元素(若有)距离输入数据第一个非零元素的偏移位置为2,其等于前三行hi=0~2非零元素的个数。依次类推,最后一个元素“4”表示所有非零元素的总个数。
类似地,在本披露实施例中,由于卷积运算的输出数据也是稀疏的,因此可以同样使用三个输出参数来表示输出数据。
第一输出参数是有效输出的致密数据,也即紧致排列的稀疏数据,用Mout表示,Mout的形状是aout*co,其中aout是输出数据中非稀疏点的个数,co是输出通道维度的大小。
第二输出参数是每一个有效输出点在W维度的坐标或索引,用wo_coord表示,其形状为1*aout。
第三输出参数是输出数据在H方向的CSR(压缩稀疏行)格式的数据,用hout表示。hout存储输出中每行第一个非零元素距离输出数据中第一个非零元素的偏移位置,其最后一个元素存储输出数据中非零元素的总个数。第三输出参数与第三输入参数类似,也可能具有多个维度,例如从高维到低维依次是:批次batch(B)、深度out_d和高度out_h,其形状为B*Dout*(Hout+1),其中B、Dout和Hout分别为稀疏形式的输出数据的B维度大小、D维度大小和H维度大小。
例如,继续图5中的示例,假设卷积核为3*3,卷积步长为2,则输出数据中包括4个有效输出点,其对应的W维度坐标wo_coord 532为[0,1,0,2],对应的H方向的CSR格式的数据hout 533为[0,2,2,4],图中未示出第一个输出参数。
示例性稀疏卷积运算过程
从输入和输出数据的数据结构可以看出,已知卷积运算规则(例如,卷积核大小,卷积步长等)的情况下,根据输入数据的坐标,可以确定输出数据中的有效输出点的坐标。因此,在本披露实施例的稀疏卷积运算方案中,可以将运算过程拆解为坐标计算和数值计算几个步骤。
图6示出了根据本披露实施例的稀疏卷积方案的示例性过程。
如图所示,在步骤610中,首先基于输入数据的坐标,筛选出每个处理器核每次需要处理的数据。此处的输入数据是已经经过填补零和稀疏到致密转换后的CSR格式的数据。
在基于图3b所示的多核计算装置来执行稀疏卷积运算的情况下,鉴于存储器(例如图3b中的SRAM)的存储空间有限,需要对数据进行拆分处理。在一些实施例中,可以按照输出数据的维度进行拆分,如前面结合图4b描述的卷积运算原理可知,可以逐行计算卷积运算结果。因此,拆分方式可以是:每个处理器核每次计算一行Wo维度的输出数据点。相应地,对应一行Wo维度的输出数据点的输入数据块的形状为(kz*ci)*ky*wi,其中wi是输入数据的W维度大小,ci是输入数据的Ci维度大小,卷积核在W、H和D维度的大小分别为kx、ky和kz。
图7示出了输入数据块的拆分示例,其中灰色输入数据块做完卷积累加运算之后,正好对应一行Wo维度的输出数据点。
考虑到数据的稀疏性,为了取出对应计算一行Wo维度的输出数据点的有效输入数据点,在一些实施例中,可以根据输入数据的第三输入参数,也即hin来进行筛选。从hin的含义可知,第i+1个数值减去第i个数值可以得到第i行的非零元素(有效输入数据点)的个数。因此,可以基于hin的这一特性来判断是否存在有效输入数据点,从而进行筛选。
图8示出了根据本披露实施例的筛选有效输入数据点的示例性方法流程图。
如图所示,在步骤811,首先将第三输入参数从外部存储电路加载到片上存储器(例如SRAM)上。第三输入参数需要的存储空间大小为(Hin+1+2*ph)*(Din+2*pd)*dwidth,其中Din和Hin分别为稀疏形式的输入数据的D维度大小和H维度大小,ph和pd分别为H维度和D维度的单侧填补量,dwidth为数据位宽。第三输入参数的形状可以是二维矩阵(Din+2*pd)*(Hin+1+2*ph),此处假设批次B=1,因为逐批次进行处理。
接着,在步骤812处,对第三输入参数以指定的扫描窗口(第一扫描窗口)、指定的扫描步长(第一扫描步长)进行遍历,以查找有效输入数据点。
第一扫描窗口的大小对应于计算一行Wo维度的输出数据点所需的输入数据,也即对应于前文提到的“第一筛选粒度”。第一扫描窗口的大小根据卷积核的尺寸来确定。具体地,第一扫描窗口大小为kz*(ky+1),kz对应扫描窗口在D维度的大小,ky+1为扫描窗口在H维度的大小,这是因为第三输入参数使用CSR格式,其中ky行H维度数据需要ky+1个数据来表示。第一扫描步长等于H维度卷积步长Sy。
在扫描遍历过程中,检测第三输入参数的每行数据是否有变化,若任一行有变化,则说明存在有效输入数据点。检测到有效输入数据点的扫描窗口可以称为区块(range_block)。检测到区块之后,可以记录该区块对应的hi和di坐标,由此可以根据hi和di坐标,算出其在输出数据中对应的ho和do坐标。
在连续检测到N IPU个区块之后,在步骤813处,可以将查找到的区块发送给处理器核(IPU),例如将这N IPU个区块分别发给N IPU个不同的IPU。同时,可以告知每个IPU其各自处理的输出点在输出数据中所对应的H和D维度坐标,也即ho和do坐标。可以理解,每个IPU在计算输出点(wo_x,ho,do)的数值的时候,wo_x是变化的,范围是0~(wo-1),ho和do是固定的。
图9示出了对第三输入参数进行扫描遍历的示意图。可以理解,图中仅示出了第三输入参数的一部分,例如di=0~2的前3行hin数据910。在此示例中扫描窗口的大小为3*4个数,扫描步长为2,沿着hin方向依次扫描。可以看出,对应输出点坐标ho=0的首个扫描窗口901中,3行hin数据都有变化,也即说明其中存在有效输入数据点,由此记录该扫描窗口901为区块0,并可以计算出其对应的ho=0,do=0。接着,向右移动2个数,在此扫描窗口902中,3行hin数据均没有变化,也即说明其中不存在有效输入数据点,可以无需处理,继续下一步扫描。按照这种方式,可以检测到连续的4个区块,分别发给4个不同的IPU,也即在此示例中N IPU=4。这4个区块对应的do坐标均为0,ho坐标分别为0,2,3,4。
继续图6,在步骤620中,被分配了待处理数据(通过前述区块来指示)的每个IPU,可以根据区块的指示取出对应的数据并构造待卷积矩阵,此后简称为Q矩阵,同时计算输出点坐标信息。例如,从预先从外部存储电路上加载了输入数据的共享存储器SRAM上取数来构造Q矩阵。
在一些实施例中,可以首先从第二输入参数wi_coord中取出所分配区块对应的输入数据块中有效输入数据点的wi_coord向量。接着,可以根据wi_coord向量,对该向量对应的输入数据以第二扫描窗口、第二扫描步长进行遍历以从有效输入的致密数据Min(也即第一输入参数)中取出对应的输入数据点来构造Q矩阵。
如前所述,区块取自于第三输入参数hin,其记录了各行首个有效输入数据点距指定点(也即整个输入数据的首个有效输入数据点)的距离,因而可以根据此信息从第二输入参数wi_coord的指定位置取出指定数量的数据构成对应的wi_coord向量。此处wi_coord向量的含义是指输入数据中一整行W维度的所有有效输入数据点的wi坐标构成的向量。例如,假设1行W维度上有34个有效输入数据点,则向量长度是34,每个向量元素的数值为对应数据点的wi坐标。
所取的wi_coord向量的数量为kz(D维度)*ky(H维度),也即对应于区块所指示的数据范围,也对应于前文描述的转换后的一维卷积运算的数量M。由此,这M个wi_coord向量可以在H和D维度上覆盖住一个卷积窗口的范围,W维度上则取整个wi。例如,在前述示例中,kz=ky=3,因此取出9个wi_coord向量。所取出的kz*ky个wi_coord向量可以进行连续拼接,以基于其来构造待卷积矩阵。
同样地,根据hin的含义,在构造wi_coord向量的过程中可以对wi_coord向量进行筛选。由于hin中前后两个元素之处代表对应行中有效输入数据点的个数,因此对于差值为0的行,wi_coord向量为空,由此可以筛除为空的wi_coord向量。此筛选步骤对应于前文描述的“第二筛选粒度”。
接着,在一些实施例中,可以根据所取出的wi_coord向量,对wi_coord向量对应的输入数据以第二扫描窗口、第二扫描步长进行遍历以从第一输入参数Min中取出对应的输入数据点来构造矩阵Q。
图10示出根据本披露实施例的构造Q矩阵的示意图。为了简单起见,图中仅示出了3行w的Q矩阵构造,其他行的w可以进行类似的构造。
如图所示,为了直观起见,图中1010展示了稀疏形式的输入数据,还示出了所取出的wi_coord向量1020(对应di=0,hi=4~6这三行)以及分配给当前IPU的区块中di=0的部分hin 1030,图中1040展示了基于di=0,hi=4~6这3行w构造的Q矩阵。
如前所述,可以基于第三输入参数hin中的信息构造wi_coord向量,也即确定当前要扫描的行中是否存在有效输入数据点。具体地,根据hin中第i+1个数值与第i个数值的差值,可以确定第i行中的有效输入数据点的个数。
例如,在图中示例中,根据区块(1030)的信息:4-2=2,可以确定hi=4这一行存在2个有效输入数据点,而根据区块1030中的信息:4-4=0,可以确定hi=5这一行不存在有效输入数据点,根据7-4=3,则可以确定hi=6这一行存在3个有效输入数据点。因此,只需扫描hi=4和hi=6这2行。从而,在此示例中,由于hi=5这一行数据为空,因此没有对应的wi_coord向量。按照hin1030指示的起始位置和数量,可以从wi_coord取出相应的数据构造wi_coord向量,如图1020所示,只有2个wi_coord向量。
接着,根据所取出的wi_coord向量,对wi_coord向量对应的输入数据以第二扫描窗口、第二扫描步长进行遍历以取出对应的有效输入数据点来构造矩阵Q。
具体地,逐行进行扫描来构造Q矩阵的对应行。在图中示例中,先扫描hi=4这一行,跳过hi=5,扫描hi=6这一行。在扫描期间,将检测到有效输入数据点的第二扫描窗口所覆盖的数据提取出来,按顺序平铺以构成矩阵Q的对应行,同时跳过未检测到有效输入数据点的第二扫描窗口。可以理解,因为是逐行进行滑动扫描,因此第二扫描窗口的大小对应卷积运算的卷积窗口在第一卷积维度的大小(例如W维度的大小,kx),第二扫描步长则对应卷积运算在W维度的卷积步长Sx。此扫描筛选的步骤对应于前文描述的“第三筛选粒度”。
在扫描时,利用第二扫描窗口(在此示例中为1*3大小)、W维度的卷积步长Sx(此示例中Sx=2),沿W维度逐行扫描遍历输入数据。当扫描窗口内存在有效输入数据点时,提取对应该扫描窗口的输入数据。如此,将各次提取的窗口数据沿W维度顺次展开平铺,以构造Q矩阵。
如图10示,可以首先针对hi=4这一行数据进行扫描,如首个扫描窗口1001所示。发现该窗口1001内存在有效输入数据点,则提取该窗口的数据构成Q矩阵1040的第1行前3列;接着向右平移2个数据,扫描下一扫描窗口1002,该窗口也存在有效输入数据点,同样提取该窗口的数据构成Q矩阵第1行接下来的3列;继续向右平移2个数据,扫描下一扫描窗口1003,该窗口也存在有效输入数据点,同样提取该窗口的数据构成Q矩阵第1行最后3列,该行数据扫描结束。扫描期间,若扫描窗口内不存在有效输入数据点,则跳至下一窗口。
接着,跳过hi=5这一行,针对hi=6这一行数据进行扫描。类似地,利用1*3大小的扫描窗口,步长为2个数据进行扫描。扫描结果为:扫描窗口1004内存在2个有效输入数据点,扫描窗口1005内存在2个有效输入数据点,扫描窗口1006内不存在有效输入数据点。提取数据构造Q矩阵的结果如1040第2行所示,由扫描窗口1004和1005覆盖的数据构成。
由于输入数据是致密形式的,因此在判断扫描窗口内是否存在有效输入数据点时,可以根据输入数据的坐标信息来确定其是否位于某个扫描窗口中。具体地,可以根据分配给该IPU的区块(1030)以及所构造的wi_coord向量(1020)来判断。可以理解,一个扫描窗口实质对应于一个输出点(的部分和),因此可以根据有效输入数据点的wi坐标来推断其所贡献于的输出数据 点的wo坐标,从而确定是否落入一个或多个扫描窗口中。
图11示出了根据本披露实施例计算输出数据点的wo坐标的示例性逻辑。在此示例中,假设卷积核大小为3*3*3,卷积步长在HWD方向均为2。
从图中可以看出,按照卷积运算规则以及对应的卷积参数(包括卷积核大小和卷积步长),从wi_coord到wo_coord存在如下映射关系:
若wi_coord是奇数,则映射是一对一的,wo_coord=(wi_coord-1)/2;
若wi_coord是偶数,则根据wi坐标是否为边界,映射可能是一对一(边界点wo_coord=0或wi-1)或一对二(非边界点)。例如,映射关系可以是wo_coord=wi_coord/2-1,以及wo_coord=wi_coord/2。
根据该映射关系,可以计算出wo_coord。例如参见图10的示例,根据第一个wi_coord向量=[1,4],可以计算出wo_coord=[0,1,2],存在3个有效输出数据点;而根据第二个wi_coord向量=[0,2,3],可以计算出wo_coord=[0,0,1,1],去重后为[0,1],存在2个有效输出数据点。在计算过程中,可以统计每一行wo坐标的个数,也即每一行的有效输出数据点个数,以供后续使用。
可以理解,取决于卷积参数的不同,映射关系可能会随之改变。本领域技术人员可以根据具体的卷积参数(具体为卷积核kx尺寸和卷积步长Sx),推导出输入数据的wi坐标与输出数据的wo坐标之间的映射关系。
当根据有效输入数据点的wi坐标推断出其所贡献于的输出数据点的wo坐标后,即可以确定其落入哪一个或多个第二扫描窗口中。
例如,在图10的示例中,针对第一个wi_coord向量,推断出其中坐标为1的有效输入数据点落入wo=0的第二扫描窗口中,坐标为4的有效输入数据点落入wo=1和wo=2的两个第二扫描窗口中。
在上述示例中,在提取有效输入数据点来构造Q矩阵时,同样可以根据wi_coord为奇数还是偶数来分情况构造Q矩阵。具体地,可以根据wi坐标为奇数还是偶数来确定该输入数据点在第二扫描窗口中的具***置。例如,wi坐标为奇数的输入数据点必然落在第二扫描窗口的中间位置,而wi坐标为偶数的输入数据点则落在两个相邻第二扫描窗口的两个相邻位置中。
由此,可以根据这一规律,从例如共享存储器SRAM中读取该有效输入数据点的值,存储到片上存储器(例如NRAM)上。
具体地,对于wi坐标为偶数的有效输入数据点,需要复制2份存储,因为会落入两个相邻扫描窗口中,并且位置在当前扫描窗口的尾部和下一扫描窗口的头部;对于wi坐标为奇数的有效输入数据点,只需要复制1份,其位置在当前扫描窗口的中部。
由此,通过对有效输入数据点进行逐行扫描,可以构建Q矩阵。针对M个非空的wi_coord向量中的每一个,可以按照上述方式顺次处理,由此构建的Q矩阵具有M行,每行由Li个第二扫描窗口构成,Li取决于该行的输出数据点的个数,如前面在计算wo坐标时所统计的。
所统计的输出数据点的个数可以用于计算输出数据中的第三输出参数hout。根据第三输出参数hout的定义可知,第i个数值代表前面i-1行中有效输出点的总个数。因此,可以根据前面基于wo_coord所统计的每一行wo坐标的个数,通过积分求和/累加求和,得到hout的各个数值。例如,在图5的示例中,由于输出数据在第0行有2个wo坐标,第1行没有wo坐标,第2行有2个wo坐标,因此,对应的hout=[0,2,2,4]。
回到图6,在步骤630中,在构建了矩阵Q之后,即可以对矩阵Q执行M个一维卷积运算,该M个一维卷积运算的M个一维卷积核由原N维卷积核按照第一卷积维度拆分而得到,一维卷积运算的卷积步长等于第二扫描窗口的大小,也即等于N维卷积核在第一卷积维度的大小。在本披露实施例中,第一卷积维度是W维度,因此一维卷积运算的卷积步长等于kx。
由此,通过M个一维卷积运算可以得到M路部分和结果,其对应于同一行wo输出点。每路部分和结果中也确定了各个部分和对应的wo坐标。
接着,在步骤640中,将M路部分和结果按其对应的输出数据点坐标,归并为一路融合数据,得到对应行wo输出数据点的最终结果。在归并过程中,具有相同wo坐标的部分和结果进 行累加。
可以采取多种方式执行上述归并过程。
在一些实施例中,可以采用硬件方式来实现归并融合处理。在这些实施例中,可以通过一种硬件指令MERGE指令来实现。MERGE指令的基本功能就是将多路待融合数据,按照其索引顺序,合并成一路融合数据,并且相同索引的数据进行累加。
在另一些实施例中,可以采用软件方式来实现归并融合处理。在这些实施例中,例如可以通过一种基于多核处理器的全向量化排序算法来实现归并融合过程中的排序。然后调用bang_add算子对排序后的数据进行遍历,当坐标相同的时候,直接累加;如果不同,则无需累加,继续遍历即可。
上面从多个方面描述了本披露实施例的稀疏型数据的卷积运算方案。相比于常规的卷积方案,本披露实施例的方案仅对稀疏数据中的非零/非空数据进行运算处理,可以避免过多的无效运算,极大节省运算量,提高处理效率。进一步地,通过不同级别(例如三个级别)的筛选粒度对输入数据进行筛选,可以快速、高效地提取需要执行卷积运算的数据。本披露实施例提供的稀疏卷积方案尤其可以适用于基于LiDAR点云数据的处理。
本披露实施例还提供了用于执行上述稀疏型数据的卷积运算的数据处理电路,以及由该数据处理电路实施的数据处理方法。
图12示例性示出了可以实施本披露实施例的数据处理电路的示意性结构图。如图12所示,数据处理电路1200包括控制电路1210、存储电路1220和运算电路1230。
控制电路1210负责处理数据处理电路1200上的各种功能,包括但不限于控制、取指、译码、计算等。控制电路1210例如可以包括图3中的控制模块31。
在一些实施例中,控制电路1210可以配置用于控制存储电路1220和运算电路1230对输入数据和卷积核执行N维卷积运算处理,N>1,N表示卷积运算中执行滑动累加的卷积维度数。在此卷积运算中,输入数据是稀疏化数据并采用致密形式表征。
进一步地,控制电路1210可以配置用于:根据输入数据的输入参数,筛选当前轮次分配给运算电路1230运算的输入数据块,其中运算电路1230每轮次计算一行W维度的输出数据。
存储电路1220可以用于存储信息,这些信息至少包括处理前和/或处理后的信息,也可以包括处理期间需要缓存的中间信息,其例如可以是图3所示的各种RAM,或称片上缓存。在一些实施例中,存储电路1220可以配置用于存储输入数据、卷积核、卷积运算结果和/或缓存中间结果,例如缓存部分和结果,或者提供MERGE指令执行期间所需的缓存空间。
运算电路1230可以配置用于根据相关指令执行各种运算操作。具体地,运算电路1230可以配置用于在控制电路1210的控制下,对输入数据与卷积核执行多个一维卷积运算,获得多路运算结果及对应的第一卷积维度上的输出点坐标;以及将多路运算结果按照其对应的输出点坐标,归并为一路融合数据作为卷积运算的结果,其中具有相同输出点坐标的运算结果进行累加。
在一些实施例中,上述N维卷积运算被拆分为M个一维卷积运算,M等于卷积核在除第一卷积维度之外的N-1个卷积维度的大小之积。
在一个实施例中,运算电路1230还可以包括运算处理电路(未示出),其可以配置成根据运算指令对运算电路执行运算前的数据进行预处理或者对运算后的数据进行后处理。在一些应用场景中,前述的预处理和后处理可以例如包括数据拆分和/或数据拼接操作。
在一些实施例中,运算电路1230可以包括多个处理器核,每个处理器核每次可以处理控制电路1210所分配的输入数据块,例如每次计算一行W输出点。
具体地,每个处理器核可以进一步配置用于按如下执行运算:基于分配的指示待处理的输入数据块的区块,构造待执行一维卷积运算的矩阵Q;计算该一维卷积运算的输出点的各部分和结果在第一卷积维度上的坐标;对矩阵Q执行多个一维卷积运算,得到多路部分和结果;以及对多路部分和结果进行归并融合处理,得到最终卷积运算结果。
尽管在上面的描述中,将确定坐标的步骤描述为由运算电路执行,本领域技术人员可以理解,确定坐标的步骤也可以通过软件来执行,例如由控制电路来执行。进一步地,尽管在上面的 描述中,将各个处理步骤概括地描述成在运算电路上执行,此处的运算电路也可以是分布式的,例如包含异构***中的运算电路,从而一部分运算例如在CPU上执行,另一部分运算例如在GPU上执行。在一个实现中,输入数据的预处理例如可以在CPU上执行,这些预处理例如可以包括稀疏形式的输入数据的致密化,等等。输入数据与卷积核的一维卷积运算、多路部分和结果的归并融合等处理可以在GPU上执行,由此充分发挥异构***的优势。
本领域技术人员可以理解,前面结合附图描述的本披露实施例的与稀疏型数据的卷积运算处理的描述可以同样应用于图12的数据处理电路,因此不再进行重复描述。
本披露还提供了一种芯片,其可以包括前面结合附图描述的任一实施例的数据处理装置。进一步地,本披露还提供了一种板卡,该板卡可以包括前述芯片。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个***,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如 计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心思想;同时,对于本领域的一般技术人员,依据本披露的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本披露的限制。

Claims (16)

  1. 一种数据处理电路,包括控制电路、存储电路和运算电路,其中:
    所述控制电路配置用于控制所述存储电路和所述运算电路对输入数据和卷积核执行N维卷积运算处理,N>1,N表示卷积运算中执行滑动累加的卷积维度数,其中所述输入数据是稀疏化数据并采用致密形式表征;
    所述存储电路配置用于存储信息,所述信息至少包括处理前、处理期间和/或处理后的信息;以及
    所述运算电路配置用于在所述控制电路的控制下,对所述输入数据与所述卷积核执行多个一维卷积运算,获得多路运算结果及对应的第一卷积维度上的输出点坐标;以及将所述多路运算结果按照其对应的输出点坐标,归并为一路融合数据作为所述卷积运算的结果,其中具有相同输出点坐标的运算结果进行累加。
  2. 根据权利要求1所述的数据处理电路,其中所述N维卷积运算被拆分为M个所述一维卷积运算,M等于所述卷积核在除所述第一卷积维度之外的N-1个卷积维度的大小之积,所述第一卷积维度为宽度W维度。
  3. 根据权利要求2所述的数据处理电路,其中所述输入数据至少包括宽度W、高度H和输入通道Ci维度,所述输入数据至少在Ci维度是致密的并大小一致,所述输入数据的所述致密形式包括三个输入参数:
    第一输入参数Min,表示有效输入的致密数据,形状为ain*ci,其中ain是所述输入数据中有效输入数据点的个数,ci是输入数据的Ci维度的大小;
    第二输入参数wi_coord,表示每一个所述有效输入数据点在W维度的坐标,形状为1*ain;以及
    第三输入参数hin,表示输入数据在H维度的压缩稀疏行CSR格式的数据,形状为X*(Hin+1),其中Hin表示输入数据的H维度的大小,X表示输入数据可能存在的除H、W、Ci之外的其他维度的大小之积。
  4. 根据权利要求3所述的数据处理电路,其中所述控制电路进一步配置用于:
    根据所述输入数据的输入参数,筛选当前轮次分配给所述运算电路运算的输入数据块,其中所述运算电路每轮次计算一行W维度的输出数据。
  5. 根据权利要求4所述的数据处理电路,其中所述控制电路进一步配置用于按如下筛选输入数据块:
    对所述第三输入参数以第一扫描窗口、第一扫描步长进行遍历以检测扫描窗口内是否存在有效输入数据点,其中第一扫描窗口对应于计算一行W维度的输出数据所需的输入数据块的大小,第一扫描步长等于所述卷积运算在H维度的卷积步长;以及
    将检测到存在有效输入数据点的第一扫描窗口所对应的第三输入参数中的区块分配给所述运算电路。
  6. 根据权利要求5所述的数据处理电路,其中所述运算电路包括N IPU个处理器核,并且所述控制电路进一步用于:
    将连续检测到的N IPU个区块分别发送给所述N IPU个处理器核进行处理。
  7. 根据权利要求5-6任一所述的数据处理电路,其中所述运算电路进一步用于:
    基于接收到的区块,构造待执行所述一维卷积运算的矩阵Q;以及
    计算所述一维卷积运算的输出点的各部分和结果在所述第一卷积维度上的坐标。
  8. 根据权利要求7所述的数据处理电路,其中所述运算电路进一步用于按如下构造所述矩阵Q:
    从所述第二输入参数wi_coord中取出所述区块对应的输入数据块中有效输入数据点的wi_coord向量;
    根据所述wi_coord向量,对所述wi_coord向量对应的输入数据以第二扫描窗口、第二扫描步长进行遍历以从所述第一输入参数Min中取出对应的输入数据点来构造所述矩阵Q。
  9. 根据权利要求8所述的数据处理电路,其中所述运算电路进一步用于:
    根据所述区块的指示,从所述wi_coord的指定位置取出指定数量的数据构成所述wi_coord向量,所取wi_coord向量的数量等于所述M;以及
    滤除其中为空的wi_coord向量。
  10. 根据权利要求8-9任一所述的数据处理电路,其中所述运算电路进一步用于按如下根据所述wi_coord向量,逐行进行扫描以构造所述矩阵Q的对应行:
    将检测到有效输入数据点的第二扫描窗口所覆盖的数据提取出来,按顺序平铺以构造所述矩阵Q的对应行;以及
    跳过未检测到有效输入数据点的第二扫描窗口;
    其中所述第二扫描窗口对应于所述卷积运算的N维卷积窗口在W维度的大小,所述第二扫描步长等于所述卷积运算在W维度的卷积步长。
  11. 根据权利要求7-10任一所述的数据处理电路,其中所述运算电路进一步用于:
    根据所述卷积运算的卷积核尺寸和卷积步长,确定输入数据的W维度坐标与输出数据的W维度坐标的映射关系;以及
    基于所述映射关系,根据各个有效输入点的W维度坐标确定对应的一个或多个输出点的W维度坐标,以作为所述部分和结果的W维度坐标。
  12. 根据权利要求7-11任一所述的数据处理电路,其中所述运算电路进一步用于:
    对所述矩阵Q的各行分别执行所述一维卷积运算,得到多路部分和结果,其中所述一维卷积运算的一维卷积核对应于所述N维卷积运算的N维卷积核的对应W维度行,所述一维卷积运算的卷积步长等于所述N维卷积核在W维度的大小。
  13. 根据权利要求12所述的数据处理电路,其中所述运算电路进一步用于:
    将所述多路部分和结果按其对应的W维度坐标,归并为一路融合数据,在所述归并中,具有相同W维度坐标的部分和结果累加,以得到对应行的输出数据点的最终结果。
  14. 一种芯片,包括根据权利要求1-13任一所述的数据处理电路。
  15. 一种板卡,包括根据权利要求14所述的芯片。
  16. 一种使用权利要求1-13任一所述的数据处理电路来处理数据的方法。
PCT/CN2022/100306 2021-12-29 2022-06-22 数据处理电路、数据处理方法及相关产品 WO2023123919A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111642096.8 2021-12-29
CN202111642096.8A CN114329324A (zh) 2021-12-29 2021-12-29 数据处理电路、数据处理方法及相关产品

Publications (1)

Publication Number Publication Date
WO2023123919A1 true WO2023123919A1 (zh) 2023-07-06

Family

ID=81016541

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100306 WO2023123919A1 (zh) 2021-12-29 2022-06-22 数据处理电路、数据处理方法及相关产品

Country Status (2)

Country Link
CN (1) CN114329324A (zh)
WO (1) WO2023123919A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117831596A (zh) * 2024-03-05 2024-04-05 悦芯科技股份有限公司 一种存储芯片稀疏失效单元电路的修复方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114329324A (zh) * 2021-12-29 2022-04-12 寒武纪行歌(南京)科技有限公司 数据处理电路、数据处理方法及相关产品
CN116828070B (zh) * 2023-08-28 2023-11-07 无锡市锡容电力电器有限公司 一种智慧电网数据优化传输方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052989A (zh) * 2018-02-07 2018-05-18 深圳市唯特视科技有限公司 一种基于样条卷积神经网络的图像分类方法
US20180181857A1 (en) * 2016-12-27 2018-06-28 Texas Instruments Incorporated Reduced Complexity Convolution for Convolutional Neural Networks
CN109084796A (zh) * 2018-08-27 2018-12-25 深圳市烽焌信息科技有限公司 路径导航方法及相关产品
CN109840585A (zh) * 2018-01-10 2019-06-04 中国科学院计算技术研究所 一种面向稀疏二维卷积的运算方法和***
CN114329324A (zh) * 2021-12-29 2022-04-12 寒武纪行歌(南京)科技有限公司 数据处理电路、数据处理方法及相关产品

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181857A1 (en) * 2016-12-27 2018-06-28 Texas Instruments Incorporated Reduced Complexity Convolution for Convolutional Neural Networks
CN109840585A (zh) * 2018-01-10 2019-06-04 中国科学院计算技术研究所 一种面向稀疏二维卷积的运算方法和***
CN108052989A (zh) * 2018-02-07 2018-05-18 深圳市唯特视科技有限公司 一种基于样条卷积神经网络的图像分类方法
CN109084796A (zh) * 2018-08-27 2018-12-25 深圳市烽焌信息科技有限公司 路径导航方法及相关产品
CN114329324A (zh) * 2021-12-29 2022-04-12 寒武纪行歌(南京)科技有限公司 数据处理电路、数据处理方法及相关产品

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117831596A (zh) * 2024-03-05 2024-04-05 悦芯科技股份有限公司 一种存储芯片稀疏失效单元电路的修复方法
CN117831596B (zh) * 2024-03-05 2024-05-24 悦芯科技股份有限公司 一种存储芯片稀疏失效单元电路的修复方法

Also Published As

Publication number Publication date
CN114329324A (zh) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2023123919A1 (zh) 数据处理电路、数据处理方法及相关产品
CN109003132B (zh) 广告推荐方法及相关产品
CN109284823B (zh) 一种运算装置及相关产品
CN109871510B (zh) 二维卷积运算处理方法、***、设备及计算机存储介质
CN109189473A (zh) 神经网络处理装置及其执行向量交换指令的方法
CN111242844B (zh) 图像处理方法、装置、服务器和存储介质
JP2024510265A (ja) 高解像度ニューラル・レンダリング
WO2023045445A1 (zh) 数据处理装置、数据处理方法及相关产品
CN112799599B (zh) 一种数据存储方法、计算核、芯片和电子设备
WO2023045446A1 (zh) 计算装置、数据处理方法及相关产品
WO2022111002A1 (zh) 用于训练神经网络的方法、设备和计算机可读存储介质
CN112633490A (zh) 执行神经网络模型的数据处理装置、方法及相关产品
WO2022218373A1 (zh) 用于优化片上***的卷积运算操作的方法和相关产品
Xiao et al. FPGA-based scalable and highly concurrent convolutional neural network acceleration
WO2019127926A1 (zh) 一种稀疏神经网络的计算方法及计算装置、电子装置、计算机可读存储介质以及计算机程序产品
CN116090519A (zh) 卷积算子的编译方法及相关产品
CN115221103A (zh) 计算装置、数据处理方法及相关产品
WO2022257980A1 (zh) 计算装置、利用计算装置实施卷积运算的方法及相关产品
CN111125627A (zh) 用于池化多维矩阵的方法及相关产品
WO2022134872A1 (zh) 数据处理装置、数据处理方法及相关产品
WO2022135599A1 (zh) 融合分支结构的装置、板卡、方法及可读存储介质
Jiang et al. DeepGCNs-Att: Point cloud semantic segmentation with contextual point representations
CN115221105A (zh) 数据处理装置、数据处理方法及相关产品
WO2023087698A1 (zh) 执行卷积运算的计算装置、方法及相关产品
WO2022135600A1 (zh) 计算神经网络的装置、板卡、方法及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22913208

Country of ref document: EP

Kind code of ref document: A1