WO2020211049A1 - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
WO2020211049A1
WO2020211049A1 PCT/CN2019/083284 CN2019083284W WO2020211049A1 WO 2020211049 A1 WO2020211049 A1 WO 2020211049A1 CN 2019083284 W CN2019083284 W CN 2019083284W WO 2020211049 A1 WO2020211049 A1 WO 2020211049A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
matrix
group
coefficients
register
Prior art date
Application number
PCT/CN2019/083284
Other languages
French (fr)
Chinese (zh)
Inventor
任子木
陆正杰
吴穹蔗
仇晓颖
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to CN201980005020.9A priority Critical patent/CN111213177A/en
Priority to PCT/CN2019/083284 priority patent/WO2020211049A1/en
Publication of WO2020211049A1 publication Critical patent/WO2020211049A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing

Definitions

  • This application relates to the field of data processing, and more specifically, to a data processing method and device.
  • image data can be filtered, so that, for example, the suppression of image noise can be achieved.
  • a filter When filtering an image, a filter can be used to process the image. How to improve the processing efficiency in the filtering process of the image is an urgent problem to be solved.
  • the embodiments of the present application provide a data processing method and device, which can improve processing efficiency in an image filtering process.
  • a data processing method is provided. The method is used in filtering processing of a matrix to be processed by using a coefficient matrix.
  • the matrix to be processed includes at least one sub-matrix, and the sub-matrix includes H groups of data.
  • the data includes W sliding windows, each sliding window has N data, the coefficient matrix includes H sets of coefficients, and each set of coefficients includes W coefficients, where the N, the H, and the W are positive integers;
  • the method includes: reading and registering the i-th group of data in the sub-matrix in a first register, where i is an integer ranging from 1 to H; reading and registering the coefficient matrix in a second register The i-th group of coefficients; the N data in the j-th sliding window in the W sliding windows included in the i-th group of data and the j-th coefficient in the W coefficients included in the i-th group of coefficients are respectively compared Multiplication processing, where j is an integer ranging from 1 to W, and at least N multiplications of N data in the j-th sliding window and the j-th coefficient are processed in parallel; the sub-matrix includes The multiplication processing results corresponding to the data having the same position in the sliding window among the W*H sliding windows are added together to obtain N
  • a data processing device is provided, the device is used for filtering processing of a matrix to be processed using a coefficient matrix, the matrix to be processed includes at least one sub-matrix, and the sub-matrix includes H groups of data, each group The data includes W sliding windows, each sliding window has N data, the coefficient matrix includes H sets of coefficients, and each set of coefficients includes W coefficients, where the N, the H, and the W are positive integers;
  • the device includes a control circuit, a multiplication circuit, an addition circuit, a first register and a second register; the control circuit is used to read and register the i-th group of data in the sub-matrix in the first register, where i Is an integer with a value from 1 to H; reads and registers the i-th group of coefficients in the coefficient matrix in the second register; a multiplication circuit is used to: put the i-th group of data in W sliding windows The N data in the j-th sliding window and the j-th coefficient of the W coefficient
  • the filtering process of the matrix to be processed using the coefficient matrix at least the N data in the j sliding window of the i-th group of data of the sub-matrix and the j-th coefficient in the i-th group of coefficients
  • the N times of multiplication processing is performed in parallel, and the results of the multiplication processing corresponding to the data having the same position in the sliding window in the W*H sliding windows included in the sub-matrix are added to obtain N output data Therefore, while filtering processing can be realized, since multiple multiplication processing is parallel, the utilization rate of hardware can be improved, and the data processing efficiency can be further improved.
  • Fig. 1 is a schematic diagram of a sliding window operation in an embodiment of the present application.
  • Fig. 2 is another schematic diagram of the sliding window operation of the embodiment of the present application.
  • Fig. 3 is another schematic diagram of the sliding window operation of the embodiment of the present application.
  • Fig. 4 is another schematic diagram of the sliding window operation of the embodiment of the present application.
  • FIG. 5 is a schematic diagram of multiplication of a sub-matrix and a coefficient matrix in an embodiment of the present application.
  • Fig. 6 is another schematic diagram of multiplication of a sub-matrix and a coefficient matrix in an embodiment of the present application.
  • FIG. 7 is another schematic diagram of multiplication of a sub-matrix and a coefficient matrix in an embodiment of the present application.
  • FIG. 8 is another schematic diagram of the multiplication of the sliding window and the coefficient matrix in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of multiplying a row of a sliding row by a row of a coefficient matrix in an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a data processing method according to an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a manner of reading data in a matrix to be processed in an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a register and a multiplexer according to an embodiment of the present application.
  • FIG. 13 is another schematic diagram of a register and a multiplexer according to an embodiment of the present application.
  • FIG. 14 is another schematic diagram of multiplying a row of a sliding by a row of a coefficient matrix in an embodiment of the present application.
  • FIG. 15 is a schematic diagram of operations in each cycle of an embodiment of the present application.
  • FIG. 16 is another schematic diagram of operations in each cycle of an embodiment of the present application.
  • Fig. 17 is a schematic diagram of a hardware component of an embodiment of the present application.
  • FIG. 18 is another schematic diagram of a data reading manner according to an embodiment of the present application.
  • FIG. 19 is a schematic block diagram of a data processing device according to an embodiment of the present application.
  • image data can be filtered, so that, for example, the suppression of image noise can be achieved.
  • a filter When filtering an image, a filter can be used to process the image.
  • the filter mentioned in the embodiment of the application can be implemented by a coefficient matrix, and the filter can be operated on the matrix of image data to be processed by sliding window operation.
  • the sliding window operation can be applied to various image processing algorithms.
  • the coefficient matrix can be slid on the matrix to be processed. After each sliding, the part of the matrix to be processed covered by the coefficient matrix can be multiplied by the coefficient matrix, and a value can be output.
  • the sliding may be performed in a row-first-column manner, or in a column-first-row manner.
  • Figure 1 is a 3x3 coefficient matrix Take an example.
  • the window covered by each sliding may be called a sliding window.
  • the sliding window Multiply the coefficient matrix to get a value O (1,1) , and then slide in the direction of the row according to the step size 1.
  • the sliding window obtained after sliding is shown in Figure 2.
  • Slide in the direction of the row and perform multiplication processing until the sliding on the corresponding row is completed, and the coefficients O (2,1) , O (2,2) , ... can be obtained as shown in Figure 3 and Figure 4
  • the coefficients O (2,1) , O (2,2) , ... can be obtained as shown in Figure 3 and Figure 4
  • the multiplication processing of the sliding window and the coefficient matrix on the matrix to be processed can be to multiply the data at the same position, and to add the data obtained by the multiplication to obtain an output value.
  • the embodiment of the present application provides a processing method for sliding window operation, which can realize multiple multiplication operations in the sliding window operation at the same time, so that the hardware utilization rate and data processing efficiency can be mentioned while the above sliding window operation is realized.
  • the matrix to be processed can be divided into multiple sub-matrices, and the height of each sub-matrix is H , And the width is N+W-1, where two adjacent sub-matrices occupy the same row, the first column differs by N columns, and two adjacent sub-matrices occupy the same column, the first row differs by one row.
  • the first column of sub-matrix 1 and the first column of sub-matrix 2 differ by N columns
  • the first row of sub-matrix 1 and sub-matrix 3 differ by 1 row. .
  • the coefficient matrix and sub-matrix 1 can be multiplied to obtain N data in the first row of the output matrix
  • the coefficient matrix and sub-matrix 2 can be multiplied to obtain the N+th row of the output matrix. 1 data to 2N data.
  • the coefficient matrix is multiplied by the sub-matrix 3 to obtain the first N data of the second row of the output matrix, and so on.
  • the j-th coefficient in the i-th row of the coefficient matrix needs to be the same as the i-th coefficient in the j-th sliding window of the sub-matrix
  • Each element of the row is multiplied separately, and each sliding window includes N elements, that is, each coefficient needs to be multiplied N times.
  • N output data For the same sub-matrix, those with the same position in the sliding window
  • the coefficient matrix can perform sliding window operation on the sub-matrix, and N data can be obtained.
  • ⁇ (2,1) can be multiplied with each data in a sliding window of the second row of sub-matrix 1
  • ⁇ (2, 2) can be multiplied with the second row of sub-matrix 1.
  • Each data in the second sliding window of, is multiplied separately, and so on, until ⁇ (2, w) is multiplied with each data in the W-th sliding window of the second row of sub-matrix 1 respectively.
  • the multiplication processing results corresponding to the data with the same position in the sliding window are added together to obtain N output data [O (1,1) ,O (1,2) ,...O (1, N) ].
  • the calculation in the row direction can be equivalent to: on an input sequence of length (N+W-1), performing W internal sliding Window, the width of the internal sliding window is N, and at the same time, on the coefficient sequence of length W, the internal sliding window is also performed W times.
  • the width of the internal sliding window is 1.
  • the specific implementation can be shown in Figure 9, which can be obtained [ T (1,1) ,T (1,2) ,...T (1,N) ]. After the (N+W-1) data of the first row is processed, the (N+W-1) data of the second row can be read in, and the above operations can be repeated, and so on.
  • the multiple multiplication processing of a coefficient for a window can be processed in parallel, thereby improving the efficiency of sliding window operation, and because there are usually multiple multipliers in the structure of the device, the multiple multipliers can be used at the same time, This can improve the utilization of hardware.
  • FIG. 10 is a flowchart of a schematic method of a data processing method 100 according to an embodiment of the present application.
  • the method can be executed by a data processing device.
  • the data processing device in the embodiment of the present application may be a filter, an encoder, a decoder, or a codec, etc.
  • the method 100 can be used in filtering processing of a matrix to be processed using a coefficient matrix, the matrix to be processed includes at least one sub-matrix, the sub-matrix includes H groups of data, each group of data includes W sliding windows, and each sliding window There are N data, the coefficient matrix includes H sets of coefficients, and each set of coefficients includes W coefficients, wherein the N, the H, and the W are positive integers.
  • each group of data in the H group of data is a row of data of the sub-matrix
  • each group of coefficients in the H group of coefficients is a row of coefficients of the coefficient matrix
  • the sub-matrix includes H rows of data, each row of data includes W sliding windows, each sliding window includes N data, and the coefficient matrix includes H rows of coefficients, and each row of coefficients includes W coefficients.
  • the step length of the coefficient matrix sliding in the sub-matrix can be 1.
  • the data included in a row of the sub-matrix can be N+W-1, which means that the sub-matrix can have N+ W-1 column.
  • sub-matrix 1, sub-matrix 2, and sub-matrix 3 shown in FIGS. 5-7 are the sub-matrices mentioned in the embodiment of this application.
  • two adjacent sub-matrices differ by N columns of data.
  • the first sub-matrix is moved N columns to the right to get the second sub-matrix.
  • each group of data in the H group of data is a column of data of the sub-matrix
  • each group of coefficients in the H group of coefficients is a column of coefficients of the coefficient matrix
  • the sub-matrix includes N columns of data, each column of data includes W sliding windows, each sliding window includes N data, and the coefficient matrix includes N columns of coefficients, and each column of coefficients includes W coefficients.
  • the sliding step of the sliding window in the sub-matrix can be 1.
  • the data included in one column of the sub-matrix can be N+W-1, that is to say, there are N+W-1 rows.
  • the sub-matrix has H columns. In this implementation manner, two adjacent sub-matrices differ by N rows of data.
  • the data processing device reads and registers the i-th group of data in the sub-matrix in the first register, where i is an integer from 1 to H.
  • the data processing device may use multiple cycles to read the i-th group of data. For example, assuming that each group of data includes N+W-1 data, N data can be read in each cycle. The data can be read in multiple cycles.
  • the sub-matrix can be read as follows:
  • N data can be read in first, and then N data can be read in until the data read is N+W-1.
  • the value of N is determined based on the capacity of the first register used to register the sub-matrix and/or the number of multipliers.
  • the multiplier can also be used to limit each cycle. Read the data volume of the sub-matrix in cycles.
  • N is less than or equal to W.
  • the data processing device reads and registers the i-th group of coefficients in the coefficient matrix in a second register.
  • the W coefficients in the i-th group of coefficients can be read in the first period.
  • the data processing device separately performs processing on the N data in the j sliding window among the W sliding windows included in the i-th group of data and the j-th coefficient among the W coefficients included in the i-th group of coefficients.
  • Multiplication processing where j is an integer ranging from 1 to W, and at least N multiplication processing of N data in the j-th sliding window and the j-th coefficient is processed in parallel.
  • ⁇ (1,1) and data [p (1,1) ,p (1,2) ,...p (1,N) ] can be multiplied by each data It is carried out in parallel, the multiplication processing of ⁇ (1,2) and each data in the data [p (1,2) ,p (1,3) ,...p (1,N+1) ] can be Parallel, and so on.
  • the data processing device adds the multiplication processing results corresponding to the data having the same position in the sliding window among the W*H sliding windows included in the sub-matrix to obtain N output data.
  • the data processing device may add the multiplication processing results corresponding to the data at the same position in all sliding windows to obtain N data.
  • the multiplication results corresponding to the first data in all sliding windows can be added together to obtain the first data in N data, and the multiplication results corresponding to the second data in all sliding windows can be processed Adding processing to obtain the second data of the N data, and so on, until N data is obtained.
  • the N data in the first sliding window and the first The N multiplications of the coefficients are processed in parallel to obtain N first processing results corresponding to the first sliding window.
  • the N first processing results corresponding to the first sliding window are output to a first register for combining the W*H sliding windows to divide the first To obtain the N output data obtained by processing results obtained by the other sliding windows except for the sliding windows.
  • the processing result obtained by the first sliding window needs to be added to the processing results obtained by other sliding windows, and other sliding window operations need to be processed in other cycles, so ,
  • the data can be registered in the first register first.
  • the s*N multiplication processes corresponding to the s sliding windows are processed in parallel, where s is greater than or equal to 2 and An integer less than or equal to W-1.
  • the stored The data can correspond to multiple sliding windows, so that the multiplication processing for multiple sliding windows is performed in parallel.
  • the value of s can be determined according to the number of available multipliers.
  • the s mentioned here is an integer, which means that the multiplication processing in an integer number of sliding windows can be performed in parallel.
  • the embodiment of the present application is not limited to this.
  • the multiplication processing in non-integer sliding windows may also be performed in parallel.
  • the multiplication results corresponding to the data having the same position in the sliding window in the s sliding windows are added together to obtain N second processing results;
  • the N second processing results are stored in the third register and used to combine the processing results obtained by other sliding windows except the s sliding windows among the W*H sliding windows to obtain the N output data.
  • the multiplication results with the same position in each sliding window may be added.
  • the multiplication result with the same position in the sliding window can be added, and the value obtained by the addition can be added to other values. The multiplication results at the same position are added together.
  • the any data is deleted from the first register.
  • the data in the sub-matrix can be used for only one multiplication process, after the multiplication process, the data will be invalid data, and the data can be deleted from the first register at this time.
  • the remaining data in the i-th group of data is moved so that the storage location occupied by the any data is occupied .
  • the remaining data can be moved, and the vacant positions can be filled with new data.
  • Register 0 can be used to store coefficients
  • registers 1 and 2 can be used to store image data
  • the bit width of each register can be 512 bits, that is, each register can store 512 bits of data
  • the bits occupied by each data The number is 16 bits
  • there are two ports a and b which are respectively used to read the image data in the sub-matrix and the coefficients in the coefficient matrix.
  • the available system bandwidth of each port is 512 bits, and the number of available multipliers is 64.
  • bit width of the register is 512 bits
  • 32 image data can be stored, and the multiplication processing of 32 image data and coefficients can be performed in parallel.
  • a coefficient can be multiplied, which means that there are 32 multipliers used. If there are more than 32 multipliers, for example, 64 multipliers, the multiplier utilization rate Is 50%.
  • the synchronous multiplication process of the first coefficient can be performed.
  • the two coefficients can be multiplied synchronously. This is because the stored data has reached 64 data, and the data that needs to be multiplied by the second coefficient is the second to 33 image data, and the image data to be multiplied by the third coefficient is the 3rd to 34th image data.
  • the multiplication process of multiple coefficients and two coefficients can be performed in parallel in other cycles except the first cycle used for the multiplication process, for example, the multiplication of two coefficients Processing, that is to say, there may be 64 multipliers used, and the utilization rate of the multipliers can be obtained by the following formula 3):
  • W may refer to the number of coefficients included in each row of data.
  • each register may correspond to a multiplexer, and there may be multiple selection signals in the multiplexer, wherein each selection signal of the multiple selection signals corresponds to a processing of the register.
  • each selection signal of the multiple selection signals corresponds to a processing of the register.
  • the data or coefficients prepared on the crossbar can be selected and stored in the register (for example, register 0 or 1 below), or no data reading and registration (for example, the following Register 2); for signal 1, the X-bit data or coefficient in the register is read and the data or coefficient is eliminated (for example, the following registers 0, 1 and 2); for signal 2, the Y in the register Bit data or coefficients are read and the data or coefficients are eliminated (for example, registers 0, 1, and 2 below).
  • the value of X and the value of Y may be different.
  • the selection signal received by the multiplexer can be 0.
  • the processing circuit can select the crossbar to prepare And store the coefficient in register 0 (the selection data on the crossbar matrix and the register data in the register can be delayed by one cycle).
  • the selection signal received by the multiplexer can be 1, this
  • the coefficient (16 bits) in the register can be eliminated, specifically, the data in the register can be right Shift 16 bits.
  • the selection signal received by the multiplexer can be 2.
  • the two coefficients in register 0 can be read for multiplication in the filtering process, and register 0 can be eliminated. 2 coefficients (32 bits) in the middle. Specifically, the data in register 0 can be shifted right by 32 bits.
  • the selection signal received by the multiplexer of register 1 and the multiplexer of register 2 can be 0, and you can select 32 image data prepared on the crossbar matrix, and store the image data in register 1 (selecting data from the crossbar matrix and registering the data in the register can be delayed by one cycle), and no data is stored in register 2.
  • the selection signal received by the multiplexer of register 1 and register 2 is 1, the multiplication process of the first coefficient starts (a) in Figure 14), and the first image data is being multiplied. After that, it can be deleted.
  • the data in register 1 can be moved 16 bits to the right; the low-order data in the read 512-bit data can be stored in the high-order bit in register 1, and the remaining 496-bit data can be stored in In register 2; starting from the third cycle, the selection signal received by the multiplexer of register 1 and register 2 is 2, and the multiplication of 2 coefficients is performed in each cycle ( Figure 14 b) and c) Synchronously), then 2 image data can be deleted.
  • register 1 and register 2 can be moved to the right by 32 bits of data as a whole until one row of data is processed.
  • the number of cycles T required for the multiplication processing of a row of coefficients in the coefficient matrix can be obtained by the following formula 4):
  • W may refer to the number of coefficients included in each row of data.
  • a counter (counter 2 as shown in Figure 15) can be set. This counter can be used to determine the current state of the register. After the processing circuit receives the start signal, it can perform a line The processing of (specifically, the specific start is to obtain data from the crossbar switch matrix). At this time, the counter can be counted from 0. After each cycle, the counter is incremented by 1. When the counter counts to T-1, the coefficient is calculated After one row of data in the matrix, the value of the calculator can be changed to 0, and the next row of coefficient matrix processing can be started. For a coefficient matrix with H rows and W columns, the above operations can be performed H times.
  • the count of counter 2 can be from 0 to W-1.
  • the selection signal of the register is 0.
  • the register The selection signal of is 1, when the count of the counter is from 2 to W-1, the selection signal of the register is 2.
  • the image data and coefficients registered in the register may be data prepared on the crossbar matrix, and the starting point of the above counter may be to obtain data from the crossbar matrix for storage in the register.
  • the aforementioned a port and b port need to be enabled to send a read request to the memory storing the image data and the coefficients. After receiving the read request, the image data and coefficients will be returned, which will bring a certain delay. For example, as shown in Figure 15, the enable signal is sent from the a port and the b port to the crossbar switch matrix. There is a three-cycle delay in storing to the register.
  • another counter (for example, counter 1 as shown in Figure 15) can be set, and this counter 1 can be used to determine what ports a and b should have. For example, as shown in Figure 15, when the count of counter 1 is 0, ports a and b send out enable signals, and when the count of counter 1 is 1, port a sends out enable signals.
  • FIG. 16 shows the pipeline processing of the sliding window process when the coefficient matrix is 3*5.
  • the coefficient matrix is 3*5
  • the data processing for each row is 3 cycles, and the data is prepared on the crossbar matrix compared to the read request, which is delayed by 3 cycles, and the data is registered To the register is delayed by one cycle compared to it being prepared on the crossbar matrix; among them, request 1 and request 2 can be read requests for the first row of data, and data 1 and data 2 are generated by request 1 and request 2, respectively Requested data, request 3 and request 4 can be read requests for the second row of data, data 3 and data 4 are the data requested by request 3 and request 4, respectively, request 5 and request 6 can be for the third row Data read request, data 5 and data 6 are the data requested by request 5 and request 6, respectively.
  • the data Before multiplying the data and the coefficient, the data can be preprocessed. It can be seen from Figure 16 that the preprocessing The processed data can be continuous. And the multiplication process is also continuous. This is because data 1 and data 2 respectively include a larger number of data, and each cycle only performs multiplication of one coefficient or two coefficients, and one or two coefficients need to be deleted each time. This can ensure the continuity of data processing, maximize the utilization of the multiplier, and improve processing efficiency.
  • FIG. 17 may include at least one parallel streaming memory for storing input data (or coefficients), a parallel execution unit (which may be the processing circuit mentioned above) and at least one for storing Parallel streaming memory for output data.
  • the parallel streaming memory for storing input data may include, for example, parallel streaming memory A and parallel streaming memory B as shown in FIG. 17, and the parallel streaming memory for storing output data may be, for example, FIG. 17.
  • each parallel streaming memory may include at least one random access memory (Random Access Memory, RAM), for example, RAM#1,#2,#3...# as shown in FIG. 17 N.
  • RAM Random Access Memory
  • the parallel execution unit may include at least one input port (for example, ports a and b (which may correspond to a and b mentioned above) respectively connected to at least one parallel streaming memory for storing input data (or coefficients), And the output port c of the parallel execution unit can be connected to the parallel streaming memory for storing output data.
  • the input ports a and b of the parallel execution unit are respectively connected to the parallel streaming memory A and the parallel streaming memory B, the output ports c of the parallel execution unit are respectively connected to the parallel stream memory C.
  • the parallel streaming memory for storing input data may include an address generation unit (AGU), which may generate a parallel read address for RAM output data based on the read request issued by the parallel execution unit .
  • AGU address generation unit
  • the parallel streaming memory for storing output data may include an AGU, and the AGU may generate a write address based on a write request of the parallel execution unit.
  • the working process of this hardware component can be specifically as follows.
  • each parallel read request contains N data (or coefficient) read requests.
  • the AGU of the parallel stream memory A/B generates parallel read addresses of N data (or coefficients) to N RAMs.
  • Parallel streaming memory A/B respectively output N parallel read data (or coefficients).
  • the parallel execution unit processes the data (or coefficients) obtained from the input port A/B, and then sends a parallel write request to the AGU of the parallel stream memory C through the output port C, and N parallel write data is for the parallel stream ⁇ Memory C.
  • each parallel write request may include N data write requests.
  • the AGU of the parallel streaming memory C generates N parallel write addresses to N RAMs, and the parallel write data is written into these RAMs.
  • the parallel stream memory can output/input N data in parallel.
  • AGU generates N data addresses for N different RAMs.
  • the parallel execution unit can input/process/output N data in parallel.
  • Figure 18 shows a schematic diagram of data processing. As shown in Figure 18, the address length occupied by each data can be address length 1. For the first row, based on the base address 1, W data can be read from RAM, which is the gray part in Figure 18. The dashed box at the back represents the data that needs to be read later until N+W-1 data are read in.
  • the counter can be 0.
  • the base address becomes address 3 (the address length between address 1 and address 3 is address length 3).
  • address 3 read N+W-1 data with address length 1, similar to the processing of the left part of the data, until the right part of the multi-line data is processed.
  • the hardware components shown in Figure 17 can be used. For each row of data, one data can be read from each of the N RAMs, and then from each of the N RAMs. One data is read until N+W-1 data are read.
  • the filtering process of the matrix to be processed using the coefficient matrix at least the N data in the j sliding window of the i-th group of data of the sub-matrix and the j-th coefficient in the i-th group of coefficients
  • the N times of multiplication processing is performed in parallel, and the results of the multiplication processing corresponding to the data having the same position in the sliding window in the W*H sliding windows included in the sub-matrix are added to obtain N output data Therefore, while filtering processing can be realized, since multiple multiplication processing is parallel, the utilization rate of hardware can be improved, and the data processing efficiency can be further improved.
  • FIG. 19 is a schematic block diagram of a data processing device 200 according to an embodiment of the present application.
  • the device is used for filtering processing of a matrix to be processed using a coefficient matrix
  • the matrix to be processed includes at least one sub-matrix, the sub-matrix includes H groups of data, and each group of data includes W sliding windows, Each sliding window has N data, the coefficient matrix includes H sets of coefficients, and each set of coefficients includes W coefficients, where the N, the H, and the W are positive integers;
  • the device 200 includes a control circuit 210, a multiplication circuit 220, an addition circuit 230, a first register 240 and a second register 250;
  • the control circuit 210 is used to: read and register the i-th group of data in the sub-matrix in the first register 240, where i is an integer with a value from 1 to H; read and store in the second register 250 Register the i-th group of coefficients in the coefficient matrix;
  • the multiplication circuit 220 is configured to: separately perform the N data in the j sliding window among the W sliding windows included in the i-th group of data and the j-th coefficient among the W coefficients included in the i-th group of coefficients Multiplication processing, where j is an integer ranging from 1 to W, and at least N multiplication processing of N data in the j-th sliding window and the j-th coefficient is processed in parallel;
  • the addition circuit 230 is configured to perform addition processing on the multiplication processing results corresponding to the data having the same position in the sliding window in the W*H sliding windows included in the sub-matrix, to obtain N output data.
  • the multiplication circuit in the embodiment of the present application may include at least one multiplier.
  • the adding circuit in the embodiment of the present application may include at least one adder.
  • the i-th group of data is read in multiple cycles, and N data are read in each cycle;
  • At least one coefficient in the i-th group of coefficients is read.
  • W coefficients in the i-th group of coefficients are read.
  • the N data in the first sliding window and the first The N multiplications of the coefficients are processed in parallel to obtain N first processing results corresponding to the first sliding window.
  • the device 200 further includes a third register 260, and the control circuit 210 is configured to:
  • the N first processing results corresponding to the first sliding window are output to a third register for combining the results obtained by combining the W*H sliding windows with other sliding windows except the first sliding window
  • the result is processed to obtain the N output data.
  • the s*N multiplication processes corresponding to the s sliding windows are processed in parallel, where s is greater than or equal to 2 and An integer less than or equal to W-1.
  • the value of s is determined based on the number of available multipliers.
  • the multiplication results corresponding to the data having the same position in the sliding window in the W*H sliding windows included in the sub-matrix are added together to output N data, include:
  • the N second processing results are stored in the third register for combining the processing results obtained by the W*H sliding windows other than the s sliding windows to obtain the N Output data.
  • the W is less than or equal to N.
  • the value of N is determined based on the capacity used to register the first register and/or the number of multipliers included in the multiplication circuit.
  • control circuit 210 is further configured to:
  • any data in the i-th group of data is read from the first register for multiplication processing, the any data is deleted from the first register.
  • control circuit 210 is further configured to:
  • the remaining data in the i-th group of data is moved so that the storage location occupied by the any data is occupied.
  • each group of data in the H group of data is a row of data of the sub-matrix
  • each group of coefficients in the H group of coefficients is a row of coefficients in the coefficient matrix
  • two adjacent sub-matrices differ by N columns of data.
  • each group of data in the H group of data is a column of data of the sub-matrix
  • each group of coefficients in the H group of coefficients is a column of coefficients of the coefficient matrix
  • two adjacent sub-matrices differ by N rows of data.
  • the device 200 may be used to implement the corresponding operations implemented by the data processing device in the foregoing method embodiments, and for the sake of brevity, details are not repeated here.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)

Abstract

A data processing method and device, which can improve the processing efficiency in the process of filtering an image. Said method is used for using a coefficient matrix to filter a matrix to be processed, a sub-matrix of the matrix to be processed comprises H groups of data, each group of data comprises W sliding windows, and each sliding window has N pieces of data, the coefficient matrix comprises H groups of coefficients, and each group of coefficients comprises W coefficients. Said method comprises: reading and registering an i-th group of data in the sub-matrix; reading and registering an i-th group of coefficients in the coefficient matrix; respectively performing multiplication processing on the N pieces of data in a j-th sliding window among the W sliding windows included in the i-th group of data and a j-th coefficient among the W coefficients included in the i-th group of coefficients, at least N multiplication processes of the N pieces of data in the j-th sliding window and the j-th coefficient being parallel processes, and adding the multiplication results corresponding to the data having the same position in the sliding windows among the W*H sliding windows included in the sub-matrix.

Description

数据处理方法和设备Data processing method and equipment
版权申明Copyright statement
本专利文件披露的内容包含受版权保护的材料。该版权为版权所有人所有。版权所有人不反对任何人复制专利与商标局的官方记录和档案中所存在的该专利文件或者该专利披露。The content disclosed in this patent document contains copyrighted material. The copyright belongs to the copyright owner. The copyright owner does not object to anyone copying the patent document or the patent disclosure in the official records and archives of the Patent and Trademark Office.
技术领域Technical field
本申请涉及数据处理领域,并且更具体地,涉及一种数据处理方法和设备。This application relates to the field of data processing, and more specifically, to a data processing method and device.
背景技术Background technique
在图像处理中,可以对图像数据进行滤波处理,从而可以实现例如对图像的噪声的抑制。In image processing, image data can be filtered, so that, for example, the suppression of image noise can be achieved.
在对图像进行滤波处理时,可以采用滤波器对图像进行处理,如何在对图像的滤波过程中提高处理效率是一项亟待解决的问题。When filtering an image, a filter can be used to process the image. How to improve the processing efficiency in the filtering process of the image is an urgent problem to be solved.
发明内容Summary of the invention
本申请实施例提供一种数据处理方法和设备,可以在对图像的滤波过程中提高处理效率。The embodiments of the present application provide a data processing method and device, which can improve processing efficiency in an image filtering process.
第一方面,提供了一种数据处理方法,所述方法用于利用系数矩阵对待处理矩阵的滤波处理中,所述待处理矩阵包括至少一个子矩阵,所述子矩阵包括H组数据,每组数据包括W个滑窗,每个滑窗具有N个数据,所述系数矩阵包括H组系数,每组系数包括W个系数,其中,所述N、所述H、所述W为正整数;所述方法包括:读取并在第一寄存器寄存所述子矩阵中的第i组数据,其中,i为取值从1到H的整数;读取并在第二寄存器寄存所述系数矩阵中的第i组系数;将所述第i组数据包括的W个滑窗中的第j个滑窗中的N个数据与第i组系数包括的W个系数中的第j个系数分别进行相乘处理,其中,j为取值从1到W的整数,至少第j个滑窗中的N个数据与第j个系数的N次相乘处理是并行处理的;将所述子矩阵包括的W*H个滑窗中在滑窗内具有相同位置的数据对应的相乘处理结果进行相加处理,以得到N个输出数据。In a first aspect, a data processing method is provided. The method is used in filtering processing of a matrix to be processed by using a coefficient matrix. The matrix to be processed includes at least one sub-matrix, and the sub-matrix includes H groups of data. The data includes W sliding windows, each sliding window has N data, the coefficient matrix includes H sets of coefficients, and each set of coefficients includes W coefficients, where the N, the H, and the W are positive integers; The method includes: reading and registering the i-th group of data in the sub-matrix in a first register, where i is an integer ranging from 1 to H; reading and registering the coefficient matrix in a second register The i-th group of coefficients; the N data in the j-th sliding window in the W sliding windows included in the i-th group of data and the j-th coefficient in the W coefficients included in the i-th group of coefficients are respectively compared Multiplication processing, where j is an integer ranging from 1 to W, and at least N multiplications of N data in the j-th sliding window and the j-th coefficient are processed in parallel; the sub-matrix includes The multiplication processing results corresponding to the data having the same position in the sliding window among the W*H sliding windows are added together to obtain N output data.
第二方面,提供了一种数据处理设备,所述设备用于利用系数矩阵对待处理矩阵的滤波处理中,所述待处理矩阵包括至少一个子矩阵,所述子矩阵包括H组数据,每组数据包括W个滑窗,每个滑窗具有N个数据,所述系数矩阵包括H组系数,每组系数包括W个系数,其中,所述N、所述H、所述W为正整数;所述设备包括控制电路、乘法电路、加法电路、第一寄存器和第二寄存器;控制电路,用于:读取并在第一寄存器中寄存所述子矩阵中的第i组数据,其中,i为取值从1到H的整数;读取并在第二寄存器中寄存所述系数矩阵中的第i组系数;乘法电路,用于:将所述第i组数据包括的W个滑窗中的第j个滑窗中的N个数据与第i组系数包括的W个系数中的第j个系数分别进行相乘处理,其中,j为取值从1到W的整数,至少第j个滑窗中的N个数据与第j个系数的N次相乘处理是并行处理的;加法电路,用于:将所述子矩阵包括的W*H个滑窗中在滑窗内具有相同位置的数据对应的相乘处理结果进行相加处理,以得到N个输出数据。In a second aspect, a data processing device is provided, the device is used for filtering processing of a matrix to be processed using a coefficient matrix, the matrix to be processed includes at least one sub-matrix, and the sub-matrix includes H groups of data, each group The data includes W sliding windows, each sliding window has N data, the coefficient matrix includes H sets of coefficients, and each set of coefficients includes W coefficients, where the N, the H, and the W are positive integers; The device includes a control circuit, a multiplication circuit, an addition circuit, a first register and a second register; the control circuit is used to read and register the i-th group of data in the sub-matrix in the first register, where i Is an integer with a value from 1 to H; reads and registers the i-th group of coefficients in the coefficient matrix in the second register; a multiplication circuit is used to: put the i-th group of data in W sliding windows The N data in the j-th sliding window and the j-th coefficient of the W coefficients included in the i-th group of coefficients are respectively multiplied, where j is an integer ranging from 1 to W, at least the j-th The N multiplications of the N data in the sliding window and the j-th coefficient are processed in parallel; the addition circuit is used to: the W*H sliding windows included in the sub-matrix have the same position in the sliding window The result of the multiplication processing corresponding to the data is added to obtain N output data.
因此,在本申请实施例中,利用系数矩阵对待处理矩阵的滤波处理中,至少子矩阵的第i组数据的第j个滑窗中的N个数据与第i组系数中的第j个系数的N次相乘处理是并行处理的,以及将子矩阵包括的W*H个滑窗中在滑窗内具有相同位置的数据对应的相乘处理结果进行相加处理,以得到N个输出数据,从而可以在实现滤波处理的同时,由于多次相乘处理是并行的,可以提高硬件的利用率,并且可以进一步提高数据处理效率。Therefore, in the embodiment of the present application, in the filtering process of the matrix to be processed using the coefficient matrix, at least the N data in the j sliding window of the i-th group of data of the sub-matrix and the j-th coefficient in the i-th group of coefficients The N times of multiplication processing is performed in parallel, and the results of the multiplication processing corresponding to the data having the same position in the sliding window in the W*H sliding windows included in the sub-matrix are added to obtain N output data Therefore, while filtering processing can be realized, since multiple multiplication processing is parallel, the utilization rate of hardware can be improved, and the data processing efficiency can be further improved.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only some of the present application. Embodiments, for those of ordinary skill in the art, without creative work, other drawings can be obtained from these drawings.
图1是本申请实施例的滑窗操作的一种示意性图。Fig. 1 is a schematic diagram of a sliding window operation in an embodiment of the present application.
图2是本申请实施例的滑窗操作的另一种示意性图。Fig. 2 is another schematic diagram of the sliding window operation of the embodiment of the present application.
图3是本申请实施例的滑窗操作的另一种示意性图。Fig. 3 is another schematic diagram of the sliding window operation of the embodiment of the present application.
图4是本申请实施例的滑窗操作的另一种示意性图。Fig. 4 is another schematic diagram of the sliding window operation of the embodiment of the present application.
图5是本申请实施例的子矩阵与系数矩阵相乘的一种示意性图。FIG. 5 is a schematic diagram of multiplication of a sub-matrix and a coefficient matrix in an embodiment of the present application.
图6是本申请实施例的子矩阵与系数矩阵相乘的另一种示意性图。Fig. 6 is another schematic diagram of multiplication of a sub-matrix and a coefficient matrix in an embodiment of the present application.
图7是本申请实施例的子矩阵与系数矩阵相乘的另一种示意性图。FIG. 7 is another schematic diagram of multiplication of a sub-matrix and a coefficient matrix in an embodiment of the present application.
图8是本申请实施例的滑窗与系数矩阵相乘的另一种示意性图。FIG. 8 is another schematic diagram of the multiplication of the sliding window and the coefficient matrix in an embodiment of the present application.
图9是本申请实施例的滑的一行与系数矩阵的一行相乘的一种示意性图。FIG. 9 is a schematic diagram of multiplying a row of a sliding row by a row of a coefficient matrix in an embodiment of the present application.
图10是本申请实施例的数据处理方法的示意性流程图。FIG. 10 is a schematic flowchart of a data processing method according to an embodiment of the present application.
图11是本申请实施例的待处理矩阵中数据读取方式的一种示意性图。FIG. 11 is a schematic diagram of a manner of reading data in a matrix to be processed in an embodiment of the present application.
图12是本申请实施例的寄存器及复用器的一种示意性图。FIG. 12 is a schematic diagram of a register and a multiplexer according to an embodiment of the present application.
图13是本申请实施例的寄存器及复用器的另一种示意性图。FIG. 13 is another schematic diagram of a register and a multiplexer according to an embodiment of the present application.
图14是本申请实施例的滑的一行与系数矩阵的一行相乘的另一种示意性图。FIG. 14 is another schematic diagram of multiplying a row of a sliding by a row of a coefficient matrix in an embodiment of the present application.
图15是本申请实施例的各个周期的操作的一种示意性图。FIG. 15 is a schematic diagram of operations in each cycle of an embodiment of the present application.
图16是本申请实施例的各个周期的操作的另一种示意性图。FIG. 16 is another schematic diagram of operations in each cycle of an embodiment of the present application.
图17是本申请实施例的一种硬件组件的示意性图。Fig. 17 is a schematic diagram of a hardware component of an embodiment of the present application.
图18是本申请实施例的一种数据读取方式的另一种示意性图。FIG. 18 is another schematic diagram of a data reading manner according to an embodiment of the present application.
图19是本申请实施例的一种数据处理设备的示意性框图。FIG. 19 is a schematic block diagram of a data processing device according to an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are a part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
除非另有说明,本申请实施例所使用的所有技术和科学术语与本申请的技术领域的技术人员通常理解的含义相同。本申请中所使用的术语只是为了描述具体的实施例的目的,不是旨在限制本申请的范围。Unless otherwise specified, all technical and scientific terms used in the embodiments of the present application have the same meaning as commonly understood by those skilled in the technical field of the present application. The terminology used in this application is only for the purpose of describing specific embodiments, and is not intended to limit the scope of this application.
在图像处理中,可以对图像数据进行滤波处理,从而可以实现例如对图像的噪声的抑制。In image processing, image data can be filtered, so that, for example, the suppression of image noise can be achieved.
在对图像进行滤波处理时,可以采用滤波器对图像进行处理,本申请实施例提到的滤波器可以由系数矩阵来实现,可以将滤波器在图像数据的待处理矩阵上进行滑窗操作来实现滤波操作,该滑窗操作可以应用在各类图像处理算法中。When filtering an image, a filter can be used to process the image. The filter mentioned in the embodiment of the application can be implemented by a coefficient matrix, and the filter can be operated on the matrix of image data to be processed by sliding window operation. To realize the filtering operation, the sliding window operation can be applied to various image processing algorithms.
具体地,可以将该系数矩阵在待处理矩阵上进行滑动,每次滑动之后,可以将待处理矩阵上被系数矩阵所覆盖的部分与系数矩阵进行相乘处理,并可以输出一个值。在系数矩阵在待处理矩阵上进行滑动时,可以以先行后列的方式进行滑动,也可以先列后行的方式进行滑动。Specifically, the coefficient matrix can be slid on the matrix to be processed. After each sliding, the part of the matrix to be processed covered by the coefficient matrix can be multiplied by the coefficient matrix, and a value can be output. When the coefficient matrix slides on the matrix to be processed, the sliding may be performed in a row-first-column manner, or in a column-first-row manner.
以下将结合图1描述滑窗操作所要实现的功能。图1中是以大小为3x3的系数矩阵
Figure PCTCN2019083284-appb-000001
为例进行说明。在本申请实施例中,每次滑动所覆盖的窗口可以称为滑窗。
The functions to be realized by the sliding window operation will be described below in conjunction with FIG. 1. Figure 1 is a 3x3 coefficient matrix
Figure PCTCN2019083284-appb-000001
Take an example. In the embodiment of the present application, the window covered by each sliding may be called a sliding window.
如图1所示,滑窗
Figure PCTCN2019083284-appb-000002
与系数矩阵相乘可以得到一个值O (1,1),然后按照步长1在行的方向上滑动,滑动之后得到的滑窗为如图2所示的
Figure PCTCN2019083284-appb-000003
将该滑窗与系数矩阵相乘得到数据O (1,2),并以此类推,直到在行的方向上滑动完毕,在列的方向上按照步长1进行一次滑动,并继续按照步长1在行的方向上滑动以及进行相乘处理,直到在对应行上滑动完毕,可以如图3和如图4所示得到系数O (2,1),O (2,2),…以此类推,直到在待处理矩阵上全部滑动完毕。
As shown in Figure 1, the sliding window
Figure PCTCN2019083284-appb-000002
Multiply the coefficient matrix to get a value O (1,1) , and then slide in the direction of the row according to the step size 1. The sliding window obtained after sliding is shown in Figure 2.
Figure PCTCN2019083284-appb-000003
Multiply the sliding window and the coefficient matrix to get the data O (1,2) , and so on, until the sliding in the row direction is completed, in the column direction to slide once according to the step size 1, and continue to follow the step size 1 Slide in the direction of the row and perform multiplication processing until the sliding on the corresponding row is completed, and the coefficients O (2,1) , O (2,2) , ... can be obtained as shown in Figure 3 and Figure 4 By analogy, until all sliding on the matrix to be processed is completed.
待处理矩阵上的滑窗与系数矩阵的相乘处理可以是相同位置上的数据进行相乘,以及相乘得到的数据进行相加操作,即得到一个输出值。The multiplication processing of the sliding window and the coefficient matrix on the matrix to be processed can be to multiply the data at the same position, and to add the data obtained by the multiplication to obtain an output value.
例如,以输出值O (1,1)为例,该输出值的计算公式可以如下式1)所示: For example, taking the output value O (1,1) as an example, the calculation formula of the output value can be shown in the following formula 1):
Figure PCTCN2019083284-appb-000004
Figure PCTCN2019083284-appb-000004
以输出值O (1,2)为例,该输出值的计算公式可以如下式2)所示: Taking the output value O (1,2) as an example, the calculation formula of the output value can be shown in the following formula 2):
Figure PCTCN2019083284-appb-000005
Figure PCTCN2019083284-appb-000005
本申请实施例提供了一种滑窗操作的处理方式,可以同时实现滑窗操作中的多个相乘操作,从而可以在实现上述滑窗操作的同时,提到硬件使用率和数据处理效率。The embodiment of the present application provides a processing method for sliding window operation, which can realize multiple multiplication operations in the sliding window operation at the same time, so that the hardware utilization rate and data processing efficiency can be mentioned while the above sliding window operation is realized.
以下将首先介绍本申请实施例的操作原理。从以上式1)和式2)以及图1-图4可以看出,对于α (1,1)而言,需要与待处理矩阵的第一行中的数据p (1,1)、p (1,2)…一直到该行的倒数第三个数据分别进行相乘,对于α (1,2)而言,需要与待处理矩阵的第一行中的数据p (1,2)、p (1,3)…一直到该行的倒数第二个数据分别进行相乘,以及对于α (1,3)而言,需要与待处理矩阵的第一行中的数据p (1,3)、p (1,4)…一直到该行的倒数第一个数据分别进行相乘。除了待处理矩阵的第一行中的数据,对于α (1,1)、α (1,2)以及α (1,3)而言,需要相乘处理的还存在待处理矩阵的第二行、第三行…一直到倒数第三行,在各个行需要相乘的数据的位置类似于第一行。 The operation principle of the embodiments of the present application will be first introduced below. From the above equations 1) and 2) and Figures 1 to 4, it can be seen that for α (1,1) , it needs to be compared with the data p (1,1) and p ( 1,2) … until the third data from the bottom of the row is multiplied separately. For α (1,2) , it needs to be the same as the data p (1,2) and p in the first row of the matrix to be processed (1,3) …until the penultimate data of the row are multiplied separately, and for α (1,3) , it needs to be the same as the data p (1,3) in the first row of the matrix to be processed , P (1,4) … until the last data of the row is multiplied separately. In addition to the data in the first row of the matrix to be processed, for α (1,1) , α (1,2) and α (1,3) , there is the second row of the matrix to be processed that needs to be multiplied. , The third row... until the third to last row, the position of the data to be multiplied in each row is similar to the first row.
相应地,对于α (2,1)而言,需要与待处理矩阵的第二行中的数据p (2,1)、p (2,2)…一直到该行的倒数第三个数据分别进行相乘,对于α (2,2)而言,需要与待处理矩阵的第二行中的数据p (2,2)、p (2,3)…一直到该行的倒数第二个数据分别进行相乘,以及对于α (2,3)而言,需要与待处理矩阵的第二行中的数据p (2,3)、p (2,4)…一直到该行的倒数第一个数据分别进行相乘。除了待处理矩阵的第二行中的数据,对于α (2,1)、α (2,2)以及α (2,3)而言,需要相乘处理的还存在待处理矩阵的第三行、第四行…一直到倒数第二行,在各个行需要相乘的数据的位置类似于第二行。 Correspondingly, for α (2,1) , it needs to be separated from the data p (2,1) and p (2,2) in the second row of the matrix to be processed until the third data from the bottom of the row To multiply, for α (2,2) , it needs to be compared with the data p (2,2) , p (2,3) in the second row of the matrix to be processed ... all the way to the penultimate data of the row Multiplying separately, and for α (2,3) , it needs to be compared with the data p (2,3) , p (2,4) in the second row of the matrix to be processed ... until the last of the row The data are multiplied separately. In addition to the data in the second row of the matrix to be processed, for α (2,1) , α (2,2) and α (2,3) , there is also the third row of the matrix to be processed that needs to be multiplied. The fourth row... until the penultimate row, the position of the data that needs to be multiplied in each row is similar to the second row.
以及,对于α (3,1)而言,需要与待处理矩阵的第三行中的数据p (3,1)、p (3,2)…一直到该行的倒数第三个数据分别进行相乘,对于α (3,2)而言,需要与待处理矩阵的第三行中的数据p (3,2)、p (3,3)…一直到该行的倒数第二个数据分别进行相乘,以及对于α (3,3)而言,需要与待处理矩阵的第三行中的数据p (3,3)、p (3,4)…一直到该行的倒数第一个数据分别进行相乘。除了待处理矩阵的第三行中的数据,对于α (3,1)、α (3,2)以及α (3,3)而言,需要相乘处理的还存在待处理矩阵的第四行、第五行…一直到倒数第一行,在各个行需要相乘的数据的位置类似于第三行。 And, for α (3,1) , it needs to be processed separately with the data p (3,1) and p (3,2) in the third row of the matrix to be processed until the third data from the bottom of the row Multiply, for α (3,2) , it needs to be compared with the data p (3,2) , p (3,3) in the third row of the matrix to be processed ... all the way to the penultimate data of the row respectively Multiply, and for α (3,3) , it needs to be compared with the data p (3,3) , p (3,4) in the third row of the matrix to be processed ... until the last one of the row The data are multiplied separately. In addition to the data in the third row of the matrix to be processed, for α (3,1) , α (3,2) and α (3,3) , there is also the fourth row of the matrix to be processed that needs to be multiplied. , Fifth row... until the first to last row, the position of the data to be multiplied in each row is similar to the third row.
在本申请实施例中,如图5-7所示,假设系数矩阵为高度为H宽度为W的H×W的矩阵,可以将待处理矩阵划分为多个子矩阵,每个子矩阵的高度为H,以及宽度为N+W-1,其中,占据相同行的相邻两个子矩阵,第一列相差N列,以及占据相同列的相邻两个子矩阵,第一行相差一行。如图5和图6所示,子矩阵1的第一列和子矩阵2的第一列相差N列,以及如图5和7所示,子矩阵1的和子矩阵3的第一行相差1行。In the embodiment of this application, as shown in Figures 5-7, assuming that the coefficient matrix is an H×W matrix with a height H and a width W, the matrix to be processed can be divided into multiple sub-matrices, and the height of each sub-matrix is H , And the width is N+W-1, where two adjacent sub-matrices occupy the same row, the first column differs by N columns, and two adjacent sub-matrices occupy the same column, the first row differs by one row. As shown in Figures 5 and 6, the first column of sub-matrix 1 and the first column of sub-matrix 2 differ by N columns, and as shown in Figures 5 and 7, the first row of sub-matrix 1 and sub-matrix 3 differ by 1 row. .
如图5-7所示,系数矩阵与子矩阵1相乘可以得到输出矩阵的第一行的N个数据,以及系数矩阵与子矩阵2相乘可以得到输出矩阵的第一行的第N+1个数据至第2N个数据。以及系数矩阵与子矩阵3相乘可以得到输出矩阵的第二行的前N个数据,以此类推。As shown in Figure 5-7, the coefficient matrix and sub-matrix 1 can be multiplied to obtain N data in the first row of the output matrix, and the coefficient matrix and sub-matrix 2 can be multiplied to obtain the N+th row of the output matrix. 1 data to 2N data. And the coefficient matrix is multiplied by the sub-matrix 3 to obtain the first N data of the second row of the output matrix, and so on.
按照以上式1)和式2)中示出的得到输出矩阵中的元素的公式,可以推出,系数矩阵的第i行中的第j个系数需要与子矩阵的第j个滑窗中第i行的每个元素分别进行相乘处理,每个滑窗分别包括N个元素,也就是每个系数需要进行N次的相乘处理,对于同一个子矩阵而言,在滑窗内具有相同位置的数据对应的相乘处理结果进行相加处理,得到N个输出数据。也就是说,系数矩阵可以在子矩阵上进行滑窗操作,可以得到N个数据。According to the formulas for obtaining the elements in the output matrix shown in the above equations 1) and 2), it can be deduced that the j-th coefficient in the i-th row of the coefficient matrix needs to be the same as the i-th coefficient in the j-th sliding window of the sub-matrix Each element of the row is multiplied separately, and each sliding window includes N elements, that is, each coefficient needs to be multiplied N times. For the same sub-matrix, those with the same position in the sliding window The result of the multiplication processing corresponding to the data is added to obtain N output data. In other words, the coefficient matrix can perform sliding window operation on the sub-matrix, and N data can be obtained.
以数据[O (1,1),O (1,2),...O (1,N)]为例,其计算过程可以如图8所示,在图8所示中,α (1,1)可以与子矩阵1的第一行的一个滑窗中各个数据分别进行相乘处理,α (1,2)可以与子矩阵1的第一行的第二个滑窗中各个数据分别进行相乘处理,以此类推,直到α (1,w)与子矩阵1的第一行的第W个滑窗中各个数据分别进行相乘处理。对于第二行而言,α (2,1)可以与子矩阵1的第二行的一个滑窗中各个数据分别进行相乘处理,α (2,2)可以与子矩阵1的第二行的第二个滑窗中各个数据分别进行相乘处理,以此类推,直到α (2,w)与子矩阵1的第二行的第W个滑窗中各个数据分别进行相乘处理。其他行以此类推。其中,在滑窗内具有相同位置的数据对应的相乘处理结果进行相加处理,以得到N个输出数据[O (1,1),O (1,2),...O (1,N)]。 Taking data [O (1,1) ,O (1,2) ,...O (1,N) ] as an example, the calculation process can be shown in Figure 8. In Figure 8, α (1 ,1) It can be multiplied with each data in a sliding window of the first row of sub-matrix 1, and α (1,2) can be multiplied with each data in the second sliding window of the first row of sub-matrix 1. Perform multiplication processing, and so on, until α (1,w) is multiplied with each data in the W-th sliding window of the first row of sub-matrix 1 respectively. For the second row, α (2,1) can be multiplied with each data in a sliding window of the second row of sub-matrix 1, and α (2, 2) can be multiplied with the second row of sub-matrix 1. Each data in the second sliding window of, is multiplied separately, and so on, until α (2, w) is multiplied with each data in the W-th sliding window of the second row of sub-matrix 1 respectively. And so on for other lines. Among them, the multiplication processing results corresponding to the data with the same position in the sliding window are added together to obtain N output data [O (1,1) ,O (1,2) ,...O (1, N) ].
基于以上分析,以及结合图9可以看到,以第1行为例,在行的方向的计算可以等效于:在长度为(N+W-1)的输入序列上,进行W次的内部滑窗,该内部滑窗的宽度为N,同时在长度为W的系数序列上,也进行W次内部滑窗,该内部滑窗的宽度为1,具体实现可以如图9所示,可以得到[T (1,1),T (1,2),...T (1,N)]。当第一行的(N+W-1)个数据处理完毕之后,可以再读入第二行的(N+W-1)个数据,并重复上述操作,以此类推。 Based on the above analysis and in conjunction with Figure 9, it can be seen that taking the first behavior example, the calculation in the row direction can be equivalent to: on an input sequence of length (N+W-1), performing W internal sliding Window, the width of the internal sliding window is N, and at the same time, on the coefficient sequence of length W, the internal sliding window is also performed W times. The width of the internal sliding window is 1. The specific implementation can be shown in Figure 9, which can be obtained [ T (1,1) ,T (1,2) ,...T (1,N) ]. After the (N+W-1) data of the first row is processed, the (N+W-1) data of the second row can be read in, and the above operations can be repeated, and so on.
从以上描述可以看出,对于滑窗操作,对于系数矩阵中的系数,一个系数可以与一个滑窗窗口中的所有数据进行相乘,不同的系数可以与不同的滑动窗口的数据相乘,各个滑窗窗口得到的相乘结果可以按照位置进行相加。因此,一个系数针对一个窗口的多次相乘处理可以并行处理,从而可以提高滑窗操作的效率,并且由于设备的结构中通常存在多个乘法器,则该多个乘 法器可以被同时使用,从而可以提高硬件的利用率。As can be seen from the above description, for the sliding window operation, for the coefficients in the coefficient matrix, one coefficient can be multiplied with all the data in one sliding window, and different coefficients can be multiplied with the data of different sliding windows. The multiplication result obtained by the sliding window can be added according to the position. Therefore, the multiple multiplication processing of a coefficient for a window can be processed in parallel, thereby improving the efficiency of sliding window operation, and because there are usually multiple multipliers in the structure of the device, the multiple multipliers can be used at the same time, This can improve the utilization of hardware.
图10是根据本申请实施例的数据处理方法100的示意性方法的流程图。该方法可以由数据处理设备执行。本申请实施例中的数据处理设备可以是滤波器、编码器、解码器或编解码器等。FIG. 10 is a flowchart of a schematic method of a data processing method 100 according to an embodiment of the present application. The method can be executed by a data processing device. The data processing device in the embodiment of the present application may be a filter, an encoder, a decoder, or a codec, etc.
该方法100可以用于利用系数矩阵对待处理矩阵的滤波处理中,所述待处理矩阵包括至少一个子矩阵,所述子矩阵包括H组数据,每组数据包括W个滑窗,每个滑窗具有N个数据,所述系数矩阵包括H组系数,每组系数包括W个系数,其中,所述N、所述H、所述W为正整数。The method 100 can be used in filtering processing of a matrix to be processed using a coefficient matrix, the matrix to be processed includes at least one sub-matrix, the sub-matrix includes H groups of data, each group of data includes W sliding windows, and each sliding window There are N data, the coefficient matrix includes H sets of coefficients, and each set of coefficients includes W coefficients, wherein the N, the H, and the W are positive integers.
在一种实现方式中,所述H组数据中每组数据为所述子矩阵的一行数据,所述H组系数中每组系数为所述系数矩阵的一行系数。In an implementation manner, each group of data in the H group of data is a row of data of the sub-matrix, and each group of coefficients in the H group of coefficients is a row of coefficients of the coefficient matrix.
也就是说,所述子矩阵包括H行数据,每行数据包括W个滑窗,每个滑窗包括N个数据,系数矩阵包括H行系数,每行系数包括W个系数。That is, the sub-matrix includes H rows of data, each row of data includes W sliding windows, each sliding window includes N data, and the coefficient matrix includes H rows of coefficients, and each row of coefficients includes W coefficients.
在本申请实施例中,系数矩阵在子矩阵中滑动的步长可以为1,则此时,子矩阵一行包括的数据可以是N+W-1个,也就是说该子矩阵可以具有N+W-1列。In the embodiment of this application, the step length of the coefficient matrix sliding in the sub-matrix can be 1. At this time, the data included in a row of the sub-matrix can be N+W-1, which means that the sub-matrix can have N+ W-1 column.
例如,如图5-7中所示的子矩阵1、子矩阵2和子矩阵3即为本申请实施例提到的子矩阵,For example, sub-matrix 1, sub-matrix 2, and sub-matrix 3 shown in FIGS. 5-7 are the sub-matrices mentioned in the embodiment of this application.
在该种实现方式中,相邻两个子矩阵相差N列数据。例如,如图5和6所示,第一个子矩阵向右移动N列得到了第二个子矩阵。In this implementation, two adjacent sub-matrices differ by N columns of data. For example, as shown in Figures 5 and 6, the first sub-matrix is moved N columns to the right to get the second sub-matrix.
在一种实现方式中,所述H组数据中每组数据为所述子矩阵的一列数据,所述H组系数中每组系数为所述系数矩阵的一列系数。In an implementation manner, each group of data in the H group of data is a column of data of the sub-matrix, and each group of coefficients in the H group of coefficients is a column of coefficients of the coefficient matrix.
也就是说,所述子矩阵包括N列数据,每列数据包括W个滑窗,每个滑窗包括N个数据,系数矩阵包括N列系数,每列系数包括W个系数。That is, the sub-matrix includes N columns of data, each column of data includes W sliding windows, each sliding window includes N data, and the coefficient matrix includes N columns of coefficients, and each column of coefficients includes W coefficients.
在本申请实施例中,滑窗在子矩阵中滑动的步长可以为1,则此时,子矩阵一列包括的数据可以是N+W-1个,也就是说具有N+W-1行,子矩阵具有H列。在该种实现方式中,相邻两个所述子矩阵相差N行数据。In the embodiment of the present application, the sliding step of the sliding window in the sub-matrix can be 1. At this time, the data included in one column of the sub-matrix can be N+W-1, that is to say, there are N+W-1 rows. , The sub-matrix has H columns. In this implementation manner, two adjacent sub-matrices differ by N rows of data.
在110中,数据处理设备读取并在第一寄存器中寄存子矩阵中的第i组数据,其中,i为取值从1到H的整数。In 110, the data processing device reads and registers the i-th group of data in the sub-matrix in the first register, where i is an integer from 1 to H.
可选地,在本申请实施例中,数据处理设备可以利用多个周期读取该第i组数据,例如,假设每组数据包括N+W-1个数据,每个周期可以读取N个数据,并可以在多个周期内对该数据进行读取完毕。Optionally, in the embodiment of the present application, the data processing device may use multiple cycles to read the i-th group of data. For example, assuming that each group of data includes N+W-1 data, N data can be read in each cycle. The data can be read in multiple cycles.
例如,如图11,可以采用如下的方式读取子矩阵:For example, as shown in Figure 11, the sub-matrix can be read as follows:
1)针对第1行,可以先读入N个数据,再读入N个数据,直到读入的数据为N+W-1。1) For the first row, N data can be read in first, and then N data can be read in until the data read is N+W-1.
2)第2行,先读入N个数据,再读入N个数据,直到读入的数据为N+W-1。2) In line 2, first read in N data, then read in N data, until the read data is N+W-1.
3)以此类推,一直到H行全部读完。3) By analogy, until all lines H have been read.
可选地,在本申请实施例中,可以存在多个第一寄存器用于寄存子矩阵中的数据。Optionally, in this embodiment of the present application, there may be multiple first registers for registering data in the sub-matrix.
可选地,在本申请实施例中,N的取值是基于用于寄存所述子矩阵的第一寄存器的容量和/或乘法器的数量确定的。Optionally, in this embodiment of the present application, the value of N is determined based on the capacity of the first register used to register the sub-matrix and/or the number of multipliers.
具体地,由于用于寄存子矩阵的寄存器的容量是有限的,则限制了每个周期读取的数据量,以及读取的数据是用于进行乘法处理的,乘法器也可以用于限制每个周期读取子矩阵的数据量。Specifically, since the capacity of the register used to register the sub-matrix is limited, the amount of data read in each cycle is limited, and the read data is used for multiplication processing. The multiplier can also be used to limit each cycle. Read the data volume of the sub-matrix in cycles.
可选地,N小于或等于W。Optionally, N is less than or equal to W.
在120中,数据处理设备读取并在第二寄存器中寄存所述系数矩阵中的第i组系数。In 120, the data processing device reads and registers the i-th group of coefficients in the coefficient matrix in a second register.
可选地,在本申请实施例中,在读取所述第i组数据的多个周期的第一个周期内,所述第i组系数中的至少一个系数被读取。Optionally, in this embodiment of the present application, in the first cycle of the multiple cycles for reading the i-th group of data, at least one coefficient in the i-th group of coefficients is read.
具体而言,由于在读取子矩阵中的数据之后,需要与系数矩阵中的系数进行相乘处理,为了保证能够及时进行该相乘处理,可以在读取第i组数据的第一个周期内,即进行系数的读取。也就是说,在第一个周期内,可以并行进行系数与数据的读取。Specifically, since the data in the sub-matrix needs to be multiplied with the coefficients in the coefficient matrix after reading the data in the coefficient matrix, in order to ensure that the multiplication processing can be performed in time, you can read the i-th group of data in the first cycle Within, read the coefficient. In other words, in the first cycle, the coefficients and data can be read in parallel.
其中,由于系数矩阵中各组系数的数据量较小,可以在该第一个周期内,读取所述第i组系数中的W个系数。Wherein, since the data amount of each group of coefficients in the coefficient matrix is small, the W coefficients in the i-th group of coefficients can be read in the first period.
在130中,数据处理设备将所述第i组数据包括的W个滑窗中的第j个滑窗中的N个数据与第i组系数包括的W个系数中的第j个系数分别进行相乘处理,其中,j为取值从1到W的整数,至少第j个滑窗中的N个数据与第j个系数的N次相乘处理是并行处理的。In 130, the data processing device separately performs processing on the N data in the j sliding window among the W sliding windows included in the i-th group of data and the j-th coefficient among the W coefficients included in the i-th group of coefficients. Multiplication processing, where j is an integer ranging from 1 to W, and at least N multiplication processing of N data in the j-th sliding window and the j-th coefficient is processed in parallel.
例如,如图9所示,α (1,1)与数据[p (1,1),p (1,2),...p (1,N)]中各个数据进行的相乘处理可以是并行进行的,α (1,2)与数据[p (1,2),p (1,3),...p (1,N+1)]中各个数据进行的相乘处理可以是并行进行的,以此类推。 For example, as shown in Figure 9, α (1,1) and data [p (1,1) ,p (1,2) ,...p (1,N) ] can be multiplied by each data It is carried out in parallel, the multiplication processing of α (1,2) and each data in the data [p (1,2) ,p (1,3) ,...p (1,N+1) ] can be Parallel, and so on.
在140中,数据处理设备将所述子矩阵包括的W*H个滑窗中在滑窗内具有相同位置的数据对应的相乘处理结果进行相加处理,以得到N个输出数据。In 140, the data processing device adds the multiplication processing results corresponding to the data having the same position in the sliding window among the W*H sliding windows included in the sub-matrix to obtain N output data.
具体地,数据处理设备可以将所有滑窗中的相同位置的数据对应的相乘处理结果进行相加处理,以得到N个数据。例如,可以将所有滑窗中的第一个数据对应的相乘结果进行相加处理,得到N个数据中的第一个数据,将所有滑窗中的第二个数据对应的相乘结果进行相加处理,得到N个数据中的第二个数据,以此类推,直到得到N个数据。Specifically, the data processing device may add the multiplication processing results corresponding to the data at the same position in all sliding windows to obtain N data. For example, the multiplication results corresponding to the first data in all sliding windows can be added together to obtain the first data in N data, and the multiplication results corresponding to the second data in all sliding windows can be processed Adding processing to obtain the second data of the N data, and so on, until N data is obtained.
可选地,在本申请实施例中,在所述第一个周期读取且寄存数据之后以及第二个周期读取的数据寄存之前,第一个滑窗中的N个数据与第一个系数的N次相乘处理是并行处理的,以得到第一个滑窗对应的N个第一处理结果。Optionally, in the embodiment of the present application, after the data is read and registered in the first cycle and before the data read in the second cycle is registered, the N data in the first sliding window and the first The N multiplications of the coefficients are processed in parallel to obtain N first processing results corresponding to the first sliding window.
可选地,在本申请实施例中,将所述第一个滑窗对应的所述N个第一处理结果输出到第一寄存器,用于结合W*H个滑窗中除所述第一个滑窗之外的其他滑窗得到的处理结果,以得到所述N个输出数据。Optionally, in this embodiment of the present application, the N first processing results corresponding to the first sliding window are output to a first register for combining the W*H sliding windows to divide the first To obtain the N output data obtained by processing results obtained by the other sliding windows except for the sliding windows.
具体而言,在第一个周期中,由于第一个滑窗得到的处理结果需要与其他的滑窗得到处理结果进行相加处理,而其他的滑窗操作需要在其他的周期进行处理,因此,可以在第一寄存器中先寄存该数据。Specifically, in the first cycle, since the processing result obtained by the first sliding window needs to be added to the processing results obtained by other sliding windows, and other sliding window operations need to be processed in other cycles, so , The data can be registered in the first register first.
可选地,在本申请实施例中,在第二个周期读取并寄存数据之后,s个滑窗对应的s*N个相乘处理是并行处理的,其中,s为大于或等于2且小于等于W-1的整数。Optionally, in the embodiment of the present application, after the data is read and registered in the second cycle, the s*N multiplication processes corresponding to the s sliding windows are processed in parallel, where s is greater than or equal to 2 and An integer less than or equal to W-1.
具体而言,在第一个周期被读取之后,在其他的周期内,由于读取了新的数据,在滑窗的步长较小的情况下,例如为1的情况下,已存储的数据可以对应于多个滑窗,从而存在多个滑窗的相乘处理是并行进行的。Specifically, after the first cycle is read, in other cycles, because new data is read, if the step size of the sliding window is small, for example, if it is 1, the stored The data can correspond to multiple sliding windows, so that the multiplication processing for multiple sliding windows is performed in parallel.
其中,s的取值可以是根据可用的乘法器的数量确定的。Among them, the value of s can be determined according to the number of available multipliers.
此处提到的s是整数,也就是说整数个滑窗中的相乘处理可以是并行进行的。但应理解,本申请实施例并不限于此,本申请实施例中,也可以是非整数个滑窗中的相乘处理是并行进行的。The s mentioned here is an integer, which means that the multiplication processing in an integer number of sliding windows can be performed in parallel. However, it should be understood that the embodiment of the present application is not limited to this. In the embodiment of the present application, the multiplication processing in non-integer sliding windows may also be performed in parallel.
可选地,在本申请实施例中,将所述s个滑窗中在滑窗内具有相同位置的数据对应的相乘结果进行相加处理,以得到N个第二处理结果;将所述N个第二处理结果存储到第三寄存器,用于结合W*H个滑窗中除所述s个滑 窗之外的其他滑窗得到的处理结果,以得到所述N个输出数据。Optionally, in this embodiment of the present application, the multiplication results corresponding to the data having the same position in the sliding window in the s sliding windows are added together to obtain N second processing results; The N second processing results are stored in the third register and used to combine the processing results obtained by other sliding windows except the s sliding windows among the W*H sliding windows to obtain the N output data.
在本申请实施例中,在当前子矩阵的所有图像数据进行了相乘处理之后,可以将各个滑窗内具有相同位置的相乘结果进行相加处理。或者,在本申请实施例中,也可以在部分图像数据进行了相乘处理之后,将在滑窗内具有相同位置的相乘结果进行相加处理,相加处理得到的值再与其他的具有相同位置的相乘结果进行相加处理。In the embodiment of the present application, after all the image data of the current sub-matrix has been multiplied, the multiplication results with the same position in each sliding window may be added. Alternatively, in the embodiment of the present application, after part of the image data has been multiplied, the multiplication result with the same position in the sliding window can be added, and the value obtained by the addition can be added to other values. The multiplication results at the same position are added together.
可选地,在本申请实施例中,在任一数据从第一寄存器被读取用于相乘处理后,将所述任一数据从所述第一寄存器中删除。Optionally, in the embodiment of the present application, after any data is read from the first register for multiplication processing, the any data is deleted from the first register.
具体而言,由于子矩阵中的数据可以仅用于一次的相乘处理,在进行相乘处理之后,该数据将是无效的数据,此时可以将该数据从第一寄存器中进行删除。Specifically, since the data in the sub-matrix can be used for only one multiplication process, after the multiplication process, the data will be invalid data, and the data can be deleted from the first register at this time.
可选地,在本申请实施例中,在所述任一数据被从第一寄存器中删除之后,移动所述第i组数据中剩余的数据,使得所述任一数据占用的存储位置被占用。Optionally, in the embodiment of the present application, after the any data is deleted from the first register, the remaining data in the i-th group of data is moved so that the storage location occupied by the any data is occupied .
具体而言,由于数据的读取是按照一定的顺序的,则可以将剩余的数据进行移动,空余的位置可以填充新的数据。Specifically, since the data is read in a certain order, the remaining data can be moved, and the vacant positions can be filled with new data.
为了更加清楚地理解本申请,以下将以下场景举例进行说明:N=32,W=32,每个端口可用的***带宽为512比特,存在3个寄存器,分别为寄存器0、寄存器1和寄存器2,寄存器0可以用于存储系数,寄存器1和2可以用于存储图像数据,每个寄存器的位宽可以是512比特,也即每个寄存器可以存储512比特的数据,每个数据所占用的比特数量为16比特,存在两个端口a和b,分别用于读取子矩阵中的图像数据和系数矩阵中的系数,每个端口可用***带宽为512比特,可用的乘法器的数量为64。In order to understand this application more clearly, the following scenarios are described as examples: N=32, W=32, the available system bandwidth of each port is 512 bits, and there are 3 registers, namely register 0, register 1, and register 2. , Register 0 can be used to store coefficients, registers 1 and 2 can be used to store image data, the bit width of each register can be 512 bits, that is, each register can store 512 bits of data, and the bits occupied by each data The number is 16 bits, and there are two ports a and b, which are respectively used to read the image data in the sub-matrix and the coefficients in the coefficient matrix. The available system bandwidth of each port is 512 bits, and the number of available multipliers is 64.
由于寄存器的位宽是512比特,则可以存储32个图像数据,则32个图像数据与系数的相乘处理可以是并行进行的。Since the bit width of the register is 512 bits, 32 image data can be stored, and the multiplication processing of 32 image data and coefficients can be performed in parallel.
在每个周期内,可以进行一个系数的相乘处理,也就是说存在32个乘法器被利用到,如果内部存在多于32的乘法器,例如,64个乘法器,则乘法器的利用率为50%。In each cycle, a coefficient can be multiplied, which means that there are 32 multipliers used. If there are more than 32 multipliers, for example, 64 multipliers, the multiplier utilization rate Is 50%.
在处理电路读取数据时,在周期A内,可以将通过端口a读取的32个图像数据寄存到寄存器1中,以及将通过端口b读取的32个系数寄存到寄存器0中。以及在该周期A的下一个周期,可以将通过端口a读取的32个 图像数据寄存到寄存器2(其中,部分数据可以寄存到寄存器1)中,则此时可以存在64个图像数据。When the processing circuit reads data, in period A, 32 image data read through port a can be registered in register 1, and 32 coefficients read through port b can be registered in register 0. And in the next cycle of cycle A, 32 image data read through port a can be registered in register 2 (part of the data can be registered in register 1), then 64 image data can exist at this time.
则针对周期A存储的数据,可以进行第1个系数的同步相乘处理。而针对该周期A的下一个周期存储的数据,可以进行两个系数的同步相乘处理,这是由于存储的数据已经到达了64个数据,第2个系数需要相乘的数据是第2至33个图像数据,而第3个系数需要相乘的图像数据为第3至34个图像数据。Then for the data stored in period A, the synchronous multiplication process of the first coefficient can be performed. For the data stored in the next cycle of cycle A, the two coefficients can be multiplied synchronously. This is because the stored data has reached 64 data, and the data that needs to be multiplied by the second coefficient is the second to 33 image data, and the image data to be multiplied by the third coefficient is the 3rd to 34th image data.
因此,为了提高乘法器的利用率,可以在除了用于相乘处理的第一个周期之外的其他周期,可以并行进行多个系数两个系数的乘法处理,例如,2个系数的相乘处理,也就是说,可以存在64个乘法器被利用到,此时乘法器的利用率可以通过以下式3)得到:Therefore, in order to improve the utilization of the multiplier, the multiplication process of multiple coefficients and two coefficients can be performed in parallel in other cycles except the first cycle used for the multiplication process, for example, the multiplication of two coefficients Processing, that is to say, there may be 64 multipliers used, and the utilization rate of the multipliers can be obtained by the following formula 3):
Figure PCTCN2019083284-appb-000006
Figure PCTCN2019083284-appb-000006
其中,W可以是指每行数据包括的系数的数量。Wherein, W may refer to the number of coefficients included in each row of data.
在本申请实施例中,每个寄存器可以对应有一个复用器,复用器可以存在多个选择信号,其中,该多个选择信号中的每个选择信号分别对应寄存器的一种处理,复用器收到一种选择信号时,可以使能寄存器进行对应的处理。In the embodiment of the present application, each register may correspond to a multiplexer, and there may be multiple selection signals in the multiplexer, wherein each selection signal of the multiple selection signals corresponds to a processing of the register. When the user receives a selection signal, it can enable the register for corresponding processing.
例如,存在三个选择信号,即信号0、信号1和信号2。对于信号0而言,可以选择交叉开关矩阵(crossbar)上准备的数据或系数,并存储到寄存器(例如以下的寄存器0或1)中,或者不进行数据的读取和寄存(例如,以下的寄存器2);对于信号1而言,寄存器中的X比特的数据或系数被读取并消除该数据或系数(例如以下的寄存器0、1和2);对于信号2而言,寄存器中的Y比特的数据或系数被读取并消除该数据或系数(例如以下的寄存器0、1和2)。其中,X的取值和Y的取值可以不相同。For example, there are three selection signals, namely signal 0, signal 1, and signal 2. For signal 0, the data or coefficients prepared on the crossbar can be selected and stored in the register (for example, register 0 or 1 below), or no data reading and registration (for example, the following Register 2); for signal 1, the X-bit data or coefficient in the register is read and the data or coefficient is eliminated (for example, the following registers 0, 1 and 2); for signal 2, the Y in the register Bit data or coefficients are read and the data or coefficients are eliminated (for example, registers 0, 1, and 2 below). Among them, the value of X and the value of Y may be different.
如图12所示,对于用于存储系数的寄存器0而言,在第一个周期,复用器收到的选择信号可以为0,此时,处理电路可以选择交叉开关矩阵(crossbar)上准备的系数,并将该系数存储到寄存器0中(交叉开关矩阵上选择数据与在寄存器中寄存数据可以延迟一个周期),在第二个周期,复用器收到的选择信号可以为1,此时寄存器0中的一个系数被读取用于进行滤波处理中的相乘处理(如图14中的a)),则可以消除寄存器中该系数(16比特),具体可以将寄存器中的数据右移16比特。在第三个周期起之后的周 期,复用器收到的选择信号可以为2,此时,寄存器0中的两个系数可以被读取用于滤波处理中的乘法处理,则可以消除寄存器0中2个系数(32比特),具体可以将寄存器0中的数据右移32比特。As shown in Figure 12, for register 0 used to store coefficients, in the first cycle, the selection signal received by the multiplexer can be 0. At this time, the processing circuit can select the crossbar to prepare And store the coefficient in register 0 (the selection data on the crossbar matrix and the register data in the register can be delayed by one cycle). In the second cycle, the selection signal received by the multiplexer can be 1, this When a coefficient in register 0 is read for multiplication in the filtering process (as shown in a) in Figure 14), the coefficient (16 bits) in the register can be eliminated, specifically, the data in the register can be right Shift 16 bits. In the period after the third period, the selection signal received by the multiplexer can be 2. At this time, the two coefficients in register 0 can be read for multiplication in the filtering process, and register 0 can be eliminated. 2 coefficients (32 bits) in the middle. Specifically, the data in register 0 can be shifted right by 32 bits.
如图13所示,对于用于存储图像数据的寄存器1和2,在第一个周期,寄存器1的复用器和寄存器2的复用器收到的选择信号可以为0,此时可以选择交叉开关矩阵上准备的32个图像数据,并将该图像数据存储到寄存器1中(从交叉开关矩阵上选择数据与在寄存器中寄存数据可以延迟一个周期),寄存器2中不存储数据;在第二个周期,寄存器1和寄存器2的复用器收到的选择信号为1,第一个系数的相乘处理开始(如图14中的a)),第一个图像数据在被相乘处理之后,可以删除,此时可以将寄存器1中的数据向右移动16比特;可以将读取的512比特的数据中低位数据存储到寄存器1中的高位,以及将剩余的496比特的数据存储到寄存器2中;从第三个周期开始,寄存器1和寄存器2的复用器收到的选择信号是2,每个周期进行2个系数的相乘处理(如图14中的b)和c)同步进行),则可以删除掉2个图像数据,则此时可以将寄存器1和寄存器2作为一个整体向右移动32比特的数据,直到一行数据被处理完毕。As shown in Figure 13, for registers 1 and 2 used to store image data, in the first cycle, the selection signal received by the multiplexer of register 1 and the multiplexer of register 2 can be 0, and you can select 32 image data prepared on the crossbar matrix, and store the image data in register 1 (selecting data from the crossbar matrix and registering the data in the register can be delayed by one cycle), and no data is stored in register 2. In two cycles, the selection signal received by the multiplexer of register 1 and register 2 is 1, the multiplication process of the first coefficient starts (a) in Figure 14), and the first image data is being multiplied. After that, it can be deleted. At this time, the data in register 1 can be moved 16 bits to the right; the low-order data in the read 512-bit data can be stored in the high-order bit in register 1, and the remaining 496-bit data can be stored in In register 2; starting from the third cycle, the selection signal received by the multiplexer of register 1 and register 2 is 2, and the multiplication of 2 coefficients is performed in each cycle (Figure 14 b) and c) Synchronously), then 2 image data can be deleted. At this time, register 1 and register 2 can be moved to the right by 32 bits of data as a whole until one row of data is processed.
应理解,针对寄存器,以向右移动来消除一定比特的数据或系数为例进行说明的,但是本申请实施例并不限于此,也可以通过向左移动来消除一定比特的数据或系数。It should be understood that for the register, the description is given by taking the shifting to the right to eliminate certain bits of data or coefficients as an example, but the embodiment of the present application is not limited to this, and the shifting to the left may also eliminate certain bits of data or coefficients.
按照以上提到的处理方式,系数矩阵中的一行系数进行的相乘处理需要的周期数T可以通过以下公式4)得到:According to the processing method mentioned above, the number of cycles T required for the multiplication processing of a row of coefficients in the coefficient matrix can be obtained by the following formula 4):
Figure PCTCN2019083284-appb-000007
Figure PCTCN2019083284-appb-000007
其中,W可以是指每行数据包括的系数的数量。Wherein, W may refer to the number of coefficients included in each row of data.
在本申请实施例中,可以设置一种计数器(如图15所示的计数器2),该计数器可以用来确定寄存器当前应有的状态,当处理电路接收到启动信号之后,可以进行针对一行系数的处理(具体,具体开始为向交叉开关矩阵获取数据),此时可以从0开始进行计数器的计数,每经历一个周期,计数器加1,当计数器计数到T-1时,也就是计算完系数矩阵的一行数据后,计算器的值可以变为0,开始进行下一行系数矩阵的处理,对于一个H行和W列的系数矩阵,可以执行H次上述操作。In the embodiment of the present application, a counter (counter 2 as shown in Figure 15) can be set. This counter can be used to determine the current state of the register. After the processing circuit receives the start signal, it can perform a line The processing of (specifically, the specific start is to obtain data from the crossbar switch matrix). At this time, the counter can be counted from 0. After each cycle, the counter is incremented by 1. When the counter counts to T-1, the coefficient is calculated After one row of data in the matrix, the value of the calculator can be changed to 0, and the next row of coefficient matrix processing can be started. For a coefficient matrix with H rows and W columns, the above operations can be performed H times.
如图15所示,假设T=W,则计数器2的计数可以是从0到W-1,在计数器2的计数为0时,寄存器的选择信号为0,在计数器的计数为1时,寄存器的选择信号为1,在计数器的计数为从2到W-1时,寄存器的选择信号为2。As shown in Figure 15, assuming T=W, the count of counter 2 can be from 0 to W-1. When the count of counter 2 is 0, the selection signal of the register is 0. When the count of the counter is 1, the register The selection signal of is 1, when the count of the counter is from 2 to W-1, the selection signal of the register is 2.
在本申请实施例中,寄存器中寄存的图像数据和系数可以是交叉开关矩阵上已准备好的数据,上述计数器的起始点可以是向交叉开关矩阵获取数据,以用于存储到寄存器中,在本申请实施例中,在将图像数据和系数在交叉开关矩阵上准备好之前,需要使能上述提到的a端口和b端口,以向存储有图像数据的和系数的存储器发送读请求,存储器接收到读请求之后,将会返回图像数据和系数,这将会带来一定的延迟,例如,如图15所示,从a端口和b端口发出使能信号,到向交叉开关矩阵上获取数据以存储到寄存器中存在三个周期的延迟,在该种情况下,可以设置另一个计数器(例如,如图15所示的计数器1),该计数器1可以用于确定端口a和b应有的状态,例如,如图15所示,在计数器1的计数为0时,端口a和b发出使能信号,在计数器1的计数为1时,端口a发出使能信号。In the embodiment of the present application, the image data and coefficients registered in the register may be data prepared on the crossbar matrix, and the starting point of the above counter may be to obtain data from the crossbar matrix for storage in the register. In the embodiment of this application, before the image data and coefficients are prepared on the crossbar switch matrix, the aforementioned a port and b port need to be enabled to send a read request to the memory storing the image data and the coefficients. After receiving the read request, the image data and coefficients will be returned, which will bring a certain delay. For example, as shown in Figure 15, the enable signal is sent from the a port and the b port to the crossbar switch matrix. There is a three-cycle delay in storing to the register. In this case, another counter (for example, counter 1 as shown in Figure 15) can be set, and this counter 1 can be used to determine what ports a and b should have. For example, as shown in Figure 15, when the count of counter 1 is 0, ports a and b send out enable signals, and when the count of counter 1 is 1, port a sends out enable signals.
本申请实施例可以实现待处理矩阵中的数据的流水式处理,图16所示的是在系数矩阵为3*5时的滑窗过程的流水式处理。The embodiment of the present application can realize the pipeline processing of the data in the matrix to be processed. FIG. 16 shows the pipeline processing of the sliding window process when the coefficient matrix is 3*5.
在系数矩阵为3*5时,针对待处理的矩阵,每行的数据处理是3个周期,数据在交叉开关矩阵上被准备相比于读请求的发出,延迟3个周期,从数据被寄存到寄存器中相比于其在交叉开关矩阵上被准备好延迟一个周期;其中,请求1和请求2可以是针对第一行数据的读请求,数据1和数据2是分别由请求1和请求2请求得到的数据,请求3和请求4可以是针对第二行数据的读请求,数据3和数据4是分别由请求3和请求4请求得到的数据,请求5和请求6可以是针对第三行数据的读请求,数据5和数据6是分别由请求5和请求6请求得到的数据,在对数据与系数进行相乘处理之前,可以对数据进行预处理,从图16中可以看出,预处理的数据可以是连续的。以及相乘处理也是连续的,这是由于数据1和数据2中分别包括了较多数量的数据,而每个周期仅进行一个系数或两个系数的相乘处理,每次需要删除一个或两个数据,从而可以保证数据处理的连续性,实现乘法器利用的最大化,从而可以提高处理效率。When the coefficient matrix is 3*5, for the matrix to be processed, the data processing for each row is 3 cycles, and the data is prepared on the crossbar matrix compared to the read request, which is delayed by 3 cycles, and the data is registered To the register is delayed by one cycle compared to it being prepared on the crossbar matrix; among them, request 1 and request 2 can be read requests for the first row of data, and data 1 and data 2 are generated by request 1 and request 2, respectively Requested data, request 3 and request 4 can be read requests for the second row of data, data 3 and data 4 are the data requested by request 3 and request 4, respectively, request 5 and request 6 can be for the third row Data read request, data 5 and data 6 are the data requested by request 5 and request 6, respectively. Before multiplying the data and the coefficient, the data can be preprocessed. It can be seen from Figure 16 that the preprocessing The processed data can be continuous. And the multiplication process is also continuous. This is because data 1 and data 2 respectively include a larger number of data, and each cycle only performs multiplication of one coefficient or two coefficients, and one or two coefficients need to be deleted each time. This can ensure the continuity of data processing, maximize the utilization of the multiplier, and improve processing efficiency.
应理解,以上图16所示的方案是以系数矩阵中每行的系数是3个为例 进行说明的,但是本申请实施例并不限于此,此时,请求2和请求3之间的周期的数量可以是T-2。It should be understood that the solution shown in FIG. 16 above is described by taking three coefficients in each row of the coefficient matrix as an example, but the embodiment of the present application is not limited to this. At this time, the period between request 2 and request 3 The number can be T-2.
以上介绍了本申请实施例的数据处理方法,以上的数据处理方法可以采用如图17所示的硬件组件。The data processing method of the embodiment of the present application is described above, and the above data processing method may use the hardware components shown in FIG. 17.
在如图17所示的硬件组件中,可以包括至少一个用于存储输入数据(或系数)的并行流式存储器、并行执行单元(可以是上文提到的处理电路)和至少一个用于存储输出数据的并行流式存储器。In the hardware components shown in FIG. 17, it may include at least one parallel streaming memory for storing input data (or coefficients), a parallel execution unit (which may be the processing circuit mentioned above) and at least one for storing Parallel streaming memory for output data.
用于存储输入数据(或系数)的并行流式存储器可以包括例如如图17所示的并行流式存储器A、并行流式存储器B,用于存储输出数据的并行流式存储器可以例如如图17所示的并行流式存储器C,每个并行流式存储器可以包括至少一个随机接入存储器(Random Access Memory,RAM),例如,如图17所示的RAM#1,#2,#3…#N。The parallel streaming memory for storing input data (or coefficients) may include, for example, parallel streaming memory A and parallel streaming memory B as shown in FIG. 17, and the parallel streaming memory for storing output data may be, for example, FIG. 17. In the parallel streaming memory C shown, each parallel streaming memory may include at least one random access memory (Random Access Memory, RAM), for example, RAM#1,#2,#3...# as shown in FIG. 17 N.
并行执行单元可以包括至少一个输入端口(例如,端口a和b(可以对应于上文提到的a和b),分别连接至至少一个用于存储输入数据(或系数)的并行流式存储器,以及并行执行单元的输出端口c可以连接至用于存储输出数据的并行流式存储器。如图17所示,并行执行单元的输入端口a和b分别连接至并行流式存储器A和并行流式存储器B,并行执行单元的输出端口c分别连接至并行流式存储器C。The parallel execution unit may include at least one input port (for example, ports a and b (which may correspond to a and b mentioned above) respectively connected to at least one parallel streaming memory for storing input data (or coefficients), And the output port c of the parallel execution unit can be connected to the parallel streaming memory for storing output data. As shown in Figure 17, the input ports a and b of the parallel execution unit are respectively connected to the parallel streaming memory A and the parallel streaming memory B, the output ports c of the parallel execution unit are respectively connected to the parallel stream memory C.
用于存储输入数据(或系数)的并行流式存储器可以包括地址生成单元(Address Generation Unit,AGU),该地址生成单元可以基于并行执行单元发出的读请求生成用于RAM输出数据的并行读地址。以及用于存储输出数据的并行流式存储器可以包括AGU,该AGU可以基于并行执行单元的写请求生成写地址。The parallel streaming memory for storing input data (or coefficients) may include an address generation unit (AGU), which may generate a parallel read address for RAM output data based on the read request issued by the parallel execution unit . And the parallel streaming memory for storing output data may include an AGU, and the AGU may generate a write address based on a write request of the parallel execution unit.
以下介绍该硬件组件的工作过程。该工作过程具体可以如下所示。The following describes the working process of this hardware component. The working process can be specifically as follows.
1)并行执行单元的输入端口a/b发出读请求给并行流式存储器A/B的AGU。其中,每个并行读请求包含N个数据(或系数)的读请求。1) The input port a/b of the parallel execution unit sends a read request to the AGU of the parallel stream memory A/B. Among them, each parallel read request contains N data (or coefficient) read requests.
2)并行流式存储器A/B的AGU产生N个数据(或系数)的并行读地址给N个RAM。2) The AGU of the parallel stream memory A/B generates parallel read addresses of N data (or coefficients) to N RAMs.
3)并行流式存储器A/B分别输出N个并行读数据(或系数)。3) Parallel streaming memory A/B respectively output N parallel read data (or coefficients).
4)并行执行单元处理从输入端口A/B获得的数据(或系数),然后通过输出端口C发出1个并行写请求给并行流式存储器C的AGU、以及N个 并行写数据是给并行流式存储器C。其中,每个并行写请求可以包含N个数据的写请求。4) The parallel execution unit processes the data (or coefficients) obtained from the input port A/B, and then sends a parallel write request to the AGU of the parallel stream memory C through the output port C, and N parallel write data is for the parallel stream式Memory C. Among them, each parallel write request may include N data write requests.
5)并行流式存储器C的AGU产生N个并行写地址给N个RAM,并行写数据被写入这些RAM。5) The AGU of the parallel streaming memory C generates N parallel write addresses to N RAMs, and the parallel write data is written into these RAMs.
6)可以重复(1)~(5)。6) You can repeat (1) to (5).
基于以上所述的硬件组件可以看出,并行流式存储器可以并行地输出/输入N个数据。其中,AGU产生N个数据的地址,给N个不同的RAM。并行执行单元可以并行地输入/处理/输出N个数据。Based on the hardware components described above, it can be seen that the parallel stream memory can output/input N data in parallel. Among them, AGU generates N data addresses for N different RAMs. The parallel execution unit can input/process/output N data in parallel.
图18示出的数据处理的示意性图。如图18所示,每个数据占用的地址长度可以为地址长度1,针对第一行,以基地址1为基准,可以从RAM中读取W个数据,也即图18中的灰色部分,后面的虚线框代表后续需要读取的数据,直到N+W-1个数据被读入。Figure 18 shows a schematic diagram of data processing. As shown in Figure 18, the address length occupied by each data can be address length 1. For the first row, based on the base address 1, W data can be read from RAM, which is the gray part in Figure 18. The dashed box at the back represents the data that needs to be read later until N+W-1 data are read in.
在计数器重新变为0时,针对第二行,基地址变为地址2(地址1与地址2之间的地址长度为地址长度2),可以以地址2为基准,读取N+W-1个具有地址长度1的数据(读取方式类似于第一行),依次类推,直到把图18中所示的左部分的多行数据被读取。When the counter becomes 0 again, for the second row, the base address becomes address 2 (the address length between address 1 and address 2 is address length 2), you can read N+W-1 based on address 2 Data with address length 1 (reading method is similar to the first row), and so on, until the multiple rows of data shown in the left part of Fig. 18 are read.
然后可以跳转到图18中右部分的数据,此时计数器可以为0,针对第一行,基地址变为地址3(地址1与地址3之间的地址长度为地址长度3),可以以地址3为基准,读取N+W-1个具有地址长度为1的数据,类似于左部分的数据的处理,直到把该右部分的多行数据处理完毕。Then you can jump to the data in the right part of Figure 18. At this time, the counter can be 0. For the first row, the base address becomes address 3 (the address length between address 1 and address 3 is address length 3). Based on address 3, read N+W-1 data with address length 1, similar to the processing of the left part of the data, until the right part of the multi-line data is processed.
其中,在从RAM中读取数据时,可以采用图17所示的硬件组件,针对每一行的数据,可以先分别从N个RAM中每个读取一个数据,再从N个RAM中的每个读取一个数据,直到读取N+W-1个数据。Among them, when reading data from RAM, the hardware components shown in Figure 17 can be used. For each row of data, one data can be read from each of the N RAMs, and then from each of the N RAMs. One data is read until N+W-1 data are read.
因此,在本申请实施例中,利用系数矩阵对待处理矩阵的滤波处理中,至少子矩阵的第i组数据的第j个滑窗中的N个数据与第i组系数中的第j个系数的N次相乘处理是并行处理的,以及将子矩阵包括的W*H个滑窗中在滑窗内具有相同位置的数据对应的相乘处理结果进行相加处理,以得到N个输出数据,从而可以在实现滤波处理的同时,由于多次相乘处理是并行的,可以提高硬件的利用率,并且可以进一步提高数据处理效率。Therefore, in the embodiment of the present application, in the filtering process of the matrix to be processed using the coefficient matrix, at least the N data in the j sliding window of the i-th group of data of the sub-matrix and the j-th coefficient in the i-th group of coefficients The N times of multiplication processing is performed in parallel, and the results of the multiplication processing corresponding to the data having the same position in the sliding window in the W*H sliding windows included in the sub-matrix are added to obtain N output data Therefore, while filtering processing can be realized, since multiple multiplication processing is parallel, the utilization rate of hardware can be improved, and the data processing efficiency can be further improved.
图19是根据本申请实施例的数据处理设备200的示意性框图。如图19所示,该设备用于利用系数矩阵对待处理矩阵的滤波处理中,所述待处理矩 阵包括至少一个子矩阵,所述子矩阵包括H组数据,每组数据包括W个滑窗,每个滑窗具有N个数据,所述系数矩阵包括H组系数,每组系数包括W个系数,其中,所述N、所述H、所述W为正整数;FIG. 19 is a schematic block diagram of a data processing device 200 according to an embodiment of the present application. As shown in FIG. 19, the device is used for filtering processing of a matrix to be processed using a coefficient matrix, the matrix to be processed includes at least one sub-matrix, the sub-matrix includes H groups of data, and each group of data includes W sliding windows, Each sliding window has N data, the coefficient matrix includes H sets of coefficients, and each set of coefficients includes W coefficients, where the N, the H, and the W are positive integers;
所述设备200包括控制电路210、乘法电路220、加法电路230、第一寄存器240和第二寄存器250;The device 200 includes a control circuit 210, a multiplication circuit 220, an addition circuit 230, a first register 240 and a second register 250;
控制电路210,用于:读取并在第一寄存器240中寄存所述子矩阵中的第i组数据,其中,i为取值从1到H的整数;读取并在第二寄存器250中寄存所述系数矩阵中的第i组系数;The control circuit 210 is used to: read and register the i-th group of data in the sub-matrix in the first register 240, where i is an integer with a value from 1 to H; read and store in the second register 250 Register the i-th group of coefficients in the coefficient matrix;
乘法电路220,用于:将所述第i组数据包括的W个滑窗中的第j个滑窗中的N个数据与第i组系数包括的W个系数中的第j个系数分别进行相乘处理,其中,j为取值从1到W的整数,至少第j个滑窗中的N个数据与第j个系数的N次相乘处理是并行处理的;The multiplication circuit 220 is configured to: separately perform the N data in the j sliding window among the W sliding windows included in the i-th group of data and the j-th coefficient among the W coefficients included in the i-th group of coefficients Multiplication processing, where j is an integer ranging from 1 to W, and at least N multiplication processing of N data in the j-th sliding window and the j-th coefficient is processed in parallel;
加法电路230,用于:将所述子矩阵包括的W*H个滑窗中在滑窗内具有相同位置的数据对应的相乘处理结果进行相加处理,以得到N个输出数据。The addition circuit 230 is configured to perform addition processing on the multiplication processing results corresponding to the data having the same position in the sliding window in the W*H sliding windows included in the sub-matrix, to obtain N output data.
可选地,本申请实施例中的乘法电路可以包括至少一个乘法器。可选地,本申请实施例中的加法电路可以包括至少一个加法器。Optionally, the multiplication circuit in the embodiment of the present application may include at least one multiplier. Optionally, the adding circuit in the embodiment of the present application may include at least one adder.
可选地,在本申请实施例中,所述第i组数据由多个周期读取,每个周期读取N个数据;Optionally, in the embodiment of the present application, the i-th group of data is read in multiple cycles, and N data are read in each cycle;
在读取所述第i组数据的多个周期的第一个周期内,所述第i组系数中的至少一个系数被读取。In the first cycle of the multiple cycles of reading the i-th group of data, at least one coefficient in the i-th group of coefficients is read.
可选地,在本申请实施例中,在所述第一个周期内,所述第i组系数中的W个系数被读取。Optionally, in the embodiment of the present application, in the first period, W coefficients in the i-th group of coefficients are read.
可选地,在本申请实施例中,在所述第一个周期读取且寄存数据之后以及第二个周期读取的数据寄存之前,第一个滑窗中的N个数据与第一个系数的N次相乘处理是并行处理的,以得到第一个滑窗对应的N个第一处理结果。Optionally, in the embodiment of the present application, after the data is read and registered in the first cycle and before the data read in the second cycle is registered, the N data in the first sliding window and the first The N multiplications of the coefficients are processed in parallel to obtain N first processing results corresponding to the first sliding window.
可选地,在本申请实施例中,设备200还包括第三寄存器260,所述控制电路210,用于:Optionally, in this embodiment of the application, the device 200 further includes a third register 260, and the control circuit 210 is configured to:
将所述第一个滑窗对应的所述N个第一处理结果输出到第三寄存器,用于结合W*H个滑窗中除所述第一个滑窗之外的其他滑窗得到的处理结果, 以得到所述N个输出数据。The N first processing results corresponding to the first sliding window are output to a third register for combining the results obtained by combining the W*H sliding windows with other sliding windows except the first sliding window The result is processed to obtain the N output data.
可选地,在本申请实施例中,在第二个周期读取并寄存数据之后,s个滑窗对应的s*N个相乘处理是并行处理的,其中,s为大于或等于2且小于等于W-1的整数。Optionally, in the embodiment of the present application, after the data is read and registered in the second cycle, the s*N multiplication processes corresponding to the s sliding windows are processed in parallel, where s is greater than or equal to 2 and An integer less than or equal to W-1.
可选地,在本申请实施例中,所述s的取值是基于可用乘法器的数量确定的。Optionally, in this embodiment of the present application, the value of s is determined based on the number of available multipliers.
可选地,在本申请实施例中,将所述子矩阵包括的W*H个滑窗中在滑窗内具有相同位置的数据对应的相乘结果进行相加处理,以输出N个数据,包括:Optionally, in the embodiment of the present application, the multiplication results corresponding to the data having the same position in the sliding window in the W*H sliding windows included in the sub-matrix are added together to output N data, include:
将所述s个滑窗中在滑窗内具有相同位置的数据对应的相乘结果进行相加处理,以得到N个第二处理结果;Performing addition processing on the multiplication results corresponding to the data having the same position in the sliding window in the s sliding windows to obtain N second processing results;
将所述N个第二处理结果存储到所述第三寄存器,用于结合W*H个滑窗中除所述s个滑窗之外的其他滑窗得到的处理结果,以得到所述N个输出数据。The N second processing results are stored in the third register for combining the processing results obtained by the W*H sliding windows other than the s sliding windows to obtain the N Output data.
可选地,在本申请实施例中,所述W小于或等于N。Optionally, in the embodiment of the present application, the W is less than or equal to N.
可选地,在本申请实施例中,所述N的取值是基于用于寄存所述第一寄存器的容量和/或乘法电路包括的乘法器的数量确定的。Optionally, in the embodiment of the present application, the value of N is determined based on the capacity used to register the first register and/or the number of multipliers included in the multiplication circuit.
可选地,在本申请实施例中,所述控制电路210进一步用于:Optionally, in the embodiment of the present application, the control circuit 210 is further configured to:
在所述第i组数据中任一数据从所述第一寄存器被读取用于相乘处理后,将所述任一数据从所述第一寄存器中删除。After any data in the i-th group of data is read from the first register for multiplication processing, the any data is deleted from the first register.
可选地,在本申请实施例中,所述控制电路210进一步用于:Optionally, in the embodiment of the present application, the control circuit 210 is further configured to:
在所述任一数据被从所述第一寄存器中删除之后,移动所述第i组数据中剩余的数据,使得所述任一数据占用的存储位置被占用。After the any data is deleted from the first register, the remaining data in the i-th group of data is moved so that the storage location occupied by the any data is occupied.
可选地,在本申请实施例中,所述H组数据中每组数据为所述子矩阵的一行数据,所述H组系数中每组系数为所述系数矩阵的一行系数。Optionally, in the embodiment of the present application, each group of data in the H group of data is a row of data of the sub-matrix, and each group of coefficients in the H group of coefficients is a row of coefficients in the coefficient matrix.
可选地,在本申请实施例中,在行的方向上,相邻两个所述子矩阵相差N列数据。Optionally, in this embodiment of the present application, in the row direction, two adjacent sub-matrices differ by N columns of data.
可选地,在本申请实施例中,所述H组数据中每组数据为所述子矩阵的一列数据,所述H组系数中每组系数为所述系数矩阵的一列系数。Optionally, in the embodiment of the present application, each group of data in the H group of data is a column of data of the sub-matrix, and each group of coefficients in the H group of coefficients is a column of coefficients of the coefficient matrix.
可选地,在本申请实施例中,在行的方向上,相邻两个所述子矩阵相差N行数据。Optionally, in this embodiment of the present application, in the row direction, two adjacent sub-matrices differ by N rows of data.
应理解,该设备200可以用于实现上述方法实施例中由数据处理设备实现的相应操作,为了简洁,在此不再赘述。It should be understood that the device 200 may be used to implement the corresponding operations implemented by the data processing device in the foregoing method embodiments, and for the sake of brevity, details are not repeated here.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应所述以权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims (32)

  1. 一种数据处理方法,其特征在于,所述方法用于利用系数矩阵对待处理矩阵的滤波处理中,所述待处理矩阵包括至少一个子矩阵,所述子矩阵包括H组数据,每组数据包括W个滑窗,每个滑窗具有N个数据,所述系数矩阵包括H组系数,每组系数包括W个系数,其中,所述N、所述H、所述W为正整数;A data processing method, characterized in that the method is used for filtering processing of a matrix to be processed using a coefficient matrix, the matrix to be processed includes at least one sub-matrix, the sub-matrix includes H groups of data, and each group of data includes W sliding windows, each sliding window has N data, the coefficient matrix includes H sets of coefficients, and each set of coefficients includes W coefficients, wherein the N, the H, and the W are positive integers;
    所述方法包括:The method includes:
    读取并在第一寄存器寄存所述子矩阵中的第i组数据,其中,i为取值从1到H的整数;Read and register the i-th group of data in the sub-matrix in the first register, where i is an integer with a value from 1 to H;
    读取并在第二寄存器寄存所述系数矩阵中的第i组系数;Reading and registering the i-th group of coefficients in the coefficient matrix in the second register;
    将所述第i组数据包括的W个滑窗中的第j个滑窗中的N个数据与第i组系数包括的W个系数中的第j个系数分别进行相乘处理,其中,j为取值从1到W的整数,至少第j个滑窗中的N个数据与第j个系数的N次相乘处理是并行处理的;The N data in the j-th sliding window among the W sliding windows included in the i-th group of data and the j-th coefficient among the W coefficients included in the i-th group of coefficients are respectively multiplied, where j For integers ranging from 1 to W, at least N data in the j-th sliding window and the j-th coefficient are multiplied in parallel for N times;
    将所述子矩阵包括的W*H个滑窗中在滑窗内具有相同位置的数据对应的相乘处理结果进行相加处理,以得到N个输出数据。The multiplication processing results corresponding to the data having the same position in the sliding window among the W*H sliding windows included in the sub-matrix are added together to obtain N output data.
  2. 根据权利要求1所述的方法,其特征在于,所述第i组数据由多个周期读取,每个周期读取N个数据;The method according to claim 1, wherein the i-th group of data is read in multiple cycles, and N data are read in each cycle;
    在读取所述第i组数据的多个周期的第一个周期内,所述第i组系数中的至少一个系数被读取。In the first cycle of the multiple cycles of reading the i-th group of data, at least one coefficient in the i-th group of coefficients is read.
  3. 根据权利要求2所述的方法,其特征在于,在所述第一个周期内,所述第i组系数中的W个系数被读取。The method according to claim 2, characterized in that, in the first period, W coefficients in the i-th group of coefficients are read.
  4. 根据权利要求2或3所述的方法,其特征在于,在所述第一个周期读取且寄存数据之后以及第二个周期读取的数据寄存之前,第一个滑窗中的N个数据与第一个系数的N次相乘处理是并行处理的,以得到第一个滑窗对应的N个第一处理结果。The method according to claim 2 or 3, wherein after the data read and registered in the first cycle and before the data read in the second cycle is registered, the N data in the first sliding window The N multiplications with the first coefficient are processed in parallel to obtain N first processing results corresponding to the first sliding window.
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:The method according to claim 4, wherein the method further comprises:
    将所述第一个滑窗对应的所述N个第一处理结果输出到第三寄存器,用于结合W*H个滑窗中除所述第一个滑窗之外的其他滑窗得到的处理结果,以得到所述N个输出数据。The N first processing results corresponding to the first sliding window are output to a third register for combining the results obtained by combining the W*H sliding windows with other sliding windows except the first sliding window The result is processed to obtain the N output data.
  6. 根据权利要求2至5中任一项所述的方法,其特征在于,在第二个 周期读取并寄存数据之后,s个滑窗对应的s*N个相乘处理是并行处理的,其中,s为大于或等于2且小于等于W-1的整数。The method according to any one of claims 2 to 5, wherein after the data is read and registered in the second cycle, the s*N multiplication processes corresponding to the s sliding windows are processed in parallel, wherein , S is an integer greater than or equal to 2 and less than or equal to W-1.
  7. 根据权利要求6所述的方法,其特征在于,所述s的取值是基于可用乘法器的数量确定的。The method according to claim 6, wherein the value of s is determined based on the number of available multipliers.
  8. 根据权利要求6或7所述的方法,其特征在于,将所述子矩阵包括的W*H个滑窗中在滑窗内具有相同位置的数据对应的相乘结果进行相加处理,以输出N个数据,包括:The method according to claim 6 or 7, characterized in that, among the W*H sliding windows included in the sub-matrix, the multiplication results corresponding to the data having the same position in the sliding window are added together to output N data, including:
    将所述s个滑窗中在滑窗内具有相同位置的数据对应的相乘结果进行相加处理,以得到N个第二处理结果;Performing addition processing on the multiplication results corresponding to the data having the same position in the sliding window in the s sliding windows to obtain N second processing results;
    将所述N个第二处理结果存储到所述第三寄存器,用于结合W*H个滑窗中除所述s个滑窗之外的其他滑窗得到的处理结果,以得到所述N个输出数据。The N second processing results are stored in the third register for combining the processing results obtained by the W*H sliding windows other than the s sliding windows to obtain the N Output data.
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,所述W小于或等于N。The method according to any one of claims 1 to 8, wherein the W is less than or equal to N.
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,所述N的取值是基于用于寄存所述第一寄存器的容量和/或用于相乘处理的乘法器的数量确定的。The method according to any one of claims 1 to 9, wherein the value of N is based on the capacity of the first register used to register and/or the number of multipliers used for multiplication processing definite.
  11. 根据权利要求1至10中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 10, wherein the method further comprises:
    在所述第i组数据中任一数据从所述第一寄存器被读取用于相乘处理后,将所述任一数据从所述第一寄存器中删除。After any data in the i-th group of data is read from the first register for multiplication processing, the any data is deleted from the first register.
  12. 根据权利要求11所述的方法,其特征在于,所述方法还包括:The method of claim 11, wherein the method further comprises:
    在所述任一数据被从所述第一寄存器中删除之后,移动所述第i组数据中剩余的数据,使得所述任一数据占用的存储位置被占用。After the any data is deleted from the first register, the remaining data in the i-th group of data is moved so that the storage location occupied by the any data is occupied.
  13. 根据权利要求1至12中任一项所述的方法,其特征在于,所述H组数据中每组数据为所述子矩阵的一行数据,所述H组系数中每组系数为所述系数矩阵的一行系数。The method according to any one of claims 1 to 12, wherein each group of data in the H group of data is a row of data of the sub-matrix, and each group of coefficients in the H group of coefficients is the coefficient A row of coefficients of the matrix.
  14. 根据权利要求13所述的方法,其特征在于,在行的方向上,相邻两个所述子矩阵相差N列数据。The method according to claim 13, wherein in the row direction, two adjacent sub-matrices differ by N columns of data.
  15. 根据权利要求1至12中任一项所述的方法,其特征在于,所述H组数据中每组数据为所述子矩阵的一列数据,所述H组系数中每组系数为所 述系数矩阵的一列系数。The method according to any one of claims 1 to 12, wherein each group of data in the H group of data is a column of data of the sub-matrix, and each group of coefficients in the H group of coefficients is the coefficient A column of coefficients of the matrix.
  16. 根据权利要求15所述的方法,其特征在于,在行的方向上,相邻两个所述子矩阵相差N行数据。15. The method according to claim 15, wherein in the row direction, two adjacent sub-matrices differ by N rows of data.
  17. 一种数据处理设备,其特征在于,所述设备用于利用系数矩阵对待处理矩阵的滤波处理中,所述待处理矩阵包括至少一个子矩阵,所述子矩阵包括H组数据,每组数据包括W个滑窗,每个滑窗具有N个数据,所述系数矩阵包括H组系数,每组系数包括W个系数,其中,所述N、所述H、所述W为正整数;A data processing device, wherein the device is used for filtering processing of a matrix to be processed using a coefficient matrix, the matrix to be processed includes at least one sub-matrix, the sub-matrix includes H groups of data, and each group of data includes W sliding windows, each sliding window has N data, the coefficient matrix includes H sets of coefficients, and each set of coefficients includes W coefficients, wherein the N, the H, and the W are positive integers;
    所述设备包括控制电路、乘法电路、加法电路、第一寄存器和第二寄存器;The device includes a control circuit, a multiplication circuit, an addition circuit, a first register and a second register;
    所述控制电路,用于:读取并在所述第一寄存器中寄存所述子矩阵中的第i组数据,其中,i为取值从1到H的整数;读取并在所述第二寄存器中寄存所述系数矩阵中的第i组系数;The control circuit is configured to: read and register the i-th group of data in the sub-matrix in the first register, where i is an integer with a value from 1 to H; read and register in the first register Register the i-th group of coefficients in the coefficient matrix in a second register;
    所述乘法电路,用于:将所述第i组数据包括的W个滑窗中的第j个滑窗中的N个数据与第i组系数包括的W个系数中的第j个系数分别进行相乘处理,其中,j为取值从1到W的整数,至少第j个滑窗中的N个数据与第j个系数的N次相乘处理是并行处理的;The multiplication circuit is configured to: separate the N data in the j sliding window among the W sliding windows included in the i-th group of data and the j-th coefficient among the W coefficients included in the i-th group of coefficients. Perform multiplication processing, where j is an integer with a value from 1 to W, and at least N multiplication processing of N data in the j-th sliding window and the j-th coefficient is processed in parallel;
    所述加法电路,用于:将所述子矩阵包括的W*H个滑窗中在滑窗内具有相同位置的数据对应的相乘处理结果进行相加处理,以得到N个输出数据。The addition circuit is used for: adding processing results corresponding to the data having the same position in the sliding window among the W*H sliding windows included in the sub-matrix to obtain N output data.
  18. 根据权利要求17所述的设备,其特征在于,所述第i组数据由多个周期读取,每个周期读取N个数据;The device according to claim 17, wherein the i-th group of data is read in multiple cycles, and N data are read in each cycle;
    在读取所述第i组数据的多个周期的第一个周期内,所述第i组系数中的至少一个系数被读取。In the first cycle of the multiple cycles of reading the i-th group of data, at least one coefficient in the i-th group of coefficients is read.
  19. 根据权利要求18所述的设备,其特征在于,在所述第一个周期内,所述第i组系数中的W个系数被读取。The device according to claim 18, wherein in the first period, W coefficients in the i-th group of coefficients are read.
  20. 根据权利要求18或19所述的设备,其特征在于,在所述第一个周期读取且寄存数据之后以及第二个周期读取的数据寄存之前,第一个滑窗中的N个数据与第一个系数的N次相乘处理是并行处理的,以得到第一个滑窗对应的N个第一处理结果。The device according to claim 18 or 19, wherein after the data read and registered in the first cycle and before the data read in the second cycle is registered, the N data in the first sliding window The N multiplications with the first coefficient are processed in parallel to obtain N first processing results corresponding to the first sliding window.
  21. 根据权利要求20所述的设备,其特征在于,还包括第三寄存器, 所述控制电路,用于:The device according to claim 20, further comprising a third register, and the control circuit is configured to:
    将所述第一个滑窗对应的所述N个第一处理结果输出到第三寄存器,用于结合W*H个滑窗中除所述第一个滑窗之外的其他滑窗得到的处理结果,以得到所述N个输出数据。The N first processing results corresponding to the first sliding window are output to a third register for combining the results obtained by combining the W*H sliding windows with other sliding windows except the first sliding window The result is processed to obtain the N output data.
  22. 根据权利要求18至21中任一项所述的设备,其特征在于,在第二个周期读取并寄存数据之后,s个滑窗对应的s*N个相乘处理是并行处理的,其中,s为大于或等于2且小于等于W-1的整数。The device according to any one of claims 18 to 21, wherein after reading and registering the data in the second cycle, the s*N multiplication processes corresponding to the s sliding windows are processed in parallel, wherein , S is an integer greater than or equal to 2 and less than or equal to W-1.
  23. 根据权利要求22所述的设备,其特征在于,所述s的取值是基于可用乘法器的数量确定的。The device according to claim 22, wherein the value of s is determined based on the number of available multipliers.
  24. 根据权利要求22或23所述的设备,其特征在于,将所述子矩阵包括的W*H个滑窗中在滑窗内具有相同位置的数据对应的相乘结果进行相加处理,以输出N个数据,包括:The device according to claim 22 or 23, wherein the multiplication results corresponding to the data having the same position in the sliding window in the W*H sliding windows included in the sub-matrix are added together to output N data, including:
    将所述s个滑窗中在滑窗内具有相同位置的数据对应的相乘结果进行相加处理,以得到N个第二处理结果;Performing addition processing on the multiplication results corresponding to the data having the same position in the sliding window in the s sliding windows to obtain N second processing results;
    将所述N个第二处理结果存储到所述第三寄存器,用于结合W*H个滑窗中除所述s个滑窗之外的其他滑窗得到的处理结果,以得到所述N个输出数据。The N second processing results are stored in the third register for combining the processing results obtained by the W*H sliding windows other than the s sliding windows to obtain the N Output data.
  25. 根据权利要求17至24中任一项所述的设备,其特征在于,所述W小于或等于N。The device according to any one of claims 17 to 24, wherein the W is less than or equal to N.
  26. 根据权利要求17至25中任一项所述的设备,其特征在于,所述N的取值是基于用于寄存所述第一寄存器的容量和/或乘法电路包括的乘法器的数量确定的。The device according to any one of claims 17 to 25, wherein the value of N is determined based on the capacity used to register the first register and/or the number of multipliers included in the multiplication circuit .
  27. 根据权利要求17至26中任一项所述的设备,其特征在于,所述设备还包括:The device according to any one of claims 17 to 26, wherein the device further comprises:
    在所述第i组数据中任一数据从所述第一寄存器被读取用于相乘处理后,将所述任一数据从所述第一寄存器中删除。After any data in the i-th group of data is read from the first register for multiplication processing, the any data is deleted from the first register.
  28. 根据权利要求27所述的设备,其特征在于,所述设备还包括:The device according to claim 27, wherein the device further comprises:
    在所述任一数据被从所述第一寄存器中删除之后,移动所述第i组数据中剩余的数据,使得所述任一数据占用的存储位置被占用。After the any data is deleted from the first register, the remaining data in the i-th group of data is moved so that the storage location occupied by the any data is occupied.
  29. 根据权利要求17至28中任一项所述的设备,其特征在于,所述H组数据中每组数据为所述子矩阵的一行数据,所述H组系数中每组系数为所 述系数矩阵的一行系数。The device according to any one of claims 17 to 28, wherein each group of data in the H group of data is a row of data of the sub-matrix, and each group of coefficients in the H group of coefficients is the coefficient A row of coefficients of the matrix.
  30. 根据权利要求29所述的设备,其特征在于,在行的方向上,相邻两个所述子矩阵相差N列数据。The device according to claim 29, wherein in the row direction, two adjacent sub-matrices differ by N columns of data.
  31. 根据权利要求17至28中任一项所述的设备,其特征在于,所述H组数据中每组数据为所述子矩阵的一列数据,所述H组系数中每组系数为所述系数矩阵的一列系数。The device according to any one of claims 17 to 28, wherein each group of data in the H group of data is a column of data of the sub-matrix, and each group of coefficients in the H group of coefficients is the coefficient A column of coefficients of the matrix.
  32. 根据权利要求31所述的设备,其特征在于,在行的方向上,相邻两个所述子矩阵相差N行数据。The device according to claim 31, wherein in the row direction, two adjacent sub-matrices differ by N rows of data.
PCT/CN2019/083284 2019-04-18 2019-04-18 Data processing method and device WO2020211049A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980005020.9A CN111213177A (en) 2019-04-18 2019-04-18 Data processing method and device
PCT/CN2019/083284 WO2020211049A1 (en) 2019-04-18 2019-04-18 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/083284 WO2020211049A1 (en) 2019-04-18 2019-04-18 Data processing method and device

Publications (1)

Publication Number Publication Date
WO2020211049A1 true WO2020211049A1 (en) 2020-10-22

Family

ID=70790117

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/083284 WO2020211049A1 (en) 2019-04-18 2019-04-18 Data processing method and device

Country Status (2)

Country Link
CN (1) CN111213177A (en)
WO (1) WO2020211049A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375721A (en) * 2010-08-23 2012-03-14 联想(北京)有限公司 Matrix multiplying method, graphic processor and electronic equipment
CN103227622A (en) * 2013-04-19 2013-07-31 中国科学院自动化研究所 Parallel filtering method and corresponding device
CN107066235A (en) * 2017-04-24 2017-08-18 北京华大信安科技有限公司 Computational methods and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7869994B2 (en) * 2007-01-30 2011-01-11 Qnx Software Systems Co. Transient noise removal system using wavelets
CN104751417B (en) * 2013-12-31 2018-08-10 展讯通信(上海)有限公司 A kind of method, apparatus and image processing system of removal color noise
CN106023091B (en) * 2016-04-22 2019-05-24 西安电子科技大学 The real-time defogging method of image based on graphics processor
CN106788714B (en) * 2016-12-05 2019-01-18 重庆工商大学 A kind of sparse solution mixing method based on optical computing
CN108475188A (en) * 2017-07-31 2018-08-31 深圳市大疆创新科技有限公司 Data processing method and equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375721A (en) * 2010-08-23 2012-03-14 联想(北京)有限公司 Matrix multiplying method, graphic processor and electronic equipment
CN103227622A (en) * 2013-04-19 2013-07-31 中国科学院自动化研究所 Parallel filtering method and corresponding device
CN107066235A (en) * 2017-04-24 2017-08-18 北京华大信安科技有限公司 Computational methods and device

Also Published As

Publication number Publication date
CN111213177A (en) 2020-05-29

Similar Documents

Publication Publication Date Title
US5500811A (en) Finite impulse response filter
CN109271133B (en) Data processing method and system
EP3093757B1 (en) Multi-dimensional sliding window operation for a vector processor
NL8601183A (en) DISCRETE COSINUS TRANSFORMATION DEVICE.
US20070052557A1 (en) Shared memory and shared multiplier programmable digital-filter implementation
JPH0612487A (en) Sample rate converter for picture data
CN108762719B (en) Parallel generalized inner product reconstruction controller
WO2020211049A1 (en) Data processing method and device
EP0474246A2 (en) Image signal processor
CN116888591A (en) Matrix multiplier, matrix calculation method and related equipment
Van A new 2-D systolic digital filter architecture without global broadcast
Mehendale et al. DA-based circuits for inner-product computation
Cardarilli et al. Impact of RNS coding overhead on FIR filters performance
JP4920559B2 (en) Data processing device
WO2021035715A1 (en) Data processing method and device
JP4156538B2 (en) Matrix operation unit
CN101594122A (en) Finite impulse response filter and implementation method thereof
CN107193784A (en) The sinc interpolation realization method and systems of the low hardware complexity of high accuracy
JP3363974B2 (en) Signal processing device
US6944640B2 (en) Progressive two-dimensional (2D) pyramid filter
JP2001160736A (en) Digital filter circuit
JPS59194242A (en) Digital multiplying and cumulative adding device
JP2960595B2 (en) Digital signal processor
JP3654622B2 (en) DCT arithmetic device and IDCT arithmetic device
JPH0741213Y2 (en) FIR filter

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19925468

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19925468

Country of ref document: EP

Kind code of ref document: A1