CN117372495A - Calculation method for accelerating dot products with different bit widths in digital image processing - Google Patents

Calculation method for accelerating dot products with different bit widths in digital image processing Download PDF

Info

Publication number
CN117372495A
CN117372495A CN202311195824.4A CN202311195824A CN117372495A CN 117372495 A CN117372495 A CN 117372495A CN 202311195824 A CN202311195824 A CN 202311195824A CN 117372495 A CN117372495 A CN 117372495A
Authority
CN
China
Prior art keywords
register
data
coefficient
input
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311195824.4A
Other languages
Chinese (zh)
Inventor
俞林杰
罗嘉蕙
张丹枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jindi Space Time Hangzhou Technology Co ltd
Original Assignee
Jindi Space Time Hangzhou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jindi Space Time Hangzhou Technology Co ltd filed Critical Jindi Space Time Hangzhou Technology Co ltd
Priority to CN202311195824.4A priority Critical patent/CN117372495A/en
Publication of CN117372495A publication Critical patent/CN117372495A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Geometry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Processing (AREA)

Abstract

The invention belongs to the field of image processing, and discloses a calculation method for accelerating dot products with different bit widths in digital image processing. Input data and coefficient data with different bit widths are respectively placed into a first register and a second register according to specific arrangement. The first register and the second register are guaranteed to have the same total length of data under the condition of different element bit widths, and hardware is convenient to read in the data. Meanwhile, when dot product operation is carried out on input data and coefficient data, the coefficient data is multiplexed by utilizing the characteristics of the same coefficients among different channels, so that the number of times of repeated reading of the data into an execution unit is effectively reduced.

Description

Calculation method for accelerating dot products with different bit widths in digital image processing
Technical Field
The invention belongs to the field of image processing, and particularly relates to a calculation method for accelerating dot products with different bit widths in digital image processing.
Background
With the widespread use of digital image processing algorithms in everyday life, and the massive landing of visual AI algorithms, the performance of the related algorithms is becoming increasingly important. In AI applications, the network requires a resolution image as input. But the output resolution of each video device is different and substantially greater than the resolution required by the network. Thus, there is a need to meet the input requirements of the network by scaling the image. In the scenes such as face recognition, the angles of certain objects in the image need to be adjusted so that the objects are better recognized by an algorithm. These scenarios are independent of the image size adjustment algorithm (resize algorithm), the numerical remapping algorithm (remap algorithm) and the affine algorithm (affine algorithm) in the field of digital image processing. The key calculation in the algorithms is interpolation calculation, and the interpolation modes include linear interpolation, nearest interpolation and the like, wherein bilinear interpolation calculation is the most frequently used calculation. In addition, the neural network also has certain requirements on the input format of the image, which is generally RGB, and the original data output by the camera is generally YUV format, so that an algorithm for converting the image format is needed.
The core computation is dot product computation, whether it is bilinear interpolation computation or image format conversion. The following describes bilinear interpolation calculation as an example.
The bilinear interpolation method is also called bilinear interpolation, and the core idea is to perform one-time linear interpolation in the two directions XY, respectively. Bilinear interpolation is widely applied to aspects such as signal processing, digital image processing, video processing and the like as an interpolation algorithm in numerical analysis.
Bilinear interpolation is a linear interpolation extension of the interpolation function of two variables, and the core idea of bilinear interpolation is to perform linear interpolation once in two directions respectively. As shown in fig. 1, let (x, y) be the point to be solved, and (x 1, y 1), (x 1, y 2), (x 2, y 1), (x 2, y 2) be the four nearest points. The values of these four points are Q11, Q12, Q21, and Q22, respectively. To obtain the value of P point, interpolation may be performed twice in the X direction and then once in the Y direction.
The calculation of P can be expressed simply as:
P=W1×Q11+W2×Q12+W3×Q21+W4×Q22;
wherein W1= (x 2-x)/(x 2-x 1) × ((y 2-y)/(y 2-y 1)),
W2=(x2-x)/(x2-x1)×((y-y1)/(y2-y1)),
W3=(x-x1)/(x2-x1)×((y2-y)/(y2-y1)),
W4=(x-x1)/(x2-x1)×((y-y1)/(y2-y1))。
in image processing, the coordinates of the corresponding source pixel point are basically floating point numbers for a target pixel (the scaling factor is a floating point number with a high probability of being multiplied by the scaling factor). Assuming that the coordinates are (i+u, j+v), where i and j are integer parts of floating point coordinates, u and v are fractional parts of floating point coordinates, and are floating point numbers in the interval of the values [0,1 ], the pixel obtained value f (i+u, j+v) can be determined by values of four surrounding pixels corresponding to coordinates (i, j), (i+1, j), (i, j+1), (i+1, j+1) in the original image. The denominator terms in W1, W2, W3, W4 in the formula for calculating P, (x 2-x 1) and (y 2-y 1) are both equal to 1. The calculation formula can be converted into:
f(i+u,j+v)=(1-u)×(1-v)×f(i,j)+(1-u)×v×f(i,j+1)+u×(1-v)×f(i+1,j)+u×v×f(i+1,j+1)。
it can be seen that its key calculation is a dot product calculation of four inputs and four coefficients.
In the field of image processing, however, it is often encountered that the bit widths of two operands that need to be dot-product are not the same. Because the pixel gray value is fixed to 8-bit unsigned integer (uint 8), the coefficient is quantized to 2 N Is a fixed point value of (a). To ensure that the precision coefficients will be quantized to 16-bit signed integers (int 16), or 32-bit signed integers (int 32). Taking bilinear interpolation calculation as an example, the accuracy requirement can be met by quantizing the coefficients to int16, and the calculation process can be converted into result=a 0 ×b 0 +a 1 ×b 1 +a 2 ×b 2 +a 3 ×b 3 ,a 0 ,a 1 ,a 2 ,a 3 Respectively correspond to Q12, Q22, Q11, Q21 in the above figures. Accelerating this calculation process can accelerate the bilinear new interpolation calculation.
For current general purpose CPUs, single Instruction Multiple Data (SIMD) technology can be used to accelerate the correlation computation. However, in the face of multiply-accumulate computation with different bit widths, the bit expansion operation needs to be performed first, so that alignment of two input bit widths is ensured, and then the multiply-accumulate computation can be performed. Based on SIMD instructions and existing hardware, there are mainly two ways in which the computation flow is described below in 128-bit registers:
in the first mode, the arrangement is as shown in fig. 2, and four input data and four coefficients are respectively put into five registers. The calculation process mainly comprises two steps of bit expansion and bit multiplication accumulation. The method needs to firstly spread the input data of uint8 to the data of int16, then perform a spread multiplication and a three-time spread multiplication accumulation.
C 0 =a 00 ×b 00 +a 01 ×b 01 +a 02 ×b 02 +a 03 ×b 03
C 1 =a 10 ×b 00 +a 11 ×b 01 +a 12 ×b 02 +a 13 ×b 03
C 2 =a 20 ×b 00 +a 21 ×b 01 +a 22 ×b 02 +a 22 ×b 03
C 3 =a 30 ×b 00 +a 31 ×b 01 +a 32 ×b 02 +a 33 ×b 03
C 0 Calculating a result for the output image of channel 0, C 1 Calculating a result for the output image of channel 1, C 2 Calculate the result for the output image of channel 2, C 3 The result is calculated for the output image of channel 3.
A total of eight spreading operations are involved, whether it be a simple spreading or a spreading multiply accumulate calculation.
In the second mode, the arrangement is as shown in fig. 3, and four input data and four coefficients are respectively put into four registers. The calculation process mainly comprises two steps of bit expansion and 16-bit dot product (dot) calculation. In the method, input data of the uint8 is firstly expanded to data of the int16, and then 16-bit dot calculation is carried out twice.
C 0 =a 00 ×b 00 +a 01 ×b 01 +a 02 ×b 02 +a 03 ×b 03
C 1 =a 10 ×b 00 +a 11 ×b 01 +a 12 ×b 02 +a 13 ×b 03
C 2 =a 20 ×b 00 +a 21 ×b 01 +a 22 ×b 02 +a 22 ×b 03
C 3 =a 30 ×b 00 +a 31 ×b 01 +a 32 ×b 02 +a 33 ×b 03
C 0 Calculating a result for the output image of channel 0, C 1 Calculating a result for the output image of channel 1, C 2 Calculate the result for the output image of channel 2, C 3 The result is calculated for the output image of channel 3.
Compared with the first mode, the second mode uses 16-bit product calculation to replace the spreading multiply accumulation calculation, thereby reducing four spreading operations and requiring four spreading calculations.
These spreading operations are only intermediate values from the standpoint of the overall bilinear interpolation calculation, and are not required. And for hardware, the write-back cost of the bit expansion operation is higher, and the cost is higher. At least two execution cycles are required for one bit expansion operation, or split into two instruction executions. In the calculation process of the two modes, a 16x16 multiplier is used, and more resources are occupied.
Disclosure of Invention
The invention aims to provide a calculation method for accelerating dot products with different bit widths in digital image processing, which ensures two inputs of dot products while multiplexing coefficient data through element arrangement, and has the same total bit width of data so as to facilitate hardware processing.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a method of accelerating computation of dot products of different bit widths in digital image processing, comprising the steps of:
(1) Placing input data into a register according to arrangement, and placing coefficient data into the register according to arrangement, so that the total data length of the register is consistent with that of the register;
(2) And taking out the input data in the register and the coefficient data in the register, and performing dot product operation on the input data and the coefficient data with different bit widths to obtain an output image calculation result.
Generating an output image from an input image by scaling, placing the input image pixel values as input data into a first register, i.e. register V n Put the coefficient data into the second register, i.e. register V p . Typically, the image data is 8 bits, and the coefficient data is 16 bits or 32 bits. Assume register V n There are x sets of output image pixel values stored in order to make register V n Sum register V p The total data length is uniform, for 16bit coefficient data, register V p With x/2 sets of coefficient data, for 32bit coefficient data, register V p There are x/4 sets of coefficient data.
Taking 16bit coefficient data as an example, slave register V n Fetching a first set of output image pixel values from a register V p The first group of coefficient data is taken out, dot product operation is carried out through an 8 multiplied by 16 multiplier, the result is obtained, and the steps are repeated. Two sets of input image pixel values correspond to one set of coefficient data, such as register V n The first and second sets of input image pixel values and the register V p The first group of coefficient data is used for dot product calculation, and a register V n+1 The third and fourth sets of input image pixel values and register V p And (3) performing dot product calculation on the second group of coefficient data, and then analogizing to finally obtain the pixel value of the output image. In a 16bit scene, the 16bit coefficients are multiplexed twice.
For 32bit coefficient data, four sets of input image pixel values correspond to one set of coefficient data. In a 32bit scene, the 32bit coefficients may be multiplexed four times.
A bilinear difference algorithm using a method of accelerating computation of dot products of different bit widths in digital image processing, comprising the steps of:
(1) Obtaining a scaling coefficient according to the input image and the output image, and calculating pixel point coordinates of the output image corresponding to the input image;
(2) Multiplying the coordinates of the pixel points of the output image by a scaling coefficient to obtain an integer part and a decimal part of the corresponding coordinates of the pixel points of the input image, wherein the integer part is the coordinate values of the input image, the integer part is put into a memory, the decimal part is quantized and then calculated to obtain a coefficient value, and the decimal part quantization method comprises the following steps: multiplying the decimal part of the abscissa by 128 to obtain a coefficient u, subtracting u from 128 to obtain (1-u), multiplying the decimal part of the ordinate by 128 to obtain a coefficient v, subtracting v from 128 to obtain (1-v), marking u and (1-u) as coefficient groups 1, v and (1-v) as coefficient groups 2, performing bit-expansion multiplication on the coefficient groups 1 and the coefficient groups 2 to obtain (1-u) x (1-v), u x (1-v) and u x (1-v), performing element interleaving and reordering to obtain (1-u) x (1-v), (1-u) x v, u x (1-v), and sequentially placing the input image pixel values into the first register as input data to enable the total data lengths of the first register and the second register to be consistent;
(3) And taking out the pixel values of the input image from the first register, taking out the coefficient data from the second register, performing dot product operation through the multiplier, and repeating the steps until all the pixel data of the output image in the first register are subjected to dot product operation, so as to obtain all the pixel values of the output image.
By adopting the technical scheme, the invention has the following beneficial effects:
1. the method effectively solves the problem of element arrangement with different bit widths, ensures that the bit widths of the elements are different, and keeps the total length of the input register aligned. The hardware can be efficiently executed without adding additional read port resources.
2. Because the coefficient data among different channels in the digital image are the same, the invention realizes multiplexing of the coefficients among channels through the characteristics, effectively reduces the repeated reading times of the data into the execution units, has obvious acceleration effect in scenes of even channels.
3. The bit expansion operation needed in the calculation is omitted, and the light is greatly improved from the instruction number.
4. Effectively reducing the multiplier resources.
Drawings
The invention is further described below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of bilinear interpolation.
Fig. 2 is a layout diagram of a first embodiment in the background art.
Fig. 3 is a layout diagram of a second embodiment of the prior art.
Fig. 4 is an arrangement diagram in example 1.
Fig. 5 is an arrangement diagram in example 2.
Fig. 6 is an arrangement diagram in example 3.
Fig. 7 is an arrangement calculation chart in example 4.
Fig. 8 is a schematic diagram of the four-channel bilinear interpolation method in example 4.
Fig. 9 is a flowchart in embodiment 4.
Detailed Description
Example 1
A method of accelerating computation of dot products of different bit widths in digital image processing, comprising the steps of:
(1) In the case where the input image has 2 channels and the coefficient data is 16 bits, it is assumed that the input value of each channel is a ck Wherein c is the channel value, k is the index of the pixel point in each group, and the coefficient data is recorded as b 0 ,b 1 ,b 2 ,b 3 The input data in the first register is arranged as a through interleaving instruction 00 、a 01 、a 02 、a 03 、a 10 、a 11 、a 12 、a 13 The coefficient data in the second register is arranged as b 0 、b 1 、b 2 、b 3 The input data is put into a first register, and the arrangement diagram is shown in fig. 4;
(2) Two groups of arranged input data a 00 、a 01 、a 02 、a 03 And a 10 、a 11 、a 12 、a 13 Respectively sum coefficient data b 0 、b 1 、b 2 、b 3 Dot product calculation is performed once by 8x16 multipliers, i.e
C 00 =a 00 ×b 0 +a 01 ×b 1 +a 02 ×b 2 +a 03 ×b 3
C 10 =a 10 ×b 0 +a 11 ×b 1 +a 12 ×b 2 +a 13 ×b 3
C 00 Calculating a result for the output image of channel 0, C 10 The result is calculated for the output image of channel 1.
Example 2
A method of accelerating computation of dot products of different bit widths in digital image processing, comprising the steps of:
(1) In the case where the input image has 4 channels and the coefficient data is 16 bits, it is assumed that the input value of each channel is a ck Wherein c is the channel value, k is the index of the pixel point in each group, and the coefficient data is recorded as b 0 ,b 1 ,b 2 ,b 3 The input data in one of the first registers is arranged as a by interleaving instructions 00 、a 01 、a 02 、a 03 、a 10 、a 11 、a 12 、a 13 The input data in the other first register is arranged as a 20 、a 21 、a 22 、a 23 、a 30 、a 31 、a 32 、a 33 The coefficient data in the second register is arranged as b 0 、b 1 、b 2 、b 3 The input data are put into two first registers, and the arrangement diagram is shown in fig. 5;
(2) In the case where the input image has 4 channels and the coefficient data is 16 bits, based on the arrangement of step (1), by 8x16 multipliers, two dot product calculations are required,
first, two sets of input data a 00 、a 01 、a 02 、a 03 And a 10 、a 11 、a 12 、a 13 Respectively sum coefficient data b 0 、b 1 、b 2 、b 3 Dot product calculation by 8X16 multipliers, i.e.
C 00 =a 00 ×b 0 +a 01 ×b 1 +a 02 ×b 2 +a 03 ×b 3
C 10 =a 10 ×b 0 +a 11 ×b 1 +a 12 ×b 2 +a 13 ×b 3
Second time, two sets of input data a 20 、a 21 、a 22 、a 22 And a 30 、a 31 、a 32 、a 33 Respectively sum coefficient data b 0 、b 1 、b 2 、b 3 Click calculations by 8x16 multipliers, i.e
C 20 =a 20 ×b 0 +a 21 ×b 1 +a 22 ×b 2 +a 22 ×b 3
C 30 =a 30 ×b 0 +a 31 ×b 1 +a 32 ×b 2 +a 33 ×b 3
C 00 Calculating a result for the output image of channel 0, C 10 Calculating a result for the output image of channel 1, C 20 Calculate the result for the output image of channel 2, C 30 The result is calculated for the output image of channel 3.
Example 3
A method of accelerating computation of dot products of different bit widths in digital image processing, comprising the steps of:
(1) In the case where the input image has 4 channels and the coefficient data is 32 bits, it is assumed that the input value of each channel is a ck Wherein c is the channel value, k is the index of the pixel point in each group, and the coefficient data is recorded as b 0 ,b 1 ,b 2 ,b 3 The input data in the first register is arranged as a through interleaving instruction 00 、a 01 、a 02 、a 03 、a 10 、a 11 、a 12 、a 13 、a 20 、a 21 、a 22 、a 23 、a 30 、a 31 、a 32 、a 33 The coefficient data in the second register is arranged as b 0 、b 1 、b 2 、b 3 The input data is put into a first register, and the arrangement diagram is shown in fig. 6;
(2) In the case where the input image has 4 channels and the coefficient data is 32 bits, based on the arrangement of step (1), one dot product calculation is required by 8×32 multipliers, that is
C 00 =a 00 ×b 0 +a 01 ×b 1 +a 02 ×b 2 +a 03 ×b 3
C 10 =a 10 ×b 0 +a 11 ×b 1 +a 12 ×b 2 +a 13 ×b 3
C 20 =a 20 ×b 0 +a 21 ×b 1 +a 22 ×b 2 +a 22 ×b 3
C 30 =a 30 ×b 0 +a 31 ×b 1 +a 32 ×b 2 +a 33 ×b 3
C 00 Calculating a result for the output image of channel 0, C 10 Calculating a result for the output image of channel 1, C 20 Calculate the result for the output image of channel 2, C 30 The result is calculated for the output image of channel 3.
Example 4
A calculation method for accelerating dot products with different bit widths in digital image processing is provided, wherein the arrangement is shown in figure 7. Taking bilinear interpolation as an example, for a four-channel image, the pixel value of the point (i, j) of the image is denoted by f (i, j). The bilinear interpolation principle of four channels is shown in fig. 8, the pixel points of each channel are represented by a shape (diamond, circle, triangle and hexagon), the pixel points of the same shape respectively represent the data of the same channel, and interpolation calculation needs to be performed on the four channels respectively. The positions of the four points Q11, Q12, Q21 and Q22 are selected, and the principle is consistent with the expression in the background art. Differently, after each Q-point position is obtained, four points (from four channels respectively) need to be read continuously for interpolation calculation of the four channels. As shown in fig. 9, the steps are specifically performed as follows.
1. Computing lateral scaling from input and output imagesCoefficient P W =W Input device /W Output of Longitudinal scaling factor P H =H Input device /H Output of . Wherein H is Input device To input the length of the image size, W Input device To input the width of the image size, H Output of To output the length of the image size, W Output of To be wide of the output image size.
2. The value [0, W ] of the abscissa of each pixel point P of the output image Output of ) Respectively multiplied by the lateral scaling factor P W Obtain the abscissa value [0, W ] of the corresponding input image float ) Each term is expressed in the form of an integer part plus a fractional part, m representing the abscissa integer part. By reacting [0, W float ) Rounding down to obtain an integer fraction [0, W m ) And then calculating the difference value between each abscissa and the corresponding integer part to obtain the corresponding abscissa decimal part. Similarly, the values of the ordinate [0, H ] can be obtained Output of ) Integer part of the corresponding input coordinates [0, H n ) And an ordinate fraction portion, n representing an ordinate integer portion.
As shown in fig. 8, four input pixel points Q11, Q12, Q21, Q22, Q11 are f (i, j), Q12 is f (i, j+1), Q21 is f (i+1, j), and Q22 is f (i+1, j+1), where i is 0, w m ) J is [0, H n ). And (5) marking the abscissa integer part and the ordinate integer part as a table T, and putting the table T into a memory for standby. Multiplying the abscissa fraction and the ordinate fraction by 128 to quantize to uint8, respectively, to obtain u and v, u being the quantized abscissa fraction and v being the quantized ordinate fraction, and subtracting u from 128 to obtain (1-u) and subtracting v from 128 to obtain (1-v). Let u, (1-u) be the coefficient set 1, v, and (1-v) be the coefficient set 2. The coefficient set 1 is put into memory in the order of (1-u), u, and the coefficient set 2 is put into memory in the order of (1-v), v.
3. Taking 128 bits of register width as an example, a table T is loaded from a memory, 32 pixel points corresponding to an output image are obtained from an input image through the table T, 8 groups are total, each channel has 2 groups, and 4 groups are each. The four points of each set of channels are located at Q11 (i, j), Q12 (i, j+1), Q21 (i+1, j), and Q22 (i+1, j+1).
As shown in fig. 7, the gray value of each pixel point is denoted as a ck Where c is the channel value and k is the index of the pixel points within each group. Two sets of pixels of channel 1, a, are interleaved (see ARM instruction set description Manual) 00 -a 03 And a 04 -a 07 Put into one of the first registers, i.e. register V, in sequence n [0,31 ] of]Bit sum [64,95 ]]Bits, two groups of pixels of channel 2, namely a 10 -a 13 And a 14 -a 17 Put into the register V in order n [32,63 of]Bit sum [95,127 ]]Bits. Register V n The pixel value is stored, and the data type is uint8.
At this time, the four sets of data of the channel 3 and the channel 4 are put in the same arrangement into another first register, namely the register V n+1 Inside.
4. Loading two sets of coefficient sets 1 into register V m Two sets of coefficient sets 2 are loaded, and each set of coefficient sets 2 is placed into a register X k Sum register X q . Then using a spread-bit multiplication. At this time V m In 2 groups (1-u), u data pair, X k To obtain 1 set (1-v), v data pairs. Respectively by V m And X k 、X q By performing the bit-expansion multiplication, two sets of coefficients of (1-u) × (1-v), u× (1-v) and u× (1-v) can be obtained. Reordered to (1-u) x (1-v), b, by interleaving instructions 0 (1-u). Times.v.b 1 U× (1-v) is b 2 U x v is b 3 Put into the second register, i.e. register V p . As shown in fig. 9, register V p The storage coefficient of the data type is int16, and the arrangement mode is continuously arranged according to the group.
5. Through step 3 and step 4, the pixel points and the corresponding coefficients have been arranged according to fig. 7. As shown in fig. 4, register V n Pixel point data and register V in (a) p The dot product operation is performed on the coefficient data of (1) and an 8x16 multiplier is used. The specific calculation mode is as follows:
C 00 =a 00 ×b 0 +a 01 ×b 1 +a 02 ×b 2 +a 03 ×b 3
C 10 =a 10 ×b 0 +a 11 ×b 1 +a 12 ×b 2 +a 13 ×b 3
C 01 =a 04 ×b 4 +a 05 ×b 5 +a 06 ×b 6 +a 07 ×b 7
C 11 =a 14 ×b 4 +a 15 ×b 5 +a 16 ×b 6 +a 17 ×b 7
result C 00 、C 10 、C 01 、C 11 Store into register C, where C 00 、C 10 、C 01 、C 11 The data type is int32. And the same can obtain the calculation results of the remaining two channels.
The calculation process is performed in register V n Sum register V p Under the condition of different data bit widths, the register V is aligned with the register total data length p The elements of (a) are multiplexed.
6. The result is abbreviated from int32 to uint8 by two operations with saturation truncation. Then the result is interweaved once, and then the continuous storage instruction can be used for being put into a memory.
The abbreviated method can be as shown in CN 202211664911.5.
Comparative example 1
The same is true of bilinear interpolation, where the dot product is calculated in the first mode as in the background art, and the first mode mentioned in the background art, and the different points are mainly in the whole calculation process:
1. in step 3, four pixels of each group are respectively placed in four registers, as shown in fig. 2.
2. In step 4, the first mode does not need to perform the bit expansion multiplication and sequencing operation, but only needs to perform the bit expansion operation.
3. In step 5, the first mode needs to perform a spreading operation on the pixel point, and then uses a spreading multiply-accumulate operation to multiply and accumulate the coefficient data respectively. One spreading multiplication and three spreading multiplication accumulation are needed. These four multiplications correspond to the dot product calculations of the present scheme. In total, there are 8 more spreading operations compared to the embodiment. Furthermore, the 16×16 multiplier resource is used, and the coefficients are not multiplexed between channels.
4. In step 6, during storage, since the channels are calculated separately, two additional interleaving operations are required compared with the scheme.
Next, with an open source computer vision algorithm library (opencv) as an implementation platform, a version based on a fifth generation open source reduced instruction set (riscv) standard vector 1.0 and a version implemented by adding the scheme based on riscv standard vector 1.0 are respectively compared, and the result is shown in table 1.
TABLE 1
In table 1, two calculations of image size reduction 1280x720 to 640x480 and image size enlargement 640x480 to 1280x720 were performed using example 4 and comparative example 1, respectively. Wherein the total instruction number represents the sum of all instruction numbers required to execute the whole restore algorithm, including the total general instruction number and the total vector instruction number; and the total vector instruction number represents only the sum of all vector instruction numbers. It can be seen that example 1 has a more than half instruction count reduction compared to comparative example 1, regardless of whether the image size is reduced or the image size is enlarged. The scheme greatly reduces the instruction number. The corresponding hardware performance can be improved by more than 1 time.
The above is only a specific embodiment of the present invention, but the technical features of the present invention are not limited thereto. Any simple changes, equivalent substitutions or modifications made on the basis of the present invention to solve the substantially same technical problems and achieve the substantially same technical effects are encompassed within the scope of the present invention.

Claims (10)

1. A method for accelerating computation of dot products of different bit widths in digital image processing, comprising the steps of:
(1) Placing input data into a first register according to arrangement, and placing coefficient data into a second register according to arrangement, so that the total data length of the first register and the second register is consistent;
(2) And taking out the input data in the first register and the coefficient data in the second register, and performing dot product operation on the input data and the coefficient data with different bit widths to obtain an output image calculation result.
2. A method of accelerating the computation of different bit-width dot products in digital image processing according to claim 1, wherein: in the step (1), each output data corresponds to n input data and t coefficient data, the corresponding n input data and t coefficient data are continuously arranged, and the arrangement requirement of dot product calculation is met, wherein n and t are integers larger than zero respectively.
3. A method of accelerating the computation of different bit-width dot products in digital image processing according to claim 1, wherein: in the step (2), the coefficients in the second register are multiplexed during dot product operation, and the coefficient values multiplied by the different groups of input data in the same first register are the same, and the multiplexing times are coefficient bit width/input data bit width.
4. The method of claim 2, wherein in step (1), the method of matching the total data length of the first register and the second register is as follows: in the case where the input image has 2 channels and the coefficient data is 16 bits, it is assumed that the input value of each channel is a ck Wherein c is the channel value, k is the index of the pixel point in each group, and the coefficient data is recorded as b 0 ,b 1 ,b 2 ,b 3 The input data in the first register is arranged as a through interleaving instruction 00 、a 01 、a 02 、a 03 、a 10 、a 11 、a 12 、a 13 The coefficient data in the second register is arranged as b 0 、b 1 、b 2 、b 3 The input data is placed into a first register.
5. A method of accelerating the computation of different bit-width dot products in digital image processing according to claim 4, wherein: in the step (2), based on the arrangement of the step (1), two groups of input data a after arrangement are arranged 00 、a 01 、a 02 、a 03 And a 10 、a 11 、a 12 、a 13 Respectively sum coefficient data b 0 b 1 b 2 b 3 By 8×
16, i.e. the multiplier performs a dot product calculation
C 00 =a 00 ×b 0 +a 01 ×b 1 +a 02 ×b 2 +a 03 ×b 3
C 10 =a 10 ×b 0 +a 11 ×b 1 +a 12 ×b 2 +a 13 ×b 3
C 00 Calculating a result for the output image of channel 0, C 10 The result is calculated for the output image of channel 1.
6. The method of claim 2, wherein in step (1), the method of matching the total data length of the first register and the second register is as follows: in the case where the input image has 4 channels and the coefficient data is 16 bits, it is assumed that the input value of each channel is a ck Wherein c is the channel value, k is the index of the pixel point in each group, and the coefficient data is recorded as b 0 ,b 1 ,b 2 ,b 3 The input data in one of the first registers is arranged as a by interleaving instructions 00 、a 01 、a 02 、a 03 、a 10 、a 11 、a 12 、a 13 The input data in the other first register is arranged as a 20 、a 21 、a 22 、a 23 、a 30 、a 31 、a 32 、a 33 The coefficient data in the second register is arranged as b 0 、b 1 、b 2 、b 3 The input data is placed into two first registers.
7. A method of accelerating the computation of different bit-width dot products in digital image processing according to claim 6, wherein: in step (2), in the case where the input image has 4 channels and the coefficient data is 16 bits, based on the arrangement of step (1), by 8x16 multipliers, two dot product calculations are required,
first, two sets of input data a 00 、a 01 、a 02 、a 03 And a 10 、a 11 、a 12 、a 13 Respectively sum coefficient data b 0 、b 1 、b 2 、b 3 Dot product calculation by 8X16 multipliers, i.e.
C 00 =a 00 ×b 0 +a 01 ×b 1 +a 02 ×b 2 +a 03 ×b 3
C 10 =a 10 ×b 0 +a 11 ×b 1 +a 12 ×b 2 +a 13 ×b 3
Second time, two sets of input data a 20 、a 21 、a 22 、a 22 And a 30 、a 31 、a 32 、a 33 Respectively sum coefficient data b 0 、b 1 、b 2 、b 3 Click calculations by 8x16 multipliers, i.e
C 20 =a 20 ×b 0 +a 21 ×b 1 +a 22 ×b 2 +a 22 ×b 3
C 30 =a 30 ×b 0 +a 31 ×b 1 +a 32 ×b 2 +a 33 ×b 3
C 00 Calculating a result for the output image of channel 0, C 10 Calculating a result for the output image of channel 1, C 20 Calculate the result for the output image of channel 2, C 30 The result is calculated for the output image of channel 3.
8. The method of claim 2, wherein in step (1), the method of matching the total data length of the first register and the second register is as follows: in the case where the input image has 4 channels and the coefficient data is 32 bits, it is assumed that the input value of each channel is a ck Wherein c is the channel value, k is the index of the pixel point in each group, and the coefficient data is recorded as b 0 ,b 1 ,b 2 ,b 3 The input data in the first register is arranged as a through interleaving instruction 00 、a 01 、a 02 、a 03 、a 10 、a 11 、a 12 、a 13 、a 20 、a 21 、a 22 、a 23 、a 30 、a 31 、a 32 、a 33 The coefficient data in the second register is arranged as b 0 、b 1 、b 2 、b 3 The input data is placed into a first register.
9. A method of accelerating the computation of different bit-width dot products in digital image processing according to claim 8, wherein: in step (2), in the case where the input image has 4 channels and the coefficient data is 32 bits, based on the arrangement of step (1), one dot product calculation is required by 8×32 multipliers, that is
C 00 =a 00 ×b 0 +a 01 ×b 1 +a 02 ×b 2 +a 03 ×b 3
C 10 =a 10 ×b 0 +a 11 ×b 1 +a 12 ×b 2 +a 13 ×b 3
C 20 =a 20 ×b 0 +a 21 ×b 1 +a 22 ×b 2 +a 22 ×b 3
C 30 =a 30 ×b 0 +a 31 ×b 1 +a 32 ×b 2 +a 33 ×b 3
C 00 Calculating a result for the output image of channel 0, C 10 Calculating a result for the output image of channel 1, C 20 Calculate the result for the output image of channel 2, C 30 The result is calculated for the output image of channel 3.
10. Bilinear difference algorithm using a method of accelerating the calculation of different bit-width dot products in digital image processing according to any of claims 1-9, characterized in that it comprises the following steps:
(1) Obtaining a scaling coefficient according to the input image and the output image, and calculating pixel point coordinates of the output image corresponding to the input image;
(2) Multiplying the coordinates of the pixel points of the output image by a scaling coefficient to obtain an integer part and a decimal part of the corresponding coordinates of the pixel points of the input image, wherein the integer part is the coordinate values of the input image, the integer part is put into a memory, the decimal part is quantized and then calculated to obtain a coefficient value, and the decimal part quantization method comprises the following steps: multiplying the decimal part of the abscissa by 128 to obtain a coefficient u, subtracting u from 128 to obtain (1-u), multiplying the decimal part of the ordinate by 128 to obtain a coefficient v, subtracting v from 128 to obtain (1-v), marking u and (1-u) as coefficient groups 1, v and (1-v) as coefficient groups 2, performing bit-expansion multiplication on the coefficient groups 1 and the coefficient groups 2 to obtain (1-u) x (1-v), u x (1-v) and u x (1-v), performing element interleaving and reordering to obtain (1-u) x (1-v), (1-u) x v, u x (1-v), and sequentially placing the input image pixel values into the first register as input data to enable the total data lengths of the first register and the second register to be consistent;
(3) And taking out the pixel values of the input image from the first register, taking out the coefficient data from the second register, performing dot product operation through the multiplier, and repeating the steps until all the pixel data of the output image in the first register are subjected to dot product operation, so as to obtain all the pixel values of the output image.
CN202311195824.4A 2023-09-15 2023-09-15 Calculation method for accelerating dot products with different bit widths in digital image processing Pending CN117372495A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311195824.4A CN117372495A (en) 2023-09-15 2023-09-15 Calculation method for accelerating dot products with different bit widths in digital image processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311195824.4A CN117372495A (en) 2023-09-15 2023-09-15 Calculation method for accelerating dot products with different bit widths in digital image processing

Publications (1)

Publication Number Publication Date
CN117372495A true CN117372495A (en) 2024-01-09

Family

ID=89395385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311195824.4A Pending CN117372495A (en) 2023-09-15 2023-09-15 Calculation method for accelerating dot products with different bit widths in digital image processing

Country Status (1)

Country Link
CN (1) CN117372495A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030225998A1 (en) * 2002-01-31 2003-12-04 Khan Mohammed Noshad Configurable data processor with multi-length instruction set architecture
US20160125263A1 (en) * 2014-11-03 2016-05-05 Texas Instruments Incorporated Method to compute sliding window block sum using instruction based selective horizontal addition in vector processor
CN109298886A (en) * 2017-07-25 2019-02-01 合肥君正科技有限公司 SIMD instruction executes method, apparatus and processor
CN109726806A (en) * 2017-10-30 2019-05-07 上海寒武纪信息科技有限公司 Information processing method and terminal device
CN114064122A (en) * 2021-11-12 2022-02-18 龙芯中科技术股份有限公司 Instruction processing method, device, equipment and storage medium
US20230108629A1 (en) * 2021-10-04 2023-04-06 Arm Limited Matrix Multiply Accelerator For Variable Bitwidth Operands

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030225998A1 (en) * 2002-01-31 2003-12-04 Khan Mohammed Noshad Configurable data processor with multi-length instruction set architecture
US20160125263A1 (en) * 2014-11-03 2016-05-05 Texas Instruments Incorporated Method to compute sliding window block sum using instruction based selective horizontal addition in vector processor
CN109298886A (en) * 2017-07-25 2019-02-01 合肥君正科技有限公司 SIMD instruction executes method, apparatus and processor
CN109726806A (en) * 2017-10-30 2019-05-07 上海寒武纪信息科技有限公司 Information processing method and terminal device
US20230108629A1 (en) * 2021-10-04 2023-04-06 Arm Limited Matrix Multiply Accelerator For Variable Bitwidth Operands
CN114064122A (en) * 2021-11-12 2022-02-18 龙芯中科技术股份有限公司 Instruction processing method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
石蕤: "基于RISC-Ⅴ架构的扩展指令微处理器设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 March 2022 (2022-03-15), pages 137 - 59 *

Similar Documents

Publication Publication Date Title
US4635292A (en) Image processor
JP2628493B2 (en) Image coding device and image decoding device provided with cosine transform calculation device, calculation device and the like
JP4064989B2 (en) Device for performing multiplication and addition of packed data
US5825680A (en) Method and apparatus for performing fast division
US8094164B1 (en) Decompression of block encoded texture data
JP4635087B2 (en) Improved floating-point unit for extension functions
KR100331136B1 (en) A computer system performing an inverse cosine transfer function for use with multimedia information
US6330282B1 (en) Block matching arithmetic device and recording medium readable program-recorded machine
KR100919236B1 (en) A method for 3D Graphic Geometric Transformation using Parallel Processor
US8681173B2 (en) Device, system, and method for improving processing efficiency by collectively applying operations
US5742529A (en) Method and an apparatus for providing the absolute difference of unsigned values
US5933160A (en) High-performance band combine function
WO1996004733A9 (en) System and method for inverse discrete cosine transform implementation
US5864372A (en) Apparatus for implementing a block matching algorithm for motion estimation in video image processing
CN117372495A (en) Calculation method for accelerating dot products with different bit widths in digital image processing
JPH06149861A (en) Dct and inverse dct computing device and method
JP4243277B2 (en) Data processing device
CN112712168A (en) Method and system for realizing high-efficiency calculation of neural network
JP3052516B2 (en) Encoded data processing device
JP3155383B2 (en) Two-mode processing device, two-dimensional conversion device, and still image data compression system
US5984515A (en) Computer implemented method for providing a two dimensional rotation of packed data
US7191203B2 (en) Method and system for high-speed multiplication
CN116227507B (en) Arithmetic device for performing bilinear interpolation processing
CN117376582A (en) Integer linear equation solving acceleration method and device for affine motion estimation
JP3385866B2 (en) Inverse quantization and inverse DCT circuit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination