CN117372495A

CN117372495A - Calculation method for accelerating dot products with different bit widths in digital image processing

Info

Publication number: CN117372495A
Application number: CN202311195824.4A
Authority: CN
Inventors: 俞林杰; 罗嘉蕙; 张丹枫
Original assignee: Jindi Space Time Hangzhou Technology Co ltd
Current assignee: Jindi Space Time Hangzhou Technology Co ltd
Priority date: 2023-09-15
Filing date: 2023-09-15
Publication date: 2024-01-09

Abstract

The invention belongs to the field of image processing, and discloses a calculation method for accelerating dot products with different bit widths in digital image processing. Input data and coefficient data with different bit widths are respectively placed into a first register and a second register according to specific arrangement. The first register and the second register are guaranteed to have the same total length of data under the condition of different element bit widths, and hardware is convenient to read in the data. Meanwhile, when dot product operation is carried out on input data and coefficient data, the coefficient data is multiplexed by utilizing the characteristics of the same coefficients among different channels, so that the number of times of repeated reading of the data into an execution unit is effectively reduced.

Description

Calculation method for accelerating dot products with different bit widths in digital image processing

Technical Field

The invention belongs to the field of image processing, and particularly relates to a calculation method for accelerating dot products with different bit widths in digital image processing.

Background

With the widespread use of digital image processing algorithms in everyday life, and the massive landing of visual AI algorithms, the performance of the related algorithms is becoming increasingly important. In AI applications, the network requires a resolution image as input. But the output resolution of each video device is different and substantially greater than the resolution required by the network. Thus, there is a need to meet the input requirements of the network by scaling the image. In the scenes such as face recognition, the angles of certain objects in the image need to be adjusted so that the objects are better recognized by an algorithm. These scenarios are independent of the image size adjustment algorithm (resize algorithm), the numerical remapping algorithm (remap algorithm) and the affine algorithm (affine algorithm) in the field of digital image processing. The key calculation in the algorithms is interpolation calculation, and the interpolation modes include linear interpolation, nearest interpolation and the like, wherein bilinear interpolation calculation is the most frequently used calculation. In addition, the neural network also has certain requirements on the input format of the image, which is generally RGB, and the original data output by the camera is generally YUV format, so that an algorithm for converting the image format is needed.

The core computation is dot product computation, whether it is bilinear interpolation computation or image format conversion. The following describes bilinear interpolation calculation as an example.

The bilinear interpolation method is also called bilinear interpolation, and the core idea is to perform one-time linear interpolation in the two directions XY, respectively. Bilinear interpolation is widely applied to aspects such as signal processing, digital image processing, video processing and the like as an interpolation algorithm in numerical analysis.

Bilinear interpolation is a linear interpolation extension of the interpolation function of two variables, and the core idea of bilinear interpolation is to perform linear interpolation once in two directions respectively. As shown in fig. 1, let (x, y) be the point to be solved, and (x 1, y 1), (x 1, y 2), (x 2, y 1), (x 2, y 2) be the four nearest points. The values of these four points are Q11, Q12, Q21, and Q22, respectively. To obtain the value of P point, interpolation may be performed twice in the X direction and then once in the Y direction.

The calculation of P can be expressed simply as:

P＝W1×Q11+W2×Q12+W3×Q21+W4×Q22；

wherein W1= (x 2-x)/(x 2-x 1) × ((y 2-y)/(y 2-y 1)),

W2＝(x2-x)/(x2-x1)×((y-y1)/(y2-y1))，

W3＝(x-x1)/(x2-x1)×((y2-y)/(y2-y1))，

W4＝(x-x1)/(x2-x1)×((y-y1)/(y2-y1))。

in image processing, the coordinates of the corresponding source pixel point are basically floating point numbers for a target pixel (the scaling factor is a floating point number with a high probability of being multiplied by the scaling factor). Assuming that the coordinates are (i+u, j+v), where i and j are integer parts of floating point coordinates, u and v are fractional parts of floating point coordinates, and are floating point numbers in the interval of the values [0,1 ], the pixel obtained value f (i+u, j+v) can be determined by values of four surrounding pixels corresponding to coordinates (i, j), (i+1, j), (i, j+1), (i+1, j+1) in the original image. The denominator terms in W1, W2, W3, W4 in the formula for calculating P, (x 2-x 1) and (y 2-y 1) are both equal to 1. The calculation formula can be converted into:

f(i+u,j+v)＝(1-u)×(1-v)×f(i,j)+(1-u)×v×f(i,j+1)+u×(1-v)×f(i+1,j)+u×v×f(i+1,j+1)。

it can be seen that its key calculation is a dot product calculation of four inputs and four coefficients.

In the field of image processing, however, it is often encountered that the bit widths of two operands that need to be dot-product are not the same. Because the pixel gray value is fixed to 8-bit unsigned integer (uint 8), the coefficient is quantized to 2 ^N Is a fixed point value of (a). To ensure that the precision coefficients will be quantized to 16-bit signed integers (int 16), or 32-bit signed integers (int 32). Taking bilinear interpolation calculation as an example, the accuracy requirement can be met by quantizing the coefficients to int16, and the calculation process can be converted into result=a ₀ ×b ₀ +a ₁ ×b ₁ +a ₂ ×b ₂ +a ₃ ×b ₃ ，a ₀ ，a ₁ ，a ₂ ，a ₃ Respectively correspond to Q12, Q22, Q11, Q21 in the above figures. Accelerating this calculation process can accelerate the bilinear new interpolation calculation.

For current general purpose CPUs, single Instruction Multiple Data (SIMD) technology can be used to accelerate the correlation computation. However, in the face of multiply-accumulate computation with different bit widths, the bit expansion operation needs to be performed first, so that alignment of two input bit widths is ensured, and then the multiply-accumulate computation can be performed. Based on SIMD instructions and existing hardware, there are mainly two ways in which the computation flow is described below in 128-bit registers:

in the first mode, the arrangement is as shown in fig. 2, and four input data and four coefficients are respectively put into five registers. The calculation process mainly comprises two steps of bit expansion and bit multiplication accumulation. The method needs to firstly spread the input data of uint8 to the data of int16, then perform a spread multiplication and a three-time spread multiplication accumulation.

C ₀ ＝a ₀₀ ×b ₀₀ +a ₀₁ ×b ₀₁ +a ₀₂ ×b ₀₂ +a ₀₃ ×b ₀₃ ，

C ₁ ＝a ₁₀ ×b ₀₀ +a ₁₁ ×b ₀₁ +a ₁₂ ×b ₀₂ +a ₁₃ ×b ₀₃ ，

C ₂ ＝a ₂₀ ×b ₀₀ +a ₂₁ ×b ₀₁ +a ₂₂ ×b ₀₂ +a ₂₂ ×b ₀₃ ，

C ₃ ＝a ₃₀ ×b ₀₀ +a ₃₁ ×b ₀₁ +a ₃₂ ×b ₀₂ +a ₃₃ ×b ₀₃ ，

C ₀ Calculating a result for the output image of channel 0, C ₁ Calculating a result for the output image of channel 1, C ₂ Calculate the result for the output image of channel 2, C ₃ The result is calculated for the output image of channel 3.

A total of eight spreading operations are involved, whether it be a simple spreading or a spreading multiply accumulate calculation.

In the second mode, the arrangement is as shown in fig. 3, and four input data and four coefficients are respectively put into four registers. The calculation process mainly comprises two steps of bit expansion and 16-bit dot product (dot) calculation. In the method, input data of the uint8 is firstly expanded to data of the int16, and then 16-bit dot calculation is carried out twice.

Compared with the first mode, the second mode uses 16-bit product calculation to replace the spreading multiply accumulation calculation, thereby reducing four spreading operations and requiring four spreading calculations.

These spreading operations are only intermediate values from the standpoint of the overall bilinear interpolation calculation, and are not required. And for hardware, the write-back cost of the bit expansion operation is higher, and the cost is higher. At least two execution cycles are required for one bit expansion operation, or split into two instruction executions. In the calculation process of the two modes, a 16x16 multiplier is used, and more resources are occupied.

Disclosure of Invention

The invention aims to provide a calculation method for accelerating dot products with different bit widths in digital image processing, which ensures two inputs of dot products while multiplexing coefficient data through element arrangement, and has the same total bit width of data so as to facilitate hardware processing.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a method of accelerating computation of dot products of different bit widths in digital image processing, comprising the steps of:

(1) Placing input data into a register according to arrangement, and placing coefficient data into the register according to arrangement, so that the total data length of the register is consistent with that of the register;

(2) And taking out the input data in the register and the coefficient data in the register, and performing dot product operation on the input data and the coefficient data with different bit widths to obtain an output image calculation result.

Generating an output image from an input image by scaling, placing the input image pixel values as input data into a first register, i.e. register V _n Put the coefficient data into the second register, i.e. register V _p . Typically, the image data is 8 bits, and the coefficient data is 16 bits or 32 bits. Assume register V _n There are x sets of output image pixel values stored in order to make register V _n Sum register V _p The total data length is uniform, for 16bit coefficient data, register V _p With x/2 sets of coefficient data, for 32bit coefficient data, register V _p There are x/4 sets of coefficient data.

Taking 16bit coefficient data as an example, slave register V _n Fetching a first set of output image pixel values from a register V _p The first group of coefficient data is taken out, dot product operation is carried out through an 8 multiplied by 16 multiplier, the result is obtained, and the steps are repeated. Two sets of input image pixel values correspond to one set of coefficient data, such as register V _n The first and second sets of input image pixel values and the register V _p The first group of coefficient data is used for dot product calculation, and a register V _n+1 The third and fourth sets of input image pixel values and register V _p And (3) performing dot product calculation on the second group of coefficient data, and then analogizing to finally obtain the pixel value of the output image. In a 16bit scene, the 16bit coefficients are multiplexed twice.

For 32bit coefficient data, four sets of input image pixel values correspond to one set of coefficient data. In a 32bit scene, the 32bit coefficients may be multiplexed four times.

A bilinear difference algorithm using a method of accelerating computation of dot products of different bit widths in digital image processing, comprising the steps of:

(1) Obtaining a scaling coefficient according to the input image and the output image, and calculating pixel point coordinates of the output image corresponding to the input image;

(2) Multiplying the coordinates of the pixel points of the output image by a scaling coefficient to obtain an integer part and a decimal part of the corresponding coordinates of the pixel points of the input image, wherein the integer part is the coordinate values of the input image, the integer part is put into a memory, the decimal part is quantized and then calculated to obtain a coefficient value, and the decimal part quantization method comprises the following steps: multiplying the decimal part of the abscissa by 128 to obtain a coefficient u, subtracting u from 128 to obtain (1-u), multiplying the decimal part of the ordinate by 128 to obtain a coefficient v, subtracting v from 128 to obtain (1-v), marking u and (1-u) as coefficient groups 1, v and (1-v) as coefficient groups 2, performing bit-expansion multiplication on the coefficient groups 1 and the coefficient groups 2 to obtain (1-u) x (1-v), u x (1-v) and u x (1-v), performing element interleaving and reordering to obtain (1-u) x (1-v), (1-u) x v, u x (1-v), and sequentially placing the input image pixel values into the first register as input data to enable the total data lengths of the first register and the second register to be consistent;

(3) And taking out the pixel values of the input image from the first register, taking out the coefficient data from the second register, performing dot product operation through the multiplier, and repeating the steps until all the pixel data of the output image in the first register are subjected to dot product operation, so as to obtain all the pixel values of the output image.

By adopting the technical scheme, the invention has the following beneficial effects:

1. the method effectively solves the problem of element arrangement with different bit widths, ensures that the bit widths of the elements are different, and keeps the total length of the input register aligned. The hardware can be efficiently executed without adding additional read port resources.

2. Because the coefficient data among different channels in the digital image are the same, the invention realizes multiplexing of the coefficients among channels through the characteristics, effectively reduces the repeated reading times of the data into the execution units, has obvious acceleration effect in scenes of even channels.

3. The bit expansion operation needed in the calculation is omitted, and the light is greatly improved from the instruction number.

4. Effectively reducing the multiplier resources.

Drawings

The invention is further described below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of bilinear interpolation.

Fig. 2 is a layout diagram of a first embodiment in the background art.

Fig. 3 is a layout diagram of a second embodiment of the prior art.

Fig. 4 is an arrangement diagram in example 1.

Fig. 5 is an arrangement diagram in example 2.

Fig. 6 is an arrangement diagram in example 3.

Fig. 7 is an arrangement calculation chart in example 4.

Fig. 8 is a schematic diagram of the four-channel bilinear interpolation method in example 4.

Fig. 9 is a flowchart in embodiment 4.

Detailed Description

Example 1

(1) In the case where the input image has 2 channels and the coefficient data is 16 bits, it is assumed that the input value of each channel is a _ck Wherein c is the channel value, k is the index of the pixel point in each group, and the coefficient data is recorded as b ₀ ，b ₁ ，b ₂ ，b ₃ The input data in the first register is arranged as a through interleaving instruction ₀₀ 、a ₀₁ 、a ₀₂ 、a ₀₃ 、a ₁₀ 、a ₁₁ 、a ₁₂ 、a ₁₃ The coefficient data in the second register is arranged as b ₀ 、b ₁ 、b ₂ 、b ₃ The input data is put into a first register, and the arrangement diagram is shown in fig. 4;

(2) Two groups of arranged input data a ₀₀ 、a ₀₁ 、a ₀₂ 、a ₀₃ And a ₁₀ 、a ₁₁ 、a ₁₂ 、a ₁₃ Respectively sum coefficient data b ₀ 、b ₁ 、b ₂ 、b ₃ Dot product calculation is performed once by 8x16 multipliers, i.e

C ₀₀ ＝a ₀₀ ×b ₀ +a ₀₁ ×b ₁ +a ₀₂ ×b ₂ +a ₀₃ ×b ₃ ，

C ₁₀ ＝a ₁₀ ×b ₀ +a ₁₁ ×b ₁ +a ₁₂ ×b ₂ +a ₁₃ ×b ₃ ，

C ₀₀ Calculating a result for the output image of channel 0, C ₁₀ The result is calculated for the output image of channel 1.

Example 2

(1) In the case where the input image has 4 channels and the coefficient data is 16 bits, it is assumed that the input value of each channel is a _ck Wherein c is the channel value, k is the index of the pixel point in each group, and the coefficient data is recorded as b ₀ ，b ₁ ，b ₂ ，b ₃ The input data in one of the first registers is arranged as a by interleaving instructions ₀₀ 、a ₀₁ 、a ₀₂ 、a ₀₃ 、a ₁₀ 、a ₁₁ 、a ₁₂ 、a ₁₃ The input data in the other first register is arranged as a ₂₀ 、a ₂₁ 、a ₂₂ 、a ₂₃ 、a ₃₀ 、a ₃₁ 、a ₃₂ 、a ₃₃ The coefficient data in the second register is arranged as b ₀ 、b ₁ 、b ₂ 、b ₃ The input data are put into two first registers, and the arrangement diagram is shown in fig. 5;

(2) In the case where the input image has 4 channels and the coefficient data is 16 bits, based on the arrangement of step (1), by 8x16 multipliers, two dot product calculations are required,

first, two sets of input data a ₀₀ 、a ₀₁ 、a ₀₂ 、a ₀₃ And a ₁₀ 、a ₁₁ 、a ₁₂ 、a ₁₃ Respectively sum coefficient data b ₀ 、b ₁ 、b ₂ 、b ₃ Dot product calculation by 8X16 multipliers, i.e.

Second time, two sets of input data a ₂₀ 、a ₂₁ 、a ₂₂ 、a ₂₂ And a ₃₀ 、a ₃₁ 、a ₃₂ 、a ₃₃ Respectively sum coefficient data b ₀ 、b ₁ 、b ₂ 、b ₃ Click calculations by 8x16 multipliers, i.e

C ₂₀ ＝a ₂₀ ×b ₀ +a ₂₁ ×b ₁ +a ₂₂ ×b ₂ +a ₂₂ ×b ₃ ，

C ₃₀ ＝a ₃₀ ×b ₀ +a ₃₁ ×b ₁ +a ₃₂ ×b ₂ +a ₃₃ ×b ₃ ，

C ₀₀ Calculating a result for the output image of channel 0, C ₁₀ Calculating a result for the output image of channel 1, C ₂₀ Calculate the result for the output image of channel 2, C ₃₀ The result is calculated for the output image of channel 3.

Example 3

(1) In the case where the input image has 4 channels and the coefficient data is 32 bits, it is assumed that the input value of each channel is a _ck Wherein c is the channel value, k is the index of the pixel point in each group, and the coefficient data is recorded as b ₀ ，b ₁ ，b ₂ ，b ₃ The input data in the first register is arranged as a through interleaving instruction ₀₀ 、a ₀₁ 、a ₀₂ 、a ₀₃ 、a ₁₀ 、a ₁₁ 、a ₁₂ 、a ₁₃ 、a ₂₀ 、a ₂₁ 、a ₂₂ 、a ₂₃ 、a ₃₀ 、a ₃₁ 、a ₃₂ 、a ₃₃ The coefficient data in the second register is arranged as b ₀ 、b ₁ 、b ₂ 、b ₃ The input data is put into a first register, and the arrangement diagram is shown in fig. 6;

(2) In the case where the input image has 4 channels and the coefficient data is 32 bits, based on the arrangement of step (1), one dot product calculation is required by 8×32 multipliers, that is

Example 4

A calculation method for accelerating dot products with different bit widths in digital image processing is provided, wherein the arrangement is shown in figure 7. Taking bilinear interpolation as an example, for a four-channel image, the pixel value of the point (i, j) of the image is denoted by f (i, j). The bilinear interpolation principle of four channels is shown in fig. 8, the pixel points of each channel are represented by a shape (diamond, circle, triangle and hexagon), the pixel points of the same shape respectively represent the data of the same channel, and interpolation calculation needs to be performed on the four channels respectively. The positions of the four points Q11, Q12, Q21 and Q22 are selected, and the principle is consistent with the expression in the background art. Differently, after each Q-point position is obtained, four points (from four channels respectively) need to be read continuously for interpolation calculation of the four channels. As shown in fig. 9, the steps are specifically performed as follows.

1. Computing lateral scaling from input and output imagesCoefficient P _W ＝W _{Input device} /W _{Output of} Longitudinal scaling factor P _H ＝H _{Input device} /H _{Output of} . Wherein H is _{Input device} To input the length of the image size, W _{Input device} To input the width of the image size, H _{Output of} To output the length of the image size, W _{Output of} To be wide of the output image size.

2. The value [0, W ] of the abscissa of each pixel point P of the output image _{Output of} ) Respectively multiplied by the lateral scaling factor P _W Obtain the abscissa value [0, W ] of the corresponding input image _float ) Each term is expressed in the form of an integer part plus a fractional part, m representing the abscissa integer part. By reacting [0, W _float ) Rounding down to obtain an integer fraction [0, W _m ) And then calculating the difference value between each abscissa and the corresponding integer part to obtain the corresponding abscissa decimal part. Similarly, the values of the ordinate [0, H ] can be obtained _{Output of} ) Integer part of the corresponding input coordinates [0, H _n ) And an ordinate fraction portion, n representing an ordinate integer portion.

As shown in fig. 8, four input pixel points Q11, Q12, Q21, Q22, Q11 are f (i, j), Q12 is f (i, j+1), Q21 is f (i+1, j), and Q22 is f (i+1, j+1), where i is 0, w _m ) J is [0, H _n ). And (5) marking the abscissa integer part and the ordinate integer part as a table T, and putting the table T into a memory for standby. Multiplying the abscissa fraction and the ordinate fraction by 128 to quantize to uint8, respectively, to obtain u and v, u being the quantized abscissa fraction and v being the quantized ordinate fraction, and subtracting u from 128 to obtain (1-u) and subtracting v from 128 to obtain (1-v). Let u, (1-u) be the coefficient set 1, v, and (1-v) be the coefficient set 2. The coefficient set 1 is put into memory in the order of (1-u), u, and the coefficient set 2 is put into memory in the order of (1-v), v.

3. Taking 128 bits of register width as an example, a table T is loaded from a memory, 32 pixel points corresponding to an output image are obtained from an input image through the table T, 8 groups are total, each channel has 2 groups, and 4 groups are each. The four points of each set of channels are located at Q11 (i, j), Q12 (i, j+1), Q21 (i+1, j), and Q22 (i+1, j+1).

As shown in fig. 7, the gray value of each pixel point is denoted as a _ck Where c is the channel value and k is the index of the pixel points within each group. Two sets of pixels of channel 1, a, are interleaved (see ARM instruction set description Manual) ₀₀ -a ₀₃ And a ₀₄ -a ₀₇ Put into one of the first registers, i.e. register V, in sequence _n [0,31 ] of]Bit sum [64,95 ]]Bits, two groups of pixels of channel 2, namely a ₁₀ -a ₁₃ And a ₁₄ -a ₁₇ Put into the register V in order _n [32,63 of]Bit sum [95,127 ]]Bits. Register V _n The pixel value is stored, and the data type is uint8.

At this time, the four sets of data of the channel 3 and the channel 4 are put in the same arrangement into another first register, namely the register V _n+1 Inside.

4. Loading two sets of coefficient sets 1 into register V _m Two sets of coefficient sets 2 are loaded, and each set of coefficient sets 2 is placed into a register X _k Sum register X _q . Then using a spread-bit multiplication. At this time V _m In 2 groups (1-u), u data pair, X _k To obtain 1 set (1-v), v data pairs. Respectively by V _m And X _k 、X _q By performing the bit-expansion multiplication, two sets of coefficients of (1-u) × (1-v), u× (1-v) and u× (1-v) can be obtained. Reordered to (1-u) x (1-v), b, by interleaving instructions ₀ (1-u). Times.v.b ₁ U× (1-v) is b ₂ U x v is b ₃ Put into the second register, i.e. register V _p . As shown in fig. 9, register V _p The storage coefficient of the data type is int16, and the arrangement mode is continuously arranged according to the group.

5. Through step 3 and step 4, the pixel points and the corresponding coefficients have been arranged according to fig. 7. As shown in fig. 4, register V _n Pixel point data and register V in (a) _p The dot product operation is performed on the coefficient data of (1) and an 8x16 multiplier is used. The specific calculation mode is as follows:

C ₀₀ ＝a ₀₀ ×b ₀ +a ₀₁ ×b ₁ +a ₀₂ ×b ₂ +a ₀₃ ×b ₃

C ₁₀ ＝a ₁₀ ×b ₀ +a ₁₁ ×b ₁ +a ₁₂ ×b ₂ +a ₁₃ ×b ₃

C ₀₁ ＝a ₀₄ ×b ₄ +a ₀₅ ×b ₅ +a ₀₆ ×b ₆ +a ₀₇ ×b ₇

C ₁₁ ＝a ₁₄ ×b ₄ +a ₁₅ ×b ₅ +a ₁₆ ×b ₆ +a ₁₇ ×b ₇

result C ₀₀ 、C ₁₀ 、C ₀₁ 、C ₁₁ Store into register C, where C ₀₀ 、C ₁₀ 、C ₀₁ 、C ₁₁ The data type is int32. And the same can obtain the calculation results of the remaining two channels.

The calculation process is performed in register V _n Sum register V _p Under the condition of different data bit widths, the register V is aligned with the register total data length _p The elements of (a) are multiplexed.

6. The result is abbreviated from int32 to uint8 by two operations with saturation truncation. Then the result is interweaved once, and then the continuous storage instruction can be used for being put into a memory.

The abbreviated method can be as shown in CN 202211664911.5.

Comparative example 1

The same is true of bilinear interpolation, where the dot product is calculated in the first mode as in the background art, and the first mode mentioned in the background art, and the different points are mainly in the whole calculation process:

1. in step 3, four pixels of each group are respectively placed in four registers, as shown in fig. 2.

2. In step 4, the first mode does not need to perform the bit expansion multiplication and sequencing operation, but only needs to perform the bit expansion operation.

3. In step 5, the first mode needs to perform a spreading operation on the pixel point, and then uses a spreading multiply-accumulate operation to multiply and accumulate the coefficient data respectively. One spreading multiplication and three spreading multiplication accumulation are needed. These four multiplications correspond to the dot product calculations of the present scheme. In total, there are 8 more spreading operations compared to the embodiment. Furthermore, the 16×16 multiplier resource is used, and the coefficients are not multiplexed between channels.

4. In step 6, during storage, since the channels are calculated separately, two additional interleaving operations are required compared with the scheme.

Next, with an open source computer vision algorithm library (opencv) as an implementation platform, a version based on a fifth generation open source reduced instruction set (riscv) standard vector 1.0 and a version implemented by adding the scheme based on riscv standard vector 1.0 are respectively compared, and the result is shown in table 1.

TABLE 1

In table 1, two calculations of image size reduction 1280x720 to 640x480 and image size enlargement 640x480 to 1280x720 were performed using example 4 and comparative example 1, respectively. Wherein the total instruction number represents the sum of all instruction numbers required to execute the whole restore algorithm, including the total general instruction number and the total vector instruction number; and the total vector instruction number represents only the sum of all vector instruction numbers. It can be seen that example 1 has a more than half instruction count reduction compared to comparative example 1, regardless of whether the image size is reduced or the image size is enlarged. The scheme greatly reduces the instruction number. The corresponding hardware performance can be improved by more than 1 time.

The above is only a specific embodiment of the present invention, but the technical features of the present invention are not limited thereto. Any simple changes, equivalent substitutions or modifications made on the basis of the present invention to solve the substantially same technical problems and achieve the substantially same technical effects are encompassed within the scope of the present invention.

Claims

1. A method for accelerating computation of dot products of different bit widths in digital image processing, comprising the steps of:

(1) Placing input data into a first register according to arrangement, and placing coefficient data into a second register according to arrangement, so that the total data length of the first register and the second register is consistent;

(2) And taking out the input data in the first register and the coefficient data in the second register, and performing dot product operation on the input data and the coefficient data with different bit widths to obtain an output image calculation result.

2. A method of accelerating the computation of different bit-width dot products in digital image processing according to claim 1, wherein: in the step (1), each output data corresponds to n input data and t coefficient data, the corresponding n input data and t coefficient data are continuously arranged, and the arrangement requirement of dot product calculation is met, wherein n and t are integers larger than zero respectively.

3. A method of accelerating the computation of different bit-width dot products in digital image processing according to claim 1, wherein: in the step (2), the coefficients in the second register are multiplexed during dot product operation, and the coefficient values multiplied by the different groups of input data in the same first register are the same, and the multiplexing times are coefficient bit width/input data bit width.

4. The method of claim 2, wherein in step (1), the method of matching the total data length of the first register and the second register is as follows: in the case where the input image has 2 channels and the coefficient data is 16 bits, it is assumed that the input value of each channel is a _ck Wherein c is the channel value, k is the index of the pixel point in each group, and the coefficient data is recorded as b ₀ ，b ₁ ，b ₂ ，b ₃ The input data in the first register is arranged as a through interleaving instruction ₀₀ 、a ₀₁ 、a ₀₂ 、a ₀₃ 、a ₁₀ 、a ₁₁ 、a ₁₂ 、a ₁₃ The coefficient data in the second register is arranged as b ₀ 、b ₁ 、b ₂ 、b ₃ The input data is placed into a first register.

5. A method of accelerating the computation of different bit-width dot products in digital image processing according to claim 4, wherein: in the step (2), based on the arrangement of the step (1), two groups of input data a after arrangement are arranged ₀₀ 、a ₀₁ 、a ₀₂ 、a ₀₃ And a ₁₀ 、a ₁₁ 、a ₁₂ 、a ₁₃ Respectively sum coefficient data b ₀ b ₁ b ₂ b ₃ By 8×

16, i.e. the multiplier performs a dot product calculation

6. The method of claim 2, wherein in step (1), the method of matching the total data length of the first register and the second register is as follows: in the case where the input image has 4 channels and the coefficient data is 16 bits, it is assumed that the input value of each channel is a _ck Wherein c is the channel value, k is the index of the pixel point in each group, and the coefficient data is recorded as b ₀ ，b ₁ ，b ₂ ，b ₃ The input data in one of the first registers is arranged as a by interleaving instructions ₀₀ 、a ₀₁ 、a ₀₂ 、a ₀₃ 、a ₁₀ 、a ₁₁ 、a ₁₂ 、a ₁₃ The input data in the other first register is arranged as a ₂₀ 、a ₂₁ 、a ₂₂ 、a ₂₃ 、a ₃₀ 、a ₃₁ 、a ₃₂ 、a ₃₃ The coefficient data in the second register is arranged as b ₀ 、b ₁ 、b ₂ 、b ₃ The input data is placed into two first registers.

7. A method of accelerating the computation of different bit-width dot products in digital image processing according to claim 6, wherein: in step (2), in the case where the input image has 4 channels and the coefficient data is 16 bits, based on the arrangement of step (1), by 8x16 multipliers, two dot product calculations are required,

8. The method of claim 2, wherein in step (1), the method of matching the total data length of the first register and the second register is as follows: in the case where the input image has 4 channels and the coefficient data is 32 bits, it is assumed that the input value of each channel is a _ck Wherein c is the channel value, k is the index of the pixel point in each group, and the coefficient data is recorded as b ₀ ，b ₁ ，b ₂ ，b ₃ The input data in the first register is arranged as a through interleaving instruction ₀₀ 、a ₀₁ 、a ₀₂ 、a ₀₃ 、a ₁₀ 、a ₁₁ 、a ₁₂ 、a ₁₃ 、a ₂₀ 、a ₂₁ 、a ₂₂ 、a ₂₃ 、a ₃₀ 、a ₃₁ 、a ₃₂ 、a ₃₃ The coefficient data in the second register is arranged as b ₀ 、b ₁ 、b ₂ 、b ₃ The input data is placed into a first register.

9. A method of accelerating the computation of different bit-width dot products in digital image processing according to claim 8, wherein: in step (2), in the case where the input image has 4 channels and the coefficient data is 32 bits, based on the arrangement of step (1), one dot product calculation is required by 8×32 multipliers, that is

10. Bilinear difference algorithm using a method of accelerating the calculation of different bit-width dot products in digital image processing according to any of claims 1-9, characterized in that it comprises the following steps: