CN102043605B - Multimedia transformation multiplier and processing method thereof - Google Patents

Multimedia transformation multiplier and processing method thereof Download PDF

Info

Publication number
CN102043605B
CN102043605B CN201010603133A CN201010603133A CN102043605B CN 102043605 B CN102043605 B CN 102043605B CN 201010603133 A CN201010603133 A CN 201010603133A CN 201010603133 A CN201010603133 A CN 201010603133A CN 102043605 B CN102043605 B CN 102043605B
Authority
CN
China
Prior art keywords
matrix
data
multimedia
module
mtd
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201010603133A
Other languages
Chinese (zh)
Other versions
CN102043605A (en
Inventor
胡伟武
刘宏伟
陈云霁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Loongson Technology Corp Ltd
Original Assignee
Loongson Technology Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Loongson Technology Corp Ltd filed Critical Loongson Technology Corp Ltd
Priority to CN201010603133A priority Critical patent/CN102043605B/en
Publication of CN102043605A publication Critical patent/CN102043605A/en
Application granted granted Critical
Publication of CN102043605B publication Critical patent/CN102043605B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention relates to a multimedia transformation multiplier and a processing method thereof. The multimedia transformation multiplier comprises a matrix multiplication module and an operation control module, wherein the matrix multiplication module is used for carrying out a matrix multiplication operation on data of a first matrix and data of a second matrix to obtain data of an intermediate result matrix; and the operation control module is used for reading operation control parameter values and carrying out an operation on the data of the intermediate result matrix in accordance with the operation control parameter values so as to obtain data of a result matrix. The multimedia transformation multiplier can accelerate the multimedia processing procedure, has fine universality, realizes commutative operation procedures for different requirements and can complete the multimedia commutative operation at low hardware cost.

Description

Multimedia transformation multiplier and processing method thereof
Technical Field
The present invention relates to the field of computer technology, and more particularly, to a multimedia transform multiplier and a method for processing multimedia files.
Background
With the development of processor technology, the application field is expanding. In particular, with the increasing demand for various operations such as multimedia operations and scientific operations, various general-purpose processors are adding instruction sets of single instruction stream and multiple data streams.
In the multimedia field, single instruction multiple data (simd) instructions can greatly increase the speed of multimedia processing. In the multimedia processing process, transform operations are very common. This is because most images share a common feature, namely that flat and slowly changing areas are the majority, while details and content are abrupt. Or stated another way, the dc and low frequency components make up the majority of the image, while the high frequency components make up a small portion of the image. This allows the spatial domain image to be transformed to the frequency domain or a particular transform domain, resulting in transform coefficients with low correlation. Thus, various operations are carried out on the basis of the method, and various treatments can be conveniently realized: such as direct image processing; or compression encoding, so-called transform encoding, etc., on the basis thereof to achieve a compression effect, etc.
Generally, there is a class of transforms called orthogonal transforms that can be used for image coding. Such as fast fourier transforms, K-L transforms, discrete cosine transforms, and the like. These transformations have a relatively general format:
the general form for a one-dimensional transform is:
F=A×f
where A is a transformation matrix, F is an image original value matrix, and F is a transformed coefficient matrix. Correspondingly, the inverse transformation is f ═ AT×F
For two-dimensional transformation, it can be generally understood that one-dimensional transformation is performed on each row and then one-dimensional transformation is performed on each column, and the matrix form is written as follows:
F=A×f×AT
the corresponding inverse transform is: f is AT×F×A
Integer transform is mostly used for transform in the multimedia field, because floating point operation brings error in operation precision, thereby causing mismatch problem of inverse transform.
In some specific media formats, addition and shift operations are also performed during integer transformation, which is actually a round-up process. For example, the formula F is A × F × ATThe calculation of (a) will become the following form:
E=(f×AT+2shift1-1)>>shift1
F=(A×E+2shift2-1)>>shift2
the following takes the processes in the codec of several mainstream media today as an example to explain the general form presented above. It is to be understood that this is merely an illustration of a usage scenario and not a limitation.
Integer discrete cosine transform, Hadamard transform and their inverse in multimedia transform operations may use the present method. Since the media format compresses the media spatially unimportant information using a process that uses transformations in conjunction with quantization. Therefore, the transform operation becomes an important step in the media encoding and decoding process, and is also a step which generates a large amount of calculation and occupies a large amount of processor time. For popular media formats, such as H.264, VC-1, AVS, rmvb, mpeg4, etc., contain such steps.
For multimedia decoding as an example, the basic operation form of the inverse discrete cosine transform is as follows:
f ═ T '× T, where X and T' are both matrices, typically 8 × 8 or 4 × 4 matrices. In particular streaming media formats, such as avs, H.264, and vc-1 formats, the particular inverse discrete cosine transform is as follows:
AVS format
The process of converting an 8 × 8 transform coefficient matrix CoeffMatrix into an 8 × 8 residual value matrix resofumatrix in the AVS includes the following steps:
first, the transform coefficient matrix is subjected to the following horizontal inverse transform with rounding shift:
E8x8=(CoeffMatrix×T8 T+4)>>3
wherein, T8Is an 8x8 inverse transform matrix, T8 TIs T8Of a transposed matrix of, CoeffMatrix x T8 TRepresenting the intermediate result after the horizontal inverse transformation. CoeffMatrix xT decoded from a bitstream conforming to this section8 TThe value range of matrix elements should be-215~215-5。
T 8 = 8 10 10 9 8 6 4 2 8 9 4 - 2 - 8 - 10 - 10 - 6 8 6 - 4 - 10 - 8 2 10 9 8 2 - 10 - 6 8 9 - 4 - 10 8 - 2 - 10 6 8 - 9 - 4 10 8 - 6 - 4 10 - 8 - 2 10 - 9 8 - 9 4 2 - 8 10 - 10 6 8 - 10 10 - 9 8 - 6 4 - 2
Next, for matrix E8x8The following vertical inverse transform is performed with rounding shifts:
R8x8=(T8×E8x8+26)>>7
wherein, T8×E8x8Representing the inverse transformed 8x8 matrix. T decoded from a bitstream conforming to the portion8×E8x8The value range of matrix elements should be-215~215-65. Finally calculating to obtain R8x8I.e. the residual sample matrix resisumatrix.
(II) VC-1 Format
The units of the inverse transformation of VC-1 are 8 × 8, 8 × 4, 4 × 8 and 4 × 4. The inverse quantized coefficients are 12-bit signed numbers and the inverse transformed coefficients are 10-bit signed numbers. The reverse transformation step of VC-1 is as follows:
<math> <mfrac> <mrow> <msub> <mi>E</mi> <mrow> <mi>M</mi> <mo>&times;</mo> <mi>N</mi> </mrow> </msub> <mo>=</mo> <mrow> <mo>(</mo> <msub> <mi>D</mi> <mrow> <mi>M</mi> <mo>&times;</mo> <mi>N</mi> </mrow> </msub> <mo>&CenterDot;</mo> <msub> <mi>T</mi> <mi>M</mi> </msub> <mo>+</mo> <mn>4</mn> <mo>)</mo> </mrow> <mo>></mo> <mo>></mo> <mn>3</mn> </mrow> <mrow> <msub> <munder> <mi>R</mi> <mo>&OverBar;</mo> </munder> <mrow> <mi>M</mi> <mo>&times;</mo> <mi>N</mi> </mrow> </msub> <mo>=</mo> <mrow> <mo>(</mo> <msubsup> <mi>T</mi> <mi>N</mi> <mo>&prime;</mo> </msubsup> <mo>&CenterDot;</mo> <mo>&CenterDot;</mo> <msub> <mi>E</mi> <mrow> <mi>M</mi> <mo>&times;</mo> <mi>N</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>C</mi> <mi>N</mi> </msub> <mo>&CenterDot;</mo> <mo>&CenterDot;</mo> <msub> <mn>1</mn> <mi>M</mi> </msub> <mo>+</mo> <mn>64</mn> <mo>)</mo> </mrow> <mo>></mo> <mo>></mo> <mn>7</mn> </mrow> </mfrac> </math>
R8x8=(T8×E8x8+26)>>7
wherein DM×NFor the input matrix, i.e. the inverse-quantized coefficient matrix, EM×NIs an intermediate result matrix, is a 13-bit signed number,R M×Nis an output matrix, namely a coefficient matrix after inverse transformation. M, N may be 4 or 8, T8,T4Respectively as follows:
T 8 = 12 12 12 12 12 12 12 12 16 15 9 4 - 4 - 9 - 15 - 16 16 6 - 6 - 16 - 16 - 6 6 16 15 - 4 - 16 - 9 9 16 4 - 15 12 - 12 - 12 12 12 - 12 - 12 12 9 - 16 4 15 - 15 - 4 16 - 9 6 - 16 16 - 6 - 6 16 - 16 6 4 - 9 15 - 16 16 - 15 9 - 4
T 4 = 17 17 17 17 22 10 - 10 - 22 17 - 17 - 17 17 10 - 22 22 - 10
C8=(0 0 0 0 1 1 1 1)T,C4=(0 0 0 0)T
in application, for a relatively large matrix, such as a matrix multiplication of 8x 8. The block matrix multiplication method can be used for realizing the matrix multiplication instruction of 4x 4. The implementation method is not described in detail here.
In the second step T '(× T) of the transformation, T' is used, which is the transpose of the transformation matrix T of the corresponding dimension: as a matrix transformation for 4x 8:
T′=T4 T,T=T8
(III) H.264 format
The transformation matrix T in h.264 is:
4x4 inverse transform matrix:
T 4 = 2 2 2 2 2 1 - 1 - 2 2 - 2 - 2 2 1 - 2 2 - 1
8x8 inverse transform matrix:
T 8 = 8 8 8 8 8 8 8 8 12 10 6 3 - 3 - 6 - 10 - 12 8 4 - 4 - 8 - 8 - 4 4 8 10 - 3 - 12 - 6 6 12 3 10 8 - 8 - 8 8 8 - 8 - 8 8 6 - 12 3 10 - 10 - 3 12 - 6 4 - 8 8 - 4 - 4 8 - 8 4 3 - 6 10 - 12 12 - 10 6 - 3
for the inverse transform of 8x8, the process is as follows:
E8x8=(CoeffMatrix×T8+4)>>3
H8x8=(T8 T×E8x8+4)>>3
R8x8=(H8x8+25)>>6
for the encoding process, i.e. the forward process of the transform operation, it is of the form: x ═ T ═ F ═ TTFor h.264 format, the transformation matrix is:
<math> <mrow> <msub> <mi>T</mi> <mrow> <mn>4</mn> <mo>&times;</mo> <mn>4</mn> </mrow> </msub> <mo>=</mo> <mfenced open='[' close=']'> <mtable> <mtr> <mtd> <mn>1</mn> </mtd> <mtd> <mn>1</mn> </mtd> <mtd> <mn>1</mn> </mtd> <mtd> <mn>1</mn> </mtd> </mtr> <mtr> <mtd> <mn>2</mn> </mtd> <mtd> <mn>1</mn> </mtd> <mtd> <mo>-</mo> <mn>1</mn> </mtd> <mtd> <mo>-</mo> <mn>2</mn> </mtd> </mtr> <mtr> <mtd> <mn>1</mn> </mtd> <mtd> <mo>-</mo> <mn>1</mn> </mtd> <mtd> <mo>-</mo> <mn>1</mn> </mtd> <mtd> <mn>1</mn> </mtd> </mtr> <mtr> <mtd> <mn>1</mn> </mtd> <mtd> <mo>-</mo> <mn>2</mn> </mtd> <mtd> <mn>2</mn> </mtd> <mtd> <mo>-</mo> <mn>1</mn> </mtd> </mtr> </mtable> </mfenced> <mo>,</mo> </mrow> </math> <math> <mrow> <msub> <mi>T</mi> <mrow> <mn>8</mn> <mo>&times;</mo> <mn>8</mn> </mrow> </msub> <mo>=</mo> <mfenced open='[' close=']'> <mtable> <mtr> <mtd> <mn>8</mn> </mtd> <mtd> <mn>8</mn> </mtd> <mtd> <mn>8</mn> </mtd> <mtd> <mn>8</mn> </mtd> <mtd> <mn>8</mn> </mtd> <mtd> <mn>8</mn> </mtd> <mtd> <mn>8</mn> </mtd> <mtd> <mn>8</mn> </mtd> </mtr> <mtr> <mtd> <mn>12</mn> </mtd> <mtd> <mn>10</mn> </mtd> <mtd> <mn>6</mn> </mtd> <mtd> <mn>3</mn> </mtd> <mtd> <mo>-</mo> <mn>3</mn> </mtd> <mtd> <mo>-</mo> <mn>6</mn> </mtd> <mtd> <mo>-</mo> <mn>10</mn> </mtd> <mtd> <mo>-</mo> <mn>12</mn> </mtd> </mtr> <mtr> <mtd> <mn>8</mn> </mtd> <mtd> <mn>4</mn> </mtd> <mtd> <mo>-</mo> <mn>4</mn> </mtd> <mtd> <mo>-</mo> <mn>8</mn> </mtd> <mtd> <mo>-</mo> <mn>8</mn> </mtd> <mtd> <mo>-</mo> <mn>4</mn> </mtd> <mtd> <mn>4</mn> </mtd> <mtd> <mn>8</mn> </mtd> </mtr> <mtr> <mtd> <mn>10</mn> </mtd> <mtd> <mo>-</mo> <mn>3</mn> </mtd> <mtd> <mo>-</mo> <mn>12</mn> </mtd> <mtd> <mo>-</mo> <mn>6</mn> </mtd> <mtd> <mn>6</mn> </mtd> <mtd> <mn>12</mn> </mtd> <mtd> <mn>3</mn> </mtd> <mtd> <mo>-</mo> <mn>10</mn> </mtd> </mtr> <mtr> <mtd> <mn>8</mn> </mtd> <mtd> <mo>-</mo> <mn>8</mn> </mtd> <mtd> <mo>-</mo> <mn>8</mn> </mtd> <mtd> <mn>8</mn> </mtd> <mtd> <mn>8</mn> </mtd> <mtd> <mo>-</mo> <mn>8</mn> </mtd> <mtd> <mo>-</mo> <mn>8</mn> </mtd> <mtd> <mn>8</mn> </mtd> </mtr> <mtr> <mtd> <mn>6</mn> </mtd> <mtd> <mo>-</mo> <mn>12</mn> </mtd> <mtd> <mn>3</mn> </mtd> <mtd> <mn>10</mn> </mtd> <mtd> <mo>-</mo> <mn>10</mn> </mtd> <mtd> <mo>-</mo> <mn>3</mn> </mtd> <mtd> <mn>12</mn> </mtd> <mtd> <mo>-</mo> <mn>6</mn> </mtd> </mtr> <mtr> <mtd> <mn>4</mn> </mtd> <mtd> <mo>-</mo> <mn>8</mn> </mtd> <mtd> <mn>8</mn> </mtd> <mtd> <mo>-</mo> <mn>4</mn> </mtd> <mtd> <mo>-</mo> <mn>4</mn> </mtd> <mtd> <mn>8</mn> </mtd> <mtd> <mo>-</mo> <mn>8</mn> </mtd> <mtd> <mn>4</mn> </mtd> </mtr> <mtr> <mtd> <mn>3</mn> </mtd> <mtd> <mo>-</mo> <mn>6</mn> </mtd> <mtd> <mn>10</mn> </mtd> <mtd> <mo>-</mo> <mn>12</mn> </mtd> <mtd> <mn>12</mn> </mtd> <mtd> <mo>-</mo> <mn>10</mn> </mtd> <mtd> <mn>6</mn> </mtd> <mtd> <mo>-</mo> <mn>3</mn> </mtd> </mtr> </mtable> </mfenced> </mrow> </math>
but the operation mode is identical to the general mode.
Also for other transforms, such as the Hadamard transform on the luminance block WD used in H264, the form is:
Y D = ( 1 1 1 1 1 1 - 1 - 1 1 - 1 - 1 1 1 - 1 1 - 1 W D 1 1 1 1 1 1 - 1 - 1 1 - 1 - 1 1 1 - 1 1 - 1 ) / 2
also in accordance with the general form described.
In summary, in the multimedia processing transformation operation process, matrix multiplication, addition and shift in the system are main parts, and a large number of multiplication and addition instructions need to be circulated for multiple times to realize one-time coding and decoding transformation, so that the time of a processor is long, the realization is complex, and the universality is not high.
Disclosure of Invention
The present invention is directed to overcome the drawbacks of the prior art, and provides a multimedia transform multiplier and a processing method thereof, which accelerate the multimedia processing process, have good versatility, and can complete the transform operation of multimedia data with less hardware cost.
The multimedia transform multiplier provided for realizing the purpose of the invention comprises a matrix multiplication module and an operation control module;
the matrix multiplication module is used for carrying out matrix multiplication operation on the data of the first matrix and the data of the second matrix to obtain the data of an intermediate result matrix;
and the operation control module is used for reading the operation control parameter values and controlling the data of the intermediate result matrix to perform operation according to the operation control parameter values to obtain the data of the result matrix.
Preferably, the multimedia transform multiplier further comprises a parameter loading module for loading the data of the first matrix, the data of the second matrix and the operation control parameter value.
Preferably, the data of the first matrix is data of a coefficient matrix of a multimedia transformation operation; and the data of the second matrix is the data of a transformation matrix for multimedia transformation operation.
Preferably, the data of the first matrix is data of a transpose matrix of a coefficient matrix of the multiplier which performs the multimedia transform operation last time; and the data of the second matrix is the data of a result matrix obtained after the last operation of the operation control module of the multiplier.
Preferably, the multimedia transform multiplier further includes a transposer for performing a transposing operation on the coefficient matrix of the multimedia transform operation to obtain data of the first matrix.
Preferably, the operation control parameter values include operation mode parameter values and operation parameter values;
the operation control module comprises a judgment module and an operation module;
the judging module is used for reading the operation mode parameter value loaded in the loading parameter module and determining the operation mode;
and the operation module is used for reading the operation parameter values loaded in the loading parameter module and controlling the intermediate result matrix to carry out corresponding operation in the operation mode determined by the judgment module according to the operation parameter values.
Preferably, the judging module comprises a digit precision bit and an operation mode bit; the judging module determines an operation mode according to the read operation mode parameter value through a bit precision bit and an operation mode bit; the digit precision bit and the operation mode bit are respectively expressed by binary numbers.
Preferably, the bit precision bits are bit precision requiring more than 16 bits or less than 16 bits for the data precision valid bits of the intermediate matrix;
when the operation mode bit is that the data precision effective bit of the intermediate matrix is lower than 16 bits, the operation mode is whether to carry out addition operation; when the data precision significant bit of the intermediate matrix is higher than 32 bits, the operation mode is to take out the lower half or the upper half of the intermediate matrix.
Preferably, the operation module includes an operation control bit representing the number of shift bits.
Preferably, the multimedia transform multiplier further comprises a first storage module, a second storage module, a third storage module and a fourth storage module; wherein:
the first storage module is used for storing the data of the first matrix;
the second storage module is used for storing the data of the second matrix;
the third storage module is used for storing the data of the intermediate result matrix obtained after the matrix multiplication module carries out matrix multiplication operation;
and the fourth storage module is used for storing the data of the result matrix obtained after the operation of the operation control module.
To achieve the object of the present invention, there is also provided a processing method of a multimedia transform multiplier, comprising the steps of:
a, performing matrix multiplication operation on data of a first matrix and data of a second matrix to obtain data of an intermediate result matrix;
and step B, controlling the data of the intermediate result matrix to carry out operation according to the loaded operation control parameter values to obtain the data of the result matrix.
Preferably, the step a is preceded by the following steps:
and step A', loading the data of the first matrix and the data of the second matrix, and calculating the control parameter value.
Preferably, the data of the first matrix is data of a coefficient matrix of a multimedia transformation operation, and the data of the second matrix is data of a transformation matrix of the multimedia transformation operation.
Preferably, the first matrix is data of a transpose matrix of a coefficient matrix of the multiplier which last performed multimedia transform operation; and the data of the second matrix is the data of a result matrix obtained after the last generation operation of the operation control module of the multiplier.
Preferably, step a further comprises the following steps:
and step A', transposing the coefficient matrix subjected to the multimedia transformation operation last time to obtain data of the first matrix.
Preferably, the operation control parameter values include operation mode parameter values and operation parameter values, and the step S200 includes the following steps:
step B1, reading the parameter value of the operation mode, and determining the operation mode;
and step B2, reading the operation parameter values, and controlling the intermediate result matrix to perform corresponding operation according to the operation parameter values in the operation mode determined in the step B1 to obtain the data of the final result matrix.
The invention has the beneficial effects that: the multimedia transformation multiplier and the processing method thereof can quickly realize matrix multiplication, can greatly accelerate the speed of transformation operation in multimedia processing, accelerate the multimedia processing process, simultaneously have good universality, realize the transformation operation process under different requirements, can finish the multimedia transformation operation with smaller hardware cost, and can save the operation time of matrix transposition in the transformation processing and simultaneously reduce the occupation of the number of computer registers.
Drawings
FIG. 1 is a diagram illustrating a multimedia transform multiplier according to an embodiment of the present invention;
FIG. 2 is a flow chart of a processing method of a multimedia transform multiplier according to a third embodiment;
fig. 3 is a flowchart of a processing method of the multimedia transform multiplier according to the fourth embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the multimedia transform multiplier and the processing method thereof according to the present invention are further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting.
In the transformation operation process of multimedia processing, multiplication, addition and shift are main parts, a large number of multiplication and addition loop iterations are needed to realize one-time transformation operation, and the transformation operation process can be realized by the cascade cooperation of a single or a small number of multimedia transformation multipliers.
From the existing transformation processes for different multimedia processes, a common matrix multiplication process can be abstracted as follows:
Emxm=(CoeffMatrix×Tm+2shift-1)>>shift (I)
or
Emxm=CoeffMatrix×Tm (II)
Wherein, CoeffMatrix represents an input matrix required by multimedia processing, and takes an image transformation example, for forward transformation into an original image value, and for inverse transformation, the CoeffMatrix is a coded coefficient; shift represents the number of bits shifted; em×mIs an output matrix, namely a matrix multiplication result matrix; tm is a specific coefficient matrix required for m × m (m ═ 4, 8) transformation, whose values take different fixed values during the transformation of different multimedia processes, where:
A) in the discrete cosine transform of AVS format, m is 8, T8Is represented as follows:
T 8 = 8 10 10 9 8 6 4 2 8 9 4 - 2 - 8 - 10 - 10 - 6 8 6 - 4 - 10 - 8 2 10 9 8 2 - 10 - 6 8 9 - 4 - 10 8 - 2 - 10 6 8 - 9 - 4 10 8 - 6 - 4 10 - 8 - 2 10 - 9 8 - 9 4 2 - 8 10 - 10 6 8 - 10 10 - 9 8 - 6 4 - 2
B) in the discrete cosine transform of VC-1 format, m is 8 or 4, T8,T4Respectively, as follows:
T 8 = 12 12 12 12 12 12 12 12 16 15 9 4 - 4 - 9 - 15 - 16 16 6 - 6 - 16 - 16 - 6 6 16 15 - 4 - 16 - 9 9 16 4 - 15 12 - 12 - 12 12 12 - 12 - 12 12 9 - 16 4 15 - 15 - 4 16 - 9 6 - 16 16 - 6 - 6 16 - 16 6 4 - 9 15 - 16 16 - 15 9 - 4
T 4 = 17 17 17 17 22 10 - 10 - 22 17 - 17 - 17 17 10 - 22 22 - 10
C) in the discrete cosine transform of h.264 format, m is 8, 4, T8,T4Respectively, as follows:
T 8 = 8 8 8 8 8 8 8 8 12 10 6 3 - 3 - 6 - 10 - 12 8 4 - 4 - 8 - 8 - 4 4 8 10 - 3 - 12 - 6 6 12 3 10 8 - 8 - 8 8 8 - 8 - 8 8 6 - 12 3 10 - 10 - 3 12 - 6 4 - 8 8 - 4 - 4 8 - 8 4 3 - 6 10 - 12 12 - 10 6 - 3
T 4 = 2 2 2 2 2 1 - 1 - 2 2 - 2 - 2 2 1 - 2 2 - 1
D) in the Hadamard transform of h.264 format, m is 4, and T4 represents as follows:
T 4 = 1 1 1 1 1 1 - 1 - 1 1 - 1 - 1 1 1 - 1 1 - 1
the multimedia transformation multiplier disclosed by the invention can complete the matrix multiplication operation of formula (I) or formula (II) as an implementable mode.
Since 8 × 8 matrix multiplication can be realized by using 4 × 4 matrix to perform block matrix multiplication, in the embodiment of the present invention, 4 × 4 matrix multiplication is described first; the process of implementing 8 × 8 matrix multiplication by block matrix multiplication using a 4 × 4 matrix is then described.
As shown in fig. 1, the multimedia transform multiplier disclosed in the present invention includes a loading parameter module 1, a first storage module 2, a second storage module 3, a matrix multiplication module 4, a third storage module 5, a fourth storage module 7 and an extraction control module 6.
Wherein:
the loading parameter module 1 is used for loading and respectively storing the data of the transformation matrix and the coefficient matrix for multimedia transformation operation into the second storage module 3 and the first storage module 2; loading the extraction control parameter values and storing the extraction control parameter values in the extraction control module 6;
the second storage module 3 and the first storage module 2 are used for storing data of the input transformation matrix CoeffMatrix and data of the coefficient matrix T;
as an implementation, each element of the input transformation matrix CoeffMatrix is a 16-bit signed integer, and is stored in the second storage module 3 from lower to higher bits by row first.
As an implementable way, the second memory module 3 comprises at least 16-bit registers for storing 16 signed numbers of the transformation matrix.
The first memory module 2 is used to store each element of the coefficient matrix T.
As an implementable way, the first memory module 2 stores each element T of the coefficient matrix TijAs with the second memory module 3, the embodiments of the present invention will not be described in detail.
And the matrix multiplication module 4 is configured to take out an element of the transformation matrix stored in the second storage module 3 and each element of the coefficient matrix T stored in the first storage module 2 according to the storage start address parameters of the first storage module 2 and the second storage module 3 loaded by the parameter loading module, perform matrix multiplication operation, and store an obtained multiplication result in the third storage module 5, so that CoeffMatrix XT is completed.
It should be noted that the computer matrix multiplication operation is a prior art, and therefore, in the embodiment of the present invention, the present invention is not described in detail.
And a third storage module 5 for storing each element of the intermediate result matrix multiplied by the matrix multiplication module.
As an implementable way, each element of the intermediate result matrix stored by the third storage module 5 is a 32-bit signed integer. The third memory module 5 comprises at least 16 32-bit registers for 16 32-bit signed integers of the intermediate result matrix.
The extraction control module 6 is used for reading the extraction control parameter values loaded in the loading parameter module, performing corresponding addition and shift operations by using the data of the intermediate result matrix in the third storage module 5, and storing the data of the result matrix in the fourth storage module 7;
and the fourth storage module 7 is used for storing each element of the final multiplication result matrix obtained by extraction by the extraction control module 6.
As an implementation manner, the fourth storage module 7 stores each element of the final multiplication result matrix extracted by the extraction control module 6 according to the storage head address parameter loaded by the parameter loading module, and each element of the final multiplication result matrix is stored the same as that of the first storage module 2.
The third storage module 5 temporarily stores intermediate result matrix data obtained by multiplying the input transformation matrix data and the coefficient matrix data, and the intermediate result matrix data is extracted by the extraction control module 6 to obtain a required result and then stored in the fourth storage module 7.
Preferably, the multimedia transform multiplier according to the embodiment of the present invention further includes a transposer 8, configured to perform a transposing operation on a coefficient matrix of the multimedia transform operation to obtain data of the first matrix.
Preferably, as an implementation manner, the extraction control module 6 of the present invention includes a judging module 61 and a shifting module 62;
wherein:
the judgment module 61 is provided with an extraction mode bit and an operation control bit, and is used for reading an extraction mode value and an operation control value loaded in the loading parameter module and determining an extraction mode;
the shift module 62 is provided with a shift control bit, and is used for reading the shift parameter value loaded in the loading parameter module and performing shift operation on the extracted intermediate result matrix according to the shift parameter value;
the extraction control parameter values include extraction mode values, operation control values and shift parameter values.
Wherein, the extraction mode is one, two, three or four determined by setting the extraction mode value to1 or 0, for example, when the extraction mode value is set to1, the first or second mode is adopted, and when the extraction mode value is 0, the third or fourth mode is adopted;
next, an operation control value is set in the extraction control module 6, and whether rounding operation is required in the first and second extraction methods or whether the upper half or the lower half of the intermediate result matrix is extracted in the third and fourth extraction methods is determined by setting the operation control value to1 or 0.
If the value of m is 1 and the value of n is set to1, rounding operation is required, namely a first extraction mode; setting the value of m to be 1, and when the value of n is 0, performing rounding operation, namely a second extraction mode; if the value of m is set to be 0 and the value of n is set to be 1, taking the lower half part of the intermediate result matrix, namely a third extraction mode; setting m to a value of 0 and n to a value of 0, the upper half of the intermediate result matrix, i.e. the fourth extraction pattern, is taken.
Preferably, as an implementation, there are four extraction modes:
1) the extraction mode is 11, i.e. the transformation process for which addition and shifting are required after matrix multiplication. This process first performs an addition (i.e., 2 shift-1 addition) and then extracts the desired 16-bit result by shifting (i.e., right shifting bits). This mode is generally applicable to transform operations where the result of the operation has valid bits within 16 bits and requires rounding. See example 1 for applications;
2) the extraction mode is 10, i.e. for transformation processes that do not require rounding. This process does not add, but directly shifts to extract the desired 16-bit result (i.e., shift bit right). This mode applies to transform processes where the result of an operation has significant bits within 16 bits and does not require rounding or a simplified approximation to reduce the number of operations.
3) The extraction pattern is 01, i.e. the lower half of the intermediate result matrix, i.e. the lower two rows of the intermediate result matrix, is extracted directly, for a total of 16 32-bit elements. In cooperation with mode 4) to save the entire 32-bit precision matrix multiplication result, can be used for a transformation process with a result precision higher than 16 bits or for multiplication of sub-blocks to complete block matrix multiplication. For the case that the result precision is higher than 16 bits, it is necessary to obtain the lower half matrix and the upper half matrix of the result respectively, and then perform the corresponding 32-bit addition and bit-weighing operations on them to obtain the corresponding result. In the case where the block size is larger than 4 × 4, the transform operation needs to be realized by block matrix multiplication, but since the addition operation is performed after the sub-matrix multiplication, the result of the sub-matrix multiplication must be maintained with 32-bit accuracy. This is because the result C00 × T00+ C01 × T10 is 16 bits, but the two addends C00 × T00 and C01 × T1 may have 32 bits, even though they are signed numbers involved in the operation. This mode is widely applied to the block matrix multiplication process, as can be seen in example 4;
4) the extraction mode is 00, namely the high half of the intermediate result matrix is extracted, and the extraction mode is matched with the mode 3) to store all matrix multiplication results with the precision of 32 bits, and the extraction mode can be used for a transformation process with high requirement on the precision of the intermediate result or used for multiplication of subblocks for completing block matrix multiplication.
Preferably, the shift parameter is a shift control bit that specifies a sufficient number of bits and is set, for example, if 7 bits are to be shifted, at least three bits are required to be represented by 2-ary numbers, and no less than three bits are required to be specified as the shift control bit.
The multimedia transformation multiplier provided by the invention realizes the transformation of multimedia data with different formats, has good universality, can greatly accelerate the media decoding speed and save computer resources.
The processing method of the multimedia transform multiplier provided by the invention is further explained by combining a matrix multiplication formula (I).
Let vs be the matrix CoeffMatrix stored in the second memory module 3, vt be the coefficient matrix T stored in the first memory module 2, temp be the intermediate result matrix temp stored in the third memory module 5, the elements of the vs and vt matrices are 16-bit signed integers, and each element of temp is 32-bit signed integer, then
Figure BDA0000040146220000131
<math> <mrow> <msub> <mi>ti</mi> <mi>j</mi> </msub> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>0</mn> </mrow> <mn>3</mn> </munderover> <msub> <mi>a</mi> <mi>ik</mi> </msub> <mo>&times;</mo> <msub> <mi>b</mi> <mi>kj</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>=</mo> <mn>0</mn> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mn>4</mn> <mo>,</mo> <mi>j</mi> <mo>=</mo> <mn>0</mn> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>
The extraction process is represented as follows:
Figure BDA0000040146220000133
vd is the result matrix stored in the fourth memory block 7.
In an implementation, one byte is set as the mode control module, the extraction mode bit is the 7 th bit, i.e., m is 6, the operation control bit is the 6 th bit, i.e., n is 5, and the 1 st to 5 th bits are the shift control bits, which may indicate a shift of 31 bits at maximum (11111).
Correspondingly, the invention also provides a processing method of the multimedia transform multiplier, which comprises the following steps:
and S100, loading the data of the first matrix and the data of the second matrix, and calculating a control parameter value.
As an implementable manner, the processing method of the multimedia transform multiplier according to the embodiment of the present invention loads the storage start address parameters of the first storage module, the second storage module, and the fourth storage module, and extracts the mode parameter and the shift bit number parameter;
and step S200, performing matrix multiplication operation on the data of the first matrix and the data of the second matrix to obtain data of an intermediate result matrix.
As an implementable manner, the processing method of the multimedia transform multiplier according to the embodiment of the present invention extracts, according to the storage start address parameters of the first storage module and the second storage module loaded by the loading parameter module, the elements of the transform matrix stored in the second storage module and each element of the coefficient matrix T stored in the first storage module, performs matrix multiplication, and stores the obtained multiplication result in the third storage module.
And step S300, controlling the data of the intermediate result matrix to carry out operation according to the loaded operation control parameter values to obtain the data of the result matrix.
As an implementable manner, the processing method of the multimedia transform multiplier according to the embodiment of the present invention extracts a shift from the intermediate result matrix stored in the third storage module according to the extraction mode parameter and the shift bit number parameter loaded by the loading parameter module, and stores the final multiplication result in the fourth storage module. Preferably, the roles of the first and second memory blocks in the multiplier are identical, so that both the transform matrix and the coefficient matrix can be stored in the second memory block or the first memory block.
Preferably, the step S300 includes the steps of:
step S310, reading the parameter value of the operation mode and determining the operation mode;
as an implementation mode, different extraction modes are adopted for the intermediate result matrix after judgment according to the extraction mode parameters loaded by the parameter loading module;
and step S320, reading the operation parameter values, and controlling the intermediate result matrix to perform corresponding operation according to the operation parameter values in the operation mode determined in the step S310 to obtain data of a final result matrix.
As an implementation manner, after the intermediate result matrix is shifted according to different extraction modes and shift parameters, each element of the final multiplication result matrix is obtained.
As an implementable manner, the following instructions may be utilized:
load $ v0, coffmatrix: indicating that a 4x4 matrix CoffiMatrix is loaded into the vector register $ v 0;
vaddh $ v2, $ v1, $ v 0: the vector registers $ v0, $ v1 are represented, signed addition is carried out on each 16-bit unit in the corresponding position, and the obtained result is stored in the corresponding position of $ v 2;
vaddw $ v2, $ v1, $ v 0: the vector registers are represented as $ v0, $ v1, signed addition is carried out on each 32-bit unit in the corresponding position, and the obtained result is stored in the corresponding position of $ v 2;
vshifth $ v1, $ v0, imm: indicating that each 16 bit cell in $ v0 is shifted by imm bits;
vshiftw $ v1, $ v0, imm: indicating that each 32 bit cell in $ v0 is shifted by imm bits;
convert32to16 $ v2, $ v1, $ v 0: shows that each 32-bit element in $ v1, $ v0 is packed into a 16-bit half word, and is saturated.
As an implementable manner, the matrix multiplication in the multimedia transform multiplier is implemented as follows:
temp00[31:0]=vs[15:0]*vt[15:0]+vs[31:16]*vt[79:64]+vs[47:32]*vt[143:128]+vs[63:48]*vt[207:192];
temp01[31:0]=vs[15:0]*vt[31:16]+vs[31:16]*vt[95:80]+vs[47:32]*vt[159:144]+vs[63:48]*vt[223:208];
temp02[31:0]=vs[15:0]*vt[47:32]+vs[31:16]*vt[111:96]+vs[47:32]*vt[175:160]+vs[63:48]*vt[239:224];
temp03[31:0]=vs[15:0]*vt[63:48]+vs[31:16]*vt[127:112]+vs[47:32]*vt[191:176]+vs[63:48]*vt[255:240];
temp10[31:0]=vs[79:64]*vt[15:0]+vs[95:80]*vt[79:64]+vs[111:96]*vt[143:128]+vs[127:112]*vt[207:192];
temp11[31:0]=vs[79:64]*vt[31:16]+vs[95:80]*vt[95:80]+vs[111:96]*vt[159:144]+vs[127:112]*vt[223:208];
temp12[31:0]=vs[79:64]*vt[47:32]+vs[95:80]*vt[111:96]+vs[111:96]*vt[175:160]+vs[127:112]*vt[239:224];
temp13[31:0]=vs[79:64]*vt[63:48]+vs[95:80]*vt[127:112]+vs[111:96]*vt[191:176]+vs[127:112]*vt[255:240];
temp20[31:0]=vs[143:128]*vt[15:0]+vs[159:144]*vt[79:64]+vs[175:160]*vt[143:128]+vs[191:176]*vt[207:192];
temp21[31:0]=vs[143:128]*vt[31:16]+vs[159:144]*vt[95:80]+vs[175:160]*vt[159:144]+vs[191:176]*vt[223:208];
temp22[31:0]=vs[143:128]*vt[47:32]+vs[159:144]*vt[111:96]+vs[175:160]*vt[175:160]+vs[191:176]*vt[239:224];
temp23[31:0]=vs[143:128]*vt[63:48]+vs[159:144]*vt[127:112]+vs[175:160]*vt[191:176]+vs[191:176]*vt[255:240];
temp30[31:0]=vs[207:192]*vt[15:0]+vs[223:208]*vt[79:64]+vs[239:224]*vt[143:128]+vs[255:240]*vt[207:192];
temp31[31:0]=vs[207:192]*vt[31:16]+vs[223:208]*vt[95:80]+vs[239:224]*vt[159:144]+vs[255:240]*vt[223:208];
temp32[31:0]=vs[207:192]*vt[47:32]+vs[223:208]*vt[111:96]+vs[239:224]*vt[175:160]+vs[255:240]*vt[239:224];
temp33[31:0]=vs[207:192]*vt[63:48]+vs[223:208]*vt[127:112]+vs[239:224]*vt[191:176]+vs[255:240]*vt[255:240];
if (imm8[6 ])//// imm8 is a mode control instruction// H
{
if (imm8[5 ])// 7 th bit is 1 and 6 th bit is 1, a first extraction pattern
dij (tij +1 < (imm8[4:0] -1)) > imm8[4:0 ]///' rounding operation >
else //**//
dij tij > imm8[4:0]// [ 7 th position is 1, 6 th position is 0, a second extraction pattern
J takes a value from 0 to 3
else
{
if (imm8[5 ])// bit 7 takes the value 0, bit 6 takes the value 1, and a third extraction pattern
{
d01,d00=t00;
d02,d03=t01;
d33,d32=t13;
}
else//' bit 7 takes the value 0, bit 6 takes the value 0, and a fourth extraction pattern// H
{
d01,d00=t20;
d02,d03=t21;
d33,d32=t33;
}
}
If the operation program of the multimedia transform multiplier is integrated into a single instruction, the instruction is named as vmtxmulh, the operation mode of the instruction is as follows:
the first element address of the first memory module, the first element address of the second memory module, the first element address of the fourth memory module, the extraction mode parameter and the shift parameter of the vmtxmulh;
taking a specific application environment as an example, it can be considered that the multimedia transform multiplier performs one-dimensional transform after performing one operation. For example, if an image is processed, two-step transformations (i.e. two-dimensional transformations) including horizontal and vertical are required, so that a two-dimensional transformation can be performed by combining the above-mentioned multimedia transformation multipliers. The two-dimensional transformation realized by the multimedia transformation multiplier and the processing method thereof is further explained below with reference to the attached drawings.
As an implementable mode, the processing method of the multiplier provided by the invention realizes two-dimensional transformation. The two-step multiplication operations that the two-dimensional transformation needs to complete are as follows:
E=(f×AT+2shift1-1)>>shift1 (*)
F=(A×E+2shift2-1)>>shift2
therefore, as an implementable manner, when performing two-dimensional transformation, the processing method of the multimedia transformation multiplier according to the embodiment of the present invention further includes, before step S200, the following steps:
and step S200', transposing the coefficient matrix subjected to the multimedia transformation operation last time to obtain data of the first matrix.
When two-dimensional transformation is carried out, the coefficient matrix A is transposed to obtain a transposed matrix AT(ii) a Then, the result matrix obtained after the last generation operation of the operation control module of the multiplier and the transposed coefficient matrix A are usedTLoading into the second and first storage modules respectively;
then, repeating the step S200 and the step S300 to carry out iterative operation to obtain a final result matrix E;
as an implementation manner, the result matrix E obtained after the last generation operation of the operation control module of the multiplier is sent to the second storage module of the multimedia transform multiplier to participate in the iterative multiplication operation, so as to obtain the final result matrix E.
Preferably, after the iterative operation is finished, rounding operation is further performed according to the requirement of the data bit, so as to obtain a final result matrix E.
As another possible implementation, the matrices E and A can be used as new input transformation matrix and coefficient matrix, respectively, and reloaded into another multimedia transformation multiplier, so as to complete the second step operation serially or iteratively.
The embodiment can obtain that the two-dimensional multimedia transformation is completed by adopting the invention, and the same multimedia transformation multiplier can be adopted to repeatedly carry out iterative operation, or two multimedia transformation multipliers are adopted to carry out serial or iterative operation to obtain the multi-dimensional multimedia transformation.
Therefore, the operational flexibility of the multimedia transformation multiplier can be greatly improved through the transposer of the processor and iterative or serial operation. For example, in the above embodiment, the coefficient matrix is transposed before it is written to the multiplier and multiplied by the input transform matrix, and to distinguish this multiplication from the previously described multiplication, the instruction may be named vmtxvmulh.
Similar to the vmtxmulh instruction, the operation mode of the vmtxmulh instruction is:
the first element address of a first memory module, the first element address of a second memory module, the first element address of a fourth memory module, the extraction mode parameter and the shift parameter of Vmtxinvmulh;
it should be noted that the above formula is only used to illustrate the transformation process of the present invention, but in different transformation processes, there may be a transposition process or no transposition process, and the iteration may have more than two iterations according to the transformation process, that is, by using the multiplier and the multimedia processing transformation method provided by the present invention, one-dimensional, two-dimensional or even more than two-dimensional transformation can be realized. The formula (#) of the present invention is only for better illustrating the transformation process of the present invention, and is not limited thereto. In addition to the above embodiments, the present invention may have other embodiments. All technical solutions formed by adopting equivalent substitutions or equivalent transformations fall within the protection scope of the claims of the present invention.
Example 1:
as an implementable embodiment, the procedure for implementing the two-dimensional transformation of formula (, v) using the combination of vmtxmulh and vmtxvmulh instructions in the multiplier is as follows:
load $ v0, f; ($ v0 denotes the first matrix data, this sentence loads matrix f into $ v0)
load $ v1, A; ($ v1 denotes secondary matrix data, this sentence for matrix A $ v1)
vmtxmulh $ v2, $ v0, $ v1, 0x 63; (first extraction mode, shift 3 bits)
vmtxvmulh $ 3, $ v1, $ v2, 0x 67; (first extraction mode, shift 7 bits)
Example 2:
for some decoders with less stringent accuracy requirements, rounding can be omitted from the operation to reduce the amount of operation, thereby increasing the operation speed. Also based on the above example, the following procedure can be obtained:
E4x4=(CoeffMatrix×T4)>>3
R4x4=(T4×E4x4)>>7 (**)
the instructions that can be used in the present invention are implemented as follows:
load v0,CoeffMatrix;
load v1,T4;
vmtxmulh v $ v2, $ v0, $ v1, 0x 43; (second middle extraction mode, shift 3 bits)
//imm8[6]=1,imm8[5]=0;imm8[4:0]=3
vmtxinvmulh $v3,$v1,$v2,0x47;
//imm8[6]=1,imm8[5]=0;imm8[4:0]=7
Example 3:
for matrices larger than 4x4 and intermediate results can be represented by 16 bits:
for a block such as 8x8, since the above instruction can only complete one 4x4 matrix multiplication, it can be implemented here with block matrix multiplication. The block matrix multiplication method is as follows:
Figure BDA0000040146220000181
for T8 T×E8x8Can be calculated as follows
Figure BDA0000040146220000182
Due to the specificity of the media application, so that the intermediate result will not exceed 16 bits in general, taking VC-1 as an example, it can be implemented as the following procedure, as shown in fig. 2, which is a flow chart of the method.
1:
load $v0,CoeffMatrix00;
load $v1,CoeffMatrix01;
load $v2,CoeffMatrix10;
load $v3,CoeffMatrix11;
load $v4,T00;
load $v5,T01;
load $v6,T10;
load $v7,T11;
2:
vmtxmulh $v8,$v0,$v4,0x40
//imm8[6]=1,imm8[5]=0;imm8[4:0]=0
vmtxmulh $v9,$v1,$v6,0x40
vmtxmulh $v10,$v0,$v5,0x40
vmtxmulh $v11,$v1,$v7,0x40
vmtxmulh $v12,$v2,$v4,0x40
vmtxmulh $v13,$v3,$v6,0x40
vmtxmulh $v14,$v2,$v5,0x40
vmtxmulh $v15,$v3,$v7,0x40
3:
vaddh $v8,$v8,$v9
vaddh $v9,$v10,$v11
vaddh $v10,$v12,$v13
vaddh $v11,$v14,$v15
4:
load $v12,{4(16bits)<repeat 16 times>};
5:
vaddh $v8,$v8,$v12
vaddh $v9,$v9,$v12
vaddh $v10,$v10,$v12
vaddh $v11,$v11,$v12
6:
vsrahi $v8,$v8,3
vsrahi $v9,$v9,3
vsrahi $v10,$v10,3
vsrahi $v11,$v11,3
7:
vmtxinvmulh $v12,$v4,$v8,0x40
//imm8[6]=1,imm8[5]=0;imm8[4:0]=0
vmtxinvmulh $v13,$v6,$v10,0x40
vmtxinvmulh $v14,$v4,$v9,0x40
vmtxinvmulh $v15,$v6,$v11,0x40
vmtxinvmulh $v16,$v5,$v8,0x40
vmtxinvmulh $v17,$v7,$v10,0x40
vmtxinvmulh $v18,$v5,$v9,0x40
vmtxinvmulh $v19,$v7,$v11,0x40
8:
vaddh $v8,$v12,$v13
vaddh $v9,$v14,$v15
vaddh $v10,$v16,$v17
vaddh $v11,$v18,$v19
9:
load $v12,{64(16bits)<repeat 16 times>};
10:
vaddh $v8,$v8,$v12
vaddh $v9,$v9,$v12
vaddh $v10,$v10,$v12
vaddh $v11,$v11,$v12
11:
vsrahi $v8,$v8,7
vsrahi $v9,$v9,7
vsrahi $v10,$v10,7
vsrahi $v11,$v11,7;
Example 4:
for matrices larger than 4x4, here 8x8 and the intermediate results are represented by 32 bits:
because of its intermediate result, each sub-block a × B has 32 bits due to multiplication, each vector register can only store half of the matrix, and the flow chart is as shown in fig. 3, and is implemented as follows:
1:
load $v0,CoeffMatrix00;
load $v1,CoeffMatrix01;
load $v2,CoeffMatrix10;
load $v3,CoeffMatrix11;
load $v4,T00;
load $v5,T01;
load $v6,T10;
load $v7,T11;
2:
vmtxmulh $v8,$v0,$v4,0x0;
vmtxmulh $v9,$v0,$v4,0x20;
vmtxmulh $v10,$v1,$v6,0x0;
vmtxmulh $v11,$v1,$v6,0x20;
vmtxmulh $v12,$v0,$v5,0x0;
vmtxmulh $v13,$v0,$v5,0x20;
vmtxmulh $v14,$v1,$v7,0x0;
vmtxmulh $v15,$v1,$v7,0x20;
vmtxmulh $v16,$v2,$v4,0x0;
vmtxmulh $v17,$v2,$v4,0x20;
vmtxmulh $v18,$v3,$v6,0x0;
vmtxmulh $v19,$v3,$v6,0x20;
vmtxmulh $v20,$v2,$v5,0x0;
vmtxmulh $v21,$v2,$v5,0x20;
vmtxmulh $v22,$v3,$v7,0x0;
vmtxmulh $v23,$v3,$v7,0x20;
3:
vaddw $v8,$v8,$v10
vaddw $v9,$v9,$v11
vaddw $v10,$v12,$v14
vaddw $v11,$v13,$v15
vaddw $v12,$v16$v18
vaddw $v13,$v17,$v19
vaddw $v14,$v20,$v22
vaddw $v15,$v21,$v23
4:
load $v16,{4(32bits)<repeat 8 times>};
5:
vaddw $v8,$v8,$v16
vaddw $v9,$v9,$v16
vaddw $v10,$v10,$v16
vaddw $v11,$v11,$v16
vaddw $v12,$v12,$v16
vaddw $v13,$v13,$v16
vaddw $v14,$v14,$v16
vaddw $v15,$v15,$v16
6:
vsrawi $v8,$v8,3
vsrawi $v9,$v9,3
vsrawi $v10,$v10,3
vsrawi $v11,$v11,3
vsrawi $v12,$v12,3
vsrawi $v13,$v13,3
vsrawi $v14,$v14,3
vsrawi $v15,$v15,3
7:
convert32to16 $v8,$v8,$v9
convert32to16 $v9,$v10,$v11
convert32to16 $v10,$v12,$v13
convert32to16 $v11,$v14,$v15
8:
vmtxinvmulh $v12,$v4,$v8,0x0;
vmtxinvmulh $v13,$v4,$v8,0x20;
vmtxinvmulh $v14,$v6,$v10,0x0;
vmtxinvmulh $v15,$v6,$v10,0x20;
vmtxinvmulh $v16,$v4,$v9,0x0;
vmtxinvmulh $v17,$v4,$v9,0x20;
vmtxinvmulh $v18,$v6,$v11,0x0;
vmtxinvmulh $v19,$v6,$v11,0x20;
vmtxinvmulh $v20,$v5,$v8,0x0;
vmtxinvmulh $v21,$v5,$v8,0x20;
vmtxinvmulh $v22,$v7,$v10,0x0;
vmtxinvmulh $v23,$v7,$v10,0x20;
vmtxinvmulh $v24,$v5,$v9,0x0;
vmtxinvmulh $v25,$v5,$v9,0x20;
vmtxinvmulh $v26,$v7,$v11,0x0;
vmtxinvmulh $v27,$v7,$v11,0x20;
9:
vaddw $v8,$v12,$v14
vaddw $v9,$v13,$v15
vaddw $v10,$v16,$v18
vaddw $v11,$v17,$v19
vaddw $v12,$v20$v122
vaddw $v13,$v21,$v23
vaddw $v14,$v24,$v26
vaddw $v15,$v25,$v27
10:
load $v16,{64(32bits)<repeat 8 times>};
11:
vaddw $v8,$v8,$v16
vaddw $v9,$v9,$v16
vaddw $v10,$v10,$v16
vaddw $v11,$v11,$v16
vaddw $v12,$v12,$v16
vaddw $v13,$v13,$v16
vaddw $v14,$v14,$v16
vaddw $v15,$v15,$v16
12:
vsrawi $v8,$v8,7
vsrawi $v9,$v9,7
vsrawi $v10,$v10,7
vsrawi $v11,$v11,7
vsrawi $v12,$v12,7
vsrawi $v13,$v13,7
vsrawi $v14,$v14,7
vsrawi $v15,$v15,7
convert32to16 $v8,$v8,$v9
convert32to16 $v9,$v10,$v11
convert32to16 $v10,$v12,$v13
convert32to16 $v11,$v14,$v15
example 5:
for a general matrix multiplication process, obtaining a 16-bit truncated matrix multiplication value may be implemented by the following procedure:
load v0,M1;
load v1,M2;
vmtxmulh $v2,$v0,$v1,0x40
//imm8[6]=1,imm8[5]=0;imm8[4:0]=0
the multimedia transformation multiplier and the processing method thereof greatly improve the speed of the processor for encoding and decoding the streaming multimedia. For the above-mentioned inverse transformation of vc1 format 4x4, with-02 optimization of GCC, the resulting assembly can see that its core loop requires roughly 110 instructions, whereas by the method of the present invention only a few instructions are needed; meanwhile, for the operation time, the core loop is executed four times in the general method, if in the ideal full flow condition, it is executed about 450 beats, and if in the full flow condition, the method is executed dozens to tens of beats. In addition, in actual situations, neither method can completely realize ideal full-flow water, so that even under the condition of considering memory access delay and the like, the method has a very large acceleration ratio. For other formats, the results obtained are essentially the same as in this example due to the similarity of the inverse transform format.
Finally, it should be noted that it is obvious that various changes and modifications can be made to the present invention by those skilled in the art without departing from the spirit and scope of the present invention. Thus, it is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

Claims (12)

1. A multimedia transform multiplier is characterized by comprising a matrix multiplication module and an operation control module;
the matrix multiplication module is used for carrying out matrix multiplication operation on the data of the first matrix and the data of the second matrix to obtain the data of an intermediate result matrix;
the operation control module is used for reading the operation control parameter values and controlling the data of the intermediate result matrix to perform operation according to the operation control parameter values to obtain the data of the result matrix;
the data of the first matrix is data of a coefficient matrix of multimedia transformation operation; the data of the second matrix is data of a transformation matrix for multimedia transformation operation;
or,
the data of the first matrix is the data of a transposed matrix of a coefficient matrix of the multiplier which carries out multimedia transformation operation last time; and the data of the second matrix is the data of a result matrix obtained after the last operation of the operation control module of the multiplier.
2. The multimedia transform multiplier of claim 1, further comprising a parameter loading module for loading data of the first matrix, data of the second matrix, and operation control parameter values.
3. The multimedia transform multiplier of claim 1, further comprising a transposer for transposing the coefficient matrix of the multimedia transform operation to obtain data of the first matrix.
4. The multimedia transform multiplier device according to any of claims 1 to 3, wherein the operation control parameter values comprise an operation mode parameter value and an operation parameter value;
the operation control module comprises a judgment module and an operation module;
the judging module is used for reading the operation mode parameter value loaded in the loading parameter module and determining the operation mode;
and the operation module is used for reading the operation parameter values loaded in the loading parameter module and controlling the intermediate result matrix to carry out corresponding operation in the operation mode determined by the judgment module according to the operation parameter values.
5. The multimedia transform multiplier of claim 4 wherein the decision module comprises a bit precision bit and an operation mode bit; the judging module determines an operation mode according to the read operation mode parameter value through a bit precision bit and an operation mode bit; the digit precision bit and the operation mode bit are respectively expressed by binary numbers.
6. The multimedia transform multiplier of claim 5, wherein the bit precision bits are bit precision bits requiring more than 16 bits or less than 16 bits for the data precision significant bits of the intermediate matrix;
when the operation mode bit is that the data precision effective bit of the intermediate matrix is lower than 16 bits, the operation mode is whether to carry out addition operation; when the data precision significant bit of the intermediate matrix is higher than 32 bits, the operation mode is to take out the lower half or the upper half of the intermediate matrix.
7. The multimedia transform multiplier of claim 4, wherein the operation module comprises operation control bits, the operation control bits representing the number of shift bits.
8. The multimedia transform multiplier of any of claims 1 to 3, further comprising a first storage module, a second storage module, a third storage module and a fourth storage module; wherein:
the first storage module is used for storing the data of the first matrix;
the second storage module is used for storing the data of the second matrix;
the third storage module is used for storing the data of the intermediate result matrix obtained after the matrix multiplication module carries out matrix multiplication operation;
and the fourth storage module is used for storing the data of the result matrix obtained after the operation of the operation control module.
9. A method for processing a multimedia transform multiplier, comprising the steps of:
a, performing matrix multiplication operation on data of a first matrix and data of a second matrix to obtain data of an intermediate result matrix;
b, controlling the data of the intermediate result matrix to carry out operation according to the loaded operation control parameter values to obtain the data of the result matrix;
the data of the first matrix is data of a coefficient matrix of multimedia transformation operation, and the data of the second matrix is data of a transformation matrix of multimedia transformation operation;
or,
the first matrix is data of a transposed matrix of a coefficient matrix of the multiplier which carries out multimedia transformation operation last time; and the data of the second matrix is the data of a result matrix obtained after the last generation operation of the operation control module of the multiplier.
10. The method of claim 9, wherein said step a is preceded by the steps of:
and step A', loading the data of the first matrix and the data of the second matrix, and calculating the control parameter value.
11. The method of claim 9, wherein step a is preceded by the steps of:
and step A', transposing the coefficient matrix subjected to the multimedia transformation operation last time to obtain data of the first matrix.
12. The processing method of the multimedia transform multiplier of claim 9, wherein the operation control parameter values comprise operation mode parameter values and operation parameter values, and the step B comprises the steps of:
step B1, reading the parameter value of the operation mode, and determining the operation mode;
and step B2, reading the operation parameter values, and controlling the intermediate result matrix to perform corresponding operation according to the operation parameter values in the operation mode determined in the step B1 to obtain the data of the final result matrix.
CN201010603133A 2010-12-23 2010-12-23 Multimedia transformation multiplier and processing method thereof Active CN102043605B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010603133A CN102043605B (en) 2010-12-23 2010-12-23 Multimedia transformation multiplier and processing method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010603133A CN102043605B (en) 2010-12-23 2010-12-23 Multimedia transformation multiplier and processing method thereof

Publications (2)

Publication Number Publication Date
CN102043605A CN102043605A (en) 2011-05-04
CN102043605B true CN102043605B (en) 2012-10-24

Family

ID=43909766

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010603133A Active CN102043605B (en) 2010-12-23 2010-12-23 Multimedia transformation multiplier and processing method thereof

Country Status (1)

Country Link
CN (1) CN102043605B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10863200B2 (en) * 2014-07-25 2020-12-08 Intel Corporation Techniques for performing a forward transformation by a video encoder using a forward transform matrix
CN109471612B (en) * 2018-09-18 2020-08-21 中科寒武纪科技股份有限公司 Arithmetic device and method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2002339867A1 (en) * 2001-09-04 2003-03-18 Microunity Systems Engineering, Inc. System and method for performing multiplication
CN100465876C (en) * 2007-07-12 2009-03-04 浙江大学 Matrix multiplier device based on single FPGA
CN101650644B (en) * 2009-04-10 2012-07-04 北京邮电大学 Galois field multiplying unit realizing device

Also Published As

Publication number Publication date
CN102043605A (en) 2011-05-04

Similar Documents

Publication Publication Date Title
JP5113067B2 (en) Efficient multiplication-free computation for signal and data processing
JP4874642B2 (en) Lossless conversion for lossy and lossless 2D data compression
Mert et al. High performance 2D transform hardware for future video coding
CN1697328B (en) Fast video codec transform implementations
CN101330616B (en) Hardware implementing apparatus and method for inverse discrete cosine transformation during video decoding process
Kammoun et al. Hardware design and implementation of adaptive multiple transforms for the versatile video coding standard
CN101796506A (en) Shift design with proportional zoom formula and disproportional pantographic interface
CN108200439B (en) Method for improving digital signal conversion performance and digital signal conversion method and device
CN104144346B (en) For executing the data processing equipment and method of the transformation between airspace and frequency domain
CN1981534B (en) Image encoding device, image decoding device, and integrated circuit used therein
CN106254883B (en) Inverse transformation method and device in video decoding
CN102043605B (en) Multimedia transformation multiplier and processing method thereof
JP2001331474A (en) Performance method for inverse discrete cosine transformation provided with single instruction multiple data instruction, expansion method for compressed data, expansion device for compressed data signal and computer program product
Mukherjee et al. Hardware efficient architecture for 2D DCT and IDCT using Taylor-series expansion of trigonometric functions
CN107249130A (en) It is a kind of 3 to multiply 3 Integer DCT Transform quantizers for digital video decoding
CN101646080A (en) Method for fast switching parallel pipeline IDCT based on AVS and device thereof
WO2020060832A1 (en) Fast implementation of odd one dimensional transforms
De Souza et al. OpenCL parallelization of the HEVC de-quantization and inverse transform for heterogeneous platforms
De Silva et al. Exploring the Implementation of JPEG Compression on FPGA
CN101562744B (en) Two-dimensional inverse transformation device
WO2000001159A1 (en) Methods and apparatus for implementing a sign function
Deepthi et al. Design and Implementation of JPEG Image Compression and Decompression
US8285774B2 (en) Operation method and apparatus for performing overlap filter and core transform
Kwan et al. Implementation of DSP-RAM: an architecture for parallel digital signal processing in memory
Nguyen et al. Designing and Implementing a 2D Integer DCT Hardware Accelerator Fully Compatible with Versatile Video Coding

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 100095 Building 2, Longxin Industrial Park, Zhongguancun environmental protection technology demonstration park, Haidian District, Beijing

Patentee after: Loongson Zhongke Technology Co.,Ltd.

Address before: 100190 No. 6 South Road, Zhongguancun Academy of Sciences, Beijing, Haidian District

Patentee before: LOONGSON TECHNOLOGY Corp.,Ltd.