CN104503732A

CN104503732A - One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor

Info

Publication number: CN104503732A
Application number: CN201410835382.XA
Authority: CN
Inventors: 吴玲达; 王宇; 吕雅帅
Original assignee: PLA Equipment College
Current assignee: PLA Equipment College
Priority date: 2014-12-30
Filing date: 2014-12-30
Publication date: 2015-04-08
Also published as: WO2016107083A1

Abstract

The invention relates to a one-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for a Feiteng processor. The method comprises the following step of utilizing a subword parallel instruction of the Feiteng processor to perform parallel optimizing on the one-dimensional eight-point IDCT conversion process, wherein a plurality of multiplying and addition operations with the same structure in the one-dimensional eight-point IDCT conversion are realized by a plurality of subword parallel instructions. The method has the advantages that the parallelism of the one-dimensional eight-point IDCT is realized by the subword parallel instructions of the Feiteng processor, and the IDCT conversion and computation efficiency is improved.

Description

A kind of one dimension 8 IDCT parallel methods towards Feiteng processor

Technical field

To the present invention relates to the method for one dimension 8 IDCT Algorithm parallelizations in domestic Feiteng processor, particularly relate to the IDCT algorithm used in image and video decode program.

Background technology

It is conventional conversion in image and video decode that 2 dimension 8 × 8IDCT calculate, and is transformation calculations the most consuming time in image decoding.In order to promote counting yield, in current image and video decode program, generally 2 dimension 8 × 8IDCT being calculated after converting and calculating with multiple one dimension 8 IDCT.

Feiteng processor is single-chip multithreading (CMT) processor family researched and developed by the National University of Defense technology.The VIS multimedia instruction set of series processors of soaring can complete sub-word parallel computation.In sub-word parallel computation, a word is exactly a data set.Sub-word is included in the data cell of the lower precision in word.Due to can by all sub-word of same application of instruction in a word, therefore the original operation needing some instructions just can complete, realizes by a strip word parallel instruction.Such as, if a word length is 64, then the size of a sub-word can be 8,16 and 32.Like this, instruction can parallel processing eight 8 seat words, four 16 seat words, or two 32 seat words.The present invention utilizes the VIS instruction set of Feiteng processor to realize the sub-word parallelization of one dimension 8 IDCT algorithms, thus promotes the arithmetic speed of IDCT calculating in Feiteng processor.

Up to the present, there is not been reported to utilize the VIS instruction set of Feiteng processor to realize the method for the sub-word parallelization of one dimension 8 IDCT algorithms.

Summary of the invention

The object of the invention is to utilize the VIS instruction of Feiteng processor to promote the efficiency of one dimension 8 idct transforms calculating.

A kind of one dimension 8 IDCT parallel methods towards Feiteng processor of the present invention, this parallel method step is as follows:

Make x (n) n=0,1,2 ..., 7 is the input of one dimension 8 IDCT, y (n) n=0,1,2 ..., 7 for exporting, and x (n) and y (n) is the integer between 0 to 255, then one dimension 8 IDCT calculating can be expressed as:

a ₀＝x(0)*C ₄+x(2)*C ₂+x(4)*C ₄+x(6)*C ₆y(0)＝a ₀+b ₀

a ₁＝x(0)*C ₄+x(2)*C ₆-x(4)*C ₄-x(6)*C ₂y(1)＝a ₁+b ₁

a ₂＝x(0)*C ₄-x(2)*C ₆-x(4)*C ₄+x(6)*C ₂y(2)＝a ₂+b ₂

a ₃＝x(0)*C ₄-x(2)*C ₂+x(4)*C ₄-x(6)*C ₆y(3)＝a ₃+b ₃

b ₀＝x(1)*C ₁+x(3)*C ₃+x(5)*C ₅+x(7)*C ₇y(4)＝a ₄-b ₄

b ₁＝x(1)*C ₃-x(3)*C ₇-x(5)*C ₁-x(7)*C ₅y(5)＝a ₅-b ₅

b ₂＝x(1)*C ₅-x(3)*C ₁+x(5)*C ₇+x(7)*C ₃y(6)＝a ₆-b ₆

b ₃＝x(1)*C ₇-x(3)*C ₅+x(5)*C ₃-x(7)*C ₁y(7)＝a ₇-b ₇

Wherein a _kand b _krepresent results of intermediate calculations, k=0,1,2 ..., 7, C _kfor constant, k=1,2,3 ..., 7, the step adopting the VIS instruction set of Feiteng processor that one dimension 8 IDCT are calculated parallelization is as follows:

(1) by all C _kbe multiplied by 2 ¹⁴after round, be designated as Fix_C _k;

(2) prefix Vec is used ₄_ represent the vector be made up of 4 integers, make Vec ₄_ X _k={ x (k), x (k), x (k), x (k) }, k=0,1,2..., 7, because the value of x (k) is between 0 to 255, only a byte need be occupied, by the Vec that four x (k) form ₄_ X _kvector is stored in one 32 long words;

(3) Vec is made ₄_ C ₀={ Fix_C ₄, Fix_C ₄, Fix_C ₄, Fix_C ₄,

Vec ₄_C ₁＝{Fix_C ₁，Fix_C ₃，Fix_C ₅，Fix_C ₇}，

Vec ₄_C ₂＝{Fix_C ₂，Fix_C ₆，-Fix_C ₆，-Fix_C ₂}，

Vec ₄_C ₃＝{Fix_C ₃，-Fix_C ₇，-Fix_C ₁，-Fix_C ₅}，

Vec ₄_C ₄＝{Fix_C ₄，-Fix_C ₄，-Fix_C ₄，Fix_C ₄}，

Vec ₄_C ₅＝{Fix_C ₅，-Fix_C ₁，Fix_C ₇，Fix_C ₃}，

Vec ₄_C ₆＝{Fix_C ₆，-Fix_C ₂，Fix_C ₂，-Fix_C ₆}，

Vec ₄_C ₇＝{Fix_C ₇，-Fix_C ₅，Fix_C ₃，-Fix_C ₁}，

Due to Fix_C _kvalue 0 to 2 ¹⁴between, only need occupy two bytes, by Vec ₄_ C _kbe stored in one 64

In the word that position is long;

(4) with the fmul8x16 command calculations Vec of Feiteng processor ₄_ X _k× Vec ₄_ C _kvalue, be designated as Vec ₄_ T _k, k=0,1,2 ..., 7;

(5) with the fpadd16 command calculations Vec of Feiteng processor ₄_ T ₀+ Vec ₄_ T ₁+ Vec ₄_ T ₂+ Vec ₄_ T ₃value, be designated as Vec ₄_ A, similarly, with fpadd16 command calculations Vec ₄_ T ₄+ Vec ₄_ T ₅+ Vec ₄_ T ₆+ Vec ₄_ T ₇value, be designated as Vec ₄_ B;

(6) with the fpadd16 command calculations Vec of Feiteng processor ₄_ A+Vec ₄the value of _ B, is designated as Vec ₄_ Y _a, with fpsub16 command calculations Vec ₄_ A-Vec ₄the value of _ B, is designated as Vec ₄_ Y _b;

(7) use fpack16 instruction by Vec ₄_ Y _abe compressed in the word of 32, be designated as Vec ₄_ Y _ap, similarly, with fpack16 instruction by Vec ₄_ Y _bbe compressed in the word of 32, be designated as Vec ₄_ Y _bp, Vec ₄_ Y _apthe value of the 1st byte is y (0), and the value of the 2nd byte is y (1), and the value of the 3rd byte is y (2), and the value of the 4th byte is y (3), Vec ₄_ Y _bpthe value of the 1st byte is y (4), and the value of the 2nd byte is y (5), and the value of the 3rd byte is y (6), and the value of the 4th byte is y (7).

The method be applied in and soar on FT1000 processor, 8 IDCT of the one dimension after parallel calculate than not having parallel one dimension 8 IDCT computing velocitys to promote 312.76%.

Embodiment

Below for FT1000 processor of soaring, the specific embodiment of the present invention (programming realization method adopts C language, uses GCC compiler to compile) is described:

1, by all C _kbe multiplied by 2 ¹⁴after round, be designated as FIX_Ck;

2,4 identical x (k) being stored in a built-in type is in 32 bit variables of v4qi, and be designated as Vec_Xk, namely each x (k) occupies a byte;

3, state that built-in type is 8 64 bit variables of v4hi: Vec_C0, Vec_C1, Vec_C2, Vec_C3, Vec_C4, Vec_C5, Vec_C6 and Vec_C7;

4, make Vec_C0={FIX_C4, FIX_C4, FIX_C4, FIX_C4}, namely each FIX_C4 occupies two bytes of Vec_C0 respectively, similarly, makes Vec_C1={FIX_C1, FIX_C3, FIX_C5, FIX_C7},

Vec_C2＝{FIX_C2，FIX_C6，-FIX_C6，-FIX_C2}，

Vec_C3＝{FIX_C3，-FIX_C7，-FIX_C1，-FIX_C5}，

Vec_C4＝{FIX_C4，-FIX_C4，-FIX_C4，FIX_C4}，

Vec_C5＝{FIX_C5，-FIX_C1，FIX_C7，FIX_C3}，

Vec_C6＝{FIX_C6，-FIX_C2，FIX_C2，-FIX_C6}，

Vec_C7＝{FIX_C7，-FIX_C5，FIX_C3，-FIX_C1}；

5, with built-in sub-word and line function _ builtin_vis_fmul8x16 calculates the value of Vec_Xk × Vec_Ck, is designated as Vec_Tk, i.e. Vec_Tk=_builtin_vis_fmul8x16 (Vec_Xk, Vec_Ck);

6, state that built-in type is the Two Variables of v4hi: Vec_A and Vec_B;

7, with built-in sub-word and line function _ builtin_vis_fpadd16 calculates the value of Vec_T0+Vec_T1+Vec_T2+Vec_T3, is stored in variable V ec_A, namely completes this calculating with following code:

Vec_A＝_builtin_vis_fpadd16(Vec_T0，Vec_T1)；

Vec_A＝_builtinvis_fpadd16(Vec_A，Vec_T2)；

Vec_A＝_builtin_vis_fpadd16(Vec_A，Vec_T3)；

8, similarly, with built-in sub-word and line function _ builtin_vis_fpadd16 calculates the value of Vec_T4+Vec_T5+Vec_T6+Vec_T7, is stored in variable V ec_B;

9, with built-in sub-word and line function _ builtin_vis_fpadd16 calculates the value of Vec_A+Vec_B, being stored in built-in type is in the variable V ec_Ya of v4hi, similarly, with built-in sub-word and line function _ and builtin_vis_fpsub16 calculates the value of Vec_A-Vec_B, and being stored in built-in type is in the variable V ec_Yb of v4hi;

10, with built-in function _ builtin_vis_fpack16, Vec_Ya being compressed to a type is in 32 bit variables of v4qi, be designated as Vec_YPa, even Vec_YPa=_builtin_vis_fpack16 (Vec_Ya), similarly, Vec_Yb being compressed to a type is in 32 bit variables of v4qi, be designated as Vec_YPb, the value of Vec_YPa the 1st byte is y (0), the value of the 2nd byte is y (1), the value of the 3rd byte is y (2), the value of the 4th byte is y (3), the value of Vec_YPb the 1st byte is y (4), the value of the 2nd byte is y (5), the value of the 3rd byte is y (6), the value of the 4th byte is y (7).

Tested on the computing machine that FT1000 processor of soaring is housed by example, 8 IDCT of the one dimension after parallel calculate than not having parallel one dimension 8 IDCT computing velocitys to promote 312.76%.

Claims

1., towards one dimension 8 IDCT parallel methods of Feiteng processor, it is characterized in that, this parallel method step is as follows: