CN104503732A - One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor - Google Patents
One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor Download PDFInfo
- Publication number
- CN104503732A CN104503732A CN201410835382.XA CN201410835382A CN104503732A CN 104503732 A CN104503732 A CN 104503732A CN 201410835382 A CN201410835382 A CN 201410835382A CN 104503732 A CN104503732 A CN 104503732A
- Authority
- CN
- China
- Prior art keywords
- vec
- fix
- value
- byte
- designated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims description 13
- 238000006243 chemical reaction Methods 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 6
- 230000007123 defense Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Devices For Executing Special Programs (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention relates to a one-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for a Feiteng processor. The method comprises the following step of utilizing a subword parallel instruction of the Feiteng processor to perform parallel optimizing on the one-dimensional eight-point IDCT conversion process, wherein a plurality of multiplying and addition operations with the same structure in the one-dimensional eight-point IDCT conversion are realized by a plurality of subword parallel instructions. The method has the advantages that the parallelism of the one-dimensional eight-point IDCT is realized by the subword parallel instructions of the Feiteng processor, and the IDCT conversion and computation efficiency is improved.
Description
Technical field
To the present invention relates to the method for one dimension 8 IDCT Algorithm parallelizations in domestic Feiteng processor, particularly relate to the IDCT algorithm used in image and video decode program.
Background technology
It is conventional conversion in image and video decode that 2 dimension 8 × 8IDCT calculate, and is transformation calculations the most consuming time in image decoding.In order to promote counting yield, in current image and video decode program, generally 2 dimension 8 × 8IDCT being calculated after converting and calculating with multiple one dimension 8 IDCT.
Feiteng processor is single-chip multithreading (CMT) processor family researched and developed by the National University of Defense technology.The VIS multimedia instruction set of series processors of soaring can complete sub-word parallel computation.In sub-word parallel computation, a word is exactly a data set.Sub-word is included in the data cell of the lower precision in word.Due to can by all sub-word of same application of instruction in a word, therefore the original operation needing some instructions just can complete, realizes by a strip word parallel instruction.Such as, if a word length is 64, then the size of a sub-word can be 8,16 and 32.Like this, instruction can parallel processing eight 8 seat words, four 16 seat words, or two 32 seat words.The present invention utilizes the VIS instruction set of Feiteng processor to realize the sub-word parallelization of one dimension 8 IDCT algorithms, thus promotes the arithmetic speed of IDCT calculating in Feiteng processor.
Up to the present, there is not been reported to utilize the VIS instruction set of Feiteng processor to realize the method for the sub-word parallelization of one dimension 8 IDCT algorithms.
Summary of the invention
The object of the invention is to utilize the VIS instruction of Feiteng processor to promote the efficiency of one dimension 8 idct transforms calculating.
A kind of one dimension 8 IDCT parallel methods towards Feiteng processor of the present invention, this parallel method step is as follows:
Make x (n) n=0,1,2 ..., 7 is the input of one dimension 8 IDCT, y (n) n=0,1,2 ..., 7 for exporting, and x (n) and y (n) is the integer between 0 to 255, then one dimension 8 IDCT calculating can be expressed as:
a
0=x(0)*C
4+x(2)*C
2+x(4)*C
4+x(6)*C
6y(0)=a
0+b
0
a
1=x(0)*C
4+x(2)*C
6-x(4)*C
4-x(6)*C
2y(1)=a
1+b
1
a
2=x(0)*C
4-x(2)*C
6-x(4)*C
4+x(6)*C
2y(2)=a
2+b
2
a
3=x(0)*C
4-x(2)*C
2+x(4)*C
4-x(6)*C
6y(3)=a
3+b
3
b
0=x(1)*C
1+x(3)*C
3+x(5)*C
5+x(7)*C
7y(4)=a
4-b
4
b
1=x(1)*C
3-x(3)*C
7-x(5)*C
1-x(7)*C
5y(5)=a
5-b
5
b
2=x(1)*C
5-x(3)*C
1+x(5)*C
7+x(7)*C
3y(6)=a
6-b
6
b
3=x(1)*C
7-x(3)*C
5+x(5)*C
3-x(7)*C
1y(7)=a
7-b
7
Wherein a
kand b
krepresent results of intermediate calculations, k=0,1,2 ..., 7, C
kfor constant,
k=1,2,3 ..., 7, the step adopting the VIS instruction set of Feiteng processor that one dimension 8 IDCT are calculated parallelization is as follows:
(1) by all C
kbe multiplied by 2
14after round, be designated as Fix_C
k;
(2) prefix Vec is used
4_ represent the vector be made up of 4 integers, make Vec
4_ X
k={ x (k), x (k), x (k), x (k) }, k=0,1,2..., 7, because the value of x (k) is between 0 to 255, only a byte need be occupied, by the Vec that four x (k) form
4_ X
kvector is stored in one 32 long words;
(3) Vec is made
4_ C
0={ Fix_C
4, Fix_C
4, Fix_C
4, Fix_C
4,
Vec
4_C
1={Fix_C
1,Fix_C
3,Fix_C
5,Fix_C
7},
Vec
4_C
2={Fix_C
2,Fix_C
6,-Fix_C
6,-Fix_C
2},
Vec
4_C
3={Fix_C
3,-Fix_C
7,-Fix_C
1,-Fix_C
5},
Vec
4_C
4={Fix_C
4,-Fix_C
4,-Fix_C
4,Fix_C
4},
Vec
4_C
5={Fix_C
5,-Fix_C
1,Fix_C
7,Fix_C
3},
Vec
4_C
6={Fix_C
6,-Fix_C
2,Fix_C
2,-Fix_C
6},
Vec
4_C
7={Fix_C
7,-Fix_C
5,Fix_C
3,-Fix_C
1},
Due to Fix_C
kvalue 0 to 2
14between, only need occupy two bytes, by Vec
4_ C
kbe stored in one 64
In the word that position is long;
(4) with the fmul8x16 command calculations Vec of Feiteng processor
4_ X
k× Vec
4_ C
kvalue, be designated as Vec
4_ T
k, k=0,1,2 ..., 7;
(5) with the fpadd16 command calculations Vec of Feiteng processor
4_ T
0+ Vec
4_ T
1+ Vec
4_ T
2+ Vec
4_ T
3value, be designated as Vec
4_ A, similarly, with fpadd16 command calculations Vec
4_ T
4+ Vec
4_ T
5+ Vec
4_ T
6+ Vec
4_ T
7value, be designated as Vec
4_ B;
(6) with the fpadd16 command calculations Vec of Feiteng processor
4_ A+Vec
4the value of _ B, is designated as Vec
4_ Y
a, with fpsub16 command calculations Vec
4_ A-Vec
4the value of _ B, is designated as Vec
4_ Y
b;
(7) use fpack16 instruction by Vec
4_ Y
abe compressed in the word of 32, be designated as Vec
4_ Y
ap, similarly, with fpack16 instruction by Vec
4_ Y
bbe compressed in the word of 32, be designated as Vec
4_ Y
bp, Vec
4_ Y
apthe value of the 1st byte is y (0), and the value of the 2nd byte is y (1), and the value of the 3rd byte is y (2), and the value of the 4th byte is y (3), Vec
4_ Y
bpthe value of the 1st byte is y (4), and the value of the 2nd byte is y (5), and the value of the 3rd byte is y (6), and the value of the 4th byte is y (7).
The method be applied in and soar on FT1000 processor, 8 IDCT of the one dimension after parallel calculate than not having parallel one dimension 8 IDCT computing velocitys to promote 312.76%.
Embodiment
Below for FT1000 processor of soaring, the specific embodiment of the present invention (programming realization method adopts C language, uses GCC compiler to compile) is described:
1, by all C
kbe multiplied by 2
14after round, be designated as FIX_Ck;
2,4 identical x (k) being stored in a built-in type is in 32 bit variables of v4qi, and be designated as Vec_Xk, namely each x (k) occupies a byte;
3, state that built-in type is 8 64 bit variables of v4hi: Vec_C0, Vec_C1, Vec_C2, Vec_C3, Vec_C4, Vec_C5, Vec_C6 and Vec_C7;
4, make Vec_C0={FIX_C4, FIX_C4, FIX_C4, FIX_C4}, namely each FIX_C4 occupies two bytes of Vec_C0 respectively, similarly, makes Vec_C1={FIX_C1, FIX_C3, FIX_C5, FIX_C7},
Vec_C2={FIX_C2,FIX_C6,-FIX_C6,-FIX_C2},
Vec_C3={FIX_C3,-FIX_C7,-FIX_C1,-FIX_C5},
Vec_C4={FIX_C4,-FIX_C4,-FIX_C4,FIX_C4},
Vec_C5={FIX_C5,-FIX_C1,FIX_C7,FIX_C3},
Vec_C6={FIX_C6,-FIX_C2,FIX_C2,-FIX_C6},
Vec_C7={FIX_C7,-FIX_C5,FIX_C3,-FIX_C1};
5, with built-in sub-word and line function _ builtin_vis_fmul8x16 calculates the value of Vec_Xk × Vec_Ck, is designated as Vec_Tk, i.e. Vec_Tk=_builtin_vis_fmul8x16 (Vec_Xk, Vec_Ck);
6, state that built-in type is the Two Variables of v4hi: Vec_A and Vec_B;
7, with built-in sub-word and line function _ builtin_vis_fpadd16 calculates the value of Vec_T0+Vec_T1+Vec_T2+Vec_T3, is stored in variable V ec_A, namely completes this calculating with following code:
Vec_A=_builtin_vis_fpadd16(Vec_T0,Vec_T1);
Vec_A=_builtinvis_fpadd16(Vec_A,Vec_T2);
Vec_A=_builtin_vis_fpadd16(Vec_A,Vec_T3);
8, similarly, with built-in sub-word and line function _ builtin_vis_fpadd16 calculates the value of Vec_T4+Vec_T5+Vec_T6+Vec_T7, is stored in variable V ec_B;
9, with built-in sub-word and line function _ builtin_vis_fpadd16 calculates the value of Vec_A+Vec_B, being stored in built-in type is in the variable V ec_Ya of v4hi, similarly, with built-in sub-word and line function _ and builtin_vis_fpsub16 calculates the value of Vec_A-Vec_B, and being stored in built-in type is in the variable V ec_Yb of v4hi;
10, with built-in function _ builtin_vis_fpack16, Vec_Ya being compressed to a type is in 32 bit variables of v4qi, be designated as Vec_YPa, even Vec_YPa=_builtin_vis_fpack16 (Vec_Ya), similarly, Vec_Yb being compressed to a type is in 32 bit variables of v4qi, be designated as Vec_YPb, the value of Vec_YPa the 1st byte is y (0), the value of the 2nd byte is y (1), the value of the 3rd byte is y (2), the value of the 4th byte is y (3), the value of Vec_YPb the 1st byte is y (4), the value of the 2nd byte is y (5), the value of the 3rd byte is y (6), the value of the 4th byte is y (7).
Tested on the computing machine that FT1000 processor of soaring is housed by example, 8 IDCT of the one dimension after parallel calculate than not having parallel one dimension 8 IDCT computing velocitys to promote 312.76%.
Claims (1)
1., towards one dimension 8 IDCT parallel methods of Feiteng processor, it is characterized in that, this parallel method step is as follows:
Make x (n) n=0,1,2 ..., 7 is the input of one dimension 8 IDCT, y (n) n=0,1,2 ..., 7 for exporting, and x (n) and y (n) is the integer between 0 to 255, then one dimension 8 IDCT calculating can be expressed as:
a
0=x(0)*C
4+x(2)*C
2+x(4)*C
4+x(6)*C
6y(0)=a
0+b
0
a
1=x(0)*C
4+x(2)*C
6-x(4)*C
4-x(6)*C
2y(1)=a
1+b
1
a
2=x(0)*C
4-x(2)*C
6-x(4)*C
4+x(6)*C
2y(2)=a
2+b
2
a
3=x(0)*C
4-x(2)*C
2+x(4)*C
4-x(6)*C
6y(3)=a
3+b
3
b
0=x(1)*C
1+x(3)*C
3+x(5)*C
5+x(7)*C
7y(4)=a
4-b
4
b
1=x(1)*C
3-x(3)*C
7-x(5)*C
1-x(7)*C
5y(5)=a
5-b
5
b
2=x(1)*C
5-x(3)*C
1+x(5)*C
7+x(7)*C
3y(6)=a
6-b
6
b
3=x(1)*C
7-x(3)*C
5+x(5)*C
3-x(7)*C
1y(7)=a
7-b
7
Wherein a
kand b
krepresent results of intermediate calculations, k=0,1,2 ..., 7, C
kfor constant,
k=1,2,3 ..., 7, the step adopting the VIS instruction set of Feiteng processor that one dimension 8 IDCT are calculated parallelization is as follows:
(1) by all C
kbe multiplied by 2
14after round, be designated as Fix_C
k;
(2) prefix Vec is used
4_ represent the vector be made up of 4 integers, make Vec
4_ X
k={ x (k), x (k), x (k), x (k) }, k=0,1,2..., 7, because the value of x (k) is between 0 to 255, only a byte need be occupied, by the Vec that four x (k) form
4_ X
kvector is stored in one 32 long words;
(3) Vec is made
4_ C
0={ Fix_C
4, Fix_C
4, Fix_C
4, Fix_C
4,
Vec
4_C
1={Fix_C
1,Fix_C
3,Fix_C
5,Fix_C
7},
Vec
4_C
2={Fix_C
2,Fix_C
6,-Fix_C
6,-Fix_C
2},
Vec
4_C
3={Fix_C
3,-Fix_C
7,-Fix_C
1,-Fix_C
5},
Vec
4_C
4={Fix_C
4,-Fix_C
4,-Fix_C
4,Fix_C
4},
Vec
4_C
5={Fix_C
5,-Fix_C
1,Fix_C
7,Fix_C
3},
Vec
4_C
6={Fix_C
6,-Fix_C
2,Fix_C
2,-Fix_C
6},
Vec
4_C
7={Fix_C
7,-Fix_C
5,Fix_C
3,-Fix_C
1},
Due to Fix_C
kvalue 0 to 2
14between, only need occupy two bytes, by Vec
4_ C
kbe stored in one 64 long words;
(4) with the fmu18x16 command calculations Vec of Feiteng processor
4_ X
k× Vec
4_ C
kvalue, be designated as Vec
4_ T
k, k=0,1,2 ..., 7;
(5) with the fpadd16 command calculations Vec of Feiteng processor
4_ T
0+ Vec
4_ T
1+ Vec
4_ T
2+ Vec
4_ T
3value, be designated as Vec
4_ A, similarly, with fpadd16 command calculations Vec
4_ T
4+ Vec
4_ T
5+ Vec
4_ T
6+ Vec
4_ T
7value, be designated as Vec
4_ B;
(6) with the fpadd16 command calculations Vec of Feiteng processor
4_ A+Vec
4the value of _ B, is designated as Vec
4_ Y
a, with fpsub16 command calculations Vec
4_ A-Vec
4the value of _ B, is designated as Vec
4_ Y
b;
(7) use fpack16 instruction by Vec
4_ Y
abe compressed in the word of 32, be designated as Vec
4_ Y
ap, similarly, with fpack16 instruction by Vec
4_ Y
bbe compressed in the word of 32, be designated as Vec
4_ Y
bp, Vec
4_ Y
apthe value of the 1st byte is y (0), and the value of the 2nd byte is y (1), and the value of the 3rd byte is y (2), and the value of the 4th byte is y (3), Vec
4_ Y
bpthe value of the 1st byte is y (4), and the value of the 2nd byte is y (5), and the value of the 3rd byte is y (6), and the value of the 4th byte is y (7).
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410835382.XA CN104503732A (en) | 2014-12-30 | 2014-12-30 | One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor |
PCT/CN2015/081035 WO2016107083A1 (en) | 2014-12-30 | 2015-06-09 | One-dimensional eight-point idct parallelism method for feiteng processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410835382.XA CN104503732A (en) | 2014-12-30 | 2014-12-30 | One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104503732A true CN104503732A (en) | 2015-04-08 |
Family
ID=52945133
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410835382.XA Pending CN104503732A (en) | 2014-12-30 | 2014-12-30 | One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN104503732A (en) |
WO (1) | WO2016107083A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016107083A1 (en) * | 2014-12-30 | 2016-07-07 | 中国人民解放军装备学院 | One-dimensional eight-point idct parallelism method for feiteng processor |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102804165A (en) * | 2009-02-11 | 2012-11-28 | 四次方有限公司 | Front end processor with extendable data path |
CN103079079A (en) * | 2013-01-23 | 2013-05-01 | 中国人民解放军装备学院 | Subword parallel method for color spatial transformation |
CN103984528A (en) * | 2014-05-15 | 2014-08-13 | 中国人民解放军国防科学技术大学 | Multithread concurrent data compression method based on FT processor platform |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5452466A (en) * | 1993-05-11 | 1995-09-19 | Teknekron Communications Systems, Inc. | Method and apparatus for preforming DCT and IDCT transforms on data signals with a preprocessor, a post-processor, and a controllable shuffle-exchange unit connected between the pre-processor and post-processor |
CN101694627B (en) * | 2009-10-23 | 2013-09-11 | 天津大学 | Compiler system based on TCore configurable processor |
CN104503732A (en) * | 2014-12-30 | 2015-04-08 | 中国人民解放军装备学院 | One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor |
-
2014
- 2014-12-30 CN CN201410835382.XA patent/CN104503732A/en active Pending
-
2015
- 2015-06-09 WO PCT/CN2015/081035 patent/WO2016107083A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102804165A (en) * | 2009-02-11 | 2012-11-28 | 四次方有限公司 | Front end processor with extendable data path |
CN103079079A (en) * | 2013-01-23 | 2013-05-01 | 中国人民解放军装备学院 | Subword parallel method for color spatial transformation |
CN103984528A (en) * | 2014-05-15 | 2014-08-13 | 中国人民解放军国防科学技术大学 | Multithread concurrent data compression method based on FT processor platform |
Non-Patent Citations (3)
Title |
---|
LIYI XIAO, HAI HUANG: "A Novel CORDIC Based Unified Architecture for DCT and IDCT", 《212 INTERNATIONAL CONFERENCE ON OPTOELECTRONICS AND MICROELECTRONICS(ICOM)》 * |
李京,沈泊: "一种低功耗2D DCT/IDCT处理器设计", 《固体电子学研究与进展》 * |
胡小涛,梁利平: "整型2D-IDCT算法的优化与实现", 《电视技术》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016107083A1 (en) * | 2014-12-30 | 2016-07-07 | 中国人民解放军装备学院 | One-dimensional eight-point idct parallelism method for feiteng processor |
Also Published As
Publication number | Publication date |
---|---|
WO2016107083A1 (en) | 2016-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6977239B2 (en) | Matrix multiplier | |
CN111213125B (en) | Efficient direct convolution using SIMD instructions | |
Cho et al. | MEC: Memory-efficient convolution for deep neural network | |
US20180107630A1 (en) | Processor and method for executing matrix multiplication operation on processor | |
FI4099168T3 (en) | Compute optimizations for low precision machine learning operations | |
CN102436438B (en) | Sparse matrix data storage method based on ground power unit (GPU) | |
US8539201B2 (en) | Transposing array data on SIMD multi-core processor architectures | |
US20180203673A1 (en) | Execution of computation graphs | |
CN106846235B (en) | Convolution optimization method and system accelerated by NVIDIA Kepler GPU assembly instruction | |
US9436465B2 (en) | Moving average processing in processor and processor | |
US20190042195A1 (en) | Scalable memory-optimized hardware for matrix-solve | |
JP2017517082A (en) | Parallel decision tree processor architecture | |
US9934199B2 (en) | Digital filter device, digital filtering method, and storage medium having digital filter program stored thereon | |
US9785614B2 (en) | Fast Fourier transform device, fast Fourier transform method, and recording medium storing fast Fourier transform program | |
CN104503732A (en) | One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor | |
Park et al. | A highly parallelized decoder for random network coding leveraging GPGPU | |
CN114117896A (en) | Method and system for realizing binary protocol optimization for ultra-long SIMD pipeline | |
Lee | Performance Study of Multicore Digital Signal Processor Architectures | |
CN103327332B (en) | The implementation method of 8 × 8IDCT conversion in a kind of HEVC standard | |
CN107491288B (en) | Data processing method and device based on single instruction multiple data stream structure | |
Wakatani | Improvement of adaptive fractal image coding on GPUs | |
Jin et al. | NOP compression scheme for high speed DSPs based on VLIW architecture | |
Amiri | Performance Improvement of Multimedia Kernels Using Data-and Thread-Level Parallelism on CPU Platform | |
Benterki et al. | Improving complexity of Karmarkar's approach for linear programming | |
Moradifar et al. | Performance improvement of multimedia Kernels using data-and thread-level parallelism on CPU platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150408 |