CN104503732A - One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor - Google Patents

One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor Download PDF

Info

Publication number
CN104503732A
CN104503732A CN201410835382.XA CN201410835382A CN104503732A CN 104503732 A CN104503732 A CN 104503732A CN 201410835382 A CN201410835382 A CN 201410835382A CN 104503732 A CN104503732 A CN 104503732A
Authority
CN
China
Prior art keywords
vec
fix
value
byte
designated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410835382.XA
Other languages
Chinese (zh)
Inventor
吴玲达
王宇
吕雅帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PLA Equipment College
Original Assignee
PLA Equipment College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PLA Equipment College filed Critical PLA Equipment College
Priority to CN201410835382.XA priority Critical patent/CN104503732A/en
Publication of CN104503732A publication Critical patent/CN104503732A/en
Priority to PCT/CN2015/081035 priority patent/WO2016107083A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to a one-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for a Feiteng processor. The method comprises the following step of utilizing a subword parallel instruction of the Feiteng processor to perform parallel optimizing on the one-dimensional eight-point IDCT conversion process, wherein a plurality of multiplying and addition operations with the same structure in the one-dimensional eight-point IDCT conversion are realized by a plurality of subword parallel instructions. The method has the advantages that the parallelism of the one-dimensional eight-point IDCT is realized by the subword parallel instructions of the Feiteng processor, and the IDCT conversion and computation efficiency is improved.

Description

A kind of one dimension 8 IDCT parallel methods towards Feiteng processor
Technical field
To the present invention relates to the method for one dimension 8 IDCT Algorithm parallelizations in domestic Feiteng processor, particularly relate to the IDCT algorithm used in image and video decode program.
Background technology
It is conventional conversion in image and video decode that 2 dimension 8 × 8IDCT calculate, and is transformation calculations the most consuming time in image decoding.In order to promote counting yield, in current image and video decode program, generally 2 dimension 8 × 8IDCT being calculated after converting and calculating with multiple one dimension 8 IDCT.
Feiteng processor is single-chip multithreading (CMT) processor family researched and developed by the National University of Defense technology.The VIS multimedia instruction set of series processors of soaring can complete sub-word parallel computation.In sub-word parallel computation, a word is exactly a data set.Sub-word is included in the data cell of the lower precision in word.Due to can by all sub-word of same application of instruction in a word, therefore the original operation needing some instructions just can complete, realizes by a strip word parallel instruction.Such as, if a word length is 64, then the size of a sub-word can be 8,16 and 32.Like this, instruction can parallel processing eight 8 seat words, four 16 seat words, or two 32 seat words.The present invention utilizes the VIS instruction set of Feiteng processor to realize the sub-word parallelization of one dimension 8 IDCT algorithms, thus promotes the arithmetic speed of IDCT calculating in Feiteng processor.
Up to the present, there is not been reported to utilize the VIS instruction set of Feiteng processor to realize the method for the sub-word parallelization of one dimension 8 IDCT algorithms.
Summary of the invention
The object of the invention is to utilize the VIS instruction of Feiteng processor to promote the efficiency of one dimension 8 idct transforms calculating.
A kind of one dimension 8 IDCT parallel methods towards Feiteng processor of the present invention, this parallel method step is as follows:
Make x (n) n=0,1,2 ..., 7 is the input of one dimension 8 IDCT, y (n) n=0,1,2 ..., 7 for exporting, and x (n) and y (n) is the integer between 0 to 255, then one dimension 8 IDCT calculating can be expressed as:
a 0=x(0)*C 4+x(2)*C 2+x(4)*C 4+x(6)*C 6y(0)=a 0+b 0
a 1=x(0)*C 4+x(2)*C 6-x(4)*C 4-x(6)*C 2y(1)=a 1+b 1
a 2=x(0)*C 4-x(2)*C 6-x(4)*C 4+x(6)*C 2y(2)=a 2+b 2
a 3=x(0)*C 4-x(2)*C 2+x(4)*C 4-x(6)*C 6y(3)=a 3+b 3
b 0=x(1)*C 1+x(3)*C 3+x(5)*C 5+x(7)*C 7y(4)=a 4-b 4
b 1=x(1)*C 3-x(3)*C 7-x(5)*C 1-x(7)*C 5y(5)=a 5-b 5
b 2=x(1)*C 5-x(3)*C 1+x(5)*C 7+x(7)*C 3y(6)=a 6-b 6
b 3=x(1)*C 7-x(3)*C 5+x(5)*C 3-x(7)*C 1y(7)=a 7-b 7
Wherein a kand b krepresent results of intermediate calculations, k=0,1,2 ..., 7, C kfor constant, k=1,2,3 ..., 7, the step adopting the VIS instruction set of Feiteng processor that one dimension 8 IDCT are calculated parallelization is as follows:
(1) by all C kbe multiplied by 2 14after round, be designated as Fix_C k;
(2) prefix Vec is used 4_ represent the vector be made up of 4 integers, make Vec 4_ X k={ x (k), x (k), x (k), x (k) }, k=0,1,2..., 7, because the value of x (k) is between 0 to 255, only a byte need be occupied, by the Vec that four x (k) form 4_ X kvector is stored in one 32 long words;
(3) Vec is made 4_ C 0={ Fix_C 4, Fix_C 4, Fix_C 4, Fix_C 4,
Vec 4_C 1={Fix_C 1,Fix_C 3,Fix_C 5,Fix_C 7},
Vec 4_C 2={Fix_C 2,Fix_C 6,-Fix_C 6,-Fix_C 2},
Vec 4_C 3={Fix_C 3,-Fix_C 7,-Fix_C 1,-Fix_C 5},
Vec 4_C 4={Fix_C 4,-Fix_C 4,-Fix_C 4,Fix_C 4},
Vec 4_C 5={Fix_C 5,-Fix_C 1,Fix_C 7,Fix_C 3},
Vec 4_C 6={Fix_C 6,-Fix_C 2,Fix_C 2,-Fix_C 6},
Vec 4_C 7={Fix_C 7,-Fix_C 5,Fix_C 3,-Fix_C 1},
Due to Fix_C kvalue 0 to 2 14between, only need occupy two bytes, by Vec 4_ C kbe stored in one 64
In the word that position is long;
(4) with the fmul8x16 command calculations Vec of Feiteng processor 4_ X k× Vec 4_ C kvalue, be designated as Vec 4_ T k, k=0,1,2 ..., 7;
(5) with the fpadd16 command calculations Vec of Feiteng processor 4_ T 0+ Vec 4_ T 1+ Vec 4_ T 2+ Vec 4_ T 3value, be designated as Vec 4_ A, similarly, with fpadd16 command calculations Vec 4_ T 4+ Vec 4_ T 5+ Vec 4_ T 6+ Vec 4_ T 7value, be designated as Vec 4_ B;
(6) with the fpadd16 command calculations Vec of Feiteng processor 4_ A+Vec 4the value of _ B, is designated as Vec 4_ Y a, with fpsub16 command calculations Vec 4_ A-Vec 4the value of _ B, is designated as Vec 4_ Y b;
(7) use fpack16 instruction by Vec 4_ Y abe compressed in the word of 32, be designated as Vec 4_ Y ap, similarly, with fpack16 instruction by Vec 4_ Y bbe compressed in the word of 32, be designated as Vec 4_ Y bp, Vec 4_ Y apthe value of the 1st byte is y (0), and the value of the 2nd byte is y (1), and the value of the 3rd byte is y (2), and the value of the 4th byte is y (3), Vec 4_ Y bpthe value of the 1st byte is y (4), and the value of the 2nd byte is y (5), and the value of the 3rd byte is y (6), and the value of the 4th byte is y (7).
The method be applied in and soar on FT1000 processor, 8 IDCT of the one dimension after parallel calculate than not having parallel one dimension 8 IDCT computing velocitys to promote 312.76%.
Embodiment
Below for FT1000 processor of soaring, the specific embodiment of the present invention (programming realization method adopts C language, uses GCC compiler to compile) is described:
1, by all C kbe multiplied by 2 14after round, be designated as FIX_Ck;
2,4 identical x (k) being stored in a built-in type is in 32 bit variables of v4qi, and be designated as Vec_Xk, namely each x (k) occupies a byte;
3, state that built-in type is 8 64 bit variables of v4hi: Vec_C0, Vec_C1, Vec_C2, Vec_C3, Vec_C4, Vec_C5, Vec_C6 and Vec_C7;
4, make Vec_C0={FIX_C4, FIX_C4, FIX_C4, FIX_C4}, namely each FIX_C4 occupies two bytes of Vec_C0 respectively, similarly, makes Vec_C1={FIX_C1, FIX_C3, FIX_C5, FIX_C7},
Vec_C2={FIX_C2,FIX_C6,-FIX_C6,-FIX_C2},
Vec_C3={FIX_C3,-FIX_C7,-FIX_C1,-FIX_C5},
Vec_C4={FIX_C4,-FIX_C4,-FIX_C4,FIX_C4},
Vec_C5={FIX_C5,-FIX_C1,FIX_C7,FIX_C3},
Vec_C6={FIX_C6,-FIX_C2,FIX_C2,-FIX_C6},
Vec_C7={FIX_C7,-FIX_C5,FIX_C3,-FIX_C1};
5, with built-in sub-word and line function _ builtin_vis_fmul8x16 calculates the value of Vec_Xk × Vec_Ck, is designated as Vec_Tk, i.e. Vec_Tk=_builtin_vis_fmul8x16 (Vec_Xk, Vec_Ck);
6, state that built-in type is the Two Variables of v4hi: Vec_A and Vec_B;
7, with built-in sub-word and line function _ builtin_vis_fpadd16 calculates the value of Vec_T0+Vec_T1+Vec_T2+Vec_T3, is stored in variable V ec_A, namely completes this calculating with following code:
Vec_A=_builtin_vis_fpadd16(Vec_T0,Vec_T1);
Vec_A=_builtinvis_fpadd16(Vec_A,Vec_T2);
Vec_A=_builtin_vis_fpadd16(Vec_A,Vec_T3);
8, similarly, with built-in sub-word and line function _ builtin_vis_fpadd16 calculates the value of Vec_T4+Vec_T5+Vec_T6+Vec_T7, is stored in variable V ec_B;
9, with built-in sub-word and line function _ builtin_vis_fpadd16 calculates the value of Vec_A+Vec_B, being stored in built-in type is in the variable V ec_Ya of v4hi, similarly, with built-in sub-word and line function _ and builtin_vis_fpsub16 calculates the value of Vec_A-Vec_B, and being stored in built-in type is in the variable V ec_Yb of v4hi;
10, with built-in function _ builtin_vis_fpack16, Vec_Ya being compressed to a type is in 32 bit variables of v4qi, be designated as Vec_YPa, even Vec_YPa=_builtin_vis_fpack16 (Vec_Ya), similarly, Vec_Yb being compressed to a type is in 32 bit variables of v4qi, be designated as Vec_YPb, the value of Vec_YPa the 1st byte is y (0), the value of the 2nd byte is y (1), the value of the 3rd byte is y (2), the value of the 4th byte is y (3), the value of Vec_YPb the 1st byte is y (4), the value of the 2nd byte is y (5), the value of the 3rd byte is y (6), the value of the 4th byte is y (7).
Tested on the computing machine that FT1000 processor of soaring is housed by example, 8 IDCT of the one dimension after parallel calculate than not having parallel one dimension 8 IDCT computing velocitys to promote 312.76%.

Claims (1)

1., towards one dimension 8 IDCT parallel methods of Feiteng processor, it is characterized in that, this parallel method step is as follows:
Make x (n) n=0,1,2 ..., 7 is the input of one dimension 8 IDCT, y (n) n=0,1,2 ..., 7 for exporting, and x (n) and y (n) is the integer between 0 to 255, then one dimension 8 IDCT calculating can be expressed as:
a 0=x(0)*C 4+x(2)*C 2+x(4)*C 4+x(6)*C 6y(0)=a 0+b 0
a 1=x(0)*C 4+x(2)*C 6-x(4)*C 4-x(6)*C 2y(1)=a 1+b 1
a 2=x(0)*C 4-x(2)*C 6-x(4)*C 4+x(6)*C 2y(2)=a 2+b 2
a 3=x(0)*C 4-x(2)*C 2+x(4)*C 4-x(6)*C 6y(3)=a 3+b 3
b 0=x(1)*C 1+x(3)*C 3+x(5)*C 5+x(7)*C 7y(4)=a 4-b 4
b 1=x(1)*C 3-x(3)*C 7-x(5)*C 1-x(7)*C 5y(5)=a 5-b 5
b 2=x(1)*C 5-x(3)*C 1+x(5)*C 7+x(7)*C 3y(6)=a 6-b 6
b 3=x(1)*C 7-x(3)*C 5+x(5)*C 3-x(7)*C 1y(7)=a 7-b 7
Wherein a kand b krepresent results of intermediate calculations, k=0,1,2 ..., 7, C kfor constant, k=1,2,3 ..., 7, the step adopting the VIS instruction set of Feiteng processor that one dimension 8 IDCT are calculated parallelization is as follows:
(1) by all C kbe multiplied by 2 14after round, be designated as Fix_C k;
(2) prefix Vec is used 4_ represent the vector be made up of 4 integers, make Vec 4_ X k={ x (k), x (k), x (k), x (k) }, k=0,1,2..., 7, because the value of x (k) is between 0 to 255, only a byte need be occupied, by the Vec that four x (k) form 4_ X kvector is stored in one 32 long words;
(3) Vec is made 4_ C 0={ Fix_C 4, Fix_C 4, Fix_C 4, Fix_C 4,
Vec 4_C 1={Fix_C 1,Fix_C 3,Fix_C 5,Fix_C 7},
Vec 4_C 2={Fix_C 2,Fix_C 6,-Fix_C 6,-Fix_C 2},
Vec 4_C 3={Fix_C 3,-Fix_C 7,-Fix_C 1,-Fix_C 5},
Vec 4_C 4={Fix_C 4,-Fix_C 4,-Fix_C 4,Fix_C 4},
Vec 4_C 5={Fix_C 5,-Fix_C 1,Fix_C 7,Fix_C 3},
Vec 4_C 6={Fix_C 6,-Fix_C 2,Fix_C 2,-Fix_C 6},
Vec 4_C 7={Fix_C 7,-Fix_C 5,Fix_C 3,-Fix_C 1},
Due to Fix_C kvalue 0 to 2 14between, only need occupy two bytes, by Vec 4_ C kbe stored in one 64 long words;
(4) with the fmu18x16 command calculations Vec of Feiteng processor 4_ X k× Vec 4_ C kvalue, be designated as Vec 4_ T k, k=0,1,2 ..., 7;
(5) with the fpadd16 command calculations Vec of Feiteng processor 4_ T 0+ Vec 4_ T 1+ Vec 4_ T 2+ Vec 4_ T 3value, be designated as Vec 4_ A, similarly, with fpadd16 command calculations Vec 4_ T 4+ Vec 4_ T 5+ Vec 4_ T 6+ Vec 4_ T 7value, be designated as Vec 4_ B;
(6) with the fpadd16 command calculations Vec of Feiteng processor 4_ A+Vec 4the value of _ B, is designated as Vec 4_ Y a, with fpsub16 command calculations Vec 4_ A-Vec 4the value of _ B, is designated as Vec 4_ Y b;
(7) use fpack16 instruction by Vec 4_ Y abe compressed in the word of 32, be designated as Vec 4_ Y ap, similarly, with fpack16 instruction by Vec 4_ Y bbe compressed in the word of 32, be designated as Vec 4_ Y bp, Vec 4_ Y apthe value of the 1st byte is y (0), and the value of the 2nd byte is y (1), and the value of the 3rd byte is y (2), and the value of the 4th byte is y (3), Vec 4_ Y bpthe value of the 1st byte is y (4), and the value of the 2nd byte is y (5), and the value of the 3rd byte is y (6), and the value of the 4th byte is y (7).
CN201410835382.XA 2014-12-30 2014-12-30 One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor Pending CN104503732A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201410835382.XA CN104503732A (en) 2014-12-30 2014-12-30 One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor
PCT/CN2015/081035 WO2016107083A1 (en) 2014-12-30 2015-06-09 One-dimensional eight-point idct parallelism method for feiteng processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410835382.XA CN104503732A (en) 2014-12-30 2014-12-30 One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor

Publications (1)

Publication Number Publication Date
CN104503732A true CN104503732A (en) 2015-04-08

Family

ID=52945133

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410835382.XA Pending CN104503732A (en) 2014-12-30 2014-12-30 One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor

Country Status (2)

Country Link
CN (1) CN104503732A (en)
WO (1) WO2016107083A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016107083A1 (en) * 2014-12-30 2016-07-07 中国人民解放军装备学院 One-dimensional eight-point idct parallelism method for feiteng processor

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102804165A (en) * 2009-02-11 2012-11-28 四次方有限公司 Front end processor with extendable data path
CN103079079A (en) * 2013-01-23 2013-05-01 中国人民解放军装备学院 Subword parallel method for color spatial transformation
CN103984528A (en) * 2014-05-15 2014-08-13 中国人民解放军国防科学技术大学 Multithread concurrent data compression method based on FT processor platform

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5452466A (en) * 1993-05-11 1995-09-19 Teknekron Communications Systems, Inc. Method and apparatus for preforming DCT and IDCT transforms on data signals with a preprocessor, a post-processor, and a controllable shuffle-exchange unit connected between the pre-processor and post-processor
CN101694627B (en) * 2009-10-23 2013-09-11 天津大学 Compiler system based on TCore configurable processor
CN104503732A (en) * 2014-12-30 2015-04-08 中国人民解放军装备学院 One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102804165A (en) * 2009-02-11 2012-11-28 四次方有限公司 Front end processor with extendable data path
CN103079079A (en) * 2013-01-23 2013-05-01 中国人民解放军装备学院 Subword parallel method for color spatial transformation
CN103984528A (en) * 2014-05-15 2014-08-13 中国人民解放军国防科学技术大学 Multithread concurrent data compression method based on FT processor platform

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LIYI XIAO, HAI HUANG: "A Novel CORDIC Based Unified Architecture for DCT and IDCT", 《212 INTERNATIONAL CONFERENCE ON OPTOELECTRONICS AND MICROELECTRONICS(ICOM)》 *
李京,沈泊: "一种低功耗2D DCT/IDCT处理器设计", 《固体电子学研究与进展》 *
胡小涛,梁利平: "整型2D-IDCT算法的优化与实现", 《电视技术》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016107083A1 (en) * 2014-12-30 2016-07-07 中国人民解放军装备学院 One-dimensional eight-point idct parallelism method for feiteng processor

Also Published As

Publication number Publication date
WO2016107083A1 (en) 2016-07-07

Similar Documents

Publication Publication Date Title
JP6977239B2 (en) Matrix multiplier
CN111213125B (en) Efficient direct convolution using SIMD instructions
Cho et al. MEC: Memory-efficient convolution for deep neural network
US20180107630A1 (en) Processor and method for executing matrix multiplication operation on processor
FI4099168T3 (en) Compute optimizations for low precision machine learning operations
CN102436438B (en) Sparse matrix data storage method based on ground power unit (GPU)
US8539201B2 (en) Transposing array data on SIMD multi-core processor architectures
US20180203673A1 (en) Execution of computation graphs
CN106846235B (en) Convolution optimization method and system accelerated by NVIDIA Kepler GPU assembly instruction
US9436465B2 (en) Moving average processing in processor and processor
US20190042195A1 (en) Scalable memory-optimized hardware for matrix-solve
JP2017517082A (en) Parallel decision tree processor architecture
US9934199B2 (en) Digital filter device, digital filtering method, and storage medium having digital filter program stored thereon
US9785614B2 (en) Fast Fourier transform device, fast Fourier transform method, and recording medium storing fast Fourier transform program
CN104503732A (en) One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor
Park et al. A highly parallelized decoder for random network coding leveraging GPGPU
CN114117896A (en) Method and system for realizing binary protocol optimization for ultra-long SIMD pipeline
Lee Performance Study of Multicore Digital Signal Processor Architectures
CN103327332B (en) The implementation method of 8 × 8IDCT conversion in a kind of HEVC standard
CN107491288B (en) Data processing method and device based on single instruction multiple data stream structure
Wakatani Improvement of adaptive fractal image coding on GPUs
Jin et al. NOP compression scheme for high speed DSPs based on VLIW architecture
Amiri Performance Improvement of Multimedia Kernels Using Data-and Thread-Level Parallelism on CPU Platform
Benterki et al. Improving complexity of Karmarkar's approach for linear programming
Moradifar et al. Performance improvement of multimedia Kernels using data-and thread-level parallelism on CPU platform

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150408