CN102857756A

CN102857756A - Transfer coder adaptive to high efficiency video coding (HEVC) standard

Info

Publication number: CN102857756A
Application number: CN2012102511159A
Authority: CN
Inventors: 李甫; 樊春晓; 石光明; 张犁; 周蕾蕾; 林杰; 杨海舟; 董伟生; 王晓甜
Original assignee: Xidian University
Current assignee: Chongqing Institute Of Integrated Circuit Innovation Xi'an University Of Electronic Science And Technology
Priority date: 2012-07-19
Filing date: 2012-07-19
Publication date: 2013-01-02
Anticipated expiration: 2032-07-19
Also published as: CN102857756B

Abstract

The invention discloses a transfer coder adaptive to a high efficiency video coding (HEVC) standard and mainly solves the problems in the prior art that transfer blocks are different in sizes and too many multipliers are used. The transfer coder adaptive to the HEVC standard comprises a one-dimensional discrete cosine transformation/ discrete sine transformation (DCT/DST) module (1), a transposing bugger module (2) and a top layer control unit (3). The one-dimensional DCT/DST module (1) is provided with a unified HEVC transfer coding framework and is combined with a butterfly-shaped structure and a matrix multiplication array to achieve resource selection sharing. The transposing bugger module (2) uses path delay between registers and different storing and reading sequences of storages to finish transposing operation of data efficiently, simply and conveniently. The top layer control unit generates rest signals and enable signals of the one-dimensional DCT/DST module and the transposing bugger module, controls the one-dimensional DCT/DST module to perform one-dimensional row transfer for input data and controls the transposing bugger module to transpose the transfer results and output the transfer results to the one-dimensional DCT/DST module so as to finish one-dimensional rank transfer. The transfer coder has the advantages of being simple and ordered in structure, high in reuse degree and capable of achieving integrated circuits, and can achieve high-throughput transfer coding.

Description

Be suitable for the transform coder of HEVC standard

Technical field

The invention belongs to the electronic circuit technology field, particularly the transform coder structure in the Video coding of HEVC standard can be applicable to very lagre scale integrated circuit (VLSIC) VLSI.

Background technology

As everyone knows, video image is the main source that the mankind obtain external world information.Along with video image obtains developing rapidly of equipment, communication network and multimedia technology, high-resolution/high definition video image has been brought into play more and more important effect in national security, national economy, scientific research and people's lives.Yet these video image informations with rich content are through after the digitlization, and data volume is very huge.Brought great difficulty for transmission and the storage of high-resolution/high definition video image.Particularly in the limited situation of data transfer bandwidth, the High Data Rate that high resolution video image brings, big data quantity problem become the bottleneck of high resolution video image application expansion, simultaneously also existing compression algorithm has been proposed larger challenge, traditional video image compressing method can not satisfy the requirement of high resolution video image.

A new generation's video image compressing method is the solution route of high resolution video image compression.2010 by ISO/IEC and ITU-T joint mapping the generation video compression standard High Efficiency Video Coding that faces down, i.e. HEVC Video Coding Scheme, and set up test model.With just compare in widely used H.264/AVC standard at present, in the identical situation of decoded image quality, HEVC approximately can save the code check up to 40%.The hybrid encoding frame of just bringing into use is still adopted in the HEVC intraframe coding in H.263: predictive coding, transition coding and entropy coding.

Yet high efficiency transition coding is to increase computation complexity as cost.Having three kinds of different processing units in the intraframe coding of HEVC, is respectively coding unit CU, predicting unit PU and converter unit TU.Three kinds of processing units logically can be divided into less processing unit downwards successively.Coding unit CU can be decomposed into the PU identical with its size and the PU of 4 1/4 sizes simultaneously.PU can further be decomposed into and its identical TU of size and the TU of 4 1/4 sizes, and each piece has than more direction H.264.Coding framework adopts the method for recurrence and quaternary tree to realize, so its structure is very complicated.As the most frequently used unit among the HEVC, the final processing unit of TU conduct, its block size from 4 * 4 to 32 * 32, particularly in the high efficiency level, the size of TU can obtain by three levels of division CU.Simultaneously, dct transform is the part of the main complicated calculations of HEVC coding standard.Therefore, the computational complexity of TU is the bottleneck problem that hardware is realized HEVC, does not also have efficiently to realize at present the circuit structure of this standard.

Summary of the invention

The object of the invention is to for the difficulty that exists in the above-mentioned background technology and not enough, a kind of transform coder of the HEVC of being applicable to standard has been proposed, to finish this conversion of 5 types to 4 * 4DCT in the HEVC standard, 4 * 4DST, 8 * 8DCT, 16 * 16DCT and 32 * 32DCT.

For achieving the above object, transform coder of the present invention comprises: comprise one dimension DCT/DST module, transpose buffering module and top layer control unit, one dimension DCT/DST module is used for finishing the various one-dimensional transforms of HEVC standard, the transpose buffering module is used for finishing the matrix transpose operation of data, be about to press row output by the line translation result of row input, the top layer control module resets and enable signal for generation of one dimension DCT/DST module and transpose buffering module, control one dimension DCT/DST module is carried out the one dimension line translation to the initial data of input, and generation control signal control transpose buffering module receives the line translation result of one dimension DCT/DST module, after All Datarows is finished dealing with, the result of control transpose buffering module after with transposition fails back one dimension DCT/DST module and carries out the one dimension rank transformation, it is characterized in that:

Described one dimension DCT/DST module comprises:

Butterfly processing element is used for finishing the reducing that adds between data, and the result who inputs the in twos addition of data head and the tail, subtract each other is sent into MUX;

The class butterfly processing element is used for finishing 4 DST conversion and inputs the addition between data, the operation of subtracting each other and postponing, and the result is sent into the matrix multiplier array;

MUX according to alternative types and current state, is selected the operation result of butterfly processing element input, exports the matrix multiplier array to;

The matrix multiplier array comprises two groups of inputs: 4 data of one group of input, 16 data of another group input, be used for finishing will input 4 data respectively with the multiplication operations of 4 groups of 4 coefficients of another group input, will obtain 16 products and send into the addition shift unit;

The addition shift unit is used for the operation result of matrix multiplier array input is carried out addition, displacement;

Described transpose buffering module, finish the matrix transpose operation to one dimension DCT/DST module line translation result, it comprises register array transposition submodule and RAM access transposition submodule, this register array transposition submodule, adopt the register array structure, utilize the different path delays of each register, different input and output direction and register area to finish non-32 data matrix transpose operation; This RAM access transposition submodule adopts 8 groups of address ram access structures, by controlling the I/O Address of each RAM, finishes the matrix transpose operation of 32 data.

As preferably, top layer control unit in the above-mentioned transform coder, comprise reset enable module and data flow con-trol module, the reset enable module links to each other with the enabling of the enabling of one dimension DCT/DST module, reset terminal and transpose buffering module, reset terminal respectively, enables and reset signal for these two modules provide; The data flow con-trol module links to each other with the Data Control end of one dimension DCT/DST module and the Data Control end of transpose buffering module respectively, according to the situation of finishing of one dimension line translation and data transposition, and the data flow direction of control DCT/DST module and transpose buffering module.

As preferably, the matrix multiplier array in the above-mentioned transform coder is made as 3, is respectively the first multiplier array, the second multiplier array and the 3rd multiplier array; The first multiplier array is used for carrying out 4 * 4DCT, 4 * 4DST, 8 * 8DCT, 16 * 16DCT and 32 * 32DCT conversion; The second multiplier array is used for carrying out 8 * 8DCT, 16 * 16DCT and 32 * 32DCT conversion; The 3rd multiplier array is used for carrying out 16 * 16DCT and 32 * 32DCT conversion.

As preferably, MUX in the above-mentioned transform coder, be made as three groups, be respectively the first MUX, the second MUX and the 3rd MUX, they respectively with the first multiplier array, the second multiplier array, the input of the 3rd multiplier array links to each other, be used for according to alternative types and current state the result of calculation of butterfly processing element being selected output, send into subordinate to finish multiplying, namely the first MUX is sent into the first multiplier array after data are selected, the second MUX is sent into the second multiplier array after data are selected, and the 3rd MUX is sent into the 3rd multiplier array after data are selected.

As preferably, butterfly processing element in the above-mentioned transform coder, be made as four groups, be respectively 32 input butterflies, 16 input butterflies, 8 input butterflies and 4 input butterflies, respectively the head and the tail addition and mutually reducing of head and the tail in order to finish 32,16,8,4 input data; Addition result of this 32 input butterfly is sent into 16 input butterflies, and it subtracts each other the result and is divided into three groups and sends into respectively the first MUX, the second MUX and the 3rd MUX; Addition result of this 16 input butterfly is sent into 8 input butterflies, and it subtracts each other the result and is divided into three groups and sends into respectively the first MUX, the second MUX and the 3rd MUX; Addition result of this 8 input butterfly is sent into 4 input butterflies, and it subtracts each other the result and is divided into two groups and sends into respectively the first MUX and the second MUX, the addition of this 4 input butterfly and subtract each other the result and all send into the first MUX.

As preferably, addition shift unit in the above-mentioned transform coder, be made as eight groups, be respectively the first addition shift unit, the second addition shift unit, third phase and add shift unit, the 4th addition shift unit, the 5th addition shift unit, the 6th addition shift unit, the 7th addition shift unit and the 8th addition shift unit;

The first addition shift unit links to each other with the first multiplier array, is used for the operation result of the first multiplier array is divided into four groups, and with the as a result right shift after the addition of every group of data difference;

The second addition shift unit links to each other with the second multiplier array, is used for the operation result of the second multiplier array is divided into four groups, and with the as a result right shift after the addition of every group of data difference;

Third phase adds shift unit and links to each other with the 3rd multiplier array, is used for the operation result of the 3rd multiplier array is divided into four groups, and with the as a result right shift after the addition of every group of data difference;

The 4th addition shift unit links to each other with the first multiplier array, the second multiplier array and the 3rd multiplier array, is used for the operation result of these multiplier arrays is divided into eight groups, and with the as a result right shift after the addition of every group of data difference;

The 5th addition shift unit links to each other with the second addition shift unit with the first addition shift unit respectively, is used for the operation result addition with the first addition shift unit and the second addition shift unit;

The 6th addition shift unit adds shift unit with third phase and links to each other, and is used for third phase is added the operation result addition in adjacent two cycles of shift unit;

The 7th addition shift unit links to each other with the 4th addition shift unit, is used for the operation result addition with adjacent three cycles of the 4th addition shift unit;

The 8th addition shift unit links to each other with the first multiplier array, is used for the operation result of the first multiplier array is divided into eight groups, and with the as a result right shift after the addition of every group of data difference.

Each addition shift unit is to determine according to conversion block size and row-column transform with every group of data figure place of the as a result right shift after the addition respectively, moves to right when namely carrying out 32 DCT line translations 4, moves to right 11 during rank transformation; Move to right when carrying out 16 DCT line translations 3, move to right 10 during rank transformation; Move to right when carrying out 8 DCT line translations 2, move to right 9 during rank transformation; Move to right 1 when carrying out 4 DCT and DST line translation, move to right 8 during rank transformation.

As preferably, register array transposition submodule in the above-mentioned transform coder is comprised of 16 row, 16 column register arrays, and this array is divided into Three regions, respectively 4 * 4 transposition sections, 8 * 8 transposition sections and 16 * 16 transposition sections, in order to finish respectively 4 points, 8 points, 16 matrix transpose operation; This 4 * 4 transposition section is comprised of 4 * 4 registers in the upper left corner of register array, and 8 * 8 transposition sections are comprised of 8 * 8 registers in the upper left corner of register array, and 16 * 16 transposition sections are comprised of whole register arrays; These trizonal input data are all from this regional left input, each clock cycle deposits to right-hand, after the data input is finished, the storage direction of register becomes from bottom to top, each clock cycle data is deposited upward, can finish matrix transpose operation, and the data behind the transposition are exported from the top.

As preferably, the RAM access transposition submodule in the above-mentioned transform coder comprises ram cell and I/O Address control unit, and this ram cell is made of 8 memories, and each memory all links to each other with one dimension DCT/DST module; The I/O Address control unit links to each other with the address end of each memory, control the I/O Address of each memory, realization is to 32 DCT line translation results of one dimension DCT/DST module, and with this transformation results by the row deposit in 8 memories, read by row again, namely finish the matrix transpose operation of 32 * 32 data.

The present invention compared with prior art has the following advantages:

The first, the present invention adopts one dimension DCT/DST module and the structure that the transpose buffering module combines, and uses same one dimension DCT/DST module to finish line translation and rank transformation, has improved the circuit multiplexer degree, has reduced circuit scale and complexity;

Second, the one dimension DCT/DST module that the present invention adopts, realize conversion by the structure of using the butterfly computation device to be combined with matrix multiplier, this structure only uses 48 multipliers can finish the dct transform of all from 4 * 4 to 32 * 32 sizes and 4 * 4 DST conversion, is suitable for hard-wired encoder;

The 3rd, the present invention is according to transform block size, the transposition module that choice for use is different, be respectively register array transposition submodule and RAM access transposition submodule, register array transposition submodule ingenious path delay of utilizing between register wherein, can finish the data transposition operation, and compound with regular structure is easy to the integrated circuit realization high-efficient simple; RAM access transposition submodule considers size of data and memory resource, utilizes different storages and reads order, efficiently finishes the matrix transpose operation of chunk data.

Description of drawings

Fig. 1 is general structure block diagram of the present invention;

Fig. 2 is the one dimension DCT/DST module frame chart among the present invention;

Fig. 3 is the matrix multiplier array structure block diagram among the present invention;

Fig. 4 is butterfly processing element of the present invention and MUX connection diagram;

Fig. 5 is the structure chart that the present invention realizes one dimension 4 * 4DCT/DST conversion;

Fig. 6 is the structure chart that the present invention realizes one dimension 8 * 8DCT conversion;

Fig. 7 is the structure chart that the present invention realizes one dimension 16 * 16DCT conversion;

Fig. 8 is the structure chart that the present invention realizes one dimension 32 * 32DCT conversion;

Fig. 9 is that the present invention realizes each multiplier array data input sequence schematic diagram of per stage of one dimension 32 * 32DCT conversion;

Figure 10 is register array transposition sub modular structure schematic diagram of the present invention;

Figure 11 is RAM access transposition sub modular structure schematic diagram of the present invention.

Embodiment

The present invention is described in detail below in conjunction with drawings and Examples.

With reference to Fig. 1, the transform coder that is suitable for the HEVC standard that the present invention proposes is made of top layer control unit 3, one dimension DCT/DST module 1 and transpose buffering module 2.Wherein the output of top layer control unit 3 is divided into two-way, and the first via is connected with Enable Pin with the reset terminal of one dimension DCT/DST module 1, and the second the tunnel is connected with Enable Pin with the reset terminal of transpose buffering module 2; The input of one dimension DCT/DST module 1 is divided into two-way, and the first via is connected with outside input data, and the second the tunnel is connected with the data output end of transpose buffering module 2; The output of one dimension DCT/DST module 1 is divided into two-way, and the first via is connected with the data input pin of transpose buffering module 2, and the second the tunnel is connected with outside output; The data input pin of transpose buffering module 2 is connected with the data output end of one dimension DCT/DST module 1, and the data output end of transpose buffering module 2 is connected with the data input pin of one dimension DCT/DST module 1.Wherein:

Described top layer control unit 3, comprise reset enable module 31 and data flow con-trol module 32, reset enable module 31 links to each other with the enabling of the enabling of one dimension DCT/DST module 1, reset terminal and transpose buffering module 2, reset terminal respectively, enables and reset signal for these two modules provide; Data flow con-trol module 32 links to each other with the Data Control end of one dimension DCT/DST module 1 and the Data Control end of transpose buffering module 2 respectively.Reset enable module 31 and data flow con-trol module 32 consist of by counter and logical circuit, count status and current alternative types of carrying out according to counter, produce resetting of one dimension DCT/DST module 1 by logical circuit, enable, resetting of data flow con-trol signal and transpose buffering module 2, enable, the data flow con-trol signal, the input data of 1 pair of transform coder of control one dimension DCT/DST module are carried out the one dimension line translation, and generation control signal control transpose buffering module 2 receives the line translation result of one dimension DCT/DST modules 1, after All Datarows was finished dealing with, control transpose buffering module 2 exported the line translation result behind the transposition to one dimension DCT/DST module 1 and carries out the one dimension rank transformation.

Described transpose buffering module 2 is finished the matrix transpose operation to one dimension DCT/DST module 1 line translation result, and it comprises register array transposition submodule 9 and RAM access transposition submodule 10.Wherein: the structure of register array transposition submodule 9 as shown in figure 10, the structure of RAM access transposition submodule 10 is as shown in figure 11.

With reference to Figure 10, register array transposition submodule 9 is comprised of 16 row, 16 column register arrays, and this array is divided into Three regions, is respectively 4 * 4 transposition sections 91,8 * 8 transposition sections 92 and 16 * 16 transposition sections 93.In order to finish respectively 4 points, 8 points, 16 matrix transpose operation; This 4 * 4 transposition section 91 is comprised of 4 * 4 registers in the upper left corner of register array, and 8 * 8 transposition sections 92 are comprised of 8 * 8 registers in the upper left corner of register array, and 16 * 16 transposition sections 93 are comprised of whole register arrays; These trizonal input data are all from this regional left input, each clock cycle deposits to right-hand, after the data input is finished, the storage direction of register becomes from bottom to top, each clock cycle data is deposited upward, can finish matrix transpose operation, and the data behind the transposition are exported from the top.

With reference to Figure 11, RAM access transposition submodule 10 comprises ram cell 101 and I/O Address control unit 102, and this ram cell 101 is made of 8 memories, and each memory all links to each other with one dimension DCT/DST module 1; I/O Address control unit 102 links to each other with the address end of each memory, control the I/O Address of each memory, realization is to 32 DCT line translation results of one dimension DCT/DST module 1, and with this transformation results by the row deposit in 8 memories, read by row again, namely finish the matrix transpose operation of 32 * 32 data.

Described one dimension DCT/DST module 1 is used for finishing 4 DCT of HEVC standard, 4 DST, DCT, 16 DCT and 32 DCT one-dimensional transforms at 8, and its structure as shown in Figure 2.

With reference to Fig. 2, one dimension DCT/DST module 1 comprises four groups of butterfly processing elements 4, a class butterfly processing element 5, three groups of MUX 6, three matrix multiplier arrays 7 and eight groups of addition shift units 8, wherein:

Three matrix multiplier arrays 7 are respectively the first multiplier array 71, the second multiplier array 72 and the 3rd multiplier array 73.The input data of each matrix multiplier array are divided into two-way: the first via is a, b, c and four data of d, the second the tunnel is a0, a1, a2, a3, b0, b1, b2, b3, c0, c1, c2, c3, d0, d1, d2 and 16 data of d3, this structure comprises 16 multipliers, in order to finish a and a0, a1, a2, a3 are multiplied each other respectively, b and b0, b1, b2, b3 multiply each other respectively, c and c0, c1, c2, c3 multiply each other respectively, the operation that d and d0, d1, d2, d3 multiply each other respectively, and 16 multiplied result outputs that will obtain, as shown in Figure 3;

Three groups of MUX 6 are respectively the first MUX 61, the second MUX 62 and the 3rd MUX 63, and their output links to each other with the input of the first multiplier array 71, the second multiplier array 72, the 3rd multiplier array 73 respectively.According to current state and alternative types, realize the operation that the input data of the first multiplier array 71, the second multiplier array 72, the 3rd multiplier array 73 are selected;

Four groups of butterfly processing elements 4, be respectively 32 input butterflies 41,16 input butterflies 42,8 input butterflies 43 and 4 input butterflies 44, each butterfly processing element consists of by adder and subtracter, to finish the head and the tail addition of input data and the operation of subtracting each other from beginning to end.Wherein 32 input butterflies 41, its addition result is sent into 16 input butterflies 42, its O10, O11, O16, O17, O112, O113 that subtracts each other among the result is sent into the first MUX 61, O12, O13, O18, O19, O114, O115 send into the second MUX 62, and O14, O15, O110, O111 send into the 3rd MUX 63; 16 input butterflies 42 are sent its addition result into 8 input butterflies 43, and its O24, O25, O26, O27 that subtracts each other among the result is sent into the first MUX 61 and the second MUX 62, and O20, O21, O22, O23 send into the 3rd MUX 63; 8 input butterflies 43 are sent its addition result into 4 input butterfly 44 and class butterfly processing elements 5, it is subtracted each other the result send into the second MUX 62; 4 the input butterflies 44 addition and subtract each other the result and all send into the first MUX 61, as shown in Figure 4.

Eight groups of addition shift units 8 are respectively the first addition shift unit 81, the second addition shift unit 82, third phase and add shift unit 83, the 4th addition shift unit 84, the 5th addition shift unit 85, the 6th addition shift unit 86, the 7th addition shift unit 87 and the 8th addition shift unit 88; The first addition shift unit 81 links to each other with the first multiplier array 71, is used for the operation result of the first multiplier array 71 is divided into four groups, and with the as a result right shift after the addition of every group of data difference; The second addition shift unit 82 links to each other with the second multiplier array 72, is used for the operation result of the second multiplier array 72 is divided into four groups, and with the as a result right shift after the addition of every group of data difference; Third phase adds shift unit 83 and links to each other with the 3rd multiplier array 73, is used for the operation result of the 3rd multiplier array 73 is divided into four groups, and with the as a result right shift after the addition of every group of data difference; The 4th addition shift unit 84 links to each other with the first multiplier array 71, the second multiplier array 72 and the 3rd multiplier array 73, is used for the operation result of these multiplier arrays is divided into eight groups, and with the as a result right shift after the addition of every group of data difference; The 5th addition shift unit 85 links to each other with the second addition shift unit 82 with the first addition shift unit 81 respectively, is used for the operation result addition with the first addition shift unit 81 and the second addition shift unit 82; The 6th addition shift unit 86 adds shift unit 83 with third phase and links to each other, and is used for third phase is added the shift unit operation result addition in 83 adjacent two cycles; The 7th addition shift unit 87 links to each other with the 4th addition shift unit 84, is used for the operation result addition with adjacent three cycles of the 4th addition shift unit 83; The 8th addition shift unit 88 links to each other with the first multiplier array 71, is used for the operation result of the first multiplier array 71 is divided into eight groups, and with the as a result right shift after the addition of every group of data difference.Each addition shift unit is to determine according to conversion block size and row-column transform with every group of data figure place of the as a result right shift after the addition respectively, moves to right when namely carrying out 32 DCT line translations 4, moves to right 11 during rank transformation; Move to right when carrying out 16 DCT line translations 3, move to right 10 during rank transformation; Move to right when carrying out 8 DCT line translations 2, move to right 9 during rank transformation; Move to right 1 when carrying out 4 DCT and DST line translation, move to right 8 during rank transformation.

It is as follows that one dimension DCT/DST module 1 among the present invention is carried out 4 DCT of one dimension, 4 DST of one dimension, 8 DCT of one dimension, 16 DCT of one dimension and 32 dct transform processes of one dimension:

With reference to Fig. 5, when one dimension DCT/DST module 1 is carried out 4 DCT of one dimension and 4 DST map functions of one dimension, use the input of 4 in this structure butterfly 44, class butterfly processing element 5, the first MUX 61, the first multiplier array 71 and the 8th addition shift unit 88.After data to be transformed are inputted from the outside, enter 4 input butterfly 44 and class butterfly processing elements 5, to finish addition between 4 data, to subtract each other, postpone operation.The operation result of class butterfly processing element 5 and the base vector of DCT are from 16 port inputs of the first multiplier array 71, and the operation result of 4 input butterflies 44 and the base vector of DST are by 4 input ports of the first MUX 61 inputs the first multiplier array 71.The first MUX 61 is 4 DCT or 4 DST according to current alternative types, selects corresponding data to input the first multiplier array 71.Use M when wherein, carrying out 4 dct transforms _0,0~ M _3,0, M _2,1, M _3,16 multipliers use M when carrying out 4 DST conversion _0,1～M _1,1, M _0,2~ M _2,2, M _0,3~ M _2,38 multipliers.Multiplication is sent the result data that obtains into the 8th addition shift unit 88 after finishing, and finishes the addition shifting function between data, obtains last transformation results.

With reference to Fig. 6, when one dimension DCT/DST module 1 is carried out 8 dct transform operations of one dimension, use the input of 8 in this structure butterfly 43,4 input butterflies 44, the first MUX 61, the second MUX 62, the first multiplier array 71, the second multiplier array 72, the second addition shift unit 82 and the 8th addition shift unit 88.After data input to be transformed, enter 8 input 43 pairs of inputs of butterfly data and carry out head and the tail addition, phase reducing, its addition result is carried out aforesaid 4 dct transforms, to obtain its 0th, 2,4,6 transformation results; The result that subtracts each other of 8 input butterflies 43 is sent into 4 input ports of the second multiplier array 72 by the second MUX 62, the basic matrix of 8 DCT is sent into 16 input ports of the second multiplier array 72, finish 16 multiplyings that need.The result of product that obtains is sent into the second addition shift unit 82, finish multiplier M _0,0~ M _3,0, M _0,1～M _3,1, M _0,2~ M _3,2And M _0,3~ M _3,3The respectively operation of addition displacement of 4 groups of results, obtain the 1st, 3,5,7 transformation results.

With reference to Fig. 7, when one dimension DCT/DST module 1 is carried out 16 dct transform operations of one dimension, use the input of 16 in this structure butterfly 42,8 input butterflies 43,4 input butterflies 44, the first MUX 61, the second MUX 62, the 3rd MUX 63, the first multiplier array 71, the second multiplier array 72, the 3rd multiplier array 73, the first addition shift unit 81, the second addition shift unit 82, third phase to add shift unit 83, the 5th addition shift unit 85, the 6th addition shift unit 86 and the 8th addition shift unit 88.After data input to be transformed, enter 16 input butterflies 42, finish 16 head and the tail addition, phase reducings between data.Its addition result is carried out aforesaid 8 dct transforms, obtain its 0th, 2,4,6,8,10,12,14 transformation results; Its O24, O25, O26, O27 that subtracts each other among the result is sent into the first MUX 61 and the second MUX 62, and O20, O21, O22, O23 send into the 3rd MUX 63.It is finished 16 dct transform action needs and carries out the operation of following 2 clock cycle:

First clock cycle: the addition result to 16 input butterflies 42 is carried out aforesaid 8 dct transforms, obtain its 0th, 2,4,6,8,10,12,14 transformation results, to subtract each other as a result O20, O21, O22, O23 send into the 3rd multiplier array 73 by the 3rd MUX 63 4 input ports, the basic matrix of 16 DCT is sent into 16 input ports of the 3rd multiplier array 73, finish 16 multiplyings that need, and Output rusults is sent into third phase add shift unit 83, finish multiplier M _0,0~ M _3,0, M _0,1～M _3,1, M _0,2~ M _3,2And M _0,3~ M _3,3The respectively operation of addition displacement of 4 groups of results, obtain 4 intermediate object programs, send into the 6th addition shift unit 86.

Second clock cycle: will subtract each other as a result O24, O25, O26, O27 send into the first multiplier array 71 and the second multiplier array 72 by the first MUX 61 and the second MUX 62 4 input ports, the basic matrix of 16 DCT of its correspondence is sent into respectively 16 input ports of the first multiplier array 71 and the second multiplier array 72, finish 32 multiplication that need, and the result that will export sends into the 5th addition shift unit 85, finishes the M with the first multiplier array 71 and the second multiplier array 72 _0,0~ M _3,0, M _0,1~ M _3,1, M _0,2~ M _3,2And M _0,3~ M _3,34 groups of respectively operations of addition displacement of result obtain the 1st, 3,5,7 transformation results.To subtract each other as a result O20, O21, O22, O23 send into the 3rd multiplier array 73 by the 3rd MUX 63 4 input ports, the basic matrix of its 16 DCT is sent into 16 input ports of the 3rd multiplier array 73, finish 16 multiplyings that need, and Output rusults is sent into third phase add shift unit 83, finish multiplier M _0,0~ M _3,0, M _0,1～M _3,1, M _0,2~ M _3,2And M _0,3~ M _3,3The respectively operation of addition displacement of 4 groups of results, obtain 4 intermediate object programs, send into the 6th addition shift unit 86.Use the 6th addition shift unit 86 that third phase is added the shift unit operation result addition in 83 adjacent two cycles, and right shift, the 9th, 11,13,15 transformation results obtained.

With reference to Fig. 8, when one dimension DCT/DST module 1 is carried out 32 dct transform operations of one dimension, use the input of 32 in this structure butterfly 41,16 input butterflies 42,8 input butterflies 43,4 input butterflies 44, the first MUX 61, the second MUX 62, the 3rd MUX 63, the first multiplier array 71, the second multiplier array 72, the 3rd multiplier array 73, the first addition shift unit 81, the second addition shift unit 82, third phase adds shift unit 83, the 4th addition shift unit 84, the 5th addition shift unit 85, the 6th addition shift unit 86, the 7th addition shift unit 87 and the 8th addition shift unit 88.After data input to be transformed, enter 32 input butterflies 41, finish 32 head and the tail addition, phase reducings between data.Its addition result is carried out aforesaid 16 dct transforms, obtain its 0th, 2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 transformation results, its O10, O11, O16, O17, O112, O113 that subtracts each other among the result is sent into the first MUX 61, O12, O13, O18, O19, O114, O115 send into the second MUX 62, and O14, O15, O110, O111 send into the 3rd MUX 63.It finishes 8 clock cycle of 32 dct transform action needs, can be divided into 2 stages, and the phase I needs 2 clock cycle, and second stage needs 6 clock cycle.Each stages operating is specific as follows:

Stage 1: its addition result is carried out aforesaid 16 dct transforms, obtain its 0th, 2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 transformation results, need altogether 2 clock cycle, each clock cycle obtains 8 transformation results.

Stage 2: the current generation comprises T1, T2, T3, T4, six clock cycle of T5, T6, the first MUX 61, the second MUX 62 and the 3rd MUX 63 are according to current state, the result that subtracts each other of 32 input butterflies, 41 outputs is sent into respectively the first multiplier array 71, the second multiplier array 72 and the 3rd multiplier array 73, within each clock cycle, finish multiplication 48 times.With reference to figure 9, when T1 and T4, O0, O0, O1, O1 are inputted from 4 ports of the first multiplier array 71, and its corresponding base vector is from 16 ports inputs of the first multiplier array 71; O2, O2, O3, O3 are inputted from 4 ports of the second multiplier array 72, and its corresponding base vector is from 16 ports inputs of the second multiplier array 72; O4, O4, O5, O5 are inputted from 4 ports of the 3rd multiplier array 73, and its corresponding base vector is from 16 ports inputs of the 3rd multiplier array 73.When T2 and T5, O6, O6, O7, O7 are inputted from 4 ports of the first multiplier array 71, and its corresponding base vector is from 16 ports inputs of the first multiplier array 71; O8, O8, O9, O9 are inputted from 4 ports of the second multiplier array 72, and its corresponding base vector is from 16 ports inputs of the second multiplier array 72; O10, O10, O11, O11 are inputted from 4 ports of the 3rd multiplier array 73, and its corresponding base vector is from 16 ports inputs of the 3rd multiplier array 73.When T3 and T6, O12, O12, O13, O13 are inputted from 4 ports of the first multiplier array 71, and its corresponding base vector is from 16 ports inputs of the first multiplier array 71; O14, O14, O15, O15 are inputted from 4 ports of the second multiplier array 72, and its corresponding base vector is from 16 ports inputs of the second multiplier array 72.The operation result of the first multiplier array 71, the second multiplier array 72 and the 3rd multiplier array 73 is sent into the 4th addition shift unit 84, to finish M corresponding in three multiplier arrays _0,0M _2,0, M _1,0M _3,0, M _0,1M _2,1, M _1,1M _3,1, M _0,2M _2,2, M _1,2M _3,2, M _0,3M _2,3, M _1,3M _3,3The operation result of 8 groups of multipliers is the operation of addition displacement respectively, and the result is inputted the 7th addition shift unit 87, to finish the operation with the operation result addition in adjacent three cycles of the 4th addition shift unit (83), obtains final transformation results.

Terminological interpretation:

DCT, discrete cosine transform;

DST, discrete sine transform;

One dimension DCT/DST, one-dimensional discrete cosine transform and discrete sine transform.

Claims

1. transform coder that is suitable for the HEVC standard, comprise one dimension DCTDST module (1), transpose buffering module (2) and top layer control unit (3), one dimension DCT/DST module (1) is used for finishing the various one-dimensional transforms of HEVC standard, transpose buffering module (2) is used for finishing the matrix transpose operation of data, be about to press row output by the line translation result of row input, top layer control module (3) resets and enable signal for generation of one dimension DCT/DST module (1) and transpose buffering module (2), control one dimension DCT/DST module (1) is carried out the one dimension line translation to the initial data of input, and generation control signal control transpose buffering module (2) receives the line translation result of one dimension DCT/DST module (1), after All Datarows is finished dealing with, control transpose buffering module (2) is failed back one dimension DCT/DST module (1) with the result behind the transposition and is carried out the one dimension rank transformation, it is characterized in that:

Described one dimension DCT/DST module (1) comprising:

Butterfly processing element (4) is used for finishing the reducing that adds between data, and the result who inputs the in twos addition of data head and the tail, subtract each other is sent into MUX (6);

Class butterfly processing element (5) is used for finishing 4 DST conversion and inputs the addition between data, the operation of subtracting each other and postponing, and the result is sent into matrix multiplier array (7);

MUX (6) according to alternative types and current state, is selected the operation result of butterfly processing element (4) input, exports matrix multiplier array (7) to;

Matrix multiplier array (7), comprise two groups of inputs: 4 data of one group of input, 16 data of another group input, be used for finishing will input 4 data respectively with the multiplication operations of 4 groups of 4 coefficients of another group input, will obtain 16 products and send into addition shift unit (8);

Addition shift unit (8) is used for the operation result of matrix multiplier array (7) input is carried out addition, displacement;

Described transpose buffering module (2), finish the matrix transpose operation to one dimension DCT/DST module (1) line translation result, it comprises register array transposition submodule (9) and RAM access transposition submodule (10), this register array transposition submodule (9), adopt the register array structure, utilize the different path delays of each register, different input and output direction and register area to finish non-32 data matrix transpose operation; This RAM access transposition submodule (10) adopts 8 groups of address ram access structures, by controlling the I/O Address of each RAM, finishes the matrix transpose operation of 32 data.

2. transform coder according to claim 1, it is characterized in that: top layer control unit (3), comprise reset enable module (31) and data flow con-trol module (32), reset enable module (31) links to each other with the enabling of the enabling of one dimension DCT/DST module (1), reset terminal and transpose buffering module (2), reset terminal respectively, enables and reset signal for these two modules provide; Data flow con-trol module (32) links to each other with the Data Control end of one dimension DCT/DST module (1) and the Data Control end of transpose buffering module (2) respectively, according to the situation of finishing of one dimension line translation and data transposition, the data flow direction of control DCT/DST module and transpose buffering module.

3. transform coder according to claim 1, it is characterized in that: matrix multiplier array (7) is made as 3, is respectively the first multiplier array (71), the second multiplier array (72) and the 3rd multiplier array (73); The first multiplier array (71) is used for carrying out 4 * 4DCT, 4 * 4DST, 8 * 8DCT, 16 * 16DCT and 32 * 32DCT conversion; The second multiplier array (72) is used for carrying out 8 * 8DCT, 16 * 16DCT and 32 * 32DCT conversion; The 3rd multiplier array (73) is used for carrying out 16 * 16DCT and 32 * 32DCT conversion.

4. transform coder according to claim 1, it is characterized in that: MUX (6) is made as three groups, be respectively the first MUX (61), the second MUX (62) and the 3rd MUX (63), they respectively with the first multiplier array (71), the second multiplier array (72), the input of the 3rd multiplier array (73) links to each other, be used for according to alternative types and current state the result of calculation of butterfly processing element (4) being selected output, send into subordinate to finish multiplying, send into the first multiplier array (71) after namely the first MUX (61) is selected data, send into the second multiplier array (72) after the second MUX (62) is selected data, send into the 3rd multiplier array (73) after the 3rd MUX (63) is selected data.

5. transform coder according to claim 1, it is characterized in that: butterfly processing element (4) is made as four groups, be respectively 32 input butterflies (41), 16 input butterflies (42), 8 input butterflies (43) and 4 input butterflies (44), respectively the head and the tail addition and mutually reducing of head and the tail in order to finish 32,16,8,4 input data; Addition result of these 32 input butterflies (41) is sent into 16 input butterflies (42), and it subtracts each other the result and is divided into three groups and sends into respectively the first MUX (61), the second MUX (62) and the 3rd MUX (63); Addition result of these 16 input butterflies (42) is sent into 8 input butterflies (43), and it subtracts each other the result and is divided into three groups and sends into respectively the first MUX (61), the second MUX (62) and the 3rd MUX (63); This 8 the input butterflies (43) addition result send into 4 the input butterflies (44), it subtracts each other the result and sends into the second MUX (62), this 4 the input butterflies (44) addition and subtract each other the result and all send into the first MUX (61).

6. transform coder according to claim 1, it is characterized in that: addition shift unit (8) is made as eight groups, is respectively the first addition shift unit (81), the second addition shift unit (82), third phase and adds shift unit (83), the 4th addition shift unit (84), the 5th addition shift unit (85), the 6th addition shift unit (86), the 7th addition shift unit (87) and the 8th addition shift unit (88);

The first addition shift unit (81) links to each other with the first multiplier array (71), is used for the operation result of the first multiplier array (71) is divided into four groups, and with the as a result right shift after the addition of every group of data difference;

The second addition shift unit (82) links to each other with the second multiplier array (72), is used for the operation result of the second multiplier array (72) is divided into four groups, and with the as a result right shift after the addition of every group of data difference;

Third phase adds shift unit (83) and links to each other with the 3rd multiplier array (73), is used for the operation result of the 3rd multiplier array (73) is divided into four groups, and with the as a result right shift after the addition of every group of data difference;

The 4th addition shift unit (84) links to each other with the first multiplier array (71), the second multiplier array (72) and the 3rd multiplier array (73), be used for the operation result of these multiplier arrays is divided into eight groups, and with the as a result right shift after the addition of every group of data difference;

The 5th addition shift unit (85) links to each other with the second addition shift unit (82) with the first addition shift unit (81) respectively, is used for the operation result addition with the first addition shift unit (81) and the second addition shift unit (82);

The 6th addition shift unit (86) adds shift unit (83) with third phase and links to each other, and is used for third phase is added the operation result addition in adjacent two cycles of shift unit (83);

The 7th addition shift unit (87) links to each other with the 4th addition shift unit (84), is used for the operation result addition with adjacent three cycles of the 4th addition shift unit (83);

The 8th addition shift unit (88) links to each other with the first multiplier array (71), is used for the operation result of the first multiplier array (71) is divided into eight groups, and with the as a result right shift after the addition of every group of data difference.

7. transform coder according to claim 6, it is characterized in that: each addition shift unit is with every group of data figure place of the as a result right shift after the addition respectively, be to determine according to conversion block size and row-column transform, move to right when namely carrying out 32 DCT line translations 4, move to right 11 during rank transformation; Move to right when carrying out 16 DCT line translations 3, move to right 10 during rank transformation; Move to right when carrying out 8 DCT line translations 2, move to right 9 during rank transformation; Move to right 1 when carrying out 4 DCT and DST line translation, move to right 8 during rank transformation.

8. transform coder according to claim 1, it is characterized in that: register array transposition submodule (9), formed by 16 row, 16 column register arrays, this array is divided into Three regions, respectively 4 * 4 transposition sections (91), 8 * 8 transposition sections (92) and 16 * 16 transposition sections (93), in order to finish respectively 4 points, 8 points, 16 matrix transpose operation; This 4 * 4 transposition section (91) is comprised of 4 * 4 registers in the upper left corner of register array, and 8 * 8 transposition sections (92) are comprised of 8 * 8 registers in the upper left corner of register array, and 16 * 16 transposition sections (93) are comprised of whole register arrays; These trizonal input data are all from this regional left input, each clock cycle deposits to right-hand, after the data input is finished, the storage direction of register becomes from bottom to top, each clock cycle data is deposited upward, can finish matrix transpose operation, and the data behind the transposition are exported from the top.

9. transform coder according to claim 1, it is characterized in that: RAM access transposition submodule (10), comprise ram cell (101) and I/O Address control unit (102), this ram cell (101) is made of 8 memories, and each memory all links to each other with one dimension DCT/DST module (1); I/O Address control unit (102) links to each other with the address end of each memory, control the I/O Address of each memory, realization is to 32 DCT line translation results of one dimension DCT/DST module (1), and with this transformation results by the row deposit in 8 memories, read by row again, namely finish the matrix transpose operation of 32 * 32 data.