A FAST MULTIPLIERLESS TRANSFORM
This application claims the benefit of U.S. Provisional Patent Application No.
60/124,746, filed on March 17, 1999, which is incorporated herein by reference in
its entirety.
BACKGROUND OF THE INVENTION
Mathematical transforms (e.g., the fourier transform, the discrete cosine
transform, the wavelet transform, etc.) have found application in signal processing
computer algorithms, particularly the compression and coding of speech, audio,
image and video data. Such transforms are used in the currendy popular image
compression standard referred to as "JPEG" developed by the Joint Photographic
Experts Group, as well as in high-performance video coding standards such as MPEG
and H.263. These operate by dividing up an image or video into sections or blocks,
performing transforms on the data in the blocks, and compressing or coding the
coefficients resulting from the transform.
The JPEG standard, for example, operates by dividing an image into 8x8
blocks of pixels (i.e., each block is an array of 64 pixels: 8 pixels in height and 8 pixels
in width). A mathematical transform known as the "discrete cosine transform
(DCT)" is calculated for each block in the image to produce a data set of coefficients.
A variable length code-based compression technique is then used to compress the
data set and store the compressed data stream in an output file. The output file can
then be decompressed to recover the DCT coefficients, the inverse DCT transform
calculated, and the resulting blocks of pixels displayed to represent the original form
of the image.
Mathematically, the DCT transforms the blocks of pixels from the spatial
representation to the frequency representation. In the frequency representation, the
image or video information can be divided into frequency components, some of
which are more important than others. The compression algorithm selectively
quantizes or discards the frequency components that do not adversely affect the
reconstructed picture contents. In this manner, compression is achieved.
From a statistical signal processing standpoint, the DCT is a robust
approximation to the optimal discrete-time Karhunen-Loeve transform (KLT) of a
first-order Gauss-Markov process with a positive correlation coefficient p when p —»
1. The KLT is optimal in the sense that it provides the most accurate representation
of a signal by producing transform coefficients that are the eigen values of the input's
autocorrelation matrix, and as known in the art, the eigen values are the optimum
achievable result. In other words, the KLT is optimal in the energy compaction
sense, i.e., among unitary transforms, the KLT packs signal energy into the fewest
number of coefficients. However, the KLT is signal-dependent, therefore, it is
computationally complex and expensive to implement. The DCT has proven to be a
much better alternative in practice in that it is signal independent, has linear phase,
has real coefficients, and can be implemented using fast algorithms.
The DCT coefficients, X[0] through X[M-1], of an M x M transform (i.e., a
transform having M2 individual signal samples at its input to be processed row by row
as M basis functions) and the input image pixels, x[0] through x[M-l], are related by
the following equations:
M-\
Tim
X[m] = a ^ x[n] cos (2« + 1) 0 < m < M - 1, n=0 2M
with
It has a closed-form expression and it is orthogonal. Orthogonality is desirable in a
transform inasmuch as there is no redundancy among the M transform bases (e.g.,
they are orthogonal to each other). In the matrix format, we can represent the
transformation of an input, x, to a transform coefficient, X, as X = Px, and the
reconstruction of the transform coefficient back into the input x as x = PrX, where P
is the DCT transform matrix and Pτ is its inverse transform matrix. Exploiting the
symmetry of the coefficients cos [(πm/2M) (2n + 1)] in the equation above, the
functions performed by the DCT transform matrix P can be represented by a series of
±1 butterflies 12 and rotation angles 14, as illustrated in Fig. 1. The ±1 butterflies
are a series of crossings which process one pair of inputs at a time (i.e., x[0] through
x[7]) by adding and subtracting the inputs in pairs. For example, the first butterfly
encountered by the data set x[0] through x[7] shown in Fig. 1 operates on the data
pair x[0], x[7]. In the top branch of the first butterfly, the term x[0] is added to the
term x[7], while in the bottom branch, the term x[7] is subtracted from x[0].
Similarly, in the second butterfly, the term x[l] is added to the term x[6], while in
the bottom branch the term x[6] is subtracted from x[l]. In an 8 x 8 transform, the
DCT initially performs four such butterflies (i.e., crossings) for a total of four discrete
addition operations and four discrete subtraction operations. The DCT rotation
angles are a series of multiplication operations carried out on the results achieved by
the ±1 butterflies. As depicted in Fig. 1, "C π/4" signifies a multiplication operation
(i.e., multiplying an input by cosine π/4). Similarly, "S π/4" signifies a multiplication
operation (i.e., multiplying an input by sine π/4). This process results in a fast DCT
implementation. Using this method, eight DCT coefficients N[0] - X[7] can be
computed using 13 multiplication operations and 29 addition operations.
Despite all of the above advantages of the DCT, there is still room for
improvement. For example, the DCT is a floating-point transform, i.e., the cos and
sin multipliers in the 5 rotation angles {π/4, 3π/8, π/4, 7π/16, 3π/16} are real
numbers with infinite precision. Hence, the DCT cannot map integers to integers
losslessly inasmuch as the product of the inputs, as processed by the ±1 butterflies,
and the DCT rotation angles result in floating-point outputs that cannot be
represented in binary form without truncation, thereby resulting in the loss of input
data. More importantly, floating-point implementations are relatively slow, require
too much hardware space, and consume too much power.
To combat these problems, most implementations of the DCT are based on
integer arithmetic by scaling up the floating-point multipliers by very large factors
(e.g., multiplying by a factor of 1000 or greater to rid the calculation of floating
decimal points). Unfortunately, however, this ad-hoc approximation method still has
problems; for instance, it does not reduce the complexity of the circuit and it still has
truncation errors. Moreover, if more computational accuracy is needed, the scaling
factors must be very large, thereby adding more complexity to the implementation.
SUMMARY OF THE INVENTION
A preferred embodiment of the invention provides an M x M multiplierless
perfect reconstruction block transform (i.e., perfect reconstruction is realized if the
invention is utilized for both analysis and synthesis, otherwise, if the invention is used
on only one end, near-perfect reconstruction results) with symmetric/antisymmetric
basis functions, i.e., linear phase filters. In a preferred embodiment, a cascade of ±1
butterfly stages is provided followed by two invertible matrices. Each invertible
matrix is made up of a cascade of lifting steps and scaling factors (αt). The scaling
factors can be folded into the quantization stepsizes. A forward binDCT (e.g., for
signal analysis), can be implemented using a cascade of ±1 butterflies and dyadic
lifting steps. Furthermore, an inverse binDCT (e.g., for signal synthesis) can be easily
realized using lifting steps of reverse order and inverted polarity, cascaded with ±1
butterflies.
In accordance with an embodiment of the invention, all of the lifting
coefficients have been chosen to be dyadic rational numbers. Each dyadic lifting step
can be constructed by a simple combination of shift-and-add operations.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other advantages, features, and applications of the invention will be
apparent from the following detailed description of the invention which is provided
in connection with the accompanying drawings in which:
Fig. 1 is a signal flow graph which depicts a conventional fast implementation
of an 8 x 8 floating-point DCT;
Figs. 2(a)-2(c) are signal flow graphs which depict variations on lifting steps
used in the analysis and synthesis of data samples;
Fig. 3 is a signal flow graph which depicts a general solution in accordance
with the invention for a perfect reconstruction block transform whose basis functions
have linear phase;
Fig. 4 is a signal flow graph which depicts how an invertible matrix U can be
decomposed into a cascade of lifting steps and scaling factors (α.j) in accordance with
a preferred embodiment of the invention;
Figs. 5-11 depict a forward binDCT in accordance with preferred
embodiments of the invention;
Fig. 12 depicts frequency responses of a binDCT in accordance with a
preferred embodiment of the invention;
Fig. 13 depicts a transform and inverse transform matrix of the first
embodiment of a binDCT;
Fig. 14 depicts peak signal-to-noise ratios (PSNR) realized for two
embodiments of the binDCT; and
Fig. 15 depicts a simplified block diagram of a transmission system
incorporating analysis and synthesis filters which employ a binDCT block transform
in accordance with preferred embodiments of the invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
Preferred embodiments and applications of the invention will now be
described with reference to Figs. 2-15. Other embodiments may be realized and
structural or logical changes may be made to the disclosed embodiments without
departing from the spirit or scope of the invention. Although the invention is
particularly described as applied to the analysis and synthesis of image and video data
samples, it should be readily apparent that the invention may be applied to any other
data samples or sets having the same or similar problems.
Instead of factoring the DCT into a cascade of rotation angles, as in
conventional practice in the art, the invention utilizes a series of lifting steps (also
known as "shear" operations or "ladder" structures). The basic principle behind a
lifting step is demonstrated in the signal flow graphs depicted in Figs. 2(a)-2(c). In
accordance with a preferred embodiment of the invention, the denominator of the
lifting coefficient /, is deliberately chosen to be dyadic in that it is a power of 2 (i.e., a
rational number that can be written in the form of l/2m, k, m e -Z) for easy binary
implementation. Fig. 2(b) shows that the lifting step can achieve perfect
reconstruction of a signal even with the presence of a nonlinear truncation operator
such as a "floor" (which rounds down to the nearest integer), a "round" (which
rounds either up or down to the nearest integer), or a ceiling" (which rounds up to
the nearest integer). Fig. 2(c) illustrates how this property is exploited to implement
the lifting step very efficiently using a combination of addition operations and bit-
shift operations. The method is always feasible as long as the lifting coefficient is
dyadic. For example, a lifting coefficient of 3/8 signifies "multiplying" an input
value by 3 followed by a "division" by 8 implemented by a right-shift of 3 binary
places (i.e., 23 = 8 in the denominator). It should be noted that the term
"multiplying" is actually inaccurate since a series of bit-shift and addition operations
are performed upon the signal to achieve the same result without performing the
actual multiplication operation.
As known in the art, binary multiplication may be simplified through a series
of bit-shift and addition operations. Generally, in order to implement a
multiplication operation, a multiplicand (i.e., a number that is to be multiplied by
another) is multiplied by a multiplier (i.e., a number by which a multiplicand is to be
multiplied). When multiplying binary numbers, bit-shift and addition operations
may be implemented to effectively perform a multiplication without the same level of
complexity required for multiplication.
A general description of binary bit-shift and addition operations, as substitutes
for multiplication is found in U.S. Patent No. 3,691,359 to Dell et al., which is
incorporated herein by reference.
For example, the numerator 3 of the lifting coefficient 3/8 signifies
performing the following steps: a) a left-shift of 1 binary place upon the input signal
to the lifting step; and b) after a left-shift of one place is performed (e.g., the decimal
value 5 is represented as 101 in binary; after a left-shift, 101 becomes 1010), the
result is then added to the original value of the input signal (e.g., 101), resulting in a
lifting step output of 1111 binary (or decimal value 15). A multiplication by 3 has
just been performed with just one bit-shift and one addition operation. In
accordance with a preferred embodiment of the invention, each of the DCT's
rotation angles (e.g., 14 of Fig. 1) are replaced with a minimal number of dyadic
lifting steps, yet the resulting transform still retains most of the DCT's superior
performance.
The DCT and the novel transformation referenced herein as "binDCT" are
both M x M block transforms. For purposes of illustration only, an 8 x 8 block will
be described herein. At the forward transform in the encoder, 8 input samples x[0] -
x[7] are processed at a time, yielding 8 transform coefficients X[0] - X[7] that can
then be quantized and encoded (e.g., entropy encoded) in a manner known in the
art. In the decoder, the 8 encoded transform coefficients are entropy decoded,
inverse quantized, and then fed to the inverse transform module. For ID signals
such as speech and audio, the input stream can be divided into 8 -sample segments,
each is then processed independently. For 2D and 3D signals (images and videos
respectively), the input can be segmented into non-overlapped 8 x 8 blocks or 8 x 8 x
8 cubes. Each block or cube can be processed independently in separable fashion.
For example, the ID transformation can be applied to every row, then to every
column, to obtain a 2D transformation.
As depicted in Figs. 5-8, each of which depicts a preferred embodiment of the
invention, the 8 input samples labeled from Λ;[0] to x[7] are first processed in four
pairs of two by the ±1 butterflies 12. Each of the four ±1 butterflies 12 takes a pair of
data samples and computes their sum as well as their difference.
As depicted in Fig. 3, each of the resulting sums may be arranged in the 4
upper branches Ul, U2, U3, U4, which will be processed by invertible matrix U to
produce the 4 even-indexed transform coefficients X[0], X[2], X[4], X[6] associated
with the transform's 4 symmetric basis functions. Likewise, all of the resulting
differences are fed into the 4 lower branches VI, V2, V3, V4, that will be further
transformed by invertible matrix V to produce the 4 odd-indexed transform
coefficients X[l], X[3], X[5], X[7] associated with the transform's 4 antisymmetric
basis functions.
In accordance with an embodiment of the invention, the outputs of the
butterflies are processed significantly different by the binDCT, as compared with the
DCT. For instance, whereas the DCT relies on rotation angles (e.g., 14 of Fig. 1)
which are depicted below in matrix form as
cos#, - sin#, cos θ, sin θt or sinfi cos# sin#, - cos#
the binDCT, in accordance with a preferred embodiment of the invention, relies on
lifting steps, or cascades of shears (e.g., 16 of Figs. 5-8), a general matrix
representation of which (to be described more fully below) is depicted below as
1 /, 0 1 /, + ι 1
Since a rotation angle is orthogonal, its inverse is simply its transpose, e.g.,
cos#, - sin6> T cos#, - sin6> cos θ, sin θ, cos θ, - sin6ϊ 1 0 sin#, cos6> sin θ, cos θ, - sin#, cos#, sin θ, cos θ, 0 1
Inverting a lifting step is achieved simply by inverting its polarity, i.e.
This is always true as long as the diagonal elements of the matrix are unity with the
same polarity (both +1 or -1). In the case that the diagonal elements have opposite
polarity, the inverse of the lifting step is simply itself, i.e.,
Therefore, transforms constructed from series of lifting steps are still perfectly
invertible. Since the multiplication operations associated with the DCT rotation
angles require floating-point arithmetic, and since it is very difficult to choose
rotation angles that yield integer multiplications, in accordance with a preferred
embodiment of the invention, the orthogonality associated with the DCT may be
sacrificed in the binDCT in favor of lower complexity. Specifically, lifting steps are
used to replace the DCT's rotation angles, preferably having the denominator of the
lifting coefficients /, chosen to be dyadic, both the forward and the inverse transform
can be implemented without using any multiplication (or division) arithmetic
operations, particularly floating point multiplication. Fig. 9 depicts an embodiment
of an inverse binDCT which employs inverted lifting steps having dyadic lifting
coefficients.
In accordance with another aspect of the invention, specific choices of lifting
coefficients /, were made to result in high-performance transforms.
In each of the 4 upper branches of the binDCTs depicted within Figs. 5-8, the
DCT's 2 butterflies (12a as in Fig. 1) are retained; however, the two upper branch
DCT rotation angles of π/4 and 3π/8 can be replaced with one of the following sets
of lifting steps depicted below in matrix form.
For example, the upper branch DCT rotation angle of π/4 (14a as seen in
Fig. 1) is replaced with a pair of lifting steps 16a (of Figs. 5-8). For illustrative
purposes, consider an input A on the upper branch of the first lifting step of 16a and
an input of B on the lower branch of the first lifting step of 16a. Consequently, the
output on the upper branch of the first lifting step is A+B and the output on the
lower branch of the first lifting step is B. Similarly, the input on the lower branch of
the second lifting step is B and the input for the upper branch of the second lifting
step is A+B. Consequently, the output for the upper branch of the second lifting
step is A+B and the output for the lower branch of the second lifting step is V2(A+B)-
B. In accordance with a preferred embodiment of the invention, these lifting steps
can be implemented in software and/or hardware using a series of binary shifts and
adds on the input data to produce the appropriate transform coefficients (e.g.,
X[0]...X[7]) representative of the original image data samples.
While any one of the above 3 matrix sets (or other equivalent sets apparent to
those of ordinary skill in the art) may be employed to obtain these output results
(albeit, with differing levels of accuracy), the first matrix set set forth above will be
derived, by way of example, as a replacement for the upper branch π/4 rotation angle
in Fig. 1.
One way in which respective outputs of A+B and B for the upper and lower
branches of the first lifting step may be obtained is to multiply the input vector
A B
by a matrix which represents the first lifting step of 16a:
1 1
0 1
Performing matrix multiplication results in the respective upper and lower branch
outputs of A+B and B in the first lifting step.
One way in which respective outputs of A+B and V2(A+B)-B for the upper and
lower branches of the second lifting step of 16a may be obtained is to multiply the
vector input to the second lifting step
A + B B
by a matrix which represents the second lifting step of 16a:
1 0
1/2 -1
Using a method similar to that described above for the upper branch DCT
rotation angle of π/4 in the graph of Fig. 1, the upper branch DCT rotation angle of
3π/8 (14b in Fig. 1) may be replaced with the following sets of lifting steps (or
equivalent lifting steps apparent to those skilled in the art) depicted below in matrix
In each of the 4 lower branches of the binDCTs, depicted within Figs. 5-8,
which yield the odd-indexed transform coefficients (i.e., X[l], X[3], X[5], X[7]),
three DCT rotation angles (i.e., π/4 (14d), 7π/16 (14e), and 3π/16 (14f, as shown
in Fig. 1 ) are also replaced with lifting steps, particularly those lifting steps having
dyadic denominators in their lifting coefficients. It should be noted that special care
must be taken in choosing a lifting coefficient to replace rotation angle π/4 (14d) in
the lower branches since it precedes the remaining angles in the structure. The
following choices of lifting steps (depicted in matrix form) are preferable to yield
better coding performances although other equivalent lifting steps may be apparent
to those skilled in the art:
π . -1 5/8 1 0 -1 3/8 1 0 1 -7/16 -1 7/16 1 0 1 -7/16 4 ' 0 1 3/8 1 0 1 11/16 1 0 1 0 1 3/4 1 0 1
Next, the DCT rotation angle 7π/16 (14e shown in Fig. 1) may be replaced with
one of the following sets of lifting steps (or equivalent steps) depicted below in
matrix form:
"1 -1/8" 1 0" "l - 3/16" 1 0 1 1 1 0
, or
16 ' 0 1 3/16 1 _ _0 1 ■ 13/16 1 0 1 ■ 13/16 1
Finally, the DCT rotation angle 3π/16 (14f shown in Fig. 1) may be replaced with
one of the following sets of lifting steps (or equivalent steps) depicted below in
matrix form:
Note that each embodiment of the binDCT still employs the following ±1 butterfly
(of Fig. 1) which may be represented in matrix form as:
1 1
1 - 1
and whose inverse is itself scaled down by a factor of two. To recover xfi] exactly on
the synthesis end, the outputs of the inverse binDCTs have to be attenuated by Vt,
which is preferably implemented in software and/or hardware as a right-shift of the
data two binary places. It should be noted that in lossless coding, the ±1 butterfly is
not very efficient since it amplifies the signal by 2. To compensate for this effect, in
accordance with another aspect of the invention one or more ±1 butterflies (e.g., 12
in Figs. 5-8) may be replaced with either of the following sets of lifting steps (or
equivalent steps) as depicted below in matrix form:
As an example, Figs. 10 and 11 are respectively provided to illustrate lossless
versions of the embodiment of the binDCT shown in Fig. 5 and its inverse binDCT
in which the ±1 butterflies (e.g., 12 of Fig. 5-8) are replaced with lifting steps having
dyadic denominators in their lifting coefficients. Similar modifications may be made
to each embodiment (e.g., as respectively depicted in Figs. 5-8) without departing
from the spirit or scope of the invention.
Fig. 12 depicts a graphical representation of frequency response for a forward
and inverse binDCT, respectively, in accordance with a preferred embodiment of the
invention.
Fig. 13 illustrates a table of coefficients for a transform matrix which may be
used to produce the same or substantially the same coefficients produced by binDCT,
in accordance with embodiments of the invention (e.g., as in Fig. 5). The top
portion of the chart 130 depicts transform coefficients used for signal analysis. The
bottom portion of the chart 132 depicts transform coefficients used for signal
synthesis. Each portion of the table 130, 132 is obtained by first converting each
±1 butterfly and each lifting step within binDCT, e.g., the preferred embodiment of
the invention shown in Fig. 5, into matrix form. Next, multiplying each of the
matrices within a basis function (e.g., x[0]) to respectively produce results in the 8
rows representative of the 8 output coefficients produced using binDCT. That is, the
Fig. 13 table provides a matrix of transform coefficients, h0 through h7, which are M
basis functions of an M x M transform consistent with a preferred embodiment of the
invention.
Fig. 14 depicts a table showing empirical data of peak signal-to-noise ratios
(PSNR) between reconstructed and original images. The table depicts results for
three separate test images represented in 512 x 512 8-bit grayscale; "Lena,"
"Goldhill" and "Barbara." The table depicts the PSNR for the DCT in the left
column of each test image and offers a comparison with two preferred embodiments
of the invention (of Figs. 5 and 7, respectively). As can be seen in Fig. 14, the
performance of the binDCT is very similar to that of the DCT.
Referring now to Fig. 15, a transmission system is depicted in which each
embodiment of the binDCT may be employed within a respective analysis block 150,
152, and each inverted binDCT may be employed within a respective synthesis block
154, 156. At the analysis side, an input signal Y(z) is processed by a pair of analysis
block 150, 152, each employing an embodiment of the binDCT block transform in
accordance with the invention. The resulting sub -band signals are then decimated by
a factor of two. At the synthesis side, the signals Y0(z), Yi(z) are interpolated, and
transformed by analysis blocks 154, 156, each of which employs an embodiment of
the inverse binDCT block transform in accordance with the invention. The
synthesized signals are then added to produce the output signal Y(z)'. As should be
readily apparent, the application of the invention is not limited to a transmission
system and may be employed for other applications including, but not limited to,
storing the coded data and reproduction of the data using any given medium.
Furthermore, although an 8 x 8 transform has been depicted, the implementation of
the invention is not limited to 8 x 8 transforms and, accordingly, may be used to
process matrices of any dimension without departing from the spirit or scope of the
invention.
The invention provides a simplified block transform that can be implemented
using only bit-shift and add operations. The transform replaces DCT rotation angles
with a series of lifting steps preferably having dyadic denominators in their lifting
coefficients specifically chosen to allow the binDCT to operate with substantially the
same degree of accuracy as the DCT, but without the DCT's level of complexity.
The result is a simpler, faster, transform process which retains all of the practical
benefits of the DCT.
The novel block transform of the invention, "binDCT," offers at least the
following advantages:
• The binDCT has a fast, elegant implementation utilizing only shift-and-
add operations. No multiplication is needed. For example, eight
transform coefficients can be computed using only 14 bit shift operations
and 31 addition operations (13 floating point-multiplication operations
and 29 addition operations are required for the DCT);
• The binDCT can map integers to integers with exact reconstruction. It
also allows a unifying lossy/lossless coding framework;
• In software implementation, the binDCT is at least 3 - 4 times faster than
the floating-point DCT. In hardware implementations, the binDCT may
be approximately 10 times or more faster than the floating-point DCT;
• The multiplierless property of the binDCT allows efficient very large scale
integrated circuit (VLSI) implementations in terms of both chip area and
power consumption;
• The binDCT approximates the DCT very closely. Perceptual quantization
matrices and coding strategies designed specifically for the DCT can be
applied to the binDCT immediately without any complicated
modification; and
• The cascade of a well-tuned forward binDCT followed by an inverse DCT
(or a forward DCT followed by an inverse binDCT) produces a reasonable
near-perfect-reconstruction system.
The binDCT provides very high coding performance. For instance, the
coding gain (a popular measure of a transform's energy compaction property) of
several binDCT versions may range in certain examples from 8.77 to 8.82 dB,
whereas the DCT has a coding gain of 8.83 dB. In image coding experiments, the
binDCT may be approximately 0.1 - 1.0 dB below the DCT in the signal-to-noise
ratios of the reconstructed images in certain examples. The binDCT may be found
to offer higher visual reconstructed quality than the DCT.
While preferred embodiments of the invention have been described and
illustrated, it should be apparent that many modifications can be made to the
invention and the invention's application without departing from its spirit or scope.
Accordingly, the invention is not limited by the foregoing description or drawings,
but is only limited by the scope of the appended claims.