IMAGE COMPRESSION AND TRANSMISSION
This invention relates to a method and apparatus for compression of images, in particular moving images such as video sequences and the like, for transmission across a communication network.
Digital video has been developed to a great extent over recent years, and in view of the large range of applications to which it lends itself-, particularly with the very high uptake- and- growth in personal computers and workstations and the popularity of the global Internet, substantial research and development has been dedicated to the development of techniques for compression, decompression and transmission of video. In general, the aim is to improve efficiency of compression as well as the effectiveness of the transport medium.
From a technical perspective, the main aim is to reduce both storage and transmission costs, i.e. to improve coding efficiency. However, one of the main concerns is the inherent trade-off between coding efficiency and video fidelity. Industry standards such as H.261 and MPEG define standard formats for compressed video data (but not implementations) , such that video fidelity can be improved as better codecs are developed without having to redefine the standard. Further, the defined standards enable a range of bitrates to be supported so that the quality of the reproduced video becomes a function of the cost of the hardware that the user can afford.
Thus, MPEG specifies both a syntax and a semantics for a legal video bitstream at the encoder stage, and a definition for
synchronisation and demultiplexing of the bitstream into its constituent parts (i.e. video, audio and other data) at the decoder stage, the latter permitting the video playback quality to scale with the abilities of the target hardware.
The video algorithms defined by MPEG are based on a class of video compression algorithms that aim to maximally reduce the natural spatio-temporal redundancy both within and between video frames in order to deliver compression. A key feature to exploit in such redundancy elimination is that of the motion of rigid bodies in a sequence of frames . Clearly, by encoding the relevant object once, and subsequently transmitting merely its spatial translation, much irrelevant data is eliminated rom the encoding process . Algorithms which attempt to achieve this effect are known as motion compensation algorithms.
Substantial work has been carried out to develop sophisticated models for motion compensation. Two main classes of such algorithms have become predominant, namely block-matching algorithms (BMA) which look at the translation of groups of pixels, and pel-recursive algorithms which are concerned with individual pixel translations.
By their very nature, pel-recursive algorithms produce better video fidelity. However, such algorithms are also more expensive to compute. BMA routines have therefore become the de facto standard in a great majority of modern implementations, and a substantial amount of research and development has been put into improving BMA over the first basic procedure outlined back in 1981. Such improvements aim
to reduce BMA's computational expense, as well as increase its overall quality.
It is well known that video sequences contain both intra-frame (spatially local) and inter-frame (temporally global) correlations, and methods to exploit this redundancy have been considered since the early 1970' s. The earliest method considered interframe ('delta-coded') sequences, where the intensity difference between pels in successive frames were coded, and this method provided the basis for all modern predictive coding techniques. The basic idea is to look at the following two variables:
* p (x, y; t) : the value of a pixel at location (x, y) at time t; and
* Predte/VV t) : the predicted value of the same pixel.
The difference between the two, ε=ppred(x'/y; t - p (x, y; t) , is known as the error signal or residual that is to be transmitted to the receiver, where it is combined with ppred (~~' y t) to reconstruct p (x, y; t) . Clearly, the better the predicted value, the smaller the error signal or residual (ε) . Conversely, compression is optimised by minimising the residual. The overall compression can be further improved if the error signal is transformed (Discrete Cosine Transform, or DCT, is the usual preferred approach) , leading to the so-called hybrid coding techniques .
Even further compression results if the motion of rigid bodies is taken into account. For still frames, the position vector of an object, represented as v= (x,y) , does not change between
frames. However, where there is motion, some translated r'= (x ' , y ' ) makes a better predictor. Hence, the motion vector Δr=r-r' can be encoded and transmitted in addition to ε for the pels at least for which motion can be identified. This is known as motion-compensated interframe predictive coding. The evaluation of Δr at the encoder is called motion estimation, whereas, at the decoder end, the exploitation of this information in pel reconstruction is called motion compensation.
Various motion compensation algorithms have been proposed and, as stated above, the block matching algorithm (BMA) remains the most widely used, primarily for the simplicity of its concept and its hardware realisability. The BMA typically begins by partitioning a frame of video pixels into non-overlapping macroblocks of size N x N. Each macroblock in the frame being encoded (the 'current block') is compared with potential matches ('candidate blocks') in the previous, or reference, frame. For a maximum vector displacement of &> pixels, a given macroblock is searched within a search window of size (N+2to)x (N+2&) , as shown in Figure 1 of the drawings. The range of the motion vector is constrained by controlling the size of the search window. The displacement is taken to be that comparison which maximises or minimises a function, a distortion measure, representing the matching criterion. Many such functions have been proposed, such as the cross-correlation function (CCF) , the mean square error (MSE) , the mean absolute error (MAE) and the cross-search algorithm (CSA) .
As stated above, block matching algorithms represent a tradeoff between block reconstruction accuracy and
hardware/computational expense vis a vis pel-recursive techniques. Furthermore, the magnitude of the motion vectors generated by means of block-matching can be relatively large, which is counter-productive within a compression strategy, especially in the case where the video data is to be transmitted at a relatively low bitrate, in which case the proportion of the transmission burst assigned to motion vectors can become disproportionate. It is for this reason that residuals from motion estimation are first compressed themselves (by means of a lossy transform encoder and an entropy encoder) before transmission. This encoding scheme has been adopted by the MPEG, H.261 and H.263 standards.
In spite of the trade-off in quality which is inherent in block-matching algorithms, the search for good motion vectors adds substantial computational overhead to the encoding process. It takes substantially longer to perform motion estimation than it does to perform motion compensation. In fact, the block-matching process is the most time-consuming part of the entire encoding process. Thus, in encoding MPEG video, the algorithms must perform a tight balancing act between the conflicting requirements of short encoding times, high image quality and high compression ratios. Encoding times can be reduced by reducing the search area for good motion vectors, but this has a direct impact on the image quality. Further, the latter varies inversely with good compression ratios.
The problem is even more severe for H.261, which is intended for video conferencing applications and the like, in which case the encoding process takes place on-line.
We have now devised a technique which overcomes the problems outlined above and provides a method and apparatus for encoding image data which substantially reduces motion estimation times relative to the prior art techniques identified above. In accordance with a first aspect of the present invention, there is provided a method of compressing image data comprising the steps of generating a set of motion vectors representative of one or more image frames, generating, by means of a predetermined hash function a set of hash values corresponding to said motion vectors, and storing as a codebook said hash values in the form .of a table or array.
Also in accordance with the first aspect of the present invention, there is provided an apparatus for compressing image data comprising means, for generating a set of motion vectors representative of one or more image frames, means for generating, using a predetermined hash function, a set of hash values corresponding to said motion vectors, and codebook means for storing said hash values in the form of a table or array.
In accordance with a second aspect of the present invention, there is provided a method of compressing image data comprising the steps of generating a set of motion vectors representative of one or more image frames, storing as a codebook data representative of said motion vectors in the form of a table or array, and using vector quantisation to index the data stored in said table or array for retrieval of said data by decoding means .
Also in accordance with the second aspect of the present invention, an apparatus for compressing image data comprising
means for generating a set of motion vectors representative of one or more image frames, codebook means for storing data representative of said motion vectors in the form of a table or array, and vector quantisation means for indexing the data stored in said table or array so that it can be retrieved.
An exemplary embodiment of the invention will now be described with reference to the accompanying drawings, in which:
Figure 1 is a schematic diagram illustrating a macroblock and search window used in a BMA compression technique according to the prior art;
Figure 2 Is a schematic diagram illustrating the integration of a vector quantiser codebook with a hash table, the diagram showing a hash table with buckets, each with M slots per bucket ;
Figure 3 is a schematic diagram illustrating an exemplary embodiment of hardware for Vector Quantised Hashing (VQH) the name of our proposed algorithm for motion estimation; and
Figure 4 is a graph showing the plot of PBMA=N2A2.
The concept of a look-up table is well-known in engineering, and may be defined as a set of (name, attribute) pairings for storing data items. There are three basic operations which may be required to be performed on such a look-up table :
1. Insert a data item
2. Delete a data item
3. Search for a data item
Intuitively, since tables are stored just like arrays, such operations may be expected to cost 0 (n) for n items. However, in accordance with this exemplary embodiment of the present invention, a better performance can be obtained by the use of a technique known as 'hashing', whereby the search criterion replaces a sequence of operations by a single operation involving the computation of a function known as a 'hash function' .
For the purpose of the present description, assume that the size of the hash table is fixed (i.e. 'static hashing' as opposed to 'dynamic hashing' in which the table size may vary) . The address of a data item x stored within the hash table may be computed by evaluating the hash function h (x) . Typically, hash tables are partitioned into b 'buckets', with each bucket consisting of s ' slots ' . Each slot is capable of storing exactly one data item, and it is often the case that s=l, i.e. each bucket stores just one data item.
The construction of the hash function h (o) is the most crucial aspect of designing a hash table. Not only should h (o) be easy to compute, but it should also ideally generate a unique address within the hash table for each argument. It is not possible for the hash table to hold every possible value of the argument. Hence, it has been found that collisions often occur, i.e. h (x) =h (y) for two data items
≠ y} . Another problem is that of overflow, whereby a data item is mapped by h (o) into an existing bucket that is already full. Ideally, therefore, hash functions should be designed to minimise the possibility of both collisions and overflow.
It has been found that there are advantages to encoding groups of image sequences, as opposed to encoding individual samples. A technique known as vector quantisation (VQ) utilises this finding and offers a way of performing lossy compression along the way.
VQ is essentially the multi-dimensional generalisation of scalar quantisation, as is commonly employed in analog-to- digital conversion processes. In analytical terms, if X is an N-dimensional source vector, then VQ is a mapping such that:
Q : R™
Where C is an L-dimensional set, L < N, such that
C = { i/.-./ i,}, and the Y.^ e #N for i=l,...,L. C is usually termed the 'codebook', and the Y± the 'code vectors'. The VQ operator Q partitions £N into L disjoint and exhaustive regions {~~ l f . . . P } , each of which has a single coarse-grained representation.
In multi-dimensional signal processing, X may be taken to be a pixel macroblock that is quantised under the operation Q into a finite codebook. The. latter is generated once, and a copy is provided to both the encoder and the decoder. It is then sufficient to merely store or transmit the output of the codebook in order to represent any source vector. The technique operates as a pattern matching algorithm. It is well-known in engineering literature and is an integral part of MPEG's repertoire of routines.
In this exemplary embodiment of the present invention, Q is reinterpreted as the hash function h= (X) and, together with an appropriately sized two-dimensional array, enables the implementation of a hash table.
Referring to Figure 2 of the drawings, a source vector X is mapped by Q into a bucket, and occupies a unique, but arbitary, slot position. Each bucket therefore holds all the source vectors that are sufficiently close to the appropriate code vector which is their quantised representation within the source regions Pj,. With this interpretation, and assuming that there are no restrictions on the size of the hash table, it is possible to represent the entire domain of source vectors completely accurately, in spite of the fact that Q is usually a dimensionality-reduction operator. In other words, the combination of VQ and a hash table loaded in the manner described above provide a way for non-lossy representation of a source frame. This combined structure will be hereinafter referred to as a 'Vector Quantised Hash Table' or VQHT.
In order to support motion compensation, MPEG classifies video frames into three categories as follows.
1. I (ntra) -frames, which are independently coded without reference to any other frames .
2. P (redicted) -frames, which exploit motion compensation in order to improve compression. A predicted frame is coded with reference to a preceding I- or P-frame.
3. B (idirectional) -frames which rely upon both preceding and subsequent frames . Such frames use bidirectional interpolation between I- and P-frames, but are not used for coding other frames. They also have the highest compression efficiency.
Furthermore, MPEG specifies two parameters, N and M, which keep a count of the frame distance (i.e. number of frames) between, respectively, two successive I-frames (also-know -as GOP or 'Group of Pictures') and two successive P-frames. Typically, the boundary between GOPs is dictated by a scene cut; hence, N is a function of the number of such cuts in a video. M, however, is not defined by MPEG, and is left to the discretion of the encoder.
The generation of P-frames is crucial for efficient coding, but is also the most expensive part of MPEG, since motion estimation is directly involved. The decoding process uses a macroblock and a motion vector to reconstruct a P-frame, based on a closest match search of the preceding frame. Note that the use of the word 'preceding' does not imply frame adjacency, since B-frames typically interleave I- and P-frames.
In addition, MPEG does not specify how a closest match should be implemented; encoders have the task therefore of minimising the difference between a predicted and an actual macroblock.
In the following, the concept of performing motion estimation and compensation using the VQHT technique discussed above is described. For simplicity, both forward-predicted and bidirectionally-predicted frames are referred to as P-frames in the following description.
For a given GOP, the process begins by encoding an I-frame (or a P-frame from which a subsequent P-frame is to be deduced) into a VQHT. As described above, this provides a complete and non-lossy representation of an I-frame. From an implementation perspective, encoding involves a two-stage process:
1. Codebook generation, in which a decision is made on the number of bucket entries L in the codebook C. Representative code vectors from the I-frame are computed
(using any of the standard VQ training algorithms) and stored in C. Clearly, the larger the codebook, the less the quantisation error during encoding and look-up. It follows, therefore, that the minimum bound on L should be at least equal to the maximum number of motion vectors that any subsequent predicted frame will require. Thus, video fidelity becomes a function of the codebook size, as well as the size of the hash table.
2. Hash Table loading, in which the VQHT bucket slots are filled up by feeding every possible source vector (macroblock) from the I-frame through the hash function and storing it (together with its co-ordinates) in its appropriate bucket. Bucket slots are filled up sequentially in this manner.
Clearly, the above two processes must be performed exactly once for a given GOP. The resulting VQHT structure must be made available to the encoder.
To encode a subsequent P-frame, a set of motion vectors are required for those macroblocks which will be predicted during
the decoding stage. The generation of motion vectors using the VQHT involves the simple act of a hash table lookup. The P- frame macroblock whose motion vector is required is hashed directly into a bucket entry. The corresponding motion vector is then obtained simply by searching all slots for that I-frame macroblock which minimises a distance metric. The co-ordinate difference between the P-frame and I-frame macroblock so found defines the motion vector. This can now be DCT-encoded before being transmitted to the decoder in the usual MPEG manner.
The encoder structures required in a hardware implementation of VQH can be partitioned into pre-processing and postprocessing stages. For pre-processing, all that is required is a vector quantiser (which is normally a part of MPEG anyway) and some local buffer memory which stores the buckets and slots comprising the VQHT. It is possible to construct control logic that will directly fill up the VQHT from the vector quantiser's output when it is given an I-frame to encode. This, however, could also be done in software without incurring a significant performance penalty.
In BMA, the generation of motion vectors from subsequent P- frames is, as noted above, a computationally intensive task. Several high-throughput systolic designs have been suggested and implemented in order to achieve this.
In the VQH approach, it is possible to design, for postprocessing purposes, efficient dataflow hardware which will give rise to a high-performance motion vector generation engine. Such a design is illustrated in Figure 3 of the drawings, and consists of the following:
A shift register array, which takes as its input a linearised macroblock from a P-frame that is to be encoded. The geometry of this array is arranged such that the outputs are simply equal to the inputs, but with each component staggered by one computational cycle from its predecessor;
A codebook buffer, which contains the code vectors which will be filled in by the hardware vector quantiser; A VQHT buffer, which contains the representation of the I-frame, and is filled in during the pre-processing stage;
A systolic sorter: the P-frame macroblock that is to be encoded needs to be hashed into its appropriate bucket, and the corresponding I-frame macroblock with the least distance metric needs to be found. For this reason, a systolic sorter is included, the function of which is two-fold. Firstly, it sorts the output metrics from the codebook array in order to find the corresponding bucket . Secondly, it sorts the output metrics from the bucket in order to find those with the least distortion; An array of comparators: the distortions from the sorter array need not be unique, particularly if the macroblock is representing a region of low spatial gradient (i.e. minimum motion) . Thus it is necessary to compare the sorted outputs with each other in order to tag all those which are equal. The comparator performs this task; and A mean absolute differencer: at this stage, there exists a set of bucket entries which have an identical (and minimum) distance metric between the P-frame macroblock to.be encoded and the I-frame. It now remains to find
from these entries that unique entry which minimises the co-ordinate metric. The 2-dimensional differencer performs this task. It takes as its input the {x, y) coordinates of the P-frame macroblock to be encoded as well as the outputs from the comparator array. It then performs a metric computation (an L2-norm) between this coordinate and the coordinates of all candidate I-frame macroblocks. The resulting calculation tags the coordinates of the best-matching I-frame macroblock.
In BMA, to compare an N x N once requires O iN2) operations (the O-notation is well-known in complexity analysis, and provides a way of. expressing an upper bound) . The number of such macroblock searches required within a search window is, from Figure 1, (2w+l)2~0(w2) . Consider a square frame, of dimensions A x A pixels. Since N+2w=A, we have O (w-) ≡ 0(A2) . If an exhaustive search for a motion vector is carried out over the entire frame, a total of PBMA = 0 (N2AZ) 0 (1SP) 0(A2) operations are required for every P-frame. The function grows relatively rapidly, as shown in Figure 4 of the drawings.
In the VQHT approach to motion estimation according to this exemplary embodiment of the present invention, it is necessary to factor in the initial, but one-off, cost of generating the codebook at the start of a GOP. Using the convention illustrated in Figure 2, if there are L bucket entries, with the largest bucket having at most M slots, then a VQHT training and loading algorithm based on J-means clustering can be shown to require O (IM) operations. (Note: M is bounded by A2, as explained above, but it is realistic to expect that M < A2, and
The look-up for generating a motion vector requires simply O (L) operations followed by at most O (M) slot searches. This gives a total of PVQHT = 0 {LM) operations to set up a GOP, followed by O (L) +0 (M) operations for every P-frame that is subsequently encoded using it. With standard VQ, the greater the codebook size, the more accurate is the quantised representation. However, with the VQHT of the present invention, the reduction in accuracy entailed by small values of L is compensated for by an increase in the maximum slot size M. The extremal cases are simply L=A2, M=l against L=l, =A2. By choosing mid-point values L=M= (l/2)A, we obtain PVQHT = 0(A2) + 0(A).
Thus, in the above description a new method of finding the closest match in video compression is presented based on two new ideas, namely the use of a hash table for storing the motion vectors, and the use of vector quantisation (VQ) as an indexing method for a hash table. A systolic architecture is also proposed for implementing the described algorithm in hardware.
An embodiment of the present invention has been described above by way of example only and it will be apparent to persons skilled in the art that modifications and variations can be made to the described embodiment without departing from the scope of the invention.