US20060239345A1 - Method of signalling motion information for efficient scalable video compression - Google Patents

Method of signalling motion information for efficient scalable video compression Download PDF

Info

Publication number
US20060239345A1
US20060239345A1 US10/528,965 US52896503A US2006239345A1 US 20060239345 A1 US20060239345 A1 US 20060239345A1 US 52896503 A US52896503 A US 52896503A US 2006239345 A1 US2006239345 A1 US 2006239345A1
Authority
US
United States
Prior art keywords
motion
video
frame
mapping
embedded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/528,965
Inventor
David Taubman
Andrew Secker
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisearch Ltd
Original Assignee
Unisearch Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisearch Ltd filed Critical Unisearch Ltd
Assigned to UNISEARCH LIMITED reassignment UNISEARCH LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SECKER, MR. ANDREW, TAUBMAN, MR. DAVID
Publication of US20060239345A1 publication Critical patent/US20060239345A1/en
Priority to US13/421,788 priority Critical patent/US10205951B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/184Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being bits, e.g. of the compressed video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/513Processing of motion vectors
    • H04N19/517Processing of motion vectors by encoding
    • H04N19/52Processing of motion vectors by encoding by predictive encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/12Selection from among a plurality of transforms or standards, e.g. selection between discrete cosine transform [DCT] and sub-band transform or selection between H.263 and H.264
    • H04N19/122Selection of transform size, e.g. 8x8 or 2x4x8 DCT; Selection of sub-band transforms of varying structure or type
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • H04N19/147Data rate or code amount at the encoder output according to rate distortion criteria
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/189Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the adaptation method, adaptation tool or adaptation type used for the adaptive coding
    • H04N19/19Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the adaptation method, adaptation tool or adaptation type used for the adaptive coding using optimisation based on Lagrange multipliers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • H04N19/31Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability in the temporal domain
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/46Embedding additional information in the video signal during the compression process
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/537Motion estimation other than block-based
    • H04N19/54Motion estimation other than block-based using feature points or meshes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/62Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding by frequency transforming in three dimensions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/63Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding using sub-band based transform, e.g. wavelets
    • H04N19/635Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding using sub-band based transform, e.g. wavelets characterised by filter definition or implementation details

Definitions

  • the present invention relates to efficient compression of motion video sequences and, in preferred embodiments, to a method for producing a fully scalable compressed representation of the original video sequence while exploiting motion and other spatio-temporal redundancies in the source material.
  • the invention relates specifically to the representation and signalling of motion information within a scalable compression framework which employs motion adaptive wavelet lifting steps. Additionally, the present invention relates to the estimation of motion parameters for scalable video compression and to the successive refinement of motion information by temporal resolution, spatial resolution or precision of the parameters.
  • Internet will be used both in its familiar sense and also in its generic sense to identify a network connection over any electronic communications medium or collection of cooperating communications systems.
  • Scalable compression refers to the generation of a bit-stream which contains embedded subsets, each of which represents an efficient compression of the original video with successively higher quality.
  • a scalable compressed video bit-stream might contain embedded sub-sets with the bit-rates of R, 2R, 3R, 4R and 5R, with comparable quality to non-scalable bit-streams, having the same bit-rates. Because these subsets are all embedded within one another, however, the storage required on the video server is identical to that of the highest available bit-rate.
  • a version at rate R might be streamed directly to the client in real-time; if the quality is insufficient, the next rate-R increment could be streamed to the client and added to the previous, cached bit-stream to recover a higher quality rendition in real time. This process could continue indefinitely without sacrificing the ability to display the incrementally improving video content in real time as it is being received from the server.
  • the above application could be extended in a number of exciting ways. Firstly, if the scalable bit-stream also contains distinct subsets corresponding to different intervals in time, then a client could interactively choose to refine the quality associated with specific time segments which are of the greatest interest. Secondly, if the scalable bit-stream also contains distinct subsets corresponding to different spatial regions, then clients could interactively choose to refine the quality associated with specific spatial regions over specific periods of time, according to their level of interest. In a training video, for example, a remote client could interactively “revisit” certain segments of the video and continue to stream higher quality information for these segments from the server, without incurring any delay.
  • low bit-rate subsets of the video must be visually intelligible. In practice, this means that most of the available bits will be devoted to a low bit-rate portion of the video are likely to contribute to the reconstruction of the video at a reduced frame rate, since attempting to recover the full frame rate video over a low bit-rate channel will result in unacceptable deterioration of the spatial details within each frame.
  • the details required to recover higher frame rates must contribute to the refinement of a model which involves motion sensitive temporal interpolation.
  • motion information is important to highly scalable video compression; moreover, the motion itself must be represented in a manner which can be scaled, according to the temporal resolution (frame rate), spatial resolution and quality of the sample data.
  • the motion sensitive transform may be perfectly inverted, in the absence of any compression artefacts; 2) the low temporal resolution subsets of the wavelet hierarchy offer high spatial fidelity so that the transform allows excellent frame rate scalability; 3) the high pass temporal detail subbands produced by the transform have very low energy, allowing high compression efficiency; 4) in the absence of motion, the transform reduces to a regular wavelet decomposition along the temporal axis; and 5) in the presence of locally translational motion, the transform is equivalent to applying a regular wavelet decomposition along the motion trajectories.
  • Any two-channel FIR subband transform can be described as a finite sequence of lifting steps [W. Sweldens, “The lifting scheme: A custom-design construction of biorthogonal wavelets,” Applied and Computational Harmonic Analysis , vol 3, pp 196-2000, April 1996]. It is instructive to begin with an example based upon the Haar wavelet transform.
  • l k [n] and h k [n] correspond to the scaled sum and the difference of each original pair of flames.
  • An example is shown in FIG. 1A . Since motion is ignored, ghosting artefacts are clearly visible in the low-pass temporal subband, and the high-pass subband frame has substantial energy.
  • W k1 ⁇ k2 denote a motion-compensated mapping of frame k 1 onto the coordinate system of frame k 2 , so that W k1 ⁇ k2 (x k1 )[n] ⁇ x k2 [n] for all n.
  • the lifting steps are modified as follows.
  • W 2k ⁇ 2k+1 and W 2k+1 ⁇ 2k represent forward and backward motion mappings, respectively.
  • the high-pass subband frames correspond to motion-compensated residuals. These will be close to zero in regions where the motion is accurately modelled. The result is shown in FIG. 1B .
  • FIG. 2 demonstrates the effect of these modified lifting steps.
  • the highpass frames are now essentially the residual from a bidirectional motion compensated prediction of the odd-indexed original frames. When the motion is adequately captured, these high-pass frames have little energy and the low-pass frames have excellent spatial fidelity.
  • the cost of estimating, coding and transmitting the above motion fields can be substantial. Moreover, this cost may adversely affect the scalability of the entire compression scheme, since it is not immediately clear how to progressively refine the motion fields without destroying the subjective properties of the reconstructed video when the motion is represented with reduced accuracy.
  • the present invention provides a method for incrementally coding and signalling motion information for a video compression system involving a motion adaptive transform and embedded coding of transformed video samples, said method comprising the steps of:
  • the present invention also provides a system for incrementally coding and signalling motion information for a video compression system involving a motion adaptive transform and embedded coding of transformed video samples, said system comprising:
  • (b) means for interleaving incremental contributions from said embedded motion fields with incremental contributions from said transformed video samples.
  • each motion field is represented in coarse to fine fashion and interleaved with the video data bit-stream, the accuracy required for motion representation can be balanced with the accuracy of the transformed sample values which may be recovered from the bit-stream. Therefore, a fully scalable video bit-stream may be progressively refined, both in regard to its quantised sample representations and in regard to its motion representation.
  • the embedded motion field bit-stream is obtained by applying embedded quantization and coding techniques to the motion field parameter values.
  • the embedded motion field bit-stream is obtained by coding the node displacement parameters associated with a triangular mesh motion model on a coarse to fine grid, each successive segment of the embedded bit-stream providing displacement parameters for node positions which lie on a finer grid than the previous stage, all coarser grids of node positions being subsets of all finer grids of node points.
  • a coarse to fine motion representation is obtained by first transforming the motion parameters and then coding the transform coefficients using embedded quantization and coding techniques.
  • the motion parameters are transformed by applying spatial discrete wavelet transforms and/or temporal transforms thereto.
  • the spatial and/or temporal transforms are reversible integer-to-integer transforms, suitable for lossless compression.
  • the embedded motion bit-streams are arranged into a sequence of quality layers, and the transformed video samples are also encoded into embedded bit-streams which are arranged into a separate sequence of quality layers.
  • said interleaving of the contributions from the embedded motion bit-streams and from the transformed video samples is performed in a manner which minimizes the expected distortion in the reconstructed video sequence at each of a plurality of compressed video bit-rates.
  • the measure of distortion is Mean Squared Error.
  • the measure of distortion is a weighted sum of the Mean Squared Error contributions from different spatial frequency bands, weighted according to perceptual relevance factors.
  • the distortion associated with inaccurate representation of the motion parameters is determined using an estimate of the spatial power spectrum of the video source.
  • the distortion associated with inaccurate representation of the motion parameters is determined using information about the spatial resolution at which the video bit-stream is to be decompressed.
  • the power spectrum of the video source is estimated using spatio-temporal video sample subbands created during compression.
  • the proportions of contributions from said embedded motion fields and said transformed video samples in the embedded bit-stream is determined on the basis of a plurality of tables associated with each frame, each table being associated with a spatial resolution at which the video bit-stream is to be decompressed.
  • the tables identify the number of motion quality layers which are to be included with each number of video sample quality layers.
  • the preferred structure of the motion representation allows rate-distortion optimal algorithms to balance the contributions of motion information and sample accuracy, as it is being included into an incrementally improving (or layered) compressed representation. While rate-distortion optimisation strategies for balancing motion and sample accuracy have been described in the literature, those algorithms were applicable only to static optimisation of a compressed bit-stream for a single target bit-rate.
  • the preferred embodiment of the present invention allows for the rate-distortion optimised balancing of motion and sample accuracy to be extended to scalable content in which the target bit-rate cannot be known a priori.
  • a method for estimating and signalling motion information for a motion adaptive transform based on temporal lifting steps comprises the steps of:
  • the present invention also provides a system for estimating and signalling motion information for a motion adaptive transform based on temporal lifting steps, said system comprising:
  • (b) means for inferring a second mapping between either said source frame or said target frame, and another frame, based on the estimated and signalled motion parameters associated with said first mapping.
  • the number of motion fields which must be signalled to the decompressor can be reduced, as some motion fields can be inferred from others.
  • said second mapping is the reciprocal mapping from said target frame to said source frame, for use within another one of the lifting steps.
  • said reciprocal mapping is the inverse of the first mapping.
  • the preferred embodiment provides a method for estimating and representing only one of the motion fields in each pair, W 2k ⁇ 2k+1 and W 2k+1 ⁇ 2k , or W 2k ⁇ 2k ⁇ 1 , and W 2k ⁇ 1 ⁇ 2k .
  • Such pairs of motion fields will be known here as “reciprocal pairs.” This allows the total amount of motion information to be reduced to one motion field per frame for the Haar case, and 2 motion fields per frame for the 5/3 case. It is found that collapsing reciprocal pairs to a single motion field, from which the pair is recovered, actually improves the properties of the motion adaptive transform, resulting in increased compression efficiency, even when the benefits of reduced motion cost are not taken into account.
  • the motion parameters of said first mapping correspond to a deformable triangular mesh motion model.
  • said reciprocal mapping is inferred by inverting the affine transformations associated with the triangular mesh used to represent said first mapping.
  • the motion parameters of said first mapping correspond to a block displacement motion model.
  • said motion adaptive transform involves multiple stages of temporal decomposition, corresponding to different temporal frame rates.
  • motion parameters at each temporal resolution are deduced from original video frames.
  • said second mapping is a mapping between frames at a lower temporal resolution than said first mapping, and said second mapping is inferred by compositing the first mapping with at least one further mapping between frames at the higher temporal resolution.
  • This embodiment enables all of the required motion fields at lower temporal resolutions (higher temporal displacements) to be derived from an initial set of frame-to-frame motion fields.
  • the compressor need only estimate the motion between each successive pair of frames, x k [n] and x k+1 [n]. This substantially reduces the cost in memory and computation of the motion estimation task, without significantly altering the compression efficiency or other properties of the motion adaptive transform.
  • said second mapping is a mapping between frames at a higher temporal resolution than said first mapping, and said second mapping is inferred by compositing the first mapping with at least one further mapping at the higher temporal resolution.
  • the higher resolution is double said lower resolution, and alternate mappings at the higher temporal resolution are explicitly signalled to a decompressor, the remaining mappings at the higher temporal resolution being replaced by the mappings inferred by compositing the lower resolution mappings with respective higher resolution mappings.
  • said replaced mappings are used within the lifting steps of said motion adaptive transform, in place of the originally estimated mappings which were replaced.
  • the method of this embodiment has the property that the motion representation is temporally scalable. In particular, only one motion field must be made available to the decoder for each video frame which it can reconstruct, at any selected temporal resolution. This method involves judicious compositing of the forward and backward motion fields from different temporal resolution levels and is compatible with the efficient motion estimation method described above, of compositing motion fields at higher resolutions to obtain motion fields at lower resolutions.
  • said replaced mappings are refined with additional motion parameters, said refinement parameters being signalled for use in decompression, and said replaced and refined mappings being used within the lifting steps of said motion adaptive transform, in place of the originally estimated mappings which were replaced.
  • inversion or composition of motion transformations is accomplished by applying said motion transformations to the node positions of a triangular mesh motion model, the composited or inverted motion transformation being subsequently applied by performing the affine transformations associated with said mesh motion model.
  • the source frame is partitioned into a regular mesh and the inversion or composition operations are applied to each node of the regular mesh to find a corresponding location in the target frame, the composited or inverted motion transformation being subsequently applied by performing the affine transformations associated with said mesh motion model.
  • This is a particularly efficient computational method for performing the various motion field transformations required by other aspects of the invention. These methods are preferably replicated at both the compressor and the decompressor, if the transform is to remain strictly invertible.
  • FIG. 1A illustrates the lifting steps for the Haar temporal transform
  • FIG. 1B illustrates a motion adaptive modification of the lifting steps for the Haar temporal transform
  • FIG. 2 illustrates the lifting steps for a motion adaptive 5/3 temporal transform
  • FIG. 3 illustrates a triangular mesh motion model
  • FIG. 4 illustrates schematically the compositing of two motion fields at a higher temporal resolution to create one at a lower resolution
  • FIG. 5 illustrates schematically the compositing of motion fields in one embodiment of a temporally scalable motion representation for the motion adaptive 5/3 lifting transform.
  • W 2k ⁇ 2k+1 and W 2k+1 ⁇ 2k would be to determine the parameters for W 2k ⁇ 2k+1 which minimise some measure (e.g., energy) of the mapping residual x 2k+1 ⁇ W 2k ⁇ 2k+1 (x 2k ) and to separately determine the parameters for W 2k+1 ⁇ 2k which minimise some measure of its residual signal, x 2k ⁇ W 2k+1 ⁇ 2k (x 2k+1 ).
  • W 2k ⁇ 2k+1 minimise some measure of its residual signal
  • the motion adaptive transform is equivalent to a one-dimensional DWT, applied along the underlying motion trajectories. If they are not inverses of one another, this desirable characteristic is lost, no matter how well they are able to minimise motion compensated residuals.
  • the decompressor only one motion field from each reciprocal pair should be directly estimated and communicated to the decompressor. Unless otherwise prohibited (e.g., by the later aspects of the invention), it is mildly preferable to directly estimate and communicate the parameters of the motion field which is used in the first (predictive) lifting step. This is the lifting step described by equations (1) and (3), for the Haar and 5/3 cases, respectively.
  • the motion is represented by a continuously deformable triangular mesh [Y. Nakaya and H. Harashima, “Motion compensation based on spatial transformations”, IEEE Trans. Circ. Syst. For Video Tech ., vol. 4, pp 339-367, June 1994], the affine motion which describes the deformation of each triangle in W 2k ⁇ 2k+1 or W 2k ⁇ 2k ⁇ 1 may be directly inverted to recover W 2k+1 ⁇ 2k and W 2k ⁇ 1 ⁇ 2k , respectively.
  • a triangular mesh model for motion field W k1 ⁇ k2 involves a collection of node positions, ⁇ ti ⁇ in the target frame, x k2 together with the locations, ⁇ si ⁇ of those same node positions, as they appear in the source frame, x k1 .
  • the target node positions, ⁇ ti ⁇ are fixed, and the motion field is parametrized by the set of node displacements, ⁇ si-ti ⁇ .
  • the target frame, x k2 is partitioned into a collection of disjoint triangles, whose vertices correspond to the node positions. Since the partition must cover the target frame, some of the target node positions must lie on the boundaries of the frame. An example involving a rectangular grid of target node vertices is shown in FIG. 3 .
  • ⁇ j ⁇ for the set of target frame triangles.
  • Let t j,0 , t j,1 and t j,2 denote the vertices of target triangle ⁇ j .
  • the triangular mesh then maps the source triangle, ⁇ ′ j , described by the vertices s j,0 , s j,1 and s j2 onto target triangle ⁇ ′ j ,
  • the motion map itself is described by an affine transformation.
  • s does not generally lie on an integer grid, and so the source frame must be interpolated, using any of a number of well-known methods, to recover the value of (W k1 ⁇ k2 (x k1 ))[t].
  • the corresponding source node position, s i is constrained to lie on the same boundary of frame x k1 , as depicted in FIG. 3 .
  • Constraining boundary nodes, t i to map to nodes, s i , on the same boundary, tends to produce unrealistic motion fields in the neighbourhood of the frame boundaries, adversely affecting the ability of the mesh to track true scene motion trajectories. For this reason, the preferred embodiment of the invention does not involve any such constraints.
  • the source triangles ⁇ ′ j will not generally cover frame x k1 , and inversion of the affine transformations yields values for (W k2 ⁇ k1 (x k2 ))[s] only when s lies within one of the source triangles, ⁇ ′ j .
  • any of a number of policies may be described.
  • the nearest source triangle, ⁇ ′ j , to s may be found and its affine parameters used to find a location t in frame x k2 .
  • An alternative approach is to first extrapolate the mesh to one which is defined over a larger region than that required by the forward motion field W k1 ⁇ k2 . So long as this region is large enough to cover the source frame, each location s in frame x k1 will belong to some source triangle within the extrapolated mesh and the corresponding affine map can be inverted to find the location t in frame x k2 .
  • the node vector t e -s e at each extrapolated node position n e in frame x k2 is obtained by linear extrapolation of two node vectors, t b -s b and t o -s o , having corresponding node positions n b and n o .
  • the extrapolated node position n e is outside the boundaries of frame x k2
  • n b is the location of the nearest boundary node to n e
  • n o 2 n b ⁇ n e is the mirror image of n, through the boundary node, n b .
  • the extrapolated node vectors are not explicitly communicated to the decoder, since it extrapolates them from the available interior node positions, following the same procedure as the encoder.
  • Triangular mesh models are particularly suitable for the recovery of a reverse motion field, W k2 ⁇ k1 , from its forward counterpart W k1 ⁇ k2 .
  • the transformation between target locations, t, and source locations, s is continuous over the whole of the target frame. This is a consequence of the fact that the affine transformation maps straight lines to straight lines.
  • a block displacement model consists of a partition of the target frame into blocks, ⁇ B i ⁇ , and a corresponding set of displacements, ⁇ i ⁇ , identifying the locations of each block within the source frame.
  • block displacement models represent the motion field in a discontinuous (piecewise constant) manner. As a result, they may not properly be inverted. Nevertheless, when reciprocal pairs of motion maps, W k1 ⁇ k2 and W k2 ⁇ k1 , use block displacement models, it is still preferable to estimate and transmit only one of the two motion fields to the decoder, inferring the other through an approximate inverse relationship. Since displacements are usually small, it is often sufficient simply to reverse the sign of the displacement vectors, ⁇ i ⁇ , when forming W k2 ⁇ k1 from W k1 ⁇ k2 or vice-versa.
  • the transform consists of a sequence of stages, each of which produces a low- and a high-pass temporal subband sequence, from its input sequence. Each stage in the temporal decomposition is applied to the low-pass subband sequence produced by the previous stage.
  • each stage of the temporal decomposition involves the same steps, one might consider applying an identical estimation strategy within each stage, estimating the relevant motion fields from the frame sequence which appears at the input to that stage.
  • the problem with such a strategy is that estimation of the true motion, based on subband frames, may be hampered by the existence of unwanted artefacts such as ghosting. Such artefacts can arise as a result of model failure or poor motion estimation in previous stages of the decomposition.
  • the first stage of decomposition employs motion mappings W k1 ⁇ k2 (0) , producing low and high-pass subband frames, l k (1) [n] and h k (1) [n].
  • the temporal displacement over which motion estimation must be performed will span many original frames.
  • the actual temporal displacement between neighbouring subband frames is 16 times the original frame displacement.
  • fps frames per second
  • Motion estimation is generally very difficult over large temporal displacements due to the large possible range of motion. This complexity can be reduced by using knowledge of motion mappings already obtained in previous levels of the decomposition.
  • the first stage of decomposition with the 5/3 kernel involves estimation of W 2k ⁇ 2k+1 (0) and W 2k+2 ⁇ 2k+1 (0) . These may be composited to form an initial approximation for W k ⁇ k+1 (1) , which is required for the second stage of decomposition. This is shown in FIG. 4 , where the arrows indicate the direction of the motion mapping. It is often computationally simpler to create composite mappings from source mappings that have the same temporal orientation, as suggested in the figure. If necessary, the source mappings can be inverted to achieve this. However, it is preferable to directly estimate source mappings, having the same direction as the composite mapping.
  • the initial approximation, formed by motion field composition in the manner described above, can be refined based on original video data, using motion estimation procedures well known to those skilled in the art. It turns out, however, that the method of compositing motion fields with a frame displacement of 1 to produce motion fields corresponding to larger frame displacements often produces highly accurate motion mappings, that do not need any refinement. In some cases the composit mappings lead to superior motion adaptive transforms than motion mappings formed by direct estimation, or with the aid of refinement steps.
  • the motion field composition method described here can be repeated throughout the temporal decomposition hierarchy so that all the mappings for the entire transform can be derived from the frame to frame motion fields estimated in the first stage.
  • composition method described above eliminates a significant portion of the computational load associated with direct estimation of the required motion fields. A total of one motion mapping must be estimated for each original frame, having a temporal displacement of only one frame. This is sufficient to determine the complete set of motion mappings for the entire transform.
  • This method is independent of the particular wavelet kernel on which the lifting framework is based; however, the effectiveness of the composition procedure does depend on the selected motion model.
  • An efficient method for performing the composition procedure is described in the 4th aspect of this invention.
  • An efficient temporally scalable motion representation should satisfy two requirements. Firstly, at most one motion mapping per video frame should be needed to reconstruct the video at any temporal resolution. This is consistent with the above observation that just one mapping per frame is sufficient to derive all mappings for the entire transform.
  • the above property should apply at each temporal resolution available from the transform.
  • This property allows the video content to be reconstructed at each available temporal resolution, without recourse to redundant motion information.
  • This aspect of the invention involves a temporally scalable motion information hierarchy, based on the method of motion field composition, as introduced in the description of the second aspect. This representation achieves both of the objectives mentioned above.
  • the motion information hierarchy described here is particularly important for motion adaptive lifting structures that are based on kernels longer than the simple Haar.
  • Block transforms such as the Haar require only the motion information between every second pair of consecutive frames, at each stage of the decomposition. Therefore an efficient temporally scalable motion representation can be easily achieved by transmitting a single motion mapping for every reciprocal pair.
  • the motion representation for two stages of the 5/3 transform is given in FIG. 5 .
  • the mappings required to perform the lifting steps are again shown as arrows, where the i th forward mapping in the j th transform level is denoted F i j .
  • the term “forward mapping” is applied to those which approximate a current frame by warping a previous frame.
  • backward mappings, denoted B i j correspond to warping a later frame to spatially align it with a current frame.
  • the entire set of motion mappings depicted in FIG. 5 can be represented using only F 1 2 and B 2 1 . Inverting F 1 2 produces the backward mapping B 1 2 .
  • the forward mapping F 1 1 is inferred by compositing the upper-level forward mapping F 1 2 with the lower-level backward mapping B 2 1 .
  • the remaining mappings B 1 1 and F 2 1 are recovered by inverting F 1 1 and B 2 1 , respectively.
  • composited fields such as F 1 1 in FIG. 5
  • the compressor may correct this by transmitting an optional refinement fields, possibly based on direct estimation using original data.
  • mappings F 2 1 and B 2 1 are not required, so it is sufficient to code mappings F 1 1 and F 1 2 , recovering the corresponding backward motion fields by inversion.
  • the methods described above can be applied recursively to any number of transform stages, and the total number of required mappings is upper bounded by one per original frame. Temporal scalability is achieved since reversing a subset of the temporal decomposition stages requires no motion information from higher resolution levels.
  • a 4th aspect of the present invention describes an efficient method for performing the motion field composition and inversion transformations mentioned in previous aspects.
  • One possible way to represent a composited mapping is in terms of a sequence of warpings through each individual mapping. Motion compensation could be performed by warping the actual data through each mapping in turn. However, this approach suffers from the accumulation of spatial aliasing and other distortions that typically accompany each warping step.
  • each location in the target frame of the composite motion field may be mapped back through the various individual mappings to find its location in the source frame of the composite motion field.
  • the preferred method is to construct a triangular mesh model for the composit motion field, deducing the displacements of the mesh node points by projecting them through the various component motion mappings.
  • the triangular mesh model provides a continuous interpolation of the projected node positions and can be represented compactly in internal memory buffers. This method is particularly advantageous when used in conjunction with triangular mesh models for all of the individual motion mappings, since the frame warping machinery required to perform the motion adaptive temporal transformation involves only one type of operation—the affine transformation described previously.
  • Motion field inversion may be performed using a similar strategy.
  • the inverted motion mapping is represented using a forward triangular mesh motion model, whose node displacements are first found by tracing them through the inverse motion field.
  • the accuracy associated with both composit and inverse motion fields representations may be adjusted by modifying the size of the triangular mesh grid.
  • the mesh node spacing used for representing composit and inverse motion fields is no larger 8 frame pixels and no smaller than 4 frame pixels.
  • an accurate motion representation is determined and used to adapt the various lifting steps in the motion adaptive transform.
  • the motion parameters are encoded using an embedded quantisation and coding strategy.
  • Such strategies are now well known to those skilled in the art, being employed in scalable image and video codecs such as those described in J. Shapiro, “Embedded image coding using zerotrees of wavelet coefficients”, IEEE Trans. Sig. Proc ., vol 41, pp 3445-3462, December 1993,. D. Taubman and A. Zakhor, “Multi-rate 3-d subband coding of video”, IEEE Trans. Image Proc ., vol. 3, pp.
  • the client requests information for the video at some particular spatial resolution and temporal resolution (frame rate). Based on this information, the server determines the distortion which will be introduced by approximating the relevant motion information with only L q (M) bits from the respective embedded bit-streams, where the available values for L q (M) are determined by the particular embedded quantisation and coding strategy which has been used.
  • L q (M) denote this distortion, measured in terms of Mean Squared Error (MSE), or a visually weighted MSE.
  • MSE Mean Squared Error
  • the values D q (M) may be estimated from the spatial-frequency power spectrum of the relevant frames.
  • D q (M) depends not only on the accuracy with which the motion parameters are represented by the L q (M) bits of embedded motion information, but also on the spatial resolution of interest. At lower spatial resolutions, less accuracy is required for the motion information, since the magnitude of the phase shifts associated with motion error are directly proportional to spatial frequency.
  • the server would also estimate or know the distortion, D p (S) , associated with the first L p (S) bits of the embedded representation generated during scalable coding of the sample values produced by the motion adaptive transform.
  • D p (S) the distortion associated with the first L p (S) bits of the embedded representation generated during scalable coding of the sample values produced by the motion adaptive transform.
  • scalable sample data compression schemes are well known to those skilled in the art Assuming an additive model for these two different distortion contributions, the server balances the amount of information delivered for the motion and sample data components, following the usual Lagrangian policy.
  • the server finds the largest values of p ⁇ and q ⁇ such that - ⁇ ⁇ ⁇ D p ⁇ ⁇ ⁇ ( S ) ⁇ ⁇ ⁇ L p ⁇ ⁇ ⁇ ( S ) ⁇ ⁇ ⁇ ⁇ and ⁇ ⁇ - ⁇ ⁇ ⁇ D q ⁇ ⁇ ⁇ ( M ) ⁇ ⁇ ⁇ L q ⁇ ⁇ ⁇ ( M ) ⁇ ⁇ ( 5 ) adjusting ⁇ >0 so that L p ⁇ (S) +L q ⁇ (M) is as large as possible, while not exceeding L max .
  • ⁇ D p (S) / ⁇ L p (S) and ⁇ D q (M) / ⁇ L q (M) are discrete approximations to the distortion-length slope at the embedded truncation points p (for sample data) and q (for motion data) respectively.
  • the client-server application described above is only an example. Similar techniques may be used to construct scalable compressed video files which contain an embedded hierarchy of progressively higher quality video, each level in the hierarchy having its own balance between the amount of information contributed from the embedded motion representation and the embedded sample data representation.
  • rate-distortion optimisation strategies have previously been described in the literature for balancing the costs of motion and sample data information, this has not previously been done in a scalable setting, where both the motion and the sample data accuracy are progressively refined together.
  • DWT spatial discrete wavelet transform
  • a spatial wavelet transform is found to offer two chief benefits over coding the motion parameters directly. Firstly, the transform typically produces a large number of near-zero valued coefficients which can be quantized to zero with negligible error and then efficiently encoded. Secondly, the DWT shapes the quantization errors incurred when the motion representation is scaled, and this shaping is found to significantly reduce the reconstructed video distortion incurred at any given level of motion quantization error. In the preferred embodiment, a reversible (integer-to-integer) spatial DWT is used to allow exact recovery of the originally estimated motion parameters from the encoded transform coefficients, which is useful at high video bit-rates. Reversible wavelet transforms are well-known to those skilled in the art. One example is the reversible 5/3 spatial DWT which forms part of the JPEG2000 image compression standard, IS 15444-1.
  • Temporal transformation of the motion parameter information can have similar benefits to spatial transformation, and the effects are found to be somewhat complementary. That is, both the use of both a spatial DWT and a temporal transform together is recommended.
  • each pair of temporally adjacent motion fields is replaced by the sum and the difference of the corresponding motion vectors. These sums and differences may be interpreted as temporal low- and high-pass subbands.
  • the preferred embodiments are those which use techniques derived from the general class of bit-plane coders.
  • the highly efficient and finely embedded fractional bit-plane coding techniques which form part of the JPEG2000 image compression standard are to be recommended.
  • each subband produced by the motion parameter transform is partitioned into code-blocks, and each code-block is encoded using a fractional bit-plane coder, producing a separate finely embedded bit-stream for each motion subband code-block.
  • the code-block partioning principles enshrined in the JPEG2000 standard can be useful when compressing very large video frames, each of which has a large number of motion vectors. It general, then, the motion information is represented by a collection of code-blocks, each of which has a finely embedded bit-stream which may be truncated to any of a variety of coded lengths.
  • the EBCOT algorithm [D. Taubman, “High performance scalable image compression with EBCOT”, IEEE Trans. Image Proc ., vol. 9, pp. 1158-1170, July 2000] represents an excellent framework for converting a large number of embedded code-block bit-streams, each with its own set of truncation points, into a global collection of abstract “quality” layers.
  • Each quality layer contains incremental contributions from each code-blocks embedded bit-stream, where these contributions are balanced in a manner which minimises the distortion associated with the overall representation at the total bit-rates associated with the quality layer.
  • By arranging the quality layers in sequence one obtains a succession of truncation points, at each of which the representation is as accurate as it can be, relative to the size of the included quality layers.
  • D M denotes mean squared error in the motion vectors due to truncation of the embedded motion parameter code-block bit-streams
  • D x,M represents the total induced squared error in the reconstructed video sequence.
  • the scaling factor, ⁇ R,S depends upon the spatial resolution at which the video signal is to be reconstructed and also upon the accuracy with which the video samples are represented.
  • motion parameter quality layers are constructed from the embedded motion block bit-streams, following the EBCOT paradigm.
  • the rate-distortion optimality of the layered motion representation holds over a wide range of spatial resolutions and levels of video sample quantization error. This is extremely convenient, since it means that the rate-distortion optimization problem expressed in equation (5) can be solved once, while constructing the motion quality layers, after which a video server or transcoder need only decide how many motion layers are to be included in the video bit-stream for a given spatial resolution and a given level of error in the video sample data.
  • the same layering strategy of EBCOT is used to construct a separate set of rate-distortion optimal quality layers for the video sample data. These are obtained by subjecting the temporal subbands produced by the motion-compensated temporal lifting steps to spatial wavelet transform, partitioning the spatio-temporal video subbands into their own code-blocks, and generating embedded bit-streams for each video sample code-block.
  • the video sample quality layers then consist of incremental contributions from the various video sample code-blocks, such at the video sample distortion is as small as it can be, relative to the total size of those quality layers.
  • rate-distortion optimal video sample quality layers is substantially independent of the spatial resolution (number of resolution levels from the spatial video sample DWT which will be sent to the decoder) and the temporal resolution (number of temporal subbands produced by the motion compensated lifting steps which will be sent to the decoder). It also turns out that the optimality of the layer boundaries is approximately independent of the level of motion distortion, at least for combinations of motion and video sample bit-rates which are approximately optimal.
  • preferred embodiments of the invention produce a single set of motion quality layers and a single set of video sample quality layers.
  • the layers are internally rate-distortion optimal over the temporal interval within which they are formed. Since video streams can have unbounded duration, we divide the time scale into epochs known as “frame slots” In each frame slot, a separate set of motion quality layers and video sample quality layers is formed. The optimization problem associated with equation (5) then reduces to that of balancing the number of motion quality layers with the number of video sample quality layers which are sent to a decoder within each frame slot.
  • a complete implementation of the preferred embodiment of the invention must provide a means for deciding how many motion quality layers, q, are to be included with a subset of the video bit-stream which includes p video sample quality layers, given the spatial resolution R, at which the video content is to be viewed.
  • the preferred way to do this is to include a collection of tables with each frame slot, there being one table per spatial resolution which may be of interest, where each table provides an entry for each number of video sample quality layers, p, identifying the corresponding best number of motion layers, q p .
  • a video server or transcoder needing to meet a compressed length constraint L max within each frame slot, can use these tables to determine p and q p which are jointly optimal, such that the total length of the respective quality layers is as small as possible, but no smaller than L max . It is then preferable to discard data from the p th video sample quality layer, until the length target L max is satisfied. This approach is preferable to that of discarding motion data, since there is generally more video sample data.
  • One way to build the aforementioned tables is to simply decompress the video at each spatial resolution, using each combination of motion and sample quality layers, q and p, so as to find the value of p q which maximizes the ratio of distortion to total bit-rate in each frame slot, for each p. Of course, this can be computationally expensive. Nevertheless, this brute-force search strategy is computationally feasible.
  • a preferred means to build the aforementioned tables is to use the fact that these tables depend only on the linear scaling factors, ⁇ R,S . These scaling factors depend, in turn, on the power spectra of the video frames which are reconstructed at each level of video sample error, i.e., at each video sample quality layer p. In the preferred embodiment of the invention, these power spectra are estimated directly from the video sample subband data during the compression process. We find, in practice, that such estimation strategies can produce results almost as good as the brute force search method described above, at a fraction of the computational cost.

Abstract

A method for incrementally coding and signalling motion information for a video compression system involving a motion adaptive transform and embedded coding of transformed video samples comprises the steps of: (a) producing an embedded bit-stream, representing each motion field in coarse to fine fashion; and (b) interleaving incremental contributions from said embedded motion fields with incremental contributions from said transformed video samples. A further embodiment of a method for estimating and signalling motion information for a motion adaptive transform based on temporal lifting steps comprises the steps of: (a) estimating and signalling motion parameters describing a first mapping from a source frame onto a target frame within one of the lifting steps; and (b) inferring a second mapping between either said source frame or said target frame, and another frame, based on the estimated and signalled motion parameters associated with said first mapping.

Description

    FIELD OF THE INVENTION
  • The present invention relates to efficient compression of motion video sequences and, in preferred embodiments, to a method for producing a fully scalable compressed representation of the original video sequence while exploiting motion and other spatio-temporal redundancies in the source material. The invention relates specifically to the representation and signalling of motion information within a scalable compression framework which employs motion adaptive wavelet lifting steps. Additionally, the present invention relates to the estimation of motion parameters for scalable video compression and to the successive refinement of motion information by temporal resolution, spatial resolution or precision of the parameters.
  • BACKGROUND OF THE INVENTION
  • For the purpose of the present discussion, the term “internet” will be used both in its familiar sense and also in its generic sense to identify a network connection over any electronic communications medium or collection of cooperating communications systems.
  • Currently, most video content which is available over the internet must be pre-loaded in a process which can take many minutes over typical modem connections, after which the video quality and duration can still be quite disappointing. In some contexts video streaming is possible, where the video is decompressed and rendered in real-time as it is being received; however, this is limited to compressed bit-rates which are lower than the capacity of the relevant network connections. The most obvious way of addressing these problems would be to compress and store the video content at a variety of different bit-rates, so that individual clients could choose to browse the material at the bit-rate and attendant quality most appropriate to their needs and patience. Approaches of this type, however, do not represent effective solutions to the video browsing problem. To see this, suppose that the video is compressed at bit-rates of R, 2R, 3R, 4R and 5R. Then storage must be found on the video server for all these separate compressed bit-streams, which is clearly wasteful. More importantly, if the quality associated with a low bit-rate version of the video is found to be insufficient, a complete new version must be downloaded at a higher bit-rate; this new bit-stream must take longer to download, which generally rules out any possibility of video streaming.
  • To enable real solutions to the remote video browsing problem, scalable compression techniques are essential. Scalable compression refers to the generation of a bit-stream which contains embedded subsets, each of which represents an efficient compression of the original video with successively higher quality. Returning to the simple example above, a scalable compressed video bit-stream might contain embedded sub-sets with the bit-rates of R, 2R, 3R, 4R and 5R, with comparable quality to non-scalable bit-streams, having the same bit-rates. Because these subsets are all embedded within one another, however, the storage required on the video server is identical to that of the highest available bit-rate. More importantly, if the quality associated with a low bit-rate version of the video is found to be insufficient, only the incremental contribution required to achieve the next higher level of quality must be retrieved from the server. In a particular application, a version at rate R might be streamed directly to the client in real-time; if the quality is insufficient, the next rate-R increment could be streamed to the client and added to the previous, cached bit-stream to recover a higher quality rendition in real time. This process could continue indefinitely without sacrificing the ability to display the incrementally improving video content in real time as it is being received from the server.
  • The above application could be extended in a number of exciting ways. Firstly, if the scalable bit-stream also contains distinct subsets corresponding to different intervals in time, then a client could interactively choose to refine the quality associated with specific time segments which are of the greatest interest. Secondly, if the scalable bit-stream also contains distinct subsets corresponding to different spatial regions, then clients could interactively choose to refine the quality associated with specific spatial regions over specific periods of time, according to their level of interest. In a training video, for example, a remote client could interactively “revisit” certain segments of the video and continue to stream higher quality information for these segments from the server, without incurring any delay.
  • To satisfy the needs of applications such as that mentioned above, low bit-rate subsets of the video must be visually intelligible. In practice, this means that most of the available bits will be devoted to a low bit-rate portion of the video are likely to contribute to the reconstruction of the video at a reduced frame rate, since attempting to recover the full frame rate video over a low bit-rate channel will result in unacceptable deterioration of the spatial details within each frame. In order to achieve smooth quality scalability within a compressed video sequence which also offers frame rate scalability, the details required to recover higher frame rates must contribute to the refinement of a model which involves motion sensitive temporal interpolation.
  • Without temporal interpolation, missing frames cannot be introduced into a low rate video sequence without first augmenting their spatial fidelity to a level commensurate with the frames already available, and this implies a large discontinuous jump in the amount of information which must be provided to the decoder in order to smoothly increase the reconstructed video quality. Continuing this line of argument, we see that motion information is important to highly scalable video compression; moreover, the motion itself must be represented in a manner which can be scaled, according to the temporal resolution (frame rate), spatial resolution and quality of the sample data.
  • Motion Adaptive Transforms Based on Wavelet Lifting
  • The present invention is best appreciated in the context of an earlier invention, which is the subject of W002/50772. This earlier patent application describes a method for modifying the individual lifting steps in a lifting implementation of a temporal wavelet decomposition, so as to compensate for the effects of motion. This method has the following advantageous properties: 1) the motion sensitive transform may be perfectly inverted, in the absence of any compression artefacts; 2) the low temporal resolution subsets of the wavelet hierarchy offer high spatial fidelity so that the transform allows excellent frame rate scalability; 3) the high pass temporal detail subbands produced by the transform have very low energy, allowing high compression efficiency; 4) in the absence of motion, the transform reduces to a regular wavelet decomposition along the temporal axis; and 5) in the presence of locally translational motion, the transform is equivalent to applying a regular wavelet decomposition along the motion trajectories.
  • To assist in the present discussion, we briefly summarise the key ideas behind this earlier invention. Any two-channel FIR subband transform can be described as a finite sequence of lifting steps [W. Sweldens, “The lifting scheme: A custom-design construction of biorthogonal wavelets,” Applied and Computational Harmonic Analysis, vol 3, pp 196-2000, April 1996]. It is instructive to begin with an example based upon the Haar wavelet transform. Up to a scale factor, this transform may be realised in the temporal domain, through a sequence of two lifting steps, as h k [ n ] = x 2 k + 1 [ n ] - x 2 k [ n ] l k [ n ] = x 2 k [ n ] + 1 2 h 2 k [ n ]
    where xk[n]∝xk[n1, n2] denotes the samples of frame k from the original video sequence and hk[n]∝hk[n1, n2] and lk[n]∝lk[n1, n2] denote the high-pass and low-pass subband frames.
  • lk[n] and hk[n] correspond to the scaled sum and the difference of each original pair of flames. An example is shown in FIG. 1A. Since motion is ignored, ghosting artefacts are clearly visible in the low-pass temporal subband, and the high-pass subband frame has substantial energy.
  • Now let Wk1→k2 denote a motion-compensated mapping of frame k1 onto the coordinate system of frame k2, so that Wk1→k2(xk1)[n]≈xk2[n] for all n. The lifting steps are modified as follows. h k [ n ] = x 2 k + 1 [ n ] - W 2 k 2 k + 1 ( x 2 k ) [ n ] ( 1 ) l k [ n ] = x 2 k [ n ] + 1 2 W 2 k + 1 2 k ( h k ) [ n ] ( 2 )
    Note that W2k→2k+1 and W2k+1→2k represent forward and backward motion mappings, respectively. The high-pass subband frames correspond to motion-compensated residuals. These will be close to zero in regions where the motion is accurately modelled. The result is shown in FIG. 1B.
  • The framework described above is readily extended to any two-channel FIR subband transform, by motion-compensating the relevant lifting steps.
  • We demonstrate this in the important case of the biorthogonal 5/3 wavelet transform [D. Le Gall and A. Tabatabai, “Sub-band coding of digital images using symmetric short kernal filters and arithmetic coding techniques,” IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp 761-764, April 1988]. As before, x2k[n] and x2k+1[n] denote the even and odd indexed frames from the original sequence. Without motion, the 5/3 transform may be implemented by alternatively updating each of these two frame subsequences, based on filtered versions of the other sub-sequence. The lifting steps are h k [ n ] = x 2 k + 1 [ n ] - 1 2 ( x 2 k [ n ] - x 2 k + 2 [ n ] ) l k [ n ] = x 2 k [ n ] + 1 4 ( h k - 1 [ n ] + h k [ n ] )
  • As before, we introduce motion warping operators within each lifting step, which yields the following h k [ n ] = x 2 k + 1 [ n ] - 1 2 ( W 2 k 2 k + 1 ( x 2 k ) [ n ] + W 2 k + 2 2 k + 1 ( x 2 k + 2 ) [ n ] ) ( 3 ) l k [ n ] = x 2 k [ n ] + 1 4 ( W 2 k - 1 2 k ( h k - 1 ) [ n ] + W 2 k + 1 2 k + 1 ( h k ) [ n ] ) ( 4 )
  • FIG. 2 demonstrates the effect of these modified lifting steps. The highpass frames are now essentially the residual from a bidirectional motion compensated prediction of the odd-indexed original frames. When the motion is adequately captured, these high-pass frames have little energy and the low-pass frames have excellent spatial fidelity.
  • Counting the Cost of Motion
  • In the example of the Haar transform, given above, two separate motion mapping operators, W2k→2k+1 and W2k+1→2k, are required to process every pair of frames, x2k[n] and x2k+1[n]. Their respective motion parameters must be transmitted to the decoder. To provide a larger number of temporal resolution levels, the transform is re-applied to the low-pass subband frames, lk[n], for which motion mapping operators W4k→4k+2 and W4k+2→4k are required for every four frames. Continuing in this way, an arbitrarily large number of temporal resolutions may be obtained, using 2 2 + 2 4 + 2 8 + .2
    motion fields per original frame.
  • For the example of the 5/3 transform, also given above, four motion mapping operators, W2k→2k+1, W2k→2k−1, W2k+1→2k and W2k−1→2k are required for every pair of frames (indexed by k), for just one level of temporal decomposition. Continuing the transformation to an arbitrarily large number of temporal resolutions involves approximately 4 motion fields per original video frame.
  • The cost of estimating, coding and transmitting the above motion fields can be substantial. Moreover, this cost may adversely affect the scalability of the entire compression scheme, since it is not immediately clear how to progressively refine the motion fields without destroying the subjective properties of the reconstructed video when the motion is represented with reduced accuracy.
  • The previous invention clearly reveals the fact that any number of motion modelling techniques are compatible with the motion adaptive lifting transform, and also recommends the use of continuously deformable motion models such as those associated with triangular or quadrilateral meshes (see, for example, Y. Nakaya and H. Harashima, “Motion compensation based on spatial transformations,” IEE Trans. Circ. Syst. For Video Tech., Vol. 4, pp 339-367, June 1994). However, no particular solution is presented to the difficulties described above.
  • SUMMARY OF THE INVENTION
  • Accordingly, in one aspect, the present invention provides a method for incrementally coding and signalling motion information for a video compression system involving a motion adaptive transform and embedded coding of transformed video samples, said method comprising the steps of:
  • (a) producing an embedded bit-stream, representing each motion field in coarse to fine fashion; and
  • (b) interleaving incremental contributions from said embedded motion fields with incremental contributions from said transformed video samples.
  • The present invention also provides a system for incrementally coding and signalling motion information for a video compression system involving a motion adaptive transform and embedded coding of transformed video samples, said system comprising:
  • (a) means for producing an embedded bit-stream, representing each motion field in coarse to fine fashion; and
  • (b) means for interleaving incremental contributions from said embedded motion fields with incremental contributions from said transformed video samples.
  • Thus, because each motion field is represented in coarse to fine fashion and interleaved with the video data bit-stream, the accuracy required for motion representation can be balanced with the accuracy of the transformed sample values which may be recovered from the bit-stream. Therefore, a fully scalable video bit-stream may be progressively refined, both in regard to its quantised sample representations and in regard to its motion representation.
  • Preferably, the embedded motion field bit-stream is obtained by applying embedded quantization and coding techniques to the motion field parameter values.
  • Preferably, the embedded motion field bit-stream is obtained by coding the node displacement parameters associated with a triangular mesh motion model on a coarse to fine grid, each successive segment of the embedded bit-stream providing displacement parameters for node positions which lie on a finer grid than the previous stage, all coarser grids of node positions being subsets of all finer grids of node points.
  • Prererably, a coarse to fine motion representation is obtained by first transforming the motion parameters and then coding the transform coefficients using embedded quantization and coding techniques.
  • Preferably, the motion parameters are transformed by applying spatial discrete wavelet transforms and/or temporal transforms thereto.
  • Preferably, the spatial and/or temporal transforms are reversible integer-to-integer transforms, suitable for lossless compression.
  • Preferably, the embedded motion bit-streams are arranged into a sequence of quality layers, and the transformed video samples are also encoded into embedded bit-streams which are arranged into a separate sequence of quality layers.
  • Preferably, said interleaving of the contributions from the embedded motion bit-streams and from the transformed video samples is performed in a manner which minimizes the expected distortion in the reconstructed video sequence at each of a plurality of compressed video bit-rates.
  • Preferably, the measure of distortion is Mean Squared Error. Preferably, the measure of distortion is a weighted sum of the Mean Squared Error contributions from different spatial frequency bands, weighted according to perceptual relevance factors.
  • Preferably, the distortion associated with inaccurate representation of the motion parameters is determined using an estimate of the spatial power spectrum of the video source.
  • Preferably, the distortion associated with inaccurate representation of the motion parameters is determined using information about the spatial resolution at which the video bit-stream is to be decompressed.
  • Preferably, the power spectrum of the video source is estimated using spatio-temporal video sample subbands created during compression.
  • Preferably, the proportions of contributions from said embedded motion fields and said transformed video samples in the embedded bit-stream is determined on the basis of a plurality of tables associated with each frame, each table being associated with a spatial resolution at which the video bit-stream is to be decompressed. In the embodiment wherein the embedded motion bit-streams and the transformed video samples are each encoded as a series of quality layers, the tables identify the number of motion quality layers which are to be included with each number of video sample quality layers.
  • The preferred structure of the motion representation allows rate-distortion optimal algorithms to balance the contributions of motion information and sample accuracy, as it is being included into an incrementally improving (or layered) compressed representation. While rate-distortion optimisation strategies for balancing motion and sample accuracy have been described in the literature, those algorithms were applicable only to static optimisation of a compressed bit-stream for a single target bit-rate. The preferred embodiment of the present invention allows for the rate-distortion optimised balancing of motion and sample accuracy to be extended to scalable content in which the target bit-rate cannot be known a priori.
  • According to a further aspect of the present invention, a method for estimating and signalling motion information for a motion adaptive transform based on temporal lifting steps, comprises the steps of:
  • (a) estimating and signalling motion parameters describing a first mapping from a source frame onto a target frame within one of the lifting steps; and
  • (b) inferring a second mapping between either said source frame or said target frame, and another frame, based on the estimated and signalled motion parameters associated with said first mapping.
  • The present invention also provides a system for estimating and signalling motion information for a motion adaptive transform based on temporal lifting steps, said system comprising:
  • (a) means for estimating and signalling motion parameters describing a first mapping from a source frame onto a target frame within one of the lifting steps; and
  • (b) means for inferring a second mapping between either said source frame or said target frame, and another frame, based on the estimated and signalled motion parameters associated with said first mapping.
  • Accordingly, the number of motion fields which must be signalled to the decompressor can be reduced, as some motion fields can be inferred from others.
  • For instance, in one embodiment said second mapping is the reciprocal mapping from said target frame to said source frame, for use within another one of the lifting steps. Preferably, said reciprocal mapping is the inverse of the first mapping.
  • Thus, the preferred embodiment provides a method for estimating and representing only one of the motion fields in each pair, W2k→2k+1 and W2k+1→2k, or W2k→2k−1, and W2k−1→2k. Such pairs of motion fields will be known here as “reciprocal pairs.” This allows the total amount of motion information to be reduced to one motion field per frame for the Haar case, and 2 motion fields per frame for the 5/3 case. It is found that collapsing reciprocal pairs to a single motion field, from which the pair is recovered, actually improves the properties of the motion adaptive transform, resulting in increased compression efficiency, even when the benefits of reduced motion cost are not taken into account.
  • In one embodiment, the motion parameters of said first mapping correspond to a deformable triangular mesh motion model. Preferably, said reciprocal mapping is inferred by inverting the affine transformations associated with the triangular mesh used to represent said first mapping.
  • In another embodiment, the motion parameters of said first mapping correspond to a block displacement motion model.
  • Preferably, said motion adaptive transform involves multiple stages of temporal decomposition, corresponding to different temporal frame rates.
  • Preferably, motion parameters at each temporal resolution are deduced from original video frames.
  • In one embodiment said second mapping is a mapping between frames at a lower temporal resolution than said first mapping, and said second mapping is inferred by compositing the first mapping with at least one further mapping between frames at the higher temporal resolution.
  • This embodiment enables all of the required motion fields at lower temporal resolutions (higher temporal displacements) to be derived from an initial set of frame-to-frame motion fields. Thus, the compressor need only estimate the motion between each successive pair of frames, xk[n] and xk+1[n]. This substantially reduces the cost in memory and computation of the motion estimation task, without significantly altering the compression efficiency or other properties of the motion adaptive transform.
  • In another embodiment, said second mapping is a mapping between frames at a higher temporal resolution than said first mapping, and said second mapping is inferred by compositing the first mapping with at least one further mapping at the higher temporal resolution. For example, preferably the higher resolution is double said lower resolution, and alternate mappings at the higher temporal resolution are explicitly signalled to a decompressor, the remaining mappings at the higher temporal resolution being replaced by the mappings inferred by compositing the lower resolution mappings with respective higher resolution mappings. Preferably said replaced mappings are used within the lifting steps of said motion adaptive transform, in place of the originally estimated mappings which were replaced.
  • This further reduces the motion information to 1 motion field per video frame, even for the 5/3 transform. The method of this embodiment has the property that the motion representation is temporally scalable. In particular, only one motion field must be made available to the decoder for each video frame which it can reconstruct, at any selected temporal resolution. This method involves judicious compositing of the forward and backward motion fields from different temporal resolution levels and is compatible with the efficient motion estimation method described above, of compositing motion fields at higher resolutions to obtain motion fields at lower resolutions.
  • Preferably said replaced mappings are refined with additional motion parameters, said refinement parameters being signalled for use in decompression, and said replaced and refined mappings being used within the lifting steps of said motion adaptive transform, in place of the originally estimated mappings which were replaced.
  • Preferably, inversion or composition of motion transformations is accomplished by applying said motion transformations to the node positions of a triangular mesh motion model, the composited or inverted motion transformation being subsequently applied by performing the affine transformations associated with said mesh motion model.
  • Preferably, the source frame is partitioned into a regular mesh and the inversion or composition operations are applied to each node of the regular mesh to find a corresponding location in the target frame, the composited or inverted motion transformation being subsequently applied by performing the affine transformations associated with said mesh motion model. This is a particularly efficient computational method for performing the various motion field transformations required by other aspects of the invention. These methods are preferably replicated at both the compressor and the decompressor, if the transform is to remain strictly invertible.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Embodiments of the invention will now be described with reference to the accompanying drawings, in which:
  • FIG. 1A illustrates the lifting steps for the Haar temporal transform;
  • FIG. 1B illustrates a motion adaptive modification of the lifting steps for the Haar temporal transform;
  • FIG. 2 illustrates the lifting steps for a motion adaptive 5/3 temporal transform;
  • FIG. 3 illustrates a triangular mesh motion model;
  • FIG. 4 illustrates schematically the compositing of two motion fields at a higher temporal resolution to create one at a lower resolution; and
  • FIG. 5 illustrates schematically the compositing of motion fields in one embodiment of a temporally scalable motion representation for the motion adaptive 5/3 lifting transform.
  • DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
  • 1st Aspect: Reciprocal Motion Fields
  • A natural strategy for estimating the reciprocal motion fields, W2k→2k+1 and W2k+1→2k, would be to determine the parameters for W2k→2k+1 which minimise some measure (e.g., energy) of the mapping residual x2k+1−W2k→2k+1(x2k) and to separately determine the parameters for W2k+1→2k which minimise some measure of its residual signal, x2k−W2k+1→2k(x2k+1). In general, such a procedure will lead to parameters for W2k→2k+1, which cannot be deduced from those for W2k+1→k2 and vice-versa, so that both sets of parameters must be sent to the decoder.
  • It turns out that only one of the two motion fields must be directly estimated. The other can then be deduced by “inverting” the motion field which was actually estimated. Both the compressor and the decompressor may perform this inversion so that only one motion field must actually be transmitted.
  • True scene motion fields cannot generally be inverted, due to the presence of occlusions and uncovered background. One would expect, therefore, to degrade the properties of the motion adaptive transform (e.g., compression performance, or quality of the low temporal resolution frames) by replacing W2k→2k+1 with an approximate inverse of W2k+1→2k or vice-versa.
  • It turns out, however, that the opposite is the case. Rather than degrading the transform, representing each reciprocal pair with only one motion field actually improves the compression efficiency and the quality of the low temporal resolution frames.
  • An explanation for the above phenomenon is given in A. Seeker and D. Taubman, “Lifting-based invertible motion adaptive transform (LIMAT) framework for highly scalable video compression”, accepted to appear in IEEE Trans. Image Proc., 2003, a copy of which is available at www.ee.unsw.edu.au/˜taubman/. Briefly, the excellent properties of the motion adaptive temporal lifting transform are closely linked to the reciprocal relationship between the pairs, W2k→2k+1 and W2k+1→2k, and W2k→2k−1 and W2k−1→2k. If the frame warping operations described by each pair are truly inverses of one another, the motion adaptive transform is equivalent to a one-dimensional DWT, applied along the underlying motion trajectories. If they are not inverses of one another, this desirable characteristic is lost, no matter how well they are able to minimise motion compensated residuals.
  • According to the first aspect of the present invention, only one motion field from each reciprocal pair should be directly estimated and communicated to the decompressor. Unless otherwise prohibited (e.g., by the later aspects of the invention), it is mildly preferable to directly estimate and communicate the parameters of the motion field which is used in the first (predictive) lifting step. This is the lifting step described by equations (1) and (3), for the Haar and 5/3 cases, respectively.
  • Inversion of Triangular Mesh Motion Models
  • Where the motion is represented by a continuously deformable triangular mesh [Y. Nakaya and H. Harashima, “Motion compensation based on spatial transformations”, IEEE Trans. Circ. Syst. For Video Tech., vol. 4, pp 339-367, June 1994], the affine motion which describes the deformation of each triangle in W2k→2k+1 or W2k→2k−1 may be directly inverted to recover W2k+1→2k and W2k−1→2k, respectively. A triangular mesh model for motion field Wk1→k2 involves a collection of node positions, {ti} in the target frame, xk2 together with the locations, {si} of those same node positions, as they appear in the source frame, xk1. Although scene adaptive meshes have been described, in the preferred embodiment of the invention the target node positions, {ti}, are fixed, and the motion field is parametrized by the set of node displacements, {si-ti}. The target frame, xk2, is partitioned into a collection of disjoint triangles, whose vertices correspond to the node positions. Since the partition must cover the target frame, some of the target node positions must lie on the boundaries of the frame. An example involving a rectangular grid of target node vertices is shown in FIG. 3.
  • As suggested by the figure, it is convenient to write {Δj} for the set of target frame triangles. Let tj,0, tj,1 and tj,2 denote the vertices of target triangle Δj. The triangular mesh then maps the source triangle, Δ′j, described by the vertices sj,0, sj,1 and sj2 onto target triangle Δ′j, The motion map itself is described by an affine transformation. Specifically, for each location, tεΔj, within the target frame, the corresponding location, s, within the source frame is given by the affine equation
    s=A j t+b j
    where t, s and bj are regarded as column vectors, Aj is a 2×2 matrix; Aj and bj may be deduced from the motion parameters, using the fact that tj,i must map to sj,i for each i=0, 1, 2. Of course, s does not generally lie on an integer grid, and so the source frame must be interpolated, using any of a number of well-known methods, to recover the value of (Wk1→k2(xk1))[t].
  • In the simplest case, whenever a target node position, ti, lies on the boundary of frame xk2, the corresponding source node position, si, is constrained to lie on the same boundary of frame xk1, as depicted in FIG. 3. In this case, the source triangles, Δ′j, completely cover the source frame and so each location, s, in frame xk1, may be associated with one of the triangles, Δ′j, and hence mapped back onto the target frame through the inverse affine relation
    t=A j −1(s−b j)
  • In this way, the value of (Wk2→k1(xk2))[s] may be found for each location, s, by interpolating frame xk2 to the location, t.
  • Constraining boundary nodes, ti, to map to nodes, si, on the same boundary, tends to produce unrealistic motion fields in the neighbourhood of the frame boundaries, adversely affecting the ability of the mesh to track true scene motion trajectories. For this reason, the preferred embodiment of the invention does not involve any such constraints. In this case, the source triangles Δ′j will not generally cover frame xk1, and inversion of the affine transformations yields values for (Wk2→k1(xk2))[s] only when s lies within one of the source triangles, Δ′j. For locations s which do not belong to any of the source triangles, Δ′j, any of a number of policies may be described. As a simple example, the nearest source triangle, Δ′j, to s may be found and its affine parameters used to find a location t in frame xk2.
  • An alternative approach is to first extrapolate the mesh to one which is defined over a larger region than that required by the forward motion field Wk1→k2. So long as this region is large enough to cover the source frame, each location s in frame xk1 will belong to some source triangle within the extrapolated mesh and the corresponding affine map can be inverted to find the location t in frame xk2. In the preferred embodiment of this approach, the node vector te-se at each extrapolated node position ne in frame xk2, is obtained by linear extrapolation of two node vectors, tb-sb and to-so, having corresponding node positions nb and no. Here, the extrapolated node position ne is outside the boundaries of frame xk2, nb is the location of the nearest boundary node to ne, and no=2 nb−ne is the mirror image of n, through the boundary node, nb. The extrapolated node vectors are not explicitly communicated to the decoder, since it extrapolates them from the available interior node positions, following the same procedure as the encoder.
  • “Inversion” of Block-Displacement Motion Models
  • Triangular mesh models are particularly suitable for the recovery of a reverse motion field, Wk2→k1, from its forward counterpart Wk1→k2. Most significantly, the transformation between target locations, t, and source locations, s, is continuous over the whole of the target frame. This is a consequence of the fact that the affine transformation maps straight lines to straight lines.
  • Block displacement models, however, are more popular for video compression due to their relative computational simplicity. A block displacement model consists of a partition of the target frame into blocks, {Bi}, and a corresponding set of displacements, {δi}, identifying the locations of each block within the source frame.
  • Unlike the triangular mesh, block displacement models represent the motion field in a discontinuous (piecewise constant) manner. As a result, they may not properly be inverted. Nevertheless, when reciprocal pairs of motion maps, Wk1→k2 and Wk2→k1, use block displacement models, it is still preferable to estimate and transmit only one of the two motion fields to the decoder, inferring the other through an approximate inverse relationship. Since displacements are usually small, it is often sufficient simply to reverse the sign of the displacement vectors, {δi}, when forming Wk2→k1 from Wk1→k2 or vice-versa.
  • 2nd Aspect: Compositing of Simple Motion Fields
  • For high energy compaction and low temporal resolution frames with high fidelity, it is essential to have accurate motion mappings for each level of a multi-resolution temporal subband decomposition. The transform consists of a sequence of stages, each of which produces a low- and a high-pass temporal subband sequence, from its input sequence. Each stage in the temporal decomposition is applied to the low-pass subband sequence produced by the previous stage.
  • Since each stage of the temporal decomposition involves the same steps, one might consider applying an identical estimation strategy within each stage, estimating the relevant motion fields from the frame sequence which appears at the input to that stage. The problem with such a strategy is that estimation of the true motion, based on subband frames, may be hampered by the existence of unwanted artefacts such as ghosting. Such artefacts can arise as a result of model failure or poor motion estimation in previous stages of the decomposition.
  • To avoid this difficulty, it is preferred to perform motion estimation on the appropriate original frames instead of the input frames to the decomposition stage in question. For example, in the second stage of temporal decomposition it is more effective to estimate the motion mapping Wk1→k2 (1) between subband frames lk1 (1)[n] and lk2 (1)[n], by using the corresponding original frames xk1[n] and xk2[n]. Similarly, in the third stage, it is more effective to estimate the motion mapping Wk1→k2 (2) between subband frames lk1 (2)[n] and lk2 (2)[n], by using the corresponding original frames x4k1[n] and x4k2[n]. To clarify the notation being used here, it is noted that the first stage of decomposition employs motion mappings Wk1→k2 (0), producing low and high-pass subband frames, lk (1)[n] and hk (1)[n].
  • After several levels of subband decomposition, the temporal displacement over which motion estimation must be performed will span many original frames. For example, in the fifth level of decomposition the actual temporal displacement between neighbouring subband frames is 16 times the original frame displacement. At a typical frame rate of 30 frames per second (fps), this corresponds to more than half a second of video.
  • Motion estimation is generally very difficult over large temporal displacements due to the large possible range of motion. This complexity can be reduced by using knowledge of motion mappings already obtained in previous levels of the decomposition. For example, as described by equations (3) and (4), the first stage of decomposition with the 5/3 kernel involves estimation of W2k→2k+1 (0) and W2k+2→2k+1 (0). These may be composited to form an initial approximation for Wk→k+1 (1), which is required for the second stage of decomposition. This is shown in FIG. 4, where the arrows indicate the direction of the motion mapping. It is often computationally simpler to create composite mappings from source mappings that have the same temporal orientation, as suggested in the figure. If necessary, the source mappings can be inverted to achieve this. However, it is preferable to directly estimate source mappings, having the same direction as the composite mapping.
  • The initial approximation, formed by motion field composition in the manner described above, can be refined based on original video data, using motion estimation procedures well known to those skilled in the art. It turns out, however, that the method of compositing motion fields with a frame displacement of 1 to produce motion fields corresponding to larger frame displacements often produces highly accurate motion mappings, that do not need any refinement. In some cases the composit mappings lead to superior motion adaptive transforms than motion mappings formed by direct estimation, or with the aid of refinement steps. The motion field composition method described here can be repeated throughout the temporal decomposition hierarchy so that all the mappings for the entire transform can be derived from the frame to frame motion fields estimated in the first stage.
  • The composition method described above eliminates a significant portion of the computational load associated with direct estimation of the required motion fields. A total of one motion mapping must be estimated for each original frame, having a temporal displacement of only one frame. This is sufficient to determine the complete set of motion mappings for the entire transform.
  • This method is independent of the particular wavelet kernel on which the lifting framework is based; however, the effectiveness of the composition procedure does depend on the selected motion model. An efficient method for performing the composition procedure is described in the 4th aspect of this invention.
  • 3rd Aspect: Efficient Temporally Scalable Motion Representation
  • An efficient temporally scalable motion representation should satisfy two requirements. Firstly, at most one motion mapping per video frame should be needed to reconstruct the video at any temporal resolution. This is consistent with the above observation that just one mapping per frame is sufficient to derive all mappings for the entire transform.
  • Secondly, the above property should apply at each temporal resolution available from the transform. In particular, this means that the motion information must be temporally embedded, with each successively higher temporal resolution requiring one extra motion mapping per pair of reconstructed video frames. This property allows the video content to be reconstructed at each available temporal resolution, without recourse to redundant motion information.
  • This aspect of the invention involves a temporally scalable motion information hierarchy, based on the method of motion field composition, as introduced in the description of the second aspect. This representation achieves both of the objectives mentioned above.
  • The motion information hierarchy described here is particularly important for motion adaptive lifting structures that are based on kernels longer than the simple Haar. Block transforms such as the Haar require only the motion information between every second pair of consecutive frames, at each stage of the decomposition. Therefore an efficient temporally scalable motion representation can be easily achieved by transmitting a single motion mapping for every reciprocal pair.
  • It is generally preferrable to use longer wavelet kernels such as the 5/3. In fact, results given in A. Secker and D. Taubman, “Lifting-based invertible motion adaptive transform (LIMAT) framework for highly scalable video compression”, (accepted to appear in IEEE Trans. Image Proc., 2003) reveal that this can lead to considerable improvements in performance.
  • The motion representation for two stages of the 5/3 transform is given in FIG. 5. The mappings required to perform the lifting steps are again shown as arrows, where the ith forward mapping in the jth transform level is denoted Fi j. The term “forward mapping” is applied to those which approximate a current frame by warping a previous frame. Likewise, backward mappings, denoted Bi j, correspond to warping a later frame to spatially align it with a current frame. Observe that the entire set of motion mappings depicted in FIG. 5 can be represented using only F1 2 and B2 1. Inverting F1 2 produces the backward mapping B1 2. The forward mapping F1 1 is inferred by compositing the upper-level forward mapping F1 2 with the lower-level backward mapping B2 1. The remaining mappings B1 1 and F2 1 are recovered by inverting F1 1 and B2 1, respectively.
  • For scenes with rapid motion, composited fields such as F1 1 in FIG. 5, may suffer from an accumulation of the model failure regions present in the individual mappings. If so, the compressor may correct this by transmitting an optional refinement fields, possibly based on direct estimation using original data.
  • As mentioned, the case for the Haar wavelet is much simpler. Mappings F2 1 and B2 1 are not required, so it is sufficient to code mappings F1 1 and F1 2, recovering the corresponding backward motion fields by inversion. The methods described above can be applied recursively to any number of transform stages, and the total number of required mappings is upper bounded by one per original frame. Temporal scalability is achieved since reversing a subset of the temporal decomposition stages requires no motion information from higher resolution levels.
  • Evidently, a motion mapping between any pair of frames can be obtained by a combination of composition and inversion operators involving the sequence of mappings Fi 2 and B2i 1. It follows that this motion representation strategy is easily modified to encompass any wavelet kernel.
  • 4th Aspect: Efficient Implementation of Motion Field Transformations
  • A 4th aspect of the present invention describes an efficient method for performing the motion field composition and inversion transformations mentioned in previous aspects.
  • One possible way to represent a composited mapping is in terms of a sequence of warpings through each individual mapping. Motion compensation could be performed by warping the actual data through each mapping in turn. However, this approach suffers from the accumulation of spatial aliasing and other distortions that typically accompany each warping step.
  • A second problem with this approach is that errors due to boundary approximations also accumulate over the sequence of mappings. Boundary regions are prone to model failure, particularly when the scene undergoes global motion such as camera panning.
  • To avoid these problems, each location in the target frame of the composite motion field may be mapped back through the various individual mappings to find its location in the source frame of the composite motion field.
  • The preferred method, described here, however, is to construct a triangular mesh model for the composit motion field, deducing the displacements of the mesh node points by projecting them through the various component motion mappings. The triangular mesh model provides a continuous interpolation of the projected node positions and can be represented compactly in internal memory buffers. This method is particularly advantageous when used in conjunction with triangular mesh models for all of the individual motion mappings, since the frame warping machinery required to perform the motion adaptive temporal transformation involves only one type of operation—the affine transformation described previously.
  • Motion field inversion may be performed using a similar strategy. The inverted motion mapping is represented using a forward triangular mesh motion model, whose node displacements are first found by tracing them through the inverse motion field. The accuracy associated with both composit and inverse motion fields representations may be adjusted by modifying the size of the triangular mesh grid. In the preferred embodiment of the invention, the mesh node spacing used for representing composit and inverse motion fields is no larger 8 frame pixels and no smaller than 4 frame pixels.
  • 5th Aspect: Successive Refinement of Motion and Sample Accuracy
  • In order to provide for scalable video bit-streams which span a wide range of bit-rates, from a few 10's of kilo-bits/s (kb/s) to 10's of mega-bits/s (Mb/s), the accuracy with which motion information is represented must also be scaled. Otherwise, the cost of coding motion information would consume an undue proportion (all or more) of the overall bit budget at low bit-rates and would be insufficient to provide significant coding gain at high bit-rates. In the 3rd aspect above, a method for providing temporally scalable motion information has been described. In this 5th aspect, a method is described for further scaling the cost of motion information, in a manner which is sensitive to both the accuracy and the spatial resolution required of the reconstructed video sequence.
  • During compression, an accurate motion representation is determined and used to adapt the various lifting steps in the motion adaptive transform. During decompression, however, it is not necessary to receive exactly the same motion parameters which were used during compression. The motion parameters are encoded using an embedded quantisation and coding strategy. Such strategies are now well known to those skilled in the art, being employed in scalable image and video codecs such as those described in J. Shapiro, “Embedded image coding using zerotrees of wavelet coefficients”, IEEE Trans. Sig. Proc., vol 41, pp 3445-3462, December 1993,. D. Taubman and A. Zakhor, “Multi-rate 3-d subband coding of video”, IEEE Trans. Image Proc., vol. 3, pp. 572-588, September 1994, A. Said and W. Pearlman, “A new, fast and efficient image codee based on set partitioning in hierarchical trees”, IEEE Trans. Circ. Syst. For Video tech., pp. 243-250, June 1996, D. Taubman, “High performance scalable image compression with EBCOT”, IEEE Trans. Image Proc., vol. 9, pp. 1158-1170, July 2000. They allow the coded bit-stream to provide a successively more accurate representation of the information being coded. For the present purposes, this information consists of the motion parameters themselves, and each motion field, Wk1→k2, is provided with its own embedded bit-stream.
  • As an example of the way in which such an embedded motion representation may be used, consider an interactive client-server application, in which the client requests information for the video at some particular spatial resolution and temporal resolution (frame rate). Based on this information, the server determines the distortion which will be introduced by approximating the relevant motion information with only Lq (M) bits from the respective embedded bit-streams, where the available values for Lq (M) are determined by the particular embedded quantisation and coding strategy which has been used. Let Dq (M) denote this distortion, measured in terms of Mean Squared Error (MSE), or a visually weighted MSE. The values Dq (M) may be estimated from the spatial-frequency power spectrum of the relevant frames. Most notably, Dq (M) depends not only on the accuracy with which the motion parameters are represented by the Lq (M) bits of embedded motion information, but also on the spatial resolution of interest. At lower spatial resolutions, less accuracy is required for the motion information, since the magnitude of the phase shifts associated with motion error are directly proportional to spatial frequency.
  • Continuing the example, above, the server would also estimate or know the distortion, Dp (S), associated with the first Lp (S) bits of the embedded representation generated during scalable coding of the sample values produced by the motion adaptive transform. As already noted, scalable sample data compression schemes are well known to those skilled in the art Assuming an additive model for these two different distortion contributions, the server balances the amount of information delivered for the motion and sample data components, following the usual Lagrangian policy. Specifically, given a total budget of Lmax bits for both components, deduced from estimates of the network transport rate, or by any other means, the server finds the largest values of pλ and qλ such that - Δ D p λ ( S ) Δ L p λ ( S ) λ and - Δ D q λ ( M ) Δ L q λ ( M ) λ ( 5 )
    adjusting λ>0 so that L (S)+L (M) is as large as possible, while not exceeding Lmax. Here, ΔDp (S)/ΔLp (S) and ΔDq (M)/ΔLq (M) are discrete approximations to the distortion-length slope at the embedded truncation points p (for sample data) and q (for motion data) respectively.
  • The client-server application described above is only an example. Similar techniques may be used to construct scalable compressed video files which contain an embedded hierarchy of progressively higher quality video, each level in the hierarchy having its own balance between the amount of information contributed from the embedded motion representation and the embedded sample data representation.
  • The strategy described above, whereby an embedded motion representation is produced by embedded quantisation and coding of the individual motion parameters, may be extended to include progressive refinement according to the density of the motion parameters themselves. To see how this works, suppose that every second row and every second column were dropped from the rectangular grid of node positions in the triangular mesh of FIG. 3. In this coarse mesh, motion parameters would be sent only for the remaining node positions and the coarse triangular mesh model induced by this information would represent a coarse approximation to the original motion model. Such approximations are readily included within an embedded motion representation, from which an appropriate distribution between the motion and sample data coding costs may again be formed.
  • While rate-distortion optimisation strategies have previously been described in the literature for balancing the costs of motion and sample data information, this has not previously been done in a scalable setting, where both the motion and the sample data accuracy are progressively refined together.
  • While rate-distortion optimization strategies have previously been de scribed in the literature for balancing the costs of motion and sample data information, this has not previously been done in a scalable setting, where both the motion and the sample data accuracy are progressively refined together. There are, in our opinion, two principle reasons why progressively refined motion fields have not been investigated in the past for video compression. Firstly, most existing video compression systems (e.g., those described by international standards) employ motion compensated predictive coding, so if the decoder were to use different motion parameters to the encoder, their states would progressively drift apart. This problem does not exist in the context of motion adaptive wavelet transforms and, in particular, those based on the motion compensated lifting paradigm taught in W002/50772.
  • The second reason why we believe others have not investigated progressively refinable motion for scalable video coding is that the motion information interacts in a complex manner with the video sample data, making it more difficult to deduce the impact of motion quantization on system performance. The invention disclosed here, however, is inspired by the following interesting observation. Although the interaction between motion errors and video sample data errors is generally complex, at all experimentally optimal combinations of the motion and sample data accuracy, this relationship simplifies and may be approximately modeled using linear methods. In the ensuing sub-sections, we teach some specific methods for scalable motion coding and for optimally balancing the distribution of motion information with video sample information.
  • Scalable Motion Coding Methods
  • As mentioned above, a variety of methods for embedded coding of data are well known to those skilled in the art. Amongst these various methods, the authors' experimental investigations have suggested particular preferred embodiments. Rather than coding the motion parameters directly, it is preferable to first subject the motion parameter fields to a spatial discrete wavelet transform (DWT). That is, the horizontal components of each motion vector are treated as a two dimensional image and the vertical components are similarly treated as a two dimensional image; each image is subjected to a spatial DWT and the transform coefficients are then coded in place of the original motion vectors.
  • The use of a spatial wavelet transform is found to offer two chief benefits over coding the motion parameters directly. Firstly, the transform typically produces a large number of near-zero valued coefficients which can be quantized to zero with negligible error and then efficiently encoded. Secondly, the DWT shapes the quantization errors incurred when the motion representation is scaled, and this shaping is found to significantly reduce the reconstructed video distortion incurred at any given level of motion quantization error. In the preferred embodiment, a reversible (integer-to-integer) spatial DWT is used to allow exact recovery of the originally estimated motion parameters from the encoded transform coefficients, which is useful at high video bit-rates. Reversible wavelet transforms are well-known to those skilled in the art. One example is the reversible 5/3 spatial DWT which forms part of the JPEG2000 image compression standard, IS 15444-1.
  • Temporal transformation of the motion parameter information can have similar benefits to spatial transformation, and the effects are found to be somewhat complementary. That is, both the use of both a spatial DWT and a temporal transform together is recommended. In one particular embodiment, each pair of temporally adjacent motion fields is replaced by the sum and the difference of the corresponding motion vectors. These sums and differences may be interpreted as temporal low- and high-pass subbands.
  • Again, it is preferable to do this in a reversible manner which is compatible with efficient lossless coding, since at high video bit-rates it is best to preserve all of the estimated motion information. For this reason, the operations of sum and difference mentioned above should be replaced by the S-transform [V. Heer and H. E. Reinfelder, “A comparison of reversible methods for data compression”, Proc. SPIE conference, ‘Medical Imaging IV’, vol 1233, pp. 354-365, 1990].
  • As for the coding of motion transform coefficients, the preferred embodiments are those which use techniques derived from the general class of bit-plane coders. In particular, the highly efficient and finely embedded fractional bit-plane coding techniques which form part of the JPEG2000 image compression standard are to be recommended. In general, each subband produced by the motion parameter transform is partitioned into code-blocks, and each code-block is encoded using a fractional bit-plane coder, producing a separate finely embedded bit-stream for each motion subband code-block.
  • In many cases, there are insufficient motion parameters to justify dividing motion subbands into multiple code-blocks, but the code-block partioning principles enshrined in the JPEG2000 standard can be useful when compressing very large video frames, each of which has a large number of motion vectors. It general, then, the motion information is represented by a collection of code-blocks, each of which has a finely embedded bit-stream which may be truncated to any of a variety of coded lengths.
  • A Layered Framework for Joint Scaling of Motion and Video Sample Data
  • The EBCOT algorithm [D. Taubman, “High performance scalable image compression with EBCOT”, IEEE Trans. Image Proc., vol. 9, pp. 1158-1170, July 2000] represents an excellent framework for converting a large number of embedded code-block bit-streams, each with its own set of truncation points, into a global collection of abstract “quality” layers. Each quality layer contains incremental contributions from each code-blocks embedded bit-stream, where these contributions are balanced in a manner which minimises the distortion associated with the overall representation at the total bit-rates associated with the quality layer. By arranging the quality layers in sequence, one obtains a succession of truncation points, at each of which the representation is as accurate as it can be, relative to the size of the included quality layers.
  • Although the interaction between motion errors and video sample errors is non-trivial, it turns out that for combinations of motion and video sample bit-rates which are optimal, the relationship between motion error and reconstructed video quality is approximately linear. We may represent this linear relationship as
    Dx,M≈ΨR,SDM
  • where DM denotes mean squared error in the motion vectors due to truncation of the embedded motion parameter code-block bit-streams, and Dx,M represents the total induced squared error in the reconstructed video sequence. The scaling factor, ΨR,S, depends upon the spatial resolution at which the video signal is to be reconstructed and also upon the accuracy with which the video samples are represented. In preferred embodiments of the present invention, motion parameter quality layers are constructed from the embedded motion block bit-streams, following the EBCOT paradigm.
  • In view of the above relationship, and noting that the scaling factor, ΨR,S, is substantially similar for all motion coefficient subbands and code-blocks, the rate-distortion optimality of the layered motion representation holds over a wide range of spatial resolutions and levels of video sample quantization error. This is extremely convenient, since it means that the rate-distortion optimization problem expressed in equation (5) can be solved once, while constructing the motion quality layers, after which a video server or transcoder need only decide how many motion layers are to be included in the video bit-stream for a given spatial resolution and a given level of error in the video sample data.
  • In preferred embodiments of the invention, the same layering strategy of EBCOT is used to construct a separate set of rate-distortion optimal quality layers for the video sample data. These are obtained by subjecting the temporal subbands produced by the motion-compensated temporal lifting steps to spatial wavelet transform, partitioning the spatio-temporal video subbands into their own code-blocks, and generating embedded bit-streams for each video sample code-block. The video sample quality layers then consist of incremental contributions from the various video sample code-blocks, such at the video sample distortion is as small as it can be, relative to the total size of those quality layers. It turns out, most conveniently, that the generation of rate-distortion optimal video sample quality layers is substantially independent of the spatial resolution (number of resolution levels from the spatial video sample DWT which will be sent to the decoder) and the temporal resolution (number of temporal subbands produced by the motion compensated lifting steps which will be sent to the decoder). It also turns out that the optimality of the layer boundaries is approximately independent of the level of motion distortion, at least for combinations of motion and video sample bit-rates which are approximately optimal.
  • In summary, preferred embodiments of the invention produce a single set of motion quality layers and a single set of video sample quality layers. The layers are internally rate-distortion optimal over the temporal interval within which they are formed. Since video streams can have unbounded duration, we divide the time scale into epochs known as “frame slots” In each frame slot, a separate set of motion quality layers and video sample quality layers is formed. The optimization problem associated with equation (5) then reduces to that of balancing the number of motion quality layers with the number of video sample quality layers which are sent to a decoder within each frame slot. The solution to this problem is dealt with below, but we note that it depends on the parameter ΨR,S which is a function of both the spatial resolution of interest to the decoder and the accuracy of the video sample data. Equivalently, for any particular number of video sample layers, p, the number of motion layers, q, which balances the rate-distortion slopes of the motion and video sample information is a function of both p and the spatial resolution of interest.
  • Methods for Optimizing the Distribution of Motion and Video Sample Data
  • In view of the preceding discussion, a complete implementation of the preferred embodiment of the invention must provide a means for deciding how many motion quality layers, q, are to be included with a subset of the video bit-stream which includes p video sample quality layers, given the spatial resolution R, at which the video content is to be viewed. The preferred way to do this is to include a collection of tables with each frame slot, there being one table per spatial resolution which may be of interest, where each table provides an entry for each number of video sample quality layers, p, identifying the corresponding best number of motion layers, qp. Depending upon the application, there may be no need to send the table itself to a decoder.
  • A video server or transcoder, needing to meet a compressed length constraint Lmax within each frame slot, can use these tables to determine p and qp which are jointly optimal, such that the total length of the respective quality layers is as small as possible, but no smaller than Lmax. It is then preferable to discard data from the pth video sample quality layer, until the length target Lmax is satisfied. This approach is preferable to that of discarding motion data, since there is generally more video sample data. One way to build the aforementioned tables is to simply decompress the video at each spatial resolution, using each combination of motion and sample quality layers, q and p, so as to find the value of pq which maximizes the ratio of distortion to total bit-rate in each frame slot, for each p. Of course, this can be computationally expensive. Nevertheless, this brute-force search strategy is computationally feasible.
  • A preferred means to build the aforementioned tables is to use the fact that these tables depend only on the linear scaling factors, ΨR,S. These scaling factors depend, in turn, on the power spectra of the video frames which are reconstructed at each level of video sample error, i.e., at each video sample quality layer p. In the preferred embodiment of the invention, these power spectra are estimated directly from the video sample subband data during the compression process. We find, in practice, that such estimation strategies can produce results almost as good as the brute force search method described above, at a fraction of the computational cost.

Claims (4)

1. A method for incrementally coding and signalling motion information for a video compression system involving a motion adaptive transform and embedded coding of transformed video samples, said method comprising the steps of:
(a) producing an embedded bit-stream, representing each motion field in coarse to fine fashion; and
(b) interleaving incremental contributions from said embedded motion fields with incremental contributions from said transformed video samples.
2-32. (canceled)
33. A system for incrementally coding and signalling motion information for a video compression system involving a motion adaptive transform and embedded coding of transformed video samples, said system comprising:
(a) means for producing an embedded bit-stream, representing each motion field in coarse to fine fashion; and
(b) means for interleaving incremental contributions from said embedded motion fields with incremental contributions from said transformed video samples.
34. A system for estimating and signalling motion information for a motion adaptive transform based on temporal lifting steps, said system comprising:
(a) means for estimating and signalling motion parameters describing a first mapping from a source frame onto a target frame within one of the lifting steps; and
(b) means for inferring a second mapping between either said source frame or said target frame, and another frame, based on the estimated and signalled motion parameters associated with said first mapping.
US10/528,965 2002-09-20 2003-09-19 Method of signalling motion information for efficient scalable video compression Abandoned US20060239345A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/421,788 US10205951B2 (en) 2002-09-20 2012-03-15 Method of signalling motion information for efficient scalable video compression

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
AU2002951574 2002-09-20
AU2002951574A AU2002951574A0 (en) 2002-09-20 2002-09-20 Method of signalling motion information for efficient scalable video compression
PCT/AU2003/001233 WO2004028166A1 (en) 2002-09-20 2003-09-19 Method of signalling motion information for efficient scalable video compression

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2003/001233 A-371-Of-International WO2004028166A1 (en) 2002-09-20 2003-09-19 Method of signalling motion information for efficient scalable video compression

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/421,788 Continuation US10205951B2 (en) 2002-09-20 2012-03-15 Method of signalling motion information for efficient scalable video compression

Publications (1)

Publication Number Publication Date
US20060239345A1 true US20060239345A1 (en) 2006-10-26

Family

ID=28047329

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/528,965 Abandoned US20060239345A1 (en) 2002-09-20 2003-09-19 Method of signalling motion information for efficient scalable video compression
US13/421,788 Active 2026-06-03 US10205951B2 (en) 2002-09-20 2012-03-15 Method of signalling motion information for efficient scalable video compression

Family Applications After (1)

Application Number Title Priority Date Filing Date
US13/421,788 Active 2026-06-03 US10205951B2 (en) 2002-09-20 2012-03-15 Method of signalling motion information for efficient scalable video compression

Country Status (3)

Country Link
US (2) US20060239345A1 (en)
AU (1) AU2002951574A0 (en)
WO (1) WO2004028166A1 (en)

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050264560A1 (en) * 2004-04-02 2005-12-01 David Hartkop Method for formating images for angle-specific viewing in a scanning aperture display device
US20060008162A1 (en) * 2004-07-09 2006-01-12 Liang-Gee Chen Pre-compression rate-distortion optimization method for JPEG 2000
US20060159179A1 (en) * 2005-01-18 2006-07-20 Markus Flierl Image transform for video coding
US20070201755A1 (en) * 2005-09-27 2007-08-30 Peisong Chen Interpolation techniques in wavelet transform multimedia coding
US20080240251A1 (en) * 2004-11-19 2008-10-02 Patrick Gioia Method For the Encoding of Wavelet-Encoded Images With Bit Rate Control, Corresponding Encoding Device and Computer Program
US20090285306A1 (en) * 2004-10-15 2009-11-19 Universita Degli Studi Di Brescia Scalable Video Coding Method
US20100061455A1 (en) * 2008-09-11 2010-03-11 On2 Technologies Inc. System and method for decoding using parallel processing
US8325796B2 (en) 2008-09-11 2012-12-04 Google Inc. System and method for video coding using adaptive segmentation
US8326075B2 (en) 2008-09-11 2012-12-04 Google Inc. System and method for video encoding using adaptive loop filter
US8325798B1 (en) * 2005-12-15 2012-12-04 Maxim Integrated Products, Inc. Adaptive motion estimation cache organization
CN103141092A (en) * 2010-09-10 2013-06-05 汤姆逊许可公司 Methods and apparatus for encoding video signals using motion compensated example-based super-resolution for video compression
US8780971B1 (en) 2011-04-07 2014-07-15 Google, Inc. System and method of encoding using selectable loop filters
US8781004B1 (en) 2011-04-07 2014-07-15 Google Inc. System and method for encoding video using variable loop filter
US8780996B2 (en) 2011-04-07 2014-07-15 Google, Inc. System and method for encoding and decoding video data
US8885706B2 (en) 2011-09-16 2014-11-11 Google Inc. Apparatus and methodology for a video codec system with noise reduction capability
US9131073B1 (en) 2012-03-02 2015-09-08 Google Inc. Motion estimation aided noise reduction
US20150254872A1 (en) * 2012-10-05 2015-09-10 Universidade De Coimbra Method for Aligning and Tracking Point Regions in Images with Radial Distortion that Outputs Motion Model Parameters, Distortion Calibration, and Variation in Zoom
US9154799B2 (en) 2011-04-07 2015-10-06 Google Inc. Encoding and decoding motion via image segmentation
US9262670B2 (en) 2012-02-10 2016-02-16 Google Inc. Adaptive region of interest
US9338477B2 (en) 2010-09-10 2016-05-10 Thomson Licensing Recovering a pruned version of a picture in a video sequence for example-based data pruning using intra-frame patch similarity
US9344729B1 (en) 2012-07-11 2016-05-17 Google Inc. Selective prediction signal filtering
US9392272B1 (en) 2014-06-02 2016-07-12 Google Inc. Video coding using adaptive source variance based partitioning
US20160316009A1 (en) * 2008-12-31 2016-10-27 Google Technology Holdings LLC Device and method for receiving scalable content from multiple sources having different content quality
US9544598B2 (en) 2010-09-10 2017-01-10 Thomson Licensing Methods and apparatus for pruning decision optimization in example-based data pruning compression
US9554132B2 (en) 2011-05-31 2017-01-24 Dolby Laboratories Licensing Corporation Video compression implementing resolution tradeoffs and optimization
US9578324B1 (en) 2014-06-27 2017-02-21 Google Inc. Video coding using statistical-based spatially differentiated partitioning
US9602814B2 (en) 2010-01-22 2017-03-21 Thomson Licensing Methods and apparatus for sampling-based super resolution video encoding and decoding
US20170223409A1 (en) * 2014-09-30 2017-08-03 Orange Method and device for adapting the display of a video stream by a client
US9762931B2 (en) 2011-12-07 2017-09-12 Google Inc. Encoding time management in parallel real-time video encoding
US9794574B2 (en) 2016-01-11 2017-10-17 Google Inc. Adaptive tile data size coding for video and image compression
US9813707B2 (en) 2010-01-22 2017-11-07 Thomson Licensing Dtv Data pruning for video compression using example-based super-resolution
US10102613B2 (en) 2014-09-25 2018-10-16 Google Llc Frequency-domain denoising
US10499996B2 (en) 2015-03-26 2019-12-10 Universidade De Coimbra Methods and systems for computer-aided surgery using intra-operative video acquired by a free moving camera
US10504239B2 (en) 2015-04-13 2019-12-10 Universidade De Coimbra Methods and systems for camera characterization in terms of response function, color, and vignetting under non-uniform illumination
US10542258B2 (en) 2016-01-25 2020-01-21 Google Llc Tile copying for video compression
CN111586418A (en) * 2020-05-09 2020-08-25 北京电信易通信息技术股份有限公司 Video compression method and device
US10796499B2 (en) 2017-03-14 2020-10-06 Universidade De Coimbra Systems and methods for 3D registration of curves and surfaces using local differential information
CN113077906A (en) * 2021-04-28 2021-07-06 上海德衡数据科技有限公司 Metadata-based medical information acquisition method, system, device and medium
US11122281B2 (en) * 2016-09-21 2021-09-14 Kakadu R&D Pty Ltd. Base anchored models and inference for the compression and upsampling of video and multiview imagery
CN113887419A (en) * 2021-09-30 2022-01-04 四川大学 Human behavior identification method and system based on video temporal-spatial information extraction
US11425395B2 (en) 2013-08-20 2022-08-23 Google Llc Encoding and decoding using tiling

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1583368A1 (en) * 2004-03-31 2005-10-05 Mitsubishi Electric Information Technology Centre Europe B.V. Direction-adaptive scalable motion parameter coding for scalable video coding
FR2901391A1 (en) * 2006-05-18 2007-11-23 France Telecom Image e.g. animated image, coding method for signal processing field, involves quantizing texture data based on function of reference deformation data for forming quantized reference data, and defining signal representative of image
CN102833219B (en) * 2011-06-16 2015-06-03 华为技术有限公司 Method and device for transmitting data files to client side
CN105009577B (en) * 2012-10-01 2019-05-03 Ge视频压缩有限责任公司 A kind of telescopic video encoding and decoding method, equipment and computer readable storage medium
CN104333757B (en) * 2014-10-17 2017-09-29 河海大学常州校区 Based on the video coding-decoding method described CS measured values more

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5684538A (en) * 1994-08-18 1997-11-04 Hitachi, Ltd. System and method for performing video coding/decoding using motion compensation
US5909251A (en) * 1997-04-10 1999-06-01 Cognitech, Inc. Image frame fusion by velocity estimation using region merging
US6392705B1 (en) * 1997-03-17 2002-05-21 Microsoft Corporation Multimedia compression system with additive temporal layers
US20030063672A1 (en) * 1999-12-09 2003-04-03 Nathalie Laurent-Chatenet Method for estimating the motion between two digital images with management of mesh overturning and corresponding coding method
US6700933B1 (en) * 2000-02-15 2004-03-02 Microsoft Corporation System and method with advance predicted bit-plane coding for progressive fine-granularity scalable (PFGS) video coding
US20040047415A1 (en) * 2000-07-13 2004-03-11 Guillaume Robert Motion estimator for coding and decoding image sequences
US6845130B1 (en) * 2000-10-12 2005-01-18 Lucent Technologies Inc. Motion estimation and compensation for video compression
US6987866B2 (en) * 2001-06-05 2006-01-17 Micron Technology, Inc. Multi-modal motion estimation for video sequences

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2303513B (en) * 1995-07-21 1999-09-29 Innovision Res Ltd Improved correlation processing for motion estimation
EP0963657B1 (en) * 1997-02-25 2002-11-27 British Broadcasting Corporation Digital signal compression encoding with improved quantisation
US6560285B1 (en) * 1998-03-30 2003-05-06 Sarnoff Corporation Region-based information compaction as for digital images
US6347157B2 (en) * 1998-07-24 2002-02-12 Picsurf, Inc. System and method for encoding a video sequence using spatial and temporal transforms
US6907073B2 (en) * 1999-12-20 2005-06-14 Sarnoff Corporation Tweening-based codec for scaleable encoders and decoders with varying motion computation capability
US6961383B1 (en) * 2000-11-22 2005-11-01 At&T Corp. Scalable video encoder/decoder with drift control
AUPR222500A0 (en) * 2000-12-21 2001-01-25 Unisearch Limited Method for efficient scalable compression of video
US6801573B2 (en) * 2000-12-21 2004-10-05 The Ohio State University Method for dynamic 3D wavelet transform for video compression
US7110608B2 (en) * 2001-07-02 2006-09-19 Canon Kabushiki Kaisha Digital image compression
US7627040B2 (en) * 2003-06-10 2009-12-01 Rensselaer Polytechnic Institute (Rpi) Method for processing I-blocks used with motion compensated temporal filtering

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5684538A (en) * 1994-08-18 1997-11-04 Hitachi, Ltd. System and method for performing video coding/decoding using motion compensation
US6392705B1 (en) * 1997-03-17 2002-05-21 Microsoft Corporation Multimedia compression system with additive temporal layers
US5909251A (en) * 1997-04-10 1999-06-01 Cognitech, Inc. Image frame fusion by velocity estimation using region merging
US20030063672A1 (en) * 1999-12-09 2003-04-03 Nathalie Laurent-Chatenet Method for estimating the motion between two digital images with management of mesh overturning and corresponding coding method
US6700933B1 (en) * 2000-02-15 2004-03-02 Microsoft Corporation System and method with advance predicted bit-plane coding for progressive fine-granularity scalable (PFGS) video coding
US20040047415A1 (en) * 2000-07-13 2004-03-11 Guillaume Robert Motion estimator for coding and decoding image sequences
US6845130B1 (en) * 2000-10-12 2005-01-18 Lucent Technologies Inc. Motion estimation and compensation for video compression
US6987866B2 (en) * 2001-06-05 2006-01-17 Micron Technology, Inc. Multi-modal motion estimation for video sequences

Cited By (59)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050264560A1 (en) * 2004-04-02 2005-12-01 David Hartkop Method for formating images for angle-specific viewing in a scanning aperture display device
US7573491B2 (en) * 2004-04-02 2009-08-11 David Hartkop Method for formatting images for angle-specific viewing in a scanning aperture display device
US20060008162A1 (en) * 2004-07-09 2006-01-12 Liang-Gee Chen Pre-compression rate-distortion optimization method for JPEG 2000
US7450771B2 (en) * 2004-07-09 2008-11-11 National Taiwan University Pre-compression rate-distortion optimization method for JPEG 2000
US20090285306A1 (en) * 2004-10-15 2009-11-19 Universita Degli Studi Di Brescia Scalable Video Coding Method
US8233526B2 (en) * 2004-10-15 2012-07-31 Universita Degli Studi Di Brescia Scalable video coding method
US20080240251A1 (en) * 2004-11-19 2008-10-02 Patrick Gioia Method For the Encoding of Wavelet-Encoded Images With Bit Rate Control, Corresponding Encoding Device and Computer Program
US8300693B2 (en) * 2005-01-18 2012-10-30 Ecole Polytechnique Federale De Lausanne Image transform for video coding
US20060159179A1 (en) * 2005-01-18 2006-07-20 Markus Flierl Image transform for video coding
US8755440B2 (en) * 2005-09-27 2014-06-17 Qualcomm Incorporated Interpolation techniques in wavelet transform multimedia coding
US20070201755A1 (en) * 2005-09-27 2007-08-30 Peisong Chen Interpolation techniques in wavelet transform multimedia coding
US8325798B1 (en) * 2005-12-15 2012-12-04 Maxim Integrated Products, Inc. Adaptive motion estimation cache organization
US20130094570A1 (en) * 2005-12-15 2013-04-18 Maxim Integrated Products, Inc. Adaptive Motion Estimation Cache Organization
US9414062B2 (en) * 2005-12-15 2016-08-09 Geo Semiconductor Inc. Adaptive motion estimation cache organization
US8325796B2 (en) 2008-09-11 2012-12-04 Google Inc. System and method for video coding using adaptive segmentation
US8326075B2 (en) 2008-09-11 2012-12-04 Google Inc. System and method for video encoding using adaptive loop filter
US20100061455A1 (en) * 2008-09-11 2010-03-11 On2 Technologies Inc. System and method for decoding using parallel processing
USRE49727E1 (en) 2008-09-11 2023-11-14 Google Llc System and method for decoding using parallel processing
US8311111B2 (en) 2008-09-11 2012-11-13 Google Inc. System and method for decoding using parallel processing
US9924161B2 (en) 2008-09-11 2018-03-20 Google Llc System and method for video coding using adaptive segmentation
US9357223B2 (en) 2008-09-11 2016-05-31 Google Inc. System and method for decoding using parallel processing
US8897591B2 (en) 2008-09-11 2014-11-25 Google Inc. Method and apparatus for video coding using adaptive loop filter
US20160316009A1 (en) * 2008-12-31 2016-10-27 Google Technology Holdings LLC Device and method for receiving scalable content from multiple sources having different content quality
US9813707B2 (en) 2010-01-22 2017-11-07 Thomson Licensing Dtv Data pruning for video compression using example-based super-resolution
US9602814B2 (en) 2010-01-22 2017-03-21 Thomson Licensing Methods and apparatus for sampling-based super resolution video encoding and decoding
CN103141092A (en) * 2010-09-10 2013-06-05 汤姆逊许可公司 Methods and apparatus for encoding video signals using motion compensated example-based super-resolution for video compression
US9338477B2 (en) 2010-09-10 2016-05-10 Thomson Licensing Recovering a pruned version of a picture in a video sequence for example-based data pruning using intra-frame patch similarity
US9544598B2 (en) 2010-09-10 2017-01-10 Thomson Licensing Methods and apparatus for pruning decision optimization in example-based data pruning compression
US8780971B1 (en) 2011-04-07 2014-07-15 Google, Inc. System and method of encoding using selectable loop filters
US8780996B2 (en) 2011-04-07 2014-07-15 Google, Inc. System and method for encoding and decoding video data
US8781004B1 (en) 2011-04-07 2014-07-15 Google Inc. System and method for encoding video using variable loop filter
US9154799B2 (en) 2011-04-07 2015-10-06 Google Inc. Encoding and decoding motion via image segmentation
US9554132B2 (en) 2011-05-31 2017-01-24 Dolby Laboratories Licensing Corporation Video compression implementing resolution tradeoffs and optimization
US8885706B2 (en) 2011-09-16 2014-11-11 Google Inc. Apparatus and methodology for a video codec system with noise reduction capability
US9762931B2 (en) 2011-12-07 2017-09-12 Google Inc. Encoding time management in parallel real-time video encoding
US9262670B2 (en) 2012-02-10 2016-02-16 Google Inc. Adaptive region of interest
US9131073B1 (en) 2012-03-02 2015-09-08 Google Inc. Motion estimation aided noise reduction
US9344729B1 (en) 2012-07-11 2016-05-17 Google Inc. Selective prediction signal filtering
US9367928B2 (en) * 2012-10-05 2016-06-14 Universidade De Coimbra Method for aligning and tracking point regions in images with radial distortion that outputs motion model parameters, distortion calibration, and variation in zoom
US20150254872A1 (en) * 2012-10-05 2015-09-10 Universidade De Coimbra Method for Aligning and Tracking Point Regions in Images with Radial Distortion that Outputs Motion Model Parameters, Distortion Calibration, and Variation in Zoom
US11425395B2 (en) 2013-08-20 2022-08-23 Google Llc Encoding and decoding using tiling
US11722676B2 (en) 2013-08-20 2023-08-08 Google Llc Encoding and decoding using tiling
US9392272B1 (en) 2014-06-02 2016-07-12 Google Inc. Video coding using adaptive source variance based partitioning
US9578324B1 (en) 2014-06-27 2017-02-21 Google Inc. Video coding using statistical-based spatially differentiated partitioning
US10102613B2 (en) 2014-09-25 2018-10-16 Google Llc Frequency-domain denoising
US20170223409A1 (en) * 2014-09-30 2017-08-03 Orange Method and device for adapting the display of a video stream by a client
US10631046B2 (en) * 2014-09-30 2020-04-21 Orange Method and device for adapting the display of a video stream by a client
US10499996B2 (en) 2015-03-26 2019-12-10 Universidade De Coimbra Methods and systems for computer-aided surgery using intra-operative video acquired by a free moving camera
USRE49930E1 (en) 2015-03-26 2024-04-23 Universidade De Coimbra Methods and systems for computer-aided surgery using intra-operative video acquired by a free moving camera
US10504239B2 (en) 2015-04-13 2019-12-10 Universidade De Coimbra Methods and systems for camera characterization in terms of response function, color, and vignetting under non-uniform illumination
US9794574B2 (en) 2016-01-11 2017-10-17 Google Inc. Adaptive tile data size coding for video and image compression
US10542258B2 (en) 2016-01-25 2020-01-21 Google Llc Tile copying for video compression
US11122281B2 (en) * 2016-09-21 2021-09-14 Kakadu R&D Pty Ltd. Base anchored models and inference for the compression and upsampling of video and multiview imagery
AU2017331736B2 (en) * 2016-09-21 2022-10-27 Kakadu R & D Pty Ltd Base anchored models and inference for the compression and upsampling of video and multiview imagery
US11335075B2 (en) 2017-03-14 2022-05-17 Universidade De Coimbra Systems and methods for 3D registration of curves and surfaces using local differential information
US10796499B2 (en) 2017-03-14 2020-10-06 Universidade De Coimbra Systems and methods for 3D registration of curves and surfaces using local differential information
CN111586418A (en) * 2020-05-09 2020-08-25 北京电信易通信息技术股份有限公司 Video compression method and device
CN113077906A (en) * 2021-04-28 2021-07-06 上海德衡数据科技有限公司 Metadata-based medical information acquisition method, system, device and medium
CN113887419A (en) * 2021-09-30 2022-01-04 四川大学 Human behavior identification method and system based on video temporal-spatial information extraction

Also Published As

Publication number Publication date
US10205951B2 (en) 2019-02-12
WO2004028166A1 (en) 2004-04-01
US20120230414A1 (en) 2012-09-13
AU2002951574A0 (en) 2002-10-03
US20170163993A9 (en) 2017-06-08

Similar Documents

Publication Publication Date Title
US10205951B2 (en) Method of signalling motion information for efficient scalable video compression
EP1354297B1 (en) Method and apparatus for scalable compression of video
US7382926B2 (en) Transcoding a JPEG2000 compressed image
Ding et al. Adaptive directional lifting-based wavelet transform for image coding
Christopoulos et al. The JPEG2000 still image coding system: an overview
Skodras et al. The JPEG 2000 still image compression standard
US7565020B2 (en) System and method for image coding employing a hybrid directional prediction and wavelet lifting
Taubman Successive refinement of video: fundamental issues, past efforts, and new directions
Leung et al. Transform and embedded coding techniques for maximum efficiency and random accessibility in 3-D scalable compression
WO2003107683A1 (en) Method and apparatus for scalable compression of video
AU2003260192B2 (en) Method of signalling motion information for efficient scalable video compression
Danyali Highly scalable wavelet image and video coding for transmission over heterogeneous networks
Skodras The JPEG2000 image compression standard in mobile health
Secker Motion-adaptive transforms for highly scalable video compression
Bojkovic et al. Advanced image compression techniques: Comparative functionalities and performances
Rüfenacht et al. Scalable Image and Video Compression
Mei A DWT based perceptual video coding framework: concepts, issues and techniques
Mandal Digital image compression techniques
Brangoulo et al. A new video coder powered with second-generation wavelets
Feideropoulou et al. Model-based quality enhancement of scalable video
Demaude et al. Using interframe correlation in a low-latency and lightweight video codec
Choraś JPEG 2000 Image Coding Standard-A Review and Applications
Ilgin DCT Video Compositing with Embedded Zerotree Coding for Multi-Point Video Conferencing
EP1639828A1 (en) Method for transcoding a jpeg2000 compressed image

Legal Events

Date Code Title Description
AS Assignment

Owner name: UNISEARCH LIMITED, AUSTRALIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TAUBMAN, MR. DAVID;SECKER, MR. ANDREW;REEL/FRAME:017133/0848;SIGNING DATES FROM 20050517 TO 20050906

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION