AU768455B2

AU768455B2 - A method for analysing apparent motion in digital video sequences

Info

Publication number: AU768455B2
Application number: AU97159/01A
Authority: AU
Inventors: Julian Frank Andrew Magarey
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2000-12-18
Filing date: 2001-12-10
Publication date: 2003-12-11
Anticipated expiration: 2021-12-10
Also published as: AU9715901A

Description

i.

I,

S&FRef: 575666

AUSTRALIA

PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD PATENT

ORIGINAL

Name and Address of Applicant: Actual Inventor(s): Address for Service: Canon Kabushiki Kaisha 30-2, Shimomaruko 3-chome, Ohta-ku Tokyo 146 Japan Julian Frank Andrew Magarey Spruson Ferguson St Martins Tower,Level 31 Market Street Sydney NSW 2000 (CCN 3710000177) A Method for Analysing Apparent Motion in Digital Video Sequences Invention Title: ASSOCIATED PROVISIONAL APPLICATION DETAILS [33] Country [31] Applic. No(s) [32] Application Date AU PR2125 18 Dec 2000 The following statement is a full description of this invention, including the best method of performing it known to me/us:p 1- Dw^^ 1 1 0 DEC 2 cdlon: aatch ~'c u i 5815c A METHOD FOR ANALYSING APPARENT MOTION IN DIGITAL VIDEO SEQUENCES Field of Invention The current invention relates to the field of video content analysis and, in particular, to the task of analysing the apparent motion undergone between each frame to estimate camera operation parameters.

Background The vast volume of data generated by digital video cameras requires intelligent 1o indexing to enable efficient searching and retrieval. Indexing based on the actual content of the video is the most promising avenue for research because it enables search cues :which are more naturally meaningful to human users than simple time codes. One example of a content-based indexing cue is the camera operation parameters used to *o record a particular sequence of frames. Camera operations, such as panning, tilting, and zooming are, in principle, measurable quantities when properly defined, and are often useful clues as to the underlying (qualitative) semantic content of a sequence of frames.

For example, certain patterns of camera operation have been used to locate events of S•interest in sports video footage, such as video footage of goals and shots in a football match. Many other aspects of digital video processing are enabled or enhanced by oeooo S" 20 knowledge of camera operation parameters. These include background mosaicing, deinterlacing, super-resolution, and automatic stabilisation.

In some cameras, operation parameters, and in particular zoom parameters, are available directly from internal mechanical sensors and stored alongside the video in "metadata" fields, allowing the task of extraction of the operation parameters to be trivial.

However, in most cameras this metadata is not available at capture time, and even if it is available, it is not stored.

575666.doc -2- Two distinct scenarios thus emerge for camera operation parameter estimation.

A first is for storage at capture time of a video using the raw video data, or in the second scenario, as a post-processing annotation step on the (possibly compressed) stored video.

Algorithms targeted at the first scenario may also be applied in the second scenario, provided the video is first decompressed.

Most known techniques of camera operation parameter estimation make use of the evolution of projected intensity patterns from frame to frame in the video sequence.

A few of these techniques use each frame independently, relying on blurring patterns caused by sensor motion during the capture of the video sequence. The main distinction is whether the primitives for parameter extraction are the actual spatiotemporal (SPT) intensity derivatives, or the motion vector field. Motion vectors, each of which is related to the frame intensity evolution in its corresponding local region, are generally available directly from the compressed video stream. Accordingly, motion vectors afford flexibility to algorithms using them as primitives, since they can also be applied as postprocessors on compressed video. If they are to be applied to a sequence of raw video data, it is clear that the motion vectors must be estimated first. Even so, the motion vector approach is generally less computationally demanding, while the SPT-derivatives-based S"approach is potentially more robust because these primitives are richer in information about the underlying scene evolution.

Camera operation with respect to a static scene imposes a global constraint on the motion vectors, expressible as a model-fitting equation. Global model parameters are related to the camera operation via known parameters such as focal length and pixel size.

However, any independently moving ("foreground") objects give rise to apparent motion that does not fit this constraint, and can therefore corrupt global parameter estimation.

575666.doc -3- Motion-vector-based algorithms may therefore be further categorised by how they handle this common phenomenon.

The most common approach is to use robust statistics, treating independent objects as outliers in a model-fitting error distribution. This requires an iterative strategy which is guaranteed to converge only if the "outliers" represent less than 50% of the total population.

Another known approach is a clustering strategy, where the dominant or global motion is taken to be the largest cluster. In this case the problem is to decide in advance how many cluster centres should be used, a quantity usually unknown a priori.

Moreover, the clustering strategy does not take into account the spatial coherence of motion vectors belonging to the same cluster.

Summary of the Invention S:It is an object of the present invention to substantially overcome or at least ameliorate one or more problems associated with existing arrangements.

In accordance with one aspect of the present invention there is disclosed a method of identifying camera operation parameters prevailing at an image from a S"sequence of images, said image being formed by a plurality of pixels, said method comprising the steps of: 0o•005 receiving apparent motion vectors at known pixel coordinates of said image; segmenting said image into segmented regions using a tilt-pan-rollzoom model; finding a largest segmented region; 575666.doc -4extracting model parameters of said tilt-pan-roll-zoom model from said largest segmented region; and computing said camera operation parameters from said model parameters of said tilt-pan-roll-zoom model extracted from said largest segmented region.

In accordance with another aspect of the present invention there is disclosed a method of determining an activity measure of motion in an image, said method comprising the steps of: receiving motion vectors at known pixel coordinates of said image; segmenting said image into segmented regions using a tilt-pan-rollzoom model; determining a model motion vector at said known pixel coordinates of 0 0,° said image; finding a largest segmented region; 0 extracting dominant model parameters from said largest segmented region; determining a dominant model motion vector at said known pixel S. coordinates, derived from said dominant model parameters; and oe.

o determining a deviation of said model motion vectors from said dominant model motion vectors over said image.

20 In accordance with another aspect of the present invention there is disclosed a method of determining an error measure of motion in an image, said method comprising the steps of: receiving motion vectors at known pixel coordinates of said image; segmenting said image into segmented regions using a tilt-pan-rollzoom model; 575666.doc finding a largest segmented region; determining a model motion vector at said known pixel coordinates over said largest segmented region of said image; and determining a mean angular difference between said motion vectors and said model motion vectors over said largest segmented region.

Other aspects of the present invention are also disclosed.

Brief Description of the Drawings A number of embodiments of the present invention will now be described with reference to the drawings, in which: Fig. 1 is a symbolic representation of image formation geometry; Fig. 2 is a block diagram of a processing algorithm for identifying camera operation parameters prevailing at an image from an image sequence; Fig. 3 is a block diagram showing the two separate steps of the model-based segmentation step shown in Fig. 2; Fig. 4 is a block diagram representing the sub-steps in the region merging ooooo segmentation step shown in Fig. 3; Fig. 5 is a plot of a parameter showing the value of the minimum merging cost as the segmentation algorithm proceeds in a typical case; Fig. 6 is a schematic block diagram of a general purpose computer upon which arrangements described can be practiced; and Figs 7A and 7B show examples of region adjacency graphs.

Detailed Description of Embodiments of the Invention 575666.doc -6- Some portions of the description which follows are explicitly or implicitly presented in terms of algorithms and symbolic representations of operations on data within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

It should be borne in mind, however, that the above and similar terms are to be S°associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, and as apparent from the s15 following, it will be appreciated that throughout the present specification, discussions o oo "utilizing terms such as "calculating", "determining", "replacing", "generating", "initializing", "outputting", or the like, refer to the action and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical (electronic) quantities within the registers and memories of the computer system 20 into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

The present specification also discloses apparatus for performing the operations of the methods. Such apparatus may be specially constructed for the required purposes, or may comprise a general-purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms and displays 575666.doc presented herein are not inherently related to any particular computer or other apparatus.

Various general-purpose machines may be used with programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate. The structure of a conventional generalpurpose computer will appear from the description below.

In addition, the present specification also discloses a computer readable medium comprising a computer program for performing the operations of the methods. The computer readable medium is taken herein to include any transmission medium for communicating the computer program between a source and a designation. The transmission medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general-purpose computer. The transmission medium may also include a hard-wired medium such as *o •exemplified in the Internet system, or wireless medium such as exemplified in the GSM mobile telephone system. The computer program is not intended to be limited to any Is15 particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the arrangements described herein.

Problem formulation i Fig. 1 is a graphical representation of the geometry of a dynamic image 20 formation process. A pinhole camera geometry is assumed, in which an image plane 10 is 'i situated a distancef, known as the focal length, from a centre of projection Q. A scene point P has coordinates Y, and its image p on image plane 10 has coordinates x given by perspective projection as: X ZY (1) XI f [X 575666.doc -8- The image plane 10 is discretely sampled by a sensor with a given sampling interval. The sampling interval is assumed to be equal in both horizontal and vertical directions. Image plane coordinates x and y are measured in sampling intervals.

Equation then holds if the focal length fis also given in units of sampling intervals.

The image plane quantities are usually given as matrices of "pixel values".

The conversion between matrix (pixel) co-ordinates and image plane coordinates x is given by

L+

1 x= 2 (2) R+1 2 with the matrix having a size of R rows by L columns. From this point, image plane co- 10 ordinates will be referred to as pixels.

Consider the case where the scene point P remains fixed while the camera, and therefore the coordinate axes, translates and rotates between frames undergoes rigid motion). The new position P' of the scene point with respect to the new coordinate axes is given by: 1 0 X T o.a.

o. i Y 0, 1 -Q Y T (3) 15 (3) h t 1 Z .0.0 0where (Tx, Ty, Tz) are the components of translation and Qy,, Qz) are the Euler angles which characterise the rotation. The Euler angles (Qx, Qy, z) are assumed to be small.

The new image plane coordinates x' of the scene point P on the image plane are given by: 575666.doc (4) where f is the new focal length resulting from the camera zoom s between the two frames: f Combining Equations and an expression is obtained for the new image coordinates x' of the scene point P: x+ 1 Z f Qx-Qy+f Ty- Jx This expression may be simplified by clarifying the definition of "small" rotation which relates to small Euler angles (fx y, z) as: x o f (7) jxy f (8) It may be shown using Equation that the expressions in Equations and (8) require that the vertical and horizontal rotational angles ,x and 1y are much less than the reciprocal of the vertical and horizontal fields of view respectively, which is a reasonable requirement for video captured with a frame capture frequency of 30 Hz.

*99.

Motion vector u at coordinate x is given by the change in position of the image p of the scene point P as: u(x) x (9) Using Equations and Equation gives an expression for the motion vector u at coordinate x: 575666.doc x Tz -O Qi fT fsny s+1 s+l( Z Z u(x) l s -sls+ T J +fZ Z s+1 Z Z x This is a point relation only, dependent on the depth Z of the scene point P. For a motion model which is valid over a region, a model for depth Z is needed in terms of image plane coordinates x and y. The chosen model is a plane, which in image plane coordinates x and y may be written as: 1 a+px+yy (11)

Z

where p, y) are the parameters of the plane.

Substituting this into Equation an eight-parameter projective model for apparent motion u is obtained, which prevails over a planar surface undergoing rigid 10 motion relative to the camera: 1 px p 2 y+p, 3 (12) S* P7X pY 1 p4 x P P6 *oool wherein pi are region model parameters which are functions of the known focal length f, and the unknown camera operation parameters (Tx, TT, Qy, Qz, s) and plane parameters (cx, 0, y).

A strategy for recovering the camera operation parameters (Tx, Ty, Tz, fx, Qfy, Qf2z, s) now suggests itself. The input is a field of motion vectors u(x) at known image plane coordinates x. If the motion vector field u(x) can somehow be partitioned or segmented into regions each of which adheres to the model of Equation and it is deemed that one of the regions Oi, say the largest, is the static background having no 575666.doc -11intrinsic motion of its own, then the camera operation parameters Ty, Tz, Q Q 2 s) can be recovered from the region model parameters pi.

There are two problems with this strategy. Firstly, it can not be certain that the largest projective model region is indeed static. Even if the scene consists mainly of background with only a few independently moving foreground objects, as is commonly assumed, the background may consist of many small planar facets, each of which has its own projective model upon translation T, while a foreground object could be planar and larger in size than each of the background regions individually. The independent motion of the foreground object will then be wrongly taken as the camera motion.

Secondly, even if the background is correctly identified, the recovery of the camera operation parameters (Tx, Ty, Tz, Q, Q2y, s) from the region model parameters pi, which involves the inversion of complicated interlinked non-linear relations, is extremely difficult while the plane parameters 13, y) are also unknown. This problem is related to the whole "structure from uncalibrated motion" research area which has been the subject of a great deal of activity over the years. Suggested solutions are usually iterative, causing such solutions to be time-consuming, and extremely sensitive to small S* estimation errors.

o,.

S"To simplify the problem, an additional assumption is made, namely that the camera does not translate, but only rotates and zooms. The camera operation parameters .ooooi are therefore limited to the three components of rotation Qy, Q 2 which are respectively tilt, pan, and roll, and a further component, zoom s. This is known as the Tilt, Pan, Roll, Zoom (TPRZ) model of camera operation. Using this TPRZ model, the projective model for apparent motion vector u of Equation (12) becomes: 575666.doc -12- [sx+(s+1)Q -f(s+l1)Qy] sy f(s x (13) This model is now linear in its unknown camera operation parameters (2x, Qy, Qz, and may be written as: u(x) A(x)a (14) with A(x) encapsulating the nature of the model: A(x) x y 1 0 ly -x01 1 and a is a model parameter vector s (S 1)J, a= s (16) (S I)Jx (s+1)lxn Note that the operation parameters (Qx, Q2y, Q, and therefore also the model S* 10 parameter vector a, are independent of the scene depth Z. Segmentation of the motion vector field u(x) into spatially coherent regions Oj, each of which has its own distinct region model parameter vector ai using this four-parameter model, should therefore, if camera operation fits the TPRZ model, return the background as a single region Obg regardless of its topography. Assuming the scene consists mainly of static background, this will be by far the largest region. It is then a simple matter to recover the zoom s and rotation components (Qx, 2y, Oz) from the region model parameter vector a, of the largest region by inverting Equation (16).

Summary ofproposed method Fig. 2 shows a processing algorithm 100 for identifying camera operation parameters prevailing at an image from an image sequence, which may be implemented 575666.doc -13in a programmable device such as the digital computer shown in Fig. 6. One set of inputs to the algorithm are estimated motion vectors u(xB), where x, Bx (17) B is the spacing in pixels of the motion estimate field U(XB). For a full-density motion field, the spacing B 1, but often the spacing B 1. From this point, it is assumed that a full density motion field is used and the subscript B will be omitted.

Each motion vector estimate u(x) is subject to a random error e(x) which is assumed to be drawn from a zero-mean normal (Gaussian) distribution with covariance A(x): e(x) (18) A symmetric 2 by 2 confidence matrix which is the inverse of the covariance A(x) at each pixel x, is an additional input to the algorithm 100. It is noted that this formulation allows distinct, and possibly mutually dependent estimation error components e(x).

15 While a motion vector u(x) is a bare summary of the frame intensity evolution in a local region, the confidence matrix C(x) enriches that information by describing the \relative reliabilities of each component of the motion vector Correlation-based or error-minimisation motion estimation techniques may be easily supplemented to provide such confidence matrices which are simply the curvature matrices of the correlation or error surfaces. The use of confidence matrices C(x) lends the algorithm 100 some of the robustness of full SPT-derivative-based algorithms.

If the confidence measures A(x) are not available, as in the case of compressed video, the algorithm 100 may still be used with all confidence matrices C(x) set to the identity matrix.

575666.doc 14- The algorithm 100 proceeds to step 110 where the estimated motion vectors u(x) and confidence matrices C(x) are segmented or partitioned into disjoint regions Oj. Each region Oj adheres to the 4-parameter (linear) TPRZ model of Equation (14).

The chosen segmentation strategy is a region-merging algorithm described below in detail. Its outputs are a segmentation label map and a model parameter vector a for each region Oj. The label map S(x) is simply a matrix whose entries are a region label j assigned to the corresponding pixel location x. A lookup table associated with the label map S(x) lists the 4 components in the region model parameter vector ia that corresponds to each region Oj.

Step 125 calculates a model motion field which is effectively a smoothed version of the motion vectors where the smoothing takes place within region boundaries.

e Step 120 of the algorithm 100 processes the label map S(x) and the region model parameter vector a. for the largest region Oj to return the estimated camera operation parameters (Qx, Qy, Qz, In addition, step 130 returns a motion-based activity measure Act calculated from the camera operation parameters (Q x, 2y, s) and the model motion field and step 140 calculates a normalised error measure Err from the motion vectors u(x) and the camera operation parameters (2x, Qy, z, as described below.

The following sections elaborate upon each of these steps 110 to 140.

20 Model-based region-merging segmentation algorithm The assumption underlying the segmentation problem is that each motion vector u(x) is associated with a particular state. The form of the model used to define the states is decided upon in advance. The 2 by n matrix A(x) in Equation (14) encapsulates the nature of the model. In the preferred implementation, the matrix A(x) is given by Equation this corresponds to the TPRZ model Another implementation (n=2) 575666.doc sets matrix A(x) equal to the 2 by 2 identity matrix; this corresponds to the tilt-pan (TP) model in which it is assumed that the roll (Qz) and zoom parameters are both zero.

Each state j is defined by unknown model parameters, which are components of the region model parameter vector ai of length n, with each state being assumed to be valid over an associated connected region of pixels Oj. These regions Oj disjointly cover the entire image, i.e. each pixel x belongs to one, and only one, region Oj. The aim of segmentation is to identify these regions Oj and the model parameter Wj for each region i0.

The pixel lattice is a regular two-dimensional (2D) grid, with each pixel x having four neighbours, namely the pixels directly above, below, and to the left and right of it.

The neighbourhood or adjacency rule for pixels x, known per se in the art, extends to the regions Oi. That is, a region Oi is said to be adjacent to another region O, if some pixel in the first region Oi is a neighbour of any pixel in the second region Oj.

The size of region Oj shall be denoted by nj, which is simply the number of pixels contained in the region Oj. From Equation the model motion field g(x) over each region Oj is assumed to be a linear projection of the region model parameter vector ai for each region O: g(x) di, x Oj (19) The model-based region merging segmentation step 110 shown in Fig. 2 is shown in more detail in Fig. 3. This region merging segmentation step 110 has two substeps. The first sub-step 115 is a full-density model fitting step, in which, from the motion vectors u(x) and confidence matrices C(x) as inputs, a model parameter vector along with a confidence matrix K(x) (defined below), is fitted to each pixel x. In the preferred implementation, a robust least-squares fit is carried out within a small window 575666.doc -16of pixels which slides over the motion field centring on each pixel x in turn. Since at least two measurements are required for a four-parameter model fit, the window must contain two or more pixels x. The preferred implementation uses a four-pixel window.

The second sub-step 116 is the actual region-merging step. Fig. 4 is a schematic flow diagram of the steps making up the region-merging sub-step 116. Initially, the trivial segmentation is applied wherein each pixel x is a separate region From the list of initial regions may be constructed a region adjacency graph (RAG), an example of which is shown in Fig. 7A. Each region 01 to 09 is a node in the RAG, and neighbour relations amongst the regions Oi are represented by graph edges. For example, edges 701, 702 and 703 indicate that region 02 is a neighbour of regions O1, 03 and 05 respectively.

To each edge is assigned a merging cost ti, which will be defined below. Region merging is then carried out as a process of graph simplification. The edge with the minimum merging cost tyj is deleted and the two corresponding nodes, representing two "i regions Oi and O, are merged to form a single region That merging cost t, is oooo 15 recorded as a parameter X. The graph is then updated wherein the nodes representinR the merged regions Oi and Oj are indicated as a single node representing the merged region 9 9 Oij0. The merging costs tyk for all regions Ok adjacent to the merged region O 0 are calculated.

Fig. 7B shows the RAG of Fig. 7A after the mergers of regions O0 and 04 to form region 014, regions 05 and 08 to form region 058, regions 02 and 03, and then later 949999 also region 06 to form region 0236. From the RAG, region 07 now has regions 014 and 058 as neighbours.

The merging process continues until only a single region remains. From a plot of parameter X at each merging step, an optimal halting point Xstop is determined as described below. From this point, one of two procedures may be followed. Either the 575666.doc -17entire process may be recommenced, and halted when the minimum merging cost ty exceeds the determined optimal halting point Xstop; or alternatively, the merging process is reversed, splitting regions Oy until the optimal halting point stop is reached. The former option is more expensive in terms of time, while the latter is more expensive in terms of system memory. In the preferred implementation, the former option is used as the regionmerging sub-step 116 and described below with reference to Fig. 4.

Merging cost definition The merging cost ty for an adjacent region pair Oi and O is based on linear least squares model fitting. First, a model fitting error is defined over a region Oj as a statistical residual quantity I xeoj which may be rewritten as Ej(a) (Uj Hija)K(U Ha) (21) where: H is a 2nj by n matrix composed of the individual A(x) matrices stacked on top of one another as x varies over region Of; Uj is a column vector of length 2 nj composed of the individual motion vectors u(x) stacked on top of one another; Kj is a 2nj by 2nj block diagonal matrix, where each 2 by 2 diagonal block is an individual confidence matrix C(x).

By weighted least squares theory, a minimising model parameter aj for the region Oj is given by the minimising argument for Ej, given by ai Ki/ H K F j (22) 575666.doc 18where Kj is the n by n confidence in the model parameter estimate aj: K =H KjH (23) The corresponding residual is given by Ej U Kj(I 2 -HKj-'HJK)U (24) In the initial full-density model fitting sub-step 115 of the model-based region merging segmentation step 110 shown in Fig. 3, the region O is a small window surrounding each pixel x. A two-stage strategy is employed: first, all pixels x in the window contribute to the model parameter estimate then the estimate is recomputed from the three pixels with the lowest residuals with respect to the first estimate. This is an attempt to handle the case where the window straddles two distinct underlying regions.

Accordingly, the outputs of sub-step 115 are the model parameter vector a(x) and confidences K(x).

When merging two regions Oi and Oj, the "merged" matrix Hij is obtained by concatenating Hi with Hj; likewise for matrices Uij and Ki. These facts may be used to *o* show that the best fitting region model parameter vector a- for the merged region O is given by a ai K-K'j(i -a where the merged confidence KI is simply K K i K (26) The merged residual Ey is given by Ey Ei E +(aaj) Kj i -aj) (27) 575666.doc -19- The merging cost tyj may then be defined as the net increase in model fitting residual E, divided by the length of the common boundary l(6S) of the merged region O-: (ai-aj Ki Kja i-aj) t j (28) It may be shown that this definition of the merging cost tyj leads to a segmentation which is a global compromise between low model fitting error and low model complexity, in terms of the total length of boundaries 1(S6). It thus encourages the formation of regions Oj which are not only uniform in terms of model adherence, but reasonably compact.

If the merge is allowed, then Equations (25) and (26) give the region model parameters ai and confidences Kr of the merged region Oi. Note that under this strategy, only Equations and (28) need to be applied throughout the region "merging process. Only the region model parameters ij and their confidences Kj for each region O are therefore required as segmentation proceeds, and neither the motion vector estimates u(x) nor the model structure itself, i.e the matrix A(x) are required. If a heap structure is used to store the RAG, the complete region merging may be achieved with S. complexity O(V log where V is the number of estimates in the motion field u(x).

Finding the optimal haltingpoint As noted above, the merging cost ty at each merge step is recorded as the parameter X. After all regions Oi have been merged and the whole image forms a single region, a graph of parameter X may be plotted. From this graph, the optimal halting point Xstop may be determined. As noted above, in the preferred implementation the region merging is simply repeated again from the start until the merging cost tyj exceeds the optimal halting point ,stop.

575666.doc To illustrate how the optimal halting point ,tp is determined, it is noted that, as merging proceeds, the merging cost tbof the regions O; and O being merged, and consequently also parameter X, generally increases. This increase however is not purely monotonic. In fact, the overall rise in parameter X is punctuated by departures from monotonicity, which herein are termed local minima. A local minimum represents the collapse of what might be termed a self-supporting group of adjacent regions. Such occurs if one boundary within the group is removed, and the merging costs for adjacent boundaries then suddenly reduce. In effect, the hypothesis that the regions of the group are part of the same object is confirmed as more regions merge and so parameter X decreases. The result is that all the boundaries in the group are removed in quick succession. These self-supporting groups tend to represent the internal structure of objects and background clutter. A figure of merit, such as the number of boundaries removed, their total length, or the maximum (absolute or relative) decrease in parameter X ••may be assigned to each local minimum.

15 The point immediately after a local minimum, being a return to substantial monotonicity, is termed herein a stable configuration. Visually, a stable configuration represents a point in the segmentation process at which an area of internal object structure or background clutter has just been removed, and is thus a good potential halting point.

*VVVV.

Each stable configuration has an associated parameter X. Fig. 5 shows a plot of parameter X during the latter part of a segmentation of a typical real motion field showing local minima and stable configurations. All local minima and stable configurations for the image may be found automatically by analysing the graph of parameter X. Significant local minima, being those whose figure of merit exceeds a certain threshold, are flagged.

In the preferred implementation, the optimal halting point stop is chosen to be the parameter X at the last such stable configuration.

575666.doc -21 Fig. 4 shows the steps forming the region merging segmentation sub-step 116 shown in Fig. 3. A first pass of the region merging segmentation sub-step 116 is used to determine the optimal halting point A subsequent pass halts once the merging cost tij reaches the optimal halting point sop.

The region merging segmentation sub-step 116 starts at step 303 and proceeds to step 304 which receives the model parameter vector a(x) and model confidences K(x) for each pixel x. The region merging segmentation sub-step 116 starts with the trivial segmentation where each pixel x forms its own region Oi. Accordingly, the region model parameter vector field ai- and their confidences K, for each region Oi are initially set as the model parameter vector a(x) and the confidence at each pixel x.

Step 306 then determines all adjacent region pairs Oi and Oj, and computes the merging cost tij according to Equation (28) for each of the adjacent region pairs Oi and Oj.

Step 308 inserts the merging cost tij into a heap T in priority order.

Step 310 takes the first entry from the heap T(1) and merges the corresponding 15 region pair Oi and Oj the region pair Oi and Oj with the lowest merging cost tj) to form a new region O. Step 312 records the merging cost tj in a list L. List L holds the values of the parameter X.

S'.i Step 314 identifies all regions Ok adjoining either of the merged regions Oi and Oj. All the pairs containing either of the merged regions Oi and Oj are removed from the heap T in step 316. Step 318 follows by calculating a new merging cost tijk for each pair of adjacent regions Oyj and Ok. The new merging costs tij.k are inserted into the heap T at the appropriate point(s) in step 320.

During the first complete pass, the region merging segmentation sub-step 116 then proceeds to step 324 where it is determined whether null segmentation has been reached. This is done by determining whether the heap T contains more than zero entries 575666.doc -22and thus the segmentation contains more than one region. If the null segmentation has not been reached, then control returns to step 310. Steps 310 to 324 are repeated until null segmentation is reached.

When all regions have been combined into the null segmentation, step 324 passes control to step 326 which then identifies the merging cost ty, stored in list L, that corresponds to the last stable configuration. This is taken to be the optimal halting point This concludes the first complete pass.

Control returns to step 304 where, starting again with the trivial segmentation where each pixel x forms its own region Oi, the pixels again are merged to form regions in the second pass.

With threshold Xtop determined, step 320 passes control to step 322 to determine if the merging has reached the stopping point. This is done by determining whether the merging cost ty corresponding to the regions Oi and Oj at the top of the heap T (entry has a value greater than the optimal halting point If the merging has reached a stopping point, then the method ends in step 330. Alternatively, control is returned to step 310 and steps 310 to 322 are repeated, merging the two regions with the lowest merging cost ty during every cycle, until the stopping point is reached.

Processing the segmented motionfield As described above, the largest segmented region may be deemed to be the static background under the TPRZ camera model viewing a scene with isolated foreground objects. The region model parameter vector a-bg of the largest region Obg is available in the lookup table pointed to by the label map where pixel x is a member of the largest region Obg. To convert the model parameter vector abg(x) of the largest region Obg to TPRZ model parameters Qy, it is simply a matter of inverting Equation (16) as follows: 575666.doe -23s abg(l) (29) abg(4)(30) D x f(s 1) -abg(3) (31) fy f(s 1) abg (2) z s (32) s+l The activity measure Act which is calculated in step 130 shown in Fig. 2, is defined as a quantitative measure of "activity" or "ambient motion", which may be derived from the size and relative (apparent) speed of the foreground objects in each frame, and used as an indexing field. For example, a pan over a static scene should yield a low activity measure Act, which should suddenly increase when an object starts to move 10 independently. A simple activity measure Act is defined based on the deviation of the model motion vectors g(x) from a background motion model gbg(x). This measure 0 balances apparent size of the foreground object with its relative speed, so that a large, o :O0 00 slow-moving object returns the same activity level as a small, fast-moving one.

To define the activity measure Act, the background model motion field gbg(X) is defined, substituting region model parameter vector abg in Equation (19):

O:

gbg(x) A(x) abg (33) 0 The activity measure Act is then the RMS deviation of the model motion field g(x) from the background motion field gbg(x) as follows: 2 Act gbg (x t (34) where M is the number of estimates in the motion field u(x).

Occasionally the region merging segmentation step 116 proceeds too far and "under-segments" the motion field so that inconsistent regions O, and O are 575666.doc -24erroneously merged. To detect this case, an error measure Err over the background, which can be compared with a fixed threshold, is required. The error measure Err should reflect the deviation between the estimated motion field u(x) and the model motion field g(x) over the background region Obg.

Following Equation an error measure Err'can be defined as a confidenceweighted, mean squared residual: Err' I gbg( (xx) [U bg nbg xeOb where nbg is the number of pixels in region Obg. This may be rewritten as Err' I T(x)u(x) T(x)gbg (X12 (36) nbg xeO,g to where C(x)

T

T(x) (37) That is, the confidence weighting can be achieved by pre-multiplying each vector by the "matrix square root" T(x) of the confidence matrix C(x) at each pixel x, then taking a simple mean squared difference. The matrix square root can be derived from the eigenvalues of C(x).

i The problem with this error measure Err' is that it is not normalised with respect to the magnitude of the background motion. That is, the error measure Err' increases in proportion to the magnitude of the background motion even if the segmentation is perfect.

This is unsatisfactory if the error measure Err' is to be compared with a single absolute 20 threshold.

It is possible to normalise the error measure Err' by dividing by the mean squared confidence-weighted background motion magnitude. However, when the background motion magnitude is near zero, this normalising factor becomes too small and 575666.doc hence the normalised error increases dramatically. A better way of measuring normalised vector differences without excessive amplification of near-zero vectors is needed.

For this purpose the angular difference between two motion vectors u and v, known in the art, is used, which is defined as follows: v) cos-' (38) where u (39) is a three-dimensional extension of the two-component motion vector u. A normalised error measure Err is then defined by analogy with Equation (36) as the mean (confidenceo0 weighted) angular difference between the estimated motion field u(x) and the model motion field g(x) over the background region Obg: Err 1 T(x)gb nbg xEObg S" If the normalised error Err exceeds a certain absolute threshold, the segmentation is deemed to have proceeded too far, thereby merging foreground and background regions. The TPRZ camera operation parameters Qx s) and activity measure Act may then be flagged as unreliable.

To guard against poor motion estimates, as can result for example if the camera moves too fast for the motion estimation technique to handle, this largest region Obg must be greater than a certain proportion of the motion field size. The preferred implementation uses a threshold of 10%. If this does not hold, the normalised error measure Err is set to a large number and the TPRZ camera operation parameters Qy, Qz, s) and activity measure Act are thus flagged as unreliable.

575666.doc -26- Apparatus The processing algorithm 100 described above may be practiced using a programmable device, and are preferably practiced using a conventional general-purpose computer system 400, such as that shown in Fig. 6 wherein the processing algorithm 100 may be implemented as software, such as an application program executing within the computer system 400. In particular, the steps of processing algorithm 100 are effected by instructions in the software that are carried out by the computer. The software may be divided into two separate parts; one part for carrying out the processing algorithm 100 and another part to manage the user interface between the latter and the user. The lo software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer from the computer readable medium, and then executed by the computer. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program product in the computer preferably effects an advantageous apparatus for identifying a background motion model of an image in accordance with the embodiments of the invention.

The computer system 400 comprises a computer module 401, input devices such as a keyboard 402 and mouse 403, output devices including a printer 415 and a display device 414. A Modulator-Demodulator (Modem) transceiver device 416 is used by the computer module 401 for communicating to and from a communications network 420, for example connectable via a telephone line 421 or other functional medium. The modem 416 can be used to obtain access to the Internet, and other network systems, such as a Local Area Network (LAN) or a Wide Area Network (WAN).

The computer module 401 typically includes at least one processor unit 405, a memory unit 406, for example formed from semiconductor random access memory

I

575666.doc 27 (RAM) and read only memory (ROM), input/output interfaces including a video interface 407, and an I/O interface 413 for the keyboard402 and mouse403 and optionally a joystick (not illustrated), and an interface 408 for the modem 416. A storage device 409 is provided and typically includes a hard disk drive 410 and a floppy disk drive 411. A magnetic tape drive (not illustrated) may also be used. A CD-ROM drive 412 is typically provided as a non-volatile source of data. The components 405 to 413 of the computer module 401, typically communicate via an interconnected bus 404 and in a manner which results in a conventional mode of operation of the computer system 400 known to those in the relevant art. Examples of computers on which the 1o embodiments can be practised include IBM-PC's and compatibles, Sun Sparcstations or alike computer systems evolved therefrom.

Typically, the application program of the preferred embodiment is resident on the hard disk drive 410 and read and controlled in its execution by the processor 405.

Intermediate storage of the program and any data fetched from the network 420 may be o* 15 accomplished using the semiconductor memory 406, possibly in concert with the hard disk drive 410. In some instances, the application program may be supplied to the user Sencoded on a CD-ROM or floppy disk and read via the corresponding drive 412 or 411, or alternatively may be read by the user from the network 420 via the modem device 416.

Still further, the software can also be loaded into the computer system 400 from other 0o 20 computer readable medium including magnetic tape, a ROM or integrated circuit, a oooo• magneto-optical disk, a radio or infra-red transmission channel between the computer module 401 and another device, a computer readable card such as a PCMCIA card, and oooo, Sthe Internet and Intranets including e-mail transmissions and information recorded on websites and the like. The foregoing is merely exemplary of relevant computer readable 575666.doc -28mediums. Other computer readable media may be practiced without departing from the scope and spirit of the invention.

The methods described may alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the functions or sub functions and for example incorporated in a digital video camera 420. Such dedicated hardware may include graphic processors, digital signal processors, or one or more microprocessors and associated memories. As seen, the camera 450 includes a display screen 452 which can be used to display the segmented frames of information regarding then same. In this fashion, a user of the camera may record a video sequence, and using the processing methods described above, create metadata that may be associate with the video sequence to conveniently describe the video sequence, thereby permitting the video sequence to be used or otherwise manipulated with a specific need for a user to view the video sequence.

A connection 448 to the computer module 401 may be utilised to transfer data to and/or S•from the computer module 401 for performing the video tracking process.

•0 000007 7 0Industrial Applicability 0 0, It is apparent from the above that the embodiment(s) of the invention are 7applicable to the video processing industries where video sequences may require 9 0*o cataloguing according to their content.

20 The foregoing describes only one embodiment/some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from 000.

the scope and spirit of the invention, the embodiment(s) being illustrative and not 9 9 restrictive.

In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including" and not "consisting only 575666.doc 29 of'. Variations of the word comprising, such as "comprise" and "comprises" have corresponding meanings.

575666.doc

Claims

1. A method of identifying camera operation parameters prevailing at an image from a sequence of images, said image being formed by a plurality of pixels, said method comprising the steps of: receiving apparent motion vectors at known pixel coordinates of said image; segmenting said image into segmented regions using a tilt-pan-roll- zoom model; finding a largest segmented region; extracting model parameters of said tilt-pan-roll-zoom model from said largest segmented region; and computing said camera operation parameters from said model parameters of said tilt-pan-roll-zoom model extracted from said largest segmented region.

2. A method as claimed in claim 1, said segmenting step comprising the further steps of: (ba) for each said pixel, fitting said motion vector to said tilt-pan-roll-zoom model to obtain a set of model parameters and corresponding confidence representations; and (bb) statistically analysing the sets of model parameters and corresponding confidence representations to derive a merged segmentation of said image. o

3. A method according to claim 2 wherein sub-step (bb) comprises the sub-steps of: (bba) defining said pixels to each be initial regions of said image;

575666.doc -31 (bbb) merging said regions in a statistical order using said sets of model parameters and confidence representations to obtain a null segmentation of said image; (bbc) analysing a curve formed using said model parameters and corresponding confidence representations to determine an optimal halting criterion at which to cease the merging of said regions; and (bbd) repeating said merging of said initial regions to halt when said optimal halting criterion is reached.

4. A method according to claim 3 wherein sub-step (bbc) comprises identifying returns to monotonicity from local minima in said curve and selecting a predetermined said return approaching the null segmentation as said optimal halting criterion. A method according to claim 3 wherein said statistical order is determined by dividing a minimum covariance-normalised vector distance between adjacent regions of said segmentation by a length of a common boundary between adjacent regions, and ordering the resulting quotients. 6. A method according to claim 5 wherein each said quotient forms a test statistic, a record of which is retained at each merging step. V. 7. A method according to any one of claims 2 to 6, said method further comprising an initial step of receiving, for each said motion vector, a corresponding error covariance; and step (ba) fits said motion vector and the corresponding error covariance to said tilt- eoo f pan-roll-zoom model to obtain a set of model parameters and corresponding confidence representations. 575666.doc 32 8. A method of determining an activity measure of motion in an image, said method comprising the steps of: receiving motion vectors at known pixel coordinates of said image; segmenting said image into segmented regions using a tilt-pan-roll- zoom model; determining a model motion vector at said known pixel coordinates of said image; finding a largest segmented region; extracting dominant model parameters from said largest segmented region; determining a dominant model motion vector at said known pixel coordinates, derived from said dominant model parameters; and determining a deviation of said model motion vectors from said 15 dominant model motion vectors over said image. S 9. A method as claimed in claim 8, said segmenting step comprising the further steps of: (ba) for each said pixel, fitting said motion vector to said tilt-pan-roll-zoom model to obtain a set of model parameters and corresponding confidence representations; and (bb) statistically analysing the sets of model parameters and corresponding confidence representations to derive a merged segmentation of said image. 10. A method according to claim 9 wherein sub-step (bb) comprises the sub-steps of: 575666.doc -33- (bba) defining said pixels to each be initial regions of said image; (bbb) merging said regions in a statistical order using said sets of model parameters and confidence representations to obtain a null segmentation of said image; (bbc) analysing a curve formed using said model parameters and corresponding confidence representations to determine an optimal halting criterion at which to cease the merging of said regions; and (bbd) repeating said merging of said initial regions to halt when said optimal halting criterion is reached. ,o 11. A method according to claim 10 wherein sub-step (bbc) comprises identifying returns to monotonicity from local minima in said curve and selecting a predetermined said return approaching the null segmentation as said optimal halting criterion. 12. A method according to claim 10 wherein said statistical order is determined by dividing a minimum covariance-normalised vector distance between adjacent regions of said segmentation by a length of a common boundary between adjacent regions, and ordering the resulting quotients. o• 13. A method of determining an error measure of motion in an image, said method comprising the steps of: receiving motion vectors at known pixel coordinates of said image; segmenting said image into segmented regions using a tilt-pan-roll- zoom model; finding a largest segmented region; 575666.doc 34 determining a model motion vector at said known pixel coordinates over said largest segmented region of said image; and determining a mean angular difference between said motion vectors and said model motion vectors over said largest segmented region. 14. A method as claimed in claim 13 further comprising the step of receiving a confidence measure of said motion vectors, and wherein said mean angular difference is weighted with said confidence measure. 15. An apparatus for identifying camera operation parameters prevailing at an image from a sequence of images, said image being formed by a plurality of pixels, said apparatus comprising: means for receiving apparent motion vectors at known pixel coordinates of said image; means for segmenting said image into segmented regions using a tilt-pan-roll- zoom model; means for finding a largest segmented region; means for extracting model parameters of said tilt-pan-roll-zoom model from said largest segmented region; and means for computing said camera operation parameters from said model parameters of said tilt-pan-roll-zoom model extracted from said largest segmented region. i 16. An apparatus as claimed in claim 15, said means for segmenting further comprising: o 575666.doc means for, for each said pixel, fitting said motion vector to said tilt-pan-roll- zoom model to obtain a set of model parameters and corresponding confidence representations; and means for statistically analysing the sets of model parameters and corresponding confidence representations to derive a merged segmentation of said image. 17. An apparatus as claimed in claim 16, said apparatus further comprising means for receiving, for each said motion vector, a corresponding error covariance; and said means for fitting fits said motion vector and the corresponding error covariance to said lo tilt-pan-roll-zoom model to obtain a set of model parameters and corresponding confidence representations. 18. An apparatus for determining an activity measure of motion in an image, said apparatus comprising: means for receiving motion vectors at known pixel coordinates of said image; go means for segmenting said image into segmented regions using a tilt-pan-roll- zoom model; means for determining a model motion vector at said known pixel coordinates of said image; means for finding a largest segmented region; means for extracting dominant model parameters from said largest segmented region; means for determining a dominant model motion vector at said known pixel coordinates, derived from said dominant model parameters; and :coordinates, derived from said dominant model parameters; and o 575666.doc -36- means for determining a deviation of said model motion vectors from said dominant model motion vectors over said image. 19. An apparatus as claimed in claim 18, said means for segmenting further comprising: means for, for each said pixel, fitting said motion vector to said tilt-pan-roll- zoom model to obtain a set of model parameters and corresponding confidence representations; and means for statistically analysing the sets of model parameters and corresponding confidence representations to derive a merged segmentation of said image. An apparatus for determining an error measure of motion in an image, said apparatus comprising: means for receiving motion vectors at known pixel coordinates of said image; means for segmenting said image into segmented regions using a tilt-pan-roll- •zoom model; means for finding a largest segmented region; S 0means for determining a model motion vector at said known pixel coordinates over said largest segmented region of said image; and means for determining a mean angular difference between said motion vectors and said model motion vectors over said largest segmented region. 00. 21. An apparatus as claimed in claim 20 further comprising means for receiving a 00.00: confidence measure of said motion vectors, and wherein said mean angular difference is weighted with said confidence measure. 575666.doc -37- 22. A program stored in a memory medium for identifying camera operation parameters prevailing at an image from a sequence of images, said image being formed by a plurality of pixels, said apparatus comprising: code for receiving apparent motion vectors at known pixel coordinates of said image; code for segmenting said image into segmented regions using a tilt-pan-roll- zoom model; code for finding a largest segmented region; code for extracting model parameters of said tilt-pan-roll-zoom model from said largest segmented region; and code for computing said camera operation parameters from said model parameters of said tilt-pan-roll-zoom model extracted from said largest segmented region. 23. A program as claimed in claim 22, said code for segmenting further comprising: code for, for each said pixel, fitting said motion vector to said tilt-pan-roll-zoom model to obtain a set of model parameters and corresponding confidence representations; and a code for statistically analysing the sets of model parameters and corresponding confidence representations to derive a merged segmentation of said image. 24. A program as claimed in claim 23, said program further comprising code for receiving, for each said motion vector, a corresponding error covariance; and said code S*f for fitting fits said motion vector and the corresponding error covariance to said tilt-pan- 4ft 575666.doc -38- roll-zoom model to obtain a set of model parameters and corresponding confidence representations. A program stored in a memory medium for determining an activity measure of motion in an image, said program comprising: code for receiving motion vectors at known pixel coordinates of said image; code for segmenting said image into segmented regions using a tilt-pan-roll- zoom model; code for determining a model motion vector at said known pixel coordinates of 1o said image; code for finding a largest segmented region; code for extracting dominant model parameters from said largest segmented region; code for determining a dominant model motion vector at said known pixel coordinates, derived from said dominant model parameters; and code for determining a deviation of said model motion vectors from said *00 0 ooooo 0 dominant model motion vectors over said image. .oo. 26. A program as claimed in claim 25, said code for segmenting further comprising: code for, for each said pixel, fitting said motion vector to said tilt-pan-roll-zoom model to obtain a set of model parameters and corresponding confidence representations; °sooo "and °oo °ooo ooo •code for statistically analysing the sets of model parameters and corresponding S confidence representations to derive a merged segmentation of said image. 2.i 575666.doc -39- 27. A program stored in a memory medium for determining an error measure of motion in an image, said program comprising: code for receiving motion vectors at known pixel coordinates of said image; code for segmenting said image into segmented regions using a tilt-pan-roll- zoom model; code for finding a largest segmented region; code for determining a model motion vector at said known pixel coordinates over said largest segmented region of said image; and code for determining a mean angular difference between said motion vectors and 1o said model motion vectors over said largest segmented region. 28. A program as claimed in claim 27 further comprising code for receiving a confidence measure of said motion vectors, and wherein said mean angular difference is weighted with said confidence measure. 29. A method of identifying a background motion model of an image, said method being substantially as described herein with reference to the drawings. 9 Apparatus for identifying a background motion model of an image, said apparatus being substantially as described herein with reference to the drawings. Dated this 15th day of OCTOBER 2003 CANON KABUSHIKI KAISHA •Patent Attorneys for the Applicant SPRUSON&FERGUSON 575666.doc