AU769487B2

AU769487B2 - A method for video region tracking through three-dimensional space-time segmentation

Info

Publication number: AU769487B2
Application number: AU97460/01A
Authority: AU
Inventors: Julian Frank Andrew Magarey; Brian John Parker
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2000-12-28
Filing date: 2001-12-24
Publication date: 2004-01-29
Anticipated expiration: 2021-12-24
Also published as: AU9746001A

Description

S&FRef: 582100

AUSTRALIA

PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD PATENT

ORIGINAL

Name and Address of Applicant: Actual Inventor(s): Address for Service: Canon Kabushiki Kaisha 30-2, Shimomaruko 3-chome, Ohta-ku Tokyo 146 Japan Brian John Parker, Julian Frank Andrew Magarey Spruson Ferguson St Martins Tower,Level 31 Market Street Sydney NSW 2000 (CCN 3710000177) o• o ooo, *go Invention Title: A Method for Video Region Tracking Through Three-dimensional Space-time Segmentation ASSOCIATED PROVISIONAL APPLICATION DETAILS [33] Country [31] Applic. No(s) AU PR2333 [32] Application Date 28 Dec 2000 The following statement is a full description of this invention, including the best method of performing it known to me/us:- 5815c riur~ w* -1- A METHOD FOR VIDEO REGION TRACKING THROUGH THREE- DIMENSIONAL SPACE-TIME SEGMENTATION Field of Invention The present invention relates to automatic scene analysis of a video signal and, in particular, to segmentation and tracking of regions in a video feed.

Background Video may be defined as a sequence of images. Each image is digitally 1o represented by image data in the form of a two-dimensional array of samples (known as pixels) of some measurable quantity. Examples of directly measurable image data include luminance and chrominance of reflected light (from optical cameras), range or distance from some reference point to the imaged points (from active range sensors), or density (from tomographic scanners).

Many quantities can be derived from the raw image data. Such quantities may be referred to as metadata, this being data that is used to describe other data. Examples of such "metadata" quantities include range from passive optical range sensors, and motion from multiple images of dynamic scenes.

Image segmentation is the decomposition of an image into homogeneous entities called regions. Because human expectation is that real world objects are in some sense compact and coherent, each segment of the partitioned image consists of a region of adjacent pixels over which some property of the data (image data, metadata, or both) is uniform. Image segmentation is an important process for many subsequent imageprocessing tasks.

However, in the processing of video for such tasks as metadata generation, not only are the significant two-dimensional regions in each image (or frame) required, but 582100.doc -2these regions also need to be tracked through time. This is known as the correspondence problem between frames.

Traditionally, the problem of segmentation and object tracking in a video sequence has been approached by segmenting separately each single frame in the sequence, followed by a region-matching step in order to relate every region in each frame with a region in following frames. Algorithms often used in the region-matching step include correlation, feature tracking, or explicitly estimating and modelling the motion parameters of regions between frames. These algorithms are complex and make it difficult to generalise other two-dimensional image processing results to the field of video processing.

Another approach is to treat the video sequence as a three-dimensional signal, with the three dimensions being two spatial dimensions plus time, and then to use threedimensional segmentation. As video data typically has a very long (potentially infinite) time axis, a limitation with this approach is the size of memory and processing power required to store and process such a large quantity of data. Further, the segmentation can only commence once all the frames are available. Accordingly, this approach is only suitable for post-processing, making it unsuitable for processing of video feeds.

To partially overcome these limitations, the sequence may be split into a number of smaller three-dimensional blocks, each having a given number of frames. Each of these smaller blocks is then segmented separately. However, as the video sequence is split into smaller blocks, this approach is unsuitable for object tracking purposes, because the regions belonging to the different blocks still need to be matched in a subsequent step.

Very few of the known techniques for two-dimensional segmentation are easily extendable to three dimensions. Unseeded variational region merging is an efficient 582100.doc .grf* V -3algorithm for segmentation which trades off region homogeneity ,With region compactness.

With automatic segmentation by variational region merging, a difficulty often experienced is deciding when to halt the merging process. Some implementations have required a predetermined "schedule" of thresholds to govern the merging process and converge to the segmentation which minimises a cost functional. Others have removed the need for a schedule, but still require an arbitrary threshold. The use of a predetermined arbitrary threshold means the segmentation algorithm is unable to adapt to different types of images in the video feed without substantial operator intervention.

Summary of the Invention It is an object of the present invention to substantially overcome or at least ameliorate one or more problems associated with existing arrangements.

In accordance with one aspect of the present invention there is disclosed a method for segmenting a sequence of two-dimensional frames, each frame being formed by a plurality of pixels, each said pixel being described by a vector having components each relating to a different measured image characteristic, said method comprising the steps of: forming a three-dimensional block of said pixels from the first predetermined number of said frames on said sequence of frames; segmenting said three-dimensional block of pixels using a threedimensional region-merging segmentor to form three-dimensional regions in a segmented block of pixels; concatenating a next frame in said sequence of frames to said segmented three-dimensional block ofpixels; 582100.doc -4outputting a segmented two-dimensional frame from said segmented three-dimensional block of pixels; and repeating steps to for each subsequent frame of said sequence.

Other aspects of the present invention are also disclosed.

Brief Description of the Drawings An embodiment of the present invention will now be described with reference to the drawings in which: Fig. 1 is a block diagram showing the formation of a three-dimensional block of data from a sequence of image frames; Fig. 2 is a schematic block diagram representing processing steps of a method of video tracking; Fig. 3 is a block diagram showing the data flow during segmentation; Fig. 4 is a graphical representation of a main data structure used in the three- dimensional segmentation; Fig. 5 is a graphical representation of a two-dimensional array used in the three- dimensional segmentation; Fig. 6A is a plot of the value of the merging cost as the algorithm proceeds in a typical case; Fig. 6B is a plot similar to Fig. 6A but simplified and shown over an entire segmentation; Fig. 7 is a schematic block diagram representing the segmentation processing steps used in the method of video tracking shown in Fig. 2; and Fig. 8A shows an example three-dimensional region which alwa'ys results in a single segment in each of the frames; 582100.doc Fig. 8B shows an example three-dimensional region which results in disconnected segments in some of the frames; Figs. 8C and 8D show an allowable configuration and a non-allowable configuration for merging respectively; Fig. 9 is a schematic block diagram representation of a computer system in which arrangements described may be implemented.

Detailed Description Some portions of the description which follows are explicitly or implicitly to presented in terms of algorithms and symbolic representations of operations on data within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

It should be borne in mind, however, that the above and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions utilizing terms such as "calculating", "determining", "replacing", "generating", "initializing", "outputting", or the like, refer to the action and processes of a computer 582100.doc -6system, or similar electronic device, that manipulates and transforms data represented as physical (electronic) quantities within the registers and memories of the computer system into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

s The present specification also discloses apparatus for performing the operations of the methods. Such apparatus may be specially constructed for the required purposes, or may comprise a general-purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose machines may be used with programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate. The structure of a conventional general- .0 purpose computer will appear from the description below.

In addition, the present specification also discloses a computer readable medium 4444 4 Is comprising a computer program for performing the operations of the methods. The computer readable medium is taken herein to include any transmission medium for communicating the computer program between a source and a designation. The transmission medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general-purpose computer. The transmission medium may also include a hard-wired medium such as exemplified in the Internet system, or wireless medium such as exemplified in the GSM mobile telephone system. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the arrangements described herein.

582100.doc -7- Fig. 1 shows a sequence of two-dimensional frames 10z in a video feed. A pixel in the z-th frame 10z has a position where x is a spatial position vector within the z-th frame 10z. Each pixel has a component or value f(x,z) as a vector of measurements, which typically includes colour values from an image sensor, and are typically represented in the RGB or LUV colour space. The pixel components f(x,z) may also contain range data or motion data, or some combination thereof. Hereafter, let im denote the number of components in the pixel componentf(x,z).

Also referring to Fig. 2 where a method 200 of video region tracking is shown, in step 202 a three-dimensional block of pixels is formed, with two of the dimensions 0o being spatial, and the other one being related to time. The three-dimensional block is defined by a sliding window 20, which includes N frames from the sequence of frames. Initially, frames 101 to 10N are included in the block.

S

Accordingly, the block of pixels is a three-dimensional grid, with each interior pixel (xz) having six direct neighbours, those being pixels directly above, below, to the left and right, and in the preceding and following frames 10z-_ and 10z+1 of the pixel (x,z) in question. Pixels at the periphery of the block each have five neighbours. Following step 202, method 200 proceeds to step 204 where the threedimensional block of pixels is segmented into a set of three-dimensional connected regions so that every pixel (xz) in the block is related to one region Oi in which all pixels (xz) belonging to the same region Oi have pixel components f(x,z) that are homogeneous in some sense. A connected region Oi is one in which each pixel in the region Oi can be reached from every other pixel (xz) in that same region Oi via a neighbourhood relation. Such a requirement forbids disjoint regions The details of the segmentation follow below.

582100.doc Once the block of pixels is fully segmented into regions Oi, the sliding window is advanced by one frame in time in step 206. This has the effect of merging a next frame 10z into the newly formed block, while removing a fully segmented frame off the "back" of the block. The fully segmented frame is preferably the oldest frame in time of the block and is returned as output 210. More generally, any plane within the sliding window may be returned as output 210.

The newly formed block of pixels defined by the sliding window 20 now includes N-1 frames of previously segmented pixels and one new frame of unsegmented pixel data (termed "trivial segmentation") concatenated to the front of the block. Trivial segmentation is where every pixel (xz) constitutes its own region Oi. In step 208 the newly formed block is further segmented. This has the effect of segmenting the newly added frame as well as incorporating it into the existing segmentation. Once the block is segmented, the method returns to step 206, where again the sliding window is advanced by one frame in time.

Method 200 has the effect of, as the sliding window 20 moves along the oo sequence of frames, returning fully segmented frames as output 210. Each of the output 0% frames 10z is a two-dimensional slice through the three-dimensional block which induces •o00 a two-dimensional segmentation of the frame into a set of connected regions Pixels o• in the same and adjoining frames that belong to the same three-dimensional region Oi are automatically identified as those pixels having the same region label i in the twodimensional segmentation. Accordingly, tracking areas in the frames belonging to the S0 same three-dimensional region Oi provides an efficient method of tracking "objects" in a 0:video sequence, removing the need for a subsequent algorithm to solve the correspondence problem between regions of different frames.

0o.0 582100.doc -9- The segmentation of the three-dimensional block of pixels, as used In steps 204 and 208 of method 200, is now described in more detail. An assumption underlying the segmentation problem is that each pixel component f(x,z) is associated with a particular state. The model used to define the states is decided upon in advance. Each state i is defined by a unknown region model parameter vector ai, with each state i being assumed to be valid over the connected region Oi. The aim of segmentation is to identify these regions Oi and the model parameters a- for each region Oi.

The neighbourhood or adjacency rule for pixels (or "voxels" as the threedimensional analogue of pixels are known), known per se in the art, extends to the regions Oi. That is, a region Oi is said to be adjacent to another region Oj if some voxel in the first region Oi is a neighbour of any voxel in the second region Oj. Separating each pair of neighbouring voxels belonging to different regions Oi and O is a boundary element. A model vector of measurements g(x,z) over each region Oi is assumed to be a linear projection of the model parameter n-vector aL for that region Oi: g(x,z) A(x,z) ai, E Oi where A(x,z) is a known m by n matrix which relates the state of region Oi to the model measurements thereby encapsulating the nature of the predefined model.

Each vector of actual pixel components f(x,z) is subject to a random error e(x,z) such that f(x,z) g(x,z) (2) Further, the error e(x,z) may be assumed to be drawn from a zero-mean normal (Gaussian) distribution with covariance A(x,z): e(x,z) N(0, (3) 582100.doc il~iSVHSI^^"~.Y,~,n-i Fig. 3 is a block diagram showing the data flow 100 during segmentation which occurs in steps 204 and 208 shown in Fig. 2. Together with the pixel components f(x,z), the m by m covariance matrix A(x,z) at each pixel is an optional additional input to the algorithm. In traditional arrangements, it has been assumed that each component of the error e(x,z) is independently and identically distributed, i.e.: A(x,z) C2(x,z)I, (4) However, the preferred implementation generalises this to encompass disparate and possibly mutually dependent measurement error components. Variational segmentation requires that a cost function E be assigned to each possible segmentation. A partition into regions O; may be compactly described by a *o binary function K(d) on the boundary elements, in which the value one is assigned to each boundary element d bordering a region Oi. This function K(d) is referred to as a boundary map. It should be noted that because of the requirement of region connectedness, not every boundary map K(d) defines a valid segmentation. The cost function E used in the preferred implementation is one in which a model fitting error is balanced with an overall complexity of the model. The sum of the statistical residuals of each region Oi is used as the model fitting error. Combining Equations and the residual over region Oi as a function of the model parameters ai is given by Z

T

(x,z)eo, The model complexity is simply the number of region-bounding elements d.

Hence the overall cost functional E may be defined as K(d) (6) d 582100.doc -Ilwhere the (non-negative) parameter X controls the relative importance of rfiodel fitting error and model complexity. The contribution of the model fitting error to the cost functional E encourages a proliferation of regions, while the model complexity encourages few regions. The functional E must therefore balance the two components to achieve a reasonable result. The aim of variational segmentation is to find a minimising model measurement k and a minimising boundary map K(d) of the overall cost functional E, for a given parameter value.

Note that if the region boundaries d are given as a valid boundary map the minimising model parameters dii over each region Oi may be found by minimising the region residuals E. This may be evaluated using a simple weighted linear least squares calculation. Given this fact, any valid boundary map K(d) will fully and uniquely describe a segmentation. Therefore, the cost function E may be regarded as a function over the space of valid edge maps (K-space), whose minimisation yields an optimal region partition Kz for a given parameter k. The corresponding minimising model parameters i may then be assumed to be those which minimise the residuals E over each region O0. The corresponding minimum residual for region Oi will hereafter be written as Ei.

From the above it is clear that the parameter X is critical to the result. If parameter X is low, many boundaries are allowed, giving "fine" segmentation. As parameter X increases, the segmentation gets coarser. At one extreme, the optimal region partition Ko, where the model complexity is completely discounted, is the trivial segmentation, in which every pixel (xz) constitutes its own region O0, and which gives zero model fitting error e. On the other hand, the optimal region partition where the model fitting error e is completely discounted, is the null or empty segmentation in which 582100.doc -12the entire block is represented by a single region Oi. Somewhere between these two extremes lies the segmentation Kz which will appear ideal in that the regions Oi correspond to a semantically meaningful partition.

To find an approximate solution to the variational segmentation problem, a region merging strategy has been employed, wherein properties of neighbouring regions Oi and Oj are used to determine whether those regions come from the same model state, thus allowing the regions Oi and Oj to be merged as a single region Oi. The region residual E also increases after any 2 neighbouring regions Oi and Oj are merged.

Knowing that the trivial segmentation is the optimal region partition Kx for the smallest possible parameter value of 0, in region merging, each pixel (xz) in the block is initially labelled as its own unique region Adjacent regions Oi and Oj are then compared using some similarity criterion and merged if they are sufficiently similar. In this way, small regions take shape and are gradually built into larger ones. The segmentations Kz before and after the merger differ only in the two regions O; and O. Accordingly, in determining the effect on the total cost functional E after such a merger, a computation may be confined to those regions Oi and Oj. By examining Equations and a merging cost for the adjacent region pair {Oi,Oj} may be written as SEi (Ei Ej) tij 7)) 1(8 j) where 1(8ij) is the area of the common boundary between three-dimensional regions Oi and Oj. If the merging cost ty exceeds parameter X, regions Oi and Oj should be merged.

The key to efficient region merging is to compute the numerator of the merging cost tij as fast as possible. Firstly, Equation is rewritten as: 582100.doc i-M WE, VX7XA1 7 MRN M Ar A I 'A W I'll WW 117W &T ANOP A "A Nu~.*ay 13 Ej(ay) (F Hj aj)

T

f(F Hj a) (8) where: H is an (njm) by n matrix composed of the individual A(x,z) matrices stacked on top of one another as varies over region Oj; F is a column vector of length (n j m) composed of the individual pixel componentf(x,z) vectors stacked on top of one another; 1) is an (nim) by (n j m) block diagonal matrix, where each m by m diagonal block is the inverse of the covariance A(x,z) matrix at the pixel denoted by the corresponding rows in F. By weighted least square theory, the minimising model parameter vector aj for the region Oi is given by ai Hj IjF J J ii where Kj is the confidence in the model parameter estimate aj, defined as the inverse of its covariance:..

S* K AJ' Hj 1 -IH (10) The corresponding residual is given by E j =F T Hij'Hr )F (11) When merging two regions O; and Oj, the "merged" matrix Hj is obtained by concatenating matrix Hi with matrix H; likewise for matrices Fy and These facts may be used to show that the best fitting model parameter vector ai, for the merged region Oij is given by: ai =ai Kii' K(ai -a5) (12) 582100.doc -14where the merged confidence is Kij Ki+ Kj (13) and the merged residual is given by E Ei E+ E -a Ij KK (Kj I a-a (14) Combining Equations (13) and the merging cost t o in Equation may be computed as: ai-aj Kci(~K 1 l K j y i a -aj) from the model parameters and confidence of the regions Oi and Oj to be merged. The matrix to be inverted is always of size n by n, does not increase with region size). If lo the merge is allowed, Equations (12) and (13) give the model parameter a. and confidence Kcy of the merged region Oiy.

Note that under this strategy, only Equations and (15) need to be applied throughout the merging process. Only the model parameters a, and their confidences Ic for each region Oi are therefore required as segmentation proceeds. Further, neither the original pixel componentsf(x,z) nor the model structure itself the matrices are required.

Variational linear-model-based segmentation may thus be separated into two stages as seen in Fig. 3, those stages being an initial model fitting stage 106 where the model parameters d(x,z) and confidences K(x,z) are found for the pixel components f(x,z) and covariance A(x,z) at each pixel followed by a region merging stage 108.

In the case of a zero-order model e.g. RGB or LUV colour vectors used directly, the initial model-fitting stage 106 is trivial: 582100.doc n~a~i~ru ~r w lil~ll", 7q ,r~ (16) K(x,z) (17) In the case of higher-order models, e.g. piecewise planar fitting on range data, model parameters a(x,z) and confidences K(x,z) at each pixel may be obtained in any manner desired.

At the initialisation of the region merging stage 108, all adjacent region pairs Oi and Oj are determined and their corresponding merging cost ty evaluated. All the adjacent region pairs 0, and Oj are then "sorted" by a priority queue into ascending order of the merging costs ty. Region merging then involves popping a region pair O, and Oj off the 0o top of the priority queue the region pair Oi and Oj with the lowest merging cost tu), merging this region pair Oi and Oj to form a new region OU with data members updated as described above.

Similarly, any boundaries that are now duplicated, and hence will be forming a multi-graph, are replaced with a new merged boundary, that has the sum of the areas of the two constituent boundaries. The merging cost of all boundaries adjacent to O are then updated, rearranging their order in the priority queue. The priority queue may be an exact priority queue such as a Fibonacci heap, or an approximate priority queue such as a bucket sorted bounded heap.

Note that it is this global ordering of the boundary merging cost ty that ensures that newly added unsegmented frame data will be preferentially segmented and merged into the existing three-dimensional segmentation.

This region-growing segmentation stage 108 may be shown to have run-time complexity of order (M log where M is the number of pixels in a frame (or "slice" in the case where the three dimensions are spatial) assuming that the number of 582100.doc ~,u,*~i~n~~*ill"MTMMYTWMT.M1T7~.i -16neighbouring regions remains small relative to M, and provided the sorting and insertion can be done in "log time". This can be guaranteed if the list structure is maintained in computer memory in a structure called a priority queue.

To calculate a stopping point for this algorithm, note that this region-growing algorithm effectively provides a value at each merge operation, namely the merging cost ty of the pair being merged. It is therefore possible to build up a sequence of merging cost t. values as the algorithm proceeds. The algorithm halts if this merging cost value exceeds a threshold X,top at any time. If the algorithm were not halted, segmentation would continue until only one region remains. o In the preferred implementation, the value of the threshold X,top is automatically 009000 determined from the first block ofN frames in the sequence of frames it is applied to. To illustrate how the threshold ,,top is determined, it is noted that, as merging proceeds, the merging cost t

O

of the regions Oi and Oj being merged generally increases, i.

Fig. 6A shows a plot of merging cost t o during part of a segmentation of a real three- Is dimensional block of pixels, and Fig. 6B shows an artificial plot of the merging cost tij 9000 over time for an entire region merging process. As can be seen from Figs. 6A and 6B, the increase in merging cost t, is not purely monotonic. In fact, the overall rise in merging cost ty is punctuated by departures from monotonicity, which herein are termed local minima. A local minimum represents the collapse of what might be termed a selfsupporting group of adjacent regions. Such occurs if one boundary within the group is removed, and the merging costs ti, for adjacent boundaries then suddenly reduce. In effect, the hypothesis that the regions of the group are part of the same object is confirmed as more regions merge and so the merging cost ty decreases. The result is that all the boundaries in the group are removed in quick succession. These self-supporting groups tend to represent the internal structure of objects and background clutter. A 582100.doc C 1 liuh~-~Y~j~. ii.- -r ~i~L~iRB'in~l'. .s(Lr(ll~W~II r*i~'Y~lr*lr~, UIYlu)k~'*Ul.ull 'OLU~hr -17measure of merit such as the number of boundaries removed, or their total-area, or the maximum (absolute or relative) decrease in merging cost t o may be assigned to each local minimum.

The point immediately after a local minimum, being a return to substantial monotonicity, is termed herein a stable configuration. Visually, a stable configuration represents a point in the region merging process at which an area of internal object structure or background clutter has just been removed, and is thus a good potential halting point. Each stable configuration has an associated merging cost to. Fig. 6A also shows local minima and stable configurations. oooo If a complete pass is made through the segmentation, in which all regions are merged until only one region (the whole block) remains, all local minima and stable configurations for the block of frames may be found automatically by analysing the values of the merging cost tij. Significant local minima, being those whose measure of merit exceeds a certain threshold 4, are flagged. As can be seen in Fig. 6B, during the early stages of region merging, local minima are common, giving the plot an erratic behaviour. As the regions become more established and substantial, the local minima frequency reduces until the null segmentation is reached the block forms a single region). Those segmentations approaching the null will however be useful since the number of regions will be manageable computationally and most likely will be visually perceptible. As indicated above, a stable point is a desirable location to cease region merging, and Fig. 6B illustrates the identification of a limited number of candidate stopping locations, those being ,,,sol sto_2, ,stop_3, at significant stables near the null segmentation. The last significant stable configuration -stop_1 is typically chosen as the threshold kstp, although any of the limited number of candidate stopping locations may be selected depending on the particular video feed and/or application being processed.

582100.doc -18- Further, where the image has a large number of local minima hundreds, thousands or more), the limited number of candidate stopping positions may be limited (eg. in the "tens").

To achieve this stable configuration during the region-growing segmentation stage 108, the merging of regions must stop once the merging cost ty exceeds the threshold ,stop. To restore this state, the region merging segmentation stage 108 need only reverse its last few merging operations by restoring the algorithm state appropriately.

Alternatively, the region merging segmentation may be run again from the start, halting when the value of merging cost ty reaches the threshold ,,sop. 0o The latter, more expensive alternative is used wherein a first pass of the region merging segmentation is used to determine the threshold kstop, and subsequent passes halt once the merging cost ty reaches the threshold ,,sop.

Fig. 7 shows the region-growing segmentation stage 108, which forms part of the data flow 100 (Fig. 3) during segmentation in more detail. It is again noted that the data flow 100 occurs during the segmentation steps 204 and 208 shown in Fig. 2. In the preferred implementation, the first pass only occurs in step 204 as it is assumed that subsequent frames added to the block of frames have similar characteristics to those frames 101 to 10 N included in the initial block.

The region-growing segmentation stage 108 starts at step 302 and proceeds to step 304 which receives the model parameters a(x,z) and model confidences K(x,z) for each pixel in the block of N frames from the model fitting stage 106 (Fig. The region-growing segmentation stage 108 starts with the trivial segmentation where each pixel (xz) forms its own region Oi. Step 306 then determines all adjacent region pairs Oi and Oj, and computes the merging cost ty according to Equation (15) for each of the 582100.doc 19flli9HfVIIIIIIV1fllASI1191S9111HS41fQWYflHS1^ -19boundaries between adjacent region pairs Oi and Oj. Step 308 inserts the boundaries with merging cost ty into a priority queue T in priority order.

Step 310 takes the first entry from the priority queue T(1) and merges the corresponding region pair Oi and Oj the region pair Oi and Oj with the lowest merging cost ti) to form a new region Step 312 records the merging cost ty in a list L.

Step 314 identifies all boundaries between regions Oi adjoining either of the merged regions Oi and Oj and merges any duplicate boundaries, adding their areas. Step 318 follows by calculating a new merging cost tijl for each boundary between adjacent regions Oy and Oi. The new merging costs til effectively reorder the priority queue into the final sorted queue in step 318.

S

During the first pass, the region-growing segmentation stage 108 then proceeds to step 324 where it is determined whether null segmentation has been reached. This is done by determining whether more than one region remains. If the null segmentation has not been reached, then control returns to step 310. Steps 310 to 324 are repeated until null segmentation is reached. When all regions have been combined into the null segmentation, step 324 passes control to step 326 which then identifies the merging cost ty, stored in list L, corresponding to the last stable configuration. This is assigned the threshold Xstop. This concludes the first pass.

Control returns to step 304 where the segmentation is started again with the trivial segmentation where each pixel forms its own region Oi. Steps 306 to 318 follow as described above where the pixels are merged to form regions in the second pass.

With threshold stop determined, step 318 passes control to step 322 to determine if the merging has reached the stopping point. This is done by determining whether the merging cost ty corresponding to the regions Oi and Oj at the top of the priority queue T 582100.doc (entry has a value greater than the threshold Lstop. If the merging his reached a stopping point, then the region-growing segmentation stage 108 ends in step 330.

Alternatively, control is returned to step 310 and steps 310 to 322 are repeated, merging the two regions with the lowest merging cost ti, every cycle, until the stopping point is reached.

In order to apply the region-growing segmentation stage 108 described above to a wide variety of data sources and models, it is desirable to use an adaptive measure of merit to determine whether a significant local minimum has been reached when analysing, in step 326, the merging costs ty stored in list L. In the preferred implementation, the measure of merit used corresponds to the number of boundaries

S

removed during a local minimum. oo Fig. 4 shows a graphical representation of a data structure 500 for use with the 9*@S region-growing segmentation stage 108, which is essentially an adjacency list representation of the segmentation graph. The data structure 500 includes a region list 510, a boundary list 520 for each region in the region list 510, and a priority queue 530. The region list 510 contains for each region Oi a region identifier, co-ordinates of all pixels in that region Oi, the model parameter vector ai and the model confidence Ki. The priority queue 530 contains and orders boundaries Bij. These are data structures which store the merging costs t 0 of two regions Oi and Oj, the area l(8ij) of the common boundary between regions Oi and Oj, and pointers to the corresponding regions Oi and Oj.

Each boundary list 520 contains a list of pointers, with each pointer pointing to an entry in the priority queue corresponding to the boundary between adjacent regions Oi and Oj under consideration. This data structure 500 allows the region-growing segmentation stage 108 to efficiently store all the data and parameters required for the segmentation.

582100.doc S BvivS/ i ,''IN -21- Fig. 5 shows a graphical representation of a two-dimensional array'D, which is maintained during steps 206 and 208. It conceptually corresponds to the two-dimensional intersection of the segmented three-dimensional regions Oi with the frame at the "front" of the sliding window. This is the last frame that was added to the block of N frames.

s Each pixel position in the two-dimensional array D stores a pointer to the segmented three-dimensional region Oi that intersects it, allowing constant time lookup of the adjacent region, so that a new frame can be concatenated to the graph data structure 500 in constant time per pixel.

The region-merging segmentation stage 108 described above starts with the 1o trivial segmentation where each pixel forms its own region Oi. In order to lower memory cost and to decrease runtime, an initial region-splitting preprocessor may be used. The region-splitting preprocessor preprocesses a frame 10, into square regions of pixels before adding it to the three-dimensional block. Accordingly, the region merging segmentation stage 108 starts with fewer regions than in the above implementation.

The preprocessor utilises a tiled quadtree decomposition of the newly added frame 10. Rather than adding the frame 10, as individual pixels, the preprocessor attempts to add squares of MxM pixels in scan-line order, where M is some predetermined small power of two, such as 16. For each square of MxM pixels it is determined whether the pixel components f(x,z) of the pixels contained in the square all have values within a predetermined small tolerance. If this condition is met, the square is added in its entirety as a single region. Alternatively, the square is subdivided into 4 equally sized smaller blocks. Each of these smaller blocks are themselves tested for homogeneity, and added as a single region if so, otherwise the process is continued recursively until single pixel-sized regions are reached.

582100.doc 22 Note that by adding square (or rectangular) blocks, a frame 10, ma be tiled in scan-line order, and hence the neighbouring regions can be quickly determined and the regions added in constant time. This would not hold for arbitrarily shaped regions.

The method 200 for tracking regions in a sequence of frames relies on the fact that there exists some "connectivity" between corresponding regions in consecutive frames. This requirement may not be met when the sequence includes small, fast moving objects. An adequate temporal sampling rate is required for such objects to remain connected in the three-dimensional space-time domain. Typical sampling rates of 25 Hz (or 50 Hz for interlaced video) are, in fact, adequate for tracking most objects of interest, such as moving humans.

If the temporal sampling rate is fixed and small, and fast moving objects need to i'.

be tracked, then one of the following options may be used: 1. Filter the original analog video signal temporally before temporal i: sampling. This introduces motion blur, and method 200 takes advantage of such motion blur, as it assists in causing connectivity in the three dimensional space-time; 2. The effect of option 1 above may be approximated by digital processing by super-sampling, where frames are captured at a higher rate than is required. Super- sampling is then followed by down-sampling by averaging several frames together; or 3. Motion-guided frame interpolation, known per se in the art, can be used to generate intermediate frames between given frames to increase the effective temporal sampling rate of data passed into the algorithm of method 200. To perform motionguided frame interpolation, an estimate of the motion field is required, which may be calculated using any one of a number of known algorithms.

A problem that is often encountered when performing region merging on video as a three-dimensional block of voxels is that connected regions in the three-dimensions 582100.doc -23may be non-simple along the time axis, which results in disconnected regiong in the twodimensional segmented frame returned in step 210 (Fig. Fig. 8A shows a region A which always results in a single segment Sz in each of the frames 10z. However, the region B shown in Fig. 8B results in disconnected segments Szl and sz2 in frame 10z. This s may adversely affect later image processing algorithms using the segmented data of the frames A typical approach to this problem is to rectify the returned two-dimensional segmented frames by post-processing, whereby connectivity of each labelled region Oj of the segmented frame 10z is tested. Each disconnected segment is then either relabelled or to removed. This form of post-processing is slow and significantly complicates algorithms it is applied to.

A preferred approach is to intervene during the three-dimensional region- growing segmentation stage 108 (Fig. Regions Oj that are non-simple along the time axis with the sliding window 20 are prevented from being generated, where "simple" o means that a cross-section along the time axis will give a single two-dimensional connected component. Hence, disconnected segments such as szl and Sz2 shown in Fig. 8B cannot occur.

Before merging neighbouring regions Oi and Oj in step 310 (Fig. it is tested whether the resulting region O would comply with an "allowed configuration" defined below. If the resulting region Oi would not comply with the allowed configuration, then the merger of regions Oi and Oj is prevented by effectively setting the merging cost tij to infinity.

Referring to Fig. 8C, a minimum and maximum extent of a region A along the time axis is defined as RAtmin and RAtm,, respectively. Similarly, the minimum and maximum extent of a region B along the time axis is defined as RBtmin and RBtmax 582100.doc -24respectively. A three-dimensional boundary separating regions A and B has a"time extent (Boundarytmin, Boundarytmax). Let Wmin be the minimum extent of the sliding window in time and Wmax the maximum extent. An allowed configuration, as shown in Fig. 8C, is defined as configurations wherein: [max(RAtmin,Wmin)=max(Boundarytmin,Wmin) OR max(RBtmin,wmin)=max(Boundarytminwn)] AND [min(RBtmax,wmax)=min(Boundarytx,wmax) OR min(RAt,~x,wnax)=min(Boundarymax,wna)]. All other configurations are disallowed.

An example of such a disallowed configuration is shown in Fig. 8D. As can be seen from the illustration, a merger between regions A and B will not result in an to allowable configuration according to the above condition.

The calculation of the extent RBtmin and RBtmax of a region B along the time axis of all the regions, as well as the extent Boundarytmin and Boundarytmax of the boundaries is easily calculated in the initialisation step 304, as the extent of pixels or regions and their boundaries are trivially known. When merging two regions, the extent of the new region may be updated in constant time by assigning the maximum and minimum extents of the two constituent regions. Boundaries may be updated in a similar manner. All initial pixels are simple in time, and by maintaining the condition that all regions Oi remain simple at each merging step, then the final three-dimensional segmentation must be simple, i.e. can not result in disconnected two-dimensional segments in cross-section.

For isotropic three-dimensional data, the presumption that the number of neighbouring regions remains small is correct and the region-growing segmentation stage 108 has O(M log M) run-time. However, when concatenating unsegmented data onto the previously segmented three-dimensional sliding window, this presumption does not hold, as the data is anisotropic, and the number of neighbouring regions can grow to large sizes.

582100.doc To maintain a reasonable O(M log M) run-time, the boundedness of the number of neighbouring regions needs to be guaranteed. This is done by extending the regiongrowing segmentation stage 108 so that the merging cost ty is set to infinity if either of the two adjoining regions Oi and Oj has a number of neighbouring regions greater than some threshold. This effectively removes the affected regions Oj and Oj from merging until such time as their number of neighbouring regions has shrunk below some predetermined bound, ensuring an O(M log M) run-time.

The method 200 is not limited to analysing regions in a video feed, but may also be used for segmentation of other forms of three-dimensional data, such as data in three spatial dimensions. The method 200 is particularly useful where segmentation is required for large quantities of three-dimensional data, such as resulting from computed tomography (CT) or magnetic resonance imaging (MRI). Segmentation of the whole three-dimensional block using a traditional three-dimensional segmentor would require a very large data storage facility and would be slow. By moving the whole threedimensional block through a sliding window which defines a smaller block of data, segmentation is performed on the smaller block of data. Every time the sliding window is advanced, a "slice" of data from the whole three-dimensional block is added to the *smaller block and a fully segmented "slice" of data is produced as output. The sliding *o window is moved over the whole three-dimensional block, until a completely segmented block of data is formed.

The method 200 of video tracking described above may be practiced using a programmable device, and are preferably practiced using a conventional general-purpose computer system 400, such as that shown in Fig. 9 wherein methods 200 may be implemented as software, such as an application program executing within the computer 25 system 400. In particular, the steps of methods 200 are effected by instructions in the 582100.doc -26software that are carried out by the computer. The software may be divided into two separate parts; one part for carrying out the video tracking method and another part to manage the user interface between the latter and the user. The software may be stored in a computer readable medium, including the storage devices described below, for example.

The software is loaded into the computer from the computer readable medium, and then executed by the computer. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program product in the computer preferably effects an advantageous apparatus for video tracking in accordance with the embodiments of the invention.

1o The computer system 400 comprises a computer module 401, input devices such as a keyboard 402 and mouse 403, output devices including a printer 415 and a display device 414. A Modulator-Demodulator (Modem) transceiver device 416 is used by the computer module 401 for communicating to and from a communications network 420, for example connectable via a telephone line421 or other functional medium. The modem 416 can be used to obtain access to the Internet, and other network systems, such as a Local Area Network (LAN) or a Wide Area Network (WAN). The computer module 401 typically includes at least one processor unit 405, a o memory unit 406, for example formed from semiconductor random access memory (RAM) and read only memory (ROM), input/output interfaces including a video interface407, and an I/O interface413 for the keyboard402 and mouse403 and optionally a joystick (not illustrated), and an interface 408 for the modem 416. A storage device 409 is provided and typically includes a hard disk drive 410 and a floppy disk drive411. A magnetic tape drive (not illustrated) may also be used. A CD-ROM drive 412 is typically provided as a non-volatile source of data. The components 405 to 413 of the computer module 401, typically communicate via an interconnected bus 404 582100.doc -27and in a manner which results in a conventional mode of operation of tfie computer system 400 known to those in the relevant art. Examples of computers on which the embodiments can be practised include IBM-PC's and compatibles, Sun Sparcstations or alike computer systems evolved therefrom.

Typically, the application program of the preferred embodiment is resident on the hard disk drive 410 and read and controlled in its execution by the processor 405.

Intermediate storage of the program and any data fetched from the network 420 may be accomplished using the semiconductor memory 406, possibly in concert with the hard disk drive 410. In some instances, the application program may be supplied to the user encoded on a CD-ROM or floppy disk and read via the corresponding drive 412 or 411, or alternatively may be read by the user from the network 420 via the modem device 416. i'.

Still further, the software can also be loaded into the computer system 400 from other computer readable medium including magnetic tape, a ROM or integrated circuit, a magneto-optical disk, a radio or infra-red transmission channel between the computer module 401 and another device, a computer readable card such as a PCMCIA card, and the Internet and Intranets including e-mail transmissions and information recorded on websites and the like. The foregoing is merely exemplary of relevant computer readable mediums. Other computer readable media may be practiced without departing from the scope and spirit of the invention.

The methods described may alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the functions or sub functions and for example incorporated in a digital video camera 450. Such dedicated hardware may include graphic processors, digital signal processors, or one or more microprocessors and associated memories. As seen, the camera 450 includes a display screen 452 which can be used to display the segmented frames of information regarding then same. In this 582100.doc -28fashion, a user of the camera may record a video sequence, and using the processing methods described above, create metadata that may be associate with the video sequence to conveniently describe the video sequence, thereby permitting the video sequence to be used or otherwise manipulated with a specific need for a user to view the video sequence.

A connection 448 to the computer module 401 may be utilised to transfer data to and/or from the computer module 401 for performing the video tracking process.

Industrial Applicability It is apparent from the above that the embodiment(s) of the invention are applicable to the video processing industries where video sequences may require cataloguing according to their content. i The foregoing describes only one embodiment/some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiment(s) being illustrative and not restrictive. In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including" and not "consisting only of'. Variations of the word comprising, such as "comprise" and "comprises" have corresponding meanings.

582100.doc

Claims

1. A method for segmenting a sequence of two-dimensional frames, each frame being formed by a plurality of pixels, each said pixel being described by a vector having components each relating to a different measured image characteristic, said method comprising the steps of: forming a three-dimensional block of said pixels from the first predetermined number of said frames on said sequence of frames; segmenting said three-dimensional block of pixels using a three- dimensional region-merging segmentor to form three-dimensional regions in a segmented block of pixels; concatenating a next frame in said sequence of frames to said segmented three-dimensional block of pixels; outputting a segmented two-dimensional frame from said segmented three-dimensional block of pixels; and repeating steps to for each subsequent frame of said sequence.

2. A method as claimed in claim 1 wherein said sequence of frames is from a video feed and said output frame is the oldest segmented frame in time.

3. A method as claimed in claim 2 comprising the further step of: (dl) tracking regions in said sequence of frames by tracking corresponding regions in said output frames.

582100.doc ^Siy[Wfffyf; ii-~u il^^ p

4. A method as claimed in any one of claims 1 to 3 wherein each pixel is further described by a corresponding error covariance, and step comprises the sub-steps of: (bl) for each said pixel, fitting each said component and the corresponding error covariance to a predetermined linear model to obtain a set of model parameters and corresponding confidence representations; (b2) statistically analysing said sets of model parameters and corresponding confidence representations to derive said segmented block of pixels that minimises a predetermined cost function.

5. A method according to claim 4 wherein step (b2) comprises the sub-steps of: merging neighboring regions in an order using said sets of model parameters and confidence representations until an optimal merging criterion is reached, said order being determined by a variational cost function.

6. A method according to claim 5 wherein said order is determined by dividing a minimum covariance-normalised vector distance between adjacent regions of said segmentation by an area of a common boundary between adjacent regions, and ordering the resulting quotients.

7. A method according to claim 5 whereby regions having more than a predetermined number of neighbouring regions are excluded from merging.

8. A method according to claim 4 wherein step (b2) comprises the sub-steps of: 582100.doc 1w -w y ~y j -31 arranging a sequence of neighbouring region pairs in an order using said sets of model parameters and confidence representations, said order being determined by a variational cost function; determining whether a merger of a first pair of neighbouring regions in said sequence would result in a configuration wherein every cross-section along the time axis provides a single two-dimensional connected component; and merging said pair of neighbouring regions if said merger would result in said configuration, or alternatively moving said pair of neighbouring regions to a substantially last position in said sequence.

9. A method according to any one of claims 1 to 8 comprising the further step of: maintaining a two-dimensional array of pointers to regions in said segmented block of frames, said array having entries corresponding with said pixels, thereby allowing a constant time look-up of neighbouring regions when concatenating a new frame in step

10. A method according to claim any one of claims 1 to 9 wherein step comprises I the sub-steps of: dividing said next frame into tiled square regions ofpixels; quadtree decomposing said tiled regions if said pixel components are not within a predetermined variance; and adding said tiled regions to said segmented block ofpixels.

11. A method according to any one of claims 1 to 10, wherein said plurality of vector components comprise at least two of colour, range and motion. vector components comprise at least two of colour, range and motion. 582100.doc y it i~"i-ii*~..?i^~iYiiIILI~;I~*IIIIV ".L1I i (~J~T~Xll~rr~WIIU~.~ ~Y~I.I(II(IIIY IU1IV* n*l~**lthY(.I lelr)iyirll nu -32-

12. An apparatus for segmenting a sequence of two-dimensional frames, each frame being formed by a plurality of pixels, each said pixel being described by a vector having components each relating to a different measured image characteristic, said apparatus comprising: means for forming a three-dimensional block of said pixels from the first predetermined number of said frames on said sequence of frames; segmenting means for segmenting said three-dimensional block of pixels using a three-dimensional region-merging segmentor to form three-dimensional regions in a segmented block of pixels; concatenating means for concatenating a next frame in said sequence of frames i'. to said segmented three-dimensional block of pixels; output means for outputting a segmented two-dimensional frame from said segmented three-dimensional block of pixels; and means for activating said segmenting means, said concatenating means and said oeoo output means for each subsequent frame of said sequence.

13. An apparatus as claimed in claim 12 wherein said sequence of frames is from a video feed and said output frame is the oldest segmented frame in time.

14. An apparatus as claimed in claim 13 further comprising: means for tracking regions in said sequence of frames by tracking corresponding regions in said output frames. 582100.doc -33- An apparatus as claimed in any one of claims 12 to 14 wherein each pixel is further described by a corresponding error covariance, and said segmenting means comprises: fitting means for, for each said pixel, fitting each said component and the corresponding error covariance to a predetermined linear model to obtain a set of model parameters and corresponding confidence representations; analysing means for statistically analysing said sets of model parameters and corresponding confidence representations to derive said segmented block of pixels that minimises a predetermined cost function. 16. An apparatus according to claim 15 wherein said analysing means comprises: ."o means for merging neighboring regions in an order using said sets of model ooo. parameters and confidence representations until an optimal merging criterion is reached, said order being determined by a variational cost function. 17. An apparatus according to claim 16 wherein said order is determined by dividing a minimum covariance-normalised vector distance between adjacent regions of said segmentation by an area of a common boundary between adjacent regions, and ordering the resulting quotients. 18. An apparatus according to claim 16 whereby regions having more than a. predetermined number of neighbouring regions are excluded from merging. 19. An apparatus according to claim 15 wherein said analysing means comprises: 582100.doc *&'xwwwsyw~~~~ s -34- means for arranging a sequence of neighboring region pairs in an'order using said sets of model parameters and confidence representations, said order being determined by a variational cost function; means for determining whether a merger of a first pair of neighboring regions in said sequence would result in a configuration wherein every cross-section along the time axis provides a single two-dimensional connected component; and means for merging said pair of neighboring regions if said merger would result in said configuration, or alternatively moving said pair of neighboring regions to a substantially last position in said sequence.. An apparatus according to claim any one of claims 12 to 19 wherein said concatenating means comprises: means for dividing said next frame into tiled square regions of pixels; means for quadtree decomposing said tiled regions if said pixel components are not within a predetermined variance; and means for adding said tiled regions to said segmented block of pixels. a C f 21. A program stored in a memory medium for segmenting a sequence of two- dimensional frames, each frame being formed by a plurality of pixels, each said pixel being described by a vector having components each relating to a different measured image characteristic, said program comprising: code for forming a three-dimensional block of said pixels from the first predetermined number of said frames on said sequence of frames; 582100.doc code for segmenting said three-dimensional block of pixels ifsTng a three- dimensional region-merging segmentor to form three-dimensional regions in a segmented block of pixels; code for concatenating a next frame in said sequence of frames to said segmented three-dimensional block of pixels; code for outputting a segmented two-dimensional frame from said segmented three-dimensional block of pixels; and code for repeating said code for segmenting, code for concatenating and code for outputting for each subsequent frame of said sequence. S 22. A method of segmenting an image, said method being substantially as described herein with reference to Figs. 1 to 9 of the drawings. *59S S 23. An apparatus for segmenting an image, said apparatus being substantially as described herein with reference to Figs. 1 to 9 of the drawings. 24. A program for segmenting an image, said program being substantially as described herein with reference to Figs. 1 to 9 of the drawings. Dated 24 December, 2001 CANON KABUSHIKI KAISHA Patent Attorneys for the Applicant/Nominated Person SPRUSON FERGUSON 582100.doc i ~ti"n~rsw^^rr,^Y~