WO2008008046A1

WO2008008046A1 - Method and system for multi-object tracking

Info

Publication number: WO2008008046A1
Application number: PCT/SG2007/000206
Authority: WO
Inventors: Liyuan Li; Ruijiang Luo; Ruihua Ma; Karianto Leman; Pankaj Kumar; Beng Hai Lee; Welmin Huang
Original assignee: Agency For Science, Technology And Research
Priority date: 2006-07-11
Filing date: 2007-07-11
Publication date: 2008-01-17
Also published as: WO2008008045A1; SG150527A1

Abstract

A method and system for multi-object tracking in a video signal. The method comprises the steps of receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to c) in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes; and then for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image; for each child node having two or more corresponding objects assigned thereto, d) sorting the corresponding objects according to estimated depth in said ech child node in said second image; e) applying a means-shift calculation to locate the corresponding object having the lowest depth in said each child node; f) updating the state and the visual content of the located corresponding object based on the second image; g) removing the updated visual content of the located corresponding object from the visual data associated with said each child node; and iterating steps e) to g) in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.

Description

Method And System For Mu lti -Object Tracking

FIELD OF INVENTION

The present invention relates broadly to method and system for multi-object tracking in a video signal, in particular a video signal captured by a fixed camera, and to a data storage medium having stored thereon computer code means for instructing a computer system to execute a method for multi-object tracking in a video signal.

BACKGROUND

Real-time tracking objects of interest in image sequences is one of the challenging problems in video understanding. It is an essential part in many computer vision applications, such as video surveillance, media analysis, human computer interface, and video compression (e.g., object-based coding in MPEG-4). Many methods have been proposed for appearance-based visual object tracking in image sequences. Generally speaking, three components are included in the processing: target representation, motion prediction, and object matching. First, a model is used to characterize the distinctive appearance features of the target object. These appearance features of the target object should be consistent and discriminant from other objects through the image sequence. Existing models include blobs of homogeneous intensities or colors, feature points, contours, templates, color histograms and joint color-spatial distributions of the object regions.

Liter-frame motion of target objects is predicted by using dynamic models or motion models. The popular dynamic models are Kalman filter and particle filters. The motion of a rigid object can be estimated by using an explicit motion model such as geometric transformation, while motion of non-rigid objects can be computed with an implicit motion model, e.g., mean- shift. An observation model is used to evaluate the matching between the target and its candidate location in the coming frame. The new position of the target is determined as the mean location of the predicted candidates (EAP: expected a posterior) if a dynamic model is employed or the location of the maximum matching value {MAP: maximum a posterior) when a motion model is used.

The MAP methods are considered as deterministic tracking since a gradient descent algorithm, e.g., a mean-shift algorithm, is commonly used to find the maximum, and the EAP methods are classified as stochastic tracking as randomly sampling in a time series state space is required. Roughly speaking, stochastic tracking is more robust than deterministic tracking since it is less likely to be trapped in local extrema, however, it is more computational expensive.

While tracking a single object in cluttered environments has been largely successful, tracking a varying number of non-rigid objects in crowds remains a challenging task. Among many others, there are three major difficulties for solving this problem. First, when objects merge into a group, the visual features for each object become ambiguous and uncertain. The distant objects can be occluded partially or even completely by the close objects. Second, the appearance of target objects may change drastically when they are in crowds due to the changes of poses and scales. For example, one standing person may sit down when he is partially occluded by another person. Third, the motion modes of target objects may change significantly in crowds, e.g., several separated persons may gather into a group, stay together for a while and then separate in different directions.

There has been an increasing number of proposals on multi-object tracking in this decade. The existing methods can roughly be classified into three categories: multiple instantiations of single trackers, multi-camera cooperation, and extended particle filters. A simple solution to the problem is to build multiple instantiations trackers for individual objects. Such methods perform well in the presence of simple occlusions with the help of Kalman filter and specified strategies to interpret the overlaps. In some proposals explicit segmentation of objects in a group using template matching is performed before tracking. Object models templates or color blobs of head, upper and lower parts of human body are learned for each isolated object. The segmentation of individuals in a group becomes more difficult when large variations of poses, scales, and motion patterns of objects are involved in the interaction.

Being aware of the difficulties caused by occlusion and overlapping, multi-camera systems are proposed. With a proper placement of multiple cameras, occlusion can be alleviated assuming at least one camera may capture a better view of each target object. The challenge for a multi-camera system is the calibration and the cooperation of multiple cameras to achieve a consistent tracking of each object since the views from different cameras can be very different.

Over the last few years, particle filters have been shown to be powerful tools for single object tracking. Attempts have also been made to extend particle filters to multi-object tracking. Hue et al extended the state space by concatenating the state vectors of fixed M objects. The likelihood measure is estimated under the assumption of conditional independence. The association for each component is assigned by a Gibbs sampler. Vermaak et al introduced a mixture of particle filters (MPF) where each component (object) is described by a cluster of particle filters that form a part of the mixture. The filters in the mixture interact only through the computation of the importance weights. Okuma et al further developed a boosted particle filter (BPF) from MPF. In BPF, the proposal distribution for each object (hockey player) is estimated by integrating the object detection and the dynamic models.

To avoid the convergence of multiple modes of occluded objects into one cluster of the front object in the case of overlapping, MaccorMick and Blake proposed the exclusion principle in the likelihood estimation. In their work, the state is extended to a 2.1D model which contains a label of depth order. The likelihood measures for all the depth configurations between two overlapping objects are explicitly derived. A strategy of partition sampling is performed for all of the configurations. A similar approach has been used to estimate the likelihoods based on color histograms of two overlapping objects.

Considering the difficulties in deducing the occlusion relationship of multiple objects from images, introduction of 3D information about the background and target objects has been proposed. In one such proposal, a calibrated camera and a generalized-cylinder or 3 -ellipsoid model of a standing human object are used, and the likelihood for a hypothesis configuration of multiple persons in a group is evaluated according to the 2D projections of their 3D positions on the ground plane. These methods successfully tracked multiple persons walking in crowds without large pose variations. However, the sufficient sampling of all the possible 3D configurations of multiple objects may lead to a significant increase of particle filters to obtain a proper distribution. When multiple objects gather together with various complex occlusions, the likelihoods of observing different objects are not independent. A few methods have sought to address such problem under a general mathematical framework and almost all of them are stochastic-based methods. A main disadvantage of these stochastic methods is their intensive computations which make them difficult for real-time video surveillance applications.

A need therefore exists to provide a method and system for multi-object tracking in a video signal that seek to address at least one of the above mentioned disadvantages.

SUMMARY

In accordance with a first aspect of the present invention there is provided a method of multi-object tracking in a video signal; the method comprising the steps of receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to c) in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes; and then for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image; for each child node having two or more corresponding objects assigned thereto, d) sorting the corresponding objects according to estimated depth in said each child node in said second image; e) applying a means-shift calculation to locate the corresponding object having the lowest depth in said each child node; f) updating the state and the visual content of the located corresponding object based on the second image; g) removing the updated visual content of the located corresponding object from the visual data associated with said each child node; and iterating steps e) to g) in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.

Step a) may comprise calculating visible portions of the respective corresponding objects in the first image to derive the estimated depths of the respective corresponding objects.

Step b) may comprise assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node based on a posterior probability evaluated by Bayes rule. A posterior probability evaluated by Bayes rule may be based on principle colour representations (PCRs) of the corresponding object and said one child node respectively.

Step c) may comprise removing the visual content of the assigned corresponding object from the visual data associated with said one child node based on PCRs of the assigned corresponding object and said one child node respectively.

Step d) may comprise calculating visible portions of the respective corresponding objects in the second image to derive the estimated depths of the respective corresponding objects.

Step e) may comprise applying the means-shift calculation to locate the corresponding object having the lowest depth in said each child node based on gravity centres of pixels of each principle colour components in a PCR of the corresponding object in the second image.

Step g) may comprise removing the updated visual content of the located corresponding object from the visual data associated with said each child node based on PCRs of the located corresponding object and said each child node respectively.

The method may further comprise storing tracking data including the updated status and visual content of each corresponding object for a series of consecutive frames and detecting an event in the video signal based on the stored tracking data.

The method may further comprise the step of for each parent node having no child node, deleting the corresponding object.

The method may further comprise the step of for each parent node having only one child node, assigning all corresponding objects to said one child node.

The method may further comprise the step of for each child node having no corresponding object assigned thereto, check whether the object is disappeared, and if not, set a new corresponding object to said each child node. The method may further comprise the step of for each child node having only one corresponding object assigned thereto, update the state and visual content of said one corresponding object from the visual data associated with said each child node.

In accordance with a second aspect of the present invention there is provided a multi- object tracking module for multi-object tracking in a video signal; the module comprising means for receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; means for generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and the means for generating, for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to c) in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes; and then, for each child node having only one corresponding object assigned thereto, updating a state and the visual content of said one object based on the second image; for each child node having two or more corresponding objects assigned thereto, d) sorting the corresponding objects according to estimated depth in said ech child node in said second image; e) applying a means-shift calculation to locate the corresponding object having the lowest depth in said each child node; f) updating the state and the visual content of the located corresponding object based on the second image; g) removing the updated visual content of the located corresponding object from the visual data associated with said each child node; and iterating steps e) to g) in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.

In accordance with a third aspect of the present invention there is provided a data storage medium having stored thereon computer code means for instructing a computer system to execute a method of multi-object tracking in a video signal; the method comprising the steps of receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to c) in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes; and then for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image; for each child node having two or more corresponding objects assigned thereto, d) sorting the corresponding objects according to estimated depth in said each child node in said second image; e) applying a means-shift calculation to locate the corresponding object having the lowest depth in said each child node; f) updating the state and the visual content of the located corresponding object based on the second image; g) removing the updated visual content of the located corresponding object from the visual data associated with said each child node; and iterating steps e) to g) in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.

In accordance with a fourth aspect of the present invention there is provided a method of stationary object tracking in a video signal, the method comprising the steps of determining that a tracked moving object has become stationary over a sequence of frames; generating a template image of the stationary object based on at least one of the frames of the sequence; tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state.

The generating of the template image may be based on image data within a bounding box in the at least one of the frames of the sequence for the tracked object.

The tracking of the state may comprise the steps of determining a first difference measure between the template image and a corresponding region in the current frame; determining a second difference measure between respective corresponding regions in the current frame and a preceding frame; determining a visibility measure of the stationary object from the corresponding region in the current frame.

The tracking of the state further.may comprise determining whether another tracked moving object overlaps the corresponding region in the current frame.

The tracking of the state further may comprise the steps of determining a motionless state if the first and second difference measures are below a first threshold value over a sequence of τ_p current frames; determining an occluded state if the first and second difference measures each exceed a second threshold value and the visibility measure falls below a third threshold value over the sequence of τ_p current frames, and another tracked moving object overlaps the corresponding region in the current frame; determining a removed state if the first and second difference measures each initially exceed the second threshold value and then the second difference measure falls below the first threshold value and the visibility measure falls below the third threshold value over the sequence of τ_p current frames, and no tracked moving object overlaps the corresponding region in the current frame; determining an inner-motion state if the first and second difference measures each initially exceed the second threshold value and then the second difference measure then falls below the first threshold value and the visibility measure is above a fourth threshold value; and determining the start moving state if the first and second difference measures exceed and remain above the second threshold value and the visibility measure exceeds and remains above the fourth threshold measure over the sequence of τ_p current frames, and a spatial shift of the tracked stationary object.

The visibility measure may be determined based on principle colour representation.

The first and second difference measures may be determined based on a knowledge base of human perceived semantic meanings, an evaluation from real-world videos, or both.

In accordance with a fifth aspect of the present invention there is provided a system for object tracking in a video signal, the system comprising means for determining that a tracked moving object has become stationary over a sequence of frames; means for generating a template image of the stationary object based on at least one of the frames of the sequence; means for tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and means for switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state.

In accordance with a sixth aspect of the present invention there is provided a data storage medium having stored thereon computer code means for instructing a computer system to execute a method of object tracking in a video signal, the method comprising the steps of determining that a tracked moving object has become stationary over a sequence of frames; generating a template image of the stationary object based on at least one of the frames of the sequence; tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state. BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which:

Figure 1 shows a series of images illustrating adaptive background subtraction using the background updating method and system of the example embodiments.

Figure 2 shows a flow chart illustrating a method of context-based background updating for adaptive background subtraction in the example embodiment.

Figure 3 shows a series of images and histograms illustrating principle colour representation (PCR) in the example embodiment.

Figure 4 shows a schematic drawing illustrating directed acrylic graphs (DAGs) for regions in consecutive frames in the example embodiment.

Figure 5 shows a flow chart illustrating a method of multi-object tracking in a video signal in the example embodiment.

Figure 6 shows a flow chart illustrating a method of stationary object tracking in a video signal in the example embodiment.

Figure 7 shows a schematic drawing of an event detection system implementation using the example embodiment.

Figure 8 shows a graph illustrating a finite state machines (FSM) representation for event detection in the system implementation of Figure 7.

Figure 9 shows a schematic drawing of a computer system for implementing the example embodiment.

DETAILED DESCRIPTION

The described embodiment provides a novel 2'/4D method of multi-object tracking for real-time video surveillance. An appearance model, principal color representation (PCR), is applied to multi-object tracking. The PCR model characterizes the appearance of an object or a region with a few most significant colors. The likelihood of observing a tracked object in a foreground region is derived according to their PCRs. Based on the Bayesian estimation theory, multi-object tracking is formulated as a Maximum A Posterior (MAP) problem over all the tracked objects. With the foreground regions provided by background subtraction, the problem of multi-object tracking is decomposed into two subproblems: assignment and location. By exploiting that the close and unoccluded objects have richer visual information than the distant or occluded ones, sequential solutions to the subproblems which process the objects in a group from the most visible to the least visible ones are derived according to the likelihoods estimated based on PCR. In the assignment step, each tracked object is assigned to a foreground region in the coming frame. When an object is assigned, its visual infoπnation will be excluded from the PCR of the region.

In the location step, multiple objects assigned to one region are located one-by-one according to their depth order. A two-phase mean-shift algorithm based on PCR is derived for locating objects. When an object is located, its visual information is excluded from the new position in the region. The operation of exclusion at the end of each iteration for assignment and location in the example embodiment can avoid multiple objects being trapped into the same region or position.

Some portions of the description which follows are explicitly or implicitly presented in terms of algorithms and functional or symbolic representations of operations on data within a computer memory. These algorithmic descriptions and functional or symbolic representations are the means used by those skilled in the data processing arts to convey most effectively the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities, such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated.

Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions utilizing terms such as "calculating", "determining", "excluding", "generating", "assigning", "locating", or the like, refer to the action and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical quantities within the computer system into other data similarly represented as physical quantities within the computer system or other information storage, transmission or display devices.

The present specification also discloses apparatus for performing the operations of the methods. Such apparatus may be specially constructed for the required purposes, or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose machines may be used with programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate. The structure of a conventional general purpose computer will appear from the description below.

In addition, the present specification also implicitly discloses a computer program, in that it would be apparent to the person skilled in the art that the individual steps of the method described herein may be put into effect by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein. Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the spirit or scope of the invention.

Furthermore, one or more of the steps of the computer program may be performed in parallel rather than sequentially. Such a computer program may be stored on any computer readable medium. The computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer. The computer readable medium may also include a hard-wired medium such as exemplified in the Internet system, or wireless medium such as exemplified in the GSM mobile telephone system. The computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements the steps of the preferred method.

The invention may also be implemented as hardware modules. More particular, in the hardware sense, a module is a functional hardware unit designed for use with other components or modules. For example, a module may be implemented using discrete electronic components, or it can form a portion of an entire electronic circuit such as an Application Specific Integrated Circuit (ASIC). Numerous other possibilities exist. Those skilled in the art will appreciate that the system can also be implemented as a combination of hardware and software modules.

The distinctive background objects (regions) in the example embodiment are classified into two categories:

Type-1 CBR: a facility for the public in the scene;

Tyρe-2 CBR: a large homogenous region.

Contextual descriptors are developed to characterize the distinctive appearances of CBRs and evaluate the likelihoods of observing them. Different contextual background regions may have different appearance features. Some manifest significant structural features, while others may have homogeneous color distributions. The example embodiment employs Orientation Histogram Representation (OHR) to describe the structural features of a region and Principal Color Representation (PCR) to describe the distribution of dominant colors. Let R'_b be the z-th CBR in the empty scene /(x), and G(x) and O(x) be the gradient and orientation images of /(x), respectively. If the orientation values are quantized into 12 bins each covering 30°, the orientation histogram for R'_b is defined as

where μ_τ () is a binary function on the threshold T and δ_k() is a delta function defined as

^{H k }>} \ 0, otherwise ,„ . v (2a)

The OHR H₆ is a simple and efficient variant of the robust local descriptor SIFT [1] for real-time processes. It is less sensitive to illumination changes and slight shift of object position.

By scanning the region R'_b , a table of the PCR for the region can be obtained. The PCR for R'_b is defined as

where /?,• is the size of R'_b, c^k _t is the k-th most significant color of R'_b and p^k _t is its significance value. The significance value is computed by

<5(cl, c2) is a delta function. It equals to 1 when the color distance d(cι, C₂) is smaller than a small threshold ε, otherwise, it is 0. The color distance used here is

2 < C₁₁₁ C₃ > d(ci,c₂) = l

;MI²+ IMI³ (5a) where <•,•> denotes the dot product [2, 3]. The principal color components E^k ,^• are sorted in descendent order according to their significance values /?*• . The first N; components which satisfy ' ' ∑_feoft ≥ 0^,95p, _{are use(}j _as ^_{6 pcR o}f ^_{6 re}gj_on $_{b ^} ^ich means the principal colors in PCR cover more than 95% colors from R'b ■ PCR is thus efficient to describe large regions of distinctive colors.

A type-1 CBR in the example embodiment is associated with a facility which has a distinctive structure and colors in the image. Both OΗR and PCR are used to characterize the type-1 CBR. Let R' _b, be the z-th type-1 CBR in the scene. Its contextual descriptors are H M and T _b\- A type-1 CBR has just two states: occluded (occupied) or not. The likelihood of observing a type-1 CBR is evaluated on the whole region. Suppose the contextual descriptors of the region R₁(X) from the corresponding position of R' _bl in the current frame /,(x) are H, and T₁. The likelihood of R' _bl being exposed can be evaluated by matching R,(x) to R' _b\. Based on OΗR, the matching of R₁(X) and R' _bi is defined as

IfR' ii and R,(x) are similar, P_L(H,\R'_{b \}) is close to 1, otherwise, it is close to 0.

lUBlihood of R_t(ya) beloagϊiig to R\_a is

The second term in the sum is the weight of the principal color c^i,,- in the PCR of R' _bu Le., P(E^k _bλ,i\f _fei) =p^k _{b \}Jp_bu. The first term is the likelihood based on the partition evidence of principal color c* _b\,_ι- It is evaluated from the PCRs of R' _M and R₍(x) as

PCnIE₁V,! = 4- tnin I pi_{t l}. ∑ Λ(ci_u,, cDrT

Λ... I «. . _(Sa)

Then there is

Wⁱ- ^₁ I ^ϊ J (9a)

-P(T¹ ₆i I r,) can be obtained in a similar way. Now the matching of R' _bl and R,(x) based on PCR is defined as

P_S,(Α ITi₁ ) = min{ P(T_t ^₁) . P(Xi₁ \T_t)\

(10a)

Assuming that colors and the gradients are independent and different weights are used, the log likelihood of observing R'_bi at time t is

LU = a. logft(#lΛM)+(l-u>.) 1OgFi(TfK₁)

(Ha) where ω_s = 0.6 is chosen empirically.

The type-2 CBRs in the example embodiment are large homogeneous regions. Only the PCR descriptor is used for each of them. Usually only part of a type-2 CBR is occluded when a foreground object overlaps it. The likelihood of observing a type-2 CBR is evaluated loc- calty. LetT^ = {p^, {E^ = (c^p^) }£} be the PCR of the i-fk type-2 CBR β?_ι2. Suppose iϊ_t(x) is a small ssigbbø-iiαod centered at x, e.g., a 5 x 5 window. The likelihood of i?j(x) belonging to Ri₂ is defined as

where |R₍(x)| is the size of the window and 5(/,(s)|R'₆₂) is a Boolean function defined as

The log likelihood that the pixel x in the current frame belongs to R' _b2 is

Lg(X) = IOgP(JJ₁(X)IBi₂)

(14a)

The appearance model of a type-1 CBR in the example embodiment consists of its OHR and PCR. For the z-th type-1 CBR R' _M, the appearance model is defined as M_a(R'_b 0 = (H ^₁, 7^* ₆₁). The spatial model of R' _b\ is defined as its bounding box and the center point, i.e., M_s(R'b ₁)

= (5'₆ ,, x'^" _M)-

To adapt to lighting changes from day to night, besides the active appearance model M_a(R'_{b 1}), a model base which contains up to K_b appearance models of R' ^₁ is used. The models in the base are learned incrementally. The active appearance model is the one from the model base which best fits the current appearance of the CBR. The model base of

Rl is ATB(JlJ₁) = {M*(i?y ^4X₁ ≤ K_b_.

Natural lighting changes slowly and smoothly. Let D be a time duration of 3 to 5 minutes, not limiting, in the example embodiment (i.e. a long duration compared with the frame duration in the video signal). The times of observing the ι-th type-1 CBR during the period are accumulated as

*^eO (15a) and the average of the likelihood values is

ή^hι **^D (16a) where L¹'' _bl > T_LX means R' _b\ is visible at time L If sufficient samples of R' _b\ have been observed during the previous (last) duration (e.g., 2!'^p _bϊ /D > 25%) and the average likelihood value is approaching the threshold T_n (e.g., L_b'f < O.ST_Lι), a new appearance of R' _b\ may be observed. In the coming duration, a new appearance model M^c _a (R'_b 1) = /H'^c _M , T''⁰ _bι } is obtained from a frame in which R,(x) looks mostly like R' _bu i.e., t_c =

L_bl | J- _ If the average likelihood values are low in the two consecutive durations, the active appearance model is replaced. First, the new appearance model M⁰ _a (R' _b i) is compared with the ones in the model base according to (1 Ia). If one is sufficiently close to the new model (i.e. the similarity is larger than T_L/+ε), it will be used to replace the active model. Otherwise, the active model will be replaced by the new model. Meanwhile, the new model is also placed into the base. If the model base is already full, the oldest one is then replaced by the new model.

Let T _b2 be the PCR descriptor of the j-th type-2 CBR R'_b2, the appearance model of R_b'₂ is then defined as M_a(R'_{b 2}) = (T _bi)- The spatial model of R' _b2 describes the range of the homogeneous region in the image. A binary mask If _b2(x) is used for it, i.e., M_s(R'_{b 2}) ⁼ (U *₂(x))- The spatial model may have to be adjusted in initialization duration when sufficient samples have been observed according to the likelihood values.

Again, a model base is employed to deal with the appearance variations of the type-2 CBRs from day to night. The model base of the z-th type-2 CBR Rib 2 is MB(Rb₂) =

, ■ ft"¹

Wa {^RK>))_k~v^Kiz % ^Aj&- . The models in the model base are learned incrementally through the time durations. First, at each time step t, the binary image of observed parts for R' _b2 is generated as V ''' _b2 (x) = μτ_L2 (L''' _bi W)- The overlapping ratio between the exposed parts and the spatial model for R' _b2 at time t is

U₂ U L_{62 (i?a)}

where ' /7' means intersection and 'u' means union. The larger the ratio is, the more parts of R' _bi are exposed and less pixels of other objects would be involved. At the end of each duration, the times of observing the large part of K _bj during the period is

and the average similarity value between the observed parts and its active model can be computed as

& = i∑% Pdftέ \fL) ' PTs (<&)

^Zk *^D (19a) where P_L(!T_b2 \ T_b\) is calculated according to (10a) with normalized PCRs and T_H = 75% is used. Like the operation for type-1 CBRs, if sufficient samples have been observed during the last duration (i.e., Z'^p _b2 /D > 25%) and the average similarity value is approaching the threshold Tu. (e-g-, S_b'£ ^< 0-8Tl₂), a new appearance model M^cc _a (R' _{b 2}) is generated from the current duration. If the average similarity values are low in the two consecutive durations, the active appearance model will be replaced. If there is a model in the base which is close enough to the new appearance model Ad¹⁰ _a (R! _{b 2}) (i.e. the similarity is larger than T_L2+έ), it will be used to replace the active model. Otherwise, the active model will be replaced by the new model. Meanwhile, the new model is also placed into the base. If the model base is already full, the oldest one is then replaced by the new model.

Let (R' _b }^Nb ,=i be the CBRs of a scene. Given a coming frame /,(x) and a local region R,(x) centered at x in I_t(x), the posterior probability of R,(x) belonging to a CBR R'_b is

P(^^(X)J ₍₂0a)

The prior probability P(R_t(x)) is the same for every pixel in an image. Then the log posterior probability of R₁(X) belonging to R'_b in the current frame /,(x) is defined as

Ql(Rt(X)) = logF(/Mx)|iφ + logP(i?i|x) _{/n λ}

(21a)

The position of a type-1 CBR is already determined by its spatial model. The prior probability P(R! _b |x) is 1 for the position and 0 otherwise. Then, the log posterior probability is equivalent to the log likelihood at the position, i.e., Q'' _bl = L' _bι(R,(x'^J _b\ )) - L^ht _bi for R' _bl. A rate of occluded times over recent frames for each type-1 CBR is used. For R'_b i, the rate is computed as

where β is a smooth factor and β = 0.5 is chosen. A high rate value (close to 1) indicates that R' _bι has been occluded in recent frames.

From the spatial model Zf _i2(x) of the i-th type-2 CBR R'_b2, the prior probability of a pixel x belonging to the region R'_{b 2} can be defined as

Combining (21a), (14a), and (23 a), the log posterior probability of that x is an exposed point ofR'_b2 is

Q$(^χ)

+ logifi&W

(24a) A rate of occluded times over recent frames at each pixel for each type-2 CBR is used. First, to be robust to noise and effect of boundaries, an occluded pixel of a type-2 CBR is confirmed on the local neighborhood R_t(x). Let r_x be the proportion of pixels belonging to R'_b2 in the neighborhood region, i.e.,

s^eHt(x) (25a) and r₂ be the proportion of exposed pixels of R'_b2 in R₍(x) according to the posterior estimates, i.e.,

(26a) where T_Q is chosen as slightly lower than 2T_L2. Then, an occluded pixel of R'_b2 is confirmed if the majority of the pixels within its neighborhood are of R'_b2 and less of them are observed in the current frame. Now the rate is computed as

where T_H— 15% is chosen in the example embodiment.

According to the result a contextual interpretation, three learning rates can be applied at each pixel for different situations in the example embodiment;

Normal learning rate to exposed background pixels with small variations;

Low learning rate to occluded background pixels;

High learning rate to exposed background pixels with significant changes.

An image of control code C,(x) is used, where the value of C,(x) is 0, 1, 2, or 3 where the low, normal, or high learning rate is applied respectively at the pixel (here 0 is for normal learning rate for non-context pixels used for display). First, for the pixels not associated with any contextual background region, C,(x) = 0 is set. The rest of Q(x) is determined according to the results of contextual interpretation. For a pixel x within the Mh type-1 CBR R'_bϊ, if r '"' _b\ ≥ 0.7 that means the CBR is being blocked by a foreground object, C,(x) = 1 is set. Otherwise, if /,(x) is detected as a background point by background subtraction, C,(x) = 2 is set since the CBR is exposed and no significant appearance change is found, but if /₍(x) is detected as a foreground point by background subtraction, a high rate should be applied since it is detected as an exposed CBR point with significant appearance change, i.e., C,(x) = 2. For a pixel of the Mh type-2 CBR R'_bi, if /'₄₂ (x) ≥ 0.7 that means the patch of the CBR is being occluded by a foreground object, C,(x) = 1 is set. Otherwise, if /,(x) is detected as a background point by background subtraction, C,(x) = 2 is set for exposed part of the type-2 CBR with no significant appearance change. But if /₍(x) is detected as a foreground point by background subtraction, C,(x) = 3 is set for the an exposed neighborhood of the type-2 CBR with significant appearance change.

To smoothen the control code temporally at each pixel, four control code images are used. The first two are the previous and current control code images described above, i.e., C₁ i(x) and C,(x), and the second two images are the control codes which really applied for pixel-level background maintenance, i.e., C*, i(x) and C*, (x). The applied control code to the current frame at pixel x is determined by the votes from three other control codes C,-i(x), C,(x), and C*_t i(x). If at least two of the three codes are the same, the control code of high votes is selected. If the three codes are different from each other, the normal learning rate is used, i.e., C*, (x) = 2.

To evaluate the effect of context-controlled background maintenance on adaptive background subtraction, the example embodiment was applied to, two existing methods of ABS were implemented. They are the methods based on Mixture of Gaussian (MoG) [4] and Principal Feature Representation (PFR) [2]. Hence, four methods, MoG, Context-Controlled MoG (CC MoG), PFR, and Context-Controlled PFR (CC PFR) were compared. In the test, the normal learning rate of the example embodiment as described above was set to the constant learning rate used for the existing methods of ABS. The high learning rate was set to the double of the normal learning rate and the low learning rate was set to zero. In Figure 1, the leftmost image 102 is a snapshot with manually cropped out contextual background regions e.g. 104, which are type-2 CBRs in this example. In the snapshot image 102, the type-2 CBRs are surrounded by polygon boundaries e.g. 106 of different colors. The second column 108 shows a sample frame from the sequence 110 and the corresponding ground truth 112 of the foreground. The rest of the images in the upper row 114 are: the segmented results by MoG 116, CC MoG (Context-Controlled MoG) 118, and the corresponding control image 120. The three images in the lower row 122 are the segmented results of PFR 124 and CC PFR (Context- Controlled PFR) 126, and the corresponding control image 128. In the control images 120, 128, the black regions e.g 130 do not belong to any CBR, the gray regions e.g. 132 are exposed parts of the CBRs with no significant appearance changes, and the white regions e.g. 134 are occluded parts of the CBRs.

According to the example embodiment, for pixels in the regions of exposed parts of the CBRs with no significant appearance changes, the normal learning rate is applied, for pixels in regions of occluded parts of the CBRs, the low learning rate is used. For pixels in regions of exposed parts of CBRs with significant changes (not applicable in the scene shown in Figure 1), the high learning rate would be used as described above. The scene in the image 102 is a meeting room with four marked type-2 CBRs for the table surface, the ground surface, wall surfaces, and the chair. In this sequence of 5250 frames, there was no overstaying objects or overcrowds. However, several people kept e.g. 138 moving around, staying somewhere for a while, and performing various activities. Therefore, the center parts of the scene were frequently occluded by persons. Using a constant learning rate in the unmodified ABS methods, some appearance features of the persons were learned into the background models, and then the background subtraction failed to extract the complete figures of the persons in the incoming frames (see images 116, 124). One example frame, Frame #102810, is displayed in Fig. 1.

By using context-controlled background maintenance of the example embodiment applied to the ABS methods, the persons were segmented satisfactorily (see images 118, 126). A quantitative evaluation on 12 frames sampled from the sequence every 200 frames started from Frame #101410 (empty frames were skipped) is listed in Table 1, where the metric value is defined as the ratio between the intersection and union of the ground truth and the segmented regions. According to [2], the performance is good if the metric value is larger than 0.5 and nearly perfect if the metric value is larger than 0.8. From Table 1, it can be seen that, by using the context-controlled background maintenance of the example embodiment applied to the existing ABS methods, the performance of adaptive background subtraction on situations of complex foreground activities can be improved significantly.

Table 1

The contextual features of the example embodiment capture the global information. Such global information may not always lead to a precise segmentation in position, especially along boundary regions of objects. However, if fed with correct samples continuously, the pixel- level statistical models can be tuned to characterize the background appearance accurately at each pixel. Then the pixel-level background models can be used to preferably achieve a precise segmentation of foreground objects.

The example embodiment exploits contextual interpretation to control the pixel-level background maintenance for adaptive background subtraction. Experimental results show that the example embodiment can improve the performance of adaptive background subtraction for at least situations of high foreground complexities. Figure 2 shows a flow chart 200 illustrating a method of background updating for adaptive background subtraction in a video signal according to the example embodiment. At step 202, one or more contextual background representation types are defined. At step 204, an image of a scene in the video signal is segmented into foreground and background regions. At step 206, each background region is classified as belonging to one of the contextual background representation types. At step 208, an orientation histogram representation (OHR), a principle colour representation (PCR), or both, are determined of each background region. At step 210, a current image of the scene in the video signal is received. At step 212, it is determined whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed. At step 214, different learning rates are set for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.

While the described example embodiment started from manually cropped out contextual background regions in a snapshot, image segmentation and background object recognition for automatic initialization of contextual models may be performed in different embodiments.

In the example embodiment, principal color representation (PCR) is applied for efficient appearance-based multi-object tracking. In a video surveillance system, object tracking may be applied to a sequence of segmented images generated by background subtraction. In such a case, each segmented image may contain one or several isolated foreground regions. Further, each region may consist of one target object (e.g., a walking person) or a group of target objects (when objects overlap from the camera view point). The example embodiment uses the principal color representation (PCR) for modeling and characterizing the appearance of target objects as well as the segmented regions. For an image sequence captured from a natural public site, each image may contain one or several objects. These objects in the image may overlap on some occasions. Further, the poses, scales, and motion modes of objects can change significantly during the overlap. It has been recognized by the inventors that these issues make the shape-based object tracking a rather challenging task. However, the inventors have recognized that it is much less likely that a a target object change colors in a sequence from a surveillance camera. Hence, using global color features of an individual object can provide a relatively stable and constant way for object appearance description. This can also lead to a better discrimination of multiple target objects in the scene.

In video surveillance, an object of interest (e.g., a person, vehicle, luggage, etc.) may render a few dominant colors which only span a small portion of the entire color space. Let the rath foreground region detected from the frame at time t be R" _t (x), where x = (x, y)^τ denotes the position of a pixel in the region. Then the corresponding principal color representation (PCR) can be defined as

IT = K, {££ = (4,4)}* J ₍₁₎ where S_n is the size of the region (or the total number of the pixels within the region), c' „ = ('"'« s'_n b' _n)^τ is the RGB values of the /th most significant color under the original color resolution (i.e., 256 levels for each channel), and s' „ is the significance of c' „ for the region. The components E_n are sorted in descending order according to the significance values of the principal colors. Let the current frame of input color images be I,(x), then the significance of fth principal color can be defined as

x€fi? (₂) where ω(x) is a weight function and δ(; ^•) is a delta function. In the example embodiment, ω(x) = 1 is chosen for isolated objects or regions. When locating an object in a group, co(x) may not be equal to 1. If necessary, other weight functions can be used, e.g. Gaussian kernel to suppress the noise around the object's boundary [5]. 5(C₁, C₂) equals to 1 when C₁ = C₂, otherwise it is equal 0. However, in the example embodiment a color distance is used which is not sensitive to noise and illumination changes

2 < C_1^ Gj >

J(c_{t f} C₂) = 1 - - NI² + IM²

(3) where < ^•, ^• > denotes the dot product. The color distance in (3) is then applied to compute the delta function in (2) as

8 = 0.005 is chosen in the example embodiment.

The PCR T _t contains the first N significant colors and their statistics for the region R" _t (x). Since a region of one or a few objects manifests only a few dominant colors, it is possible to find a small number N to approximate the color features of the region, Le.,

N

(5)

In the example embodiments, using N= 50 in (5) lead to satisfactory results for almost all the regions containing one or a group of objects. Fig. 3 shows two examples of PCRs where one image 300 contains two isolated individuals and another image 302 contains a group of 5 persons. The PCRs for the foreground regions are generated through scanning the respective regions, and are shown in the histograms 312, 314 respectively. Details of the algorithm for the foreground region R" , (x) (see white areas e.g. 304, 306 in the segmented images 308, 310 respectively) are summarized in Table 2.

TABLE 2 THE ALGORITHM TO GENERATE PCR FOR REGIONR", (x)

The aim of object tracking is to allocate a tracked object in the coming frame according to its previous appearance. To achieve this, the likelihood, or the conditional probability of observing the tracked object in a region of the current frame, has to be evaluated. In the following, the likelihoods are first defined based on the original and normalized PCRs of the tracked object and a region. This is the extended to the scale-invariant likelihood.

Let O^m ,- ₁ be the mth. tracked object described by its PCR η-m — <_s <βi _ _(ct _si \\N i I

-^t-ι ^{i WU '}m ^\ -mⁱ m ^fSi=Λ f [obtained previously when the torched object was an isolated object, and R" , be the «th foreground region detected at time t. According to the law of the total probability, the likelihood of the object O^m _t- ₁ in the region R" , can be defined as

P(mθT-ι) = ∑ P(RIlEi_n)P(EUOZ₁)

(6) where each P(R" , [Ef_n ) is the likelihood of that the object O^m _t- 1 appearing in the region R" _t based on the partition evidence E'_m , and P(E? _m \O^m ,_ 0 is the conditional probability of the evidence E_n, given the object Oⁿ ,_ ,. Using the PCR T^m ,. i of the object O^m ,_ ,, the conditional probability P(E _m \O^m ,- χ) can be defined as the weight of the principal color c' _m for its appearance, P(EiJOZ₁) = ^

(7)

Using the PCRs J"V , of the object O^m ,_ , and T _t for the region R" , , the likelihood

"\*H. I -^w can be computed according to their significance values of the same color component c'_m

miu/l.^^^^^xU^min /^^^c^ciK}

Substituting (7) and (8) into (6) yields

ptmotύ = —

It is noted that the above likelihood (9) is evaluated under the assumption that the size variation of the object is small. However, if the size change is large the likelihood value will be affected. Therefore, a definition of likelihood based on the normalized PCRs is used in the example embodiment. Let

Tg₁

4_t = and TJ^ι = {l.

&i = 4./-V- The likelihood based on the normalized PCRs becomes

If the region R" , only contains a single object, the likelihood based on the normalized PCRs is more accurate than that based on the original PCRs. However, if <9^m,_ i is one of the objects in the group R", , the likelihood on original PCRs is better. Hence, the scale-invariant likelihood of observing a given object O^m ,_ i in the region R", is defined as,

Heuristically, if the region R" is a single object, the likelihood computed from normalized PCRs appears more reliable. However, if OfI₁ is one of the objects in a group R" , the likelihood from un-normalized PCRs appears better since the object is smaller than the group. Equ (11) can provide a suitable measurement for these two cases in the example embodiment. Object tracking in video surveillance aims at maintaining a unique identification (id) for each target object and providing its time-dependent positions in the scene. When multiple target objects frequently merge and separate from one another in a public site, tracking one individual object is no longer an independent process. Multi-object tracking can be formulated as a global Maximum A Posterior (MAP) problem for all the tracked objects. With the segmented foreground regions provided by background subtraction, in the example embodiment the global MAP problem can be approximately decomposed as two subproblems: assignment and location. Using the principal color representation (PCR) and the associated likelihood function, the example embodiments uses sequential solutions to these two subproblems, as detailed below.

Let Ot-i — (O™ _i\^ι be the set of tracked target objects in the previous, frame I₍_i(x), and β(._i

be the set of state parameters describing their positions at time t - 1. The task of multi-object tracking is to estimate the states θ_s of tracked objects in the current frame 1« (x) given their previous appearance models O_t_i and states θt-i. This can be formulated as the Maximum A Posterior (MAP) estimation for the state parameters θ.\

©J = MgmaxP(θ_t|I,(x), C7_t_i_l θ,-i) θ< ^• (12)

When several objects overlap one another, they cannot be tracked as independent objects. With foreground regions K_t. —

provided by background subtraction. (12) can be simplified as θt = aignωx.PfθtlRt. CV.i. θt-i) θ* (13) where the tracked objects

are in the regions

from the previous frame. If a region (e.g., JR£__I) only contains one object (e.g., OfLj) i* is ²^ isolated object, otherwise the region is a group. Objects belonging to different regions in the previous frame may merge into a new group region (e.g., i2£) in the current frame. Also, the objects in a group region (eg., .Rf_-1) in the previous frame may separate to several regions in the current frame. For real-time processing with, moderate or high frame rate of image acquisition, the inter-frame movements of target objects are usually small. This implies that there is always an overlap between the regions of the same object in. the consecutive frames. Exploiting such a relation, the problem (13) can be further decomposed by using directed acyclic graphs (DAGs)- The directed acyclic graphes (DAGs) for the regions detected in the consecutive frames r_f_i(x) and I_f(x) are constructed ia the following way. Let the regions from the previous, and current frames be denoted as the nodes and be laid in two layers: the parent layer and the child layer. The parent layer consists of nodes representing the regions {-fijLil_jϊϊ¹ ^¹¹ *^ue previous^, frame It_i(x). and the child layer consists of nodes denoting the regions

ω the current frame

T₄ fx). Suppose R^₁ and R* are the jth and kth. regioas in the previous and current frames, respectively, then the directional link from Jf^_-1 to R^ can be defined as =^

(14)

This implies that there is a link only when the two regions have soaie overlap. A directed acyclic graph (DAG) is formed by a set of nodes in which every node connects to one or more nodes in the same group. A set of DAGs (graphs) can be generated. An example of graphs for two consecutive frames is illustrated in Figure 4. The notations for the DAGs are defined as follows. For the ith graph, the parent nodes are denoted as {n^}^' where each node rtjf represents one of the regions

and the chϋd nodes are denoted as {«]'⁹}j_!.i' where each node represents one of the regions {.fl* } tii- The ith DAG can thus be denoted as G. = (K'lSi - K*}£i', {&}>■ The objects in a parent node n{* are denoted as

If Miφ — 1 then the node is a single object, otherwise it is a group of &!,_# objects. The object o"l^!f is one of the objects

The objects in a child node n *. which may be newly generated objects or objects, tracked from ϊts parent nodes, are denoted as {o'j '^"'}^ ₌₁. After processing all graphs, the objects in the child nodes are reordered as {Oj^* J-J₁H₁. They are the set of tracked target objects in the current frame.

Since there is no link between different DAGs, the objects m the parent nodes of one graph can be tracked independently of the other graphs. Let O|__! represent the set of objects in the parent nodes of graph G_it Ls., Q^₁ = {{%1'f }^*=i}^i- Then the probability of the states for all the tracked objects in image I*(x) becomes

La P(Θ,|π_f, CV₁, θ_t_0 = J] P(&l\G_h C⁾^₁, Qt₁)

(15) where θ₈ = (ΘJ, - - - . θf⁰) and L_G is the number of DAGs. According to (15), (13) can be decomposed as finding a (β\)* for each graph such that

(^θjf = ^ar^grøP(θ{|G<,^θl_₁.^θ;_₁) θ' (16) If there are several parent and child nodes in a graph, and some parent nodes represent groups, (16) is still a nontrivial problem. The example embodiment solves the problem in two sequential steps from coarse to fine. The problem is decomposed approximately as two sub- problems: assignment and location. The coarse assignment process assigns each object in a parent node to one of its child nodes while the fine location process determines the new states of the objects assigned to each child node.

Assignment:

In tliis step, the tracked objects in each parent node is assigned to its child nodes based on the largest posterior probability. Let θ_t' be the parameter vector describing the assignment of the tracked objects O\__Λ to the child nodes { nψ \^~ _qt{. The posterior probability of tfiε assignment for graph Qi can be expressed as

where CJJf₁ = {o^f }%*_ml are the tracked objects in node rφ. and A^^p) = K* : V_pq = 1 } J₁ are die. child nodes of njf. The parameter vector is θ, = (&_t'^} . • • • , θ_f"^{Vθ i}) . The best assignment for the tracked objects

^SU(^^{i tua}* ^ results in the best observation of the objects in the corresponding child nodes, that is

(§<*)^• = arg max P(^[N^\ O^ , . ^f₁ )

^V (18)

Here ΘJ^{1 p} can be considered as the coarse tracking parameters indicating ia which child nodes (regions) the objects are observed without concerning the exact new positions of the tracked objects in the child regions.

Location:

In this step, the new staϊes of the tracked objects assigned to each child node (e.g., region Fή) are determined. Let Q)? = {o?*}*^ be the objects assigned to the child node nf from its parent nodes. That is. O/^{1 1} is a subset of C^_-1 according to the assignment parameters (©^«)* = ((6ip)*, • • • , (θ_t'^jVtu)'). After the assignment, objects in each child node can be tracked independently of objects in the other child nodes. Hence, the posterior probability of the new states for the tracked objects in the graph Gi can be evaluated as

(19) where ΘJ = (θf¹, ■ • • , θ^¹--'), and Q^l_x is the set of previous state parameters for objects Of¹ From (19), locating the objects in the child node

is expressed as

Multi-object tracking thus becomes finding the solutions for Eqs. (18) and (20) in the example embodiment. Further sequential solutions for (18) and (20) based on PCR are used and described below.

Assuming that {ϋ^'}^._ι are the foreground regions and { B* J-^i_-1 are their bounding boxes detected at time t, then their PCRs can be obtained as {T/'}^_r Let (GJ^f₁ be the set of directed acyclic graphes (DAGs) for the foreground regions between the consecutive frames. If there is only one object in a graph G_if then the object will be tracked as an isolated object Otherwise, multi-object tracking will be performed according to Eq&. (IS) and (20). For tracking multiple objects in a group, the posterior probability of the new state for each object is determined on both spatial position and depth relationship. Hence. 2-D state is used for each object. The state vector for an object Of is θ% = (6|^l, rj¹) where i»" is the bounding box describing its spatial position and ι?|* is the likelihood λ?alue describing its depth position in the group.

Tracking Isolated Objects

If the Mi graph consists of only one child node (i.e., Oi = {n*'¹)), a new object appears and is initialized as a} in Gi with a new id number. Suppose the node n^{{¹ represents the region Rf, then the PCR and bounding box of

are set as T_t ^l = if and 6* = B^. Since ø* is an isolated object, it is not occluded by any other objects. The depth state is set as v] = F{A||o*) = 1. It's

If the ith graph, contains one parent node and one child node, and the parent node is associated with one object, tke graph represents the simple case of isolated object tracking. Let the graph be Gi — {κoiⁿi_> ^ii)_> ^tue object in the parent node be OfI₁, and the child node n\ represent the region jRf, then the object

(i.e., O™ _j and

have the same id number). Its state becomes θ\ = (1*1, vj) with bj = B_f and w| = 1. In addition, its PCR is updated as T_% = Tf to follow the gradual variation of the object.

If the Mi graph only contains; one parent node which has no child nodes, then the previous objects in the parent node are assumed to have disappeared in the αurent frame. Tracking is terminated for these objects. Tracking Multiple Objects in a Graph

If the zth graph G₁- contains multiple parent nodes or child nodes, the operations of assignment and location will be performed. In the following description of the operations for one graph, for the sake of notational convenience the index i for the graph G, is omitted below.

Assignment:

Let n^ be a parent node in the graph C?. Of_-1 = (σ™iK_?ii be the associated objects, and Av = {nf : fø = I)^₁ be its child nodes. If the parent node has more than one child node, the assignment of objects 0Jf_-1 ϊs determined by Eq. (18). However, with varying numbers of objects and child nodes, Eq. (18) is a nontrivial problem of optimal configuration. To make the problem tractable, a sequential solution is proposed based on their PCRs and the depth relations among the objects. hi each group, the close and non-occluded objects have richer visible information than the distant or occluded objects. This means that an occluded object less affects the tracking of the objects occluding it. Hence, the assignment can be solved sequentially from the most visible one to the least visible one. Let the objects

ώ the parent node rtζ be ordered according to their visible sizes. Assuming that the correct assignment of the object o™__λ is ff^ = q_m which assigns of__x to the child node ref* (uf^m €

and the child node nf¹ represents the region

F$^m, then the posterior probability of assignment for objects Of_-1 = (0Jl₁K_nL₁ is computed as

P(θf|Λ^/r°", O∑_,.θf_,)

where Λ^(m - 1) = B^ - ∑£^~? δ(θ_t ^ι = φn)4-i represents the region after excluding the objects previously assigned to it before σ™ j. Note that the assignments of the objects (0"I₁ } are not independent The assignment of one object is affected by the previous objects with higher ranks. This means the assignment of each object can be performed one-by-one sequentially from the most to the least visible ones. For each object, the posterior probability of assignment can be evaluated using Bayes rale,

P (8? = q_m\n!*^>(m - l)_f <C.i_'*£i) = P (3f"(m - IKl₁ J? = <?,„) P{θ? = (?,,.|Ci)

(22) The first tenia oil the rigiithand side of (22) is the likelihood of observing the object 0^L₁ in the region R¹I"¹ with die exclusion of previously assigned objects, while the second term is the prior probability of θ^¹ = ςf_TO given the previous state 0Jl₁. For assignment, (22) can be evaluated on PCR. Assuming that a child node rζ € M^ represents the region ij*. Let T* and T^₁ be the PCELs of B% and 0Jl₁, respectively. Using Eqs. (21) and (22), the best assignment of die objects {Oj^αK_πϋα ^¹ ώ^e parent node rig can be achieved one-by-one sequentially according to their depth order by

from m = 1 to M_p. where if (in - 1) = 7* - ∑™^ δφ{ = q)T_{l_i represents the PCEL of

The sequential solution to Eq. (18) using Eq. (23) is computed in two steps. First, the objects (⁰JIiI_m=_I ^^{11 me} parent node n^ are sorted in a list according to their visible parts. An iterative process is then performed from the most visible to the least visible object. In each iteration, die object in the top of the list is assigned to one of the child nodes according to (23). Once an object is assigned to a child node, it is removed from the list and its visual evidence is excluded from the PCR of the child region. Details are described below.

Let ^ I'Jli Y_nJLi be the likelihoods of objects {of_li}_m!lι ώ ^o computed in the previous time step. Let s_m be the size of 0Jl₁ from its PCR T^₁. The visible part of 0J¹I₁ in n£ is estimated as Q2i = υfliV.- We denote (o7-il_»*=ι ^⁸ being sorted in descendant order according to the values of {C£i }_m=i ^and P^{laced ia a} ϋst. Before the iteration. 2*(θ) = 'if is set for the regions associated with the child nodes .V₁ -

In each iteration, the top object in the list is pop out. Suppose it is the mth object 0^I₁. The likelihood in (23) is calculated as P*{fj*(m - Dtø-i) according to Eq. (11). The prior probability in (23) is evaluated based on the shape similarity and center distance between the bounding boxes. Let 6Jl₁ and B^ be the bounding boxes of 0"I₁ and R^ associated with the child node. n|, x£* and xj be their centers, and d_m and d_k be their diagonal lengthes. respectively. Then, the shape similarity between two boxes (center aligned) is defined as

where "n" denotes the intersection and "U" denotes the union. The center distance between two boxes is defined as with ά — (d_m + rf_fc)/2. Tlie prior probabilitv^; is defined as

P(θ™ = g|0£_i) = 0.5(m, +??J_of). The object 0^L₁ is assigned to the child node nf™ according to Eq. (23). The last operation in this iteration is exclusion that removes the visual information of tiie object o"_' _x from TJ™(m - 1) associated with the child node nf". Let T^₁ and f *^m(nι - 1) be the PCRs of

- T^₁ is updated from TJ?" (m — 1) for its principal colors one by one. For the jth element of principal color c{ and significance sj^, the following updating is performed.

^ = < - As ₍₂₄₎

When all the elements in T,*^m(m— 1) are updated, the generated PCR JS lf^m(m). For the regions associated WJtIi the rest of child aodes in

Tf (m) = If (m - 1) is set.

Location:

Let Of

are the objects assigned to a child node

in the graph G^*. The new states of the objects should be determined by solving Eq. (20).

Locating the objects in a region are not independent of each other, but the front ones with richer visible information are less affected by the occluded ones. Hence, in the example embodiment objects in the node are located one by one from the most visible to the least visible ones based on their visible parts. The posterior probability of new states for all the objects in the node can be expressed as

where θf = {θ]. - • • .$*), R* is the region associated with the node n'f, and kf(n — 1) = JR_J — YJS_I C{ represents the region in which the visual evidence of the first n — 1 objects (θ_j , - ■ > , o"~^Λ) have been excluded at the located positions. According to (25). locating objects Of according to (20) is equivalent to locating them one by oaε sequentially according to

{θ?y = argmaxP følφfn - 1 W-^-i)

where {o"}_ni_j are sorted in descendent order according to their visible sizes.

The sequential solution to the problem Eq. (20) and Eq. (26) contains two steps. In the first step, the visible parts of the objects in the node are estimated, and the objects are sorted according to their visible sizes. In the second step, an iterative process is applied to locate the objects one-by-one in the region with a mean-shift algorithm based on PCR. When an object is located, its visual evidence is excluded from its position in the region. The details are described in the following.

Assuming that an object of in the child node ?ιf is the object o™^p _t assigned from the parent ng, then there is <%__λ

[J^₁ computed from the previous frame. The likelihood of observing object όξ in the child node iϊ{ (or region Rf) in the current frame can be evaluated as /"'(.RfIo") according to Eq. (11). Since the motion of object between the two consecutive frames is assumed small, the visible parts of the object of in region R^ can be estimated as

where s_n and S_k are the sizes of the object 0™ and the region R^. respectively. In Eq. (27), η is a weight to smooth the estimates from consecutive frames (r_/ = 0.5 is chosen in this study). (0"Jj₁J₁ are then sorted in descendent order according to the -values of {C-T_n !i ^an& *^βen placed in a list. To perform exclusion for J?<^fc (n — 1) = /?* — ∑JLT_I o[. a weight image ω_n(x) is used. If the pixel x is likely to belong to one of the previously located objects (of . - - ^■ . o"^"1), αv,-.(x) is low (ft* 0), otherwise, it is high (sa 1). For initialization, set u,'₀(x) = 1 for all the pixels belonging to the region R$. and ωp(x) = 0 otherwise.

In each iteration, the top object in the list is pop up. Assuming that it is the rath object of with the initial position represented by the previous bounding bos Bi^0*1 = Of_-1 centered at x,v ', and its PCR is Tf = {,s_B. {££ = (e£_; 4)})>L_j } = T^f coming from the object 0 "J₁. Locating

the object 0" in the region R% according to Eq. (26) is equivalent to finding a position where the maximum value of probability density occurs for observing the object. This density maximum can be found by employing a mean-shift procedure with a weight mode which can reveal the probability density of observing the object in the neighborhood [5], [6], [7]. A two-stage mean-shift procedure is proposed based on the evidence of the object^" s principal colors. In the first stage, the gravity center of the pixels of each principal color component is computed as

withy = 1, ^{■ • •} ,N, where r indicates a current step of mean-shift iteration. In the second stage, the new position of the object o", is generated as the weighted average of the gravity centers

performed as

The complete algorithm of multi-object tracking based on PCR in the example embodiment is summarized in Table 3.

TABLE 3

THE SUMMARY OF THE MULTI-OBJECT TRACKING ALGORITHM Input: color image I,(x) and segmented image S_t(x); Preprocessing: generate graphs (G₁ , i = 1, . . . L_G); For G_t, i = 1, , . . LQ , do: Assignment: for each parent node n^ in G_υ p = 1, . . . , M_iιPι do: a.l: if Π_Q' ^P has no child node, the objects in it are deleted; a.2: if n^' has only one child node n['^q , all the objects in it are assigned to n\'^q ; a.3: if n^' has multiple child nodes; a.3.1: sort the objects {o™__{} ^f₁ in Π_Q' ^P , and then assign them one-by-one from the first to the last as follows: a.3.1.1: assign o"__x to the child node rι{^{ι qm} according to (23); a.3.1.2: exclude the visual information of o™__x from the PCR of n['^qm

Location: for each child node n['^p in G_u q = 1, . . . , M_q , do:

1.1: if no object is assigned to the node, check if it is a disappeared object; if not, set it as a new object;

1.2: if only one object is assigned to the node, update the state and PCR of the object; 1.3: if multiple objects are assigned to the node:

1.3.1: sort the objects {o" } ^ ¹ ₁ in the node using (27), and then locate the objects one-by-one as follows:

1.3.1.1: apply mean-shift to locate o" using (28) and (29);

1.3.1.2: exclude the visual evidence of o" at the location in R₁ using (31)

1.3.2: if the likelihood of observing an object in the region is less than 0.1, set the object as disappeared.

Clearance: if an object disappeared for more than 50 frames, delete the object. End

The algorithm in the example embodiment includes two phases of processing for each DAG (Directed Acyclic Graph): assignment and location. In the assignment phase, each parent node in the DAG is processed. In the location phase, assigned objects in each child node are tracked. To be robust to the separation of small parts from the tracked object due to segmentation errors, small objects in a group with likelihood values less than 0.1 are set as disappeared. To prevent the losing of small or heavily occluded objects in a group, the records of disappeared objects are kept for 50 frames. When a new object is detected, it is compared with disappeared objects according to their PCRs, sizes and distances. If it compares to a disappeared object the tracking will be restored, otherwise a new object is created.

In the example embodiment, segmenting individual persons in a group with domain knowledge will be preferred. For example, in the example embodiment knowledge about the sizes and aspect ratios of persons in the scene is used to adapt to segmentation errors.

Figure 5 shows a flow chart 500 illustrating a method of multi-object tracking in a video signal in the example embodiment. At step 502, first and second segmented images of two consecutive frames of the video signal respectively are received, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked. At step 504, one or more directed acrylic graphs (DAGs) are generated for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node. At step 506, for each parent node having two or more child nodes, a) the corresponding objects of the foreground region contributing to said each parent node are sorted according to estimated depth in said first image; b) the corresponding object having the lowest depth is assigned to one of the child nodes of said each parent node; c) a visual content of the assigned corresponding object is removed from the visual data associated with said one child node; and steps b) to c) are iterated in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes.

At step 508, for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image. At step 510, for each child node having two or more corresponding objects assigned thereto, d) the corresponding objects are sorted according to estimated depth in said each child node in said second image; e) a means-shift calculation is applied to locate the corresponding object having the lowest depth in said each child node; f) the state and the visual content of the located corresponding object are updated based on the second image; g) the updated visual content of the located corresponding object is removed from the visual data associated with said each child node; and steps e) to g) are iterated in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.

When an object stops moving and stays in the same position in the scene for a while, the object would be absorbed into the background gradually with existing background updating techniques. That means the object would be lost in the segmented foreground images. On the other hand, in e.g. crowd scenes, if one can separate the moving objects from the stationary objects in the scene, one can reduce the overlapping of multiple foreground objects. This would make the tracking of each individual easier and more robust. In the described example embodiment, a layer tracking algorithm is designed to track stationary objects through even frequent occlusions. When the object starts moving, the objected is identified as moving object and tracked by a moving object tracking algorithm. In the example embodiment, the stationary objects include not only static non-living objects but also include motionless living objects, e.g. a standing or sitting person. Since the living objects may move again, the switching between moving object tracking and stationary object tracking for the target object is preferably smooth with no change of identity in the example embodiment.

When an object stops moving and stays in a scene in frame of a video signal, the appearance variation of the object is typically small through a sequence of frames. A template image of the object is used to represent such a stationary object in the example embodiment.

Let [B₁]^_x. be a sequence of bounding boxes of the zth tracked object in the τ_b most current frames as tracked by a moving object tracking algorithm. If the object has stopped moving, the bounding boxes will overlap each other. For a selected length parameter τ_b , if the spatial intersection of all the boxes is not empty, the object is detected as a stationary object in the example embodiment. In the example embodiment, but not limiting, τ_b is set as 10 frames, corresponding to about 1 second. To track the stationary object in e.g. a busy site in which the object may be occluded frequently by moving foreground objects, a layer representation based on the object's template image is built. The layer representation of the detected stationary object is defined as

where A\ is the template image of the object maintained at time t, T₁ is the Principal

Colour Representation (PCR) of the object stored when the object was detected as a stationary object. That is, the template image is based at least on the last frame of the sequence used in detecting the object as a stationary object. d/ is the difference measure between the template Af and the frame J₇-(S) for the corresponding region of Aj , d_c ^J is the difference measure between the consecutive frames /^₁(S) and /_y (s) for the region of the template, d_p ^J is the visibility measure of the object from the corresponding region in the frame I • (s) , and s_k is an estimated state of the stationary object at time k. Measures in τ_d most current frames and states in τ_s most current frames are recorded. The details of calculation of these measures and estimating states from these measures for each layer object will be described below.

In e.g. a busy public site, if there are some objects staying in the scene, they will often merge with moving objects and the result in a high complexity for object tracking. In separating the layer or stationary objects from moving objects and track the stationary and moving objects separately, the example embodiment can greatly enhance object tracking much. Let c = I₁ (s) be the color of a foreground point in the region of /th template image. According to Bayesian rule, the probability of the point belonging to the background is

p_{(b] c) =} ^pmm _(2b)

/>(c)

where p(c \ b) can be obtained from the Principal Feature Representation (PFR) of the background. The PFR at each pixel is used to characterize the background. Let s = (x, y) be a pixel of the image. For each type of feature, a table which records the principal feature vectors and their statistics at S

is built, where p_v' (b) is the learned probability of S belonging to the background (P_s (b) ) based on the observation of the feature v and S_v' (i) records the statistics of the M_v most frequent feature vectors at s , Each S_v' (i) contains three components S_y' (i) = (4b)

where D_v is the dimension of the vector V. The S_v' (i) in the table are sorted in descending order with respect to the value p_v' . Hence, the first N_v elements are used as principal features. Three types of features are used in the example embodiment. They are a spectral feature (color), a spatial feature (gradient), and a temporal feature (color co-occurrence), respectively. Among them, color and gradient features are stable for static background parts and color co-occurrence features are suitable for dynamic background parts. Three tables are used to learn the possible principal features of the three types for the background. They are T_c (s) , T_e (s) , and T_cc (s) . The color vector is c = (R₁ , G₁ , B₁ ) from the input color frame. The gradient vector is e = (g_x , g ) obtained by Sobel operator. The color co-occurrence vector is cc = (R_1-1 , G_1-1 , B_t__\ , R₁ , G₁ , B₁) with 32 levels for each color component.

The probability of the pixel becoming a background point at the current frame can be calculated as

N_s(b)

P(b) = (5b) M.

where N_s (b) is the background points in a small window W_s centered at S in the previous frame, and M_s is the number of points within the window. Similarly, the probabilities of s belonging to the layer (stationery) object or moving foreground object are

p(c) p(c)

respectively. The probabilities p(c \ I) and p(c \ f) can be calculated with Gaussian kernels. Let c'_x be the color of point x in the template A$^~x within the window W_s . Then p(c I Z) can be calculated as p(c I /) = m xeatl'x. {k_c(c'_x - c)k_s(x - s)} (7b)

where the Gaussian kernels can be written as Λ: (v) = exp 1 ~2 with g=c or s

²< indicating the kernel for color or spatial vector, respectively. Again, let c^ be the color of a point x in the window W₃ and in the region of moving foreground objects from the last frame /_:_, (s) . The probability p(c \ f) can be calculated as

p(c I /) = m xeaffx. {k_c(cζ - c)k_s (x - s)} (8b)

The priors can be calculated as

where N₃(I) and N₃(f) are the number of points belonging to the layer object and moving objects within the window W_s in the previous frame.

Comparing Equ (2b) and (6b), it can be found that p(c) has become a common normalization factor. Hence, the likelihoods of S belonging to background, the layer object, or the moving object can be defined as

p\b I c) = p(c I b)p(b), p\l I c) = p(c I l)p(l), and p'(f | c) = p(c \ f)p(f) (10b)

respectively. The pixel s would be assigned according to the greatest likelihood value. The mask for the moving objects is used as the input for moving object tracking.

Stationary objects may also be involved in several changes and interactions with other objects through the sequence. For a non-living object, it may e.g. undergo illumination changes, be occluded and removed by other objects. For a living object, the object may change pose or move bodyparts, or start moving again. During tracking the stationary object, the object's states are estimated and the template image updated correspondently in the example embodiment. In the example embodiment, five states are used to describe the layer object, they are: motionless, occluded, removed, inner-motion, and start-moving. The state is estimated according to various change measures from a short sequence of most recent frames.

Let s be a point in template Aj (s), of the zth layer object. The difference between the template and a current frame at s can be evaluated as

K ^ ^{~ 7}< ^(s) < Th_Λ (l ib)

where Th_d is the threshold according to image noise. Then, the difference measure between the template and current frame for the layer object is defined as

where S_A' is the size of the template.

Similarly, for a point s in the template A\ (s), the difference between consecutive frames at the point is evaluated as

l^f I⁷'^{00 " 7}'-^{l(S)| < Tkd} (13b)

otherwise

The difference measure between consecutive frames for the layer object is defined as

The difference measures are calculated on color vectors.

If the changes over the region of the template are caused by motion of the object itself, even if the differences dj and dj. would be large, the visibility (visibility measure d_p ^J ) of the object in the current frame based on PCR would still be high since the PCR is a global representation not related to spatial information. On the other hand, if the changes are caused by occlusion of other objects, the visibility of the layer object in the current frame would be low. Let 7 000206

41

T. be the PCR of the layer object in H₁ that was stored when the object was detected as a stationary object, and T₁' be the PCR from the region overlapped by the template Aj in the current frame. Then the visibility measure of the layer object in the current can be evaluated as d_p' = P(T_j I T_J) . More particular, Let O_m'^~l be an object in /^₁(S) , and O_n' be a region in /,(s) .

According to Bayesian law, the probability of observing O'^~λ in O_n' can be computed as

P(O_π' I O_m'^~l) = ∑P(O_n' I EJ¹ PiEl I C) (15b)

1 = 1

From the definition of PCR, the significance of c'_m for O'^~l is P(E_1n \ O'^~l) = P[Jp_n , and the likelihood of observing O'^~l in O_n' according to the evidence of c'_m is

Pψ_n ^> | E;) (16b)

where p_c, is the significance of c'_m from the region O_n' . Let C(c'_m) is the subset of the

principal colors from O_n' which match c'_m . p can be calculated as

P_clln = ∑Pt ⁽17^b)

A:c* eC(c;,, )

Now the visibility measure becomes

Pφ = Pψ_n' I O'-^{) = min{p>_c, } (18b)

With the change measures in a short sequence of τ_d most current frames (i.e. image frames from I_t__τ (x) to /, (x) ) evaluated above, with τ_d normally set to 10 frames in the example embodiment, the states of the tracked layer object are estimated by heuristic rules in the example embodiment:

Rule 1: motionless: If both d{ and dj. are low through the sequence, it is motionless;

Rule 2: occluded: If both dj and d_c ^J turn to moderate or high and d_p ^J turns low through the sequence, as well as there are moving objects overlapping the region of the template A\ as determined from the bounding boxes of such moving objects in a moving object tracking algorithm applied, the layer object is occluded; Rule 3: removed: If both dj and dj. turn to high and d_p ^j turns low, and then d_c ^J turns low through the sequence with no moving object overlapping the region of the template, the layer object is removed;

Rule 4: inner-motion: If both d{ and d_c ^J turn to moderate and then d_c ^] turns low through the sequence, while d^J keeps being high, this means the layer object has changed its pose or moved part of its body but still stays there;

Rule 5: start moving: If both dj and dj. turn and keeps being moderate, and d_p ^J keeps being high through the sequence, as well as there is a shift of the layer object, this means the layer object starts moving again.

The parameters for the rules are determined according to a knowledge base of human perceived semantic meanings and an evaluation from real-world videos in the example embodiment. In the example embodiment, but not limiting, for the above rules, the difference measures for dj and d_c ^J are low if they are less than 0.25, they are of moderate if they are within (0.25, 0.75), and they are high if they are larger than 0.75. The visibility measure d_p ^j is low if it is less than 0.6, otherwise, it is high. The measure of shape shift is calculated by checking the expanding foreground pixels along the boundary of the template Aj . If the number of expanded pixels is larger than 50% of the template size, the "shift" of the object is detected. It will be appreciated that for some videos from specific cameras, e.g. cameras with unstable signals, adjustment of the thresholds may be required in different embodiment and as based on the relevant knowledge base.

To track the layer object more robustly in the example embodiment, the layer model is maintained to adapt to real variations of the object without being affected by other objects in the scene. The five most recent states for each layer object (τ_s = 5) are recorded. However, it will be appreciated that other values may be used in different embodiments. If one state has more than 3 supports, the state is confirmed. For the corresponding state, the following updating is performed.

If the layer object is confirmed as being motionless, a smooth operation is performed to the template image. If the object is recognized as being in the inner-motion state, the new image of the object in the current fame will replace the template. If the object is occluded, no updating will be performed. If the object is classified as start-moving, the object will be transformed as a moving object with the same ID and corresponding PCR, mask, and position for tracking by a moving object tracking algorithm. The layer representation of the object will be deleted. If the object is detected as removed, the object will be transformed as a disappeared object and its layer representation will be destroyed. With these operations, the target object moving around, staying somewhere for a while, and moving again can be tracked continuously and seamlessly by combining the example embodiment with the moving object tracking algorithm described for the example embodiment.

Figure 6 shows a flow chart 600 illustrating a method of object tracking in a video signal according to the example embodiment. At step 602, it is detected that a tracked moving object has become stationary over a sequence of frames. At step 604, a template image of the stationary object is generated based at least one of the frames in the sequence. At step 606, a state of the stationary object is tracked based on a comparison of the template image with a current frame of the video signal.

Event detection:

The structure diagram of an event detection system 700 implementation incorporating the described example embodiment is shown in Figure 7. It contains four fundamental modules, foreground segmentation module 701, moving object tracking module 702, stationary object tracking module 704, and event detection module 706.

The foreground segmentation module 701 performs the background subtraction and learning and includes the method and system for background updating of the example embodiment described above, applied to e.g. the adaptive background subtraction method proposed in [8]. The background model used in the example implementations employs Principal Feature Representation (PFR) at each pixel to characterize background appearance.

The moving objects are tracked with the deterministic 2.5D multi-object tracking algorithm of the described example embodiment in the moving object tracking module 702. As described above, to deal with great variations of target objects in shapes and scales as well as complex occlusions, moving objects are represented by the models of principal color representation which exploits a few most significant colors and their statistics to characterize the appearance of each tracked object. When a tracked object has been detected as stopping moving, a layer representation, or a template, for the object is established and will be tracked by the stationary object tracking module 704 using the method and system of the described example embodiment. At each time step, the states of templates for the objects are estimated with fuzzy reasoning. The template for one object may shift between five states: motionless, interior motion, occluded, starting moving, and removed. When a template for an object is detected as starting to move, the template for the object will be deleted and the object will be shifted as a moving object and then tracked by the moving object tracking module 702.

In the event detection module 706, semantic models based on Finite State Machines (FSM) are designed to detect suspected scenarios. In the system 700 of the example implementation, four types of unusual events are detected. They are unattended objects, theft, loitering persons, unattended vehicles or unconscious persons.

An "event" is an abstract symbolic concept of what has happened in the scene. It is the semantic level description of the spatio-temporal concatenation of movements and actions of interesting objects in the scene. Event detection in video understanding is a high level procedure which identifies specific events by interpreting the sequences of observed perceptual features from intermediate level processing. It is a step that bridges the numerical level and the symbolic level. The fundamental part of event detection is event modeling. For an event, the model is determined by the task and the different instantiations. There are generally two issues for event modeling. One is to select an appropriate representation model, or formal language, and the other is to derive the descriptors for the interesting events with the model.

In implementations based on the described example embodiment, unusual events are described by the spatio-temporal evolution of object's states, movements, and actions. On a semantic level, each event can be defined as a sequential succession of a few well-defined states. An event could be started at one or more initial states, and then one state can transit to the next state when new conditions are met as the scene evolves in time. When a specific state is reached, the event is declared. State transition may also happen from an intermediate state back to a previous state if some conditions no more hold for the state. The semantic representation can be modelled based on Finite State Machines (FSM). The FSM has at least two advantages: (1) it is explicit and natural for semantic description; (2) FSM can readily and flexibly incorporate a variety of context information from intermediate-level processing.

Using Finite State Machine, each specific event can be represented by a directed graph

G. = (S_i ,E^*), where Sf is the set of nodes representing the states and Ef is the set of edges representing the transitions. One example of a FSM 800 is described in Figure 8. Any new object is initiated to be state "0" 802 for all the events defined. This is the initial state. The FSM 800 is truly started only when some conditions are met and the active node transits to the next intermediate state, i.e., state "1" 804. There can be more than one intermediate state for the FSM 800 of an event, depending on the complexity of an event. The FSM 800 reaches the final state "End"406 when all the conditions are met and then the corresponding specific event is triggered. The FSM 800 is updated at each new frame. The FSM 800 could have the self-loop transition for each state. Although the FSM 800 could remain at the same state, some or all properties of the object may have changed. At least, a time counter is incremented for each frame.

The more complicated an event, the bigger is N, i.e. the number of intermediate states in the FSM 800, and the more is the chance to deliver an unreliable detection result. Therefore, an important task in event modeling is to trim any unnecessary states by careful analysis and to identify the simplest event model.

The input of an FSM is the numerical perceptual features generated by moving and stationary object tracking modules (compare 702 and 704 in Figure 7). The visual cues of each tracked object can include shape, position, motion, and relations with others. The visual cues in the example implementation are:

- Object ID: the identity number of each tracked foreground object;

- Box: bounding box of the tracked object in current frame;

- Size: the area of the object in current frame;

- Status: indicates whether the tracked object is moving around or stationary,

- StayTime: indicates how long the object has stayed in the scene;

- InGroup: indicates whether the object is an isolated one or merged with others;

- Visibility: a measure within [0,1] indicates the degree of occlusion when overlapping with others;

- Motion: a measure within [0,1] indicates the degree of interior motion of a stationary object.

The general processing flow for event detection in the example implementation is shown in Table III.

An advantage of the tracking modules (compare 702, 704 in Figure 7) is the capability to resume tracking of some objects that are lost for a few frames. The two events, UNATTENDED OBJECT and THEFT, are directly concerned with object disappearance in the example implementation. Thus when an active object does not appear in the track records of the current frame, one preferably determines whether it is temporarily lost or whether there is a genuine disappearance. To achieve this, a first-in-first-out (FIFO) stack is built to contain the track records of N frames. O_Tracked are the track records of the previous N-th frame and the triggered event is delayed by N frames. As such, in the example implementation it is possible to 'look forward' to check the case of disappearance of an object, with N= 30 in the example implementation. With a processing rate of 8 frames/sec or above, this represents a delay of less than 4 seconds. It will be appreciated that the delay can be balanced against the accuracy of detection in different implementations.

Loitering Detection

Loitering as defined in the example implementation involves one object. It is defined as a person wandering in the observed scene with the duration t > T_Loilering . The FSM is initialized for each new object. The FSM has one intermediate state: "Stay" which indicates that the tracked person is staying in the scene, whether moving around or stationary. There are two conditions for the transition from state "INIT" to state "Stay":

- The object is classified as human;

- The object moves in the scene (moving around or staying somewhere with frequent interior motion). hi state "Stay", a time counter / is continuously incremented as new frames are coming in. When t > T_Loltering , the FSM transits from state "Stay" to state "Loiter" and a loitering event is triggered.

Unconscious Person Detection

As defined in the example implementation, this event also involves one object, a person. It is defined as an object becoming complete static with the duration t > T_Sta[ic . The FSM is initialized for each new object. When the tracked object is recognised as a person, the FSM transits to state "M", which indicates a person who is moving around or has significant interior motion. The second intermediate state of the FSM is "S", which indicates a person becoming and staying static, or complete motionless. There are two conditions for the transition from state "M" to state "S":

- The position of the person does not change compared to the previous frame;

- The interior motion of the person m < T_lnMoιion .

In state "S", a time counter t is continuously incremented as new frames are coming in. When / > T_Static , the FSM transits from state "S" to state "UP", indicating that an unconscious person is detected. Examples of unconscious person include a sleeping or faint person. It will be appreciated that similar condictions can be used to detetc e.g. a vehicle staying overtime in a zone for short stopping, in which case the object of interest is changed to vehicle instead of person.

Unattended Object Detection

This event as defined in the example implementation involves two objects. The FSM is initialized for each new object. When the new small object is identified as being separated from another large moving object, and it stays static, a deposited object is detected and the ownership is established between the two objects. The FSM transits from state "INIT" to state "Station". In the state, the object is associated with its owner. If the owner leaves the scene covered by the camera, the FSM transits from state "Station" to state "UO" and the 'Unattended Object' is declared.

Theft Detection

This event as defined in the example implementation involves three objects. The FSM is initialized for each new object. Similar to the event of unattended object, when the new small object is identified as being separated from another large moving object, and it stays static, a deposited object is detected and the ownership is established between the two objects. The FSM transits from state "INIT" to state "Station". In the state, the object is associated with its owner. However, when the object disappears and this happens with that another object got it and the owner still stays in the scene, the FSM transits from the state "Station" to the state "Theft" and a 'Theft' event is declared, meanwhile, the second person is identified as the potential thief.

The method and system of the example embodiment can be implemented on a computer system 900, schematically shown in Figure 9. It may be implemented as software, such as a computer program being executed within the computer system 900, and instructing the computer system 900 to conduct the method of the example embodiment.

The computer system 900 comprises a computer module 902, input modules such as a keyboard 904 and mouse 906 and a plurality of output devices such as a display 908, and printer 910.

The computer module 902 is connected to a computer network 912 via a suitable transceiver device 914, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN).

The computer module 902 in the example includes a processor 918, a Random Access Memory (RAM) 920 and a Read Only Memory (ROM) 922. The computer module 902 also includes a number of Input/Output (I/O) interfaces, for example I/O interface 924 to the display 908, and I/O interface 926 to the keyboard 904.

The components of the computer module 902 typically communicate via an interconnected bus 928 and in a manner known to the person skilled in the relevant art.

The application program is typically supplied to the user of the computer system 900 encoded on a data storage medium such as a CD-ROM or flash memory carrier and read utilising a corresponding data storage medium drive of a data storage device 930. The application program is read and controlled in its execution by the processor 918. Intermediate storage of program data maybe accomplished using RAM 920.

It will be appreciated by a person skilled in the art that numerous variations and/or modifications may be made to the present invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive.

References

[1] D. Lowe. Distinctive image features from scale-invariant key-points. Int'lJ. Computer Vision, 60(2):91-110, 2004.

[2] L. Li, W. Huang, I. Y. H. Gu, and Q. Tian. Statistical modeling of complex background for foreground object detection. IEEE Trans. Image Processing, 13(11): 1459-1472, 2004.

[3] L. Li and M. K. H. Leung. Integrating intensity and texture differences for robust change detection. IEEE Trans. Image Processing, 11 (2): 105-112, 2002.

[4] C. Stauffer and W. Grimson. Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(8):747-757, August 2000.

[5] D. Comaniciu, V. Ramesh, and P. Meer, "Kernel-Based Object Tracking," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 5, pp. 564-577, 2003.

[6] D. Comaniciu and P. Meer, "Mean Shift: A Robust Approach Toward Feature Space Analysis," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603- 619, 2002.

[7] Y. Cheng, "Mean Shift, Mode Seeking, and Clustering," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 8, pp. 790-799, 1995.

[8] Liyuan Li, et. al. IEEE T-IP, vol. 13, no. 11, pp. 1459-1472, 2004.

[9] C. Wren, A. Azarbaygaui, T. Darrell, and A. Pentland. Pfinder: Real-time tracking of the human body. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(7):780-785, 1997.

Claims

1. A method of multi-object tracking in a video signal; the method comprising the steps of: receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to c) in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes; and then for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image; for each child node having two or more corresponding objects assigned thereto, d) sorting the corresponding objects according to estimated depth in said ech child node in said second image; e) applying a means-shift calculation to locate the corresponding object having the lowest depth in said each child node; f) updating the state and the visual content of the located corresponding object based on the second image; g) removing the updated visual content of the located corresponding object from the visual data associated with said each child node; and iterating steps e) to g) in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.

2. The method as claimed in claim 1, wherein step a) comprises calculating visible portions of the respective corresponding objects in the first image to derive the estimated depths of the respective corresponding objects.

3. The method as claimed in claim 1, wherein step b) comprises assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node based on a posterior probability evaluated by Bayes rule.

4. The method as claimed in claim 3, wherein the a posterior probability evaluated by Bayes rule is based on principle colour representations (PCRs) of the corresponding object and said one child node respectively.

5. The method as claimed in claim 1, wherein step c) comprises removing the visual content of the assigned corresponding object from the visual data associated with said one child node based on PCRs of the assigned corresponding object and said one child node respectively.

6. The method as claimed in claim 1, wherein step d) comprises calculating visible portions of the respective corresponding objects in the second image to derive the estimated depths of the respective corresponding objects.

7. The method as claimed in claim 1, wherein step e) comprises applying the means-shift calculation to locate the corresponding object having the lowest depth in said each child node based on gravity centres of pixels of each principle colour components in a PCR of the corresponding object in the second image.

8. The method as claimed in claim 1, wherein step g) comprises removing the updated visual content of the located corresponding object from the visual data associated with said each child node based on PCRs of the located corresponding object and said each child node respectively.

9. The method as claimed in claim 1, further comprising storing tracking data including the updated status and visual content of each corresponding object for a series of 7 000206

51

consecutive frames and detecting an event in the video signal based on the stored tracking data.

10. The method as claimed in claim 1, further comprising the step of: for each parent node having no child node, deleting the corresponding object.

11. The method as claimed in claim 1, further comprising the step of: for each parent node having only one child node, assigning all corresponding objects to said one child node.

12. The method as claimed in claim 1, further comprising the step of: for each child node having no corresponding object assigned thereto, check whether the object is disappeared, and if not, set a new corresponding object to said each child node.

13. The method as claimed in claim 1, further comprising the step of: for each child node having only one corresponding object assigned thereto, update the state and visual content of said one corresponding object from the visual data associated with said each child node.

14. A multi-object tracking module for multi-object tracking in a video signal; the module comprising: means for receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; means for generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and the means for generating, for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; 06

52

c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to c) in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes; and then, for each child node having only one corresponding object assigned thereto, updating a state and the visual content of said one object based on the second image; for each child node having two or more corresponding objects assigned thereto, d) sorting the corresponding objects according to estimated depth in said ech child node in said second image; e) applying a means-shift calculation to locate the corresponding object having the lowest depth in said each child node; f) updating the state and the visual content of the located corresponding object based on the second image; g) removing the updated visual content of the located corresponding object from the visual data associated with said each child node; and iterating steps e) to g) in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.

15. A data storage medium having stored thereon computer code means for instructing a computer system to execute a method of multi-object tracking in a video signal; the method comprising the steps of: receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to c) in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes; and then for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image; for each child node having two or more corresponding objects assigned thereto, d) sorting the corresponding objects according to estimated depth in said ech child node in said second image; e) applying a means-shift calculation to locate the corresponding object having the lowest depth in said each child node; f) updating the state and the visual content of the located corresponding object based on the second image; g) removing the updated visual content of the located corresponding object from the visual data associated with said each child node; and iterating steps e) to g) in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.

16. A method of stationary object tracking in a video signal, the method comprising the steps of: determining that a tracked moving object has become stationary over a sequence of frames; generating a template image of the stationary object based on at least one of the frames of the sequence; tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state.

17. The method as claimed in claim 16, wherein the generating of the template image is based on image data within a bounding box in the at least one of the frames of the sequence for the tracked object.

18. The method as claimed in claim 16, wherein the tracking of the state comprises steps of: determining a first difference measure between the template image and a corresponding region in the current frame; determining a second difference measure between respective corresponding regions in the current frame and a preceding frame; determining a visibility measure of the stationary object from the corresponding region in the current frame.

19. The method as claimed in claim 18, wherein the tracking of the state further comprises determining whether another tracked moving object overlaps the corresponding region in the current frame.

20. The method as claimed in claim 19, wherein the tracking of the state further comprises the steps of: determining a motionless state if the first and second difference measures are below a first threshold value over a sequence of τ_p current frames; determining an occluded state if the first and second difference measures each exceed a second threshold value and the visibility measure falls below a third threshold value over the sequence of τ_p current frames, and another tracked moving object overlaps the corresponding region in the current frame; determining a removed state if the first and second difference measures each initially exceed the second threshold value and then the second difference measure falls below the first threshold value and the visibility measure falls below the third threshold value over the sequence of X_p current frames, and no tracked moving object overlaps the corresponding region in the current frame; determining an inner-motion state if the first and second difference measures each initially exceed the second threshold value and then the second difference measure then falls below the first threshold value and the visibility measure is above a fourth threshold value; and determining the start moving state if the first and second difference measures exceed and remain above the second threshold value and the visibility measure exceeds and remains above the fourth threshold measure over the sequence of τ_p current frames, and a spatial shift of the tracked stationary object.

21. The method as claimed in claim 18, wherein the visibility measure is determined based on principle colour representation.

22. The method as claimed in claim 18, wherein the first and second difference measures are determined based on a knowledge base of human perceived semantic meanings, an evaluation from real-world videos, or both.

23. A system for object tracking in a video signal, the system comprising: means for determining that a tracked moving object has become stationary over a sequence of frames; means for generating a template image of the stationary object based on at least one of the frames of the sequence; means for tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and means for switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state.

24. A data storage medium having stored thereon computer code means for instructing a computer system to execute a method of object tracking in a video signal, the method comprising the steps of: determining that a tracked moving object has become stationary over a sequence of frames; generating a template image of the stationary object based on at least one of the frames of the sequence; tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state.