WO2008008046A1 - Method and system for multi-object tracking - Google Patents

Method and system for multi-object tracking Download PDF

Info

Publication number
WO2008008046A1
WO2008008046A1 PCT/SG2007/000206 SG2007000206W WO2008008046A1 WO 2008008046 A1 WO2008008046 A1 WO 2008008046A1 SG 2007000206 W SG2007000206 W SG 2007000206W WO 2008008046 A1 WO2008008046 A1 WO 2008008046A1
Authority
WO
WIPO (PCT)
Prior art keywords
objects
child node
image
state
child
Prior art date
Application number
PCT/SG2007/000206
Other languages
French (fr)
Inventor
Liyuan Li
Ruijiang Luo
Ruihua Ma
Karianto Leman
Pankaj Kumar
Beng Hai Lee
Welmin Huang
Original Assignee
Agency For Science, Technology And Research
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agency For Science, Technology And Research filed Critical Agency For Science, Technology And Research
Publication of WO2008008046A1 publication Critical patent/WO2008008046A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/18Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/174Segmentation; Edge detection involving the use of two or more images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/254Analysis of motion involving subtraction of images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/28Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking

Definitions

  • the present invention relates broadly to method and system for multi-object tracking in a video signal, in particular a video signal captured by a fixed camera, and to a data storage medium having stored thereon computer code means for instructing a computer system to execute a method for multi-object tracking in a video signal.
  • Real-time tracking objects of interest in image sequences is one of the challenging problems in video understanding. It is an essential part in many computer vision applications, such as video surveillance, media analysis, human computer interface, and video compression (e.g., object-based coding in MPEG-4). Many methods have been proposed for appearance-based visual object tracking in image sequences. Generally speaking, three components are included in the processing: target representation, motion prediction, and object matching.
  • a model is used to characterize the distinctive appearance features of the target object. These appearance features of the target object should be consistent and discriminant from other objects through the image sequence.
  • Existing models include blobs of homogeneous intensities or colors, feature points, contours, templates, color histograms and joint color-spatial distributions of the object regions.
  • Liter-frame motion of target objects is predicted by using dynamic models or motion models.
  • the popular dynamic models are Kalman filter and particle filters.
  • the motion of a rigid object can be estimated by using an explicit motion model such as geometric transformation, while motion of non-rigid objects can be computed with an implicit motion model, e.g., mean- shift.
  • An observation model is used to evaluate the matching between the target and its candidate location in the coming frame. The new position of the target is determined as the mean location of the predicted candidates (EAP: expected a posterior) if a dynamic model is employed or the location of the maximum matching value ⁇ MAP: maximum a posterior) when a motion model is used.
  • the MAP methods are considered as deterministic tracking since a gradient descent algorithm, e.g., a mean-shift algorithm, is commonly used to find the maximum, and the EAP methods are classified as stochastic tracking as randomly sampling in a time series state space is required. Roughly speaking, stochastic tracking is more robust than deterministic tracking since it is less likely to be trapped in local extrema, however, it is more computational expensive.
  • multi-camera systems Being aware of the difficulties caused by occlusion and overlapping, multi-camera systems are proposed. With a proper placement of multiple cameras, occlusion can be alleviated assuming at least one camera may capture a better view of each target object.
  • the challenge for a multi-camera system is the calibration and the cooperation of multiple cameras to achieve a consistent tracking of each object since the views from different cameras can be very different.
  • MaccorMick and Blake proposed the exclusion principle in the likelihood estimation.
  • the state is extended to a 2.1D model which contains a label of depth order.
  • the likelihood measures for all the depth configurations between two overlapping objects are explicitly derived.
  • a strategy of partition sampling is performed for all of the configurations.
  • a similar approach has been used to estimate the likelihoods based on color histograms of two overlapping objects.
  • a method of multi-object tracking in a video signal comprising the steps of receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to
  • Step a) may comprise calculating visible portions of the respective corresponding objects in the first image to derive the estimated depths of the respective corresponding objects.
  • Step b) may comprise assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node based on a posterior probability evaluated by Bayes rule.
  • a posterior probability evaluated by Bayes rule may be based on principle colour representations (PCRs) of the corresponding object and said one child node respectively.
  • Step c) may comprise removing the visual content of the assigned corresponding object from the visual data associated with said one child node based on PCRs of the assigned corresponding object and said one child node respectively.
  • Step d) may comprise calculating visible portions of the respective corresponding objects in the second image to derive the estimated depths of the respective corresponding objects.
  • Step e) may comprise applying the means-shift calculation to locate the corresponding object having the lowest depth in said each child node based on gravity centres of pixels of each principle colour components in a PCR of the corresponding object in the second image.
  • Step g) may comprise removing the updated visual content of the located corresponding object from the visual data associated with said each child node based on PCRs of the located corresponding object and said each child node respectively.
  • the method may further comprise storing tracking data including the updated status and visual content of each corresponding object for a series of consecutive frames and detecting an event in the video signal based on the stored tracking data.
  • the method may further comprise the step of for each parent node having no child node, deleting the corresponding object.
  • the method may further comprise the step of for each parent node having only one child node, assigning all corresponding objects to said one child node.
  • the method may further comprise the step of for each child node having no corresponding object assigned thereto, check whether the object is disappeared, and if not, set a new corresponding object to said each child node.
  • the method may further comprise the step of for each child node having only one corresponding object assigned thereto, update the state and visual content of said one corresponding object from the visual data associated with said each child node.
  • a multi- object tracking module for multi-object tracking in a video signal; the module comprising means for receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; means for generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and the means for generating, for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object from the visual data associated with said one child no
  • a data storage medium having stored thereon computer code means for instructing a computer system to execute a method of multi-object tracking in a video signal; the method comprising the steps of receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object
  • a method of stationary object tracking in a video signal comprising the steps of determining that a tracked moving object has become stationary over a sequence of frames; generating a template image of the stationary object based on at least one of the frames of the sequence; tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state.
  • the generating of the template image may be based on image data within a bounding box in the at least one of the frames of the sequence for the tracked object.
  • the tracking of the state may comprise the steps of determining a first difference measure between the template image and a corresponding region in the current frame; determining a second difference measure between respective corresponding regions in the current frame and a preceding frame; determining a visibility measure of the stationary object from the corresponding region in the current frame.
  • the tracking of the state further.may comprise determining whether another tracked moving object overlaps the corresponding region in the current frame.
  • the tracking of the state further may comprise the steps of determining a motionless state if the first and second difference measures are below a first threshold value over a sequence of ⁇ p current frames; determining an occluded state if the first and second difference measures each exceed a second threshold value and the visibility measure falls below a third threshold value over the sequence of ⁇ p current frames, and another tracked moving object overlaps the corresponding region in the current frame; determining a removed state if the first and second difference measures each initially exceed the second threshold value and then the second difference measure falls below the first threshold value and the visibility measure falls below the third threshold value over the sequence of ⁇ p current frames, and no tracked moving object overlaps the corresponding region in the current frame; determining an inner-motion state if the first and second difference measures each initially exceed the second threshold value and then the second difference measure then falls below the first threshold value and the visibility measure is above a fourth threshold value; and determining the start moving state if the first and second difference measures exceed and remain above the second threshold value and the visibility measure exceeds and remains
  • the visibility measure may be determined based on principle colour representation.
  • the first and second difference measures may be determined based on a knowledge base of human perceived semantic meanings, an evaluation from real-world videos, or both.
  • a system for object tracking in a video signal comprising means for determining that a tracked moving object has become stationary over a sequence of frames; means for generating a template image of the stationary object based on at least one of the frames of the sequence; means for tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and means for switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state.
  • a data storage medium having stored thereon computer code means for instructing a computer system to execute a method of object tracking in a video signal, the method comprising the steps of determining that a tracked moving object has become stationary over a sequence of frames; generating a template image of the stationary object based on at least one of the frames of the sequence; tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state.
  • Figure 1 shows a series of images illustrating adaptive background subtraction using the background updating method and system of the example embodiments.
  • Figure 2 shows a flow chart illustrating a method of context-based background updating for adaptive background subtraction in the example embodiment.
  • Figure 3 shows a series of images and histograms illustrating principle colour representation (PCR) in the example embodiment.
  • FIG. 4 shows a schematic drawing illustrating directed acrylic graphs (DAGs) for regions in consecutive frames in the example embodiment.
  • DAGs directed acrylic graphs
  • Figure 5 shows a flow chart illustrating a method of multi-object tracking in a video signal in the example embodiment.
  • Figure 6 shows a flow chart illustrating a method of stationary object tracking in a video signal in the example embodiment.
  • Figure 7 shows a schematic drawing of an event detection system implementation using the example embodiment.
  • Figure 8 shows a graph illustrating a finite state machines (FSM) representation for event detection in the system implementation of Figure 7.
  • FSM finite state machines
  • Figure 9 shows a schematic drawing of a computer system for implementing the example embodiment.
  • the described embodiment provides a novel 2'/4D method of multi-object tracking for real-time video surveillance.
  • An appearance model principal color representation (PCR)
  • PCR principal color representation
  • the PCR model characterizes the appearance of an object or a region with a few most significant colors.
  • the likelihood of observing a tracked object in a foreground region is derived according to their PCRs.
  • multi-object tracking is formulated as a Maximum A Posterior (MAP) problem over all the tracked objects. With the foreground regions provided by background subtraction, the problem of multi-object tracking is decomposed into two subproblems: assignment and location.
  • MAP Maximum A Posterior
  • each tracked object is assigned to a foreground region in the coming frame.
  • its visual info ⁇ nation will be excluded from the PCR of the region.
  • multiple objects assigned to one region are located one-by-one according to their depth order.
  • a two-phase mean-shift algorithm based on PCR is derived for locating objects.
  • an object When an object is located, its visual information is excluded from the new position in the region. The operation of exclusion at the end of each iteration for assignment and location in the example embodiment can avoid multiple objects being trapped into the same region or position.
  • the present specification also discloses apparatus for performing the operations of the methods.
  • Such apparatus may be specially constructed for the required purposes, or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer.
  • the algorithms and displays presented herein are not inherently related to any particular computer or other apparatus.
  • Various general purpose machines may be used with programs in accordance with the teachings herein.
  • the construction of more specialized apparatus to perform the required method steps may be appropriate.
  • the structure of a conventional general purpose computer will appear from the description below.
  • the present specification also implicitly discloses a computer program, in that it would be apparent to the person skilled in the art that the individual steps of the method described herein may be put into effect by computer code.
  • the computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein.
  • the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the spirit or scope of the invention.
  • Such a computer program may be stored on any computer readable medium.
  • the computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer.
  • the computer readable medium may also include a hard-wired medium such as exemplified in the Internet system, or wireless medium such as exemplified in the GSM mobile telephone system.
  • the computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements the steps of the preferred method.
  • the invention may also be implemented as hardware modules. More particular, in the hardware sense, a module is a functional hardware unit designed for use with other components or modules. For example, a module may be implemented using discrete electronic components, or it can form a portion of an entire electronic circuit such as an Application Specific Integrated Circuit (ASIC). Numerous other possibilities exist. Those skilled in the art will appreciate that the system can also be implemented as a combination of hardware and software modules.
  • ASIC Application Specific Integrated Circuit
  • the distinctive background objects (regions) in the example embodiment are classified into two categories:
  • Type-1 CBR a facility for the public in the scene
  • Ty ⁇ e-2 CBR a large homogenous region.
  • Contextual descriptors are developed to characterize the distinctive appearances of CBRs and evaluate the likelihoods of observing them. Different contextual background regions may have different appearance features. Some manifest significant structural features, while others may have homogeneous color distributions.
  • the example embodiment employs Orientation Histogram Representation (OHR) to describe the structural features of a region and Principal Color Representation (PCR) to describe the distribution of dominant colors.
  • OCR Orientation Histogram Representation
  • PCR Principal Color Representation
  • the OHR H 6 is a simple and efficient variant of the robust local descriptor SIFT [1] for real-time processes. It is less sensitive to illumination changes and slight shift of object position.
  • the PCR for R' b is defined as where /?,• is the size of R' b , c k t is the k-th most significant color of R' b and p k t is its significance value.
  • the significance value is computed by
  • ⁇ 5(cl, c2) is a delta function. It equals to 1 when the color distance d(c ⁇ , C 2 ) is smaller than a small threshold ⁇ , otherwise, it is 0.
  • the color distance used here is
  • a type-1 CBR in the example embodiment is associated with a facility which has a distinctive structure and colors in the image. Both O ⁇ R and PCR are used to characterize the type-1 CBR.
  • R' b be the z-th type-1 CBR in the scene. Its contextual descriptors are H M and T b ⁇ -
  • a type-1 CBR has just two states: occluded (occupied) or not. The likelihood of observing a type-1 CBR is evaluated on the whole region.
  • the contextual descriptors of the region R 1 (X) from the corresponding position of R' bl in the current frame /,(x) are H, and T 1 .
  • R' bl The likelihood of R' bl being exposed can be evaluated by matching R,(x) to R' b ⁇ . Based on O ⁇ R, the matching of R 1 (X) and R' bi is defined as IfR' ii and R,(x) are similar, P L (H, ⁇ R' b ⁇ ) is close to 1, otherwise, it is close to 0.
  • the first term is the likelihood based on the partition evidence of principal color c* b ⁇ , ⁇ - It is evaluated from the PCRs of R' M and R ( (x) as
  • PCnIE 1 V,! 4- tnin I pi t l . ⁇ ⁇ (ci u ,, cDrT
  • the type-2 CBRs in the example embodiment are large homogeneous regions. Only the PCR descriptor is used for each of them. Usually only part of a type-2 CBR is occluded when a foreground object overlaps it.
  • the likelihood of observing a type-2 CBR is evaluated loc- calty.
  • i ⁇ t (x) is a small ssigbb ⁇ -ii ⁇ od centered at x, e.g., a 5 x 5 window.
  • the likelihood of i?j(x) belonging to Ri 2 is defined as where
  • the appearance model of a type-1 CBR in the example embodiment consists of its OHR and PCR.
  • the spatial model of R' b ⁇ is defined as its bounding box and the center point, i.e., M s (R'b 1 )
  • a model base which contains up to K b appearance models of R' ⁇ 1 is used.
  • the models in the base are learned incrementally.
  • the active appearance model is the one from the model base which best fits the current appearance of the CBR.
  • D be a time duration of 3 to 5 minutes, not limiting, in the example embodiment (i.e. a long duration compared with the frame duration in the video signal).
  • the times of observing the ⁇ -th type-1 CBR during the period are accumulated as
  • the active appearance model is replaced.
  • the new appearance model M 0 a (R' b i) is compared with the ones in the model base according to (1 Ia). If one is sufficiently close to the new model (i.e. the similarity is larger than T L/ + ⁇ ), it will be used to replace the active model. Otherwise, the active model will be replaced by the new model. Meanwhile, the new model is also placed into the base. If the model base is already full, the oldest one is then replaced by the new model.
  • T b2 be the PCR descriptor of the j-th type-2 CBR R' b2
  • the spatial model of R' b2 describes the range of the homogeneous region in the image.
  • the spatial model may have to be adjusted in initialization duration when sufficient samples have been observed according to the likelihood values.
  • the prior probability P(R t (x)) is the same for every pixel in an image. Then the log posterior probability of R 1 (X) belonging to R' b in the current frame /,(x) is defined as
  • the position of a type-1 CBR is already determined by its spatial model.
  • x) is 1 for the position and 0 otherwise.
  • the prior probability of a pixel x belonging to the region R' b 2 can be defined as
  • s eHt(x) (25a) and r 2 be the proportion of exposed pixels of R' b2 in R ( (x) according to the posterior estimates, i.e.,
  • T Q is chosen as slightly lower than 2T L2 .
  • an occluded pixel of R' b2 is confirmed if the majority of the pixels within its neighborhood are of R' b2 and less of them are observed in the current frame.
  • the rate is computed as where T H — 15% is chosen in the example embodiment.
  • control code C,(x) is used, where the value of C,(x) is 0, 1, 2, or 3 where the low, normal, or high learning rate is applied respectively at the pixel (here 0 is for normal learning rate for non-context pixels used for display).
  • control code images are used.
  • the first two are the previous and current control code images described above, i.e., C 1 i(x) and C,(x), and the second two images are the control codes which really applied for pixel-level background maintenance, i.e., C*, i(x) and C*, (x).
  • the example embodiment was applied to, two existing methods of ABS were implemented. They are the methods based on Mixture of Gaussian (MoG) [4] and Principal Feature Representation (PFR) [2]. Hence, four methods, MoG, Context-Controlled MoG (CC MoG), PFR, and Context-Controlled PFR (CC PFR) were compared.
  • MoG MoG
  • C MoG Context-Controlled MoG
  • PFR Context-Controlled PFR
  • CC PFR Context-Controlled PFR
  • the normal learning rate of the example embodiment as described above was set to the constant learning rate used for the existing methods of ABS. The high learning rate was set to the double of the normal learning rate and the low learning rate was set to zero.
  • the leftmost image 102 is a snapshot with manually cropped out contextual background regions e.g.
  • the second column 108 shows a sample frame from the sequence 110 and the corresponding ground truth 112 of the foreground.
  • the rest of the images in the upper row 114 are: the segmented results by MoG 116, CC MoG (Context-Controlled MoG) 118, and the corresponding control image 120.
  • the three images in the lower row 122 are the segmented results of PFR 124 and CC PFR (Context- Controlled PFR) 126, and the corresponding control image 128.
  • the black regions e.g 130 do not belong to any CBR
  • the gray regions e.g. 132 are exposed parts of the CBRs with no significant appearance changes
  • the white regions e.g. 134 are occluded parts of the CBRs.
  • the normal learning rate is applied, for pixels in regions of occluded parts of the CBRs, the low learning rate is used.
  • the high learning rate would be used as described above.
  • the scene in the image 102 is a meeting room with four marked type-2 CBRs for the table surface, the ground surface, wall surfaces, and the chair. In this sequence of 5250 frames, there was no overstaying objects or overcrowds. However, several people kept e.g. 138 moving around, staying somewhere for a while, and performing various activities.
  • the contextual features of the example embodiment capture the global information. Such global information may not always lead to a precise segmentation in position, especially along boundary regions of objects. However, if fed with correct samples continuously, the pixel- level statistical models can be tuned to characterize the background appearance accurately at each pixel. Then the pixel-level background models can be used to preferably achieve a precise segmentation of foreground objects.
  • FIG. 2 shows a flow chart 200 illustrating a method of background updating for adaptive background subtraction in a video signal according to the example embodiment.
  • one or more contextual background representation types are defined.
  • an image of a scene in the video signal is segmented into foreground and background regions.
  • each background region is classified as belonging to one of the contextual background representation types.
  • an orientation histogram representation (OHR), a principle colour representation (PCR), or both, are determined of each background region.
  • a current image of the scene in the video signal is received.
  • different learning rates are set for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.
  • principal color representation is applied for efficient appearance-based multi-object tracking.
  • object tracking may be applied to a sequence of segmented images generated by background subtraction.
  • each segmented image may contain one or several isolated foreground regions.
  • each region may consist of one target object (e.g., a walking person) or a group of target objects (when objects overlap from the camera view point).
  • the example embodiment uses the principal color representation (PCR) for modeling and characterizing the appearance of target objects as well as the segmented regions.
  • each image may contain one or several objects. These objects in the image may overlap on some occasions. Further, the poses, scales, and motion modes of objects can change significantly during the overlap.
  • an object of interest e.g., a person, vehicle, luggage, etc.
  • an object of interest e.g., a person, vehicle, luggage, etc.
  • Let the rath foreground region detected from the frame at time t be R" t (x), where x (x, y) ⁇ denotes the position of a pixel in the region.
  • the corresponding principal color representation PCR
  • ⁇ (x) is a weight function and ⁇ (; • ) is a delta function.
  • a color distance is used which is not sensitive to noise and illumination changes
  • the PCR T t contains the first N significant colors and their statistics for the region R" t (x). Since a region of one or a few objects manifests only a few dominant colors, it is possible to find a small number N to approximate the color features of the region, Le.,
  • Fig. 3 shows two examples of PCRs where one image 300 contains two isolated individuals and another image 302 contains a group of 5 persons.
  • the PCRs for the foreground regions are generated through scanning the respective regions, and are shown in the histograms 312, 314 respectively. Details of the algorithm for the foreground region R" , (x) (see white areas e.g. 304, 306 in the segmented images 308, 310 respectively) are summarized in Table 2.
  • the aim of object tracking is to allocate a tracked object in the coming frame according to its previous appearance.
  • the likelihood or the conditional probability of observing the tracked object in a region of the current frame, has to be evaluated.
  • the likelihoods are first defined based on the original and normalized PCRs of the tracked object and a region. This is the extended to the scale-invariant likelihood.
  • O m ,- 1 be the mth. tracked object described by its PCR ⁇ -m — ⁇ s ⁇ i _ (c t s i ⁇ N i I
  • the likelihood of the object O m t - 1 in the region R" can be defined as
  • each P(R" , [Ef n ) is the likelihood of that the object O m t - 1 appearing in the region R" t based on the partition evidence E' m
  • P(E? m ⁇ O m ,_ 0 is the conditional probability of the evidence E n , given the object O n ,_ ,.
  • the likelihood based on the normalized PCRs is more accurate than that based on the original PCRs.
  • the likelihood on original PCRs is better.
  • the scale-invariant likelihood of observing a given object O m ,_ i in the region R is defined as,
  • Equ (11) can provide a suitable measurement for these two cases in the example embodiment.
  • Object tracking in video surveillance aims at maintaining a unique identification (id) for each target object and providing its time-dependent positions in the scene. When multiple target objects frequently merge and separate from one another in a public site, tracking one individual object is no longer an independent process. Multi-object tracking can be formulated as a global Maximum A Posterior (MAP) problem for all the tracked objects.
  • MAP Maximum A Posterior
  • the global MAP problem can be approximately decomposed as two subproblems: assignment and location.
  • assignment and location Using the principal color representation (PCR) and the associated likelihood function, the example embodiments uses sequential solutions to these two subproblems, as detailed below.
  • Ot-i — (OTM i ⁇ be the set of tracked target objects in the previous, frame I ( _i(x), and ⁇ (._i be the set of state parameters describing their positions at time t - 1.
  • the task of multi-object tracking is to estimate the states ⁇ s of tracked objects in the current frame 1 « (x) given their previous appearance models O t _i and states ⁇ t-i. This can be formulated as the Maximum A Posterior (MAP) estimation for the state parameters ⁇ . ⁇
  • MAP Maximum A Posterior
  • the objects in a group region (eg., .Rf -1 ) in the previous frame may separate to several regions in the current frame.
  • the inter-frame movements of target objects are usually small. This implies that there is always an overlap between the regions of the same object in. the consecutive frames.
  • the problem (13) can be further decomposed by using directed acyclic graphs (DAGs)-
  • the parent layer consists of nodes representing the regions ⁇ -fijLil j ⁇ 1 ⁇ 11 * ue previous , frame It_i(x). and the child layer consists of nodes denoting the regions ⁇ the current frame
  • a directed acyclic graph is formed by a set of nodes in which every node connects to one or more nodes in the same group.
  • a set of DAGs can be generated. An example of graphs for two consecutive frames is illustrated in Figure 4. The notations for the DAGs are defined as follows.
  • the parent nodes are denoted as ⁇ n ⁇ ' where each node rtjf represents one of the regions and the ch ⁇ d nodes are denoted as ⁇ «]' 9 ⁇ j_!.i' where each node represents one of the regions ⁇ .fl* ⁇ tii-
  • G. (K'lSi - K* ⁇ £i', ⁇ & ⁇ > ⁇
  • the objects in a parent node n ⁇ * are denoted as If Mi ⁇ — 1 then the node is a single object, otherwise it is a group of &!, # objects.
  • f is one of the objects
  • the objects in the child nodes are reordered as ⁇ Oj * J-J 1 H 1 . They are the set of tracked target objects in the current frame.
  • (16) is still a nontrivial problem.
  • the example embodiment solves the problem in two sequential steps from coarse to fine.
  • the problem is decomposed approximately as two sub- problems: assignment and location.
  • the coarse assignment process assigns each object in a parent node to one of its child nodes while the fine location process determines the new states of the objects assigned to each child node.
  • tliis step the tracked objects in each parent node is assigned to its child nodes based on the largest posterior probability.
  • ⁇ t ' be the parameter vector describing the assignment of the tracked objects O ⁇ _ ⁇ to the child nodes ⁇ n ⁇ ⁇ ⁇ q t ⁇ .
  • the posterior probability of tfi ⁇ assignment for graph Qi can be expressed as
  • the best assignment for the tracked objects SU( ⁇ i tua * ⁇ results in the best observation of the objects in the corresponding child nodes, that is
  • ⁇ J 1 p can be considered as the coarse tracking parameters indicating ia which child nodes (regions) the objects are observed without concerning the exact new positions of the tracked objects in the child regions.
  • the new sta ⁇ es of the tracked objects assigned to each child node are determined.
  • objects in each child node can be tracked independently of objects in the other child nodes.
  • Multi-object tracking thus becomes finding the solutions for Eqs. (18) and (20) in the example embodiment. Further sequential solutions for (18) and (20) based on PCR are used and described below.
  • a new object appears and is initialized as a ⁇ in Gi with a new id number.
  • the node n ⁇ ⁇ 1 represents the region Rf
  • o*) 1. It's
  • tke graph represents the simple case of isolated object tracking.
  • the graph be Gi — ⁇ oi n i > ⁇ ii) > tue object in the parent node be OfI 1
  • the child node n ⁇ represent the region jRf
  • the object i.e., OTM j and have the same id number.
  • 1.
  • the Mi graph only contains; one parent node which has no child nodes, then the previous objects in the parent node are assumed to have disappeared in the ⁇ urent frame. Tracking is terminated for these objects. Tracking Multiple Objects in a Graph
  • the operations of assignment and location will be performed.
  • the index i for the graph G is omitted below.
  • the assignment can be solved sequentially from the most visible one to the least visible one.
  • the objects ⁇ the parent node rt ⁇ be ordered according to their visible sizes.
  • (22) can be evaluated on PCR. Assuming that a child node r ⁇ € M ⁇ represents the region ij*. Let T* and T ⁇ 1 be the PCELs of B% and 0Jl 1 , respectively. Using Eqs. (21) and (22), the best assignment of die objects ⁇ Oj ⁇ K ⁇ ⁇ ⁇ 1 ⁇ e parent node rig can be achieved one-by-one sequentially according to their depth order by
  • the top object in the list is pop out.
  • it is the mth object 0 ⁇ I 1 .
  • the likelihood in (23) is calculated as P* ⁇ fj*(m - Dt ⁇ -i) according to Eq. (11).
  • the prior probability in (23) is evaluated based on the shape similarity and center distance between the bounding boxes.
  • 6Jl 1 and B ⁇ be the bounding boxes of 0"I 1 and R ⁇ associated with the child node.
  • , x£* and xj be their centers, and d m and d k be their diagonal lengthes. respectively.
  • the shape similarity between two boxes (center aligned) is defined as where "n" denotes the intersection and "U” denotes the union.
  • 0£ i ) 0.5(m, +??J of ).
  • the object 0 ⁇ L 1 is assigned to the child node nfTM according to Eq. (23).
  • the last operation in this iteration is exclusion that removes the visual information of tiie object o"_' x from TJTM(m - 1) associated with the child node nf".
  • T ⁇ 1 and f * m (n ⁇ - 1) be the PCRs of - T ⁇ 1 is updated from TJ?" (m — 1) for its principal colors one by one. For the jth element of principal color c ⁇ and significance sj ⁇ , the following updating is performed.
  • Locating the objects in a region are not independent of each other, but the front ones with richer visible information are less affected by the occluded ones.
  • objects in the node are located one by one from the most visible to the least visible ones based on their visible parts.
  • the posterior probability of new states for all the objects in the node can be expressed as
  • ⁇ ?y argmaxP f ⁇ l ⁇ fn - 1 W- ⁇ -i) where ⁇ o" ⁇ n i j are sorted in descendent order according to their visible sizes.
  • the sequential solution to the problem Eq. (20) and Eq. (26) contains two steps.
  • the first step the visible parts of the objects in the node are estimated, and the objects are sorted according to their visible sizes.
  • an iterative process is applied to locate the objects one-by-one in the region with a mean-shift algorithm based on PCR. When an object is located, its visual evidence is excluded from its position in the region. The details are described in the following.
  • s n and S k are the sizes of the object 0TM and the region R ⁇ . respectively.
  • J? ⁇ fc (n — 1) /?* — ⁇ JLT I o[. a weight image ⁇ n (x) is used. If the pixel x is likely to belong to one of the previously located objects (of . - - ⁇ .
  • ⁇ v,-.(x) is low (ft* 0), otherwise, it is high (sa 1).
  • the object 0" in the region R% according to Eq. (26) is equivalent to finding a position where the maximum value of probability density occurs for observing the object.
  • This density maximum can be found by employing a mean-shift procedure with a weight mode which can reveal the probability density of observing the object in the neighborhood [5], [6], [7].
  • a two-stage mean-shift procedure is proposed based on the evidence of the object " s principal colors. In the first stage, the gravity center of the pixels of each principal color component is computed as
  • the algorithm in the example embodiment includes two phases of processing for each DAG (Directed Acyclic Graph): assignment and location.
  • DAG Directed Acyclic Graph
  • assignment phase each parent node in the DAG is processed.
  • location phase assigned objects in each child node are tracked.
  • small objects in a group with likelihood values less than 0.1 are set as disappeared.
  • the records of disappeared objects are kept for 50 frames. When a new object is detected, it is compared with disappeared objects according to their PCRs, sizes and distances. If it compares to a disappeared object the tracking will be restored, otherwise a new object is created.
  • segmenting individual persons in a group with domain knowledge will be preferred.
  • knowledge about the sizes and aspect ratios of persons in the scene is used to adapt to segmentation errors.
  • Figure 5 shows a flow chart 500 illustrating a method of multi-object tracking in a video signal in the example embodiment.
  • first and second segmented images of two consecutive frames of the video signal respectively are received, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked.
  • one or more directed acrylic graphs (DAGs) are generated for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node.
  • DAGs directed acrylic graphs
  • step 506 for each parent node having two or more child nodes, a) the corresponding objects of the foreground region contributing to said each parent node are sorted according to estimated depth in said first image; b) the corresponding object having the lowest depth is assigned to one of the child nodes of said each parent node; c) a visual content of the assigned corresponding object is removed from the visual data associated with said one child node; and steps b) to c) are iterated in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes.
  • step 508 for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image.
  • step 510 for each child node having two or more corresponding objects assigned thereto, d) the corresponding objects are sorted according to estimated depth in said each child node in said second image; e) a means-shift calculation is applied to locate the corresponding object having the lowest depth in said each child node; f) the state and the visual content of the located corresponding object are updated based on the second image; g) the updated visual content of the located corresponding object is removed from the visual data associated with said each child node; and steps e) to g) are iterated in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.
  • a layer tracking algorithm is designed to track stationary objects through even frequent occlusions.
  • the objected is identified as moving object and tracked by a moving object tracking algorithm.
  • the stationary objects include not only static non-living objects but also include motionless living objects, e.g. a standing or sitting person. Since the living objects may move again, the switching between moving object tracking and stationary object tracking for the target object is preferably smooth with no change of identity in the example embodiment.
  • a template image of the object is used to represent such a stationary object in the example embodiment.
  • [B 1 ] ⁇ x . be a sequence of bounding boxes of the zth tracked object in the ⁇ b most current frames as tracked by a moving object tracking algorithm. If the object has stopped moving, the bounding boxes will overlap each other.
  • ⁇ b if the spatial intersection of all the boxes is not empty, the object is detected as a stationary object in the example embodiment. In the example embodiment, but not limiting, ⁇ b is set as 10 frames, corresponding to about 1 second.
  • a layer representation based on the object's template image is built. The layer representation of the detected stationary object is defined as where A ⁇ is the template image of the object maintained at time t, T 1 is the Principal
  • PCR Colour Representation
  • the template image is based at least on the last frame of the sequence used in detecting the object as a stationary object.
  • d/ is the difference measure between the template Af and the frame J 7 -(S) for the corresponding region of Aj
  • d c J is the difference measure between the consecutive frames / ⁇ 1 (S) and / y (s) for the region of the template
  • d p J is the visibility measure of the object from the corresponding region in the frame I • (s)
  • s k is an estimated state of the stationary object at time k. Measures in ⁇ d most current frames and states in ⁇ s most current frames are recorded. The details of calculation of these measures and estimating states from these measures for each layer object will be described below.
  • I 1 (s) the color of a foreground point in the region of /th template image. According to Bayesian rule, the probability of the point belonging to the background is
  • p(c ⁇ b) can be obtained from the Principal Feature Representation (PFR) of the background.
  • PFR Principal Feature Representation
  • s (x, y) be a pixel of the image.
  • p v ' (b) is the learned probability of S belonging to the background (P s (b) ) based on the observation of the feature v
  • S v ' (i) records the statistics of the M v most frequent feature vectors at s
  • N v elements are used as principal features.
  • Three types of features are used in the example embodiment. They are a spectral feature (color), a spatial feature (gradient), and a temporal feature (color co-occurrence), respectively. Among them, color and gradient features are stable for static background parts and color co-occurrence features are suitable for dynamic background parts.
  • Three tables are used to learn the possible principal features of the three types for the background. They are T c (s) , T e (s) , and T cc (s) .
  • N s (b) is the background points in a small window W s centered at S in the previous frame
  • M s is the number of points within the window.
  • the priors can be calculated as
  • N 3 (I) and N 3 (f) are the number of points belonging to the layer object and moving objects within the window W s in the previous frame.
  • the pixel s would be assigned according to the greatest likelihood value.
  • the mask for the moving objects is used as the input for moving object tracking.
  • Stationary objects may also be involved in several changes and interactions with other objects through the sequence.
  • a non-living object it may e.g. undergo illumination changes, be occluded and removed by other objects.
  • the object may change pose or move bodyparts, or start moving again.
  • the object's states are estimated and the template image updated correspondently in the example embodiment.
  • five states are used to describe the layer object, they are: motionless, occluded, removed, inner-motion, and start-moving.
  • the state is estimated according to various change measures from a short sequence of most recent frames.
  • s be a point in template Aj (s), of the zth layer object.
  • the difference between the template and a current frame at s can be evaluated as
  • Th d is the threshold according to image noise.
  • S A ' is the size of the template.
  • the difference measure between consecutive frames for the layer object is defined as
  • the difference measures are calculated on color vectors.
  • T. be the PCR of the layer object in H 1 that was stored when the object was detected as a stationary object
  • T 1 ' be the PCR from the region overlapped by the template Aj in the current frame.
  • O m ' ⁇ l be an object in / ⁇ 1 (S)
  • O n ' be a region in /,(s) .
  • the states of the tracked layer object are estimated by heuristic rules in the example embodiment:
  • Rule 2 occluded: If both dj and d c J turn to moderate or high and d p J turns low through the sequence, as well as there are moving objects overlapping the region of the template A ⁇ as determined from the bounding boxes of such moving objects in a moving object tracking algorithm applied, the layer object is occluded;
  • Rule 3 removed: If both dj and dj. turn to high and d p j turns low, and then d c J turns low through the sequence with no moving object overlapping the region of the template, the layer object is removed;
  • the parameters for the rules are determined according to a knowledge base of human perceived semantic meanings and an evaluation from real-world videos in the example embodiment.
  • the difference measures for dj and d c J are low if they are less than 0.25, they are of moderate if they are within (0.25, 0.75), and they are high if they are larger than 0.75.
  • the visibility measure d p j is low if it is less than 0.6, otherwise, it is high.
  • the measure of shape shift is calculated by checking the expanding foreground pixels along the boundary of the template Aj . If the number of expanded pixels is larger than 50% of the template size, the "shift" of the object is detected. It will be appreciated that for some videos from specific cameras, e.g. cameras with unstable signals, adjustment of the thresholds may be required in different embodiment and as based on the relevant knowledge base.
  • the layer model is maintained to adapt to real variations of the object without being affected by other objects in the scene.
  • the layer object is confirmed as being motionless, a smooth operation is performed to the template image. If the object is recognized as being in the inner-motion state, the new image of the object in the current fame will replace the template. If the object is occluded, no updating will be performed. If the object is classified as start-moving, the object will be transformed as a moving object with the same ID and corresponding PCR, mask, and position for tracking by a moving object tracking algorithm. The layer representation of the object will be deleted. If the object is detected as removed, the object will be transformed as a disappeared object and its layer representation will be destroyed. With these operations, the target object moving around, staying somewhere for a while, and moving again can be tracked continuously and seamlessly by combining the example embodiment with the moving object tracking algorithm described for the example embodiment.
  • Figure 6 shows a flow chart 600 illustrating a method of object tracking in a video signal according to the example embodiment.
  • step 602 it is detected that a tracked moving object has become stationary over a sequence of frames.
  • a template image of the stationary object is generated based at least one of the frames in the sequence.
  • step 606 a state of the stationary object is tracked based on a comparison of the template image with a current frame of the video signal.
  • FIG. 7 The structure diagram of an event detection system 700 implementation incorporating the described example embodiment is shown in Figure 7. It contains four fundamental modules, foreground segmentation module 701, moving object tracking module 702, stationary object tracking module 704, and event detection module 706.
  • the foreground segmentation module 701 performs the background subtraction and learning and includes the method and system for background updating of the example embodiment described above, applied to e.g. the adaptive background subtraction method proposed in [8].
  • the background model used in the example implementations employs Principal Feature Representation (PFR) at each pixel to characterize background appearance.
  • PFR Principal Feature Representation
  • the moving objects are tracked with the deterministic 2.5D multi-object tracking algorithm of the described example embodiment in the moving object tracking module 702.
  • moving objects are represented by the models of principal color representation which exploits a few most significant colors and their statistics to characterize the appearance of each tracked object.
  • a layer representation, or a template for the object is established and will be tracked by the stationary object tracking module 704 using the method and system of the described example embodiment.
  • the states of templates for the objects are estimated with fuzzy reasoning.
  • the template for one object may shift between five states: motionless, interior motion, occluded, starting moving, and removed.
  • FSM Finite State Machines
  • Event is an abstract symbolic concept of what has happened in the scene. It is the semantic level description of the spatio-temporal concatenation of movements and actions of interesting objects in the scene.
  • Event detection in video understanding is a high level procedure which identifies specific events by interpreting the sequences of observed perceptual features from intermediate level processing. It is a step that bridges the numerical level and the symbolic level.
  • the fundamental part of event detection is event modeling. For an event, the model is determined by the task and the different instantiations. There are generally two issues for event modeling. One is to select an appropriate representation model, or formal language, and the other is to derive the descriptors for the interesting events with the model.
  • unusual events are described by the spatio-temporal evolution of object's states, movements, and actions.
  • each event can be defined as a sequential succession of a few well-defined states.
  • An event could be started at one or more initial states, and then one state can transit to the next state when new conditions are met as the scene evolves in time.
  • State transition may also happen from an intermediate state back to a previous state if some conditions no more hold for the state.
  • the semantic representation can be modelled based on Finite State Machines (FSM).
  • FSM Finite State Machines
  • the FSM has at least two advantages: (1) it is explicit and natural for semantic description; (2) FSM can readily and flexibly incorporate a variety of context information from intermediate-level processing.
  • each specific event can be represented by a directed graph
  • G. (S i ,E * ), where Sf is the set of nodes representing the states and Ef is the set of edges representing the transitions.
  • Sf is the set of nodes representing the states
  • Ef is the set of edges representing the transitions.
  • Sf is the set of nodes representing the states
  • Ef is the set of edges representing the transitions.
  • Sf is the set of nodes representing the states
  • Ef is the set of edges representing the transitions.
  • N i.e. the number of intermediate states in the FSM 800, and the more is the chance to deliver an unreliable detection result. Therefore, an important task in event modeling is to trim any unnecessary states by careful analysis and to identify the simplest event model.
  • the input of an FSM is the numerical perceptual features generated by moving and stationary object tracking modules (compare 702 and 704 in Figure 7).
  • the visual cues of each tracked object can include shape, position, motion, and relations with others.
  • the visual cues in the example implementation are:
  • - InGroup indicates whether the object is an isolated one or merged with others
  • a measure within [0,1] indicates the degree of occlusion when overlapping with others
  • - Motion a measure within [0,1] indicates the degree of interior motion of a stationary object.
  • An advantage of the tracking modules is the capability to resume tracking of some objects that are lost for a few frames.
  • the two events, UNATTENDED OBJECT and THEFT, are directly concerned with object disappearance in the example implementation.
  • a first-in-first-out (FIFO) stack is built to contain the track records of N frames.
  • O Tracked are the track records of the previous N-th frame and the triggered event is delayed by N frames.
  • N 30 in the example implementation.
  • Loitering as defined in the example implementation involves one object. It is defined as a person wandering in the observed scene with the duration t > T Loilering .
  • the FSM is initialized for each new object.
  • the FSM has one intermediate state: "Stay" which indicates that the tracked person is staying in the scene, whether moving around or stationary. There are two conditions for the transition from state "INIT" to state "Stay”:
  • the object is classified as human
  • this event also involves one object, a person. It is defined as an object becoming complete static with the duration t > T Sta[ic .
  • the FSM is initialized for each new object. When the tracked object is recognised as a person, the FSM transits to state "M", which indicates a person who is moving around or has significant interior motion.
  • the second intermediate state of the FSM is "S”, which indicates a person becoming and staying static, or complete motionless. There are two conditions for the transition from state "M" to state "S”:
  • state "S” a time counter t is continuously incremented as new frames are coming in.
  • the FSM transits from state "S” to state "UP”, indicating that an unconscious person is detected. Examples of unconscious person include a sleeping or faint person.
  • similar condictions can be used to detetc e.g. a vehicle staying overtime in a zone for short stopping, in which case the object of interest is changed to vehicle instead of person.
  • This event as defined in the example implementation involves two objects.
  • the FSM is initialized for each new object.
  • the new small object is identified as being separated from another large moving object, and it stays static, a deposited object is detected and the ownership is established between the two objects.
  • the FSM transits from state "INIT” to state "Station”. In the state, the object is associated with its owner. If the owner leaves the scene covered by the camera, the FSM transits from state "Station" to state "UO" and the 'Unattended Object' is declared.
  • This event as defined in the example implementation involves three objects.
  • the FSM is initialized for each new object. Similar to the event of unattended object, when the new small object is identified as being separated from another large moving object, and it stays static, a deposited object is detected and the ownership is established between the two objects.
  • the FSM transits from state "INIT" to state "Station". In the state, the object is associated with its owner. However, when the object disappears and this happens with that another object got it and the owner still stays in the scene, the FSM transits from the state "Station" to the state "Theft" and a 'Theft' event is declared, meanwhile, the second person is identified as the potential thief.
  • the method and system of the example embodiment can be implemented on a computer system 900, schematically shown in Figure 9. It may be implemented as software, such as a computer program being executed within the computer system 900, and instructing the computer system 900 to conduct the method of the example embodiment.
  • the computer system 900 comprises a computer module 902, input modules such as a keyboard 904 and mouse 906 and a plurality of output devices such as a display 908, and printer 910.
  • the computer module 902 is connected to a computer network 912 via a suitable transceiver device 914, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN).
  • LAN Local Area Network
  • WAN Wide Area Network
  • the computer module 902 in the example includes a processor 918, a Random Access Memory (RAM) 920 and a Read Only Memory (ROM) 922.
  • the computer module 902 also includes a number of Input/Output (I/O) interfaces, for example I/O interface 924 to the display 908, and I/O interface 926 to the keyboard 904.
  • I/O Input/Output
  • the components of the computer module 902 typically communicate via an interconnected bus 928 and in a manner known to the person skilled in the relevant art.
  • the application program is typically supplied to the user of the computer system 900 encoded on a data storage medium such as a CD-ROM or flash memory carrier and read utilising a corresponding data storage medium drive of a data storage device 930.
  • the application program is read and controlled in its execution by the processor 918.
  • Intermediate storage of program data maybe accomplished using RAM 920.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

A method and system for multi-object tracking in a video signal. The method comprises the steps of receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to c) in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes; and then for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image; for each child node having two or more corresponding objects assigned thereto, d) sorting the corresponding objects according to estimated depth in said ech child node in said second image; e) applying a means-shift calculation to locate the corresponding object having the lowest depth in said each child node; f) updating the state and the visual content of the located corresponding object based on the second image; g) removing the updated visual content of the located corresponding object from the visual data associated with said each child node; and iterating steps e) to g) in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.

Description

Method And System For Mu lti -Object Tracking
FIELD OF INVENTION
The present invention relates broadly to method and system for multi-object tracking in a video signal, in particular a video signal captured by a fixed camera, and to a data storage medium having stored thereon computer code means for instructing a computer system to execute a method for multi-object tracking in a video signal.
BACKGROUND
Real-time tracking objects of interest in image sequences is one of the challenging problems in video understanding. It is an essential part in many computer vision applications, such as video surveillance, media analysis, human computer interface, and video compression (e.g., object-based coding in MPEG-4). Many methods have been proposed for appearance-based visual object tracking in image sequences. Generally speaking, three components are included in the processing: target representation, motion prediction, and object matching. First, a model is used to characterize the distinctive appearance features of the target object. These appearance features of the target object should be consistent and discriminant from other objects through the image sequence. Existing models include blobs of homogeneous intensities or colors, feature points, contours, templates, color histograms and joint color-spatial distributions of the object regions.
Liter-frame motion of target objects is predicted by using dynamic models or motion models. The popular dynamic models are Kalman filter and particle filters. The motion of a rigid object can be estimated by using an explicit motion model such as geometric transformation, while motion of non-rigid objects can be computed with an implicit motion model, e.g., mean- shift. An observation model is used to evaluate the matching between the target and its candidate location in the coming frame. The new position of the target is determined as the mean location of the predicted candidates (EAP: expected a posterior) if a dynamic model is employed or the location of the maximum matching value {MAP: maximum a posterior) when a motion model is used.
The MAP methods are considered as deterministic tracking since a gradient descent algorithm, e.g., a mean-shift algorithm, is commonly used to find the maximum, and the EAP methods are classified as stochastic tracking as randomly sampling in a time series state space is required. Roughly speaking, stochastic tracking is more robust than deterministic tracking since it is less likely to be trapped in local extrema, however, it is more computational expensive.
While tracking a single object in cluttered environments has been largely successful, tracking a varying number of non-rigid objects in crowds remains a challenging task. Among many others, there are three major difficulties for solving this problem. First, when objects merge into a group, the visual features for each object become ambiguous and uncertain. The distant objects can be occluded partially or even completely by the close objects. Second, the appearance of target objects may change drastically when they are in crowds due to the changes of poses and scales. For example, one standing person may sit down when he is partially occluded by another person. Third, the motion modes of target objects may change significantly in crowds, e.g., several separated persons may gather into a group, stay together for a while and then separate in different directions.
There has been an increasing number of proposals on multi-object tracking in this decade. The existing methods can roughly be classified into three categories: multiple instantiations of single trackers, multi-camera cooperation, and extended particle filters. A simple solution to the problem is to build multiple instantiations trackers for individual objects. Such methods perform well in the presence of simple occlusions with the help of Kalman filter and specified strategies to interpret the overlaps. In some proposals explicit segmentation of objects in a group using template matching is performed before tracking. Object models templates or color blobs of head, upper and lower parts of human body are learned for each isolated object. The segmentation of individuals in a group becomes more difficult when large variations of poses, scales, and motion patterns of objects are involved in the interaction.
Being aware of the difficulties caused by occlusion and overlapping, multi-camera systems are proposed. With a proper placement of multiple cameras, occlusion can be alleviated assuming at least one camera may capture a better view of each target object. The challenge for a multi-camera system is the calibration and the cooperation of multiple cameras to achieve a consistent tracking of each object since the views from different cameras can be very different.
Over the last few years, particle filters have been shown to be powerful tools for single object tracking. Attempts have also been made to extend particle filters to multi-object tracking. Hue et al extended the state space by concatenating the state vectors of fixed M objects. The likelihood measure is estimated under the assumption of conditional independence. The association for each component is assigned by a Gibbs sampler. Vermaak et al introduced a mixture of particle filters (MPF) where each component (object) is described by a cluster of particle filters that form a part of the mixture. The filters in the mixture interact only through the computation of the importance weights. Okuma et al further developed a boosted particle filter (BPF) from MPF. In BPF, the proposal distribution for each object (hockey player) is estimated by integrating the object detection and the dynamic models.
To avoid the convergence of multiple modes of occluded objects into one cluster of the front object in the case of overlapping, MaccorMick and Blake proposed the exclusion principle in the likelihood estimation. In their work, the state is extended to a 2.1D model which contains a label of depth order. The likelihood measures for all the depth configurations between two overlapping objects are explicitly derived. A strategy of partition sampling is performed for all of the configurations. A similar approach has been used to estimate the likelihoods based on color histograms of two overlapping objects.
Considering the difficulties in deducing the occlusion relationship of multiple objects from images, introduction of 3D information about the background and target objects has been proposed. In one such proposal, a calibrated camera and a generalized-cylinder or 3 -ellipsoid model of a standing human object are used, and the likelihood for a hypothesis configuration of multiple persons in a group is evaluated according to the 2D projections of their 3D positions on the ground plane. These methods successfully tracked multiple persons walking in crowds without large pose variations. However, the sufficient sampling of all the possible 3D configurations of multiple objects may lead to a significant increase of particle filters to obtain a proper distribution. When multiple objects gather together with various complex occlusions, the likelihoods of observing different objects are not independent. A few methods have sought to address such problem under a general mathematical framework and almost all of them are stochastic-based methods. A main disadvantage of these stochastic methods is their intensive computations which make them difficult for real-time video surveillance applications.
A need therefore exists to provide a method and system for multi-object tracking in a video signal that seek to address at least one of the above mentioned disadvantages.
SUMMARY
In accordance with a first aspect of the present invention there is provided a method of multi-object tracking in a video signal; the method comprising the steps of receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to c) in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes; and then for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image; for each child node having two or more corresponding objects assigned thereto, d) sorting the corresponding objects according to estimated depth in said each child node in said second image; e) applying a means-shift calculation to locate the corresponding object having the lowest depth in said each child node; f) updating the state and the visual content of the located corresponding object based on the second image; g) removing the updated visual content of the located corresponding object from the visual data associated with said each child node; and iterating steps e) to g) in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.
Step a) may comprise calculating visible portions of the respective corresponding objects in the first image to derive the estimated depths of the respective corresponding objects.
Step b) may comprise assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node based on a posterior probability evaluated by Bayes rule. A posterior probability evaluated by Bayes rule may be based on principle colour representations (PCRs) of the corresponding object and said one child node respectively.
Step c) may comprise removing the visual content of the assigned corresponding object from the visual data associated with said one child node based on PCRs of the assigned corresponding object and said one child node respectively.
Step d) may comprise calculating visible portions of the respective corresponding objects in the second image to derive the estimated depths of the respective corresponding objects.
Step e) may comprise applying the means-shift calculation to locate the corresponding object having the lowest depth in said each child node based on gravity centres of pixels of each principle colour components in a PCR of the corresponding object in the second image.
Step g) may comprise removing the updated visual content of the located corresponding object from the visual data associated with said each child node based on PCRs of the located corresponding object and said each child node respectively.
The method may further comprise storing tracking data including the updated status and visual content of each corresponding object for a series of consecutive frames and detecting an event in the video signal based on the stored tracking data.
The method may further comprise the step of for each parent node having no child node, deleting the corresponding object.
The method may further comprise the step of for each parent node having only one child node, assigning all corresponding objects to said one child node.
The method may further comprise the step of for each child node having no corresponding object assigned thereto, check whether the object is disappeared, and if not, set a new corresponding object to said each child node. The method may further comprise the step of for each child node having only one corresponding object assigned thereto, update the state and visual content of said one corresponding object from the visual data associated with said each child node.
In accordance with a second aspect of the present invention there is provided a multi- object tracking module for multi-object tracking in a video signal; the module comprising means for receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; means for generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and the means for generating, for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to c) in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes; and then, for each child node having only one corresponding object assigned thereto, updating a state and the visual content of said one object based on the second image; for each child node having two or more corresponding objects assigned thereto, d) sorting the corresponding objects according to estimated depth in said ech child node in said second image; e) applying a means-shift calculation to locate the corresponding object having the lowest depth in said each child node; f) updating the state and the visual content of the located corresponding object based on the second image; g) removing the updated visual content of the located corresponding object from the visual data associated with said each child node; and iterating steps e) to g) in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.
In accordance with a third aspect of the present invention there is provided a data storage medium having stored thereon computer code means for instructing a computer system to execute a method of multi-object tracking in a video signal; the method comprising the steps of receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to c) in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes; and then for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image; for each child node having two or more corresponding objects assigned thereto, d) sorting the corresponding objects according to estimated depth in said each child node in said second image; e) applying a means-shift calculation to locate the corresponding object having the lowest depth in said each child node; f) updating the state and the visual content of the located corresponding object based on the second image; g) removing the updated visual content of the located corresponding object from the visual data associated with said each child node; and iterating steps e) to g) in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.
In accordance with a fourth aspect of the present invention there is provided a method of stationary object tracking in a video signal, the method comprising the steps of determining that a tracked moving object has become stationary over a sequence of frames; generating a template image of the stationary object based on at least one of the frames of the sequence; tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state.
The generating of the template image may be based on image data within a bounding box in the at least one of the frames of the sequence for the tracked object.
The tracking of the state may comprise the steps of determining a first difference measure between the template image and a corresponding region in the current frame; determining a second difference measure between respective corresponding regions in the current frame and a preceding frame; determining a visibility measure of the stationary object from the corresponding region in the current frame.
The tracking of the state further.may comprise determining whether another tracked moving object overlaps the corresponding region in the current frame.
The tracking of the state further may comprise the steps of determining a motionless state if the first and second difference measures are below a first threshold value over a sequence of τp current frames; determining an occluded state if the first and second difference measures each exceed a second threshold value and the visibility measure falls below a third threshold value over the sequence of τp current frames, and another tracked moving object overlaps the corresponding region in the current frame; determining a removed state if the first and second difference measures each initially exceed the second threshold value and then the second difference measure falls below the first threshold value and the visibility measure falls below the third threshold value over the sequence of τp current frames, and no tracked moving object overlaps the corresponding region in the current frame; determining an inner-motion state if the first and second difference measures each initially exceed the second threshold value and then the second difference measure then falls below the first threshold value and the visibility measure is above a fourth threshold value; and determining the start moving state if the first and second difference measures exceed and remain above the second threshold value and the visibility measure exceeds and remains above the fourth threshold measure over the sequence of τp current frames, and a spatial shift of the tracked stationary object.
The visibility measure may be determined based on principle colour representation.
The first and second difference measures may be determined based on a knowledge base of human perceived semantic meanings, an evaluation from real-world videos, or both.
In accordance with a fifth aspect of the present invention there is provided a system for object tracking in a video signal, the system comprising means for determining that a tracked moving object has become stationary over a sequence of frames; means for generating a template image of the stationary object based on at least one of the frames of the sequence; means for tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and means for switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state.
In accordance with a sixth aspect of the present invention there is provided a data storage medium having stored thereon computer code means for instructing a computer system to execute a method of object tracking in a video signal, the method comprising the steps of determining that a tracked moving object has become stationary over a sequence of frames; generating a template image of the stationary object based on at least one of the frames of the sequence; tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state. BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the invention will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which:
Figure 1 shows a series of images illustrating adaptive background subtraction using the background updating method and system of the example embodiments.
Figure 2 shows a flow chart illustrating a method of context-based background updating for adaptive background subtraction in the example embodiment.
Figure 3 shows a series of images and histograms illustrating principle colour representation (PCR) in the example embodiment.
Figure 4 shows a schematic drawing illustrating directed acrylic graphs (DAGs) for regions in consecutive frames in the example embodiment.
Figure 5 shows a flow chart illustrating a method of multi-object tracking in a video signal in the example embodiment.
Figure 6 shows a flow chart illustrating a method of stationary object tracking in a video signal in the example embodiment.
Figure 7 shows a schematic drawing of an event detection system implementation using the example embodiment.
Figure 8 shows a graph illustrating a finite state machines (FSM) representation for event detection in the system implementation of Figure 7.
Figure 9 shows a schematic drawing of a computer system for implementing the example embodiment.
DETAILED DESCRIPTION
The described embodiment provides a novel 2'/4D method of multi-object tracking for real-time video surveillance. An appearance model, principal color representation (PCR), is applied to multi-object tracking. The PCR model characterizes the appearance of an object or a region with a few most significant colors. The likelihood of observing a tracked object in a foreground region is derived according to their PCRs. Based on the Bayesian estimation theory, multi-object tracking is formulated as a Maximum A Posterior (MAP) problem over all the tracked objects. With the foreground regions provided by background subtraction, the problem of multi-object tracking is decomposed into two subproblems: assignment and location. By exploiting that the close and unoccluded objects have richer visual information than the distant or occluded ones, sequential solutions to the subproblems which process the objects in a group from the most visible to the least visible ones are derived according to the likelihoods estimated based on PCR. In the assignment step, each tracked object is assigned to a foreground region in the coming frame. When an object is assigned, its visual infoπnation will be excluded from the PCR of the region.
In the location step, multiple objects assigned to one region are located one-by-one according to their depth order. A two-phase mean-shift algorithm based on PCR is derived for locating objects. When an object is located, its visual information is excluded from the new position in the region. The operation of exclusion at the end of each iteration for assignment and location in the example embodiment can avoid multiple objects being trapped into the same region or position.
Some portions of the description which follows are explicitly or implicitly presented in terms of algorithms and functional or symbolic representations of operations on data within a computer memory. These algorithmic descriptions and functional or symbolic representations are the means used by those skilled in the data processing arts to convey most effectively the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities, such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated.
Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions utilizing terms such as "calculating", "determining", "excluding", "generating", "assigning", "locating", or the like, refer to the action and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical quantities within the computer system into other data similarly represented as physical quantities within the computer system or other information storage, transmission or display devices.
The present specification also discloses apparatus for performing the operations of the methods. Such apparatus may be specially constructed for the required purposes, or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose machines may be used with programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate. The structure of a conventional general purpose computer will appear from the description below.
In addition, the present specification also implicitly discloses a computer program, in that it would be apparent to the person skilled in the art that the individual steps of the method described herein may be put into effect by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein. Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the spirit or scope of the invention.
Furthermore, one or more of the steps of the computer program may be performed in parallel rather than sequentially. Such a computer program may be stored on any computer readable medium. The computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer. The computer readable medium may also include a hard-wired medium such as exemplified in the Internet system, or wireless medium such as exemplified in the GSM mobile telephone system. The computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements the steps of the preferred method.
The invention may also be implemented as hardware modules. More particular, in the hardware sense, a module is a functional hardware unit designed for use with other components or modules. For example, a module may be implemented using discrete electronic components, or it can form a portion of an entire electronic circuit such as an Application Specific Integrated Circuit (ASIC). Numerous other possibilities exist. Those skilled in the art will appreciate that the system can also be implemented as a combination of hardware and software modules.
The distinctive background objects (regions) in the example embodiment are classified into two categories:
Type-1 CBR: a facility for the public in the scene;
Tyρe-2 CBR: a large homogenous region.
Contextual descriptors are developed to characterize the distinctive appearances of CBRs and evaluate the likelihoods of observing them. Different contextual background regions may have different appearance features. Some manifest significant structural features, while others may have homogeneous color distributions. The example embodiment employs Orientation Histogram Representation (OHR) to describe the structural features of a region and Principal Color Representation (PCR) to describe the distribution of dominant colors. Let R'b be the z-th CBR in the empty scene /(x), and G(x) and O(x) be the gradient and orientation images of /(x), respectively. If the orientation values are quantized into 12 bins each covering 30°, the orientation histogram for R'b is defined as
Figure imgf000015_0001
where μτ () is a binary function on the threshold T and δk() is a delta function defined as
H k }> \ 0, otherwise ,„ . v (2a)
The OHR H6 is a simple and efficient variant of the robust local descriptor SIFT [1] for real-time processes. It is less sensitive to illumination changes and slight shift of object position.
By scanning the region R'b , a table of the PCR for the region can be obtained. The PCR for R'b is defined as
Figure imgf000015_0002
where /?,• is the size of R'b, ck t is the k-th most significant color of R'b and pk t is its significance value. The significance value is computed by
Figure imgf000015_0003
<5(cl, c2) is a delta function. It equals to 1 when the color distance d(cι, C2) is smaller than a small threshold ε, otherwise, it is 0. The color distance used here is
2 < C111 C3 > d(ci,c2) = l
;MI2+ IMI3 (5a) where <•,•> denotes the dot product [2, 3]. The principal color components Ek , are sorted in descendent order according to their significance values /?*• . The first N; components which satisfy ' ' ∑feoft ≥ 0,95p, are use(j as ^6 pcR of ^6 regjon $b ^ ^ich means the principal colors in PCR cover more than 95% colors from R'b ■ PCR is thus efficient to describe large regions of distinctive colors.
A type-1 CBR in the example embodiment is associated with a facility which has a distinctive structure and colors in the image. Both OΗR and PCR are used to characterize the type-1 CBR. Let R' b, be the z-th type-1 CBR in the scene. Its contextual descriptors are H M and T b\- A type-1 CBR has just two states: occluded (occupied) or not. The likelihood of observing a type-1 CBR is evaluated on the whole region. Suppose the contextual descriptors of the region R1(X) from the corresponding position of R' bl in the current frame /,(x) are H, and T1. The likelihood of R' bl being exposed can be evaluated by matching R,(x) to R' b\. Based on OΗR, the matching of R1(X) and R' bi is defined as
Figure imgf000016_0001
IfR' ii and R,(x) are similar, PL(H,\R'b \) is close to 1, otherwise, it is close to 0.
Figure imgf000016_0002
lUBlihood of Rt(ya) beloagϊiig to R\a is
Figure imgf000016_0003
The second term in the sum is the weight of the principal color c^i,,- in the PCR of R' bu Le., P(Ek ,i\f fei) =pk b \Jpbu. The first term is the likelihood based on the partition evidence of principal color c* b\,ι- It is evaluated from the PCRs of R' M and R((x) as
PCnIE1V,! = 4- tnin I pit l. ∑ Λ(ciu,, cDrT
Λ... I «. . (Sa)
Then there is
Wi- ^1 I ^ϊ J (9a)
-P(T1 6i I r,) can be obtained in a similar way. Now the matching of R' bl and R,(x) based on PCR is defined as
PS,(Α ITi1 ) = min{ P(Tt ^1) . P(Xi1 \Tt)\
(10a)
Assuming that colors and the gradients are independent and different weights are used, the log likelihood of observing R'bi at time t is
LU = a. logft(#lΛM)+(l-u>.) 1OgFi(TfK1)
(Ha) where ωs = 0.6 is chosen empirically.
The type-2 CBRs in the example embodiment are large homogeneous regions. Only the PCR descriptor is used for each of them. Usually only part of a type-2 CBR is occluded when a foreground object overlaps it. The likelihood of observing a type-2 CBR is evaluated loc- calty. LetT^ = {p^, {E^ = (c^p^) }£} be the PCR of the i-fk type-2 CBR β?ι2. Suppose iϊt(x) is a small ssigbbø-iiαod centered at x, e.g., a 5 x 5 window. The likelihood of i?j(x) belonging to Ri2 is defined as
Figure imgf000017_0001
where |R((x)| is the size of the window and 5(/,(s)|R'62) is a Boolean function defined as
Figure imgf000017_0002
The log likelihood that the pixel x in the current frame belongs to R' b2 is
Lg(X) = IOgP(JJ1(X)IBi2)
(14a)
The appearance model of a type-1 CBR in the example embodiment consists of its OHR and PCR. For the z-th type-1 CBR R' M, the appearance model is defined as Ma(R'b 0 = (H ^1, 7* 61). The spatial model of R' b\ is defined as its bounding box and the center point, i.e., Ms(R'b 1)
= (5'6 ,, x'" M)-
To adapt to lighting changes from day to night, besides the active appearance model Ma(R'b 1), a model base which contains up to Kb appearance models of R' ^1 is used. The models in the base are learned incrementally. The active appearance model is the one from the model base which best fits the current appearance of the CBR. The model base of
Rl is ATB(JlJ1) = {M*(i?y ^4X1 ≤ Kb_.
Natural lighting changes slowly and smoothly. Let D be a time duration of 3 to 5 minutes, not limiting, in the example embodiment (i.e. a long duration compared with the frame duration in the video signal). The times of observing the ι-th type-1 CBR during the period are accumulated as
*eO (15a) and the average of the likelihood values is
ή **D (16a) where L1'' bl > TLX means R' b\ is visible at time L If sufficient samples of R' b\ have been observed during the previous (last) duration (e.g., 2!'p /D > 25%) and the average likelihood value is approaching the threshold Tn (e.g., Lb'f < O.STLι), a new appearance of R' b\ may be observed. In the coming duration, a new appearance model Mc a (R'b 1) = /H'c M , T''0 bι } is obtained from a frame in which R,(x) looks mostly like R' bu i.e., tc =
Figure imgf000018_0001
Lbl | J- _ If the average likelihood values are low in the two consecutive durations, the active appearance model is replaced. First, the new appearance model M0 a (R' b i) is compared with the ones in the model base according to (1 Ia). If one is sufficiently close to the new model (i.e. the similarity is larger than TL/+ε), it will be used to replace the active model. Otherwise, the active model will be replaced by the new model. Meanwhile, the new model is also placed into the base. If the model base is already full, the oldest one is then replaced by the new model.
Let T b2 be the PCR descriptor of the j-th type-2 CBR R'b2, the appearance model of Rb'2 is then defined as Ma(R'b 2) = (T bi)- The spatial model of R' b2 describes the range of the homogeneous region in the image. A binary mask If b2(x) is used for it, i.e., Ms(R'b 2) = (U *2(x))- The spatial model may have to be adjusted in initialization duration when sufficient samples have been observed according to the likelihood values.
Again, a model base is employed to deal with the appearance variations of the type-2 CBRs from day to night. The model base of the z-th type-2 CBR Rib 2 is MB(Rb2) =
, ■ ft"1
Wa {RK>))_k~vKiz % Aj&- . The models in the model base are learned incrementally through the time durations. First, at each time step t, the binary image of observed parts for R' b2 is generated as V ''' b2 (x) = μτL2 (L''' bi W)- The overlapping ratio between the exposed parts and the spatial model for R' b2 at time t is
U2 U L62 (i?a)
where ' /7' means intersection and 'u' means union. The larger the ratio is, the more parts of R' bi are exposed and less pixels of other objects would be involved. At the end of each duration, the times of observing the large part of K bj during the period is
Figure imgf000018_0002
and the average similarity value between the observed parts and its active model can be computed as
& = i∑% Pdftέ \fL) ' PTs (<&)
Zk *D (19a) where PL(!Tb2 \ Tb\) is calculated according to (10a) with normalized PCRs and TH = 75% is used. Like the operation for type-1 CBRs, if sufficient samples have been observed during the last duration (i.e., Z'p b2 /D > 25%) and the average similarity value is approaching the threshold Tu. (e-g-, Sb< 0-8Tl2), a new appearance model Mcc a (R' b 2) is generated from the current duration. If the average similarity values are low in the two consecutive durations, the active appearance model will be replaced. If there is a model in the base which is close enough to the new appearance model Ad10 a (R! b 2) (i.e. the similarity is larger than TL2+έ), it will be used to replace the active model. Otherwise, the active model will be replaced by the new model. Meanwhile, the new model is also placed into the base. If the model base is already full, the oldest one is then replaced by the new model.
Let (R' b }Nb ,=i be the CBRs of a scene. Given a coming frame /,(x) and a local region R,(x) centered at x in It(x), the posterior probability of R,(x) belonging to a CBR R'b is
P(^(X)J (20a)
The prior probability P(Rt(x)) is the same for every pixel in an image. Then the log posterior probability of R1(X) belonging to R'b in the current frame /,(x) is defined as
Ql(Rt(X)) = logF(/Mx)|iφ + logP(i?i|x) /n λ
(21a)
The position of a type-1 CBR is already determined by its spatial model. The prior probability P(R! b |x) is 1 for the position and 0 otherwise. Then, the log posterior probability is equivalent to the log likelihood at the position, i.e., Q'' bl = L' bι(R,(x'J b\ )) - Lht bi for R' bl. A rate of occluded times over recent frames for each type-1 CBR is used. For R'b i, the rate is computed as
Figure imgf000019_0001
where β is a smooth factor and β = 0.5 is chosen. A high rate value (close to 1) indicates that R' bι has been occluded in recent frames.
From the spatial model Zf i2(x) of the i-th type-2 CBR R'b2, the prior probability of a pixel x belonging to the region R'b 2 can be defined as
Figure imgf000019_0002
Combining (21a), (14a), and (23 a), the log posterior probability of that x is an exposed point ofR'b2 is
Q$(χ)
Figure imgf000019_0003
+ logifi&W
(24a) A rate of occluded times over recent frames at each pixel for each type-2 CBR is used. First, to be robust to noise and effect of boundaries, an occluded pixel of a type-2 CBR is confirmed on the local neighborhood Rt(x). Let rx be the proportion of pixels belonging to R'b2 in the neighborhood region, i.e.,
seHt(x) (25a) and r2 be the proportion of exposed pixels of R'b2 in R((x) according to the posterior estimates, i.e.,
(26a) where TQ is chosen as slightly lower than 2TL2. Then, an occluded pixel of R'b2 is confirmed if the majority of the pixels within its neighborhood are of R'b2 and less of them are observed in the current frame. Now the rate is computed as
Figure imgf000020_0001
where TH— 15% is chosen in the example embodiment.
According to the result a contextual interpretation, three learning rates can be applied at each pixel for different situations in the example embodiment;
Normal learning rate to exposed background pixels with small variations;
Low learning rate to occluded background pixels;
High learning rate to exposed background pixels with significant changes.
An image of control code C,(x) is used, where the value of C,(x) is 0, 1, 2, or 3 where the low, normal, or high learning rate is applied respectively at the pixel (here 0 is for normal learning rate for non-context pixels used for display). First, for the pixels not associated with any contextual background region, C,(x) = 0 is set. The rest of Q(x) is determined according to the results of contextual interpretation. For a pixel x within the Mh type-1 CBR R', if r '"' b\ ≥ 0.7 that means the CBR is being blocked by a foreground object, C,(x) = 1 is set. Otherwise, if /,(x) is detected as a background point by background subtraction, C,(x) = 2 is set since the CBR is exposed and no significant appearance change is found, but if /((x) is detected as a foreground point by background subtraction, a high rate should be applied since it is detected as an exposed CBR point with significant appearance change, i.e., C,(x) = 2. For a pixel of the Mh type-2 CBR R'bi, if /'42 (x) ≥ 0.7 that means the patch of the CBR is being occluded by a foreground object, C,(x) = 1 is set. Otherwise, if /,(x) is detected as a background point by background subtraction, C,(x) = 2 is set for exposed part of the type-2 CBR with no significant appearance change. But if /((x) is detected as a foreground point by background subtraction, C,(x) = 3 is set for the an exposed neighborhood of the type-2 CBR with significant appearance change.
To smoothen the control code temporally at each pixel, four control code images are used. The first two are the previous and current control code images described above, i.e., C1 i(x) and C,(x), and the second two images are the control codes which really applied for pixel-level background maintenance, i.e., C*, i(x) and C*, (x). The applied control code to the current frame at pixel x is determined by the votes from three other control codes C,-i(x), C,(x), and C*t i(x). If at least two of the three codes are the same, the control code of high votes is selected. If the three codes are different from each other, the normal learning rate is used, i.e., C*, (x) = 2.
To evaluate the effect of context-controlled background maintenance on adaptive background subtraction, the example embodiment was applied to, two existing methods of ABS were implemented. They are the methods based on Mixture of Gaussian (MoG) [4] and Principal Feature Representation (PFR) [2]. Hence, four methods, MoG, Context-Controlled MoG (CC MoG), PFR, and Context-Controlled PFR (CC PFR) were compared. In the test, the normal learning rate of the example embodiment as described above was set to the constant learning rate used for the existing methods of ABS. The high learning rate was set to the double of the normal learning rate and the low learning rate was set to zero. In Figure 1, the leftmost image 102 is a snapshot with manually cropped out contextual background regions e.g. 104, which are type-2 CBRs in this example. In the snapshot image 102, the type-2 CBRs are surrounded by polygon boundaries e.g. 106 of different colors. The second column 108 shows a sample frame from the sequence 110 and the corresponding ground truth 112 of the foreground. The rest of the images in the upper row 114 are: the segmented results by MoG 116, CC MoG (Context-Controlled MoG) 118, and the corresponding control image 120. The three images in the lower row 122 are the segmented results of PFR 124 and CC PFR (Context- Controlled PFR) 126, and the corresponding control image 128. In the control images 120, 128, the black regions e.g 130 do not belong to any CBR, the gray regions e.g. 132 are exposed parts of the CBRs with no significant appearance changes, and the white regions e.g. 134 are occluded parts of the CBRs.
According to the example embodiment, for pixels in the regions of exposed parts of the CBRs with no significant appearance changes, the normal learning rate is applied, for pixels in regions of occluded parts of the CBRs, the low learning rate is used. For pixels in regions of exposed parts of CBRs with significant changes (not applicable in the scene shown in Figure 1), the high learning rate would be used as described above. The scene in the image 102 is a meeting room with four marked type-2 CBRs for the table surface, the ground surface, wall surfaces, and the chair. In this sequence of 5250 frames, there was no overstaying objects or overcrowds. However, several people kept e.g. 138 moving around, staying somewhere for a while, and performing various activities. Therefore, the center parts of the scene were frequently occluded by persons. Using a constant learning rate in the unmodified ABS methods, some appearance features of the persons were learned into the background models, and then the background subtraction failed to extract the complete figures of the persons in the incoming frames (see images 116, 124). One example frame, Frame #102810, is displayed in Fig. 1.
By using context-controlled background maintenance of the example embodiment applied to the ABS methods, the persons were segmented satisfactorily (see images 118, 126). A quantitative evaluation on 12 frames sampled from the sequence every 200 frames started from Frame #101410 (empty frames were skipped) is listed in Table 1, where the metric value is defined as the ratio between the intersection and union of the ground truth and the segmented regions. According to [2], the performance is good if the metric value is larger than 0.5 and nearly perfect if the metric value is larger than 0.8. From Table 1, it can be seen that, by using the context-controlled background maintenance of the example embodiment applied to the existing ABS methods, the performance of adaptive background subtraction on situations of complex foreground activities can be improved significantly.
Table 1
Figure imgf000022_0001
The contextual features of the example embodiment capture the global information. Such global information may not always lead to a precise segmentation in position, especially along boundary regions of objects. However, if fed with correct samples continuously, the pixel- level statistical models can be tuned to characterize the background appearance accurately at each pixel. Then the pixel-level background models can be used to preferably achieve a precise segmentation of foreground objects.
The example embodiment exploits contextual interpretation to control the pixel-level background maintenance for adaptive background subtraction. Experimental results show that the example embodiment can improve the performance of adaptive background subtraction for at least situations of high foreground complexities. Figure 2 shows a flow chart 200 illustrating a method of background updating for adaptive background subtraction in a video signal according to the example embodiment. At step 202, one or more contextual background representation types are defined. At step 204, an image of a scene in the video signal is segmented into foreground and background regions. At step 206, each background region is classified as belonging to one of the contextual background representation types. At step 208, an orientation histogram representation (OHR), a principle colour representation (PCR), or both, are determined of each background region. At step 210, a current image of the scene in the video signal is received. At step 212, it is determined whether respective pixels in image regions of the current image spatially corresponding to the background regions are occluded or exposed. At step 214, different learning rates are set for the adaptive background subtraction for pixels that are occluded and for pixels that are exposed respectively.
While the described example embodiment started from manually cropped out contextual background regions in a snapshot, image segmentation and background object recognition for automatic initialization of contextual models may be performed in different embodiments.
In the example embodiment, principal color representation (PCR) is applied for efficient appearance-based multi-object tracking. In a video surveillance system, object tracking may be applied to a sequence of segmented images generated by background subtraction. In such a case, each segmented image may contain one or several isolated foreground regions. Further, each region may consist of one target object (e.g., a walking person) or a group of target objects (when objects overlap from the camera view point). The example embodiment uses the principal color representation (PCR) for modeling and characterizing the appearance of target objects as well as the segmented regions. For an image sequence captured from a natural public site, each image may contain one or several objects. These objects in the image may overlap on some occasions. Further, the poses, scales, and motion modes of objects can change significantly during the overlap. It has been recognized by the inventors that these issues make the shape-based object tracking a rather challenging task. However, the inventors have recognized that it is much less likely that a a target object change colors in a sequence from a surveillance camera. Hence, using global color features of an individual object can provide a relatively stable and constant way for object appearance description. This can also lead to a better discrimination of multiple target objects in the scene.
In video surveillance, an object of interest (e.g., a person, vehicle, luggage, etc.) may render a few dominant colors which only span a small portion of the entire color space. Let the rath foreground region detected from the frame at time t be R" t (x), where x = (x, y)τ denotes the position of a pixel in the region. Then the corresponding principal color representation (PCR) can be defined as
IT = K, {££ = (4,4)}* J (1) where Sn is the size of the region (or the total number of the pixels within the region), c' „ = ('"'« s'n b' n)τ is the RGB values of the /th most significant color under the original color resolution (i.e., 256 levels for each channel), and s' „ is the significance of c' „ for the region. The components En are sorted in descending order according to the significance values of the principal colors. Let the current frame of input color images be I,(x), then the significance of fth principal color can be defined as
x€fi? (2) where ω(x) is a weight function and δ(; ) is a delta function. In the example embodiment, ω(x) = 1 is chosen for isolated objects or regions. When locating an object in a group, co(x) may not be equal to 1. If necessary, other weight functions can be used, e.g. Gaussian kernel to suppress the noise around the object's boundary [5]. 5(C1, C2) equals to 1 when C1 = C2, otherwise it is equal 0. However, in the example embodiment a color distance is used which is not sensitive to noise and illumination changes
2 < C1^ Gj >
J(ct f C2) = 1 - - NI2 + IM2
(3) where < , > denotes the dot product. The color distance in (3) is then applied to compute the delta function in (2) as
Figure imgf000024_0001
8 = 0.005 is chosen in the example embodiment.
The PCR T t contains the first N significant colors and their statistics for the region R" t (x). Since a region of one or a few objects manifests only a few dominant colors, it is possible to find a small number N to approximate the color features of the region, Le.,
N
(5)
In the example embodiments, using N= 50 in (5) lead to satisfactory results for almost all the regions containing one or a group of objects. Fig. 3 shows two examples of PCRs where one image 300 contains two isolated individuals and another image 302 contains a group of 5 persons. The PCRs for the foreground regions are generated through scanning the respective regions, and are shown in the histograms 312, 314 respectively. Details of the algorithm for the foreground region R" , (x) (see white areas e.g. 304, 306 in the segmented images 308, 310 respectively) are summarized in Table 2.
TABLE 2 THE ALGORITHM TO GENERATE PCR FOR REGIONR", (x)
Figure imgf000025_0001
The aim of object tracking is to allocate a tracked object in the coming frame according to its previous appearance. To achieve this, the likelihood, or the conditional probability of observing the tracked object in a region of the current frame, has to be evaluated. In the following, the likelihoods are first defined based on the original and normalized PCRs of the tracked object and a region. This is the extended to the scale-invariant likelihood.
Let Om ,- 1 be the mth. tracked object described by its PCR η-m — <s <βi _ (ct si \\N i I
-ti WU 'm \ -mi m fSi=Λ f [obtained previously when the torched object was an isolated object, and R" , be the «th foreground region detected at time t. According to the law of the total probability, the likelihood of the object Om t- 1 in the region R" , can be defined as
P(mθT-ι) = ∑ P(RIlEin)P(EUOZ1)
(6) where each P(R" , [Efn ) is the likelihood of that the object Om t- 1 appearing in the region R" t based on the partition evidence E'm , and P(E? m \Om ,_ 0 is the conditional probability of the evidence En, given the object On ,_ ,. Using the PCR Tm ,. i of the object Om ,_ ,, the conditional probability P(E m \Om ,- χ) can be defined as the weight of the principal color c' m for its appearance, P(EiJOZ1) = ^
(7)
Using the PCRs J"V , of the object Om ,_ , and T t for the region R" , , the likelihood
"\*H. I -^w can be computed according to their significance values of the same color component c'm
Figure imgf000026_0001
miu/l.^^^^^xU^min /^^^c^ciK}
Figure imgf000026_0002
Substituting (7) and (8) into (6) yields
ptmotύ = —
Figure imgf000026_0003
It is noted that the above likelihood (9) is evaluated under the assumption that the size variation of the object is small. However, if the size change is large the likelihood value will be affected. Therefore, a definition of likelihood based on the normalized PCRs is used in the example embodiment. Let
Tg1
Figure imgf000026_0004
4t = and TJι = {l.
Figure imgf000026_0005
&i = 4./-V- The likelihood based on the normalized PCRs becomes
Figure imgf000026_0006
If the region R" , only contains a single object, the likelihood based on the normalized PCRs is more accurate than that based on the original PCRs. However, if <9m,_ i is one of the objects in the group R", , the likelihood on original PCRs is better. Hence, the scale-invariant likelihood of observing a given object Om ,_ i in the region R", is defined as,
Heuristically, if the region R" is a single object, the likelihood computed from normalized PCRs appears more reliable. However, if OfI1 is one of the objects in a group R" , the likelihood from un-normalized PCRs appears better since the object is smaller than the group. Equ (11) can provide a suitable measurement for these two cases in the example embodiment. Object tracking in video surveillance aims at maintaining a unique identification (id) for each target object and providing its time-dependent positions in the scene. When multiple target objects frequently merge and separate from one another in a public site, tracking one individual object is no longer an independent process. Multi-object tracking can be formulated as a global Maximum A Posterior (MAP) problem for all the tracked objects. With the segmented foreground regions provided by background subtraction, in the example embodiment the global MAP problem can be approximately decomposed as two subproblems: assignment and location. Using the principal color representation (PCR) and the associated likelihood function, the example embodiments uses sequential solutions to these two subproblems, as detailed below.
Let Ot-i — (O™ i\^ι be the set of tracked target objects in the previous, frame I(_i(x), and β(._i
Figure imgf000027_0001
be the set of state parameters describing their positions at time t - 1. The task of multi-object tracking is to estimate the states θs of tracked objects in the current frame 1« (x) given their previous appearance models Ot_i and states θt-i. This can be formulated as the Maximum A Posterior (MAP) estimation for the state parameters θ.\
©J = MgmaxP(θt|I,(x), C7t_il θ,-i) θ< (12)
When several objects overlap one another, they cannot be tracked as independent objects. With foreground regions Kt. —
Figure imgf000027_0002
provided by background subtraction. (12) can be simplified as θt = aignωx.PfθtlRt. CV.i. θt-i) θ* (13) where the tracked objects
Figure imgf000027_0003
are in the regions
Figure imgf000027_0004
from the previous frame. If a region (e.g., JR£_I) only contains one object (e.g., OfLj) i* is 2^ isolated object, otherwise the region is a group. Objects belonging to different regions in the previous frame may merge into a new group region (e.g., i2£) in the current frame. Also, the objects in a group region (eg., .Rf-1) in the previous frame may separate to several regions in the current frame. For real-time processing with, moderate or high frame rate of image acquisition, the inter-frame movements of target objects are usually small. This implies that there is always an overlap between the regions of the same object in. the consecutive frames. Exploiting such a relation, the problem (13) can be further decomposed by using directed acyclic graphs (DAGs)- The directed acyclic graphes (DAGs) for the regions detected in the consecutive frames rf_i(x) and If(x) are constructed ia the following way. Let the regions from the previous, and current frames be denoted as the nodes and be laid in two layers: the parent layer and the child layer. The parent layer consists of nodes representing the regions {-fijLiljϊϊ1 ^11 *ue previous, frame It_i(x). and the child layer consists of nodes denoting the regions
Figure imgf000028_0001
ω the current frame
T4 fx). Suppose R^1 and R* are the jth and kth. regioas in the previous and current frames, respectively, then the directional link from Jf^-1 to R^ can be defined as =^
Figure imgf000028_0002
(14)
This implies that there is a link only when the two regions have soaie overlap. A directed acyclic graph (DAG) is formed by a set of nodes in which every node connects to one or more nodes in the same group. A set of DAGs (graphs) can be generated. An example of graphs for two consecutive frames is illustrated in Figure 4. The notations for the DAGs are defined as follows. For the ith graph, the parent nodes are denoted as {n^}^' where each node rtjf represents one of the regions
Figure imgf000028_0003
and the chϋd nodes are denoted as {«]'9}j_!.i' where each node represents one of the regions {.fl* } tii- The ith DAG can thus be denoted as G. = (K'lSi - K*}£i', {&}>■ The objects in a parent node n{* are denoted as
Figure imgf000028_0004
If Miφ — 1 then the node is a single object, otherwise it is a group of &!,# objects. The object o"l!f is one of the objects
Figure imgf000028_0005
The objects in a child node n *. which may be newly generated objects or objects, tracked from ϊts parent nodes, are denoted as {o'j '"'}^ =1. After processing all graphs, the objects in the child nodes are reordered as {Oj* J-J1H1. They are the set of tracked target objects in the current frame.
Since there is no link between different DAGs, the objects m the parent nodes of one graph can be tracked independently of the other graphs. Let O|_! represent the set of objects in the parent nodes of graph Git Ls., Q^1 = {{%1'f }^*=i}^i- Then the probability of the states for all the tracked objects in image I*(x) becomes
La P(Θ,|πf, CV1, θt_0 = J] P(&l\Gh C)^1, Qt1)
(15) where θ8 = (ΘJ, - - - . θf0) and LG is the number of DAGs. According to (15), (13) can be decomposed as finding a (β\)* for each graph such that
(θjf = argrøP(θ{|G<,θl_1.θ;_1) θ' (16) If there are several parent and child nodes in a graph, and some parent nodes represent groups, (16) is still a nontrivial problem. The example embodiment solves the problem in two sequential steps from coarse to fine. The problem is decomposed approximately as two sub- problems: assignment and location. The coarse assignment process assigns each object in a parent node to one of its child nodes while the fine location process determines the new states of the objects assigned to each child node.
Assignment:
In tliis step, the tracked objects in each parent node is assigned to its child nodes based on the largest posterior probability. Let θt' be the parameter vector describing the assignment of the tracked objects O\_Λ to the child nodes { nψ \~ qt{. The posterior probability of tfiε assignment for graph Qi can be expressed as
Figure imgf000029_0001
where CJJf1 = {o^f }%*ml are the tracked objects in node rφ. and A^p) = K* : Vpq = 1 } J1 are die. child nodes of njf. The parameter vector is θ, = (&t'} . • • • , θf"Vθ i) . The best assignment for the tracked objects
Figure imgf000029_0002
SU(^i tua* ^ results in the best observation of the objects in the corresponding child nodes, that is
(§<*) = arg max P(^[N^\ O^ , . ^f1 )
V (18)
Here ΘJ1 p can be considered as the coarse tracking parameters indicating ia which child nodes (regions) the objects are observed without concerning the exact new positions of the tracked objects in the child regions.
Location:
In this step, the new staϊes of the tracked objects assigned to each child node (e.g., region Fή) are determined. Let Q)? = {o?*}*^ be the objects assigned to the child node nf from its parent nodes. That is. O/1 1 is a subset of C^-1 according to the assignment parameters (©«)* = ((6ip)*, • • • , (θt'jVtu)'). After the assignment, objects in each child node can be tracked independently of objects in the other child nodes. Hence, the posterior probability of the new states for the tracked objects in the graph Gi can be evaluated as
Figure imgf000029_0003
(19) where ΘJ = (θf1, ■ • • , θ^1--'), and Q^lx is the set of previous state parameters for objects Of1 From (19), locating the objects in the child node
Figure imgf000030_0001
is expressed as
Figure imgf000030_0002
Figure imgf000030_0003
Multi-object tracking thus becomes finding the solutions for Eqs. (18) and (20) in the example embodiment. Further sequential solutions for (18) and (20) based on PCR are used and described below.
Assuming that {ϋ^'}^.ι are the foreground regions and { B* J-^i-1 are their bounding boxes detected at time t, then their PCRs can be obtained as {T/'}^r Let (GJ^f1 be the set of directed acyclic graphes (DAGs) for the foreground regions between the consecutive frames. If there is only one object in a graph Gif then the object will be tracked as an isolated object Otherwise, multi-object tracking will be performed according to Eq&. (IS) and (20). For tracking multiple objects in a group, the posterior probability of the new state for each object is determined on both spatial position and depth relationship. Hence. 2-D state is used for each object. The state vector for an object Of is θ% = (6|l, rj1) where i»" is the bounding box describing its spatial position and ι?|* is the likelihood λ?alue describing its depth position in the group.
Tracking Isolated Objects
If the Mi graph consists of only one child node (i.e., Oi = {n*'1)), a new object appears and is initialized as a} in Gi with a new id number. Suppose the node n{{1 represents the region Rf, then the PCR and bounding box of
Figure imgf000030_0004
are set as Tt l = if and 6* = B^. Since ø* is an isolated object, it is not occluded by any other objects. The depth state is set as v] = F{A||o*) = 1. It's
Figure imgf000030_0005
If the ith graph, contains one parent node and one child node, and the parent node is associated with one object, tke graph represents the simple case of isolated object tracking. Let the graph be Gi — {κoini> ^ii)> tue object in the parent node be OfI1, and the child node n\ represent the region jRf, then the object
Figure imgf000030_0007
(i.e., O™ j and
Figure imgf000030_0006
have the same id number). Its state becomes θ\ = (1*1, vj) with bj = Bf and w| = 1. In addition, its PCR is updated as T% = Tf to follow the gradual variation of the object.
If the Mi graph only contains; one parent node which has no child nodes, then the previous objects in the parent node are assumed to have disappeared in the αurent frame. Tracking is terminated for these objects. Tracking Multiple Objects in a Graph
If the zth graph G1- contains multiple parent nodes or child nodes, the operations of assignment and location will be performed. In the following description of the operations for one graph, for the sake of notational convenience the index i for the graph G, is omitted below.
Assignment:
Let n^ be a parent node in the graph C?. Of-1 = (σ™iK?ii be the associated objects, and Av = {nf : fø = I)^1 be its child nodes. If the parent node has more than one child node, the assignment of objects 0Jf-1 ϊs determined by Eq. (18). However, with varying numbers of objects and child nodes, Eq. (18) is a nontrivial problem of optimal configuration. To make the problem tractable, a sequential solution is proposed based on their PCRs and the depth relations among the objects. hi each group, the close and non-occluded objects have richer visible information than the distant or occluded objects. This means that an occluded object less affects the tracking of the objects occluding it. Hence, the assignment can be solved sequentially from the most visible one to the least visible one. Let the objects
Figure imgf000031_0001
ώ the parent node rtζ be ordered according to their visible sizes. Assuming that the correct assignment of the object o™_λ is ff^ = qm which assigns of_x to the child node ref* (ufm
Figure imgf000031_0002
and the child node nf1 represents the region
F$m, then the posterior probability of assignment for objects Of-1 = (0Jl1KnL1 is computed as
P(θf|Λ/r°", O∑_,.θf_,)
Figure imgf000031_0003
where Λ^(m - 1) = B^ - ∑£~? δ(θt ι = φn)4-i represents the region after excluding the objects previously assigned to it before σ™ j. Note that the assignments of the objects (0"I1 } are not independent The assignment of one object is affected by the previous objects with higher ranks. This means the assignment of each object can be performed one-by-one sequentially from the most to the least visible ones. For each object, the posterior probability of assignment can be evaluated using Bayes rale,
P (8? = qm\n!*>(m - l)f <C.i'*£i) = P (3f"(m - IKl1 J? = <?,„) P{θ? = (?,,.|Ci)
(22) The first tenia oil the rigiithand side of (22) is the likelihood of observing the object 0^L1 in the region R1I"1 with die exclusion of previously assigned objects, while the second term is the prior probability of θ^1 = ςfTO given the previous state 0Jl1. For assignment, (22) can be evaluated on PCR. Assuming that a child node rζ € M^ represents the region ij*. Let T* and T^1 be the PCELs of B% and 0Jl1, respectively. Using Eqs. (21) and (22), the best assignment of die objects {Oj^αKπϋα ^1 ώe parent node rig can be achieved one-by-one sequentially according to their depth order by
Figure imgf000032_0001
from m = 1 to Mp. where if (in - 1) = 7* - ∑™^ δφ{ = q)T{li represents the PCEL of
Figure imgf000032_0002
The sequential solution to Eq. (18) using Eq. (23) is computed in two steps. First, the objects (0JIiIm=I ^11 me parent node n^ are sorted in a list according to their visible parts. An iterative process is then performed from the most visible to the least visible object. In each iteration, die object in the top of the list is assigned to one of the child nodes according to (23). Once an object is assigned to a child node, it is removed from the list and its visual evidence is excluded from the PCR of the child region. Details are described below.
Let ^ I'Jli YnJLi be the likelihoods of objects {of_li}m!lι ώ ^o computed in the previous time step. Let sm be the size of 0Jl1 from its PCR T^1. The visible part of 0J1I1 in n£ is estimated as Q2i = υfliV.- We denote (o7-il»*=ι ^8 being sorted in descendant order according to the values of {C£i }m=i and Placed ia a ϋst. Before the iteration. 2*(θ) = 'if is set for the regions associated with the child nodes .V1 -
In each iteration, the top object in the list is pop out. Suppose it is the mth object 0^I1. The likelihood in (23) is calculated as P*{fj*(m - Dtø-i) according to Eq. (11). The prior probability in (23) is evaluated based on the shape similarity and center distance between the bounding boxes. Let 6Jl1 and B^ be the bounding boxes of 0"I1 and R^ associated with the child node. n|, x£* and xj be their centers, and dm and dk be their diagonal lengthes. respectively. Then, the shape similarity between two boxes (center aligned) is defined as
Figure imgf000032_0004
where "n" denotes the intersection and "U" denotes the union. The center distance between two boxes is defined as with ά — (dm + rffc)/2. Tlie prior probabilitv; is defined as
Figure imgf000032_0003
P(θ™ = g|0£i) = 0.5(m, +??Jof). The object 0^L1 is assigned to the child node nf™ according to Eq. (23). The last operation in this iteration is exclusion that removes the visual information of tiie object o"_' x from TJ™(m - 1) associated with the child node nf". Let T^1 and f *m(nι - 1) be the PCRs of
Figure imgf000033_0001
- T^1 is updated from TJ?" (m — 1) for its principal colors one by one. For the jth element of principal color c{ and significance sj^, the following updating is performed.
Figure imgf000033_0002
^ = < - As (24)
When all the elements in T,*m(m— 1) are updated, the generated PCR JS lfm(m). For the regions associated WJtIi the rest of child aodes in
Figure imgf000033_0003
Tf (m) = If (m - 1) is set.
Location:
Let Of
Figure imgf000033_0004
are the objects assigned to a child node
Figure imgf000033_0005
in the graph G*. The new states of the objects should be determined by solving Eq. (20).
Locating the objects in a region are not independent of each other, but the front ones with richer visible information are less affected by the occluded ones. Hence, in the example embodiment objects in the node are located one by one from the most visible to the least visible ones based on their visible parts. The posterior probability of new states for all the objects in the node can be expressed as
Figure imgf000033_0006
where θf = {θ]. - • • .$*), R* is the region associated with the node n'f, and kf(n — 1) = JRJ — YJSI C{ represents the region in which the visual evidence of the first n — 1 objects (θj , - ■ > , o"~Λ) have been excluded at the located positions. According to (25). locating objects Of according to (20) is equivalent to locating them one by oaε sequentially according to
{θ?y = argmaxP følφfn - 1 W-^-i)
Figure imgf000033_0007
where {o"}nij are sorted in descendent order according to their visible sizes.
The sequential solution to the problem Eq. (20) and Eq. (26) contains two steps. In the first step, the visible parts of the objects in the node are estimated, and the objects are sorted according to their visible sizes. In the second step, an iterative process is applied to locate the objects one-by-one in the region with a mean-shift algorithm based on PCR. When an object is located, its visual evidence is excluded from its position in the region. The details are described in the following.
Assuming that an object of in the child node ?ιf is the object o™p t assigned from the parent ng, then there is <%_λ
Figure imgf000034_0001
[J^1 computed from the previous frame. The likelihood of observing object όξ in the child node iϊ{ (or region Rf) in the current frame can be evaluated as /"'(.RfIo") according to Eq. (11). Since the motion of object between the two consecutive frames is assumed small, the visible parts of the object of in region R^ can be estimated as
where sn and Sk are the sizes of the object 0™ and the region R^. respectively. In Eq. (27), η is a weight to smooth the estimates from consecutive frames (r/ = 0.5 is chosen in this study). (0"Jj1J1 are then sorted in descendent order according to the -values of {C-Tn !i an& *βen placed in a list. To perform exclusion for J?<fc (n — 1) = /?* — ∑JLTI o[. a weight image ωn(x) is used. If the pixel x is likely to belong to one of the previously located objects (of . - - . o""1), αv,-.(x) is low (ft* 0), otherwise, it is high (sa 1). For initialization, set u,'0(x) = 1 for all the pixels belonging to the region R$. and ωp(x) = 0 otherwise.
In each iteration, the top object in the list is pop up. Assuming that it is the rath object of with the initial position represented by the previous bounding bos Bi0*1 = Of-1 centered at x,v ', and its PCR is Tf = {,sB. {££ = (e£; 4)})>Lj } = T^f coming from the object 0 "J1. Locating
the object 0" in the region R% according to Eq. (26) is equivalent to finding a position where the maximum value of probability density occurs for observing the object. This density maximum can be found by employing a mean-shift procedure with a weight mode which can reveal the probability density of observing the object in the neighborhood [5], [6], [7]. A two-stage mean-shift procedure is proposed based on the evidence of the object" s principal colors. In the first stage, the gravity center of the pixels of each principal color component is computed as
Figure imgf000034_0002
withy = 1, ■ • • ,N, where r indicates a current step of mean-shift iteration. In the second stage, the new position of the object o", is generated as the weighted average of the gravity centers
Figure imgf000035_0001
performed as
Figure imgf000035_0002
The complete algorithm of multi-object tracking based on PCR in the example embodiment is summarized in Table 3.
TABLE 3
THE SUMMARY OF THE MULTI-OBJECT TRACKING ALGORITHM Input: color image I,(x) and segmented image St(x); Preprocessing: generate graphs (G1 , i = 1, . . . LG); For Gt, i = 1, , . . LQ , do: Assignment: for each parent node n^ in Gυ p = 1, . . . , MiιPι do: a.l: if ΠQ' P has no child node, the objects in it are deleted; a.2: if n^' has only one child node n['q , all the objects in it are assigned to n\'q ; a.3: if n^' has multiple child nodes; a.3.1: sort the objects {o™_{} ^f1 in ΠQ' P , and then assign them one-by-one from the first to the last as follows: a.3.1.1: assign o"_x to the child node rι{ι qm according to (23); a.3.1.2: exclude the visual information of o™_x from the PCR of n['qm
Location: for each child node n['p in Gu q = 1, . . . , Mq , do:
1.1: if no object is assigned to the node, check if it is a disappeared object; if not, set it as a new object;
1.2: if only one object is assigned to the node, update the state and PCR of the object; 1.3: if multiple objects are assigned to the node:
1.3.1: sort the objects {o" } ^ 1 1 in the node using (27), and then locate the objects one-by-one as follows:
1.3.1.1: apply mean-shift to locate o" using (28) and (29);
1.3.1.2: exclude the visual evidence of o" at the location in R1 using (31)
1.3.2: if the likelihood of observing an object in the region is less than 0.1, set the object as disappeared.
Clearance: if an object disappeared for more than 50 frames, delete the object. End
The algorithm in the example embodiment includes two phases of processing for each DAG (Directed Acyclic Graph): assignment and location. In the assignment phase, each parent node in the DAG is processed. In the location phase, assigned objects in each child node are tracked. To be robust to the separation of small parts from the tracked object due to segmentation errors, small objects in a group with likelihood values less than 0.1 are set as disappeared. To prevent the losing of small or heavily occluded objects in a group, the records of disappeared objects are kept for 50 frames. When a new object is detected, it is compared with disappeared objects according to their PCRs, sizes and distances. If it compares to a disappeared object the tracking will be restored, otherwise a new object is created.
In the example embodiment, segmenting individual persons in a group with domain knowledge will be preferred. For example, in the example embodiment knowledge about the sizes and aspect ratios of persons in the scene is used to adapt to segmentation errors.
Figure 5 shows a flow chart 500 illustrating a method of multi-object tracking in a video signal in the example embodiment. At step 502, first and second segmented images of two consecutive frames of the video signal respectively are received, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked. At step 504, one or more directed acrylic graphs (DAGs) are generated for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node. At step 506, for each parent node having two or more child nodes, a) the corresponding objects of the foreground region contributing to said each parent node are sorted according to estimated depth in said first image; b) the corresponding object having the lowest depth is assigned to one of the child nodes of said each parent node; c) a visual content of the assigned corresponding object is removed from the visual data associated with said one child node; and steps b) to c) are iterated in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes.
At step 508, for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image. At step 510, for each child node having two or more corresponding objects assigned thereto, d) the corresponding objects are sorted according to estimated depth in said each child node in said second image; e) a means-shift calculation is applied to locate the corresponding object having the lowest depth in said each child node; f) the state and the visual content of the located corresponding object are updated based on the second image; g) the updated visual content of the located corresponding object is removed from the visual data associated with said each child node; and steps e) to g) are iterated in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.
When an object stops moving and stays in the same position in the scene for a while, the object would be absorbed into the background gradually with existing background updating techniques. That means the object would be lost in the segmented foreground images. On the other hand, in e.g. crowd scenes, if one can separate the moving objects from the stationary objects in the scene, one can reduce the overlapping of multiple foreground objects. This would make the tracking of each individual easier and more robust. In the described example embodiment, a layer tracking algorithm is designed to track stationary objects through even frequent occlusions. When the object starts moving, the objected is identified as moving object and tracked by a moving object tracking algorithm. In the example embodiment, the stationary objects include not only static non-living objects but also include motionless living objects, e.g. a standing or sitting person. Since the living objects may move again, the switching between moving object tracking and stationary object tracking for the target object is preferably smooth with no change of identity in the example embodiment.
When an object stops moving and stays in a scene in frame of a video signal, the appearance variation of the object is typically small through a sequence of frames. A template image of the object is used to represent such a stationary object in the example embodiment.
Let [B1]^x. be a sequence of bounding boxes of the zth tracked object in the τb most current frames as tracked by a moving object tracking algorithm. If the object has stopped moving, the bounding boxes will overlap each other. For a selected length parameter τb , if the spatial intersection of all the boxes is not empty, the object is detected as a stationary object in the example embodiment. In the example embodiment, but not limiting, τb is set as 10 frames, corresponding to about 1 second. To track the stationary object in e.g. a busy site in which the object may be occluded frequently by moving foreground objects, a layer representation based on the object's template image is built. The layer representation of the detected stationary object is defined as
Figure imgf000038_0001
where A\ is the template image of the object maintained at time t, T1 is the Principal
Colour Representation (PCR) of the object stored when the object was detected as a stationary object. That is, the template image is based at least on the last frame of the sequence used in detecting the object as a stationary object. d/ is the difference measure between the template Af and the frame J7-(S) for the corresponding region of Aj , dc J is the difference measure between the consecutive frames /^1(S) and /y (s) for the region of the template, dp J is the visibility measure of the object from the corresponding region in the frame I • (s) , and sk is an estimated state of the stationary object at time k. Measures in τd most current frames and states in τs most current frames are recorded. The details of calculation of these measures and estimating states from these measures for each layer object will be described below.
In e.g. a busy public site, if there are some objects staying in the scene, they will often merge with moving objects and the result in a high complexity for object tracking. In separating the layer or stationary objects from moving objects and track the stationary and moving objects separately, the example embodiment can greatly enhance object tracking much. Let c = I1 (s) be the color of a foreground point in the region of /th template image. According to Bayesian rule, the probability of the point belonging to the background is
p(b] c) = pmm (2b)
/>(c)
where p(c \ b) can be obtained from the Principal Feature Representation (PFR) of the background. The PFR at each pixel is used to characterize the background. Let s = (x, y) be a pixel of the image. For each type of feature, a table which records the principal feature vectors and their statistics at S
Figure imgf000039_0001
is built, where pv' (b) is the learned probability of S belonging to the background (Ps (b) ) based on the observation of the feature v and Sv' (i) records the statistics of the Mv most frequent feature vectors at s , Each Sv' (i) contains three components Sy' (i) = (4b)
Figure imgf000040_0001
where Dv is the dimension of the vector V. The Sv' (i) in the table are sorted in descending order with respect to the value pv' . Hence, the first Nv elements are used as principal features. Three types of features are used in the example embodiment. They are a spectral feature (color), a spatial feature (gradient), and a temporal feature (color co-occurrence), respectively. Among them, color and gradient features are stable for static background parts and color co-occurrence features are suitable for dynamic background parts. Three tables are used to learn the possible principal features of the three types for the background. They are Tc (s) , Te (s) , and Tcc (s) . The color vector is c = (R1 , G1 , B1 ) from the input color frame. The gradient vector is e = (gx , g ) obtained by Sobel operator. The color co-occurrence vector is cc = (R1-1 , G1-1 , Bt_\ , R1 , G1 , B1) with 32 levels for each color component.
The probability of the pixel becoming a background point at the current frame can be calculated as
Ns(b)
P(b) = (5b) M.
where Ns (b) is the background points in a small window Ws centered at S in the previous frame, and Ms is the number of points within the window. Similarly, the probabilities of s belonging to the layer (stationery) object or moving foreground object are
p(c) p(c)
respectively. The probabilities p(c \ I) and p(c \ f) can be calculated with Gaussian kernels. Let c'x be the color of point x in the template A$~x within the window Ws . Then p(c I Z) can be calculated as p(c I /) = m xeatl'x. {kc(c'x - c)ks(x - s)} (7b)
where the Gaussian kernels can be written as Λ: (v) = exp 1 ~2 with g=c or s
2< indicating the kernel for color or spatial vector, respectively. Again, let c^ be the color of a point x in the window W3 and in the region of moving foreground objects from the last frame /:_, (s) . The probability p(c \ f) can be calculated as
p(c I /) = m xeaffx. {kc(cζ - c)ks (x - s)} (8b)
The priors can be calculated as
Figure imgf000041_0001
where N3(I) and N3(f) are the number of points belonging to the layer object and moving objects within the window Ws in the previous frame.
Comparing Equ (2b) and (6b), it can be found that p(c) has become a common normalization factor. Hence, the likelihoods of S belonging to background, the layer object, or the moving object can be defined as
p\b I c) = p(c I b)p(b), p\l I c) = p(c I l)p(l), and p'(f | c) = p(c \ f)p(f) (10b)
respectively. The pixel s would be assigned according to the greatest likelihood value. The mask for the moving objects is used as the input for moving object tracking.
Stationary objects may also be involved in several changes and interactions with other objects through the sequence. For a non-living object, it may e.g. undergo illumination changes, be occluded and removed by other objects. For a living object, the object may change pose or move bodyparts, or start moving again. During tracking the stationary object, the object's states are estimated and the template image updated correspondently in the example embodiment. In the example embodiment, five states are used to describe the layer object, they are: motionless, occluded, removed, inner-motion, and start-moving. The state is estimated according to various change measures from a short sequence of most recent frames.
Let s be a point in template Aj (s), of the zth layer object. The difference between the template and a current frame at s can be evaluated as
K ^ ~ 7< (s) < ThΛ (l ib)
Figure imgf000042_0001
where Thd is the threshold according to image noise. Then, the difference measure between the template and current frame for the layer object is defined as
Figure imgf000042_0002
where SA' is the size of the template.
Similarly, for a point s in the template A\ (s), the difference between consecutive frames at the point is evaluated as
lf I7'00 " 7'-l(S)| < Tkd (13b)
Figure imgf000042_0003
otherwise
The difference measure between consecutive frames for the layer object is defined as
Figure imgf000042_0004
The difference measures are calculated on color vectors.
If the changes over the region of the template are caused by motion of the object itself, even if the differences dj and dj. would be large, the visibility (visibility measure dp J ) of the object in the current frame based on PCR would still be high since the PCR is a global representation not related to spatial information. On the other hand, if the changes are caused by occlusion of other objects, the visibility of the layer object in the current frame would be low. Let 7 000206
41
T. be the PCR of the layer object in H1 that was stored when the object was detected as a stationary object, and T1' be the PCR from the region overlapped by the template Aj in the current frame. Then the visibility measure of the layer object in the current can be evaluated as dp' = P(Tj I TJ) . More particular, Let Om'~l be an object in /^1(S) , and On' be a region in /,(s) .
According to Bayesian law, the probability of observing O' in On' can be computed as
P(Oπ' I Om'~l) = ∑P(On' I EJ1 PiEl I C) (15b)
1 = 1
From the definition of PCR, the significance of c'm for O'~l is P(E1n \ O'~l) = P[Jpn , and the likelihood of observing O'~l in On' according to the evidence of c'm is
n > | E;) (16b)
Figure imgf000043_0001
where pc, is the significance of c'm from the region On' . Let C(c'm) is the subset of the
principal colors from On' which match c'm . p can be calculated as
Pclln = ∑Pt (17b)
A:c* eC(c;,, )
Now the visibility measure becomes
Pφ = Pψn' I O'-{) = min{p>c, } (18b)
Figure imgf000043_0002
With the change measures in a short sequence of τd most current frames (i.e. image frames from It_τ (x) to /, (x) ) evaluated above, with τd normally set to 10 frames in the example embodiment, the states of the tracked layer object are estimated by heuristic rules in the example embodiment:
Rule 1: motionless: If both d{ and dj. are low through the sequence, it is motionless;
Rule 2: occluded: If both dj and dc J turn to moderate or high and dp J turns low through the sequence, as well as there are moving objects overlapping the region of the template A\ as determined from the bounding boxes of such moving objects in a moving object tracking algorithm applied, the layer object is occluded; Rule 3: removed: If both dj and dj. turn to high and dp j turns low, and then dc J turns low through the sequence with no moving object overlapping the region of the template, the layer object is removed;
Rule 4: inner-motion: If both d{ and dc J turn to moderate and then dc ] turns low through the sequence, while dJ keeps being high, this means the layer object has changed its pose or moved part of its body but still stays there;
Rule 5: start moving: If both dj and dj. turn and keeps being moderate, and dp J keeps being high through the sequence, as well as there is a shift of the layer object, this means the layer object starts moving again.
The parameters for the rules are determined according to a knowledge base of human perceived semantic meanings and an evaluation from real-world videos in the example embodiment. In the example embodiment, but not limiting, for the above rules, the difference measures for dj and dc J are low if they are less than 0.25, they are of moderate if they are within (0.25, 0.75), and they are high if they are larger than 0.75. The visibility measure dp j is low if it is less than 0.6, otherwise, it is high. The measure of shape shift is calculated by checking the expanding foreground pixels along the boundary of the template Aj . If the number of expanded pixels is larger than 50% of the template size, the "shift" of the object is detected. It will be appreciated that for some videos from specific cameras, e.g. cameras with unstable signals, adjustment of the thresholds may be required in different embodiment and as based on the relevant knowledge base.
To track the layer object more robustly in the example embodiment, the layer model is maintained to adapt to real variations of the object without being affected by other objects in the scene. The five most recent states for each layer object (τs = 5) are recorded. However, it will be appreciated that other values may be used in different embodiments. If one state has more than 3 supports, the state is confirmed. For the corresponding state, the following updating is performed.
If the layer object is confirmed as being motionless, a smooth operation is performed to the template image. If the object is recognized as being in the inner-motion state, the new image of the object in the current fame will replace the template. If the object is occluded, no updating will be performed. If the object is classified as start-moving, the object will be transformed as a moving object with the same ID and corresponding PCR, mask, and position for tracking by a moving object tracking algorithm. The layer representation of the object will be deleted. If the object is detected as removed, the object will be transformed as a disappeared object and its layer representation will be destroyed. With these operations, the target object moving around, staying somewhere for a while, and moving again can be tracked continuously and seamlessly by combining the example embodiment with the moving object tracking algorithm described for the example embodiment.
Figure 6 shows a flow chart 600 illustrating a method of object tracking in a video signal according to the example embodiment. At step 602, it is detected that a tracked moving object has become stationary over a sequence of frames. At step 604, a template image of the stationary object is generated based at least one of the frames in the sequence. At step 606, a state of the stationary object is tracked based on a comparison of the template image with a current frame of the video signal.
Event detection:
The structure diagram of an event detection system 700 implementation incorporating the described example embodiment is shown in Figure 7. It contains four fundamental modules, foreground segmentation module 701, moving object tracking module 702, stationary object tracking module 704, and event detection module 706.
The foreground segmentation module 701 performs the background subtraction and learning and includes the method and system for background updating of the example embodiment described above, applied to e.g. the adaptive background subtraction method proposed in [8]. The background model used in the example implementations employs Principal Feature Representation (PFR) at each pixel to characterize background appearance.
The moving objects are tracked with the deterministic 2.5D multi-object tracking algorithm of the described example embodiment in the moving object tracking module 702. As described above, to deal with great variations of target objects in shapes and scales as well as complex occlusions, moving objects are represented by the models of principal color representation which exploits a few most significant colors and their statistics to characterize the appearance of each tracked object. When a tracked object has been detected as stopping moving, a layer representation, or a template, for the object is established and will be tracked by the stationary object tracking module 704 using the method and system of the described example embodiment. At each time step, the states of templates for the objects are estimated with fuzzy reasoning. The template for one object may shift between five states: motionless, interior motion, occluded, starting moving, and removed. When a template for an object is detected as starting to move, the template for the object will be deleted and the object will be shifted as a moving object and then tracked by the moving object tracking module 702.
In the event detection module 706, semantic models based on Finite State Machines (FSM) are designed to detect suspected scenarios. In the system 700 of the example implementation, four types of unusual events are detected. They are unattended objects, theft, loitering persons, unattended vehicles or unconscious persons.
An "event" is an abstract symbolic concept of what has happened in the scene. It is the semantic level description of the spatio-temporal concatenation of movements and actions of interesting objects in the scene. Event detection in video understanding is a high level procedure which identifies specific events by interpreting the sequences of observed perceptual features from intermediate level processing. It is a step that bridges the numerical level and the symbolic level. The fundamental part of event detection is event modeling. For an event, the model is determined by the task and the different instantiations. There are generally two issues for event modeling. One is to select an appropriate representation model, or formal language, and the other is to derive the descriptors for the interesting events with the model.
In implementations based on the described example embodiment, unusual events are described by the spatio-temporal evolution of object's states, movements, and actions. On a semantic level, each event can be defined as a sequential succession of a few well-defined states. An event could be started at one or more initial states, and then one state can transit to the next state when new conditions are met as the scene evolves in time. When a specific state is reached, the event is declared. State transition may also happen from an intermediate state back to a previous state if some conditions no more hold for the state. The semantic representation can be modelled based on Finite State Machines (FSM). The FSM has at least two advantages: (1) it is explicit and natural for semantic description; (2) FSM can readily and flexibly incorporate a variety of context information from intermediate-level processing.
Using Finite State Machine, each specific event can be represented by a directed graph
G. = (Si ,E*), where Sf is the set of nodes representing the states and Ef is the set of edges representing the transitions. One example of a FSM 800 is described in Figure 8. Any new object is initiated to be state "0" 802 for all the events defined. This is the initial state. The FSM 800 is truly started only when some conditions are met and the active node transits to the next intermediate state, i.e., state "1" 804. There can be more than one intermediate state for the FSM 800 of an event, depending on the complexity of an event. The FSM 800 reaches the final state "End"406 when all the conditions are met and then the corresponding specific event is triggered. The FSM 800 is updated at each new frame. The FSM 800 could have the self-loop transition for each state. Although the FSM 800 could remain at the same state, some or all properties of the object may have changed. At least, a time counter is incremented for each frame.
The more complicated an event, the bigger is N, i.e. the number of intermediate states in the FSM 800, and the more is the chance to deliver an unreliable detection result. Therefore, an important task in event modeling is to trim any unnecessary states by careful analysis and to identify the simplest event model.
The input of an FSM is the numerical perceptual features generated by moving and stationary object tracking modules (compare 702 and 704 in Figure 7). The visual cues of each tracked object can include shape, position, motion, and relations with others. The visual cues in the example implementation are:
- Object ID: the identity number of each tracked foreground object;
- Box: bounding box of the tracked object in current frame;
- Size: the area of the object in current frame;
- Status: indicates whether the tracked object is moving around or stationary,
- StayTime: indicates how long the object has stayed in the scene;
- InGroup: indicates whether the object is an isolated one or merged with others;
- Visibility: a measure within [0,1] indicates the degree of occlusion when overlapping with others;
- Motion: a measure within [0,1] indicates the degree of interior motion of a stationary object.
The general processing flow for event detection in the example implementation is shown in Table III.
An advantage of the tracking modules (compare 702, 704 in Figure 7) is the capability to resume tracking of some objects that are lost for a few frames. The two events, UNATTENDED OBJECT and THEFT, are directly concerned with object disappearance in the example implementation. Thus when an active object does not appear in the track records of the current frame, one preferably determines whether it is temporarily lost or whether there is a genuine disappearance. To achieve this, a first-in-first-out (FIFO) stack is built to contain the track records of N frames. OTracked are the track records of the previous N-th frame and the triggered event is delayed by N frames. As such, in the example implementation it is possible to 'look forward' to check the case of disappearance of an object, with N= 30 in the example implementation. With a processing rate of 8 frames/sec or above, this represents a delay of less than 4 seconds. It will be appreciated that the delay can be balanced against the accuracy of detection in different implementations.
Loitering Detection
Loitering as defined in the example implementation involves one object. It is defined as a person wandering in the observed scene with the duration t > TLoilering . The FSM is initialized for each new object. The FSM has one intermediate state: "Stay" which indicates that the tracked person is staying in the scene, whether moving around or stationary. There are two conditions for the transition from state "INIT" to state "Stay":
- The object is classified as human;
- The object moves in the scene (moving around or staying somewhere with frequent interior motion). hi state "Stay", a time counter / is continuously incremented as new frames are coming in. When t > TLoltering , the FSM transits from state "Stay" to state "Loiter" and a loitering event is triggered.
Unconscious Person Detection
As defined in the example implementation, this event also involves one object, a person. It is defined as an object becoming complete static with the duration t > TSta[ic . The FSM is initialized for each new object. When the tracked object is recognised as a person, the FSM transits to state "M", which indicates a person who is moving around or has significant interior motion. The second intermediate state of the FSM is "S", which indicates a person becoming and staying static, or complete motionless. There are two conditions for the transition from state "M" to state "S":
- The position of the person does not change compared to the previous frame;
- The interior motion of the person m < TlnMoιion .
In state "S", a time counter t is continuously incremented as new frames are coming in. When / > TStatic , the FSM transits from state "S" to state "UP", indicating that an unconscious person is detected. Examples of unconscious person include a sleeping or faint person. It will be appreciated that similar condictions can be used to detetc e.g. a vehicle staying overtime in a zone for short stopping, in which case the object of interest is changed to vehicle instead of person.
Unattended Object Detection
This event as defined in the example implementation involves two objects. The FSM is initialized for each new object. When the new small object is identified as being separated from another large moving object, and it stays static, a deposited object is detected and the ownership is established between the two objects. The FSM transits from state "INIT" to state "Station". In the state, the object is associated with its owner. If the owner leaves the scene covered by the camera, the FSM transits from state "Station" to state "UO" and the 'Unattended Object' is declared.
Theft Detection
This event as defined in the example implementation involves three objects. The FSM is initialized for each new object. Similar to the event of unattended object, when the new small object is identified as being separated from another large moving object, and it stays static, a deposited object is detected and the ownership is established between the two objects. The FSM transits from state "INIT" to state "Station". In the state, the object is associated with its owner. However, when the object disappears and this happens with that another object got it and the owner still stays in the scene, the FSM transits from the state "Station" to the state "Theft" and a 'Theft' event is declared, meanwhile, the second person is identified as the potential thief.
The method and system of the example embodiment can be implemented on a computer system 900, schematically shown in Figure 9. It may be implemented as software, such as a computer program being executed within the computer system 900, and instructing the computer system 900 to conduct the method of the example embodiment.
The computer system 900 comprises a computer module 902, input modules such as a keyboard 904 and mouse 906 and a plurality of output devices such as a display 908, and printer 910.
The computer module 902 is connected to a computer network 912 via a suitable transceiver device 914, to enable access to e.g. the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN).
The computer module 902 in the example includes a processor 918, a Random Access Memory (RAM) 920 and a Read Only Memory (ROM) 922. The computer module 902 also includes a number of Input/Output (I/O) interfaces, for example I/O interface 924 to the display 908, and I/O interface 926 to the keyboard 904.
The components of the computer module 902 typically communicate via an interconnected bus 928 and in a manner known to the person skilled in the relevant art.
The application program is typically supplied to the user of the computer system 900 encoded on a data storage medium such as a CD-ROM or flash memory carrier and read utilising a corresponding data storage medium drive of a data storage device 930. The application program is read and controlled in its execution by the processor 918. Intermediate storage of program data maybe accomplished using RAM 920.
It will be appreciated by a person skilled in the art that numerous variations and/or modifications may be made to the present invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive.
References
[1] D. Lowe. Distinctive image features from scale-invariant key-points. Int'lJ. Computer Vision, 60(2):91-110, 2004.
[2] L. Li, W. Huang, I. Y. H. Gu, and Q. Tian. Statistical modeling of complex background for foreground object detection. IEEE Trans. Image Processing, 13(11): 1459-1472, 2004.
[3] L. Li and M. K. H. Leung. Integrating intensity and texture differences for robust change detection. IEEE Trans. Image Processing, 11 (2): 105-112, 2002.
[4] C. Stauffer and W. Grimson. Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(8):747-757, August 2000.
[5] D. Comaniciu, V. Ramesh, and P. Meer, "Kernel-Based Object Tracking," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 5, pp. 564-577, 2003.
[6] D. Comaniciu and P. Meer, "Mean Shift: A Robust Approach Toward Feature Space Analysis," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603- 619, 2002.
[7] Y. Cheng, "Mean Shift, Mode Seeking, and Clustering," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 8, pp. 790-799, 1995.
[8] Liyuan Li, et. al. IEEE T-IP, vol. 13, no. 11, pp. 1459-1472, 2004.
[9] C. Wren, A. Azarbaygaui, T. Darrell, and A. Pentland. Pfinder: Real-time tracking of the human body. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(7):780-785, 1997.

Claims

1. A method of multi-object tracking in a video signal; the method comprising the steps of: receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to c) in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes; and then for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image; for each child node having two or more corresponding objects assigned thereto, d) sorting the corresponding objects according to estimated depth in said ech child node in said second image; e) applying a means-shift calculation to locate the corresponding object having the lowest depth in said each child node; f) updating the state and the visual content of the located corresponding object based on the second image; g) removing the updated visual content of the located corresponding object from the visual data associated with said each child node; and iterating steps e) to g) in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.
2. The method as claimed in claim 1, wherein step a) comprises calculating visible portions of the respective corresponding objects in the first image to derive the estimated depths of the respective corresponding objects.
3. The method as claimed in claim 1, wherein step b) comprises assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node based on a posterior probability evaluated by Bayes rule.
4. The method as claimed in claim 3, wherein the a posterior probability evaluated by Bayes rule is based on principle colour representations (PCRs) of the corresponding object and said one child node respectively.
5. The method as claimed in claim 1, wherein step c) comprises removing the visual content of the assigned corresponding object from the visual data associated with said one child node based on PCRs of the assigned corresponding object and said one child node respectively.
6. The method as claimed in claim 1, wherein step d) comprises calculating visible portions of the respective corresponding objects in the second image to derive the estimated depths of the respective corresponding objects.
7. The method as claimed in claim 1, wherein step e) comprises applying the means-shift calculation to locate the corresponding object having the lowest depth in said each child node based on gravity centres of pixels of each principle colour components in a PCR of the corresponding object in the second image.
8. The method as claimed in claim 1, wherein step g) comprises removing the updated visual content of the located corresponding object from the visual data associated with said each child node based on PCRs of the located corresponding object and said each child node respectively.
9. The method as claimed in claim 1, further comprising storing tracking data including the updated status and visual content of each corresponding object for a series of 7 000206
51
consecutive frames and detecting an event in the video signal based on the stored tracking data.
10. The method as claimed in claim 1, further comprising the step of: for each parent node having no child node, deleting the corresponding object.
11. The method as claimed in claim 1, further comprising the step of: for each parent node having only one child node, assigning all corresponding objects to said one child node.
12. The method as claimed in claim 1, further comprising the step of: for each child node having no corresponding object assigned thereto, check whether the object is disappeared, and if not, set a new corresponding object to said each child node.
13. The method as claimed in claim 1, further comprising the step of: for each child node having only one corresponding object assigned thereto, update the state and visual content of said one corresponding object from the visual data associated with said each child node.
14. A multi-object tracking module for multi-object tracking in a video signal; the module comprising: means for receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; means for generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and the means for generating, for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; 06
52
c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to c) in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes; and then, for each child node having only one corresponding object assigned thereto, updating a state and the visual content of said one object based on the second image; for each child node having two or more corresponding objects assigned thereto, d) sorting the corresponding objects according to estimated depth in said ech child node in said second image; e) applying a means-shift calculation to locate the corresponding object having the lowest depth in said each child node; f) updating the state and the visual content of the located corresponding object based on the second image; g) removing the updated visual content of the located corresponding object from the visual data associated with said each child node; and iterating steps e) to g) in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.
15. A data storage medium having stored thereon computer code means for instructing a computer system to execute a method of multi-object tracking in a video signal; the method comprising the steps of: receiving first and second segmented images of two consecutive frames of the video signal respectively, at least one of the first and second segmented images including one or more foreground regions, each foreground region corresponding to one or more objects to be tracked; generating one or more directed acrylic graphs (DAGs) for zero or more parent nodes in the first segmented image and zero to more child nodes in the second segmented images, each DAG including at least one parent or child node; and for each parent node having two or more child nodes, a) sorting the corresponding objects of the foreground region contributing to said each parent node according to estimated depth in said first image; b) assigning the corresponding object having the lowest depth to one of the child nodes of said each parent node; c) removing a visual content of the assigned corresponding object from the visual data associated with said one child node; and iterating steps b) to c) in order of increasing depth of the corresponding objects for assigning all corresponding objects to the two or more child nodes; and then for each child node having only one corresponding object assigned thereto, update a state and the visual content of said one object based on the second image; for each child node having two or more corresponding objects assigned thereto, d) sorting the corresponding objects according to estimated depth in said ech child node in said second image; e) applying a means-shift calculation to locate the corresponding object having the lowest depth in said each child node; f) updating the state and the visual content of the located corresponding object based on the second image; g) removing the updated visual content of the located corresponding object from the visual data associated with said each child node; and iterating steps e) to g) in order of increasing depth of the corresponding objects for locating all corresponding objects in a corresponding region of said each child node.
16. A method of stationary object tracking in a video signal, the method comprising the steps of: determining that a tracked moving object has become stationary over a sequence of frames; generating a template image of the stationary object based on at least one of the frames of the sequence; tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state.
17. The method as claimed in claim 16, wherein the generating of the template image is based on image data within a bounding box in the at least one of the frames of the sequence for the tracked object.
18. The method as claimed in claim 16, wherein the tracking of the state comprises steps of: determining a first difference measure between the template image and a corresponding region in the current frame; determining a second difference measure between respective corresponding regions in the current frame and a preceding frame; determining a visibility measure of the stationary object from the corresponding region in the current frame.
19. The method as claimed in claim 18, wherein the tracking of the state further comprises determining whether another tracked moving object overlaps the corresponding region in the current frame.
20. The method as claimed in claim 19, wherein the tracking of the state further comprises the steps of: determining a motionless state if the first and second difference measures are below a first threshold value over a sequence of τp current frames; determining an occluded state if the first and second difference measures each exceed a second threshold value and the visibility measure falls below a third threshold value over the sequence of τp current frames, and another tracked moving object overlaps the corresponding region in the current frame; determining a removed state if the first and second difference measures each initially exceed the second threshold value and then the second difference measure falls below the first threshold value and the visibility measure falls below the third threshold value over the sequence of Xp current frames, and no tracked moving object overlaps the corresponding region in the current frame; determining an inner-motion state if the first and second difference measures each initially exceed the second threshold value and then the second difference measure then falls below the first threshold value and the visibility measure is above a fourth threshold value; and determining the start moving state if the first and second difference measures exceed and remain above the second threshold value and the visibility measure exceeds and remains above the fourth threshold measure over the sequence of τp current frames, and a spatial shift of the tracked stationary object.
21. The method as claimed in claim 18, wherein the visibility measure is determined based on principle colour representation.
22. The method as claimed in claim 18, wherein the first and second difference measures are determined based on a knowledge base of human perceived semantic meanings, an evaluation from real-world videos, or both.
23. A system for object tracking in a video signal, the system comprising: means for determining that a tracked moving object has become stationary over a sequence of frames; means for generating a template image of the stationary object based on at least one of the frames of the sequence; means for tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and means for switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state.
24. A data storage medium having stored thereon computer code means for instructing a computer system to execute a method of object tracking in a video signal, the method comprising the steps of: determining that a tracked moving object has become stationary over a sequence of frames; generating a template image of the stationary object based on at least one of the frames of the sequence; tracking a state of the stationary object based on a comparison of the template image with a current frame of the video signal; and switching to a moving object tracking algorithm using a same object ID if the state of the stationary object is determined as a start moving state.
PCT/SG2007/000206 2006-07-11 2007-07-11 Method and system for multi-object tracking WO2008008046A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US80696406P 2006-07-11 2006-07-11
US60/806,964 2006-07-11

Publications (1)

Publication Number Publication Date
WO2008008046A1 true WO2008008046A1 (en) 2008-01-17

Family

ID=38923513

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/SG2007/000206 WO2008008046A1 (en) 2006-07-11 2007-07-11 Method and system for multi-object tracking
PCT/SG2007/000205 WO2008008045A1 (en) 2006-07-11 2007-07-11 Method and system for context-controlled background updating

Family Applications After (1)

Application Number Title Priority Date Filing Date
PCT/SG2007/000205 WO2008008045A1 (en) 2006-07-11 2007-07-11 Method and system for context-controlled background updating

Country Status (2)

Country Link
SG (1) SG150527A1 (en)
WO (2) WO2008008046A1 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8572740B2 (en) 2009-10-01 2013-10-29 Kaspersky Lab, Zao Method and system for detection of previously unknown malware
CN103729861A (en) * 2014-01-03 2014-04-16 天津大学 Multiple object tracking method
AU2013242830B2 (en) * 2013-10-10 2016-11-24 Canon Kabushiki Kaisha A method for improving tracking in crowded situations using rival compensation
GB2550858A (en) * 2016-05-26 2017-12-06 Nokia Technologies Oy A method, an apparatus and a computer program product for video object segmentation
US9922262B2 (en) 2014-12-10 2018-03-20 Samsung Electronics Co., Ltd. Method and apparatus for tracking target object
CN108399411A (en) * 2018-02-26 2018-08-14 北京三快在线科技有限公司 A kind of multi-cam recognition methods and device
CN109143222A (en) * 2018-07-27 2019-01-04 中国科学院半导体研究所 Based on the three dimensional maneuvering object tracking for sampling particle filter of dividing and ruling
CN109643452A (en) * 2016-08-12 2019-04-16 高通股份有限公司 The method and system of lost objects tracker is maintained in video analysis
CN110785775A (en) * 2017-07-07 2020-02-11 三星电子株式会社 System and method for optical tracking
CN110889864A (en) * 2019-09-03 2020-03-17 河南理工大学 Target tracking method based on double-layer depth feature perception
CN111178218A (en) * 2019-12-23 2020-05-19 北京中广上洋科技股份有限公司 Multi-feature combined video tracking method and system based on face recognition
CN111179304A (en) * 2018-11-09 2020-05-19 北京京东尚科信息技术有限公司 Object association method, device and computer-readable storage medium
CN111340846A (en) * 2020-02-25 2020-06-26 重庆邮电大学 Multi-feature fusion anti-occlusion target tracking method
CN111726264A (en) * 2020-06-18 2020-09-29 中国电子科技集团公司第三十六研究所 Network protocol variation detection method, device, electronic equipment and storage medium
CN112991382A (en) * 2019-12-02 2021-06-18 中国科学院国家空间科学中心 PYNQ frame-based heterogeneous visual target tracking system and method
US11604254B2 (en) 2019-08-16 2023-03-14 Fujitsu Limited Radar-based posture recognition apparatus and method and electronic device

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8284249B2 (en) 2008-03-25 2012-10-09 International Business Machines Corporation Real time processing of video frames for triggering an alert
GB2459701B (en) * 2008-05-01 2010-03-31 Pips Technology Ltd A video camera system
US8483481B2 (en) 2010-07-27 2013-07-09 International Business Machines Corporation Foreground analysis based on tracking information
WO2014038924A2 (en) * 2012-09-06 2014-03-13 Mimos Berhad A method for producing a background model
EP3246874B1 (en) * 2016-05-16 2018-03-14 Axis AB Method and apparatus for updating a background model used for background subtraction of an image
CN107368784A (en) * 2017-06-15 2017-11-21 西安理工大学 A kind of novel background subtraction moving target detecting method based on wavelet blocks
CN110121034B (en) * 2019-05-09 2021-09-07 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for implanting information into video
CN117953015B (en) * 2024-03-26 2024-07-09 武汉工程大学 Multi-row person tracking method, system, equipment and medium based on video super-resolution

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6542621B1 (en) * 1998-08-31 2003-04-01 Texas Instruments Incorporated Method of dealing with occlusion when tracking multiple objects and people in video sequences
US20040146183A1 (en) * 2001-02-26 2004-07-29 Yair Shimoni Method and system for tracking an object
US6826292B1 (en) * 2000-06-23 2004-11-30 Sarnoff Corporation Method and apparatus for tracking moving objects in a sequence of two-dimensional images using a dynamic layered representation
JP2004348303A (en) * 2003-05-21 2004-12-09 Fujitsu Ltd Object detection system and program
US6879705B1 (en) * 1999-07-14 2005-04-12 Sarnoff Corporation Method and apparatus for tracking multiple objects in a video sequence

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7418134B2 (en) * 2003-05-12 2008-08-26 Princeton University Method and apparatus for foreground segmentation of video sequences
US7224735B2 (en) * 2003-05-21 2007-05-29 Mitsubishi Electronic Research Laboratories, Inc. Adaptive background image updating

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6542621B1 (en) * 1998-08-31 2003-04-01 Texas Instruments Incorporated Method of dealing with occlusion when tracking multiple objects and people in video sequences
US6879705B1 (en) * 1999-07-14 2005-04-12 Sarnoff Corporation Method and apparatus for tracking multiple objects in a video sequence
US6826292B1 (en) * 2000-06-23 2004-11-30 Sarnoff Corporation Method and apparatus for tracking moving objects in a sequence of two-dimensional images using a dynamic layered representation
US20040146183A1 (en) * 2001-02-26 2004-07-29 Yair Shimoni Method and system for tracking an object
JP2004348303A (en) * 2003-05-21 2004-12-09 Fujitsu Ltd Object detection system and program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PATENT ABSTRACTS OF JAPAN *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8572740B2 (en) 2009-10-01 2013-10-29 Kaspersky Lab, Zao Method and system for detection of previously unknown malware
US9633265B2 (en) 2013-10-10 2017-04-25 Canon Kabushiki Kaisha Method for improving tracking in crowded situations using rival compensation
AU2013242830B2 (en) * 2013-10-10 2016-11-24 Canon Kabushiki Kaisha A method for improving tracking in crowded situations using rival compensation
CN103729861A (en) * 2014-01-03 2014-04-16 天津大学 Multiple object tracking method
CN103729861B (en) * 2014-01-03 2016-06-22 天津大学 A kind of multi-object tracking method
US9922262B2 (en) 2014-12-10 2018-03-20 Samsung Electronics Co., Ltd. Method and apparatus for tracking target object
US10319095B2 (en) 2016-05-26 2019-06-11 Nokia Technologies Oy Method, an apparatus and a computer program product for video object segmentation
GB2550858A (en) * 2016-05-26 2017-12-06 Nokia Technologies Oy A method, an apparatus and a computer program product for video object segmentation
CN109643452A (en) * 2016-08-12 2019-04-16 高通股份有限公司 The method and system of lost objects tracker is maintained in video analysis
CN110785775A (en) * 2017-07-07 2020-02-11 三星电子株式会社 System and method for optical tracking
CN110785775B (en) * 2017-07-07 2023-12-01 三星电子株式会社 System and method for optical tracking
CN108399411A (en) * 2018-02-26 2018-08-14 北京三快在线科技有限公司 A kind of multi-cam recognition methods and device
CN109143222A (en) * 2018-07-27 2019-01-04 中国科学院半导体研究所 Based on the three dimensional maneuvering object tracking for sampling particle filter of dividing and ruling
CN111179304B (en) * 2018-11-09 2024-04-05 北京京东尚科信息技术有限公司 Target association method, apparatus and computer readable storage medium
CN111179304A (en) * 2018-11-09 2020-05-19 北京京东尚科信息技术有限公司 Object association method, device and computer-readable storage medium
US11604254B2 (en) 2019-08-16 2023-03-14 Fujitsu Limited Radar-based posture recognition apparatus and method and electronic device
CN110889864A (en) * 2019-09-03 2020-03-17 河南理工大学 Target tracking method based on double-layer depth feature perception
CN110889864B (en) * 2019-09-03 2023-04-18 河南理工大学 Target tracking method based on double-layer depth feature perception
CN112991382A (en) * 2019-12-02 2021-06-18 中国科学院国家空间科学中心 PYNQ frame-based heterogeneous visual target tracking system and method
CN112991382B (en) * 2019-12-02 2024-04-09 中国科学院国家空间科学中心 Heterogeneous visual target tracking system and method based on PYNQ framework
CN111178218B (en) * 2019-12-23 2023-07-04 北京中广上洋科技股份有限公司 Multi-feature joint video tracking method and system based on face recognition
CN111178218A (en) * 2019-12-23 2020-05-19 北京中广上洋科技股份有限公司 Multi-feature combined video tracking method and system based on face recognition
CN111340846B (en) * 2020-02-25 2023-02-17 重庆邮电大学 Multi-feature fusion anti-occlusion target tracking method
CN111340846A (en) * 2020-02-25 2020-06-26 重庆邮电大学 Multi-feature fusion anti-occlusion target tracking method
CN111726264B (en) * 2020-06-18 2021-11-19 中国电子科技集团公司第三十六研究所 Network protocol variation detection method, device, electronic equipment and storage medium
CN111726264A (en) * 2020-06-18 2020-09-29 中国电子科技集团公司第三十六研究所 Network protocol variation detection method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2008008045A1 (en) 2008-01-17
SG150527A1 (en) 2009-03-30

Similar Documents

Publication Publication Date Title
WO2008008046A1 (en) Method and system for multi-object tracking
EP1836683B1 (en) Method for tracking moving object in video acquired of scene with camera
Camplani et al. Background foreground segmentation with RGB-D Kinect data: An efficient combination of classifiers
Portmann et al. People detection and tracking from aerial thermal views
Cristani et al. Background subtraction for automated multisensor surveillance: a comprehensive review
Zhang et al. Mining semantic context information for intelligent video surveillance of traffic scenes
US8233676B2 (en) Real-time body segmentation system
Lu et al. A nonparametric treatment for location/segmentation based visual tracking
Vu et al. Audio-video event recognition system for public transport security
Yun et al. Unsupervised moving object detection through background models for ptz camera
Lu et al. Detecting unattended packages through human activity recognition and object association
Cheng et al. Segmentation of aerial surveillance video using a mixture of experts
Tavakkoli et al. A support vector data description approach for background modeling in videos with quasi-stationary backgrounds
Marcenaro et al. Self-organizing shape description for tracking and classifying multiple interacting objects
Khatri et al. Video analytics based identification and tracking in smart spaces
Abdel-Gawad et al. Vulnerable road users detection and tracking using yolov4 and deep sort
Cuevas et al. Tracking-based non-parametric background-foreground classification in a chromaticity-gradient space
Michael et al. Automatic vehicle detection and tracking in aerial surveillances using SVM
Ghahremannezhad Advanced Traffic Video Analytics for Robust Traffic Accident Detection
Garibotto et al. Object detection and tracking from fixed and mobile platforms
Ali Feature-based tracking of multiple people for intelligent video surveillance.
Harasse et al. Multiple faces tracking using local statistics
Tavakkoli et al. Background Learning with Support Vectors: Efficient Foreground Detection and Tracking for Automated Visual Surveillance
Huang et al. Region-level motion-based foreground detection with shadow removal using MRFs
Jeyabharathi et al. Background Subtraction and Object Tracking via Key Frame-Based Rotational Symmetry Dynamic Texture

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07794225

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 07794225

Country of ref document: EP

Kind code of ref document: A1