CA2667066A1

CA2667066A1 - Apparatus and method for automatic real-time bi-layer segmentation using color and infrared images

Info

Publication number: CA2667066A1
Application number: CA2667066A
Authority: CA
Inventors: Pierre Benoit Boulanger
Original assignee: Individual
Current assignee: Individual
Priority date: 2009-05-27
Filing date: 2009-05-27
Publication date: 2010-11-27

Abstract

An apparatus and method is provided for the automatic real-time, bi-layer segmentation of foreground and background portions of an image using the color and infrared images of the image. The method includes illuminating an object with infrared and visible light to produce infrared and color images of the object. An infrared mask is produced from the infrared image to predict the foreground and background portions of the image. A pentamap is produced from the color image to define the color image into five distinct regions. An iterative relaxation-labeling algorithm is applied to the images to determine the foreground and background portions of the image. The algorithm converges when a compatibility coefficient between adjacent pixels in the image reaches a predetermined threshold. A Gaussian filter or matting technique can then be applied to smooth the transition between the foreground and background portions of the image.

Description

TITLE: APPARATUS AND METHOD FOR AUTOMATIC REAL-TIME BI-LAYER
SEGMENTATION USING COLOR AND INFRARED IMAGES

INVENTOR:
Pierre Benoit Boulanger TECHNICAL FIELD:

The present disclosure is related to the field of bi-layer video segmentation of foreground and background images using the fusion of self-registered color and infrared ("IR") images, in particular, a sensor fusion based on a real-time implementation of a relaxation labelling algorithm computed on a graphic processing unit ("GPU") in real-time.

BACKGROUND:
Many tasks in computer vision involve bi-layer video segmentation. One important application is in teleconferencing, where there is a need to substitute the original background with a new one. A large number of papers have been published on bi-layer video segmentation. For example, background subtraction techniques try to solve this problem by using adaptive thresholding with a background model [1].

One of the most well known techniques is chroma keying which uses blue or green backgrounds to separate the foreground objects. Because of its low cost, it is heavily used in photography and cinema studios around the world. On the other hand, these techniques are difficult to implement in real office environment or outdoors as the segmentation results depend heavily on constant lighting and the access to a blue or green background. To remediate this problem, some techniques use learned backgrounds using frames where the foreground object is not present. Again, those {E5615937. DOC;1 }

2 techniques are plagued by ambient lighting fluctuations as well as by shadows.
Other techniques perform segmentation based on stereo disparity map computed from two or more cameras [2, 3]. These methods have several limitations as they are not robust to illumination changes and scene features making dense stereo map difficult to get in most cases. They also have low computational efficiency and segmentation accuracy.
Recently, several researchers have used active depth-cameras in combination with a regular camera to acquire depth data to assist in foreground segmentation [4, 5]. The way they combine the two cameras, however, involves scaling, re-sampling and dealing with synchronization problems. There are some special video cameras available today that produce both depth and red-green-blue ("RGB") signals using time-of-flight, e.g.
ZCam [6], but this is a very complex technology that requires the development of new miniaturise streak cameras which are hard to produce at low cost.

It is, therefore, desirable to provide a system and method for the bi-layer video segmentation of foreground and background images that overcomes the shortcomings in the prior art.

SUMMARY:
A new solution to the problem of bi-layer video segmentation is provided in terms of both hardware design and in the algorithmic solution. At the data acquisition stage, infrared video can be used, which is robust to illumination changes and provides an automatic initialization of a bitmap for foreground-background segmentation. A
contrast-preserving relaxation labeling algorithm can then be used to finalize the segmentation process using the color information. The parallel characteristics of this new relaxation-{E5615937.DOC;1}

3 labelling algorithm can be implemented on a Graphic Processing Unit ("GPU") hardware to achieve real-time (30 Hz) performance on commodity hardware.

Broadly stated, an apparatus is provided for automatic real-time bi-layer segmentation of foreground and background images, comprising: an infrared ("IR") light source configured to illuminate an object with IR light, the object located in a foreground portion of an image, the image further comprising a background portion; a color camera configured to produce a color video signal; an IR camera configured to produce an infrared video signal; a beam splitter operatively disposed between the color camera and the IR camera whereby a portion of light reflecting off of the object passes through the beam splitter to the color camera and another portion of light reflecting off of the object reflects off of the beam splitter and passes through an interference filter to the IR
camera, the interference filter configured to allow IR light to pass through;
and a video processor operatively coupled to the color camera and the IR camera and configured to receive the color video signal and the IR video signal, the video processor further comprising algorithm means configured to enable the video processor to process the color and IR video signals to separate the foreground portion from the background portion of the image and produce an output video signal that contains only the foreground portion of the image.

Broadly stated, a method is provided for a method for the automatic real-time bi-layer segmentation of foreground and background images, the method comprising the steps of: illuminating an object in an image with infrared and visible light, the image further comprising a foreground portion and a background portion; producing a color video signal of the image; producing an infrared video signal of the image;
processing {E5615937. DOC;1 }

4 the color video signal to produce a color image array comprising a plurality of color vectors, each color vector representing a color pixel in a color image of the image;
processing the infrared video signal to produce an infrared image array comprising a plurality of infrared components, each infrared component representing a gray scale value of an infrared pixel in an infrared image of the image; comparing the infrared image array against a predetermined threshold to produce an infrared mask to predict the foreground and background portions of the image; using the color image array, producing a pentamap to partition the color image into a plurality of regions;
applying a Gaussian Mixture Model algorithm to the infrared mask and the pentamap to determine an initial segmentation of the foreground and background portions of the image;
applying an iterative relaxation labelling algorithm to the infrared mask and the pentamap until a compatibility coefficient between adjacent pixels in the image are equal to or less than a predetermined threshold thereby finalizing the definition of the foreground and background portions of the image; and smoothing the transition between the foreground and background portions of the image.

Broadly stated, an apparatus is provided for automatic real-time bi-layer segmentation of foreground and background images, comprising: means for illuminating an object in an image with infrared and visible light, the image further comprising a foreground portion and a background portion; means for producing a color video signal of the image; means for producing an infrared video signal of the image; means for processing the color video signal to produce a color image array comprising a plurality of color vectors, each color vector representing a color pixel in a color image of the image; means for processing the infrared video signal to produce an infrared image {E5615937. DOC;1 }

array comprising a plurality of infrared components, each infrared component representing a gray scale value of an infrared pixel in an infrared image of the image;
means for comparing the infrared image array against a predetermined threshold to produce an infrared mask to predict the foreground and background portions of the image; means for producing a pentamap using the color image array to partition the color image into a plurality of regions; means for applying a Gaussian Mixture Model algorithm to the infrared mask and the pentamap to determine an initial segmentation of the foreground and background portions of the image; means for applying an iterative relaxation labelling algorithm to the infrared mask and the pentamap until a compatibility coefficient between adjacent pixels in the image are equal to or less than a predetermined threshold thereby finalizing the definition of the foreground and background portions of the image; and means for smoothing the transition between the foreground and background portions of the image.

BRIEF DESCRIPTION OF THE DRAWINGS:

Figure 1 is a block diagram depicting an apparatus for the bi-layer video segmentation of foreground and background images.

Figure 2 is a pair of images depicting synchronized and registered infrared and color images.

Figure 3 is a pair of images depicting images as viewed through an infrared mask and a pentamap.

Figure 4 is a set of twelve images depicting an image converging to a final image after successive iterations of a relaxation algorithm applied to the image.

{E5615937.D0C;1 }

Figure 5 is a pair of images depicting the effect of Gaussian border blending applied to the image.

Figure 6 is a block diagram depicting a system for processing the bi-layer video segmentation of an image.

Figure 7 is a flowchart depicting a method for bi-layer video segmentation of foreground and background images.

DETAILED DESCRIPTION OF EMBODIMENTS

Referring to Figure 1, a block diagram of an embodiment of a data acquisition apparatus for a system for the bi-layer video segmentation of foreground and background images is shown. In this embodiment, the foreground can be illuminated by an invisible infrared ("IR") light source having a wavelength ranging between 850 nm to 1500 nm that can be captured by an infrared camera tuned to the wavelength selected, using a narrow-band ( 25 nm) optical filter to reject all light except the one produced by the IR light source. In a representative embodiment, an 850 nm IR light source can be used but other embodiments can use other IR wavelengths as well known to those skilled in the art, depending on the application requirements. The IR camera and color camera can produce a mirrored video pair that is synchronized both in time and space, using a genlock mechanism for temporal synchronization and an optical beam splitter for spatial registration. With this system, there is no need to align the images using complex calibration algorithms since they are guaranteed to be coplanar and coaxial.

An example of a video frame captured by the apparatus of Figure 1 is shown in Figure 2. As one can see, the IR image captured using the apparatus of Figure 1 is a mirror version of the color image captured by this apparatus. This is due to the {E5615937. DOC;1 }

reflection imparted on the IR image by reflecting off of the beam splitter.
The mirrored IR image can be easily corrected using image transposition as well known to those skilled in the art.

In one embodiment, the apparatus can automatically produce synchronized IR
and color video pairs, which can reduce or eliminate problems arising from synchronizing the IR and color images. In another embodiment, the IR
information captured by the apparatus can be independent of illumination changes; hence, a bitmap of the foreground/background can be made to produce an initial image. In a further embodiment, the IR illuminator can add flexibility to the foreground definition by moving the IR illuminator around to any object to be segmented from the rest of the image. In so doing, the foreground can be defined by the object within certain distance from the IR
illuminator rather than from the camera.

One advantage of the IR image is that it can be used to predict foreground and background areas in the image. The IR image is a gray scale image, in which brighter parts indicate the foreground (as illuminated by IR source). Missing foreground parts must be within a certain distance from the illuminated parts.

For the foreground/background segmentation, in one embodiment, one can determine a binary label vector:

f=(f1,f2...f1Pi), where fn E {0,1 }, with 0 being the background label and 1 being the foreground label, and IPI being the number of pixels.

{E5615937.DOC;1 }

After registration in the image plane of both the IR and color images, each IR
image can be represented as an array:

1=(11,I2...11PI), and each color image can be represented as an array:
Z=(Z1,Z2...ZIP ), where In is a gray scale value, and Zr, is an RGB color vector.

In one embodiment, an estimate of the foreground area can be found by comparing the IR image against a predetermined threshold to produce an IR-MASK
defined as:

IR-MASK= {pixels) that the IR value Ip>T}

where T can be determined automatically using the Otsu algorithm [11].

Using this binary image, one can further define a pentamap P that can partition the color image into five regions. These regions can be:

1. certain foreground ("CF");
2. certain background ("CB");
3. local foreground ("LF");
4. local background ("LB"); and

5. uncertain region ("U").

In one embodiment, the pentamap can be formed from the initial segmentation produced by the thresholding of the IR image, where the bitmap IR-MASK can be processed by various morphological operators [10] such as erosion and dilation.

In another embodiment, a pentamap can be composed of five binary maps:
P -{TCF, TCB, TLF, TLB}; with 1. TCF ={pixels I pixel that are member of IR-MASK.erosion(s)}
2. TLF ={pixels I pixel that are members of IR-MASK.xor. TCF}

{E5615937. DOC;1 }

3. TCB ={pixels pixel that are members of not. (IR-MASK.dilation(r+s))}
4. TSB ={pixels I pixel that are members of IR-MASK.dilation(r+s).xor.IR-MASK.dilation(T)}
5. Tu ={pixels I pixel that are members of P.xor. TCF xor. TCB xor. TLF xor.
TSB}
where s is the size of the structuring element for the foreground and r the size of the structuring element for the background.

Examples of the effect of the IR-MASK and the pentamap on an image are shown at Figure 3. Sample foreground/background cues can be drawn from TLF/TLB, which is an interior/exterior narrow strip of width s of the foreground. In one embodiment, it can be assumed that the gray level pixels in the unknown area Tu are consistent with the neighbourhood regions. In this embodiment, the size of the structuring element s (in the definition of TCB and TLB above) does not change the segmentation result much as long as se- [15, 25].

In this algorithm, a Gaussian Mixture Model ("GMM") can be used to represent the foreground/background color model. Each GMM can be represented by a full-covariance Gaussian mixture with M components (M=10). In one embodiment, the GMM for the foreground can be represented as:

KF={KF1, KF2 . . . KFM};

and similarly for the background as:
KB={KB1 KB2... KBM}.

Following this initial segmentation, a relaxation labeling [7] can be used to reduce ambiguities and noise based on the parallel use of local constraints between labels. In this embodiment of implementation, each pixel can be first assigned a probability vector and a label based on the color information. The probability vector can be updated {E5615937.DOC;1 }

iteratively based on the local constraints between labels [8]. The relaxation labelling process can be composed of four main steps:

Step 1: Initialization: For each pixel, compute a probability vector Pro (P) = [Pr10 (P) , Pro' (P)], where Pr1 is the probability of pixel p being the foreground (fp=1), and Pro is the probability (fp=0) being the background.

Based on the pentamap, the probability vector for pixels in the unknown area Tu can be defined according to the following scheme:

1 bPETCF,TLF
max Pr(Zp I KF) 0 i=1.. M
Pr1 (p) = max Pr(Zp I KFi) + max Pr(Zp I KB) )' P E TU (1) i=1..M i=1..M
0 VPETCB,TLB
Pro0 (p) =1- Prl (p) (2) Step 2: Iteration: At the nth iteration, the probability vector Pr(p) for pixel p can be updated based on the previous vector Pr"-'(p) and neighbourhood probability vector:
Pr"-'(q), q c N(p) where N(p) is the 8-connected neighbourhood about pixel p, and q is a neighbouring pixel to pixel p.

Prn (p) = Prin-1(p)(1 + Q,n-1(p)) LPrj n-1 (p)(1 + Q;n-1(p)) (3) i=0 {E5615937.DOC;1 }

where Q,n-1(p) = 1 c(p, q) * Pr;"-' (q) (4) card(N(p)) gEN(p) with i={0,11.

C(p,q) is the compatibility coefficient between pixels p and its neighbour q and can be defined by:

1 iZZII2 < e Qp,q) p q (5) -1 otherwise where Zp and Zq are the pixel color of neighbouring pixels p and q and 8 is the threshold on the difference.

For the purposes of this specification, the embodiments presented herein are presumed to comprise a rectangular arrangement of pixels aligned both vertically and horizontally, and that the pixels comprise a rectangular or square configuration. It is obvious to those skilled in the art that other physical arrangements of pixels that are not aligned vertically or horizontally can be used with the apparatuses and methods described herein, as can pixels whose configuration is not rectangular or square, and that such arrangements and configurations are to be included with the embodiments discussed herein.

Step 3: Convergence and final labelling: After running for of iterations, each pixel can be assigned a label with a largest probability. As shown in Figure 4, one can see the {E5615937.DOC;1 }

convergence rate of the pentamap as a function of the number of iteration. In a representative embodiment, we found of=10 to be sufficient in most cases.

Step 4: Border Blending: In one embodiment, border blending can begin with the final segmentation produced by the relaxation labeling algorithm. A Gaussian filter or matting can then be applied to the boundary contour, so that there is a smooth transition between foreground and background, eliminating obvious artifacts at the boundaries between the foreground and background portions of the image. Figure 5 illustrates the segmentation results after border blending.

Referring to Figure 7, one embodiment of the method described herein can include the following steps:

1. taking color and infrared images of an object in an image.

2. comparing the infrared image to a predetermined threshold to produce an IR mask.

3. creating a tri-map of the color image to define the color image into three distinct partitions or regions.

4. creating a pentamap from the tri-map to define the color image into five distinct partitions or regions.

5. perform an iterative relaxation labeling algorithm to the pentamap until there is convergence in defining the foreground and background portions of the image, meaning, that a compatibility coefficient between adjacent pixels in the image is equal to or less than a predetermined threshold.

6. perform a Gaussian filter or matting process to the converged image to smooth the transition between the foreground and background portions of {E5615937. DOC;1 }

the image whereby a video image of the foreground portion can be produced without the background portion thereby enabling the ability to superimpose the foreground portion onto any desired background imagery using video processing techniques well known to those skilled in the art.
Hardware Implementation In order to provide real-time (30 Hz) performance on commodity computing, the algorithm can be implemented using GPU processing. Figure 6 illustrates one embodiment of an apparatus that can carry out the algorithm. The two cameras (color and IR) can be synchronized or "genlocked' together using the color camera as the source of a master clock. The video signals from the IR and color cameras can then be combined together using a side by side multiplexer to insure perfect synchronization of the frames of the two video signals. A high-speed digitizer can then convert the video signals from the multiplexer into digital form where each pixel of the multiplexed video signals can be converted into 24 bits integer corresponding to RGB. In the case of the IR signal, the integer can be set to be R=G=B. The digitizer can then directly transfer each digitized pixel into the main memory of a host computer using Direct Memory Access (DMA) transfer to obtain a frame transfer rate of at least 30Hz. The CPU can process each frame by separating the IR images from the color images and then correct the mirror transformation of the IR images through a transposition of the image. The GPU can then transfer the color and IR images through the PCI-E bus towards the GPU
texture memory where the IR and color pixels can be processed using the algorithm described above. Since the GPU can be configured to operate much faster than the CPU, this processing can be performed at 120 Hz. The resulting segmented color (E5615937.DOC;1) image can then be transferred back to the system memory through the PCI-E bus.
The CPU can then change the state of the device driver indicating to the application program that a new image is ready.

In one embodiment, the method described herein can be Microsoft DirectX
compatible, which can make the image transfer and processing directly accessible to various programs as a virtual camera. The concept of virtual camera can useful as any applications such as Skype, H323 video conferencing system or simply video recording utilities can connect to the camera as if it was a standard webcam.

Although a few embodiments have been shown and described, it will be appreciated by those skilled in the art that various changes and modifications might be made without departing from the scope of the invention. The terms and expressions used in the preceding specification have been used herein as terms of description and not of limitation, and there is no intention in the use of such terms and expressions of excluding equivalents of the features shown and described or portions thereof, it being recognized that the scope of the invention is defined and limited only by the claims that follow.

References:
[1] N. Friedman, S. Russell, "Image Segmentation in Video Sequences: a Probabilistic Approach", Proc. 13th Conf, on Uncertainty in Artificial Intelligence, Aug 1997, pp. 175-181.

[2] C. Eveland, K. Konolige, and R.C. Bolles, "Background modeling for segmentation of video-rate stereo sequences", Proc. IEEE Computer Vision and Pattern Recognition (CVPR), Santa Barbara, CA, USA, Jun 1998, pp. 266-271.

{E5615937. DOC;1 }

[3] V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, and C. Rother, "Bi-layer Segmentation of Binocular video", Proc. CVPR, San Diego, CA, US, 2005, pp. 407 -414.

[4] N. Santrac, G. Friedland, R. Rojas, "High resolution segmentation with a time-of-flight 3D-camera using the example of a lecture scene", Fachbereich mathematik and informatik, Sep 2006.

[5] O. Wang, J. Finger, Q. Yang, J. Davis, and R. Yang, "Automatic Natural Video Matting with Depth", Pacific Conference on Computer Graphics and Applications (Pacific Graphics), 2007.

[6] G. Iddan and G. Yahav, "3D Imaging in the studio (and elsewhere)", Proc.
SPIE, 2001, pp. 48-55.

[7] R.A. Hummel and S.W. Zucker, "On the Foundations of Relaxation Labeling Processes", IEEE Trans. Pattern Analysis and Machines Intelligence, May 1983, pp.
267-287.

[8] M. W. Hansen and W. E. Higgins, "Relaxation Methods for Supervised Image Segmentation", IEEE Trans. Pattern Analysis and Machine Intelligence, Sep 1997, pp.
949-962.

[9] Y. Boykov, and M.-P. Jolly, "Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images", Proc. IEEE Int. Conf. on computer vision, 2001, CD-ROM.

[10] http://en.wikipedia.org/wiki/Morphological image processing

[11] http://en.wikipedia.org/wiki/Otsu's method {E5615937.DOC;1 }

Claims

WE CLAIM:

1. An apparatus for automatic real-time bi-layer segmentation of foreground and background images, comprising:

a) an infrared ("IR") light source configured to illuminate an object with IR
light, the object located in a foreground portion of an image, the image further comprising a background portion;

b) a color camera configured to produce a color video signal;

c) an IR camera configured to produce an infrared video signal;

d) a beam splitter operatively disposed between the color camera and the IR
camera whereby a portion of light reflecting off of the object passes through the beam splitter to the color camera and another portion of light reflecting off of the object reflects off of the beam splitter and passes through an interference filter to the IR camera, the interference filter configured to allow IR light to pass through; and e) a video processor operatively coupled to the color camera and the IR
camera and configured to receive the color video signal and the IR video signal, the video processor further comprising algorithm means configured to enable the video processor to process the color and IR video signals to separate the foreground portion from the background portion of the image and produce an output video signal that contains only the foreground portion of the image.

2. A method for the automatic real-time bi-layer segmentation of foreground and background images, the method comprising the steps of:

a) illuminating an object in an image with infrared and visible light, the image further comprising a foreground portion and a background portion;

b) producing a color video signal of the image;

c) producing an infrared video signal of the image;

d) processing the color video signal to produce a color image array comprising a plurality of color vectors, each color vector representing a color pixel in a color image of the image;

e) processing the infrared video signal to produce an infrared image array comprising a plurality of infrared components, each infrared component representing a gray scale value of an infrared pixel in an infrared image of the image;

f) comparing the infrared image array against a predetermined threshold to produce an infrared mask to predict the foreground and background portions of the image;

g) using the color image array, producing a pentamap to partition the color image into a plurality of regions;

h) applying a Gaussian Mixture Model algorithm to the infrared mask and the pentamap to determine an initial segmentation of the foreground and background portions of the image;

i) applying an iterative relaxation labelling algorithm to the infrared mask and the pentamap until a compatibility coefficient between adjacent pixels in the image are equal to or less than a predetermined threshold thereby finalizing the definition of the foreground and background portions of the image; and j) smoothing the transition between the foreground and background portions of the image.

3. An apparatus for automatic real-time bi-layer segmentation of foreground and background images, comprising:

a) means for illuminating an object in an image with infrared and visible light, the image further comprising a foreground portion and a background portion;

b) means for producing a color video signal of the image;

c) means for producing an infrared video signal of the image;

d) means for processing the color video signal to produce a color image array comprising a plurality of color vectors, each color vector representing a color pixel in a color image of the image;

e) means for processing the infrared video signal to produce an infrared image array comprising a plurality of infrared components, each infrared component representing a gray scale value of an infrared pixel in an infrared image of the image;

f) means for comparing the infrared image array against a predetermined threshold to produce an infrared mask to predict the foreground and background portions of the image;

g) means for producing a pentamap using the color image array to partition the color image into a plurality of regions;

h) means for applying a Gaussian Mixture Model algorithm to the infrared mask and the pentamap to determine an initial segmentation of the foreground and background portions of the image;

i) means for applying an iterative relaxation labelling algorithm to the infrared mask and the pentamap until a compatibility coefficient between adjacent pixels in the image are equal to or less than a predetermined threshold thereby finalizing the definition of the foreground and background portions of the image; and j) means for smoothing the transition between the foreground and background portions of the image.