Depth-map improvement
The invention relates to a method of converting an input depth map into an output depth map, the input depth map comprising input depth samples and the output depth map comprising output depth samples, each depth sample having a respective depth value. The invention further relates to a depth-map conversion unit for converting an input depth map into an output depth map. The invention further relates to an image processing apparatus comprising: receiving means for receiving a signal representing an input depth map; and a depth-map conversion unit for converting the input depth map into an output depth map. The invention further relates to a computer program product to be loaded by a computer arrangement, comprising instructions to convert an input depth map into an output depth map.
The ability to record accurate depth information is a key requirement for three- dimensional (3D) television systems and other systems that use 3D video such as mobile phones and game devices. Two approaches can be taken for recording 3D content: direct recording of a depth related signal. For example an infrared (IR) camera may be used in a radar-like approach to record a depth image, or also called depth-map; or - recording two or more video signals from different directions and calculating depth from disparity. This is the more traditional approach to acquire a depth image. Both approaches have their advantages and drawbacks. For instance the first approach provides noisy depth measurements for materials that have a low intrinsic infrared reflectance. For instance a recurring problem is that depth samples cannot accurately be obtained for dark hair and other dark materials. The result is that the depth map comprises values for which the reliability value is relatively low. A depth map is an array, typically a two-dimensional array of values corresponding to depth, distance from a viewer. In fact there might even be "missing data" points. Another reason why the depth map might comprise samples with a label "missing data", is that the corresponding scene points fall outside a
predetermined depth measurement window. The second approach requires accurate disparity estimation, which is a hard problem. Both methods unavoidably cause errors in the depth maps. These errors may range from small regions that are noisy, i.e. a few pixels close together to larger regions where entire objects have a large depth error. Noise and other errors in the depth map decrease rendering quality for stereoscopic viewing. Also other applications of depth measurements such as compression will suffer.
It is an object of the invention to provide a method of the kind described in the opening paragraph which is relatively robust. This object of the invention is achieved in that the method comprises: testing for a first one of the input depth samples, having a relatively low reliability value, whether a new depth value should be assigned on basis of a difference between a first control value of a control signal representing human visible information and a second control value of the control signal, the first control value corresponding to the first one of the input depth samples and the second control value related to a second one of the input depth samples being located in a neighborhood of the first one of the input depth samples and having a relatively high reliability value; optionally establishing the new depth value on basis of a second depth value of the second one of the input depth samples; and assigning the new depth value to a first one of the output depth samples corresponding to the first one of the input depth samples. Typically the new depth value is established on basis of the second depth value of the second one of the depth samples if the difference is below a predetermined threshold. The basic assumption underlying the inventive method is that there cannot be a significant step in the depth map if there is not also a significant step in the co-registered visual image. When depth maps are used in combination with visual images, e.g. for stereoscopic (3D) television, then spatial coherence in co-registered visual images is exploited to fill in the depth samples with a relatively low reliability value. Even for "missing" data points in the depth map new depth values are determined on basis of depth samples with a relatively high reliability value. Typically the human visible information comprises one of luminance and color. In an embodiment of the method according to the invention, the first one of the input depth samples belongs to a first set comprising first input depth samples to which non-
determined depth values have been assigned and the second one of the input depth samples belongs to a second set comprising second input depth samples to which respective determined depth values have been assigned, the first input depth samples and the second input depth samples being located around the first one of the input depth samples and whereby the new depth value is established on basis of the second depth value of the second one of the depth samples if the ratio between a first number of input depth samples of the second set and a total number of input depth samples of the first set and the second set is above a further predetermined threshold. In other words, starting from the set of "valid" measurements, i.e. depth samples, a dilatation (growing) is performed outwards until there is a significant step in the visual image. In addition, the dilation is stopped when the object boundary becomes too irregular. Since, objects in the real world, and their image taken by a camera often have smooth boundaries. The test on boundary regularity is based on the ratio of the different input depth samples. If the first number of input depth samples is relatively high then there are relatively many input depth samples having a determined depth value, i.e. a depth value for which the reliability is relatively high. A non-determined depth value means that the corresponding reliability is relatively low. In that case, a particular default depth value might have been assigned, e.g. representing infinite. In an embodiment of the method according to the invention, the second control value corresponds to the second one of the input depth samples. An advantage of this embodiment is that the second control value can directly be fetched from the signal representing human visible information. Alternatively, the second control value is based on a third control value corresponding to the second one of the input depth samples and a fourth control value corresponding to a third one of the input depth samples, being located in the neighborhood of the first one of the input depth samples and having a further relatively high reliability value. An advantage of this embodiment is an improved robustness. In an embodiment of the method according to the invention, the new depth value is established by computing an average or median of the second depth value and further depth values belonging to further input depth samples being located in the neighborhood of the first one of the input depth samples and having relatively high reliability values. Several types of linear or non-linear processing operations are possible to compute an appropriate new depth value, but a preferred one is the average of multiple depth samples having a relatively high reliability value. An order statistical operation, like a median also results in appropriate interpolated output depth maps. Alternatively, surface fitting is applied to take
into account that there is a depth trend in the neighborhood of the first one of the input depth samples. It is a further object of the invention to provide a depth-map conversion unit of the kind described in the opening paragraph which is relatively robust. This object of the invention is achieved in that the depth-map conversion unit comprises: testing means for testing for a first one of the input depth samples, having a relatively low reliability value, whether a new depth value should be assigned on basis of a difference between a first control value of a control signal representing human visible information and a second control value of the control signal, the first control value corresponding to the first one of the input depth samples and the second control value related to a second one of the input depth samples being located in a neighborhood of the first one of the input depth samples and having a relatively high reliability value; establishing means for optionally establishing the new depth value on basis of a second depth value of the second one of the input depth samples; and assigning means for assigning the new depth value to a first one of the output depth samples corresponding to the first one of the input depth samples. It is a further object of the invention to provide an image processing apparatus of the kind described in the opening paragraph of which the depth-map conversion unit is relatively robust. This object of the invention is achieved in that the depth-map conversion unit comprises: testing means for testing for a first one of the input depth samples, having a relatively low reliability value, whether a new depth value should be assigned on basis of a difference between a first control value of a control signal representing human visible information and a second control value of the control signal, the first control value corresponding to the first one of the input depth samples and the second control value related to a second one of the input depth samples being located in a neighborhood of the first one of the input depth samples and having a relatively high reliability value; - establishing means for optionally establishing the new depth value on basis of a second depth value of the second one of the input depth samples; and assigning means for assigning the new depth value to a first one of the output depth samples corresponding to the first one of the input depth samples.
It is a further object of the invention to provide a computer program product of the kind described in the opening paragraph which is relatively robust. This object of the invention is achieved in that, the computer program product, after being loaded, provides said processing means with the capability to carry out: - testing for a first one of the input depth samples, having a relatively low reliability value, whether a new depth value should be assigned on basis of a difference between a first control value of a control signal representing human visible information and a second control value of the control signal, the first control value corresponding to the first one of the input depth samples and the second control value related to a second one of the input depth samples being located in a neighborhood of the first one of the input depth samples and having a relatively high reliability value; optionally establishing the new depth value on basis of a second depth value of the second one of the input depth samples; and assigning the new depth value to a first one of the output depth samples corresponding to the first one of the input depth samples. Modifications of the depth-map conversion unit and variations thereof may correspond to modifications and variations thereof of the image processing apparatus, the method and the computer program product, being described.
These and other aspects of the depth-map conversion unit, of the image processing apparatus, of the method and of the computer program product, according to the invention will become apparent from and will be elucidated with respect to the implementations and embodiments described hereinafter and with reference to the accompanying drawings, wherein: Fig. 1A-1C schematically show the working of an IR depth camera; Fig. 2A schematically shows a visual image; Fig. 2B schematically shows an input depth map corresponding to the visual image of Fig. 2A; Fig. 3 A schematically shows a first part of a single iteration of the method according to the invention; Fig. 3B schematically shows a second part of a single iteration of the method according to the invention; Fig. 4 schematically shows a structuring set;
Fig. 5A schematically shows an input depth map; Fig. 5B schematically shows an output depth map corresponding to the input depth map of Fig. 5A; Fig. 6 schematically shows a depth-map conversion unit according to the invention; and Fig. 7 schematically shows an embodiment of the image processing apparatus according to the invention. Same reference numerals are used to denote similar parts throughout the figures.
Figs. 1A-C schematically show the working of an IR depth camera 108 which is based on time-of-flight. Fig. 1 A schematically shows a light wall 100 moving from the camera 108 to the scene 106. Fig. IB schematically shows an imprinted light wall 102 returning to camera 108 and Fig. 1C schematically shows a truncated light wall 104 containing depth information from the scene 106. Depth is extracted from the reflected deformed infrared light- wall 102 by deploying a fast image shutter in front of a CCD chip and blocking the incoming light. The collected light at each of the pixels is inversely proportional to depth of the specific pixel. Since reflecting objects may have any reflection coefficient there exist a need to compensate for this effect. Hence, a normalization depth is calculated per pixel by simply dividing the front portion pixel intensity by the corresponding portion of the total intensity. In this set-up, the reflected IR light passes the same lenses as the visual light. Behind the lenses, the IR and visual light are separated and recorded with different sensors. There are no angular differences between the colour camera and the depth sensor, so each pixel of the colour camera is assigned a corresponding depth value. Camera zoom is accounted for in a very natural way as both IR and visual light passes through the same optical path. The specific operation of the IR depth camera results in two types of "missing" data points, i.e. samples of which the reliability value is relatively low: - Points where the infrared reflectance is low due to the specific properties of the reflecting object. Some materials have a low infrared reflectance. Also, smooth surfaces at large grazing angles are problematic since most of the illumination energy is scattered away from the sensor. Typically a threshold operation is done in hardware associated with the depth camera; and
Points where the depth falls outside a predetermined depth measurement window. The light collected at a pixel is inversely proportional to the depth of the specific pixel. Due to transmit power limitations there is a fixed measurement window for which accurate depth can be recorded. Outside this range, observations are reported as "missing". The scene is usually arranged in such a way that either all objects fall inside the measurement window, or that objects that fall outside this window are always behind the measurement window. This last situation makes it possible to assign an arbitrary large depth to these pixels. The output of the IR camera 108 are an RGB-signal and a depth signal. Typically the information concerning the original IR-reflection is lost. This makes the depth reconstruction problem difficult, because then the level of IR-reflection can no longer be used. Fig. 2A schematically shows a visual image 200 representing a scene with three actors. Two of them are sitting on a desk and the third one is standing up-right. Fig. 2B schematically shows an input depth map 202 corresponding to the visual image of Fig. 2A. This input depth map 202 comprises depth samples 210-214 for which no appropriate depth value has been determined, e.g. because of one of the causes described above in connection with the Figs. lA-C. For instance a first inappropriate depth sample 210 corresponds to the dark hair 204 of one of the actors. A second inappropriate depth sample 212 corresponds to an office device 206 which is located out of the predetermined depth measurement window. A third inappropriate depth sample 214 corresponds to a surface 208 of an office chair which is substantially oriented in the transfer direction of the light wall of the IR camera. Fig. 3 A schematically shows a first part of a single iteration of the method according to the invention and Fig. 3B schematically shows a second part of the single iteration of the method according to the invention. These Figs. 3A and 3B show one discrete time step t, i.e. one iteration, in the method for restoration of an incomplete input depth map using a co-registered visual image R,G,B. Figs. 3 A shows the extension of the set of "valid" data points X, to a new set of "valid" data points XM , using a conditional dilation, i.e. growing that depends on the visual image. Figs. 3B shows a depth interpolation step in which a depth is estimated for the set of new data points: XM - X, . Using set notation, let X denote the set of points (x,y) for which a "valid" measurement is known and let c , the complement of X , be the set of "missing" data points. Let t denote a discrete time step, i.e. one iteration. A single dilation step grows the "valid"
data point set X by using points from the "missing" data set Xc . A first condition for extension is that the ratio between a first number of "valid" data points in a structuring set S-ψj,) and the total number of data points in the structuring set S+^j is above a predetermined threshold a , where S+(_^) is a structuring set S translated to image coordinates (x,y). Using set notation:
Note that S
+(x y)
yy The use of structuring sets is well known from mathematical morphology. See e.g. "An Introduction to
Morphological Image Processing" by E.R. Dougherty, in SPIE optical engineering press 1992. Parameter a controls the shape evolution of the set X as a function of the iteration number t . It is the fractional area that "valid" data points occupy in S translated to position (x,y). Setting large, i.e. close to one, prevents the "valid" set of points _¥" to grow without bounds when the iteration number t → ∞ . When starting from a perfectly linear shaped boundary, dilation is only possible for < 0.5. Fig. 4 schematically shows an example of a structuring set S^x y) of 5x5 pixels. The structuring set S+(Xjy) containing 15 "valid" data points and 10 "missing" data points. According to equation (1), the "missing" data point at position (x, y) is added to the "valid" set only if the fraction of "valid" points that occupy the square window is higher than . In a preferred case c. =3/5 The visual image denoted by R,G,B is used to further constrain the dilation at each iteration number t . Let / denote a function of R,G,B . Equation (1) is now extended by requiring that the functional value of / is less than a further predetermined threshold β :
X
M = X, {(x, )e X :
> a,f < β} (2)
A possible choice for / is a linear combination of the gradient magnitudes of R,G,B at location (x, y). This is logical since a dilatation across a luminance or color edge should be prevented since they may correspond with depth discontinuities. However, accurate evaluation of the gradient magnitudes requires that the visual image is pre-filtered
with derivatives of a Gaussian using large kernel sizes to avoid effects of noise. This is computationally expensive. A preferred alternative functional value of / is specified in Equation 3. The computation of that alternative functional value of / only requires taking the absolute value of differences and finding the minimum:
It will be clear that there are alternative functional values of / , e.g. based on an order statistical operation on control values followed by a computation of a difference between the output of said operation and the control value under consideration. After each dilation step a new depth value has to be interpolated at "missing" data point (x,y) for which has been determined that it should be added to the other set. An appropriate way of computing the new depth value is based on calculating the mean depth value of a part of the data inside the structuring set: d(χ,y) = —τ- .d(u,v) (4)
where (u,v)≡ S+(x<y) π X, which is the subset of "valid" data points in the structuring set and N = Itf j,) n X, I is the total number of "valid" data points in the structuring set. Fig. 5A schematically shows an input depth map 202 and Fig. 5B schematically shows the corresponding output depth map 502. It can be seen that measurement errors for the chair 214, the table 504 and a part 210 of one of the persons have been removed and interpolated. Note that the valid measurements in the input depth map are not smoothed. Note also that restoration of the depth map is at the expense of smoother object boundaries. This trade-off between improvement based on colour and deformation of object boundaries can be controlled by varying the predetermined thresholds and other parameters'. the dimension of the structuring set S^^ . A square window of 1 lxl 1 pixels was used for the output depth map 502 as depicted in Fig. 5B; - the predetermined threshold e [θ,l] controls the amount of growing based on shape. A predetermined threshold a = 0.6 was used for the output depth map 502 as depicted in Fig. 5B;
the further predetermined threshold β e [θ,255] controls the amount of growing based on colour distance. A predetermined threshold β = 60 was used for the output depth map 502 as depicted in Fig. 5B; and the total number of iterations. A total number of 30 was used for the output depth map 502 as depicted in Fig. 5B. Fig. 6 schematically shows a depth-map conversion unit 600 according to the invention. The depth-map conversion unit 600 is arranged to convert an input depth map 202, which is provided at the input connector 608, into an output depth map 502 which it provides at the output connector 614. The input depth map 202 comprises input depth samples and the output depth map 502 comprising output depth samples, each depth sample having a respective depth value. The depth-map conversion unit 600 comprises: a testing unit for testing for a first one of the input depth samples, having a relatively low reliability value, whether a new depth value should be assigned on basis of a difference between a first control value of a control signal representing human visible information and a second control value of the control signal, the first control value corresponding to the first one of the input depth samples and the second control value related to a second one of the input depth samples being located in a neighborhood of the first one of the input depth samples and having a relatively high reliability value; an establishing unit for optionally establishing the new depth value on basis of a second depth value of the second one of the input depth samples; and an assigning unit for assigning the new depth value to a first one of the output depth samples corresponding to the first one of the input depth samples. The working of the depth-map conversion unit 600 is described in connection with the Figs 3A-3B and 4. The depth-map conversion unit 600 comprises a control interface 608 for providing the depth-map conversion unit 600 with a video image and optionally comprises a further control interface 612 for providing the depth-map conversion unit 600 with reliability values corresponding to the respective input depth values. Alternatively, the reliability values are integrated in the input depth map, e.g. by means of the usage of default depth values which do not correspond to physically determined values. The reliability values optionally correspond to an IR signal from the depth camera, which is not processed by means of a threshold operation. The testing unit 602, the establishing unit 604 and the assigning unit 606 may be implemented using one processor. Normally, these functions are performed under control of a software program product. During execution, normally the software program product is
loaded into a memory, like a RAM, and executed from there. The program may be loaded from a background memory, like a ROM, hard disk, or magnetically and/or optical storage, or may be loaded via a network like Internet. Optionally an application specific integrated circuit provides the disclosed functionality. Fig. 7 schematically shows an embodiment of the image processing apparatus
700 according to the invention, comprising: Receiving means 702 for receiving a signal representing visual images and co- registered input depth maps. The depth-map conversion unit 600 as described in connection with Fig. 6; and Optionally, a rendering device 706 for rendering 3D images on basis of the received visual images an the output depth map of the depth-map conversion unit 600. The image processing apparatus 700 might be a depth camera. Alternatively, the image processing apparatus 700 is a display apparatus or storage apparatus. In that case the signal may be a broadcast signal received via an antenna or cable but may also be a signal from a storage device like a VCR (Video Cassette Recorder) or Digital Versatile Disk (DVD). The signal is provided at the input connector 710. The image processing apparatus 700 optionally comprises, a not depicted, display device for displaying the output images of the rendering device 706. In that case the image processing apparatus 700 is e.g. a TV. Alternatively the image processing apparatus 700 does not comprise the optional display device but provides the output images to an apparatus that does comprise a display device. Then the image processing apparatus 700 might be e.g. a set top box, a satellite-tuner, a VCR player, a DVD player or recorder. Optionally the image processing apparatus 700 comprises storage means, like a hard-disk or means for storage on removable media, e.g. optical disks. The image processing apparatus 700 might also be a system being applied by a film-studio or broadcaster. It should be noted that the above-mentioned embodiments illustrate rather than limit the invention and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be constructed as limiting the claim. The word 'comprising' does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitable programmed computer. In
the unit claims enumerating several means, several of these means can be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words are to be interpreted as names.