WO2013124312A1

WO2013124312A1 - Object-aligned filtering method and apparatus

Info

Publication number: WO2013124312A1
Application number: PCT/EP2013/053370
Authority: WO
Inventors: Christiaan Varekamp; Patrick Luc Els VANDEWELLE
Original assignee: Tp Vision Holding B.V.
Priority date: 2012-02-21
Filing date: 2013-02-20
Publication date: 2013-08-29

Abstract

An object-aligned filtering method (100) for filtering a set of object values corresponding to an image depicting said objects, the image having multiple pixels, each object value of the set of object values corresponds to a pixel of the multiple pixels so that at least part of the pixels of the multiple pixels have a corresponding object value of the set of object values, the method comprising obtaining multiple feature maps (120) for the image, a feature map assigns a feature value to a pixel of the multiple pixels, at least some of the feature maps being related to the objects, a first filtering (142) of the object values to obtain a first filtered object value for a selected pixel of the multiple pixels, the first filtering being guided by a first subset of the multiple feature maps, a second filtering (144) of the object values to obtain a second filtered object value for the selected pixel, the second filtering being guided by a second subset of the multiple feature maps, the second subset being different from the first subset, deriving a single filtered object value (160) for the selected pixel from at least the first and second filtered object value.

Description

Object-aligned filtering method and apparatus

FIELD OF THE INVENTION

The invention relates to an object-aligned filtering method for filtering a set of object values corresponding to an image depicting said objects, the image having multiple pixels, each object value of the set of object values corresponds to a pixel of the multiple pixels so that at least part of the pixels of the multiple pixels have a corresponding object value of the set of object values, the method comprising obtaining multiple feature maps for the image, a feature map assigns a feature value to a pixel of the multiple pixels, at least some of the feature maps being related to the objects, and a first filtering of the object values to obtain a first filtered object value for a selected pixel of the multiple pixels, the first filtering being guided by the multiple feature maps. BACKGROUND OF THE INVENTION

Over the last decade a substantial amount of research has been directed at the realization of 3D display technology for use in and around the home. As a result there has been a flurry of both stereoscopic as well as autostereoscopic displays. In stereoscopic displays, the eyes of the viewer are generally assisted by e.g. glasses or foils positioned between the viewers eyes and the display to direct in a time- multiplexed or simultaneous manner (e.g. through spectral separation) a left eye image to the left eye of the viewer and a right eye image to the right eye of the viewer. As generally users find wearing glasses bothersome, autostereoscopic displays which do not require viewers to wear such glasses, have likewise received considerable attention. Autostereoscopic displays, often multi-view displays, generally allow the visualization of multiple; e.g. 5-9 or more images or views which are multiplexed in a viewing cone directed at the viewers. By observing separate views from a cone with the left and right eye respectively a stereoscopic effect may be obtained by the unaided eye.

An important issue for both stereoscopic displays and autostereoscopic displays is the delivery of content. Various approaches are known to deliver three-dimensional content to a display. Some of these approaches explicitly encode all views, whereas others encode one or some views and additional depth and/or disparity information for one or all of these views. The advantage of providing depth information is that it facilitates manipulation of the three-dimensional content e.g. when rendering additional views based on the images provided. Depth information is an example of object values.

Depth information may be obtained from monocular images, or monocular image sequences. A variety of applications of such algorithms may be envisaged ranging from fully automated conversion, to user assisted 2D to 3D conversion for high quality content. In case of user assisted 2D to 3D conversion computer assisted depth map generation may represent a substantial time saving. Depth maps generated from monocular images typically suffer from sparsity, i.e., depth information is not available for all image points; moreover depth information may be noisy.

Depth information may also be acquired through analysis of e.g. disparity of stereo images. However, disparity analysis of two images requires that any particular location on a first image is located on the second image of a stereoscopic pair of images. For many image points this may be difficult, e.g., for points located in relatively featureless parts of the image, i.e., a wall of a uniform color. This may result: in sparsity, i.e., the determination of depth information may fail for some points; and in noise, e.g., if a point in the second image is slightly off.

An example of an approach to obtain a depth map from a monocular image is presented in "Depth Map Generation by Image Classification", by S. Battiato, et al, published in Proceedings of SPIE Electronic Imaging 2004, Three-Dimensional Image Capture and Applications VI - Vol. 5302-13. In the above paper a depth map is generated by combining a depth map based on an estimated global depth profile of an image which is then combined with a further depth map which comprises image structure. The resulting combination however does not always provide for a satisfactory depth perception in the resulting image.

International patent application PCT/IB2009/054857, published as

WO2010/052632A1, with title 'Method and device for generating a depth map', incorporated herein by reference, discloses a method of generating a depth map for a monocular image, using a global depth profile in combination with color and luminance values of the image.

SUMMARY OF THE INVENTION

It is a problem of the prior art that depth information for an image is often sparse and/or noisy, whereas dense information with low-noise is desired. One way to reduce noise and to create a dense depth map from a sparse one is to filter the available depth information. Ordinary filtering methods have the disadvantage that they spread depth data across object boundaries visible in the image. A first improvement is to use an object aligned filtering method to filter the depth data.

Two recent object aligned filtering methods are bilateral filtering and guided filtering. Guided filtering is described in a paper 'Guided image filtering' by He, Kaiming and Sun, Jian and Tang, Xiaoou; published in the Proceedings of the 11th European conference on Computer vision: Part I, ECCV'IO, 2010, incorporated herein by reference. An efficient bilateral filtering method is described in 'Real-time edge- aware image processing with the bilateral grid' by Chen, Jiawen and Paris, Sylvain and Durand, Fredo, published in ACM SIGGRAPH 2007 papers.

In these methods a filtering is guided by additional data (feature maps) that relates to the objects, the filtering is adapted such that it is unlikely to cross object boundaries as indicated by the additional data.

It is noted that the papers cited above do not disclose applications to depth data.

A cross-bilateral filters applied to smoothen a disparity map after an initial estimation, often suffers from over-smoothening, causing leakage of foreground depth into the background and vice versa. One approach to this problem could be to reduce the filter size. Unfortunately, this will also reduce the amount of information available within the filter support, and therefore this will create problems in other parts of the image. Another option is to enlarge the feature space, and add features to create a better separation between objects. This can be done in a bilateral grid by adding dimensions to the bilateral grid histogram. However, the computational complexity of the algorithm rapidly increases when adding features. This is why, often only the image luminance is used. When going from luminance only to RGB colors, the typical bilateral grid would grow from a 3D histogram to a 5D histogram. A second problem with higher- dimensional histograms is the increasing sparseness of the available data. By adding dimensions to the histogram, the distance between the available data points rapidly increases.

The invention provides an object-aligned filtering method for filtering a set of object values corresponding to an image depicting said objects, the image having multiple pixels, each object value of the set of object values corresponds to a pixel of the multiple pixels so that at least part of the pixels of the multiple pixels have a corresponding object value of the set of object values, the method comprising obtaining multiple feature maps for the image, a feature map assigns a feature value to a pixel of the multiple pixels, at least some of the feature maps being related to the objects, a first filtering of the object values to obtain a first filtered object value for a selected pixel of the multiple pixels, the first filtering being guided by a first subset of the multiple feature maps, a second filtering of the object values to obtain a second filtered object value for the selected pixel, the second filtering being guided by a second subset of the multiple feature maps, the second subset being different from the first subset, deriving a single filtered object value for the selected pixel from at least the first and second filtered object value.

Given an image representing a number of objects, for some of the image pixels object values are available representing some property of an object corresponding to that image pixel. An object value relates to a spatial point of the image depicting the object. In preferred embodiments, the object values represent depth data, however the filtering method is applicable to other types of data for which object aligned filtering is needed, for example, motion data.

For the image a number of feature maps are available. A feature map assigns a value to most, very preferably all, of the pixels in the image. A feature map may be obtained by processing image pixel values. For example, a feature map may be obtained for each of three components of an RGB color indication of the picture. A feature map may however also be obtained independent from the image. For example, a coordinate in a coordinate system for the image may be used, as it gives information on proximity.

Ideally, one may prefer to take into account all feature maps that contain any information regarding the edges of objects. However, the amount of processing that needs to be done in a bilateral method rises exponentially with the number of features.

In the invention (at least) two filtering steps are performed. In a first filtering a first subset of the feature maps are taken into account, in a second filtering a second subset of the multiple feature maps is taken into account. Next the two filterings are combined to obtain a single filtering. For example, one could average the two filterings, or use other more advanced methods as explained below. At least one, but preferably both, of the first and second filtering are guided by a proper subset of the multiple feature maps, i.e., having a cardinality lower than that of the full set of the multiple feature maps.

The method allows an object-aligned filtering that can take into account a larger amount of feature data in the same amount of time. In an embodiment, the method comprises multiple filtering steps, each filtering step being guided by a corresponding different subset of the multiple feature maps, the single filtered object value being derived from the multiple filtering steps. The amount of work increases linearly with the number of filtering steps, not exponentially with the number of feature maps. The number of filtering steps may be 2, 2 or more, 3, 3 or more, 4 or more, 6 or more, 10 or more etc.

Preferably, the first and second filtering use the same filtering method, but this is not necessary.

The object values may be estimated data associated with image locations of the image. For example, the estimated data may be estimated from a monocular image, or from disparity data. The object values may be sparse, i.e., only available for part of the image. The sparsity may even be severe, i.e., an object value may be available for less than 1% of the number of pixels. The object values may be dense but suffer from noise. The object values are typically numeric values, typically real numbers. The object values may be part of an object value vector. In preferred embodiments, the object values represent depth data, either directly, or indirectly, say in the form of disparity data.

The invention also applies to other types of object values than depth data, say motion data. The invention may be used to convert motion estimates from sparse to dense. The invention may also be used to convert block-based motion estimates to pixel-based motion estimates. The latter may be implemented by having the same block-based motion estimate for all pixels in the block of pixels; alternatively, a selected point in the block, say a center point may correspond with the motion estimate; etc. The block-based motion estimates may be represented as a sub-sampled motion vector field. One may also convert sparse annotations of some variable, or segment label maps, to a dense map of the same variable.

The method may be applied for a single selected point of the image but preferably, the method is applied for all points of the image to obtain dense object values for an image. For example, the method may be applied to a sparse depth map (some image pixels do not have a depth estimate) to produce a dense depth map (having a depth value for each image pixel). If the first and/or second filtering is configured to allow a sparse input the filtering may be directly applied to the object value. However, if needed one may convert sparse data to dense data before filtering by assigning a fixed value to pixels having no object value, say 0. This way of up scaling sparse to dense data may give acceptable results if followed by an edge aligned filtering step.

Like a depth map, a feature map assigns values to pixel points. In an

embodiment, at least one feature map in the first subset and/or the second subset is derived from image values of the multiple pixels. The image values of a pixel typically represent its color and/or luminosity. For example, a feature map may comprise any one of the components of RGB data for a pixel, a value indicating texture data, a value indicating luminosity. Texture and luminosity values may be derived from the image.

Feature maps that contain this type of information are related to objects in the image because they are derived from an image depicting it. Filtering guided by such a feature map relies on the heuristic that objects of a similar color/texture etc, are likely to be at a similar distance, i.e., have similar object values.

However, feature maps may also be derived independent from image values of the multiple pixels. For example, a feature map may assign a coordinate value to a pixel, say the x or y coordinate. Different coordinate systems may be used, say polar coordinates, coordinates having different axis etc. Using such a feature map relies on the heuristic that objects that have similar coordinates, i.e., are close to each other, or lie in the same direction, are likely to be at a similar distance, i.e., have similar object values.

The multiple feature values assigned to a pixel by the multiple feature maps may be collected in a feature vector. One may then represent the filtering step as filtering guided by a feature vector reduced in dimension, e.g., by projection.

Some of the feature maps may give reliable information when regarded on their own. For example, if the image contains a single red object at a particular distance, a filtering guided only by the red component of RGB pixel values will likely give good filtering results, at least around the red object. However, two feature maps in combination may reveal more of the object boundaries than they can in isolation. In an embodiment, the first subset and the second subset share at least one feature map. Filtering guided by the same feature maps in different combinations, makes it less likely information is thrown away by not considering all feature maps simultaneously.

Indeed, in an embodiment, the first subset and/or the second subset contain exactly two feature maps. Empirically it appears that if an object boundary could be determined from the full collection of feature maps, i.e., by computationally expensive methods, than often there was some combination of two feature maps from which the same boundary could also be derived.

Preferably, the number of multiple feature maps is at least 3 since this will allow at least 2 filtering steps each guided by two feature maps (having an overlap of one feature map).

Note that one could do a filtering step for each pair of feature maps from multiple feature maps. Even then the amount of work would only grow quadratically with the number of feature maps. For example, for 3, 4, 5, 6, 7 feature maps one could do 3, 6, 10, 15, 21 filtering steps, respectively, one filtering step for any possible combination of two feature maps. Not all combinations of two feature maps need to be used. For example, some combinations of two feature maps are expected to give better than average results. For example, the combination of a feature map representing y- coordinate and intensity of the color blue works very well to identify skies. When images including skies are expected it is advantageous to include this combination, perhaps to the expense of other combinations.

In principle any one of the first and/or second filtering may use any object aligned filtering method for filtering object values guided by additional information such as feature maps. For example, one could use guided filtering. However, one especially convenient filtering method is bilateral filtering. Bilateral filtering using additional data is sometimes referred to as 'joint' or 'cross' bilateral filtering.

Preferably, at least one and preferably both, of the first and second filtering is edge aligned. This may be accomplished by including image data, say luminance or chroma values, as a feature map in the feature maps that guide the filtering.

When using bilateral filtering, it is especially advantageous to make use of a so- called bilateral grid. A bilateral grid is a multi-dimensional space, typically

implemented as an array, wherein each dimension is indexed with values of a particular feature value. For example, given the features {x coordinate, y coordinate, color intensity} , one could construct a 3 dimensional grid, wherein each point represents one of x-coordinate, y-coordinate and a color intensity. An object value may be mapped to a grid by looking up the feature values of the feature maps used for that filtering. The combination of the feature values indicates the location in the grid where the object value is to be processed.

Typically, grids used in bilateral filtering are at least 3 dimensional, in order to represent some combination of proximity (x and y coordinates) and color (intensity). However in the method according to the invention this is not needed since multiple filtering steps are available. Accordingly, the first and/or second filtering step may be guided by 2 feature maps. Even using 1 feature map for one or more of the filterings is feasible.

Using two feature maps instead of 3 or more has the advantage of not increasing sparsity. If the grid has 3 dimensions the sparse object values are projected in a space of even larger dimension increasing the sparsity problem. With more than 3 feature maps this problem is even more severe.

Preferably, the grid is divided in multiple bins, a bin representing multiple combinations of feature values. These types of grids are also referred to as histograms. Binning increases the number of object values that are mapped to the same location in the grid. This allows better statistical analysis on the data mapped to a particular grid. Without binning only data values having the exact same combination of feature values in the chosen subset of feature maps would be mapped to the same location.

For example, one may determine a variance of the object values mapped to a bin, and derive a confidence therefrom. For example, a low variance map corresponds to a high confidence for that bin. The low variance in object value confirms the heuristic that objects sharing the particular features have similar object values (e.g. are at the same depth). Thus if a low variance is found together with a number of mapped object values above a threshold, then it is reasonable to give high confidence to filtering done on the basis of it.

Object values mapped to the same location in the grid (same bin/point) may be processed in various ways. For example, one may average them and use the averaged object values as the basis for filtering over the grid. In a practical implementation, running counts are kept of object values during the mapping process. For example, for object values corresponding to pixels of the image, one may keep a count of the number of mapped object values per bin and the sum of the mapped object values per bin. After the mapping the average may be derived by dividing sum by the number (if number is 1 or more).

After mapping data values to the grid, and possible other processing (such as averaging, or computing variances), a filter may be done over the grid. This filtering need not be edge aligned. For example, a Gaussian filter may be applied to the grid.

Next a first filtered object value for a selected pixel may be obtained by:

obtaining the feature values of the selected pixel and looking up the filtered value in the grid. This process is sometimes referred to as slicing.

A single filtered object value may be derived from the multiple filtered object values obtained from multiple filtering steps. For example, the multiple filtered object values, say the first and second filtered object values, may be averaged. The averaging may be weighting with a confidence value indicating the confidence of the filtered object values. Note that a confidence value need not be the same for all bins. In fact, it is likely that some feature combination works well for some objects, but other combinations may work better for other objects.

In an embodiment, the method comprises determining a first confidence indicative of the object alignment of the first filtering and a second confidence indicative of the object alignment of the second filtering, and wherein the deriving of the single filtered object value further depends on the first and second confidence.

Preferably a confidence, say the first and second confidence, is expressed as a confidence value.

The object alignment may be obtained by having at least one of the first and second filtering resulting in object alignment through the guidance by the respective feature maps at least one of which is related to the objects. By basing the deriving of the single filtered object value on the first and second confidence the object alignment may also be obtained or improved. Different combinations of feature maps guide the filtering in different ways. Some of these filtering will be better than others. By using from the multiple filtering, those filterings (or parts thereof) that are best aligned with the objects the object alignment of the end-results is improved.

Determining if a filtering has a good object alignment may be done in various ways.

A filtering method that selects a set of object values and fits it to a model may determine a measure, e.g. value, indicating how close the fit is. If the model is a constant value, then the variance of the selected set of object values is a measure for object alignment; a low variance indicating that a constant value model gives a good fit. For example, if the filtering uses binning, such as bilateral filtering, then a bin is likely to have good object alignment if the object values are close together.

For example, the following is an advantageous filtering method which may be used for the first and/or second filtering: A filtering method for filtering a set of object values corresponding to an image depicting said objects, the image having multiple pixels, each object value of the set of object values corresponds to a pixel of the multiple pixels so that at least part of the pixels of the multiple pixels have a

corresponding object value of the set of object values, the method comprising a filtering of the object values to obtain a filtered object value for a selected pixel of the multiple pixels, the filtering being guided by a subset of the multiple feature maps, wherein the filtering comprises: selecting a set of object values dependent upon the subset of the multiple feature maps, fitting the set of object values to a model, determining a measure indicating how close the fit is, and using the fitted model to obtain a filtered object value and using the measure indicating how close the fit is to obtain a confidence indicative of the object alignment of the filtering, at least for the selected pixel.

Obtaining or improving object alignment through an after- filtering selection of the best filterings gives better results if the object values themselves are correlated with the object; for example, if two object values corresponding to the same object are closer together than two object values corresponding to different objects, (at least on average). Object properties, such as depth values, motion vectors or components thereof, segment label maps, etc, all satisfy this criterion. The method may even be applied to object values obtained from the image, i.e., chroma values, e.g., to obtain object-based denoising.

In an embodiment, the grid is initially non-empty, e.g. reset to 0, but filled with the results for a previous image, possibly multiplied by a factor, say ½. This has the advantage that the object information from a previous image is used for a next image. This in turn may improve temporal stability of the filter output when converting shots for image sequences.

A further aspect of the invention concerns an object-aligned filtering apparatus for filtering a set of object values corresponding to an image depicting said objects, the image having multiple pixels, each object value of the set of object values corresponds to a pixel of the multiple pixels so that at least part of the pixels of the multiple pixels have a corresponding object value of the set of object values, the method comprising a feature map obtaining module configured to obtain multiple feature maps for the image, a feature map assigns a feature value to a pixel of the multiple pixels, at least some of the feature maps being related to the objects, a first filtering module configured to filter the object values to obtain a first filtered object value for a selected pixel of the multiple pixels, the first filtering being guided by a first subset of the multiple feature maps, a second filtering module configured to filter the object values to obtain a second filtered object value for the selected pixel, the second filtering being guided by a second subset of the multiple feature maps, the second subset being different from the first subset, a combination module configured to derive a single filtered object value for the selected pixel from at least the first and second filtered object value.

The object-aligned filtering apparatus is an electronic device. For example, the object-aligned filtering apparatus may be comprised in a display, e.g. a television, especially a 3D display, more especially, an autostereoscopic display. The filtered object values may be a filtered depth map. A filtered depth map may be used to compute intermediate display images.

The object-aligned filtering apparatus may be comprised in a mobile electronic device, especially a gaming device or mobile phone. The object-aligned filtering apparatus may be comprised in a set-top box, computer, etc.

The method according to the invention is particularly suitable for efficient hardware implementation. Mapping object value to a grid may be done using one or more integrated circuits. Preferably, the mapping is done in parallel, mapping multiple pixels at the same time.

A method according to the invention may be implemented on a computer as a computer implemented method, or in dedicated hardware, or in a combination of both. Executable code for a method according to the invention may be stored on a computer program product. Examples of computer program products include memory devices, optical storage devices, integrated circuits, servers, online software, etc. Preferably, the computer program product comprises non-transitory program code means stored on a computer readable medium for performing a method according to the invention when said program product is executed on a computer. In a preferred embodiment, the computer program comprises computer program code means adapted to perform all the steps of a method according to the invention when the computer program is run on a computer. Preferably, the computer program is embodied on a computer readable medium.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is explained in further detail by way of example and with reference to the accompanying drawings, wherein:

Figure 1 is a flow chart illustrating an object-aligned filtering method according to the invention,

Figure 2 is a block diagram illustrating an object-aligned filtering system according to the invention,

Figure 3 is flow chart illustrating an object-aligned filtering method for use in the invention,

Figure 4a is an exemplary picture and disparity data,

Figure 4b shows three feature grids with different subsets of feature maps for the picture of figure 4a,

Figure 5 is a diagram illustrating an embodiment of feature grids,

Figure 6 is a diagram illustrating a further embodiment of feature grids,

Figure 7a and 7b are a gray-scale representation of a stereoscopic image pair, Figure 7c is a gray-scale representation of a depth map

Figure 7d is a gray-scale representation of a depth map

Throughout the Figures, similar or corresponding features are indicated by same reference numerals.

DETAILED EMBODIMENTS

While this invention is susceptible of embodiment in many different forms, there is shown in the drawings and will herein be described in detail one or more specific embodiments, with the understanding that the present disclosure is to be considered as exemplary of the principles of the invention and not intended to limit the invention to the specific embodiments shown and described.

Fig. 2 illustrates in the form of a block diagram a system 200 according to the invention. The system may be used for the method of Fig. 1. Fig. 2 shows an image source 220. For example, the image source could be a content device, e.g., a disc player, an internet connection etc. System 200 comprises an image based feature map module 222 for deriving feature maps from the image.

System 200 comprises a non- image based feature map source 224. The non- image based feature maps may be retrieved from disc or computed. Below feature maps are more fully described.

System 200 comprises a selector 230 for selecting subset of the feature maps obtained by feature map module 222 and feature map source 224. Feature map selector 230 selects at least two different subsets. System 200 comprises an object aligned filtering module for filtering object values guided by feature maps selected by feature map selector 230. Filtering module 240 will be used at least twice for the same image and object values. Filtering module 240 may be implemented twice in filtering system 200 to allow parallelization. Filtering module 240 receives object values from some source 210. For example, object values source 210 comprises a disparity analyzer. Optionally, filtering system 200 comprises a confidence module 250. Confidence module 250 computes a confidence value for the filtering estimating how well the filtering is object aligned. System 200 comprises a filtering combiner 260 for combining the at least two filterings of filtering module 240. The filtered object values may be used for a variety of purposes indicated by sink 270. For example, filtering combiner 260 may use the object value to calculate intermediate images for an auto stereoscopic display. Depth data may be used to compute a second image

corresponding to the initial image to obtain a stereoscopic pair etc.

Feature map module 222, feature map selector 230, filtering module 240, (optionally) confidence module 250 and filtering combiner 260 may be combined to form an object aligned filtering apparatus.

Fig. 1 illustrates in the form of a flow chart a method according to the invention. It is noted that Fig. 1 both includes and omits some optional steps.

A set of object values is obtained 110, e.g. from source 210. The object values may, e.g., represent depth data for pixels in an image. The depth data may have been obtained by an analysis of a monocular image, or from a comparison of a stereoscopic image. Since obtaining depth data is hard, it is expected that the data is sparse or noisy, and probably both. The method aims to filter the data so that the sparse available data is extended over the image and that noise is reduced. It is disadvantageous however if the filtering would extend beyond the edge of objects. Especially for 3D applications and filtering of depth maps, this leads to an unsatisfactorily viewing experience. To this end feature maps are obtained 120. The feature maps may be obtained from an analysis of the image to which the depth data relates 122, e.g. by module 222. Examples of this include, color components, e.g., of RGB encoding of pixel color value, component of other color space, e.g. luminosity values, U and V values, etc. Obtaining a feature map from the image may involve computations. For example one or more values may be obtained for pixels indicating texture. A texture value for a pixel may be obtained by computing the variance in pixel values in a region around the pixel.

Feature maps may also be obtained 124 without using the image itself, e.g., from a feature map source 224. Example features are a coordinate in a coordinate system; Cartesian coordinates possibly having skewed axes, polar coordinates etc. A further example includes a global depth map obtained from a scene type. These feature maps may be obtained from a source, e.g., a memory, or may be computed on the fly.

A feature map assigns a feature value to (preferably each) pixel of the image. The number of feature maps may be chosen larger with the invention than with known object aligned systems. As an example, one may use three feature maps to represent the 3 color components (r, g, and b). This would allow for a much more color sensitive analysis than only a luminosity value would. As a further example, one may use five feature maps to represent the 3 color components (r, g, and b) and x and y coordinates of a pixel. This allows inclusion of proximity information in addition to color information.

In advanced embodiments feature maps may even be derived from other image modalities for the same scene. For example, when filtering motion estimations a feature map may comprise depth information for the imaged, e.g. obtained from a rangefmder.

From the feature maps that are used, at least two selections 130 are made, e.g., by a feature map selector 230; a first and second subset of the multiple feature maps is selected. The selection will typically be pre-determined but this is not necessary, for example, the feature map selection may be dependent upon the image type, e.g. a scene type obtained from some other source, e.g., color information. For example, for a black and white image, using color features is not likely to add useful filtering results.

Together steps 122 and 124 produce multiple feature maps. One could omit non-image based feature maps, i.e., step 124 and source 224. For each selected subset of the multiple feature maps an object aligned filtering is performed, e.g., by an object aligned filtering module 240. A first filtering 142 is done guided by a first subset of the multiple feature maps. A second filtering 144 is done by a second subset of the multiple feature maps.

Preferably, at least one feature map of the first and at least one feature map of the second subset is image based. But this is not necessary, for example the first filtering could contain image based on color, whereas the second filtering could contain only non-image based features, say coordinates. The latter would filter well in regions where a large number of object values are available. To avoid leaking problems, the latter could have relatively small bins, and high confidence requirements (e.g., confidence values are multiplied by a factor smaller than 1). The object-based filtering method is improved if the multiple feature maps comprise feature maps that correlate with the objects, e.g., have a correlation above a correlation threshold.

The first and second filtering may be done by module 240 and uses an object aligned filtering method for filtering a set of object values that allows guidance with feature maps. For example, the filtering could be based on bilateral filtering or on guided filtering. In the so-called 'Guided filters' (see paper in background) a local, typically linear, model is constructed from a guidance image (color image), and the data to be filtered (object values e.g. the depth image). Such a guided filter could be adapted to construct a model from a subset of feature maps to the object values. The linear model is then applied to the object values to obtain a local filtering. The bilateral filtering preferably uses a grid, sometimes referred to as a bilateral grid. We will sometimes refer to the grid as a feature grid. The feature grid is also referred to as a histogram. Below, with Fig. 3, we will exemplify object aligned filtering using a grid based filter.

It is advantageous to also determine confidence values for the filtering, e.g., the first and second filtering. Although a confidence could be pre-determined, for example, based on how well a particular subset performed in the past, preferably the confidence value is dependent upon the filtering. A confidence value may be obtained for a whole image, but preferably, a confidence is obtained on a per pixel or per bin basis. The latter allows that different combinations of filtering are made for different parts of the image. For example, the subset containing a feature map based on y coordinate, and blue color will likely do well to filter object values in the 'sky' object. A color blue with a high y value is likely to be from the same object. Thus a high confidence will be attached to this part of the filtering. However, the filtering based on this subset may perform poorly to discriminate a red object against a green background. A subset based on a red component feature map and a green component feature map will likely do much better. Confidence may be based on variance and number of object values. If a large number of object values are mapped and have a low variance, the confidence is likely high. If the combination of filtering is weighted by confidence, the filtering around the red object will favor the red/green filtering and the filtering in the sky the blue/y filtering.

The multiple filterings are combined 160 to a single filtering, e.g., by filtering combiner 260. The filtering combiner may simply take the average of the filterings. Many edges will be detected by a number of feature combinations; Even though some individual filtering may miss an object. The average is thus much more robust than using a single filtering would be. A more refined approach is to weigh the average with confidence values, high confidence filtering receiving more weight. The weighting could be different for different parts of the image.

The final filtering may be used in a filtered object values sink 260, e.g., for stereoscopic displays. It is noted that the method may be applied to any data that needs object aligned (edge aligned) filtering.

Fig. 3 illustrates in a block diagram an object-aligned filtering method based on bilateral grid-based filtering that may be applied in steps 142 and 144 and/or in filtering module 240.

The set of object values corresponding to an image depicting said objects is obtained, e.g. in step 110, e.g., from source 210; an object value of the set of object values corresponds to a pixel of the multiple pixels of the image. A subset of feature maps is available for the image.

An object value is selected 310 along with the pixel that the object value corresponds to. For each of the subset of feature maps, the feature value is looked up, that the feature map assigns to the corresponding pixel. In this way a set of feature values is obtained for the pixel. This set is sometimes referred to as a feature vector. Indeed, the set of all feature maps may be interpreted as single map assigning a feature vector to each pixel, a subset of the feature map may then mathematically be described as a projection.

It is preferred to bin 324 the feature values, i.e., the range a feature can attain is divided in multiple smaller ranges (bins). The feature value may then be replaced, e.g., by a center point of the range in which it lies, or a serial number of the range. Binning reduces the accuracy of a feature but decreases relative sparsity of object values.

Although binning is preferred, it is not necessary.

Next the object value is mapped to a place 326 in a feature grid indicated by the (binned) feature values. The feature grid has the same number of dimensions as there are feature maps in the subset of feature maps. Each feature value indexes one dimension of the grid. 'Placing' an object value in the grid, may be done simply by adding the object value on a list kept at that location of the grid. In more advanced implementations, the object value is placed by modifying running counters kept at that location of the grid. For example, one may keep running counters of the number of object values, their sum, and the sums of the squares. This allows later easy

determination of average, and (standard) variance. This optimization not only reduces the memory requirements, but makes the memory requirement more predictable thus making it better suitable for hardware implementations.

In this manner all object values are placed in the grid 330.

It may be desired to include a processing step 340 before filtering. For example, in an implementation of placing step 326 that lists all object values in the

corresponding bin, processing step 340 be may used to ensure that all locations in the grid have a single object value. This may be done by averaging those locations that have multiple object values placed on them. Locations that do not have any object values placed on them are set to a pre-determined value, say 0. One may derive a confidence value in step 340 as well. For example, one may compute the number of object value placed in a bin (high is good) and the variance of the object values in the bin (low is good) to obtain a confidence indication.

Next the grid is filtered 350; any non-object aligned filtering method may be used for this purpose. For example, a Gaussian filtering step may be applied to the grid. Depending on the level of object alignment of the feature maps used, working in the feature grid gives a resulting object alignment for the filtering, One may also fit the object values to a model, especially if a sufficient number of object values are available in a bin. For example, one may try to construct a linear model from the features to the object values, say using linear regression. If the number of object values is above a threshold and the error of the regression is low this allows fitting of planes to object values.

One preferred way of placing is to maintain, e.g., accumulate, a number of sums for each bin. For example, the number of object values (SI) and the sums thereof (SD) may be maintained for each bin. Filter 350 may then be applied to the sum, say SI and to SD. This has the advantage that bins that did not receive data end up with a value SI > 0. Accordingly SD/S1 then does not imply division by zero.

Also fitting of object values to a model, say a linear model, may use this approach. The coefficients representing the fitted model, say a and b of a particular vertical sloping plane, may be obtained as a function of maintained sums. Each of these sums may be separately filtered.

Using a type of slicing the grid may be read out to obtain a filtering for the object values. A dense set of object values is obtained for the entire image in which noise has been reduced.

A pixel is selected 362. Again the feature values for this pixel in the subset of feature maps are looked up 364. If needed the values are binned 366 as in step 324. The location indicated by the feature values is looked up 368 in the filtered feature grid; the filtered value there may be taken as the filtered object value corresponding to the selected pixel. The read out of the filtered feature grid may be repeated for all pixels in the image, so as to obtain a dense object value map. The grid may represent values in various ways, for example, the particular bin may have a fitted model, say a plane, fitted to the object values, and in that case the fitted model may be applied to the feature values to obtain the filtered object value.

Possibly, an object aligned filtering apparatus comprises a microprocessor (not shown) which executes appropriate software stored at the apparatus, e.g. that the software may have been downloaded and stored in a corresponding memory, e.g. RAM (not shown). The software may execute the method of Fig. 1. It is advantageous to implement the method, in whole or in part, in hardware. For example, mapping step 322, 324, 326, 330 may well be done in hardware. A feature map need not be made explicit before step 322, but could be computed just in time.

Many different ways of executing a method according to the invention is possible, as will be apparent to a person skilled in the art. For example, the order of the steps can be varied or some steps may be executed in parallel. Moreover, in between steps other method steps may be inserted. The inserted steps may represent refinements of the method such as described herein, or may be unrelated to the method. For example, steps 322, 324, 326 may be executed, at least partially, in parallel for multiple pixels. Moreover, a given step may not have finished completely before a next step is started.

A method according to the invention may be executed using software, which comprises instructions for causing a processor system to perform method 100 or 300 Software may only include those steps taken by a particular sub-entity of the system. The software may be stored in a suitable storage medium, such as a hard disk, a floppy, a memory etc. The software may be sent as a signal along a wire, or wireless, or using a data network, e.g., the Internet. The software may be made available for download and/or for remote usage on a server.

It will be appreciated that the invention also extends to computer programs, particularly computer programs on or in a carrier, adapted for putting the invention into practice. The program may be in the form of source code, object code, a code intermediate source and object code such as partially compiled form, or in any other form suitable for use in the implementation of the method according to the invention. An embodiment relating to a computer program product comprises computer executable instructions corresponding to each of the processing steps of at least one of the methods set forth. These instructions may be subdivided into subroutines and/or be stored in one or more files that may be linked statically or dynamically. Another embodiment relating to a computer program product comprises computer executable instructions corresponding to each of the means of at least one of the systems and/or products set forth.

Below various further implementation possibilities are discussed following Figs.

4-7. These implementation possibilities are exemplifying, not limiting.

As noted above, the multiple feature maps may be denoted as one multidimensional feature vector F(x,y) comprising features as components. Preferably, F has at least three features fl , f2 and O as components: F(x,y) = (fl(x,y), f2(x,y), f3(x,y)} . A sparse and/or noisy set of measurements d_m(x_m,y_m) of an unknown dense quantity d(x,y) for which we want an estimate using the information available in feature vector F.

Using the terminology of feature vectors we will also refer to a feature grid as feature-space projection histograms. We use the term histogram since it is convenient to implement the method using running sums, but this is not necessary.

An advantageous method comprises mapping the measurements d_m(x_m,y_m) and related data into at least two different sub-spaces having a lower dimensionality (preferably planes), the sub-spaces comprising information from at least one feature, conducting filtering of measurements d_m(x_m,y_m) and related data in these at least two different sub-spaces, calculating filtered estimates of <i(x,y) in each sub-space and combining the at least two filtered estimates in an inverse mapping onto the dense grid.

Notably, the object values to be filtered, i.e. the measurement d_m(x_m,y_m) need not be one of the features. Preferably, the filtering is done in multiple 2D feature spaces.

So for 5 features, one could do 10 two-dimensional filterings. Each of these 10 spaces gives a separate estimate of the filtered value.

Furthermore, the multiple (e.g. all 10) estimates are combined into a single estimate, e.g., using a weighting that depends on the variance of the variable being filtered (e.g. D) that is being filtered in each of the particular spaces. Note that other subspaces are also possible. For instance, as an alternative one may project the 5D space onto a number of 3D spaces. Those filtered values can then be combined using again a weighting, e.g., that depends on the variance.

The fitting may assume a constant model, or one may produce a, for instance linear, disparity model, which allows us to fit a linear model to disparity and use that in the slicing to produce smooth surfaces. Multiple models can be weighted, again based on the variance/fit of each of the separate models. For example, this can be done by s(x,y,i) S^{{ J)} S^'^yJ)

accumulating additional sums such as for instance: ^y , ^y , ⁷ We will denote the object values as variable D(x,y) that we wish to filter using extra information that is available to us at pixel location (x,y). This extra information is represented as a multi-dimensional vector f(x,y) that is known for each pixel location (x,y). Its components can for instance consist of visual color information, texture information and/or location information. To filter D(x,y), we project D(x,y) into multiple feature-space projection histograms of f(x,y). Each of the projections of f(x,y) has a lower dimension than feature vector f(x,y) and is a projection onto fewer, say 2 of the N dimensions. A signal model m is now estimated for each bin in each feature- space projection. This model describes how D(x,y) behaves in a certain spatial domain (x,j) eQ. Model m can just be a constant level but also a sloping function (for example a function of the image coordinates). So a given pixel (x,y) is part of one or more spatial domains and each spatial domain has an associated signal model. The final estimate is a weighted combination of all estimates for all domains where the weights depend on the model error E. The model error can for instance be the variance that D(x,y) has in each of the feature-space projection histograms.

An embodiment of the idea is illustrated by Fig. 5. To filter disparity D(x,y) we produce as an example three 2D feature-space projections of the 3D feature vector f = (r,g,b)< _Let p_{2 denote the set of aU 2D pro}j_{ections 0}f f_: F₂= {(r,g),(r,b),(g,b)} . The three projected spaces are: rg-space, rb-space and gb-space. As a next step we visit all pixels in the image and produce three accumulation arrays per feature space. One array just accumulates the value 1 in a given bin, indicated by ^s^^r,s) _m he first row of Fig. 5. This is a histogram. A second array accumulates disparity (second row in Fig. 5) and a third array accumulates the squared disparities: ^D (third row in Fig. 5).

We can now calculate the mean squared deviation from the mean disparity in each histogram bin for the projections using the equation for the variance:

In this example, we have three such variances since (fij]) <≡F₂= {(r,g),(r,b),(g,b)} .

Each variance provides information on the fit of the data (disparity values) when assuming a constant as a model. The final filter that is applied to the data is a weighted combination of the mean disparities over all 2D projections of feature space:

wherein E denotes the mean squared deviation from the mean disparity. The higher the error E, the lower the weight of the corresponding disparity estimate in the weighted combination. The effect of this filter is illustrated in Figs. 4a and 4b.

Fig. 4a is picture displayed together with sparse disparity measurements. The figure in the foreground contains long disparity vectors whereas the background contains short disparity vectors. We wish to obtain a dense (i.e. per pixel) disparity map from these sparse measurements. Fig. 4a is a grayscale representation, of the actual picture. The background of Fig. 4a is blue, the foreground figure is black.

Fig. 4b illustrates three feature space projection histograms for the example image in Fig. 4a. Since all disparities map into a single bin in rg-space, the variance of disparity is high in that space. Both the rb- and the gb-space give a good separation of figure and background and therefore provide reliable estimates. The filter operation just averages all three estimates but gives the estimates with lower variance a higher weight.

Using RGB features typically works well if the constant color object assumption holds. For example, application of the above filter can be useful when objects are partly illuminated and partly in the shadow. This is a frequently occurring situation. For instance a house can cast a shadow on an otherwise brightly illuminated grass field. In this case it would be beneficial that there is at least a single domain Ω in the image that selects all the disparity measurements of the entire grass field together ignoring the shadow to fit a planar model to the data. If a single shadow is cast over both objects with the effect that the color intensity decreases in both the green and blue channel. To separate objects based on color we could use the color channel ratios as features:

This feature vector gives us the six 2D feature space projections:

F = kr',g'Ur',b'Ur',IUg',b'),(g',I),(b',I)}

In these projections, pixels are grouped based on their color, their intensity and both these aspects together.

Other features may also be used. The above approach can be extended to other features than only the color values. First of all, one could add the horizontal and vertical image coordinates. In such a case, it is beneficial to perform some type of filtering on the histogram, to avoid visibility of the grid pattern of the histogram.

Moreover, one could add not only horizontal and vertical image coordinates, but also differently oriented coordinates. This allows fitting of models not only to horizontal and vertical structures, but also differently oriented objects.

Another valuable addition to the feature space is a texture measure. This could e.g. be local variance. Such a feature would allow us for example to distinguish between a textured and a homogeneous object, even though they have the same mean color and mean intensity values (and are also similar for other features).

The possibilities are not limited to the ones described above: the space can be extended with many other features, where distinctive features for the specific data set can be chosen. One could think for example of distance from the center, or from another coordinate in the image, orientation, any type of geometric transformation, multiscale color and texture values, etc.

As opposed to bilateral grids, where an additional feature adds a dimension to the filtering space and causes the problems described above, in our approach the set of projections is extended, but they still have the same dimensions. Generally, if for example all 2D projections of a space of N features are used, we get N{N-\)I2 projections (the scaling by 2 is because the order of the features in a projection plane is unimportant). So the number of 2D spaces grows quadratically with the number of features. Note that one could also use higher dimensional projections, e.g., 3D projections.

In a simple embodiment the filtering step in the feature grid assumes that pixels sharing the same feature combination have the same object value, e.g., are at the same depth. In a more advanced implementation, it is assumed that the object value of pixels sharing the same feature combination are related, e.g., have a linear relation. Consider as an example, a plane, say a grass field, as a whole irrespective of illumination using color only. In this case separation on the basis of color is limited, since a constant value does not fit well as disparity model for a plane with varying depth. The invention extends well to more general models. As an example consider the following disparity model for the grass field: -

The disparity is then a linear function of the image j-coordinate. The parameters a and b can be determined using linear regression. For instance, Press, H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P., Numerical Recipes in C: The Art of Scientific Computing, Second Edition, Cambridge University Press, 1998, gives the equations for linear regression. Using the already introduced notation these equations become:

Estimating these parameters is possible by accumulating the additional sums ^s°y , ^sy and , all per bin. Again these sums are needed for each of the feature space projections. This is illustrated in Fig. 6.

Fig. 6 show additional feature space histograms with sums that are needed in case a linear model of the form ^D^ ^{~ a}y ^{+ b} is required to describe the disparity in part of the image .

Let E denote the mean squared error of the model fit. This error can be expressed in terms of the sums accumulated so far and the coefficients a and b. -±(_Di - (ay_{i + b})

i=l

N

i=l

1 ^N I \

^{E =}^Y} - ^2D^ - ^2D'^{b + a2}y^ ^{+ 2a b + 1,2} )

^E = + b

- 2aS_Dy - 2bS_D + a²S ^■ 2abS„

■ + b^z

Note also that using the above error-based weighting, models of different complexity can be straightforwardly combined. In this way, for our grass field example, we could calculate the error measure for both the constant and the planar model. The error for the planar model would be smaller than for the constant model, and the planar model would therefore get a stronger weight in the depth calculation. Denoting the disparity estimate for model k as Z¾ (e.g., Do = SoififylSitfif_j) for the constant model), we can extend (2) to get

where E ifiJ) is the corresponding error measure for disparity model k and

w^k(^s(fⁱ' f^j)) _{s a} function that allows to give more or less weight to a model depending on the model complexity and/or the number of measurements in a particular bin and feature space projection. In fact, this can be generalized even further, and we can combine projection spaces of varying dimensionality. We could for example use a ID projection of the image onto luminance space, and combine that with the above two- dimensional color projections. Similarly, we can also use multiple bin sizes, allowing large objects with more internal variation to be grouped at a larger bin size level, and selecting a narrow separation between objects at a smaller bin size level. This could even be worked out in a multi-scale manner to improve efficiency. Figs. 7a and 7b show left and right images of a stereoscopic pair. The images show an office corridor. The image shows two walls, one straight ahead from the camera and one slanting away in the background. A carpet leads up from the bottom end of the image towards the wall.

The image is difficult for estimating a depth map, since there are many homogenous planes. Furthermore, the ground plane should show a smooth transition from fore to background .

Fig. 7c shows a depth map constructing with a bilateral filter as described above but without allowing fitting to a linear ground plane (darker colors have a higher depth value). Fig. 7d shows a depth map constructing with a bilateral filter as described above while allowing fitting to a linear ground plane. As can be seen, allowing a planar fit to the data gives a much smoother result for the ground plane.

Another improvement is, prior to applying the filter, to select a subset of all feature-space projection histograms based on some global confidence measure of each feature-space projection. For instance, if the variance of D is high in all bins of a given feature space projection histogram, then we can conclude that the particular

combination of features does not model one single object well in terms of disparity. In that case, the entire histogram can be omitted in the weighting operation for each image pixel since the contribution would be small anyway. Such an analysis can significantly reduce computational complexity.

One approach is to sort all histograms in order of increasing minimum model error, where the minimum is taken of all bins in each particular accumulation array. The top N highest arrays in the resulting sorted list can then be applied in the filter. We can for instance use the top 10 accumulation arrays from a total of 100, thereby reducing complexity of the weighted averaging with a factor 10.

Also prior models can be used to steer the solution. For instance, for all spaces that involve the image j-coordinate we can apply a prior on a slanted model. Also the ¾y-space can have a large prior on large depth for bins that have large j-coordinate and large δ-coordinate at the same time due to the fact that 'sky' is always far away. These prior models can also be determined experimentally by averaging the feature-space projection histogram over various stereo video sequences.

After construction of the different feature space histograms, various filtering operations may be applied to smoothen the histogram data. One could for example perform an averaging or Gaussian smoothing filter to the histograms, which could be different depending on the specific subset of features that is considered in each projection.

Another option is to perform a splatting or interpolation operation when filling the histogram bins, and extracting a filtered image, respectively. In this way, the data are not entered to or extracted from a single bin in each histogram, but in each of the surrounding bins with weights relative to the distance to the respective bin centers. This will also have a smoothing effect on the filtered image.

Next to the processing described above, which can generally be seen as an averaging (or regression) operation, in which sums of variables are gathered in the histograms, also other options are available. We could for example store for each bin the minimum and/or maximum value, and in a second pass perform outlier detection on these values to remove outliers.

If confidence values are used one may also generate an overall confidence map for the filtered image, without much additional effort. This confidence map can be used for post-processing the filtered image, as it indicates the total filter weight for each part of the image. A type of post-processing could be to replace the most uncertain pixels by their more certain neighbors.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

CLAIMS:

1. An object-aligned filtering method (100) for filtering a set of object values corresponding to an image depicting said objects, the image having multiple pixels, each object value of the set of object values corresponds to a pixel of the multiple pixels so that at least part of the pixels of the multiple pixels have a corresponding object value of the set of object values, the method comprising

obtaining multiple feature maps (120) for the image, a feature map assigns a feature value to a pixel of the multiple pixels, at least some of the feature maps being related to the objects,

- a first filtering (142) of the object values to obtain a first filtered object value for a selected pixel of the multiple pixels, the first filtering being guided by a first subset of the multiple feature maps,

a second filtering (144) of the object values to obtain a second filtered object value for the selected pixel, the second filtering being guided by a second subset of the multiple feature maps, the second subset being different from the first subset,

deriving a single filtered object value (160) for the selected pixel from at least the first and second filtered object value.

2. A filtering method as in Claim 1, wherein the first subset and the second subset share at least one feature map.

3. A filtering method as in any one of the preceding claims, further comprising determining a first confidence (152) indicative of the object alignment of the first filtering and a second confidence (154) indicative of the object alignment of the second filtering, and wherein

the deriving of the single filtered object value further depends on the first and second confidence.

4. A filtering method as in any one of the preceding claims wherein the object values represent one of

depth data, or

motion data.

5. A filtering method as in any one of the preceding claims wherein the set of object values is sparse, and wherein a single filtered object value is derived for all the multiple pixels.

6. A filtering method as in any one of the preceding claims wherein the first subset and/or the second subset contain exactly two feature maps.

7. A filtering method as in any one of the preceding claims wherein at least one feature map in the first subset and/or the second subset is derived (122) from image values of the multiple pixels.

8. A filtering method as in any one of the preceding claims wherein at least one feature map in the first subset and/or the second subset is independent from image value of the multiple pixels (124).

9. A filtering method as in any one of the preceding claims wherein the first and/or second filtering comprises

mapping the object values to a grid (326) indexed by the feature value represented by the first and/or second subset of the multiple feature maps, the dimension of the grid being the same as the number of feature maps in the subset of the multiple feature maps.

10. A filtering method as in Claim 9 comprising filtering over the grid (350).

11. A filtering method as in Claim 8 or 9, wherein the grid is divided in multiple bins (324), a bin representing multiple combinations of feature values, comprising determining a variance of object value in a bin, and deriving a confidence therefrom.

12. An object-aligned filtering apparatus for filtering a set of object values corresponding to an image depicting said objects, the image having multiple pixels, each object value of the set of object values corresponds to a pixel of the multiple pixels so that at least part of the pixels of the multiple pixels have a corresponding object value of the set of object values, the method comprising a feature map obtaining module (222, 224) configured to obtain multiple feature maps for the image, a feature map assigns a feature value to a pixel of the multiple pixels, at least some of the feature maps being related to the objects,

a filtering module (240) configured to filter the object values to obtain a first filtered object value for a selected pixel of the multiple pixels, the first filtering being guided by a first subset of the multiple feature maps, and configured to filter the object values to obtain a second filtered object value for the selected pixel, the second filtering being guided by a second subset of the multiple feature maps, the second subset being different from the first subset,

- a combination module (260) configured to derive a single filtered object value for the selected pixel from at least the first and second filtered object value.

13. A computer program comprising computer program code means adapted to perform all the steps of any one of the method claims 1 to 11 when the computer program is run on a computer.

14. A computer program as claimed in claim 13 embodied on a computer readable medium.