WO2022003740A1

WO2022003740A1 - Method for determining the confidence of a disparity map through a self-adaptive learning of a neural network, and sensor system thereof

Info

Publication number: WO2022003740A1
Application number: PCT/IT2021/050193
Authority: WO
Inventors: Matteo POGGI; Filippo ALEOTTI; Fabio TOSI; Stefano MATTOCCIA
Original assignee: Alma Mater Studiorum - Universita' Di Bologna
Priority date: 2020-07-02
Filing date: 2021-06-21
Publication date: 2022-01-06
Also published as: IT202000016054A1

Abstract

Method for determining the confidence of a disparity map through a self-adaptive learning of a neural network, and sensor system thereof The present invention relates to a method for determining the 5 confidence of a disparity map by training a neural network (14), wherein the confidence is the level of certainty or uncertainty of each pixel of said disparity map, from at least one digital image, of a scene, comprising the following steps: A. acquiring said at least one digital image, of said scene; B. calculating said disparity map for each pixel of said at least 10 one digital image,; C. extracting at least one self-supervising criterion from said at least one digital image, and said disparity map; D. calculating a confidence map, from said disparity map, by means of said neural network (14); E. calculating a loss signal from said confidence map and said at least one self-supervising criterion; and F. 15 optimizing said neural network (14) by training said neural network (14) with the information associated to said loss signal. The present invention also relates to a sensor system for determining the confidence of a disparity map from at least one digital image, of a scene. 20

Description

Method for determining the confidence of a disparity map through a self- adaptive learning of a neural network, and sensor system thereof

The present invention relates to a method for determining the confidence of a disparity map through a self-adaptive learning of a neural network, and sensor system thereof.

Field of the invention

More specifically, the invention relates to a method and a sensor system of the mentioned type, designed in particular for determining the confidence of disparity maps inferred by a stereo algorithm or a network through a neural network capable of self-adapting, but which can be used for any type of image acquisition system, in which it is necessary to estimate the confidence, thus determining the level of certainty or uncertainty of each pixel of said image. In the following, the description will be directed to a self-supervised confidence estimation in a constrained setting, but it is clear that the same should not be considered limited to this specific use.

Background

There are in the market several systems for acquiring images in 3D, in order to determine the depth of an image.

Currently, stereo is one of the most popular strategies to accurately perceive the 3D structure of the scene, through two synchronized cameras and several algorithms, either hand-designed or based on deep neural networks. In many practical applications, alongside with disparity inference, confidence estimation is often performed as well. Purposely, a wide range of methods based either on hand-crafted measures or learning-based strategies have been proposed.

Recently there have been showed how state-of-the-art networks processing cues available from any stereo setup (i.e. the input stereo pair and the output disparity map) are substantially equivalent to those processing the entire cost volume, further supporting the evidence that the disparity map itself contains sufficient clues to identify outliers.

Such a feature is highly desirable, since it potentially paves the way for learning confidence estimation for any stereo camera, even without any knowledge about the stereo algorithm/network deployed.

This fact is very appealing since it frequently occurs with most industrial/off-the-shelf (e.g. Stereolabs ZED 2) or consumer devices (e.g. smartphones).

Nonetheless, this opportunity was investigated only partially in the literature. Moreover, all the above-mentioned methods are strongly constrained to the need for ground truth depth labels acquired in the target domain.

However, since achieving such labels is cumbersome and time- consuming, some self-supervised methods have been proposed in the prior art. Although these methods proved that confidence estimation could be learned without needing active sensors, they have various drawbacks.

One of the drawbacks of the known technical solution that static stereo sequences are required.

Moreover, a drawback of the technical solution according to the prior art is that it needs access to the cost volume, rarely exposed in the case of off-the-shelf stereo sensors mentioned above or not defined at all in most modern neural networks.

As a consequence, the solutions available in the prior art are not thought to handle adaptation, required to soften domain-shift issues. Thus, a solution for out-of-the-box deployment of self-adaptive confidence estimation would be highly desirable for many practical applications.

A notable example concerns smartphones, nowadays equipped with multiple cameras and stereo algorithms/networks deployed for augmented reality or other applications in unpredictable environments. It follows a short review of the prior art about the literature concerning confidence measures and recent trends in stereo matching.

Confidence measures can be divided in two main categories: hand made and learned measures.

The former category consists of conventional method computed typically from cost volume analysis such as the ratio between two minima, such as in the so-called Peak-Ratio or PKR, or, as recently proposed, determining local properties of the disparity map, like the number of pixels with the same disparity hypothesis.

As regards learned measures, hand-made cues are usually combined and fed as input to a random forest classifier or to a Convolutional Neural Network (CNN) appropriately trained deploying depth labels.

Learned methods may require:

1) full access to the cost volume to extract hand-made features or process the volume itself;

2) disparity maps for both left and right viewpoint; or

3) only the input image and its corresponding disparity map.

The above-mentioned three requirements translate into harder to softer constraints at deployment, most of them usually not met by off-the- shelf stereo cameras since exposing only the input stereo pair and the output disparity map to the user.

Recently, it has been shown that, although a CNN with access to the full cost volume can perform better than networks processing disparity and reference image only, the margin between the two approaches is small and, in most cases, negligible, at the cost of a much minor versatility of the former.

As to the applications of confidence measures, in addition to the traditional outliers filtering task, many higher-level applications exploit such cue for different purposes.

In particular, it has been estimated confidence and detect ground control points to improve global optimization. It has also been proposed a confidence-based modulation of the cost volume applied before Semi- Global Matching (SGM) optimization. Also, the streaking effects of the SGM stereo algorithm by using a weighted sum of the scanlines according to a confidence measure have been reduced.

Similarly, other approaches provide for fusing multiple scanlines of SGM using a random forest classifier.

Also, methods acting outside the stereo algorithms have been proposed for stereo algorithm fusion, sensor fusion and unsupervised adaptation of deep models for stereo matching.

Self-supervised learning has been barely investigated for confidence estimation.

According to some approaches (Mostegel et al., see [1]), it has been leveraged leverage stereo videos looking at consistencies and contradictions between the different viewpoints of a static scene in order to obtain correct and wrong candidates from a given stereo algorithm.

In other approaches, (see Tosi et al. [2]) instead, it has been relied on traditional confidence measures to obtain these two sets according to a consensus among them.

In addition, at first, CNNs have replaced single steps in the stereo pipeline, such as cost computation, rapidly converging towards end-to-end solutions estimating dense disparity maps by means of 2D or 3D networks.

The latest trend in the field consists of casting disparity estimation as a continuous learning problem, thanks to the self-supervision enabled by image re-projection.

Scope of the invention

In light of the above, it is therefore an object of the present invention overcoming the drawbacks mentioned in the proposed self-supervised methods of the prior art, providing a method for determining the confidence of a disparity map through a self-adaptive learning of a neural network. Another object of the present invention is that of providing a method for self-adapting a confidence measure unconstrained to the stereo system deployed.

A further object of the invention is that of providing a novel loss function built upon cues available from the input stereo pair and the output disparity only, needing no additional information to learn/adapt to the sensed environment.

Another object of the present invention is that of providing a method and a system of high reliability, easy to implement, and competitive in terms of costs when compared to the known technique.

Still, object of the present invention is to provide the tools necessary for the execution of the method and the apparatuses to perform such method.

Object of the invention

It is therefore specific object of the present invention a method for determining the confidence of a disparity map D_L by training a neural network, wherein the confidence is the level of certainty or uncertainty of each pixel of said disparity map D_L, from at least one digital image I_L, I_R of a scene, comprising the following steps: A. acquiring said at least one digital image I_L, I_R of said scene; B. calculating said disparity map D_L for each pixel of said at least one digital image I_L , I_R ; C. extracting at least one self- supervising criterion from said at least one digital image I_L, I_R and said disparity map D_L D. calculating a confidence map CM, from said disparity map D_l, by means of said neural network; E. calculating a loss signal L_MBCE from said confidence map CM and said at least one self-supervising criterion; and F. optimizing said neural network by training said neural network with the information associated to said loss signal L_MBCE.

Still according to the invention, said at least one self-supervising criterion extracted in said step C may comprise at least one of the following criteria: a self-supervising criterion related to an image reprojection error between said at least one digital image I_L, I_R ; a self-supervising criterion related to a disparity agreement between pixels of said disparity map D_L; and/or a self-supervising criterion related to the uniqueness of any pixel in I_L and I_R respectively. Advantageously according to the invention, said self-supervising criterion related to an image reprojection error between said at least one digital image I_L, I_R , may be calculated according to the following equation: being a

reprojection of l_R on reference image coordinates, SSIM is the Structural SIMilarity index and a is a parameter ranging from 0 to 1 , preferably tuned to 0.85. Conveniently according to the invention, said self-supervising criterion related to a disparity agreement between pixels of said disparity map D_l, may be calculated according to the following equation:

A = DA > 0.5 wherein with H_NxN is an histogram encoding, for each

pixel (x,y) of said disparity map D_L, the number of neighbours in a NxN window having the same disparity d.

Always according to the invention, said self-supervising criterion related to the uniqueness of any pixel in I_L and I_R respectively may be calculated according to the following equation:

wherein

Still according to the invention, said loss signal L_MBCE is a Multi-modal Binary Cross Entropy loss signal which may be calculated according to the following equation:

where o e [0,1] is the output of said neural network (14), P and Q are two sets of proxy labels derived respectively by a self-supervising criterion comprised in said at least one self-supervising criteria being met or not.

Always according to the invention, said step B may be carried out according to the following formula D_L = S(I_L,I_R).

Conveniently according to the invention, said step B may be carried out by means of a network S. Advantageously according to the invention, said step A may be carried out by an image detecting unit comprising at least one image detecting device for detecting said at least one digital image I_L, I_R , said step B may be carried out by first processing means, connected to said image detecting device, said step C may be carried out by a filter, connected to said image detecting device and said first processing means, and said steps E and F may be carried out by a second processing means, connected to said filter and said neural network.

Still according to the invention, said step A may be carried out by a stereo matching technique, so as to detect a reference image I_L and a target image I_R of said scene.

It is also object of the present invention a processing unit for determining the confidence of a disparity map D_L, wherein the confidence is the level of certainty or uncertainty of each pixel of said disparity map D_L, from at least one digital image I_L, I_R of a scene, wherein the disparity map D_L is achieved through a network S, and wherein the processing unit is configured to execute the steps B-F of said method.

Conveniently according to the invention, the processing unit may comprise processing means, connected to said image detecting unit, a filter, connected to said image detecting unit and said processing means, and configured for extracting at least one self-supervising criteria from said at least one digital image I_L, I_R and said disparity map D_L, and a neural network, connected to said processing means, configured for producing a confidence map CM from said disparity map D_L, wherein said processing means are configured for determining a disparity map D_L from said at least one digital image I_L, I_R , and for calculating a loss signal from said confidence map CM and said at least one self-supervising criteria.

It is also object of the present invention a sensor system for determining the confidence of a disparity map D_L from at least one digital image I_L, I_R of a scene, comprising an image detection unit configured for acquiring said at least one digital image I_L , I_R of said scene, and a processing unit connected to said image detecting unit.

It is further object of the present invention a computer program comprising instructions which, when the program is executed by a computer, cause the processor to perform the present method steps.

It is also object of the present invention a computer-readable storage medium comprising instructions which, when executed by a computer, cause the processor to perform the present method steps.

Brief description of the drawings

The present invention will be now described, for illustrative but not limitative purposes, according to its preferred embodiments, with particular reference to the figures of the enclosed drawings, wherein:

Fig. 1 illustrates a block diagram of an embodiment of the sensor system for determining the confidence of a disparity map by self-adaptive learning of a neural network, according to the present invention;

Fig. 2 illustrates a flowchart concerning the steps of the method for determining the confidence of a disparity map by self-adaptive learning of a neural network, according to the present invention;

Fig. 3 illustrates, given an highlighted region, a set of inliers and a set of outliers, which are determined by using different configurations of self- supervising criteria, according to the present invention;

Fig. 4 illustrates a table which reports AUC scores for networks trained on a first set of test images and tested on an unseen set of test images;

Fig. 5 illustrates from left: a reference image, a disparity map and confidence maps by existing self-supervised approaches [2], [1], the proposed technique and the proposed technique during online adaptation;

Fig. 6 illustrates two examples of reference image and disparity map acquired with an iPhone XS, followed by estimated confidence map after few iterations of on the-fly learning; and

Fig. 7 shows illustrates a reference images, a disparity maps from various algorithms and confidence estimated by self-supervised frameworks and the present method.

Detailed description

In the various figures, similar parts will be indicated by the same reference numbers.

With reference to the aforementioned Fig. 1 , a sensor system for determining the confidence of a disparity map through a self-adaptive learning of a neural network, indicated as a whole with the reference number 1 , is shown, which comprises an image detecting unit 10 and a processing unit U, connected to said image detecting unit 10.

In the present embodiment, said processing unit U comprises first processing means 11 connected to said image detecting unit 10, a filter 12 connected to said image detecting unit 10 and said first processing means 11 , second processing means 13, connected to said filter 12, and a neural network or confidence network 14, connected to said first processing means 11 and to said second processing means 13.

In the embodiment according of the present invention, said first processing means 11 and said second processing means 13 are two different processing means.

However, in other embodiments of the present invention, said first processing means 11 and said second processing means 13 can be considered as the same processing means or integrated, for instance in a same microprocessor.

Moreover, in the embodiment at issue, said image detecting unit 10 is a stereoscopic vision system.

However, in other embodiments of the present invention, said image detecting unit 10 can be any other system even according to the prior art capable of obtaining disparity or distance maps from digital images or other methods.

In particular, said image detecting unit 10 comprises a first image detecting device 100 and a second image detecting device 101 , such as a video camera, a photo camera or a sensor, arranged at a predetermined fixed distance from each other.

In other embodiments according to the present invention, the image detecting unit 10 can comprise a number of detecting devices other than two, for example, one, as in monocular systems for depth estimation from images.

More specifically, each of said image detecting devices 100, 101 detects a respective image of the object or the scene observed.

As it will be better explained below, the image acquired by means of said image detecting device 100, i.e. the left image, will be considered as the reference image or reference I_L, while the image acquired through said image detecting device 101 , i.e. the right image, will be considered as the target image or target I_R.

However, each image acquired by the respective detection device 100, 101 can be considered as reference image I_L or target image I_R.

Still referring to Fig. 1 , said first processing means 11 are connected to said image detecting devices 100, 101 . In particular, said first processing means 11 are configured to process said images I_L and I_R in order to generate a disparity map D_L.

In the embodiment according of the present invention, the output disparity map D_L is computed assuming I_L as the reference image. However, in another embodiment of the present invention, the output disparity map can be computed assuming I_R as the reference image.

Moreover, in the embodiment schematically illustrated in Fig. 1 , said first processing means 11 generates said disparity map D_L by means of a stereo algorithm S.

However, in further embodiments of the present invention, said first processing means 11 provide the use of additional algorithms, networks, programs or other computer sensors, capable of generating disparity maps.

As will be better described below, said filter 12 is capable of extracting a plurality of self-supervising criteria from said disparity map D_L and said images I_L and I_R , in order to provide a self-adaptive learning of said confidence network 14, as better explained below.

In the embodiment at issue, as it will be better explained below, the extracted self-supervising criteria are three, herein referred to as T, A and U, related to image re-projection error, disparity agreement between nearby pixels of an image and uniqueness constraint between pixels of different images, respectively.

However, in other embodiments, it is possible to extract or calculate a different number of said self-supervising criteria, such as one, two or more than three criteria, respect to those ones described above.

Moreover, in other embodiments of the present invention, it is possible to extract different criteria from the self-supervising criteria described for the present invention.

Said second processing means 13 are then configured to determine an evaluation of the loss based on said three self-supervising criteria T, A and U, in order to evaluate the output of the the newar network 14, so as to train the same online, namely during its operation, without any external data for training the same.

More specifically, the second processing means 13 calculate a Multimodal Binary Cross Entropy (MBCE) loss signal from a combination of the outcomes of said three self-supervising criteria T, A and U, and a confidence map CM is computed by said confidence network 14.

As said above, said confidence network 14 is connected to said first processing means 11 and to said second processing means 13.

In particular, said confidence network 14 is configured to determine said confidence map CM from said disparity map D_L.

More specifically, said confidence map CM ranks pixels of the disparity map D_L from less to more reliable (from black to white).

As it will be described in more details herein, said confidence network 14 is capable of updating its own knowledge of the surrounding environment by means of the valuation of the Multi-modal Binary Cross Entropy (MBCE) loss signal computed from said second processing unit 13.

As mentioned above, in some embodiments, the first processing means 11 , the second processing means 13, the filter 12, and the neural network or confidence network 14, can be integrated in a single processing unit U, properly programmed.

Referring now to Fig. 2, a flowchart of the method according to the present invention is shown, which can be executed also by the system of Fig. 1.

At first, the step of acquiring images, indicated with the reference letter A, provides the acquisition of a reference image I_L and a target image I_R related to an object or a scene observed by means of said image detecting unit 10.

In step B, said first processing means 11 process said images I_L and I_R, in order to generate a disparity map D_L by means of said stereo algorithm 5.

As said above, in the embodiment at issue, the present method provides an image processing using a stereo algorithm S.

However, in further embodiments of the present invention, it can be possible to use an additional algorithms or programs or other computer sensors capable of generating disparity maps.

Subsequently, in step C, said filter 12 extracts said three self- supervising criteria T, A and U from the two images I_L and I_R and the disparity map D_L.

In step D, said confidence network 14 determines said confidence map CM from said disparity map D_L.

Then, in step E, said second processing means 13 compute the MBCE loss signal from said confidence map CM and a combination of one or more of said self-supervising criteria T, A and U. It is noted that also other self-supervising criteria from said disparity map D_L can be used, alternatively or on addition to the three self-supervising criteria T, A and U specified above, without departing from the scope of protection of the invention herein disclosed.

Finally, in step F, said confidence network 14 is updated based on said MBCE loss signal computed in said step E.

In particular, the parameters of said neural network 14 are continuously updated in order to adapt said neural network 14 itself to the environment related to scene observed.

As already said, the present invention aims at proposing a self- supervised paradigm suited for learning a confidence measure, unconstrained from the specific stereo method deployed and capable of self-adaptation.

Therefore, at first stereo systems are classified into different categories according to the data they make available, and then a strategy compatible with all of them is introduced.

Stereo matching systems

Three main broad categories of stereo matching solutions are herein defined, each one characterized by different data made available during deployment. It is clear that the stereo matching herein disclosed are just possible embodiments and also other stereo matching systems can be available and implemented.

A generic rectified stereo pair will be referred as (/_L,/_R), respectively made of left and right images, and a generic stereo algorithm or deep network will be referred as S. Furthermore, in the remainder of the description, in order to simplify notation, (x,y) coordinates will be omitted if not strictly necessary.

Given any stereo algorithm processing a stereo pair (/_L,/_R), the output disparity map is defined, computed assuming I_L as the reference image, as D_L = S(/_L,/_R).

This image triplet is the minimum amount of data available out of any stereo method, and all the systems making available only such cues are here defined as “black-box systems”. Such systems are highly representative off-the-shelf stereo cameras (e.g., Stereolabs ZED 2) or stereo methods implemented in consumer devices (e.g., Apple iPhones).

In particular, they neither allow end-users to access the implementation nor provide explicit ways (Application Programming Interfaces or APIs) to call for it.

For each (/_L,/_R) acquired in the field by the device, they provide the corresponding disparity map typically with undisclosed approaches based either on conventional stereo algorithms or deep networks.

Although black-box systems provide cues available in any stereo system, when explicit calls to the algorithm APIs are exposed, additional cues can be retrieved. Hence, a second family of systems can be implemented, for which, although it is given no access to the algorithm implementation or its intermediate data, explicit calls to the method itself are possible (e.g. stereo algorithms provided by pre-compiled libraries).

The systems belonging to this class are defined as “gray-box systems”, since multiple calls to S allow for retrieving additional cues. For instance, it is straightforward to compute the Left to Right Consistency (LRC) of the disparity maps, a popular strategy to obtain a confidence estimator, even if not explicitly provided by S itself in its original implementation.

Given the possibility to call S two times, consistency checking can be performed analysing D_L and a second disparity map, namely D_R obtained by assuming I_R as the reference images. Defining <- the horizontal flipping operator, D_R is obtained as follows:

Once obtained D_R, the consistency between the two disparity maps can be checked as:

where n(a,b) is a sampling operator, collecting values at coordinate a from b, and a threshold value (usually 1 ) above, which D_L and D_R are considered inconsistent.

If the implementation of S is accessible, additional cues can be sourced by processing intermediate data structures, if meaningful. The preferred one is the cost volume V, containing matching costs V(x,y, d) for pixels at coordinates (x,y) and any disparity hypothesis d e [0, d_max\.

This class of systems, referred to as “white-box systems”, enables computation of any confidence measure, either conventional or learning- based. Popular traditional confidence measures obtained from V are the

Peak-Ratio (PKR) and Left-Right Difference (LRD) defined, respectively, as

where dl and d2m, respectively, are the disparity hypotheses corresponding to the minimum cost and the second local minima (see for example [3]). Regarding LRD, given the cost volume V_R computed assuming I_R as the reference image, for any pixel (x,y) costs are sampled at (x - dl,y), i.e., from the estimated matching pixel.

Black-box models represent the most challenging, yet general target when dealing with confidence estimation since their constraints prevent the deployment of most state-of-the-art measures, as well as self-supervised strategy existing in the literature.

In the embodiment herein disclosed, the method comprises a general-purpose strategy enabling self-supervised confidence estimation in such constrained settings. However, in further embodiments, the method can be used even for state-of-the-art CNNs. Furthermore, out-of-the-box learning of confidence estimation with any stereo setup and self-adaptation in any environment is available. Determination of the three self-supervising criteria

In order to develop a self-supervised strategy suited for any stereo system, it is required to identify cues that are effective to source a robust supervision signal.

According to the previous discussion, in the case for example of black-box models, the data available comprise (/_L,/_R) and D_L only.

In this circumstance, although relevant information is not available compared to other models, the above mentioned three self-supervising criteria are introduced to obtain the desired self-supervised signal from the meagre cues available.

As a first self-supervising criterion, an image re-projection error is considered.

As a first self-supervising criterion implemented in the method for determining the confidence of a disparity map through a self-adaptive learning of a neural network according to the present invention, it is considered that the reprojection across the two viewpoints available in a rectified stereo pair could be a powerful source of supervision, either for monocular (see [4, 5, 6]) or stereo (see [7, 8]) depth estimation.

Specifically, I_R is reprojected on the reference image coordinates as f_R = n(D_L,I_R). Then, the difference between I_L and warped right view f_R appearance encodes how correct the reprojection is.

To this aim, the most popular choice is a weighted sum between two terms, respectively SSIM (see [9]) and absolute difference:

Wherein a is a parameter ranging from 0 to 1 , preferably tuned to 0.85. The higher the image reprojection error is, the more likely D_L is wrong.

By definition, matching pixels is particularly challenging in ambiguous regions, such as textureless portions of the image. To this aim, the present invention aim at detecting regions with rich texture, being more likely to be correctly estimated by S, by comparing D computed between (/_L,/_R) with the one after reprojection as

In large ambiguous regions will result equal (or even minor)

than the reprojection error, thus identifying pixels on which stereo is prone to errors.

As a second self-supervising criterion, the disparity agreement or agreement among neighbouring matches is considered.

In particular, considering that most regions of a disparity map should be smooth, variations in nearby pixels should be small except at depth boundaries. D_L itself allows for the extraction of meaningful cues to assess the quality of disparity assignments. Purposely, the disparity agreement between neighbouring pixels is defined as:

where H_NxN is an histogram encoding, for each pixel (x,y), the number of neighbours in a NxN window having the same disparity d (in case of subpixel precision, within 1 pixel).

In the absence of depth discontinuities, the majority of pixels in the neighbourhood should share the same, or very similar, disparity hypothesis.

Hence, this second self-supervising criterion is defined to identify reliable stereo correspondences as A = DA > 0.5, assuming that more than half of the pixels in the neighbourhood share the same disparity.

It is worth noting that this second self-supervising criterion is often not met in the presence of depth boundaries, even in case of correct disparities.

As a third self-supervising criterion, the uniqueness constraint is considered.

In an ideal frontal-parallel scene observed by a stereo camera in standard form, for each pixel in I_L exists at most one match in I_R and vice-versa. Leveraging this property, known as uniqueness, is particularly useful to detect outliers in occluded regions and represents a reliable alternative to LRC and LRD measures, not usable when dealing with black-box models.

Uniqueness Constraint (UC) is encoded as follows:

with

In other words, the uniqueness for any pixel in I_L holds if it does not collide in the target image with any other pixel, i.e., not matching the same pixel in I_R matched by any other.

This property is exploited in order to define a third self-supervising criterion as U = UC.

Although effective at detecting mostly occlusions, the uniqueness constraint is often violated in the presence of slanted surfaces. Multi-modal Binary Cross Entropy calculation

Given one or more of the three self-supervising criteria T, A and U disclosed above, a measure of binary entropy loss is calculated, to take into account multiple label hypotheses.

In particular, a Multi-modal Binary Cross Entropy (MBCE) loss is defined for each pixel of the acquired image as:

where o is the output of the neural network e [0,1], i.e. passed through a sigmoid activation, P and Q are two sets of proxy labels, derived respectively by a self-supervising criterion being met or not.

For instance, it is considered that the self-supervising criterion are calculated for each pixel, basing on the from said disparity map D_L and said images I_L and I_R. Pixels satisfying the first self-supervising criterion on image reprojection will have labels T^p = 1, T^q = 0 and vice versa, when they do not.

Therefore, unlike traditional binary cross entropy, where a single label y and its counterpart (1 - y) are used, disjoint sets of proxies are defined allowing for a flexible configuration of the loss function according to the three self-supervising criteria described so far.

For instance, by setting P = [T^P,A^P\ and Q = [T^q] the network will be trained to detect good matches using image reprojection error plus disparity agreement and outliers using the image reprojection error only.

Adding elements to the sets P and Q reduces progressively the number of pixels considered correct or wrong, respectively.

It is noted that Fig. 3 illustrates, given an highlighted region, a set of inliers (also shown in green colour) and a set of outliers (also shown in red colour), which are determined by using the following configurations of self- supervising criterion in multimodal binary cross-entropy loss signal: a) T^p, T« b) A^p, A« c) U^p, rn d) T^p, A^p, U^p, T« e) T^p, A^p, U^p , T«, A«, W, while for black pixels, the considered configuration gives no guesses.

In particular, Fig. 3 highlights how combining multiple guesses, as in case d) and in case e), for some pixels no supervision is given when self- supervising criteria do not match.

The system and the method for determining the confidence of a disparity map through a self-adaptive learning of a neural network can be operable to realize a depth estimation sensor, capable of providing an estimate of confidence based on machine-learning without having to acquire datasets for learning, which is high expensive and complicated to perform, with techniques belonging of the state of the art.

Possible applications of the method for determining the confidence of a disparity map through a self-adaptive learning of a neural network can be:

1 ) assessing, in general, the quality of a stereo algorithm by finding situations/patterns where it usually fails. For example, several conventional algorithms fail near occlusions;

2) estimated confidence map CM ranks pixels from less to more reliable (from black to white). It can be used to extract a subset of reliable points to be used by “guided stereo” and “real-time self-adaptive” technologies, and filter out less reliable pixels and replace them with better estimates;

3) fusion of stereo disparity with Time of Flight (ToF) depth maps; and

4) fusion of multiple stereo algorithms.

Experimental results

In this section, the outcome of experiments aimed at assessing the effectiveness of the proposed invention, referred to as Out-of-The-Box (OTB), is reported.

To measure the effectiveness of the learned confidence measures, the Area Under Curve (AUC) of the sparsification plots (see [3], [10], [11], [12]) has been computed.

In particular, given a disparity map, pixels are sorted in increasing order of confidence and gradually removed (e.g., 5% each time) from the disparity map. At each iteration, the error rate is computed over the sparse disparity map as the percentage of pixels having absolute error larger than t.

Plotting the error rate results in a sparsification curve, whose AUC quantitatively assesses the confidence effectiveness (the lower, the better). The optimal AUC is obtained by sampling the pixels in decreasing order of absolute error.

Self-adapting in-the-wild

Experiments aimed at assessing how effective the present method is for self-adaptation of a confidence measure in unseen environments have been conducted by selecting a sequence from the DrivingStereo ([19]) dataset. The sequence 2018-10-25-07-37, containing 6905 stereo pairs acquired in unconstrained (i.e., dynamic) environment has been used.

In particular, for this evaluation, Census-SGM ([22]) and MADNet ([8]) have been chosen. The former because it represents the preferred choice for hardware implementation on custom stereo cameras. The latter because it well represents the category of modern end-to-end CNNs characterized by a good trade-off between accuracy and speed.

For confidence networks, ConfNet ([11]) has been selected. In this experiment, it has been assumed to have pre-trained versions of ConfNet with the different self-supervision paradigms, respectively SELF ([1]) and WILD ([2]), on KITTI 2012 on the first 20 images of the training set ([10]).

For OTB, [ T^P,A^P , U^p,T^q,A^q, U^q] configuration have been used. When performing online adaptation (online entry), for each stereo pair the confidence is estimated and evaluated before loss computation (thus, supervision only acts on the upcoming frames.

This way, ConfNet runs at 0.08 seconds (12 FPS) against the 0.02 (50 FPS) without adaptation on Titan Xp. The fourth table shown in Fig. 4 collects the outcome of this evaluation. It is pointed out that WILD cannot be deployed for MADNet since a meaningful cost volume is not available. On the other hand, SELF would require ( D_L,D_R ) for supervision, while MADNet provides only the former.

Assuming this network as a gray-box model, this issue has been got rid of at training time obtaining D_R as shown in equation (1). Concerning SGM, OTB performs in between WILD and SELF. Nevertheless, keeping continuous adaptation active on the whole sequence makes it outperform both by a good margin. Concerning MADNet, SELF results more effective than OTB.

Again, performing online adaptation makes OTB the best solution in this case as well. Finally, Fig. 5 shows qualitative examples for the SGM algorithm.

On-the-flv learning with black-box sensors

Finally, it is reported, as qualitative results, the outcome obtained by learning on-the-fly a confidence measure on disparity map sourced by an Apple iPhone^® XS, without any pre-training.

Purposely, a sequence of about 100 pairs on which ConfNet has been trained on-the-fly has been collected.

In particular, Fig. 6 shows examples of acquired disparity and estimated confidence maps by ConfNet adapted online. More specifically, the very few frames collected are sufficient to learn how to detect gross errors like on turtle's shell.

Qualitative results on a variety of algorithms

Furthermore, as shown in Fig. 7 on a variety of algorithms, the present technical solution is better with respect to known strategies requiring full access to the cost volume (see [2]) or static scenes for training ([1])-

Conclusion

In light of the above, it has been introduced a novel self-supervised paradigm aimed at learning from scratch a confidence measure for stereo.

In particular, few, principled cues from the input stereo pair and the estimated disparity have been used in order to source supervision signals in place of disparity ground truth labels.

Being such cues available during deployment in-the-wild, the present invention is suited for continuous online adaptation on any black-box framework. Furthermore, experimental results proved that the present method is shows a high performance if compared with existing self-supervised approaches and, conversely to them, allow to further improvements during deployment by leveraging the online self-adaptation process.

Advantages

An advantage of the method according to the present invention is that of allowing a self-adapting confidence estimation agnostic to the stereo algorithm or network.

Another advantage of the present invention is that of learning an effective confidence measure only based on the minimum information available in any stereo setup (i.e., the input stereo pair of images and the output disparity map).

The present invention has been described for illustrative but not limitative purposes, according to its preferred embodiments, but it is to be understood that modifications and/or changes can be introduced by those skilled in the art without departing from the relevant scope as defined in the enclosed claims.

References

[1] Mostegel, C., Rumpler, M., Fraundorfer, F., Bischof, H.: “Using self-contradiction to learn confidence measures in stereo vision.” In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016);

[2] Tosi, F., Poggi, M., Tonioni, A., Di Stefano, L, Mattoccia, S.: “Learning confidence measures in the wild. In: BMVC (Sept 2017);

[3] Flu, X., Mordohai, P.: “A quantitative evaluation of confidence measures for stereo vision” 34(11 ), 2121 -2133 (2012);

[4] Godard, C., Mac Aodha, O., Brostow, G.J.: “Unsupervised monocular depth estimation with left-right consistency.” In: CVPR (2017);

[5] Poggi, M., Aleotti, F., Tosi, F., Mattoccia, S.: “Towards real-time unsupervised monocular depth estimation on CPU.” In: IEEE/JRS Conference on Intelligent Robots and Systems (IROS) (2018);

[6] Godard, C., Mac Aodha, O., Brostow, G.J.: “Digging into self- supervised monocular depth estimation.” In: ICCV (2019);

[7] Zhang, Z., Cui, Z., Xu, C., Jie, Z., Li, X., Yang, J.: “Joint task- recursive learning for semantic segmentation and depth estimation.” In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 235-251 (2018);

[8] Tonioni, A., Tosi, F., Poggi, M., Mattoccia, S., Di Stefano, L.: “Real-time selfadaptive deep stereo” (June 2019);

[9] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: “Image quality assessment: from error visibility to structural similarity.” IEEE transactions on image processing 13(4), 600-612 (2004);

[10] Poggi, M., Tosi, F., Mattoccia, S.: “Quantitative evaluation of confidence measures in a machine learning world.” pp. 5228-5237 (2017);

[11] Tosi, F., Poggi, M., Benincasa, A., Mattoccia, S.: “Beyond local reasoning for stereo confidence estimation with deep learning.” pp. 319-334 (2018);

[12] Kim, S., Kim, S., Min, D., Sohn, K.: Laf-net: “Locally adaptive fusion networks for stereo confidence estimation.” In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019);

[13] Poggi, M., Mattoccia, S.: “Learning from scratch a confidence measure.” In: BMVC (2016);

[14] Gul, M.S.K., Batz, M., Keinert, J.: “Pixel-wise confidences for stereo disparities using recurrent neural networks.” In: BMVC (2019);

[15] Geiger, A., Lenz, P., Urtasun, R.: “Are we ready for autonomous driving? The KITTI vision benchmark suite.” In: CVPR (2012);

[16] Menze, M., Geiger, A.: Object scene flow for autonomous vehicles.” In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015);

[17] Scharstein, D., Hirschmuller, H., Kitajima, Y., Krathwohl, G., Nesic, N.,Wang, X., Westling, P.: “High-resolution stereo datasets with subpixel-accurate ground truth.” In: German conference on pattern recognition pp. 31-42. Springer (2014);

[18] Schops, T., Schonberger, J.L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., Geiger, A.: “A multi-view stereo benchmark with high- resolution images and multi-camera videos.” In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 3260-3269 (2017);

[19] Yang, G., Song, X., Huang, C., Deng, Z., Shi, J., Zhou, B.: Drivingstereo: “A largescale dataset for stereo matching in autonomous driving scenarios.” In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019);

[20] Zbontar, J., LeCun, Y.: “Stereo matching by training a convolutional neural network to compare image patches.” Journal of Machine Learning Research 17(1-32), 2 (2016);

[21] Zhang, K., Lu, J., Lafruit, G.: “Cross-based local stereo matching using orthogonal integral images.” IEEE transactions on circuits and systems for video technology 19(7), 1073-1079 (2009).

[22] Hirschmuller, H.: “Accurate and efficient stereo processing by semi-global matching and mutual information.” In: Computer Vision and

Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. vol. 2, pp. 807-814. IEEE (2005);

[23] Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation.” In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016).

Claims

1. Method for determining the confidence of a disparity map D_L by training a neural network (14), wherein the confidence is the level of certainty or uncertainty of each pixel of said disparity map D_L, from at least one digital image I_L, I_R of a scene, comprising the following steps:

A. acquiring said at least one digital image I_L , I_R of said scene;

B. calculating said disparity map D_L for each pixel of said at least one digital image I_L, I_R ;

C. extracting at least one self-supervising criterion from said at least one digital image I_L , I_R and said disparity map D_L;

D. calculating a confidence map CM, from said disparity map D_L, by means of said neural network (14);

E. calculating a loss signal L_MBCE from said confidence map CM and said at least one self-supervising criterion; and

F. optimizing said neural network (14) by training said neural network (14) with the information associated to said loss signal L_MBCE.

2. Method according to the preceding claim, characterized in that said at least one self-supervising criterion extracted in said step C comprise at least one of the following criteria: a self-supervising criterion related to an image reprojection error between said at least one digital image I_L, I_R ; a self-supervising criterion related to a disparity agreement between pixels of said disparity map D_L; and/or a self-supervising criterion related to the uniqueness of any pixel in I_L and I_R respectively.

3. Method according to claim 2, characterized in that said self- supervising criterion related to an image reprojection error between said at least one digital image I_L , I_R , is calculated according to the following equation:

wherein and

being a

reprojection of I_R on reference image coordinates, SSIM is the Structural SIMilarity index and a is a parameter ranging from 0 to 1 , preferably tuned to 0.85.

4. Method according to any one of claims 2 or 3, characterized in that said self-supervising criterion related to a disparity agreement between pixels of said disparity map D_L, is calculated according to the following equation: A = DA > 0.5 wherein with H_NxN is an histogram encoding, for each

5. Method according to any one of claims 2 - 4, characterized in that said self-supervising criterion related to the uniqueness of any pixel in I_L and I_R respectively is calculated according to the following equation:

wherein

6. Method according to any one of the preceding claims, characterized in that said loss signal L_MBCE is a Multi-modal Binary Cross Entropy loss signal calculated according to the following equation:

7. Method according to any one of the preceding claims, characterized in that said step B is carried out according to the following formula D_L = S(I_L,I_R).

8. Method according to any one of the preceding claims, characterized in that said step B is carried out by means of a network S.

9. Method according to any one of the preceding claims, characterized in that said step A is carried out by an image detecting unit (10) comprising at least one image detecting device (100, 101 ) for detecting said at least one digital image I_L, l_R, in that said step B is carried out by first processing means (11), connected to said image detecting device (10), in that said step C is carried out by a filter (12), connected to said image detecting device (10) and said first processing means (11 ), and in that said steps E and F are carried out by a second processing means (13), connected to said filter (12) and said neural network (14).

10. Method according to any one of the preceding claims, characterized in that said step A is carried out by a stereo matching technique, so as to detect a reference image I_L and a target image I_R of said scene.

11. Processing unit (U) for determining the confidence of a disparity map D_l, wherein the confidence is the level of certainty or uncertainty of each pixel of said disparity map D_L, from at least one digital image I_L, I_R of a scene, wherein the disparity map D_L is achieved through a network S, and wherein the processing unit (U) is configured to execute the steps B- F of said method, according to any one of claims 1-9.

12. Processing unit (U) according to claim 11 , characterized in that it comprises: processing means (11, 13), connected to said image detecting unit

(10), a filter (12), connected to said image detecting unit (10) and said processing means (11 , 13), and configured for extracting at least one self- supervising criteria from said at least one digital image I_L , I_R and said disparity map D_L, and a neural network (14), connected to said processing means (11 , 13), configured for producing a confidence map CM from said disparity map D_L, wherein said processing means (11 , 13) are configured for determining a disparity map D_L from said at least one digital image I_L , I_R , and for calculating a loss signal from said confidence map CM and said at least one self-supervising criteria.

13. Sensor system (1) for determining the confidence of a disparity map D_L from at least one digital image I_L, I_R of a scene, comprising an image detection unit (10) configured for acquiring said at least one digital image I_L, I_R of said scene, and a processing unit (U) according to any one of claims 11 or 12, connected to said image detecting unit (10).

14. Computer program comprising instructions which, when the program is executed by a computer, cause the processor to perform the method steps according to any one of claims 1-10.

15. Computer-readable storage medium comprising instructions which, when executed by a computer, cause the processor to perform the method steps according to any one of claims 1-10.