WO2022003740A1 - Method for determining the confidence of a disparity map through a self-adaptive learning of a neural network, and sensor system thereof - Google Patents
Method for determining the confidence of a disparity map through a self-adaptive learning of a neural network, and sensor system thereof Download PDFInfo
- Publication number
- WO2022003740A1 WO2022003740A1 PCT/IT2021/050193 IT2021050193W WO2022003740A1 WO 2022003740 A1 WO2022003740 A1 WO 2022003740A1 IT 2021050193 W IT2021050193 W IT 2021050193W WO 2022003740 A1 WO2022003740 A1 WO 2022003740A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- self
- confidence
- disparity map
- supervising
- image
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
- G06T7/593—Depth or shape recovery from multiple images from stereo images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10004—Still image; Photographic image
- G06T2207/10012—Stereo images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Definitions
- the present invention relates to a method for determining the confidence of a disparity map through a self-adaptive learning of a neural network, and sensor system thereof.
- the invention relates to a method and a sensor system of the mentioned type, designed in particular for determining the confidence of disparity maps inferred by a stereo algorithm or a network through a neural network capable of self-adapting, but which can be used for any type of image acquisition system, in which it is necessary to estimate the confidence, thus determining the level of certainty or uncertainty of each pixel of said image.
- the description will be directed to a self-supervised confidence estimation in a constrained setting, but it is clear that the same should not be considered limited to this specific use.
- stereo is one of the most popular strategies to accurately perceive the 3D structure of the scene, through two synchronized cameras and several algorithms, either hand-designed or based on deep neural networks. In many practical applications, alongside with disparity inference, confidence estimation is often performed as well. Purposely, a wide range of methods based either on hand-crafted measures or learning-based strategies have been proposed.
- Such a feature is highly desirable, since it potentially paves the way for learning confidence estimation for any stereo camera, even without any knowledge about the stereo algorithm/network deployed.
- a drawback of the technical solution according to the prior art is that it needs access to the cost volume, rarely exposed in the case of off-the-shelf stereo sensors mentioned above or not defined at all in most modern neural networks.
- Confidence measures can be divided in two main categories: hand made and learned measures.
- the former category consists of conventional method computed typically from cost volume analysis such as the ratio between two minima, such as in the so-called Peak-Ratio or PKR, or, as recently proposed, determining local properties of the disparity map, like the number of pixels with the same disparity hypothesis.
- cost volume analysis such as the ratio between two minima, such as in the so-called Peak-Ratio or PKR, or, as recently proposed, determining local properties of the disparity map, like the number of pixels with the same disparity hypothesis.
- hand-made cues are usually combined and fed as input to a random forest classifier or to a Convolutional Neural Network (CNN) appropriately trained deploying depth labels.
- CNN Convolutional Neural Network
- Learned methods may require:
- CNNs have replaced single steps in the stereo pipeline, such as cost computation, rapidly converging towards end-to-end solutions estimating dense disparity maps by means of 2D or 3D networks.
- the latest trend in the field consists of casting disparity estimation as a continuous learning problem, thanks to the self-supervision enabled by image re-projection.
- Another object of the present invention is that of providing a method for self-adapting a confidence measure unconstrained to the stereo system deployed.
- a further object of the invention is that of providing a novel loss function built upon cues available from the input stereo pair and the output disparity only, needing no additional information to learn/adapt to the sensed environment.
- Another object of the present invention is that of providing a method and a system of high reliability, easy to implement, and competitive in terms of costs when compared to the known technique.
- object of the present invention is to provide the tools necessary for the execution of the method and the apparatuses to perform such method.
- said at least one self-supervising criterion extracted in said step C may comprise at least one of the following criteria: a self-supervising criterion related to an image reprojection error between said at least one digital image I L , I R ; a self-supervising criterion related to a disparity agreement between pixels of said disparity map D L ; and/or a self-supervising criterion related to the uniqueness of any pixel in I L and I R respectively.
- said self-supervising criterion related to an image reprojection error between said at least one digital image I L , I R may be calculated according to the following equation: being a reprojection of l R on reference image coordinates, SSIM is the Structural SIMilarity index and a is a parameter ranging from 0 to 1 , preferably tuned to 0.85.
- said self-supervising criterion related to a disparity agreement between pixels of said disparity map D l may be calculated according to the following equation:
- A DA > 0.5
- H NxN is an histogram encoding, for each pixel (x,y) of said disparity map D L , the number of neighbours in a NxN window having the same disparity d.
- said self-supervising criterion related to the uniqueness of any pixel in I L and I R respectively may be calculated according to the following equation: wherein
- said loss signal L MBCE is a Multi-modal Binary Cross Entropy loss signal which may be calculated according to the following equation: where o e [0,1] is the output of said neural network (14), P and Q are two sets of proxy labels derived respectively by a self-supervising criterion comprised in said at least one self-supervising criteria being met or not.
- said step B may be carried out by means of a network S.
- said step A may be carried out by an image detecting unit comprising at least one image detecting device for detecting said at least one digital image I L , I R
- said step B may be carried out by first processing means, connected to said image detecting device
- said step C may be carried out by a filter, connected to said image detecting device and said first processing means
- said steps E and F may be carried out by a second processing means, connected to said filter and said neural network.
- said step A may be carried out by a stereo matching technique, so as to detect a reference image I L and a target image I R of said scene.
- the processing unit may comprise processing means, connected to said image detecting unit, a filter, connected to said image detecting unit and said processing means, and configured for extracting at least one self-supervising criteria from said at least one digital image I L , I R and said disparity map D L , and a neural network, connected to said processing means, configured for producing a confidence map CM from said disparity map D L , wherein said processing means are configured for determining a disparity map D L from said at least one digital image I L , I R , and for calculating a loss signal from said confidence map CM and said at least one self-supervising criteria.
- a sensor system for determining the confidence of a disparity map D L from at least one digital image I L , I R of a scene, comprising an image detection unit configured for acquiring said at least one digital image I L , I R of said scene, and a processing unit connected to said image detecting unit.
- It is also object of the present invention a computer-readable storage medium comprising instructions which, when executed by a computer, cause the processor to perform the present method steps.
- Fig. 1 illustrates a block diagram of an embodiment of the sensor system for determining the confidence of a disparity map by self-adaptive learning of a neural network, according to the present invention
- Fig. 2 illustrates a flowchart concerning the steps of the method for determining the confidence of a disparity map by self-adaptive learning of a neural network, according to the present invention
- Fig. 3 illustrates, given an highlighted region, a set of inliers and a set of outliers, which are determined by using different configurations of self- supervising criteria, according to the present invention
- Fig. 4 illustrates a table which reports AUC scores for networks trained on a first set of test images and tested on an unseen set of test images
- Fig. 5 illustrates from left: a reference image, a disparity map and confidence maps by existing self-supervised approaches [2], [1], the proposed technique and the proposed technique during online adaptation;
- Fig. 6 illustrates two examples of reference image and disparity map acquired with an iPhone XS, followed by estimated confidence map after few iterations of on the-fly learning;
- Fig. 7 shows illustrates a reference images, a disparity maps from various algorithms and confidence estimated by self-supervised frameworks and the present method.
- a sensor system for determining the confidence of a disparity map through a self-adaptive learning of a neural network which comprises an image detecting unit 10 and a processing unit U, connected to said image detecting unit 10.
- said processing unit U comprises first processing means 11 connected to said image detecting unit 10, a filter 12 connected to said image detecting unit 10 and said first processing means 11 , second processing means 13, connected to said filter 12, and a neural network or confidence network 14, connected to said first processing means 11 and to said second processing means 13.
- said first processing means 11 and said second processing means 13 are two different processing means.
- said first processing means 11 and said second processing means 13 can be considered as the same processing means or integrated, for instance in a same microprocessor.
- said image detecting unit 10 is a stereoscopic vision system.
- said image detecting unit 10 can be any other system even according to the prior art capable of obtaining disparity or distance maps from digital images or other methods.
- said image detecting unit 10 comprises a first image detecting device 100 and a second image detecting device 101 , such as a video camera, a photo camera or a sensor, arranged at a predetermined fixed distance from each other.
- a first image detecting device 100 and a second image detecting device 101 , such as a video camera, a photo camera or a sensor, arranged at a predetermined fixed distance from each other.
- the image detecting unit 10 can comprise a number of detecting devices other than two, for example, one, as in monocular systems for depth estimation from images.
- each of said image detecting devices 100, 101 detects a respective image of the object or the scene observed.
- the image acquired by means of said image detecting device 100 i.e. the left image
- the image acquired through said image detecting device 101 i.e. the right image
- the target image or target I R will be considered as the target image or target I R .
- each image acquired by the respective detection device 100, 101 can be considered as reference image I L or target image I R .
- said first processing means 11 are connected to said image detecting devices 100, 101 .
- said first processing means 11 are configured to process said images I L and I R in order to generate a disparity map D L .
- the output disparity map D L is computed assuming I L as the reference image.
- the output disparity map can be computed assuming I R as the reference image.
- said first processing means 11 generates said disparity map D L by means of a stereo algorithm S.
- said first processing means 11 provide the use of additional algorithms, networks, programs or other computer sensors, capable of generating disparity maps.
- said filter 12 is capable of extracting a plurality of self-supervising criteria from said disparity map D L and said images I L and I R , in order to provide a self-adaptive learning of said confidence network 14, as better explained below.
- the extracted self-supervising criteria are three, herein referred to as T, A and U, related to image re-projection error, disparity agreement between nearby pixels of an image and uniqueness constraint between pixels of different images, respectively.
- Said second processing means 13 are then configured to determine an evaluation of the loss based on said three self-supervising criteria T, A and U, in order to evaluate the output of the the newar network 14, so as to train the same online, namely during its operation, without any external data for training the same.
- the second processing means 13 calculate a Multimodal Binary Cross Entropy (MBCE) loss signal from a combination of the outcomes of said three self-supervising criteria T, A and U, and a confidence map CM is computed by said confidence network 14.
- MBCE Multimodal Binary Cross Entropy
- said confidence network 14 is connected to said first processing means 11 and to said second processing means 13.
- said confidence network 14 is configured to determine said confidence map CM from said disparity map D L .
- said confidence map CM ranks pixels of the disparity map D L from less to more reliable (from black to white).
- said confidence network 14 is capable of updating its own knowledge of the surrounding environment by means of the valuation of the Multi-modal Binary Cross Entropy (MBCE) loss signal computed from said second processing unit 13.
- MBCE Multi-modal Binary Cross Entropy
- the first processing means 11 , the second processing means 13, the filter 12, and the neural network or confidence network 14, can be integrated in a single processing unit U, properly programmed.
- FIG. 2 a flowchart of the method according to the present invention is shown, which can be executed also by the system of Fig. 1.
- the step of acquiring images provides the acquisition of a reference image I L and a target image I R related to an object or a scene observed by means of said image detecting unit 10.
- step B said first processing means 11 process said images I L and I R , in order to generate a disparity map D L by means of said stereo algorithm 5.
- the present method provides an image processing using a stereo algorithm S.
- step C said filter 12 extracts said three self- supervising criteria T, A and U from the two images I L and I R and the disparity map D L .
- step D said confidence network 14 determines said confidence map CM from said disparity map D L .
- step E said second processing means 13 compute the MBCE loss signal from said confidence map CM and a combination of one or more of said self-supervising criteria T, A and U. It is noted that also other self-supervising criteria from said disparity map D L can be used, alternatively or on addition to the three self-supervising criteria T, A and U specified above, without departing from the scope of protection of the invention herein disclosed.
- step F said confidence network 14 is updated based on said MBCE loss signal computed in said step E.
- the parameters of said neural network 14 are continuously updated in order to adapt said neural network 14 itself to the environment related to scene observed.
- the present invention aims at proposing a self- supervised paradigm suited for learning a confidence measure, unconstrained from the specific stereo method deployed and capable of self-adaptation.
- stereo matching solutions Three main broad categories of stereo matching solutions are herein defined, each one characterized by different data made available during deployment. It is clear that the stereo matching herein disclosed are just possible embodiments and also other stereo matching systems can be available and implemented.
- a generic rectified stereo pair will be referred as (/ L ,/ R ), respectively made of left and right images, and a generic stereo algorithm or deep network will be referred as S. Furthermore, in the remainder of the description, in order to simplify notation, (x,y) coordinates will be omitted if not strictly necessary.
- This image triplet is the minimum amount of data available out of any stereo method, and all the systems making available only such cues are here defined as “black-box systems”.
- black-box systems are highly representative off-the-shelf stereo cameras (e.g., Stereolabs ZED 2) or stereo methods implemented in consumer devices (e.g., Apple iPhones).
- black-box systems provide cues available in any stereo system, when explicit calls to the algorithm APIs are exposed, additional cues can be retrieved.
- a second family of systems can be implemented, for which, although it is given no access to the algorithm implementation or its intermediate data, explicit calls to the method itself are possible (e.g. stereo algorithms provided by pre-compiled libraries).
- LRC Left to Right Consistency
- D L a second disparity map
- D R a second disparity map
- n(a,b) is a sampling operator, collecting values at coordinate a from b, and a threshold value (usually 1 ) above, which D L and D R are considered inconsistent.
- Peak-Ratio (PKR) and Left-Right Difference (LRD) defined, respectively, as where dl and d2m, respectively, are the disparity hypotheses corresponding to the minimum cost and the second local minima (see for example [3]).
- LRD Given the cost volume V R computed assuming I R as the reference image, for any pixel (x,y) costs are sampled at (x - dl,y), i.e., from the estimated matching pixel.
- the method comprises a general-purpose strategy enabling self-supervised confidence estimation in such constrained settings.
- the method can be used even for state-of-the-art CNNs.
- out-of-the-box learning of confidence estimation with any stereo setup and self-adaptation in any environment is available. Determination of the three self-supervising criteria
- the data available comprise (/ L ,/ R ) and D L only.
- an image re-projection error is considered.
- a is a parameter ranging from 0 to 1 , preferably tuned to 0.85. The higher the image reprojection error is, the more likely D L is wrong.
- the present invention aim at detecting regions with rich texture, being more likely to be correctly estimated by S, by comparing D computed between (/ L ,/ R ) with the one after reprojection as
- D L itself allows for the extraction of meaningful cues to assess the quality of disparity assignments.
- H NxN is an histogram encoding, for each pixel (x,y), the number of neighbours in a NxN window having the same disparity d (in case of subpixel precision, within 1 pixel).
- the uniqueness constraint is considered.
- the uniqueness for any pixel in I L holds if it does not collide in the target image with any other pixel, i.e., not matching the same pixel in I R matched by any other.
- a Multi-modal Binary Cross Entropy (MBCE) loss is defined for each pixel of the acquired image as: where o is the output of the neural network e [0,1], i.e. passed through a sigmoid activation, P and Q are two sets of proxy labels, derived respectively by a self-supervising criterion being met or not.
- MBCE Multi-modal Binary Cross Entropy
- Fig. 3 illustrates, given an highlighted region, a set of inliers (also shown in green colour) and a set of outliers (also shown in red colour), which are determined by using the following configurations of self- supervising criterion in multimodal binary cross-entropy loss signal: a) T p , T « b) A p , A « c) U p , rn d) T p , A p , U p , T « e) T p , A p , U p , T «, A «, W, while for black pixels, the considered configuration gives no guesses.
- Fig. 3 highlights how combining multiple guesses, as in case d) and in case e), for some pixels no supervision is given when self- supervising criteria do not match.
- the system and the method for determining the confidence of a disparity map through a self-adaptive learning of a neural network can be operable to realize a depth estimation sensor, capable of providing an estimate of confidence based on machine-learning without having to acquire datasets for learning, which is high expensive and complicated to perform, with techniques belonging of the state of the art.
- Possible applications of the method for determining the confidence of a disparity map through a self-adaptive learning of a neural network can be:
- estimated confidence map CM ranks pixels from less to more reliable (from black to white). It can be used to extract a subset of reliable points to be used by “guided stereo” and “real-time self-adaptive” technologies, and filter out less reliable pixels and replace them with better estimates;
- OTB Out-of-The-Box
- AUC Area Under Curve
- pixels are sorted in increasing order of confidence and gradually removed (e.g., 5% each time) from the disparity map.
- the error rate is computed over the sparse disparity map as the percentage of pixels having absolute error larger than t.
- ConfNet [11]
- SELF [1]
- WILD [2]
- Fig. 5 shows qualitative examples for the SGM algorithm.
- Fig. 6 shows examples of acquired disparity and estimated confidence maps by ConfNet adapted online. More specifically, the very few frames collected are sufficient to learn how to detect gross errors like on turtle's shell.
- the present invention is suited for continuous online adaptation on any black-box framework. Furthermore, experimental results proved that the present method is shows a high performance if compared with existing self-supervised approaches and, conversely to them, allow to further improvements during deployment by leveraging the online self-adaptation process.
- An advantage of the method according to the present invention is that of allowing a self-adapting confidence estimation agnostic to the stereo algorithm or network.
- Another advantage of the present invention is that of learning an effective confidence measure only based on the minimum information available in any stereo setup (i.e., the input stereo pair of images and the output disparity map).
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Feedback Control In General (AREA)
- Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
- Image Analysis (AREA)
Abstract
Method for determining the confidence of a disparity map through a self-adaptive learning of a neural network, and sensor system thereof The present invention relates to a method for determining the 5 confidence of a disparity map by training a neural network (14), wherein the confidence is the level of certainty or uncertainty of each pixel of said disparity map, from at least one digital image, of a scene, comprising the following steps: A. acquiring said at least one digital image, of said scene; B. calculating said disparity map for each pixel of said at least 10 one digital image,; C. extracting at least one self-supervising criterion from said at least one digital image, and said disparity map; D. calculating a confidence map, from said disparity map, by means of said neural network (14); E. calculating a loss signal from said confidence map and said at least one self-supervising criterion; and F. 15 optimizing said neural network (14) by training said neural network (14) with the information associated to said loss signal. The present invention also relates to a sensor system for determining the confidence of a disparity map from at least one digital image, of a scene. 20
Description
Method for determining the confidence of a disparity map through a self- adaptive learning of a neural network, and sensor system thereof
The present invention relates to a method for determining the confidence of a disparity map through a self-adaptive learning of a neural network, and sensor system thereof.
Field of the invention
More specifically, the invention relates to a method and a sensor system of the mentioned type, designed in particular for determining the confidence of disparity maps inferred by a stereo algorithm or a network through a neural network capable of self-adapting, but which can be used for any type of image acquisition system, in which it is necessary to estimate the confidence, thus determining the level of certainty or uncertainty of each pixel of said image. In the following, the description will be directed to a self-supervised confidence estimation in a constrained setting, but it is clear that the same should not be considered limited to this specific use.
Background
There are in the market several systems for acquiring images in 3D, in order to determine the depth of an image.
Currently, stereo is one of the most popular strategies to accurately perceive the 3D structure of the scene, through two synchronized cameras and several algorithms, either hand-designed or based on deep neural networks. In many practical applications, alongside with disparity inference, confidence estimation is often performed as well. Purposely, a wide range of methods based either on hand-crafted measures or learning-based strategies have been proposed.
Recently there have been showed how state-of-the-art networks processing cues available from any stereo setup (i.e. the input stereo pair
and the output disparity map) are substantially equivalent to those processing the entire cost volume, further supporting the evidence that the disparity map itself contains sufficient clues to identify outliers.
Such a feature is highly desirable, since it potentially paves the way for learning confidence estimation for any stereo camera, even without any knowledge about the stereo algorithm/network deployed.
This fact is very appealing since it frequently occurs with most industrial/off-the-shelf (e.g. Stereolabs ZED 2) or consumer devices (e.g. smartphones).
Nonetheless, this opportunity was investigated only partially in the literature. Moreover, all the above-mentioned methods are strongly constrained to the need for ground truth depth labels acquired in the target domain.
However, since achieving such labels is cumbersome and time- consuming, some self-supervised methods have been proposed in the prior art. Although these methods proved that confidence estimation could be learned without needing active sensors, they have various drawbacks.
One of the drawbacks of the known technical solution that static stereo sequences are required.
Moreover, a drawback of the technical solution according to the prior art is that it needs access to the cost volume, rarely exposed in the case of off-the-shelf stereo sensors mentioned above or not defined at all in most modern neural networks.
As a consequence, the solutions available in the prior art are not thought to handle adaptation, required to soften domain-shift issues. Thus, a solution for out-of-the-box deployment of self-adaptive confidence estimation would be highly desirable for many practical applications.
A notable example concerns smartphones, nowadays equipped with multiple cameras and stereo algorithms/networks deployed for augmented reality or other applications in unpredictable environments.
It follows a short review of the prior art about the literature concerning confidence measures and recent trends in stereo matching.
Confidence measures can be divided in two main categories: hand made and learned measures.
The former category consists of conventional method computed typically from cost volume analysis such as the ratio between two minima, such as in the so-called Peak-Ratio or PKR, or, as recently proposed, determining local properties of the disparity map, like the number of pixels with the same disparity hypothesis.
As regards learned measures, hand-made cues are usually combined and fed as input to a random forest classifier or to a Convolutional Neural Network (CNN) appropriately trained deploying depth labels.
Learned methods may require:
1) full access to the cost volume to extract hand-made features or process the volume itself;
2) disparity maps for both left and right viewpoint; or
3) only the input image and its corresponding disparity map.
The above-mentioned three requirements translate into harder to softer constraints at deployment, most of them usually not met by off-the- shelf stereo cameras since exposing only the input stereo pair and the output disparity map to the user.
Recently, it has been shown that, although a CNN with access to the full cost volume can perform better than networks processing disparity and reference image only, the margin between the two approaches is small and, in most cases, negligible, at the cost of a much minor versatility of the former.
As to the applications of confidence measures, in addition to the traditional outliers filtering task, many higher-level applications exploit such cue for different purposes.
In particular, it has been estimated confidence and detect ground
control points to improve global optimization. It has also been proposed a confidence-based modulation of the cost volume applied before Semi- Global Matching (SGM) optimization. Also, the streaking effects of the SGM stereo algorithm by using a weighted sum of the scanlines according to a confidence measure have been reduced.
Similarly, other approaches provide for fusing multiple scanlines of SGM using a random forest classifier.
Also, methods acting outside the stereo algorithms have been proposed for stereo algorithm fusion, sensor fusion and unsupervised adaptation of deep models for stereo matching.
Self-supervised learning has been barely investigated for confidence estimation.
According to some approaches (Mostegel et al., see [1]), it has been leveraged leverage stereo videos looking at consistencies and contradictions between the different viewpoints of a static scene in order to obtain correct and wrong candidates from a given stereo algorithm.
In other approaches, (see Tosi et al. [2]) instead, it has been relied on traditional confidence measures to obtain these two sets according to a consensus among them.
In addition, at first, CNNs have replaced single steps in the stereo pipeline, such as cost computation, rapidly converging towards end-to-end solutions estimating dense disparity maps by means of 2D or 3D networks.
The latest trend in the field consists of casting disparity estimation as a continuous learning problem, thanks to the self-supervision enabled by image re-projection.
Scope of the invention
In light of the above, it is therefore an object of the present invention overcoming the drawbacks mentioned in the proposed self-supervised methods of the prior art, providing a method for determining the confidence of a disparity map through a self-adaptive learning of a neural network.
Another object of the present invention is that of providing a method for self-adapting a confidence measure unconstrained to the stereo system deployed.
A further object of the invention is that of providing a novel loss function built upon cues available from the input stereo pair and the output disparity only, needing no additional information to learn/adapt to the sensed environment.
Another object of the present invention is that of providing a method and a system of high reliability, easy to implement, and competitive in terms of costs when compared to the known technique.
Still, object of the present invention is to provide the tools necessary for the execution of the method and the apparatuses to perform such method.
Object of the invention
It is therefore specific object of the present invention a method for determining the confidence of a disparity map DL by training a neural network, wherein the confidence is the level of certainty or uncertainty of each pixel of said disparity map DL, from at least one digital image IL, IR of a scene, comprising the following steps: A. acquiring said at least one digital image IL, IR of said scene; B. calculating said disparity map DL for each pixel of said at least one digital image IL , IR ; C. extracting at least one self- supervising criterion from said at least one digital image IL, IR and said disparity map DL D. calculating a confidence map CM, from said disparity map Dl, by means of said neural network; E. calculating a loss signal LMBCE from said confidence map CM and said at least one self-supervising criterion; and F. optimizing said neural network by training said neural network with the information associated to said loss signal LMBCE.
Still according to the invention, said at least one self-supervising criterion extracted in said step C may comprise at least one of the following criteria: a self-supervising criterion related to an image reprojection error
between said at least one digital image IL, IR ; a self-supervising criterion related to a disparity agreement between pixels of said disparity map DL; and/or a self-supervising criterion related to the uniqueness of any pixel in IL and IR respectively. Advantageously according to the invention, said self-supervising criterion related to an image reprojection error between said at least one digital image IL, IR , may be calculated according to the following equation: being a
reprojection of lR on reference image coordinates, SSIM is the Structural SIMilarity index and a is a parameter ranging from 0 to 1 , preferably tuned to 0.85. Conveniently according to the invention, said self-supervising criterion related to a disparity agreement between pixels of said disparity map Dl, may be calculated according to the following equation:
A = DA > 0.5 wherein with HNxN is an histogram encoding, for each
pixel (x,y) of said disparity map DL, the number of neighbours in a NxN window having the same disparity d.
Always according to the invention, said self-supervising criterion related to the uniqueness of any pixel in IL and IR respectively may be calculated according to the following equation:
wherein
Still according to the invention, said loss signal LMBCE is a Multi-modal Binary Cross Entropy loss signal which may be calculated according to the following equation:
where o e [0,1] is the output of said neural network (14), P and Q are two sets of proxy labels derived respectively by a self-supervising criterion comprised in said at least one self-supervising criteria being met or not.
Always according to the invention, said step B may be carried out according to the following formula DL = S(IL,IR).
Conveniently according to the invention, said step B may be carried out by means of a network S. Advantageously according to the invention, said step A may be carried out by an image detecting unit comprising at least one image detecting device for detecting said at least one digital image IL, IR , said step B may be carried out by first processing means, connected to said image detecting device, said step C may be carried out by a filter, connected to said image detecting device and said first processing means, and said steps E and F may be carried out by a second processing means, connected to said filter and said neural network.
Still according to the invention, said step A may be carried out by a stereo matching technique, so as to detect a reference image IL and a target image IR of said scene.
It is also object of the present invention a processing unit for determining the confidence of a disparity map DL, wherein the confidence is the level of certainty or uncertainty of each pixel of said disparity map DL,
from at least one digital image IL, IR of a scene, wherein the disparity map DL is achieved through a network S, and wherein the processing unit is configured to execute the steps B-F of said method.
Conveniently according to the invention, the processing unit may comprise processing means, connected to said image detecting unit, a filter, connected to said image detecting unit and said processing means, and configured for extracting at least one self-supervising criteria from said at least one digital image IL, IR and said disparity map DL, and a neural network, connected to said processing means, configured for producing a confidence map CM from said disparity map DL, wherein said processing means are configured for determining a disparity map DL from said at least one digital image IL, IR , and for calculating a loss signal from said confidence map CM and said at least one self-supervising criteria.
It is also object of the present invention a sensor system for determining the confidence of a disparity map DL from at least one digital image IL, IR of a scene, comprising an image detection unit configured for acquiring said at least one digital image IL , IR of said scene, and a processing unit connected to said image detecting unit.
It is further object of the present invention a computer program comprising instructions which, when the program is executed by a computer, cause the processor to perform the present method steps.
It is also object of the present invention a computer-readable storage medium comprising instructions which, when executed by a computer, cause the processor to perform the present method steps.
Brief description of the drawings
The present invention will be now described, for illustrative but not limitative purposes, according to its preferred embodiments, with particular reference to the figures of the enclosed drawings, wherein:
Fig. 1 illustrates a block diagram of an embodiment of the sensor system for determining the confidence of a disparity map by self-adaptive
learning of a neural network, according to the present invention;
Fig. 2 illustrates a flowchart concerning the steps of the method for determining the confidence of a disparity map by self-adaptive learning of a neural network, according to the present invention;
Fig. 3 illustrates, given an highlighted region, a set of inliers and a set of outliers, which are determined by using different configurations of self- supervising criteria, according to the present invention;
Fig. 4 illustrates a table which reports AUC scores for networks trained on a first set of test images and tested on an unseen set of test images;
Fig. 5 illustrates from left: a reference image, a disparity map and confidence maps by existing self-supervised approaches [2], [1], the proposed technique and the proposed technique during online adaptation;
Fig. 6 illustrates two examples of reference image and disparity map acquired with an iPhone XS, followed by estimated confidence map after few iterations of on the-fly learning; and
Fig. 7 shows illustrates a reference images, a disparity maps from various algorithms and confidence estimated by self-supervised frameworks and the present method.
Detailed description
In the various figures, similar parts will be indicated by the same reference numbers.
With reference to the aforementioned Fig. 1 , a sensor system for determining the confidence of a disparity map through a self-adaptive learning of a neural network, indicated as a whole with the reference number 1 , is shown, which comprises an image detecting unit 10 and a processing unit U, connected to said image detecting unit 10.
In the present embodiment, said processing unit U comprises first processing means 11 connected to said image detecting unit 10, a filter 12 connected to said image detecting unit 10 and said first processing means
11 , second processing means 13, connected to said filter 12, and a neural network or confidence network 14, connected to said first processing means 11 and to said second processing means 13.
In the embodiment according of the present invention, said first processing means 11 and said second processing means 13 are two different processing means.
However, in other embodiments of the present invention, said first processing means 11 and said second processing means 13 can be considered as the same processing means or integrated, for instance in a same microprocessor.
Moreover, in the embodiment at issue, said image detecting unit 10 is a stereoscopic vision system.
However, in other embodiments of the present invention, said image detecting unit 10 can be any other system even according to the prior art capable of obtaining disparity or distance maps from digital images or other methods.
In particular, said image detecting unit 10 comprises a first image detecting device 100 and a second image detecting device 101 , such as a video camera, a photo camera or a sensor, arranged at a predetermined fixed distance from each other.
In other embodiments according to the present invention, the image detecting unit 10 can comprise a number of detecting devices other than two, for example, one, as in monocular systems for depth estimation from images.
More specifically, each of said image detecting devices 100, 101 detects a respective image of the object or the scene observed.
As it will be better explained below, the image acquired by means of said image detecting device 100, i.e. the left image, will be considered as the reference image or reference IL, while the image acquired through said image detecting device 101 , i.e. the right image, will be considered as the
target image or target IR.
However, each image acquired by the respective detection device 100, 101 can be considered as reference image IL or target image IR.
Still referring to Fig. 1 , said first processing means 11 are connected to said image detecting devices 100, 101 . In particular, said first processing means 11 are configured to process said images IL and IR in order to generate a disparity map DL.
In the embodiment according of the present invention, the output disparity map DL is computed assuming IL as the reference image. However, in another embodiment of the present invention, the output disparity map can be computed assuming IR as the reference image.
Moreover, in the embodiment schematically illustrated in Fig. 1 , said first processing means 11 generates said disparity map DL by means of a stereo algorithm S.
However, in further embodiments of the present invention, said first processing means 11 provide the use of additional algorithms, networks, programs or other computer sensors, capable of generating disparity maps.
As will be better described below, said filter 12 is capable of extracting a plurality of self-supervising criteria from said disparity map DL and said images IL and IR , in order to provide a self-adaptive learning of said confidence network 14, as better explained below.
In the embodiment at issue, as it will be better explained below, the extracted self-supervising criteria are three, herein referred to as T, A and U, related to image re-projection error, disparity agreement between nearby pixels of an image and uniqueness constraint between pixels of different images, respectively.
However, in other embodiments, it is possible to extract or calculate a different number of said self-supervising criteria, such as one, two or more than three criteria, respect to those ones described above.
Moreover, in other embodiments of the present invention, it is
possible to extract different criteria from the self-supervising criteria described for the present invention.
Said second processing means 13 are then configured to determine an evaluation of the loss based on said three self-supervising criteria T, A and U, in order to evaluate the output of the the newar network 14, so as to train the same online, namely during its operation, without any external data for training the same.
More specifically, the second processing means 13 calculate a Multimodal Binary Cross Entropy (MBCE) loss signal from a combination of the outcomes of said three self-supervising criteria T, A and U, and a confidence map CM is computed by said confidence network 14.
As said above, said confidence network 14 is connected to said first processing means 11 and to said second processing means 13.
In particular, said confidence network 14 is configured to determine said confidence map CM from said disparity map DL.
More specifically, said confidence map CM ranks pixels of the disparity map DL from less to more reliable (from black to white).
As it will be described in more details herein, said confidence network 14 is capable of updating its own knowledge of the surrounding environment by means of the valuation of the Multi-modal Binary Cross Entropy (MBCE) loss signal computed from said second processing unit 13.
As mentioned above, in some embodiments, the first processing means 11 , the second processing means 13, the filter 12, and the neural network or confidence network 14, can be integrated in a single processing unit U, properly programmed.
Referring now to Fig. 2, a flowchart of the method according to the present invention is shown, which can be executed also by the system of Fig. 1.
At first, the step of acquiring images, indicated with the reference letter A, provides the acquisition of a reference image IL and a target image
IR related to an object or a scene observed by means of said image detecting unit 10.
In step B, said first processing means 11 process said images IL and IR, in order to generate a disparity map DL by means of said stereo algorithm 5.
As said above, in the embodiment at issue, the present method provides an image processing using a stereo algorithm S.
However, in further embodiments of the present invention, it can be possible to use an additional algorithms or programs or other computer sensors capable of generating disparity maps.
Subsequently, in step C, said filter 12 extracts said three self- supervising criteria T, A and U from the two images IL and IR and the disparity map DL.
In step D, said confidence network 14 determines said confidence map CM from said disparity map DL.
Then, in step E, said second processing means 13 compute the MBCE loss signal from said confidence map CM and a combination of one or more of said self-supervising criteria T, A and U. It is noted that also other self-supervising criteria from said disparity map DL can be used, alternatively or on addition to the three self-supervising criteria T, A and U specified above, without departing from the scope of protection of the invention herein disclosed.
Finally, in step F, said confidence network 14 is updated based on said MBCE loss signal computed in said step E.
In particular, the parameters of said neural network 14 are continuously updated in order to adapt said neural network 14 itself to the environment related to scene observed.
As already said, the present invention aims at proposing a self- supervised paradigm suited for learning a confidence measure, unconstrained from the specific stereo method deployed and capable of
self-adaptation.
Therefore, at first stereo systems are classified into different categories according to the data they make available, and then a strategy compatible with all of them is introduced.
Stereo matching systems
Three main broad categories of stereo matching solutions are herein defined, each one characterized by different data made available during deployment. It is clear that the stereo matching herein disclosed are just possible embodiments and also other stereo matching systems can be available and implemented.
A generic rectified stereo pair will be referred as (/L,/R), respectively made of left and right images, and a generic stereo algorithm or deep network will be referred as S. Furthermore, in the remainder of the description, in order to simplify notation, (x,y) coordinates will be omitted if not strictly necessary.
Given any stereo algorithm processing a stereo pair (/L,/R), the output disparity map is defined, computed assuming IL as the reference image, as DL = S(/L,/R).
This image triplet is the minimum amount of data available out of any stereo method, and all the systems making available only such cues are here defined as “black-box systems”. Such systems are highly representative off-the-shelf stereo cameras (e.g., Stereolabs ZED 2) or stereo methods implemented in consumer devices (e.g., Apple iPhones).
In particular, they neither allow end-users to access the implementation nor provide explicit ways (Application Programming Interfaces or APIs) to call for it.
For each (/L,/R) acquired in the field by the device, they provide the corresponding disparity map typically with undisclosed approaches based either on conventional stereo algorithms or deep networks.
Although black-box systems provide cues available in any stereo
system, when explicit calls to the algorithm APIs are exposed, additional cues can be retrieved. Hence, a second family of systems can be implemented, for which, although it is given no access to the algorithm implementation or its intermediate data, explicit calls to the method itself are possible (e.g. stereo algorithms provided by pre-compiled libraries).
The systems belonging to this class are defined as “gray-box systems”, since multiple calls to S allow for retrieving additional cues. For instance, it is straightforward to compute the Left to Right Consistency (LRC) of the disparity maps, a popular strategy to obtain a confidence estimator, even if not explicitly provided by S itself in its original implementation.
Given the possibility to call S two times, consistency checking can be performed analysing DL and a second disparity map, namely DR obtained by assuming IR as the reference images. Defining <- the horizontal flipping operator, DR is obtained as follows:
Once obtained DR, the consistency between the two disparity maps can be checked as:
where n(a,b) is a sampling operator, collecting values at coordinate a from b, and a threshold value (usually 1 ) above, which DL and DR are considered inconsistent.
If the implementation of S is accessible, additional cues can be sourced by processing intermediate data structures, if meaningful. The preferred one is the cost volume V, containing matching costs V(x,y, d) for
pixels at coordinates (x,y) and any disparity hypothesis d e [0, dmax\.
This class of systems, referred to as “white-box systems”, enables computation of any confidence measure, either conventional or learning- based. Popular traditional confidence measures obtained from V are the
Peak-Ratio (PKR) and Left-Right Difference (LRD) defined, respectively, as
where dl and d2m, respectively, are the disparity hypotheses corresponding to the minimum cost and the second local minima (see for example [3]). Regarding LRD, given the cost volume VR computed assuming IR as the reference image, for any pixel (x,y) costs are sampled at (x - dl,y), i.e., from the estimated matching pixel.
Black-box models represent the most challenging, yet general target when dealing with confidence estimation since their constraints prevent the deployment of most state-of-the-art measures, as well as self-supervised strategy existing in the literature.
In the embodiment herein disclosed, the method comprises a general-purpose strategy enabling self-supervised confidence estimation in such constrained settings. However, in further embodiments, the method can be used even for state-of-the-art CNNs. Furthermore, out-of-the-box learning of confidence estimation with any stereo setup and self-adaptation in any environment is available.
Determination of the three self-supervising criteria
In order to develop a self-supervised strategy suited for any stereo system, it is required to identify cues that are effective to source a robust supervision signal.
According to the previous discussion, in the case for example of black-box models, the data available comprise (/L,/R) and DL only.
In this circumstance, although relevant information is not available compared to other models, the above mentioned three self-supervising criteria are introduced to obtain the desired self-supervised signal from the meagre cues available.
As a first self-supervising criterion, an image re-projection error is considered.
As a first self-supervising criterion implemented in the method for determining the confidence of a disparity map through a self-adaptive learning of a neural network according to the present invention, it is considered that the reprojection across the two viewpoints available in a rectified stereo pair could be a powerful source of supervision, either for monocular (see [4, 5, 6]) or stereo (see [7, 8]) depth estimation.
Specifically, IR is reprojected on the reference image coordinates as fR = n(DL,IR). Then, the difference between IL and warped right view fR appearance encodes how correct the reprojection is.
To this aim, the most popular choice is a weighted sum between two terms, respectively SSIM (see [9]) and absolute difference:
Wherein a is a parameter ranging from 0 to 1 , preferably tuned to 0.85. The higher the image reprojection error is, the more likely DL is wrong.
By definition, matching pixels is particularly challenging in ambiguous regions, such as textureless portions of the image.
To this aim, the present invention aim at detecting regions with rich texture, being more likely to be correctly estimated by S, by comparing D computed between (/L,/R) with the one after reprojection as
In large ambiguous regions will result equal (or even minor)
than the reprojection error, thus identifying pixels on which stereo is prone to errors.
As a second self-supervising criterion, the disparity agreement or agreement among neighbouring matches is considered.
In particular, considering that most regions of a disparity map should be smooth, variations in nearby pixels should be small except at depth boundaries. DL itself allows for the extraction of meaningful cues to assess the quality of disparity assignments. Purposely, the disparity agreement between neighbouring pixels is defined as:
where HNxN is an histogram encoding, for each pixel (x,y), the number of neighbours in a NxN window having the same disparity d (in case of subpixel precision, within 1 pixel).
In the absence of depth discontinuities, the majority of pixels in the neighbourhood should share the same, or very similar, disparity hypothesis.
Hence, this second self-supervising criterion is defined to identify reliable stereo correspondences as A = DA > 0.5, assuming that more than half of the pixels in the neighbourhood share the same disparity.
It is worth noting that this second self-supervising criterion is often not met in the presence of depth boundaries, even in case of correct disparities.
As a third self-supervising criterion, the uniqueness constraint is
considered.
In an ideal frontal-parallel scene observed by a stereo camera in standard form, for each pixel in IL exists at most one match in IR and vice-versa. Leveraging this property, known as uniqueness, is particularly useful to detect outliers in occluded regions and represents a reliable alternative to LRC and LRD measures, not usable when dealing with black-box models.
In other words, the uniqueness for any pixel in IL holds if it does not collide in the target image with any other pixel, i.e., not matching the same pixel in IR matched by any other.
This property is exploited in order to define a third self-supervising criterion as U = UC.
Although effective at detecting mostly occlusions, the uniqueness constraint is often violated in the presence of slanted surfaces. Multi-modal Binary Cross Entropy calculation
Given one or more of the three self-supervising criteria T, A and U disclosed above, a measure of binary entropy loss is calculated, to take into account multiple label hypotheses.
In particular, a Multi-modal Binary Cross Entropy (MBCE) loss is defined for each pixel of the acquired image as:
where o is the output of the neural network e [0,1], i.e. passed through a sigmoid activation, P and Q are two sets of proxy labels, derived respectively by a self-supervising criterion being met or not.
For instance, it is considered that the self-supervising criterion are calculated for each pixel, basing on the from said disparity map DL and said images IL and IR. Pixels satisfying the first self-supervising criterion on image reprojection will have labels Tp = 1, Tq = 0 and vice versa, when they do not.
Therefore, unlike traditional binary cross entropy, where a single label y and its counterpart (1 - y) are used, disjoint sets of proxies are defined allowing for a flexible configuration of the loss function according to the three self-supervising criteria described so far.
For instance, by setting P = [TP,AP\ and Q = [Tq] the network will be trained to detect good matches using image reprojection error plus disparity agreement and outliers using the image reprojection error only.
Adding elements to the sets P and Q reduces progressively the number of pixels considered correct or wrong, respectively.
It is noted that Fig. 3 illustrates, given an highlighted region, a set of inliers (also shown in green colour) and a set of outliers (also shown in red colour), which are determined by using the following configurations of self- supervising criterion in multimodal binary cross-entropy loss signal: a) Tp, T« b) Ap, A« c) Up, rn d) Tp, Ap, Up, T« e) Tp, Ap, Up , T«, A«, W, while for black pixels, the considered configuration gives no guesses.
In particular, Fig. 3 highlights how combining multiple guesses, as in case d) and in case e), for some pixels no supervision is given when self- supervising criteria do not match.
The system and the method for determining the confidence of a disparity map through a self-adaptive learning of a neural network can be operable to realize a depth estimation sensor, capable of providing an
estimate of confidence based on machine-learning without having to acquire datasets for learning, which is high expensive and complicated to perform, with techniques belonging of the state of the art.
Possible applications of the method for determining the confidence of a disparity map through a self-adaptive learning of a neural network can be:
1 ) assessing, in general, the quality of a stereo algorithm by finding situations/patterns where it usually fails. For example, several conventional algorithms fail near occlusions;
2) estimated confidence map CM ranks pixels from less to more reliable (from black to white). It can be used to extract a subset of reliable points to be used by “guided stereo” and “real-time self-adaptive” technologies, and filter out less reliable pixels and replace them with better estimates;
3) fusion of stereo disparity with Time of Flight (ToF) depth maps; and
4) fusion of multiple stereo algorithms.
Experimental results
In this section, the outcome of experiments aimed at assessing the effectiveness of the proposed invention, referred to as Out-of-The-Box (OTB), is reported.
To measure the effectiveness of the learned confidence measures, the Area Under Curve (AUC) of the sparsification plots (see [3], [10], [11], [12]) has been computed.
In particular, given a disparity map, pixels are sorted in increasing order of confidence and gradually removed (e.g., 5% each time) from the disparity map. At each iteration, the error rate is computed over the sparse disparity map as the percentage of pixels having absolute error larger than t.
Plotting the error rate results in a sparsification curve, whose AUC quantitatively assesses the confidence effectiveness (the lower, the better).
The optimal AUC is obtained by sampling the pixels in decreasing order of absolute error.
Self-adapting in-the-wild
Experiments aimed at assessing how effective the present method is for self-adaptation of a confidence measure in unseen environments have been conducted by selecting a sequence from the DrivingStereo ([19]) dataset. The sequence 2018-10-25-07-37, containing 6905 stereo pairs acquired in unconstrained (i.e., dynamic) environment has been used.
In particular, for this evaluation, Census-SGM ([22]) and MADNet ([8]) have been chosen. The former because it represents the preferred choice for hardware implementation on custom stereo cameras. The latter because it well represents the category of modern end-to-end CNNs characterized by a good trade-off between accuracy and speed.
For confidence networks, ConfNet ([11]) has been selected. In this experiment, it has been assumed to have pre-trained versions of ConfNet with the different self-supervision paradigms, respectively SELF ([1]) and WILD ([2]), on KITTI 2012 on the first 20 images of the training set ([10]).
For OTB, [ TP,AP , Up,Tq,Aq, Uq] configuration have been used. When performing online adaptation (online entry), for each stereo pair the confidence is estimated and evaluated before loss computation (thus, supervision only acts on the upcoming frames.
This way, ConfNet runs at 0.08 seconds (12 FPS) against the 0.02 (50 FPS) without adaptation on Titan Xp. The fourth table shown in Fig. 4 collects the outcome of this evaluation. It is pointed out that WILD cannot be deployed for MADNet since a meaningful cost volume is not available. On the other hand, SELF would require ( DL,DR ) for supervision, while MADNet provides only the former.
Assuming this network as a gray-box model, this issue has been got rid of at training time obtaining DR as shown in equation (1). Concerning SGM, OTB performs in between WILD and SELF. Nevertheless, keeping
continuous adaptation active on the whole sequence makes it outperform both by a good margin. Concerning MADNet, SELF results more effective than OTB.
Again, performing online adaptation makes OTB the best solution in this case as well. Finally, Fig. 5 shows qualitative examples for the SGM algorithm.
On-the-flv learning with black-box sensors
Finally, it is reported, as qualitative results, the outcome obtained by learning on-the-fly a confidence measure on disparity map sourced by an Apple iPhone® XS, without any pre-training.
Purposely, a sequence of about 100 pairs on which ConfNet has been trained on-the-fly has been collected.
In particular, Fig. 6 shows examples of acquired disparity and estimated confidence maps by ConfNet adapted online. More specifically, the very few frames collected are sufficient to learn how to detect gross errors like on turtle's shell.
Qualitative results on a variety of algorithms
Furthermore, as shown in Fig. 7 on a variety of algorithms, the present technical solution is better with respect to known strategies requiring full access to the cost volume (see [2]) or static scenes for training ([1])-
Conclusion
In light of the above, it has been introduced a novel self-supervised paradigm aimed at learning from scratch a confidence measure for stereo.
In particular, few, principled cues from the input stereo pair and the estimated disparity have been used in order to source supervision signals in place of disparity ground truth labels.
Being such cues available during deployment in-the-wild, the present invention is suited for continuous online adaptation on any black-box framework.
Furthermore, experimental results proved that the present method is shows a high performance if compared with existing self-supervised approaches and, conversely to them, allow to further improvements during deployment by leveraging the online self-adaptation process.
Advantages
An advantage of the method according to the present invention is that of allowing a self-adapting confidence estimation agnostic to the stereo algorithm or network.
Another advantage of the present invention is that of learning an effective confidence measure only based on the minimum information available in any stereo setup (i.e., the input stereo pair of images and the output disparity map).
The present invention has been described for illustrative but not limitative purposes, according to its preferred embodiments, but it is to be understood that modifications and/or changes can be introduced by those skilled in the art without departing from the relevant scope as defined in the enclosed claims.
References
[1] Mostegel, C., Rumpler, M., Fraundorfer, F., Bischof, H.: “Using self-contradiction to learn confidence measures in stereo vision.” In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016);
[2] Tosi, F., Poggi, M., Tonioni, A., Di Stefano, L, Mattoccia, S.: “Learning confidence measures in the wild. In: BMVC (Sept 2017);
[3] Flu, X., Mordohai, P.: “A quantitative evaluation of confidence measures for stereo vision” 34(11 ), 2121 -2133 (2012);
[4] Godard, C., Mac Aodha, O., Brostow, G.J.: “Unsupervised monocular depth estimation with left-right consistency.” In: CVPR (2017);
[5] Poggi, M., Aleotti, F., Tosi, F., Mattoccia, S.: “Towards real-time unsupervised monocular depth estimation on CPU.” In: IEEE/JRS
Conference on Intelligent Robots and Systems (IROS) (2018);
[6] Godard, C., Mac Aodha, O., Brostow, G.J.: “Digging into self- supervised monocular depth estimation.” In: ICCV (2019);
[7] Zhang, Z., Cui, Z., Xu, C., Jie, Z., Li, X., Yang, J.: “Joint task- recursive learning for semantic segmentation and depth estimation.” In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 235-251 (2018);
[8] Tonioni, A., Tosi, F., Poggi, M., Mattoccia, S., Di Stefano, L.: “Real-time selfadaptive deep stereo” (June 2019);
[9] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: “Image quality assessment: from error visibility to structural similarity.” IEEE transactions on image processing 13(4), 600-612 (2004);
[10] Poggi, M., Tosi, F., Mattoccia, S.: “Quantitative evaluation of confidence measures in a machine learning world.” pp. 5228-5237 (2017);
[11] Tosi, F., Poggi, M., Benincasa, A., Mattoccia, S.: “Beyond local reasoning for stereo confidence estimation with deep learning.” pp. 319-334 (2018);
[12] Kim, S., Kim, S., Min, D., Sohn, K.: Laf-net: “Locally adaptive fusion networks for stereo confidence estimation.” In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019);
[13] Poggi, M., Mattoccia, S.: “Learning from scratch a confidence measure.” In: BMVC (2016);
[14] Gul, M.S.K., Batz, M., Keinert, J.: “Pixel-wise confidences for stereo disparities using recurrent neural networks.” In: BMVC (2019);
[15] Geiger, A., Lenz, P., Urtasun, R.: “Are we ready for autonomous driving? The KITTI vision benchmark suite.” In: CVPR (2012);
[16] Menze, M., Geiger, A.: Object scene flow for autonomous vehicles.” In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015);
[17] Scharstein, D., Hirschmuller, H., Kitajima, Y., Krathwohl, G.,
Nesic, N.,Wang, X., Westling, P.: “High-resolution stereo datasets with subpixel-accurate ground truth.” In: German conference on pattern recognition pp. 31-42. Springer (2014);
[18] Schops, T., Schonberger, J.L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., Geiger, A.: “A multi-view stereo benchmark with high- resolution images and multi-camera videos.” In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 3260-3269 (2017);
[19] Yang, G., Song, X., Huang, C., Deng, Z., Shi, J., Zhou, B.: Drivingstereo: “A largescale dataset for stereo matching in autonomous driving scenarios.” In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019);
[20] Zbontar, J., LeCun, Y.: “Stereo matching by training a convolutional neural network to compare image patches.” Journal of Machine Learning Research 17(1-32), 2 (2016);
[21] Zhang, K., Lu, J., Lafruit, G.: “Cross-based local stereo matching using orthogonal integral images.” IEEE transactions on circuits and systems for video technology 19(7), 1073-1079 (2009).
[22] Hirschmuller, H.: “Accurate and efficient stereo processing by semi-global matching and mutual information.” In: Computer Vision and
Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. vol. 2, pp. 807-814. IEEE (2005);
[23] Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation.” In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016).
Claims
1. Method for determining the confidence of a disparity map DL by training a neural network (14), wherein the confidence is the level of certainty or uncertainty of each pixel of said disparity map DL, from at least one digital image IL, IR of a scene, comprising the following steps:
A. acquiring said at least one digital image IL , IR of said scene;
B. calculating said disparity map DL for each pixel of said at least one digital image IL, IR ;
C. extracting at least one self-supervising criterion from said at least one digital image IL , IR and said disparity map DL;
D. calculating a confidence map CM, from said disparity map DL, by means of said neural network (14);
E. calculating a loss signal LMBCE from said confidence map CM and said at least one self-supervising criterion; and
F. optimizing said neural network (14) by training said neural network (14) with the information associated to said loss signal LMBCE.
2. Method according to the preceding claim, characterized in that said at least one self-supervising criterion extracted in said step C comprise at least one of the following criteria: a self-supervising criterion related to an image reprojection error between said at least one digital image IL, IR ; a self-supervising criterion related to a disparity agreement between pixels of said disparity map DL; and/or a self-supervising criterion related to the uniqueness of any pixel in IL and IR respectively.
3. Method according to claim 2, characterized in that said self- supervising criterion related to an image reprojection error between said at least one digital image IL , IR , is calculated according to the following
equation:
wherein and
being a
reprojection of IR on reference image coordinates, SSIM is the Structural SIMilarity index and a is a parameter ranging from 0 to 1 , preferably tuned to 0.85.
4. Method according to any one of claims 2 or 3, characterized in that said self-supervising criterion related to a disparity agreement between pixels of said disparity map DL, is calculated according to the following equation: A = DA > 0.5 wherein with HNxN is an histogram encoding, for each
pixel (x,y) of said disparity map DL, the number of neighbours in a NxN window having the same disparity d.
6. Method according to any one of the preceding claims,
characterized in that said loss signal LMBCE is a Multi-modal Binary Cross Entropy loss signal calculated according to the following equation:
where o e [0,1] is the output of said neural network (14), P and Q are two sets of proxy labels derived respectively by a self-supervising criterion comprised in said at least one self-supervising criteria being met or not.
7. Method according to any one of the preceding claims, characterized in that said step B is carried out according to the following formula DL = S(IL,IR).
8. Method according to any one of the preceding claims, characterized in that said step B is carried out by means of a network S.
9. Method according to any one of the preceding claims, characterized in that said step A is carried out by an image detecting unit (10) comprising at least one image detecting device (100, 101 ) for detecting said at least one digital image IL, lR, in that said step B is carried out by first processing means (11), connected to said image detecting device (10), in that said step C is carried out by a filter (12), connected to said image detecting device (10) and said first processing means (11 ), and in that said steps E and F are carried out by a second processing means (13), connected to said filter (12) and said neural network (14).
10. Method according to any one of the preceding claims, characterized in that said step A is carried out by a stereo matching technique, so as to detect a reference image IL and a target image IR of said scene.
11. Processing unit (U) for determining the confidence of a disparity map Dl, wherein the confidence is the level of certainty or uncertainty of each pixel of said disparity map DL, from at least one digital image IL, IR of a scene, wherein the disparity map DL is achieved through a network S, and wherein the processing unit (U) is configured to execute the steps B- F of said method, according to any one of claims 1-9.
12. Processing unit (U) according to claim 11 , characterized in that it comprises: processing means (11, 13), connected to said image detecting unit
(10), a filter (12), connected to said image detecting unit (10) and said processing means (11 , 13), and configured for extracting at least one self- supervising criteria from said at least one digital image IL , IR and said disparity map DL, and a neural network (14), connected to said processing means (11 , 13), configured for producing a confidence map CM from said disparity map DL, wherein said processing means (11 , 13) are configured for determining a disparity map DL from said at least one digital image IL , IR , and for calculating a loss signal from said confidence map CM and said at least one self-supervising criteria.
13. Sensor system (1) for determining the confidence of a disparity
map DL from at least one digital image IL, IR of a scene, comprising an image detection unit (10) configured for acquiring said at least one digital image IL, IR of said scene, and a processing unit (U) according to any one of claims 11 or 12, connected to said image detecting unit (10).
14. Computer program comprising instructions which, when the program is executed by a computer, cause the processor to perform the method steps according to any one of claims 1-10.
15. Computer-readable storage medium comprising instructions which, when executed by a computer, cause the processor to perform the method steps according to any one of claims 1-10.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IT102020000016054A IT202000016054A1 (en) | 2020-07-02 | 2020-07-02 | METHOD FOR DETERMINING THE CONFIDENCE OF A DISPARITY MAP BY SELF-ADAPTIVE LEARNING OF A NEURAL NETWORK, AND RELATED SENSOR SYSTEM |
IT102020000016054 | 2020-07-02 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022003740A1 true WO2022003740A1 (en) | 2022-01-06 |
Family
ID=72644653
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IT2021/050193 WO2022003740A1 (en) | 2020-07-02 | 2021-06-21 | Method for determining the confidence of a disparity map through a self-adaptive learning of a neural network, and sensor system thereof |
Country Status (2)
Country | Link |
---|---|
IT (1) | IT202000016054A1 (en) |
WO (1) | WO2022003740A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023146882A1 (en) * | 2022-01-26 | 2023-08-03 | Meta Platforms Technologies, Llc | Display system with machine learning (ml) based stereoscopic view synthesis over a wide field of view |
WO2023201903A1 (en) * | 2022-04-18 | 2023-10-26 | 清华大学 | Occlusion-aware-based unsupervised light field disparity estimation system and method |
CN117907242A (en) * | 2024-03-15 | 2024-04-19 | 贵州省第一测绘院(贵州省北斗导航位置服务中心) | Homeland mapping method, system and storage medium based on dynamic remote sensing technology |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116433588B (en) * | 2023-02-21 | 2023-10-03 | 广东劢智医疗科技有限公司 | Multi-category classification and confidence discrimination method based on cervical cells |
-
2020
- 2020-07-02 IT IT102020000016054A patent/IT202000016054A1/en unknown
-
2021
- 2021-06-21 WO PCT/IT2021/050193 patent/WO2022003740A1/en active Application Filing
Non-Patent Citations (29)
Title |
---|
GEIGER, A.LENZ, P.URTASUN, R.: "Are we ready for autonomous driving? The KITTI vision benchmark suite", CVPR, 2012 |
GODARD, C.MAC AODHA, O.BROSTOW, G.J.: "Digging into self-supervised monocular depth estimation", ICCV, 2019 |
GODARD, C.MAC AODHA, O.BROSTOW, G.J.: "Unsupervised monocular depth estimation with left-right consistency", CVPR, 2017 |
GUL, M.S.K.BATZ, M.KEINERT, J.: "Pixel-wise confidences for stereo disparities using recurrent neural networks", BMVC, 2019 |
HIRSCHMULLER, H.: "Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on", vol. 2, 2005, IEEE, article "Accurate and efficient stereo processing by semi-global matching and mutual information", pages: 807 - 814 |
HU, X.MORDOHAI, P., A QUANTITATIVE EVALUATION OF CONFIDENCE MEASURES FOR STEREO VISION, vol. 34, no. 11, 2012, pages 2121 - 2133 |
KIM, S.KIM, S.MIN, D.SOHN, K.: "Laf-net: ''Locally adaptive fusion networks for stereo confidence estimation.", THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, June 2019 (2019-06-01) |
MAYER, N.ILG, E.HAUSSER, P.FISCHER, P.CREMERS, D.DOSOVITSKIY, A.BROX, T.: "A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation", THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, June 2016 (2016-06-01) |
MENZE, M.GEIGER, A.: "Object scene flow for autonomous vehicles", CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2015 |
MOSTEGEL, C.RUMPLER, M.FRAUNDORFER, F.BISCHOF, H.: "Using self-contradiction to learn confidence measures in stereo vision", THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, June 2016 (2016-06-01) |
POGGI MATTEO ET AL: "Good Cues to Learn From Scratch a Confidence Measure for Passive Depth Sensors", IEEE SENSORS JOURNAL, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 20, no. 22, 23 June 2020 (2020-06-23), pages 13533 - 13541, XP011815143, ISSN: 1530-437X, [retrieved on 20201015], DOI: 10.1109/JSEN.2020.3004629 * |
POGGI MATTEO ET AL: "Learning a confidence measure in the disparity domain from O(1) features", COMPUTER VISION AND IMAGE UNDERSTANDING, ACADEMIC PRESS, US, vol. 193, 18 January 2020 (2020-01-18), XP086066772, ISSN: 1077-3142, [retrieved on 20200118], DOI: 10.1016/J.CVIU.2020.102905 * |
POGGI MATTEO ET AL: "Self-adapting Confidence Estimation for Stereo", 30 November 2020, LECTURE NOTES IN COMPUTER SCIENCE; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], PAGE(S) 715 - 733, ISBN: 978-3-030-67069-6, ISSN: 0302-9743, XP047571515 * |
POGGI, M.ALEOTTI, F.TOSI, F.MATTOCCIA, S.: "Towards real-time unsupervised monocular depth estimation on CPU", IEEE/JRS CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS, 2018 |
POGGI, M.MATTOCCIA, S.: "Learning from scratch a confidence measure", BMVC, 2016 |
POGGI, M.TOSI, F.MATTOCCIA, S., QUANTITATIVE EVALUATION OF CONFIDENCE MEASURES IN A MACHINE LEARNING WORLD, 2017, pages 5228 - 5237 |
SCHARSTEIN, D.HIRSCHMULLER, H.KITAJIMA, Y.KRATHWOHL, G.NESIC, N.WANG, X.WESTLING, P.: "German conference on pattern recognition", 2014, SPRINGER, article "High-resolution stereo datasets with subpixel-accurate ground truth", pages: 31 - 42 |
SCHOPS, T.SCHONBERGER, J.L.GALLIANI, S.SATTLER, T.SCHINDLER, K.POLLEFEYS, M.GEIGER, A.: "A multi-view stereo benchmark with high-resolution images and multi-camera videos", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2017, pages 3260 - 3269 |
TONIONI ALESSIO ET AL: "Real-Time Self-Adaptive Deep Stereo", 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 15 June 2019 (2019-06-15), pages 195 - 204, XP033686564, DOI: 10.1109/CVPR.2019.00028 * |
TONIONI, A.TOSI, F.POGGI, M.MATTOCCIA, S.DI STEFANO, L., REAL-TIME SELFADAPTIVE DEEP STEREO, June 2019 (2019-06-01) |
TOSI FABIO ET AL: "Learning confidence measures in the wild", 1 January 2017 (2017-01-01), XP055778762, ISBN: 978-1-901725-60-5, Retrieved from the Internet <URL:https://vision.disi.unibo.it/~mpoggi/papers/bmvc2017.pdf> DOI: 10.5244/C.31.133 * |
TOSI, F.POGGI, M.BENINCASA, A.MATTOCCIA, S., BEYOND LOCAL REASONING FOR STEREO CONFIDENCE ESTIMATION WITH DEEP LEARNING, 2018, pages 319 - 334 |
TOSI, F.POGGI, M.TONIONI, A.DI STEFANO, L.MATTOCCIA, S.: "Learning confidence measures in the wild", BMVC, September 2017 (2017-09-01) |
WANG, Z.BOVIK, A.C.SHEIKH, H.R.SIMONCELLI, E.P.: "Image quality assessment: from error visibility to structural similarity", IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 13, no. 4, 2004, pages 600 - 612, XP011110418, DOI: 10.1109/TIP.2003.819861 |
YANG, G.SONG, X.HUANG, C.DENG, Z.SHI, J.ZHOU, B.: "Drivingstereo: ''A largescale dataset for stereo matching in autonomous driving scenarios.", IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2019 |
YIRAN ZHONG ET AL: "Self-Supervised Learning for Stereo Matching with Self-Improving Ability", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 4 September 2017 (2017-09-04), XP080818196 * |
ZBONTAR, J.LECUN, Y.: "Stereo matching by training a convolutional neural network to compare image patches", JOURNAL OF MACHINE LEARNING RESEARCH, vol. 17, no. 1-32, 2016, pages 2, XP058261868 |
ZHANG, K.LU, J.LAFRUIT, G.: "Cross-based local stereo matching using orthogonal integral images", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 19, no. 7, 2009, pages 1073 - 1079, XP011254879 |
ZHANG, Z.CUI, Z.XU, C.JIE, Z.LI, X.YANG, J.: "Joint task-recursive learning for semantic segmentation and depth estimation", PROCEEDINGS OF THE EUROPEAN CONFERENCE ON COMPUTER VISION (ECCV, 2018, pages 235 - 251 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023146882A1 (en) * | 2022-01-26 | 2023-08-03 | Meta Platforms Technologies, Llc | Display system with machine learning (ml) based stereoscopic view synthesis over a wide field of view |
WO2023201903A1 (en) * | 2022-04-18 | 2023-10-26 | 清华大学 | Occlusion-aware-based unsupervised light field disparity estimation system and method |
CN117907242A (en) * | 2024-03-15 | 2024-04-19 | 贵州省第一测绘院(贵州省北斗导航位置服务中心) | Homeland mapping method, system and storage medium based on dynamic remote sensing technology |
Also Published As
Publication number | Publication date |
---|---|
IT202000016054A1 (en) | 2022-01-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7300438B2 (en) | Method and system for large-scale determination of RGBD camera pose | |
Poggi et al. | Learning from scratch a confidence measure. | |
WO2022003740A1 (en) | Method for determining the confidence of a disparity map through a self-adaptive learning of a neural network, and sensor system thereof | |
CN110310320B (en) | Binocular vision matching cost aggregation optimization method | |
US11651581B2 (en) | System and method for correspondence map determination | |
CN114332214A (en) | Object attitude estimation method and device, electronic equipment and storage medium | |
EP3054421A1 (en) | Method of fast and robust camera location ordering | |
Rossi et al. | Joint graph-based depth refinement and normal estimation | |
KR102223484B1 (en) | System and method for 3D model generation of cut slopes without vegetation | |
JP2021068056A (en) | On-road obstacle detecting device, on-road obstacle detecting method, and on-road obstacle detecting program | |
CN112785705A (en) | Pose acquisition method and device and mobile equipment | |
CN116029996A (en) | Stereo matching method and device and electronic equipment | |
Saygili et al. | Adaptive stereo similarity fusion using confidence measures | |
El Bouazzaoui et al. | Enhancing RGB-D SLAM performances considering sensor specifications for indoor localization | |
CN110443228B (en) | Pedestrian matching method and device, electronic equipment and storage medium | |
Hirner et al. | FC-DCNN: A densely connected neural network for stereo estimation | |
KR101916573B1 (en) | Method for tracking multi object | |
Poggi et al. | Self-adapting confidence estimation for stereo | |
Srikakulapu et al. | Depth estimation from single image using defocus and texture cues | |
KR20160024419A (en) | System and Method for identifying stereo-scopic camera in Depth-Image-Based Rendering | |
US20240153120A1 (en) | Method to determine the depth from images by self-adaptive learning of a neural network and system thereof | |
CN114399532A (en) | Camera position and posture determining method and device | |
Gheissari et al. | Parametric model-based motion segmentation using surface selection criterion | |
KR102102369B1 (en) | Method and apparatus for estimating matching performance | |
Poggi et al. | The final published version is available online at: https://doi. org/10.1007/978-3-030-58586 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21745455 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21745455 Country of ref document: EP Kind code of ref document: A1 |