CN111192306A

CN111192306A - System for disparity estimation and method for disparity estimation of system

Info

Publication number: CN111192306A
Application number: CN201911120591.5A
Authority: CN
Inventors: 杜宪志; 伊尔哈米·穆斯塔法; 李正元
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2018-11-15
Filing date: 2019-11-15
Publication date: 2020-05-22

Abstract

A system for disparity estimation and a method for disparity estimation for a system are disclosed. The system for disparity estimation comprises: one or more feature extractor modules configured to extract one or more feature maps from one or more input images; and one or more semantic information modules connected at one or more outputs of the one or more feature extractor modules, wherein the one or more semantic information modules are configured to: generating one or more foreground semantic information to be provided to the one or more feature extractor modules for disparity estimation at a next training epoch.

Description

System for disparity estimation and method for disparity estimation of system

This application claims priority and benefit of U.S. provisional patent application serial No. 62/768,055, entitled "foreground-background aware hole multi-scale network for disparity estimation," filed on 2018, month 11 and 15, which is expressly incorporated herein by reference in its entirety.

Technical Field

One or more aspects according to embodiments of the present disclosure relate to a foreground-background perceived hole (aperture) multi-scale network (FBA-AMNet) for disparity estimation.

Background

Depth estimation is a fundamental computer vision problem aimed at predicting the measure of distance of each point in a captured scene. This has many applications, such as the ability to separate foreground (near) objects from background (far) objects. Accurate depth estimation allows separating the background from the foreground objects of interest in a scene and allows processing images from non-professional photographers or cameras with smaller lenses to obtain a more aesthetically pleasing image focused on the subject.

The above information in the background section is only for enhancement of understanding of the background of the technology and therefore it should not be construed as an admission of the existence or association of prior art.

Disclosure of Invention

This summary is provided to introduce a selection of features and concepts of embodiments of the disclosure that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. One or more of the described features may be combined with one or more other described features to provide a viable device.

Aspects of example embodiments of the present disclosure relate to a foreground-background aware hole multi-scale network for disparity estimation.

According to some example embodiments of the present disclosure, a system for disparity estimation comprises: one or more feature extractor modules configured to extract one or more feature maps from one or more input images; and one or more semantic information modules connected at one or more outputs of the one or more feature extractor modules, wherein the one or more semantic information modules are configured to: generating one or more foreground semantic information to be provided to the one or more feature extractor modules for disparity estimation at a next training epoch.

In some embodiments, the system further comprises: an Extended Cost Volume (ECV) module connected at the one or more outputs of the one or more feature extractor modules, the ECV module configured to calculate matching cost information between the one or more feature maps; a stacked hole multi-scale (AM) module connected at an output of the ECV module and configured to process matching cost information between the one or more feature maps from the ECV module to aggregate multi-scale context information, the stacked AM module comprising a plurality of AM modules stacked together; and a regression module connected at an output of the stacked AM module and configured to estimate a disparity of the system based on the aggregated multi-scale context information and the one or more foreground semantic information from the stacked AM module.

In some embodiments, the ECV module comprises: a disparity level feature distance subvolume module configured to determine a pixel-by-pixel absolute difference between the first feature map and the second feature map; a disparity level depth correlation sub-body module configured to determine a correlation between the first feature map and the second feature map; and a disparity level feature cascade subvolume module configured to cascade the first feature map and the second feature map shifted by d at each disparity level d. In some embodiments, the size of the disparity level feature distance subvolume module is H × W × (D +1) × C, where H, W and C represent height, width and feature size, and D represents the maximum disparity the system is able to predict; the size of the disparity level depth-related subvolume module is H x W x (D +1) x C; and the size of the disparity level feature cascade subvolume module is H × W × (D +1) × 2C.

In some embodiments, the system is further configured to determine the size of the ECV module by cascading the disparity level feature, the disparity level depth correlation sub-module, and the disparity level feature cascade sub-module of the distance sub-module along the depth dimension, wherein the ECV module has a size of H × W × (D +1) × 4C. In some embodiments, a stacked AM module comprises a plurality of AM modules stacked together with a shortcut connection within the stacked AM module, wherein an AM module of the plurality of AM modules of the stacked AM module is configured to: matching cost information between the one or more feature maps from the ECV is processed using k pairs of a 3 x 3 hole convolution layer and two 1 x 1 convolution layers. In some embodiments, k pairs of 3 x 3 hole convolutional layers have a dilation factor [1,2,2,4,4, …, k/2, k/2, k ], wherein two 1 x 1 convolutional layers with a dilation factor of one are added at the ends of the AM modules of the plurality of AM modules for feature refinement and feature resizing.

In some embodiments, the one or more feature extractor modules comprise: a first depth separable residual network (D-ResNet) module configured to receive a first input image and first foreground semantic information; a second D-ResNet module configured to receive a second input image and second foreground semantic information; a first AM module connected at the output of the first D-ResNet module; and a second AM module connected at an output of the second D-ResNet module. In some embodiments, the first D-ResNet module and the second D-ResNet module have shared weights, the first AM module and the second AM module have shared weights, wherein each of the first AM module and the second AM module is configured as a scene understanding module for capturing deep global context information and local details, wherein the ECV module is connected at an output of the first AM module and at an output of the second AM module.

In some embodiments, the one or more semantic information modules comprise: a first semantic information module connected at an output of the first AM module, wherein the first semantic information module is configured to generate first foreground semantic information, wherein the first foreground semantic information is provided to the first D-ResNet module via a first feedback loop as additional input to the system for a next training epoch of the system; and a second semantic information module connected at an output of the second AM module, wherein the second semantic information module is configured to generate second foreground semantic information, wherein the second foreground semantic information is provided to the second D-ResNet module via a second feedback loop as additional input to the system for a next training epoch of the system.

In some embodiments, the first semantic information module comprises: a first Convolutional Neural Network (CNN) module connected at an output of the first AM module; a first upsampling module connected at an output of the first CNN module; and a first prediction module connected at an output of the first upsampling module and configured to generate first foreground semantic information. In some embodiments, the second semantic information module comprises: a second Convolutional Neural Network (CNN) module connected at an output of the second AM module; a second upsampling module connected at an output of the second CNN module; and a second prediction module connected at an output of the second upsampling module and configured to generate second foreground semantic information. In some embodiments, the system is a multitasking module configured to perform two tasks, wherein the two tasks are disparity estimation and foreground semantic information generation, wherein the loss of the system is a weighted sum of two losses from the two tasks.

According to some example embodiments of the present disclosure, a method for disparity estimation for a system comprising one or more feature extractor modules, one or more semantic information modules, an Extended Cost Volume (ECV) module, a stacked hole multi-scale (AM) module, and a regression module, the method comprising: extracting, by the one or more feature extractor modules, one or more feature maps from one or more input images; generating one or more foreground semantic information by the one or more semantic information modules connected at one or more outputs of the one or more feature extractor modules, wherein the one or more foreground semantic information is provided to the one or more feature extractor modules; calculating matching cost information between the one or more feature maps by an ECV module connected at the one or more outputs of the one or more feature extractor modules; processing, by a stacked AM module connected at an output of the ECV module, matching cost information between the one or more feature maps from the ECV module to aggregate multi-scale context information for disparity regression; estimating, by a regression module connected at an output of a stacked AM module, a disparity of the system based on the aggregated multi-scale context information and foreground semantic information; and recursively training the system using the one or more feature maps and the one or more foreground semantic information until convergence.

In some embodiments, the one or more foreground semantic information for the current time period is computed by the one or more semantic information modules for a previous time period, wherein the one or more input images comprise a first input image and a second input image, wherein the one or more feature maps extracted from the one or more input images comprise a first feature map extracted from the first input image and a second feature map extracted from the second input image, and wherein the method further comprises: determining, by a disparity level feature distance subvolume module of the ECV module, a pixel-by-pixel absolute difference between the first feature map and the second feature map; determining, by a disparity level depth correlation sub-body module of the ECV module, a correlation between the first feature map and the second feature map; and cascading the first feature map and the second feature map which are shifted by d at each parallax level d through the parallax level feature cascade sub-body module.

In some embodiments, the size of the disparity level feature distance subvolume module is H × W × (D +1) × C, where H, W and C represent height, width and feature size, and D represents the maximum disparity the system is able to predict; the size of the disparity level depth-related subvolume module is H x W x (D +1) x C; and the size of the disparity level feature cascade subvolume module is H × W × (D +1) × 2C.

In some embodiments, the method further comprises: determining a size of the ECV module by cascading the disparity level feature, the disparity level depth correlation sub-module, and the disparity level feature cascade sub-module of the distance sub-module along the depth dimension, wherein the size of the ECV module is H x W x (D +1) x 4C.

In some embodiments, the method further comprises: generating first foreground semantic information by a first semantic information module of the one or more semantic information modules; receiving, by a first depth separable residual network (D-ResNet) module of the one or more feature extractor modules, a first input image and first foreground semantic information, wherein the first foreground semantic information is provided to the first D-ResNet module via a first feedback loop as additional input for a next training epoch of the system; generating second foreground semantic information by a second semantic information module of the one or more semantic information modules; receiving, by a second D-ResNet module of the one or more feature extractor modules, a second input image and second foreground semantic information, wherein the second foreground semantic information is provided to the second D-ResNet module via a second feedback loop as additional input for a next training session of the system; and capturing, by a first AM module and a second AM module of the one or more feature extractor modules, deep global context information and local details for scene understanding.

In some embodiments, the stacked AM module comprises a plurality of AM modules stacked together with a shortcut connection within the stacked AM module, wherein the method further comprises: processing, by an AM module of the plurality of AM modules of the stacked AM module, matching cost information between the one or more feature maps from the ECV with k pairs of 3 x 3 hole convolution layers and two 1 x 1 convolution layers, wherein k pairs of 3 x 3 hole convolution layers have an expansion factor [1,2,2,4,4, …, k/2, k/2, k ], wherein two 1 x 1 convolution layers with an expansion factor of one are added at the end of the AM module for feature refinement and feature resizing.

Drawings

These and other features of some example embodiments of the present disclosure will be appreciated and understood with reference to the specification, claims, and drawings, wherein:

FIG. 1 illustrates a block diagram of a hole multiscale network (AMNet), according to some embodiments of the present disclosure;

fig. 2A illustrates a block diagram of a residual network (ResNet) block, according to some embodiments of the present disclosure;

FIG. 2B illustrates a block diagram of a depth separable ResNet (D-ResNet) block, according to some embodiments of the present disclosure;

FIG. 3 illustrates development of an Extended Cost Volume (ECV) module in accordance with some embodiments of the present disclosure;

FIG. 4 illustrates the structure and specification of a void multi-scale (AM) module and a stacked AM module according to some embodiments of the present disclosure;

FIG. 5A illustrates a multitasking network according to some embodiments of the present disclosure;

FIG. 5B illustrates another multitasking network according to some embodiments of the present disclosure;

FIG. 6 illustrates an FBA-AMNet system according to some embodiments of the present disclosure;

fig. 7 illustrates a method for disparity estimation for FBA-AMNet systems, in accordance with some embodiments of the present disclosure;

fig. 8 illustrates disparity estimation results for two foreground objects for AMNet and FBA-AMNet according to some embodiments of the present disclosure; and

fig. 9 illustrates one image and the results of the foreground-background segmentation from coarse to fine generated by FBA-AMNet, according to some embodiments of the present disclosure.

Detailed Description

The detailed description set forth below in connection with the appended drawings is intended as a description of some example embodiments of a foreground-background perceptual hole multi-scale network for disparity estimation provided in accordance with the present disclosure and is not intended to represent the only forms in which the present disclosure may be interpreted or utilized. This description sets forth features of the disclosure in connection with the illustrated embodiments. It is to be understood, however, that the same or equivalent functions and structures may be accomplished by different embodiments that are also intended to be encompassed within the scope of the disclosure. As shown elsewhere herein, like element numbers are intended to indicate like elements or features.

Depth estimation is a fundamental computer vision problem aimed at predicting the measure of distance of each point in a captured scene. There has been recent interest in the estimation of real-world depth of elements in captured scenes. This has many applications, such as the ability to separate foreground (near) objects from background (far) objects. Accurate depth estimation allows separating the background from the foreground objects of interest in the scene. Accurate foreground-background separation allows the captured image to be processed to simulate effects such as a shot effect. Shot is a soft out-of-focus blur of the background that is mastered by using the correct settings in an expensive camera with fast lens and wide aperture and bringing the camera closer to the subject and the subject away from the background to simulate a shallow depth of field. Thus, accurate depth estimation allows processing of images from non-professional photographers or cameras with smaller lenses (such as mobile phone cameras) to obtain more aesthetically pleasing images with a shot effect focused on the subject. Other applications of accurate depth estimation include three-dimensional (3D) object reconstruction and virtual reality applications, where it is desirable to change the background or subject and render them according to the desired virtual reality. Other applications of accurate depth estimation from captured scenes are also used in automotive automation, surveillance cameras, auto-driving applications, and increased security by improving object inspection accuracy and estimating the distance of objects from the camera using only the camera or depth estimates from camera inputs and from multiple sensors.

Given a corrected stereo image pair, the depth estimate may be converted to a disparity estimate using camera calibration. For each pixel in one image, the disparity estimation looks for an offset between one pixel and its corresponding pixel in the other image on the same horizontal line, so that the two pixels are projections of similar 3D locations.

Some embodiments of the present disclosure provide a system and method for estimating real-world depths of elements in a scene captured by two stereo cameras. Two stereo-corrected images are captured and processed to accurately calculate the disparity between pixels in the two images. The pipeline of the stereo matching disparity estimation system comprises three parts: feature extraction from left and right images captured by a stereo camera, matching cost calculation between left and right feature maps, and post-processing and disparity regression by a scene understanding module.

Some embodiments of the present disclosure provide designs for components of a disparity estimation system. For example, an extended cost volume (extended cost volume) module for matching cost calculation is provided. The extended cost module comprises an original cost module, a parallax level (level) correlation module and a parallax level difference module. A stacked hole convolution module with skip connection inside for post-processing is also provided. Soft classification based regression is used for disparity regression. Some embodiments of the present disclosure may be applicable to object segmentation.

Some embodiments may utilize semantic information (in particular, using semantic segmentation or semantic boundary information of foreground objects) to enhance the disparity estimation task. The foreground semantic information may be used in two ways, e.g., adding one or more tasks to the disparity estimation network for foreground semantic segmentation/boundary prediction, and using foreground semantic segmentation or boundary maps as additional input features (RGB-S input) in addition to red-green-blue (RGB) images. The two methods may also be combined together within one network. A multitasking network with RGB-S inputs is recursively trained, wherein an input semantic graph for a current epoch is computed in a previous epoch (epoch) by the multitasking network.

Some embodiments of the present disclosure provide a deep learning architecture for disparity estimation with a design of the first three components (e.g., feature extractor, cost body, second stage processing module). In some embodiments, an example network employs a feature extractor based on deep separable convolutions, a cost body for computing matching costs using different similarity measures, and a scene understanding module for aggregating rich multi-scale context information with hole convolutions. The cost body for calculating the matching cost using different similarity measures is detailed with respect to fig. 3. Some embodiments of the present disclosure also provide a multitasking network that utilizes foreground-background segmentation information to enhance itself with better foreground-background perception. In some embodiments, the example network is trained end-to-end using an iterative training method. In some embodiments, the present training method outperforms the prior art methods with significant margins and achieves the highest levels on the three most popular disparity estimation references (KITTI stereo 2015, KITTI stereo 2012, and scene flow).

A Convolutional Neural Network (CNN) based disparity estimation system may include feature extraction, matching cost estimation, second stage processing, and disparity regression. The feature extractor extracts a distinguishable high-level feature map from the left input image and the right input image, the selection may be a residual network. And the cost body calculates the matching cost between the left feature map and the right feature map. Correlation, pixel-by-pixel difference, or simple concatenation are used together for cost body computation. The post-processing module utilizes a set of convolutional layers or a scene understanding module to further process and refine the output from the cost body. Finally, a disparity regressor or a disparity classifier performs pixel-by-pixel disparity prediction.

For example, in a CNN-based disparity estimation system, first, depth features are extracted from the corrected left and right images using a CNN-based feature extractor, such as a residual network (ResNet) -50 or VGG-16. Then, a cost body (CV) is formed by measuring the matching cost between the left and right feature maps. The selection of the matching cost may include correlation, absolute distance, and/or feature concatenation. The CV is further processed and refined by a second stage processing module for disparity regression. Furthermore, information from other low level visual tasks (such as semantic segmentation or edge detection) can be used to enhance the disparity estimation system.

Fig. 1 illustrates a block diagram of a hole multi-scale network (AMNet)100, according to some embodiments of the present disclosure. The AMNet100 is a CNN-based disparity estimation system. The AMNet100 of FIG. 1 includes a first depth separable ResNet (D-ResNet)106, a second D-ResNet108, a first hole multiscale (AM) module 110, a second AM module 112, an Extended Cost (ECV) module 114, a stacked AM module 118, an upscale (upscale) module 120, and a regression module 122.

In the example embodiment of fig. 1, the standard ResNet-50 backbone (backbone) is modified to D-ResNet (e.g., 106, 108) as the backbone of the AMNet 100. In the AMNet100 of fig. 1, a first D-ResNet106 receives as input a first input image 102 and a second D-ResNet108 receives as input a second input image 104. The first D-ResNet106 and the second D-ResNet108 may have shared weights. A first AM module 110 is connected to the output of the first D-ResNet106 and a second AM module 112 is connected to the output of the second D-ResNet 108. The AM module 110 and the AM module 112 may also have shared weights.

Each of the

AM modules

110 and 112 is designed as a scene understanding module that captures depth global context information as well as local details. In some embodiments, the D-ResNet modules (e.g., first D-ResNet106, second D-ResNet 108) following the AM modules (e.g., 110, 112) may function as feature extractors. Thus, a combination of a first D-ResNet106 and a first AM module 110 may be used to extract features of the first input image 102, and a combination of a second D-ResNet108 and a second AM module 112 may be used to extract features of the second input image 104.

The ECV module 114 is connected to the outputs of the

AM modules

110 and 112. The ECV module 114 is a combination of a disparity level depth correlation sub-volume (disparity level depth correlation sub-volume), a disparity level feature distance sub-volume (disparity-level feature distance sub-volume), and a disparity level feature cascade sub-volume (disparity-level feature cascade sub-volume). The ECV module 114 carries rich information about the matching costs under different similarity measures. The ECV module 114 can be processed by a stacked AM module 118 (connected at the output of the ECV module 114) and can utilize flexible argmin-based disparity regression to predict disparity. In some embodiments, argmin is an argument (argument) that minimizes a cost function (costfunction).

In some embodiments, foreground-background segmentation information may be used to enhance disparity estimation. The AMNet100 can be extended to foreground-background aware AM-Net (FBA-AMNet) that utilizes foreground-background segmentation to improve disparity estimation. The foreground-background segmentation map may be provided to the AMNet100 as an additional input feature (RGB-S input) to the AMNet100, the AMNet100 being extended to a multitask network where the primary task is disparity estimation and the secondary task is foreground-background segmentation. In some embodiments, the multitasking network is referred to as FBA-AMNet. The assistance task helps the network (e.g., FBA-AMNet) to have better foreground-background perception, further improving disparity estimation.

Aspects of the various components of the AMNet100 will now be described in more detail in the following sections.

Fig. 2A shows a block diagram of a ResNet 201 block, and fig. 2B shows a block diagram of a D-ResNet203 block, according to some embodiments of the present disclosure. The D-ResNet203 block may be the first D-ResNet106 or the second D-ResNet108 of the AMNet100 of fig. 1.

The depth separable convolution decomposes the standard convolution into a depth convolution (e.g., 206 or 210) followed by a 1 x 1 convolution (e.g., 208 or 212). Depth separable convolution offers great potential in image classification tasks and has been further developed as a network backbone for other computer vision tasks. In an example embodiment of D-ResNet203, in a standard ResNet-50 (e.g., ResNet 201) backbone network, the standard convolution is replaced with a custom deep separable convolution.

Standard 3 x 3 convolutional layers in the ResNet-50 (e.g., ResNet 201) backbone network comprise 9 xD_in×D_outOne parameter, and a depth separable convolutional layer (e.g., D-ResNet203) containing D_in×(9+D_out) A parameter, wherein D_inAnd D_outThe sizes of the input feature map and the output feature map are shown, respectively. Because of D in the ResNet model_outWhich may be, for example, 32 or more, direct replacement of the standard convolution (e.g., 202 or 204) with a depth separable convolution (e.g., depth convolution 206 followed by 1 x 1 convolution 208 or depth convolution 210 followed by 1 x 1 convolution 212) will result in a model with very little complexity. In an example embodiment of D-ResNet203, in a depth separable convolutional layer (depth convolution 206 followed by 1 × 1 convolution 208 or depth convolution 210 followed by 1 × 1 convolution 212), D_outHas further increased so that the number of parameters in D-ResNet203 approaches ResNet-50 (e.g., ResNet 201).

For each D-ResNet block (e.g., 203) modified from the ResNet block (e.g., 201), a 1 x 1 convolution (e.g., 214) is implemented into the input feature map in a shortcut (shortcut) connection for feature size matching.

Fig. 2A and 2B show a comparison between a standard ResNet block (e.g., ResNet 201) and a D-ResNet block (e.g., D-ResNet 203). The network specifications for the D-ResNet (e.g., D-ResNet203) backbone network are listed in Table 1. Linear rectification units (relus) and batch normalization were used after each layer. After the D-ResNet (e.g., D-ResNet203) backbone, the shape of the feature map is 1/4 of the shape of the input image.

In some embodiments, a 50-layer residual network (e.g., ResNet 201) can be modified in PSMNet to a feature extractor consisting of 4 sets of residual blocks, where each residual block consists of two convolutional layers with a 3 × 3 convolutional kernel. The number of residual blocks in 4 groups is { 3; 16; 3; 3}. In ResNet of PSmNet, the number of output feature maps is D for four residual error groups_out{ 32; 64; 128; 128} where D is the total residual block_in＝D_out. Due to D _out32 or greater, so direct replacement of the standard convolution with a deep separable convolution will result in a model with a very small number of parameters. However, in current D-ResNet (e.g., D-ResNet203), D can separate convolutional layers for depth in four residual blocks_outRespectively increase to D_out-96; 256 of; 256 of; 256} where, for the first residual block, D _in32 to approximate the number of parameters in the current D-ResNet (e.g., D-ResNet203) to the number of parameters of PSMNet. Thus, current D-ResNet (e.g., D-ResNet203) may learn more depth features than ResNet (e.g., ResNet 201) while having similar complexity. Since the depth separable residual blocks (e.g., 206, 210) may not have the same number of input features and output features, D_outThe point-by-point 1 x 1 projection filter can be deployed over a shortcut (residual) connection to connect D to_inProjection of input features onto D_outCharacteristically (e.g., FIGS. 2A and 2B show a comparison between standard ResNet (FIG. 2A) and D-ResNet (FIG. 2B)). ReLU and batch normalization were used after each layer. After the D-ResNet backbone, the width and height of the output feature map is 1/4 of the width and height of the input image. The network specifications for the D-ResNet backbone are listed in table 1.

Table 1: detailed layer specification of D-ResNet. "repeat" means that the current layer or block is repeated a certain number of times. "str." and "dil." denote the step size and the dilation (dilation) factor (e.g., in some embodiments, the step size is the step size or amount of movement when the filter slides, the dilation factor is the ratio of upscaling or upsampling of the filter after insertion of a zero), and "Sepconv." denotes separable convolution.

Fig. 3 illustrates the development of an ECV module for calculating a matching cost between feature maps of a pair of stereoscopic images, according to some embodiments of the present disclosure. The ECV module may be the ECV module 114 of fig. 1.

The ECV module 114 is developed to map the left feature (F) across all disparity levels_l) And the right characteristic diagram (F)_r) Three different matching costs between them are summarized. The ECV module 114 may include three sub-volumes, e.g., a disparity level feature distance sub-volume, a disparity level depth correlation sub-volume, and a disparity level feature cascade sub-volume.

Given that the maximum disparity of an AMNet (e.g., AMNet100) can be predicted as D, a disparity level represents a disparity value D from 0 to D, where F is the disparity value D_rShifted by d pixels to the right to fill with F and with the necessary trim and zeros_lAnd (4) aligning. For example, for a disparity level d, the right feature map F is first applied_rShift d pixels to the right (e.g., 301), and then shift the right feature map F_rAnd left characteristic diagram F_lA cascade (e.g., 302). Removing two feature maps F_lAnd F_rAnd zero padding (e.g., 303) is performed to pad (e.g., 304) the new feature map to have the same left feature map F as the original_lIs the same width (e.g., 303). This process is shown in fig. 3.

In some embodiments, in the disparity level feature distance subvolume, F is calculated over all disparity levels_lAnd F_rDot-by-dot (e.g., pixel-by-pixel) absolute difference therebetween. Given F_lIs hxwxc, where H, W and C represent height, width and feature size, all D +1 disparity maps are packed together to form a sub-volume of size hxwx (D +1) × C. In some embodiments, in (D +1), 1 is the left feature F without any offset_lAnd feature right F_rThe disparity map between them. The method is repeated D times until a maximum parallax shift D of the left image relative to the right image is obtained.

For example, at each disparity level d, at the right feature F that will be shifted by d_rAnd left feature map F_lAfter alignment, the pixel-by-pixel absolute difference is calculated. All (D +1) difference maps are packed together to form a subvolume of size H × W × (D +1) × C.

In the disparity level depth correlation subvolume, F is defined for square chunks of size 2t +1 (e.g., "t" is a parameter)_lIn x₁Centered block p_lAnd F_rIn x₂Centered block p₂The correlation between them, as in equation 1:

c(x₁，x₂)＝∑_{O∈[-t，t]×[-t，t]}<F_l(x₁+o)，F_r(x₂+o)>。 (1)

thus, instead of calculating p_lAnd with x₁All other blocks centered on a value within the neighborhood of size D (e.g., extending along a horizontal line), spread over p of all disparity levels (e.g., extending along a disparity level)_lWith alignment of F_rP in (1)_lIs calculated. This results in a daughter of size H × W × (D +1) × C. In order to make the size of the output feature map comparable to other subvolumes, a depth correlation may be implemented. At each disparity level, the depth correlation of two aligned blocks is calculated and packed across all depth channels, as in equations 2 and 3 (below).

c(x₁，x₁)＝[c⁰(x₁，x₁)，c¹(x₁，x₁)，...，c^C(x₁，x₁)]， (3)

Depth correlation is calculated for all blocks throughout all disparity levels, which results in a subvolume of size H × W × (D +1) × C.

In the disparity level feature cascade subvolume, at each disparity level d, right feature map F of d is shifted_rAnd left characteristic diagram F_lDirect cascade. All D +1 cascaded feature maps are packed together to form a subvolume of size H × W × (D +1) × 2C. In some embodiments, for left feature map F_lFor each offset of the left feature in (1), a disparity map is calculated.

Finally, all subvolumes of ECV are cascaded along the depth dimension, which results in ECV of size H × W × (D +1) × 4C. The ECV can provide the subsequent modules (e.g., stacked AM modules, regression modules) in the system (e.g., AMNet100) with rich information describing the left feature map F at all levels of disparity_lAnd right feature diagram F_rDifferent ways of matching costs. In some embodiments, by introducing the disparity dimension into an ECV module (e.g., ECV module 114), 3D convolution is implemented in a subsequent convolutional layer (e.g., stacked AM module 118) in a network (e.g., AMNet 100).

Fig. 4 illustrates the structure and specifications of AM modules and stacked AM modules according to some embodiments of the present disclosure. In fig. 4, "dil." indicates a dilating factor in each hole convolution layer.

The stacked AM module 402 of fig. 4 is a second level processing module connected at the output of the ECV module. The stacked AM module 402 processes the output from the ECV module (e.g., left feature map F at all levels of parallax)_lAnd the right characteristic diagram F_rMatching cost information between) to aggregate multi-scale context information for disparity regression.

In CNN-based low-level disparity systems, a hole convolution module and a scene understanding module may be used to aggregate multi-scale context information for dense disparity prediction. The stacked AM module of fig. 4 may be used as a scene understanding module. The stacked AM module 402 of fig. 4 includes three

AM blocks

404, 406, and 408. Three AM blocks 404, 406, and 408 are stacked together with internal short-cut connections to form stacked AM module 402. In some embodiments, the shortcut connection is a connection between neural network layers that skips some intermediate layers.

An AM block is a block of convolutional layers having an encoder-decoder structure. The AM block can process the input profile using k as an encoder section for a 2-step convolution layer (2-stride convolution layer) and a 1-step convolution layer (1-stride convolution layer). The choice of k may be, for example, 3, 4 or 5. The set of deconvolution layers is then implemented as a decoder section for further processing and upscaling the feature map to its original size.

In some embodiments, in stacked AM module 402, all of the convolutional layers in the encoder of each of the AM blocks (e.g., 404, 406, and 408) are hole convolutional layers (e.g., in fig. 4, only AM block 406 is extended, however, all of the convolutional layers in each of AM blocks 404 and 408 are also hole convolutional layers similar to AM block 406). The expansion factor in each AM block (e.g., 406) may increase by a factor of 2. Because hole convolution naturally carries more information and preserves the size of the feature map, the decoder section can be eliminated and two additional convolution layers can be added at the end for feature refinement (refinement). In one example, an AM block (e.g., 404, 406, or 408) processes matching cost information between feature maps from ECV modules using k to a 3 x 3 hole convolutional layer and two 1 x 1 convolutional layers, where k is an integer power of 2 and greater than 0. For example, stacked AM tiles 406 are designed as a set of 3 x 3 hole convolutions with different spreading factors [1,2,2,4,4, …, k/2, k/2, k ]. Without loss of spatial resolution, the dilation factor increases as the AM block 406 gets deeper to capture dense multi-scale context information. In the AM block 406, two 1 × 1 convolutions with a dilation factor of 1 are added at the end for feature refinement and feature resizing.

To gather more coarse-to-fine context information, a cascade of three AM modules (e.g., 404, 406, and 408) is implemented with internal shortcut connections to form stacked AM module 402.

As discussed with respect to fig. 1, AM modules (e.g., 110, 112) following the D-ResNet (e.g., 106, 108) backbone may be used to form the feature extractor, and stacked AM modules (e.g., 118) following the ECV (e.g., 114) may be used as second level processing modules. The 3D convolution is implemented in a stacked AM module (e.g., 118).

In some embodiments, the upscaling module 120 may be an upsampling module that renders the input features to a higher resolution.

In some embodiments, a flexible argmin operation may be employed for disparity regression (e.g., in the regression module 122). For one output layer, the expectation of D +1 disparities is computed as the final disparity prediction, as shown in equation 4:

wherein the content of the first and second substances,

is the flexible maximum probability of disparity j at pixel i, D is the maximum disparity value.

E.g. predicted disparity d_iMay be based on aggregated multi-scale context information from stacked AM modules (e.g., 118, 402). In some embodiments, stacked AM modules (e.g., 118, 402) generate higher-level features based on cascaded feature maps in ECV modules (e.g., 114), which are then used to estimate disparity d_i. This is called cost aggregation and disparity calculation.

Smooth loss L₁Disparity d that can be used for measurement prediction_iTrue parallax with ground

The difference between them. The loss is calculated as the average smooth loss L over all marked pixels₁. During training, three losses are calculated separately from three AM blocks (e.g., 404, 406, and 408) in a stacked AM module (e.g., 402), and the three losses are summed to sumResulting in the final loss as shown in equations (5) and (6).

Where N is the total number of marked pixels. In some embodiments, the ultimate loss may be utilized

Training an AMNet (e.g., AMNet100) in which the AMNet (e.g., AMNet100) is calculating the final disparity d_iOr cognitive (knowledge) from such training when generating the disparity map 124.

During testing, only the output from the final AM module (e.g., 408) is used for disparity regression. Based on disparity regression, final disparity d_iMay be predicted (e.g., a disparity map 124 may be generated).

Some embodiments of the present disclosure provide a system that takes disparity estimation as a primary task while utilizing semantic information to assist disparity estimation. Semantic information, such as semantic segmentation maps and semantic boundaries, define the category and location of each object in an image. It helps the system to understand the image better and further helps the disparity estimation. For a disparity estimation network, the goal is to predict disparity values for all pixels. Given the fact that disparity changes dramatically at the position where the foreground object appears, adding a priori information describing the foreground object will facilitate accurate disparity estimation by the system. Furthermore, better perception of foreground objects may lead to better disparity estimation. In some embodiments, in an outdoor driving scenario (such as a KITTI), foreground objects are defined as vehicles and people. In some embodiments, foreground-background segmentation maps may be used to improve disparity estimation. In some embodiments, only foreground and background pixels are distinguished. The exact class of foreground object is not of interest. Furthermore, in some embodiments, only semantic information of the foreground object is considered, regardless of the exact class of the foreground object.

There are two common methods of using foreground-background segmentation information or foreground semantic information into a disparity estimation system (e.g., AMNet 100).

The first approach is to extend the network into a multitasking network where the primary task is disparity estimation and the secondary task is foreground segmentation or foreground semantic boundary prediction. Fig. 5A illustrates a multitasking network 501 according to some embodiments of the present disclosure. The multitasking network 501 is designed to have a shared root (base) and a different head (head) for both tasks. For example, in the multitasking network 501, both tasks share the same root at the beginning of the network fabric and have separate headers at the end of the network fabric. The shared root may include input images (e.g., Img1 and Img2)502 and 504,

feature extractors

506 and 508, 1 × 1

convolution modules

510 and 512, ECV module 514, stacked AM module 518, upsampling module 520, and regression module 522. In the multitasking network 501, the shared root may be the AMNet100 of fig. 1. For example,

feature extractors

506 and 508 may be D-ResNet106 and 108, 1 × 1

convolution modules

510 and 512 may be

AM modules

110 and 112, ECV module 514 may be ECV module 114, stacked AM module 518 may be stacked AM module 118, upsampling module 520 may be upscaling module 120, and regression module 522 may be regression module 122 of fig. 1. The shared root may be used to generate a disparity output 524 (e.g., disparity estimate). The CNN module 526 (connected at the output of the 1 x 1 convolution module 510) followed by the upsampling module 528 and the prediction module 530 may be used to generate a semantic output 532 (e.g., foreground segmentation or foreground semantic boundary prediction).

The overall loss function of the multitasking network 501 is a weighted sum of the losses of the two tasks. Due to data imbalance in the subtasks (e.g., foreground segmentation or foreground semantic boundary prediction), the weight of foreground pixels or foreground semantic boundaries in the total loss function increases. By optimizing the multitasking network 501 for both tasks, the shared root is trained to implicitly have a better perception of foreground objects, which leads to better disparity estimation.

The second approach is to directly provide additional foreground-background segmentation information as an additional input in addition to the RGB image (RGB-S input) to guide the disparity estimation network. This requires accurate segmentation maps in both the training phase and the testing phase. Fig. 5B illustrates another multitasking network 503 according to some embodiments of the present disclosure for a second method. The network 503 may be the AMNet100 of fig. 1 with additional foreground-background segmentation information as an additional input. For example,

feature extractors

506 and 508 may be D-ResNet106 and 108, 1 × 1

convolution modules

510 and 512 may be

AM modules

110 and 112, ECV module 514 may be ECV module 114, stacked AM module 518 may be stacked AM module 118, upsampling module 520 may be upscaling module 120, and regression module 522 may be regression module 122 of AMNet100 of fig. 1. In some embodiments, the foreground segmentation map or the foreground semantic boundary map may be used as another input feature (e.g., 534, 536) to the network 503 in addition to the RGB images (e.g., 502, 504). This forms the RGB-S input to the network 503. Additional input signals (e.g., 534, 536) send a priori knowledge to the network 503 for better image understanding. In this way, features of better image understanding come from additional inputs (534, 536). The network 503 is not trained to better understand the images.

Because both techniques (as discussed with respect to fig. 5A and 5B) improve the performance of the disparity estimation task, the two techniques can be combined together in a single system.

Fig. 6 illustrates an FBA-AMNet 600 system according to some embodiments of the present disclosure. The AMNet100 of fig. 1 can be extended to FBA-AMNet 600. The FBA-AMNet 600 is designed as a multitasking network (as discussed with respect to fig. 5A and 5B) with RGB-S inputs.

The FBA-AMNet 600 comprises a first D-ResNet606 configured to receive a first input image 602 and first foreground semantic information 632a, and a second D-ResNet608 configured to receive a second input image 604 and second foreground semantic information 632 b. The FBA-AMNet 600 also includes a first AM module 610 connected at the output of the first D-ResNet606, a second AM module 612 connected at the output of the second D-ResNet608, an ECV module 614, a stacked AM module 618, an upscaling module 620, and a regression module 622.

The first D-ResNet606 and the second D-ResNet608 may have shared weights. The first AM module 610 and the second AM module 612 may also have shared weights. Each of the first AM module 610 and the second AM module 612 is designed as a scene understanding module that captures deep global context information as well as local details. In some embodiments, a combination of a first D-ResNet606 and a first AM module 610 may be used to extract features of the first input image 602, and a combination of a second D-ResNet608 and a second AM module 612 may be used to extract features of the second input image 604.

The ECV module 614 is connected to the outputs of the first AM module 610 and the second AM module 612. The ECV module 614 is a combination of a disparity level depth correlation sub-volume, a disparity level feature distance sub-volume, and a disparity level feature cascade sub-volume. The ECV module 614 carries rich information about the matching cost under different similarity measures. The stacked AM module 618 is a second level processing module connected at the output of the ECV module 614. The stacked AM module 618 processes the output from the ECV module 614 (e.g., left feature map F at all disparity levels)_lAnd the right characteristic diagram F_rMatching cost information between) to aggregate multi-scale context information for disparity regression. A disparity regression may be calculated at the regression module 622 and based on the disparity regression, the final disparity d may be predicted_i(e.g., a disparity map 624 may be generated).

A first CNN module 626a (connected at the output of the first AM module 610) followed by a first upsampling module 628a and a first prediction module 630a may be used to generate a first semantic output 632a (e.g., foreground-background segmentation map or foreground semantic information), and a second CNN module 626b (connected at the output of the second AM module 612) followed by a second upsampling module 628b and a second prediction module 630b may be used to generate a second semantic output 632b (e.g., foreground-background segmentation map or foreground semantic information). In some embodiments, CNN modules 626a and 626b may be a multi-layer neural network. In some embodiments, the first upsampling module 628a and the second upsampling module 628b may be used to render the input features to a higher resolution.

The FBA-AMNet 600 is iteratively (or recursively) trained by providing

semantic outputs

632a and 632b (e.g., foreground-background segmentation maps) via

feedback loops

616a and 616b at the current epoch as additional segmentation inputs to the FBA-AMNet 600 at the next epoch.

For example, at epoch 0, the input foreground semantic information to the multitasking network is initialized to zero.

In epoch K, the input foreground semantic information (e.g., 632a, 632b) for an image to the multitasking network is the output from the multitasking network for that image in epoch K-1.

This process is repeated until convergence.

In some embodiments, during the inference phase (e.g., at the first prediction module 630a and the second prediction module 630b), the segmentation task (e.g., foreground-background segmentation) may be omitted, and a null map is provided as an additional input representing the foreground-background segmentation maps (e.g., 632a and 632 b). Although the segmentation task (e.g., foreground-background segmentation) is ignored, the performance of the FBA-amnt system 600 has improved because the multitasking network (e.g., including the first CNN module 626a followed by the first upsampling module 628a and the first prediction module 630a or including the second CNN module 626b followed by the second upsampling module 628b and the second prediction module 630b) estimates the foreground-background segmentation map (e.g., 632a and 632b) and implicitly learns the foreground object boundaries.

In some embodiments, two inference iterations are run, where a null graph is provided at the first iteration and the foreground-background segmentation graphs (e.g., 632a and 632b) output from the first iteration are used as additional inputs for the second iteration.

All layers (e.g., 606, 608, 610, and 612) in the feature extractor are shared between two tasks (e.g., disparity estimation 624 and foreground-background segmentation prediction (e.g., foreground

semantic information

632a and 632 b)). In addition to the feature extractor, a binary classification layer (626a, 626b), an upsampling layer (628a, 628b), and a flexible max layer (630a, 630b) are added for foreground-background segmentation (e.g., 632a, 632 b).

During training, FBA-AMNet 600 continually refines and utilizes its foreground-background segmentation predictions (e.g., 632a, 632b) to learn a better perception of foreground objects. The loss of the FBA-AMNet 600 is calculated as a weighted sum of two losses from two tasks:

L＝L_disp+λL_segwherein λ is L_segWherein L is_dispRepresents the loss in disparity estimation (e.g., 624), and L_segRepresenting the loss in the foreground-background segmentation prediction (e.g., 632a, 632 b). During testing, the segmentation task is ignored and the null map is used as an additional input.

The weight of foreground pixels or foreground semantic boundaries in the loss function increases due to data imbalance in the subtasks (e.g., foreground-background segmentation prediction). By optimizing the multitasking FBA-AMNet 600 for both tasks (e.g., foreground-background segmentation and disparity prediction), the shared roots (606, 608, 610, 612, 614, 618, 620, 622) are trained to implicitly have a better perception of foreground objects, which leads to a better disparity estimation.

Fig. 7 illustrates a method for disparity estimation for FBA-AMNet systems, according to some embodiments of the present disclosure. The method 700 of fig. 7 can be implemented in the FBA-AMNet system 600 of fig. 6.

At 702,

input images

602 and 604 are received at D-ResNet606 and D-ResNet 608.

At 704, feature maps are extracted from the input images (e.g., 602, 604) by feature extractor modules (e.g., a combination of 606 and 610 and a combination of 608 and 612).

At 706, foreground semantic information (e.g., 632a, 632b) is generated by a semantic information module (e.g., a combination of 626a, 628a, and 630a and a combination of 626b, 628b, and 630b) connected at an output of the feature extractor module (in particular, AM module 610, AM module 612 of the feature extractor module). The foreground semantic information (e.g., 632a, 632b) is provided to one or more feature extractor modules (specifically, D-ResNet606 and D-ResNet608 of the one or more feature extractor modules) for use in estimating disparity at the next training epoch.

At 708, matching cost information between the one or more feature maps is computed by an ECV module (e.g., 614) connected at the output of the feature extractor module (specifically,

AM module

610, 612 of the feature extractor module).

At 710, matching cost information between one or more feature maps from the ECV module 614 is processed by a stacked AM module 618 connected at the output of the ECV module 614 to aggregate multi-scale context information for disparity regression.

At 712, the disparity of the FBA-AMNet system 600 is estimated based on the aggregated multi-scale context information and foreground semantic information (e.g., 632a, 632b) by a regression module 622 connected at the output of the stacked AM module 618.

The FBA-AMNet system 600 is recursively trained using one or more feature maps and one or more foreground semantic information (e.g., 632a, 632b) until convergence.

The method 700 can be evaluated on the three most popular disparity estimation references, KITTI stereo 2015, KITTI stereo 2012, and scene stream.

The KITTI reference provides images of size 376 × 1248 captured by a pair of stereo cameras in a real world driving scene. The KITTI stereo 2015 includes 200 training stereo image pairs and 200 test stereo image pairs. A sparse ground truth disparity map may be provided with training data. The D1-all error can be used as the primary indicator of evaluation to calculate the percentage of pixels whose estimated error is 3px and 5% of their ground truth disparity.

The KITTI stereo 2012 included 194 training stereo image pairs and 195 test stereo image pairs. The Out-Noc error (Out-Noc error) may be used as a primary assessment indicator to calculate the percentage of pixels whose estimated error for all non-occluded pixels ≧ 3 px.

The scene stream reference is a set of synthetic data sets containing approximately 39000 stereoscopic image pairs of size 540 x 960 rendered based on various synthetic sequences. Three subsets containing about 35000 stereo image pairs were available for training (Flyingthings3D training, Monkka, and Driving), and one subset containing about 4000 stereo image pairs was available for testing (Flyingthings3D testing). The scene stream provides a complete ground truth disparity map for all images. End-point-error (EPE) can be used as an indicator for evaluation.

In some embodiments, AMNet-8 and AMNet-32 are trained first from scratch on the scene stream training set. For both models, the dilation factors of the hole convolution layer in the AM module are set to [1,2,2,4,4,8,1, respectively]And [1,2,2,4,4,8,8,16,16,32,1]. The maximum disparity D is set to 192. The parameter t in the ECV is set to 0. The weight λ of the segmentation loss is set to 0.5. For a pair of input images, two blocks of size 256 × 512 at the same random position are cropped as input to the network. All pixels with ground truth disparities greater than D are excluded from training. The model (e.g., 600) is trained end-to-end with an Adam optimizer at block size 16 over 15 epochs. The learning rate is initially set to 10^-3After 10 periods, down to 10^-4. FBA-AMNet is not trained on a scene stream due to the fact that the segmentation labels in the scene stream are not consistent between scenes or objects.

In some embodiments, four models (e.g., AMNet-8, AMNet-32 (e.g., AMNet100), FBA-AMNet-8, and FBA-AMNet-32 (e.g., FBA-AMNet 600)) on the KITTI from the pre-trained AMNet-8 and AMNet-32 (e.g., AMNet100) models are then fine-tuned. To train the FBA-AMNet model, the first layer in the AMNet model is modified to have 4 channels for RGB-S input, and a binary classification layer for foreground-background segmentation task, a bilinear upsampling layer, and a flexible maximum layer may be added. The model was trained using the iterative training method described for fig. 6 and 7 with an Adam optimizer in a batch size (batch size)12 over 1000 epochs. The learning rate is initially set to 10^-3After 600 periods, down to 10^-4. The learning rate can be increased up to 10 times for new layers. Other settings are the same as the scene stream training process. The foreground-background segmentation map is initialized to zero at epoch 0.

All models were implemented using PyTorch on NVIDIA Titan-Xp GPUs.

Table 2: the performance of the model of the example embodiment on the KITTI stereo 2015 test set was compared to the most published methods. D1-bg represents the evaluation of static background pixels. D1-fg indicates that the dynamic foreground pixels are evaluated. D1-all indicates that all pixels are evaluated.

Table 3: performance comparisons on the KITTI stereo 2012 test set. The error threshold is set to 3.

Method of producing a composite material	GC-Net[15]	DispNetC[20]	PSMNet[2]	AMNet-8	AMNet-32
						EPE	2.51	1.68	1.09	0.81	0.74

Table 4: and comparing the performances on the scene flow test set. All results are reported in EPE.

Results on KITTI stereo 2015 test set: four models (AMNet-8, AMNet-32 (e.g., AMNet100), FBA-AMNet-8, and FBA-AMNet-32 (e.g., FBA-AMNet 600)) were compared on a KITTI stereo 2015 test set using all published methods at all evaluation settings. The results are shown in table 2. All four models outperform the previous method with significant margin on D1-all. Compared to the previous best results at 2.25%, the FBA-AMNet-32 (e.g., FBA-AMNet 600) model pushed D1-all (all pixels) to 1.93% with a relative gain of 14.2%.

Results on the KITTI stereoscopic 2012 test set: a comparison of performance over the KITTI stereo 2012 test set is shown in table 3. Consistent with the KITTI stereoscopic 2015, the four models were significantly superior to all other published methods in all evaluation settings. The FBA-AMNet-32 (e.g., FBA-AMNet 600) model reduces Out-Noc to 1.32% with a relative gain of 11.4% compared to the previous best results reported at 1.49%. Here only the result for error threshold 3 is reported. The results for the other error thresholds are consistent with the results for error threshold 3.

Results on the scene flow test set: the AMNe t-8 model and the AMNet-32 (e.g., AMNet100) model are compared using all published methods in the scenario flow test set. Both models outperform the others with large margins. The results reported in EPE are shown in table 4. Compared to the previous best results at 1.09, the AMNet-32 (e.g., AMNet100) model pushed E PE to 0.74 with a relative gain of 32.1%.

The following section analyzes the effectiveness of each component of the architecture in detail. Since the KITTI only allows a limited number of evaluations of the test set per month, most of the analysis is performed on the scene flow test set.

This section explores how modifications to the network backbone from standard ResNet-50 to D-ResNet change performance and complexity. The following three models were compared: an AMNet-32 (e.g., AMNet100) model using ResNet-50 as a network backbone, an AMNet-32 (e.g., AMNet100) model using modified ResNet-50 as a network backbone by directly replacing standard convolutions with deep separable convolutions, and an example AMNet-32 (e.g., AMNet100) model. The results on the scene flow test set and the number of parameters in each model are shown in table 5. D-ResNet is preferred over standard ResNet-50 as a network backbone with a smaller number of parameters.

Table 5: performance comparison and complexity comparison of three models using different network backbones. And reporting the result on the scene flow test set.

An ablation study (ablation study) was performed for ECV with seven models modified from the AMNet-32 (e.g., AMNet100) model by using different combinations of the three subvolumes (disparity level feature distance subvolumes, disparity level depth correlation subvolumes and disparity level feature cascade subvolumes). The result comparison and feature size comparison on the scene flow test set are shown in table 6. The results show that: the disparity level feature is more efficient from the subvolume than the other two, and the combination of the three subvolumes (e.g., ECV module 614) results in the best performance.

Table 6: and comparing the performance and the characteristic size of the models using different cost bodies. "dist.", "corr." and "FC" denote the disparity level feature distance, disparity level depth correlation and disparity level feature cascade, respectively. All results are reported in EPE on the scene flow test set.

Table 7: performance comparison and run-time comparison for each image. Reporting all results on the scene flow test set. The size of the test image was 540 x 960.

In some embodiments, the deeper structure allows the AM module (e.g., 610, 612) to gather more multi-scale context information resulting in a finer feature representation while being more computationally expensive. The effect of different structures of the AM module on the performance and speed of an AMNet-32 (e.g., AMNet100) model can be analyzed by setting the AM module's maximum expansion factor k to 4,8, 16, and 32. The performance and speed comparisons for the four models on the scene flow test set are shown in table 7. All test images were 540 x 960 in size.

FBA-AMNet (e.g., 600) is designed and trained to generate smoother and more accurate shapes for foreground objects than AMNet, which results in a finer disparity map. Fig. 8 shows disparity estimation results for two challenging foreground objects from KITTI test images for the AMNet model (e.g., 100) and the FBA-AMNet (600) model. The visualization in fig. 8 supports the following facts: FBA-AMNet (e.g., 600) can generate finer details for foreground objects.

Fig. 9 shows one image from the KITTI stereo 2015 test set and the results of the foreground-background segmentation from coarse to fine generated by the FBA-AMNet-32 (e.g., FBA-AMNet 600) model during training periods 10, 300, 600, and 1000. The visualization shows that: during the training process, the multitasking network gradually learns better perception of foreground objects. Since the optimization process of the multitask network is biased towards the disparity estimation task, a segmentation task that generates a suitable segmentation map is desirable.

Example embodiments of the present disclosure provide an end-to-end deep learning architecture with a design for each major component used for disparity estimation. The model (e.g., 600) can extract depth features and discriminative features, compute rich matching costs using three different similarity metrics, and aggregate multi-scale context information for dense disparity estimation. How each component contributes to the final result is analyzed and visualized in detail. The example FBA-AMNet (e.g., 600) outperforms all other published methods on the three most popular disparity estimation benchmarks.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed herein could be termed a second element, component, region, layer or section without departing from the spirit and scope of the inventive concept.

For ease of description, spatially relative terms such as "below … …," "below … …," "below … …," "above … …," "above," and the like may be used herein to describe one element or feature's relationship to another element or feature as illustrated in the figures. It will be understood that such spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements described as "below" or "beneath" other elements or features would then be oriented "above" the other elements or features. Thus, the example terms "below … …" and "below … …" can encompass both an orientation of above and below. The device may be otherwise oriented (e.g., rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly. In addition, it will also be understood that when a layer is referred to as being "between" two layers, it can be the only layer between the two layers, or one or more intervening layers may also be present.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the inventive concept. As used herein, the terms "substantially," "about," and the like are used as terms of approximation and not as terms of degree, and are intended to account for inherent deviations in measured or calculated values that would be recognized by one of ordinary skill in the art.

As used herein, the singular is intended to include the plural unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. When a statement such as "at least one of … …" follows a list of elements, that statement modifies the entire list of elements rather than a single element in the list. Furthermore, the use of "may" in describing embodiments of the inventive concept is meant to be "one or more embodiments of the present disclosure. Moreover, the term "exemplary" is intended to mean exemplary or illustrative. As used herein, the terms "use" and "used" may be considered synonymous with the terms "utilize" and "utilized," respectively.

It will be understood that when an element or layer is referred to as being "on," "connected to," "coupled to" or "adjacent to" another element or layer, it can be directly on, connected to, coupled to or directly adjacent to the other element or layer or one or more intervening elements or layers may be present. In contrast, when an element or layer is referred to as being "directly on," "directly connected to," "directly coupled to" or "directly adjacent to" another element or layer, there are no intervening elements or layers present.

Any numerical range recited herein is intended to include all sub-ranges subsumed within that range with the same numerical precision. For example, a range of "1.0 to 10.0" is intended to include all sub-ranges between the recited minimum value of 1.0 and the recited maximum value of 10.0 (and including the minimum value of 1.0 and the maximum value of 10.0), i.e., all sub-ranges having a minimum value equal to or greater than 1.0 and a maximum value equal to or less than 10.0 (such as, for example, 2.4 to 7.6). Any maximum numerical limitation recited herein is intended to include all lower numerical limitations subsumed therein, and any minimum numerical limitation recited in this specification is intended to include all higher numerical limitations subsumed therein.

In some embodiments, one or more outputs of different embodiments of the methods and systems of the present disclosure may be transmitted to an electronic device coupled to or having a display device for displaying one or more outputs or information about one or more outputs of different embodiments of the methods and systems of the present disclosure.

Electronic or electrical devices and/or any other related devices or components according to embodiments of the disclosure described herein may be implemented using any suitable hardware, firmware (e.g., application specific integrated circuits), software, or combination of software, firmware, and hardware. For example, various components of these devices may be formed on one Integrated Circuit (IC) chip or on separate IC chips. In addition, various components of these devices may be implemented on a flexible printed circuit film, a Tape Carrier Package (TCP), a Printed Circuit Board (PCB), or formed on one substrate. Further, various components of these devices may be processes or threads running on one or more processors in one or more computing devices executing computer program instructions and interacting with other system components for performing the various functions described herein. The computer program instructions are stored in a memory, which may be implemented in a computing device using standard memory devices, such as, for example, Random Access Memory (RAM). The computer program instructions may also be stored in other non-transitory computer readable media, such as, for example, CD-ROM, flash drives, etc. Furthermore, one skilled in the art will recognize that: the functionality of the various computing devices may be combined or integrated into a single computing device, or the functionality of a particular computing device may be distributed across one or more other computing devices, without departing from the spirit and scope of the exemplary embodiments of the present disclosure.

Although exemplary embodiments of a foreground-background perceptual hole multiscale network for disparity estimation have been specifically described and illustrated herein, many modifications and variations will be apparent to those skilled in the art. Thus, it will be understood that a foreground-background perceptual hole multi-scale network for disparity estimation constructed in accordance with the principles of the present disclosure may be implemented in addition to as specifically described herein. The inventive concept is also defined in the claims and their equivalents.

Claims

1. A system for disparity estimation, the system comprising:

one or more feature extractor modules configured to extract one or more feature maps from one or more input images; and

one or more semantic information modules connected at one or more outputs of the one or more feature extractor modules,

wherein the one or more semantic information modules are configured to: generating one or more foreground semantic information to be provided to the one or more feature extractor modules for disparity estimation at a next training epoch.

2. The system of claim 1, wherein the one or more input images comprise a first input image and a second input image, and wherein the one or more feature maps extracted from the one or more input images comprise a first feature map extracted from the first input image and a second feature map extracted from the second input image.

3. The system of claim 2, further comprising:

an extended cost body module connected at the one or more outputs of the one or more feature extractor modules, the extended cost body module configured to compute matching cost information between a first feature map and a second feature map;

a stacked hole multi-scale module connected at an output of the extended cost body module and configured to process matching cost information between the first feature map and the second feature map from the extended cost body module to aggregate multi-scale contextual information, the stacked hole multi-scale module comprising a plurality of hole multi-scale modules stacked together; and

a regression module connected at an output of the stacked-hole multi-scale module and configured to estimate disparity based on the aggregated multi-scale context information and the one or more foreground semantic information from the stacked-hole multi-scale module.

4. The system of claim 3, wherein the extended cost module comprises:

a disparity level feature distance subvolume module configured to determine a pixel-by-pixel absolute difference between the first feature map and the second feature map;

a disparity level depth correlation sub-body module configured to determine a correlation between the first feature map and the second feature map; and

a disparity level feature cascade subvolume module configured to cascade the first feature map and the second feature map shifted by d at each disparity level d.

5. The system of claim 4, wherein:

the size of the disparity level feature distance subvolume module is H x W x (D +1) x C, wherein H, W and C represent height, width and feature size, and D represents the maximum disparity the system can predict;

the size of the disparity level depth-related subvolume module is H x W x (D +1) x C; and is

The size of the disparity level feature cascade daughter module is H × W × (D +1) × 2C.

6. The system of claim 5, wherein the size of the extended cost volume module is determined by cascading a disparity level feature distance sub-volume module, a disparity level depth correlation sub-volume module, and a disparity level feature cascade sub-volume module along a depth dimension, wherein the size of the extended cost volume module is H x W x (D +1) x 4C.

7. The system of claim 3, wherein the plurality of hole multi-scale modules are stacked together with a shortcut connection within a stacked hole multi-scale module, wherein a hole multi-scale module of the plurality of hole multi-scale modules of the stacked hole multi-scale module is configured to: matching cost information between a first feature map and a second feature map from an extended cost volume module is processed using k pairs of a 3 x 3 hole convolution layer and two 1 x 1 convolution layers, where k is an integer power of 2 and is greater than 0.

8. The system of claim 7, wherein k pairs of 3 x 3 hole convolution layers have a dilation factor [1,2,2,4,4, …, k/2, k/2, k ], wherein two 1 x 1 convolution layers with a dilation factor of one are added at the ends of the hole multi-scale modules of the plurality of hole multi-scale modules for feature refinement and feature resizing.

9. The system of claim 3, wherein the one or more feature extractor modules comprise:

a first depth separable residual network module configured to receive a first input image and first foreground semantic information;

a second depth separable residual network module configured to receive a second input image and second foreground semantic information;

a first hole multi-scale module connected at the output of the first depth separable residual error network module; and

and the second hole multi-scale module is connected at the output of the second depth separable residual error network module.

10. The system of claim 9, wherein the first depth separable residual network module and the second depth separable residual network module have shared weights and the first hole multi-scale module and the second hole multi-scale module have shared weights, wherein each of the first hole multi-scale module and the second hole multi-scale module is configured as a scene understanding module for capturing depth global context information and local details, wherein the extended cost body module is connected at an output of the first hole multi-scale module and at an output of the second hole multi-scale module.

11. The system of claim 9, wherein the one or more semantic information modules comprise:

a first semantic information module connected at an output of the first hole multi-scale module, wherein the first semantic information module is configured to generate first foreground semantic information, wherein the first foreground semantic information is provided to the first depth separable residual network module via a first feedback loop as an additional input to the system for a next training epoch of the system; and

a second semantic information module connected at an output of the second hole multi-scale module, wherein the second semantic information module is configured to generate second foreground semantic information, wherein the second foreground semantic information is provided to the second depth separable residual network module via a second feedback loop as an additional input to the system for a next training epoch of the system.

12. The system of claim 11, wherein the first semantic information module comprises:

the first convolution neural network module is connected to the output of the first cavity multi-scale module;

a first up-sampling module connected at the output of the first convolutional neural network module; and

a first prediction module connected at an output of the first upsampling module and configured to generate first foreground semantic information.

13. The system of claim 11, wherein the second semantic information module comprises:

the second convolutional neural network module is connected to the output of the second cavity multi-scale module;

a second up-sampling module connected at an output of the second convolutional neural network module; and

a second prediction module connected at an output of the second upsampling module and configured to generate second foreground semantic information.

14. The system of claim 1, wherein the system is a multitasking module configured to perform two tasks, wherein the two tasks are disparity estimation and foreground semantic information generation, wherein the loss of the system is a weighted sum of the two losses from the two tasks.

15. A method for disparity estimation for a system comprising one or more feature extractor modules, one or more semantic information modules, an extended cost body module, a stacked hole multiscale module, and a regression module, the method comprising:

extracting, by the one or more feature extractor modules, one or more feature maps from one or more input images;

generating one or more foreground semantic information by the one or more semantic information modules connected at one or more outputs of the one or more feature extractor modules, wherein the one or more foreground semantic information is provided to the one or more feature extractor modules;

calculating matching cost information between the one or more feature maps by an extended cost body module connected at the one or more outputs of the one or more feature extractor modules;

processing, by a stacked hole multi-scale module connected at an output of an extended cost body module, matching cost information between the one or more feature maps from the extended cost body module to aggregate multi-scale context information for disparity regression;

estimating, by a regression module connected at an output of the stacked-hole multi-scale module, a disparity based on the aggregated multi-scale context information and the one or more foreground semantic information; and

recursively train the system with the one or more feature maps and the one or more foreground semantic information until convergence.

16. The method of claim 15, wherein the one or more foreground semantic information for a current time period is computed by the one or more semantic information modules for a previous time period, wherein the one or more input images comprise a first input image and a second input image, wherein the one or more feature maps extracted from the one or more input images comprise a first feature map extracted from the first input image and a second feature map extracted from the second input image, and wherein the method further comprises:

determining pixel-by-pixel absolute difference between the first feature map and the second feature map by a disparity level feature distance subvolume module of the extended cost volume module;

determining a correlation between the first feature map and the second feature map by expanding a disparity level depth correlation sub-volume module of the cost volume module; and

and cascading the first feature map and the second feature map which are shifted by d at each parallax level d by expanding the parallax level feature cascade sub-body module of the cost body module.

17. The method of claim 16, wherein:

18. The method of claim 17, further comprising:

determining a size of an extended cost volume module by cascading a disparity level feature distance sub-volume module, a disparity level depth correlation sub-volume module, and a disparity level feature cascade sub-volume module along a depth dimension, wherein the size of the extended cost volume module is H x W x (D +1) x 4C.

19. The method of claim 17, further comprising:

generating first foreground semantic information by a first semantic information module of the one or more semantic information modules;

receiving, by a first depth separable residual network module of the one or more feature extractor modules, a first input image and first foreground semantic information, wherein the first foreground semantic information is provided to the first depth separable residual network module via a first feedback loop as additional input for a next training session of the system;

generating second foreground semantic information by a second semantic information module of the one or more semantic information modules;

receiving, by a second depth separable residual network module of the one or more feature extractor modules, a second input image and second foreground semantic information, wherein the second foreground semantic information is provided to the second depth separable residual network module via a second feedback loop as additional input for a next training session of the system; and

capturing, by a first hole multi-scale module and a second hole multi-scale module of the one or more feature extractor modules, deep global context information and local details for scene understanding.

20. The method of claim 16, wherein the stacked hole multi-scale module comprises a plurality of hole multi-scale modules stacked together with shortcut connections within the stacked hole multi-scale module, wherein the method further comprises: processing, by a hole multi-scale module of the plurality of hole multi-scale modules of the stacked hole multi-scale module, matching cost information between the one or more feature maps from an extended cost volume module with k pairs of 3 x 3 hole convolution layers and two 1 x 1 convolution layers, wherein k pairs of 3 x 3 hole convolution layers have an expansion factor [1,2,2,4,4, …, k/2, k/2, k ], wherein two 1 x 1 convolution layers with an expansion factor of one are added at the end of the hole multi-scale module for feature refinement and feature resizing, wherein k is an integer power of 2 and greater than 0.