CN114170311A

CN114170311A - Binocular stereo matching method

Info

Publication number: CN114170311A
Application number: CN202111479987.6A
Authority: CN
Inventors: 杨戈; 廖雨婷
Original assignee: Zhuhai Campus Of Beijing Normal University
Current assignee: Zhuhai Campus Of Beijing Normal University
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-03-11

Abstract

The invention discloses a binocular stereo matching method, which comprises the following steps: 1) the feature extraction network performs feature extraction on the left image and the right image to be matched to obtain a feature map of N resolution ratios of the left image and a feature map of N resolution ratios corresponding to the right image; 2) performing related operation on the feature maps with the same resolution ratio corresponding to the left image and the right image to form a 4D matching cost volume; then, local cost aggregation operation is carried out on each 4D matching cost volume through a scale aggregation module to obtain a brand new matching cost volume with the same resolution as the original matching cost volume; 3) fusing the N4D brand-new matching cost volumes obtained in the step 2) through an inter-scale aggregation module to obtain a final matching cost volume; 4) obtaining disparity maps corresponding to the N different resolutions according to the final matching cost volume; and then, performing up-sampling on each obtained disparity map, and inputting the disparity maps into a StereoDRNet to obtain a final predicted disparity map.

Description

Binocular stereo matching method

Technical Field

The invention relates to the field of computer vision, in particular to a binocular stereo matching method.

Background

Binocular stereoscopic vision is inspired from human vision, under the condition of not contacting a target object, images are shot from different angles by two cameras through simulating human eyes, three-dimensional information of the object is obtained according to a parallax principle, and the three-dimensional outline and the position information of the object are reconstructed.

The key technology for realizing binocular vision is as follows: the method comprises the steps of camera calibration, stereo correction, stereo matching and three-dimensional reconstruction, wherein the stereo matching is one of the core steps of binocular vision technology and is the research focus in the field of computer vision. The stereo matching is a process of finding the corresponding relation of a space object in the acquired left and right images, namely finding the corresponding point of a binocular image, calculating the parallax, and estimating the depth according to the calculated parallax by using the similar triangle principle. The stereo matching is a loop which is most important in the stereo vision, and parallax information obtained by the stereo matching directly influences obtained three-dimensional information.

Stereo matching is a key and research focus and difficulty as a binocular vision technology. The research on the subject has been done for decades, and traditionally, the stereo matching process is considered as a multi-stage optimization solving process, which is generally divided into four steps, i.e., matching cost calculation, matching cost aggregation, disparity calculation and disparity optimization (refer to Scharstein D, Szeliski R.A task and evaluation of noise-frame stereo correlation algorithms, J. International Journal of Computer Vision,2002,47(1-3): 7-42).

In recent years, deep learning has shown excellent performance in various fields such as image, voice, natural language processing, etc., a Convolutional Neural Network (CNN for short) can effectively reduce a large data volume into a small data volume, and simultaneously retain picture features for facilitating feature extraction, and a stereo matching algorithm with high precision and high operating efficiency can be obtained by reasonably utilizing the Convolutional Neural Network. And the large-scale binocular simulation data set also provides possibility for a stereo matching algorithm based on deep learning.

Document "n.mayer, e.ilg, P.

P.Fischer,D.Cremers,A.Dosovitskiy and T.Brox.A Large Dataset to Train Convolutional Networks for Disparity,Optical Flow,and Scene Flow Estimation[J]2016IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2016: 4040-. The document "Chang Jianren, Chen Yongsheng]Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),2018: 5410-. Document "F.Zhang, V.Prisacariu, R.Yang and P.Torr.GA-Net: Guided Aggregation Net for End-to-End Stereo Matching [ J.]The Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2019: 185-. The method comprises a semi-global aggregation layer (SGA) and a local guided aggregation Layer (LGA), wherein the SGA is approximate micro-representable SGM, the LGA compensates detail loss of down sampling and up sampling by taking advantage of a cost filtering strategy in a traditional method, and the two layers of networks achieve high precision while reducing parameters.

In addition, the network structure of the end-to-end stereo matching algorithm is generally complex, the feature extraction effect is generally better in regions which are not suitable for determining, such as sheltering, reflecting, weak texture, non-texture and the like, a multi-scale and multi-layer feature extraction network is often selected, but a large number of redundant parameters exist, and further, the problems of huge cache during operation, high equipment requirement and the like occur, so that the cost for the algorithm of small equipment is large. The literature, "wang yufeng, wang hongwei, in light, yangming, yue, full-gig. stereo matching algorithms [ J ] optics, 2019,39(11):227 channel 234" based on three-dimensional convolutional neural networks, by constructing sparse losses in the parallax dimension, the running time is greatly shortened with improved accuracy, however, there are still large errors for irregular surfaces, shadow areas, or areas where light is particularly dark and blocked. The literature ' Liujian, Von Yun Jian, Tioguo, face and Fuwu, Zhu Shizhuo ', a PSmNet improved Stereo Matching algorithm [ J ]. university of south China university of China school (natural science version), 2020,48(01):60-69+83 ', is improved on the basis of a Pyramid Stereo Matching Network (PSmNet for short), and the SWNet Stereo Matching algorithm is provided, wherein a dark space Pyramid structure (ASPP) is adopted in a feature extraction module, so that finally the number of layers of the Network model is small, parameters are few, meanwhile, the memory occupation is small, the Matching task can be quickly processed, and the Matching reaches higher accuracy. The document ' Litton, Mawei, Xushibushy, Zhang Xiaopeng ' Peng ', an end-to-end depth network [ J ] adapting to a stereo matching task, computer research and development, 2020,57(07): 1531-. The literature, "Chengniang, Nashao, Dafepeng", study of stereo matching Network based on attention mechanism [ J ]. optical newspaper, 2020,40(14): 144-. The literature 'Wangyefeng, Wangweiwu, Liuyu, Yangming, full-Gingji, progressive refinement real-time stereo matching algorithm [ J ]. optical science, 2020,40(09): 99-109' proposes a lightweight model which can be used for carrying out stereo matching tasks, and the main idea is to predict a dense disparity map by using a rough and fine progressive method, and in a disparity refinement module, propose a Markov Random Field (MRF) to cluster multi-modes firstly, respectively process and then fuse, and output predicted disparity map residual errors, thereby carrying out case-based processing on different regions. So as to improve the precision of the algorithm and improve the real-time performance of the algorithm. However, the existing feature extraction network is too complex, the parameter quantity is too much, and the feature extraction efficiency is low. The existing end-to-end stereo matching algorithm is mainly used for improving running time, in a matching cost calculation stage and a cost aggregation stage, in the matching cost calculation stage, the end-to-end stereo matching algorithm usually depends on a good feature extraction network, but in order to better extract features, the existing feature extraction network is complex and has a large number of parameters, although a good effect can be obtained, the whole end-to-end network is more complex, and occupies a huge memory, so that equipment burden is increased, and the consumption time for obtaining an initial disparity map is prolonged.

Disclosure of Invention

In order to overcome the defects of too complex existing feature extraction networks and too much parameter quantity, the invention aims to provide a binocular stereo matching method, which reduces the operation pressure of equipment through improved binocular stereo matching and obtains an initial disparity map with excellent quality more quickly, and the innovation point of the method is that an end-to-end depth network suitable for stereo matching is provided, wherein a network structure is built through an end-to-end frame design, after processed left and right images are input, wherein the left and right images mainly refer to left and right images obtained by using parallel binocular vision systems, such as binocular cameras, or two parallel cameras, the disparity map can be directly obtained through the network, the processes of manual preprocessing and subsequent processing are reduced, the operation time is reduced, and errors in the stereo matching process are reduced, meanwhile, on the basis of AANet, a feature extraction network with less parameter quantity and simplicity is constructed to obtain better local features, and then the network is enabled to efficiently obtain a disparity map with excellent quality through an intra-scale aggregation module and an inter-scale aggregation module respectively by utilizing a deformable convolution and a traditional cross-scale aggregation mode.

The technical scheme of the invention is as follows:

a binocular stereo matching method comprises the following steps:

1) the feature extraction network performs feature extraction on the left image and the right image to be matched to obtain a feature map of N resolution ratios of the left image and a feature map of N resolution ratios corresponding to the right image; wherein N is an integer greater than 1;

2) performing related operation on the feature maps with the same resolution ratio corresponding to the left image and the right image to form a 4D matching cost volume, so that N matching cost volumes can be obtained, and performing local cost aggregation operation on the N matching cost volumes through a scale aggregation module to obtain N brand-new matching cost volumes with the same resolution ratio as the original resolution ratio;

3) fusing the N brand new matching cost volumes obtained in the step 2) through an inter-scale aggregation module to obtain N final matching cost volumes required by parallax calculation;

4) obtaining disparity maps corresponding to N different resolutions by adopting a soft argmin function according to the final matching cost volume; and then, performing up-sampling on each obtained disparity map, and inputting the disparity maps into a StereoDRNet to obtain a final predicted disparity map.

Further, the feature extraction network firstly performs down-sampling on the input image, and then processes the input image by adopting a first residual error network to obtain a feature map M1 with the resolution being one third of that of the input image; then inputting the feature map M1 into a second residual error network for processing to obtain a feature map M2 with the resolution being one sixth of the input image; and finally, inputting the feature map M2 into a third residual error network for processing to obtain a feature map M3 with the resolution of one tenth of that of the input image.

Further, the in-dimension polymerization module passes through

Obtaining a brand new matching cost volume; wherein the content of the first and second substances,

expressing the matching cost K of the parallax d at the pixel point p after the intra-scale polymerization²Representing the total number of sample points, w_kRepresents the aggregate weight, p, of the k sample point_kIs the fixed offset, Δ p, of a pixel point p_kIs a learnable additional offset, m_kIs a position weight of the k-th point for adjusting the aggregation weight w_kAnd the function C is a matching cost value calculation function.

Further, the inter-scale aggregation module aggregates each matching cost volume with each other matching cost volume except for the inter-scale aggregation module, when the resolution of the aggregated matching cost volume is higher than that of the current matching cost volume, the aggregated matching cost volume is subjected to down-sampling operation, when the resolution of the aggregated matching cost volume is lower than that of the current matching cost volume, the aggregated matching cost volume is subjected to up-sampling operation, and the final matching cost volume is obtained through the inter-scale aggregation module.

Further, by

Summing the sampling results to obtain a matched cost volume

Wherein the content of the first and second substances,

representing a matching cost volume of an r-th scale after intra-scale aggregation; function f_rThere are three cases when r ═ s is an identity mapping function, i.e. at the same resolutionUnder the condition, the matching cost volume of the r scale is not transformed; when r is<s, namely when the resolution of the matching cost volume of the r scale after intra-scale aggregation is larger than the resolution of the matching cost volume of the s scale to be aggregated currently, downsampling the matching cost volume of the r scale to make the resolution consistent with the resolution of the matching cost volume of the s scale; if r>S, that is, when the resolution of the matching cost volume of the r-th scale after intra-scale aggregation is smaller than the resolution of the matching cost volume of the S-th scale to be aggregated currently, the matching cost volume of the r-th scale is up-sampled and then convolved by 1x1, where S is 1 to S, and S is the same as N.

Further, the network structure of the first residual network, the second residual network and the third residual network is formed by sequentially connecting 1 residual network with 2 layers of convolution kernels of 1X1, 1 residual network with a first layer of convolution kernel of 3X3 and a second layer of convolution kernel of 1X1, and finally a residual network with 2 layers of convolution kernels of 1X 1.

The technical scheme adopted by the invention for solving the technical problem is deep network-AEDNet.

1 network architecture design

According to the four steps of the stereo matching algorithm: matching cost calculation, matching cost aggregation, parallax calculation and parallax optimization. The network structure design of the invention is firstly a matching cost calculation stage, based on the difference between a stereo matching task and other computer vision tasks, an improved feature extraction network specially used for the stereo matching task is adopted to extract input left and right images to obtain feature maps with 1/3, 1/6 and 1/12 of original image sizes, and then the left and right image features with the three resolutions are constructed into a multi-scale matching cost volume (cost volume) through related operations. In the cost Aggregation stage, in order to replace the tedious 3D convolution operation, an intra-Scale Aggregation mode and an inter-Scale Aggregation mode are adopted, and respectively comprise 3 intra-Scale Aggregation modules (ISA) and 1 inter-Scale Aggregation module (CSA), in the parallax computation stage, a soft argmin function, namely a fully differentiable flexible argmin function, is utilized, in the End-to-End Stereo matching network GCNet document, namely A.Kendall, H Martirosyan, S Dasguata, et al, End-to-End Learning of Geometry and Context for Deep Stereo Regression [ C ].2017IEEE Conference Connector Vision (ICCV), three-dimensional Stereo, 2017:66-75 ", is proposed for the first time, the initial volume matching operation can be used for reconstructing the self-volume matching cost, namely Stereo based on the gradient network, the final volume sampling and the Stereo sampling based on the gradient network, julian Straub, Christopher Sweeney, Richard New com and Henry Fuchs, Stereodornet [ J ]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2019: 11786-. The designed network structure is shown in figure 1, and the network data processing flow is shown in figure 2.

2 feature extraction module

A good matching cost calculation method is the basis of an accurate stereo matching algorithm and aims to measure the correlation between left and right pixels. The cost measurement is mainly carried out by utilizing a similarity function under different parallaxes. In a common end-to-end stereo matching algorithm, an approximate ResNet network structure is generally used in a matching cost stage, a GC-Net adopts 2D convolution to extract depth features, and after dimension reduction is carried out, a residual error network of 8 layers is immediately connected so as to extract unary features. The PSMNet first extracts unary features (unary features) using basic residual block learning, and then is used to expand the receptive field by hole convolution for feature extraction. And the GA-Net adopts a Stacked Hourglass network (Stack Hourglass Networks) to extract features, wherein the ResNet network refers to a VGG (visual Geometry group) network except for solving the degradation problem of the deep network, and is modified on the basis of the VGG (virtual Geometry group) network. ResNet and VGG are champions of ILSVRC (image Net Large Scale Visual Recognition Change), whose task is mainly image classification. The image classification and the stereo matching are two different image tasks, and the main difference is that the image classification task needs to realize the translation invariance of the image, namely, the output classification result is not changed no matter what degree of translation transformation occurs to the input image. In addition, the stacked hourglass network is mainly used for human posture detection (human pose) tasks, which require that the view angle and the position are not influenced, namely translation and rotation. The stereo matching task is sensitive to image translation, and requires that once an input image is translated and changed, the image result of the input image should be correspondingly changed.

Furthermore, the stereo matching task does not require semantic level features at all. The image classification task needs highly abstract semantic level features, so that the image content type is more accurately judged. The stacked hourglass network is characterized by multi-scale features for gesture recognition. If the recognition result is only the features extracted using the convolution of the last layer, this is likely to result in loss of information. In fact, human posture detection is a correlation type task, adjacent body joints have certain correlation, but different joint points have different characteristic points, which indicates that each body joint does not have the best recognition effect on the same convolutional network layer output. This requires that the network can simultaneously utilize the results of each layer of the network output. However, the stereo matching task mainly needs to find out the corresponding relationship between pixel points in the left and right images, so as to find out the homonymy points in the left and right images and further calculate the parallax. Therefore, the stereo matching task should focus more on the local detail description capability of the features extracted by the feature extraction network in different spatial locations.

In summary, the present invention employs an improved network simplex resnet, that is, SesNet, which is specially used for stereo matching task to perform Feature extraction, and since the purpose of the present invention is to reduce the number of parameters in the Feature extraction stage and further reduce the operation time, three Feature maps with different resolutions are obtained by using SesNet continuously passing through three channels, and after three Feature maps with different resolutions are obtained, a Feature map Pyramid network (FPN) is used, and the algorithm flow of the Feature extraction module is as shown in fig. 3. Wherein M2 and M3 respectively represent output layers which are subjected to SesNet and subjected to dimensionality reduction, and P1, P2 and P3 represent output layers which are subjected to SesNet and FPN.

The main idea of simpleResNet design is to limit the convolution kernel size to obtain features with lower abstraction degree, considering that the larger the receptive field is, the image will reduce the difference of the obtained features due to the overlapping of regions. Meanwhile, the invention reduces the corresponding convolution layers, realizes that the self position information, namely the detail information, which can be extracted by the network is increased while the parameter quantity is reduced, and ensures that the extracted features have stronger local detail expression and difference. The improved feature network according to simpleResNet is called SesNet, and the specific network structure is shown in Table 1, wherein the specific network structure of the residual block in the table is indicated by square brackets, for example, the conv1_ X layer therein, the specific network structure is 1 residual network with 2-layer convolution kernel of 1X1, and the output feature map channel is 32. Where the usage of Batch regularization (Batch regularization) and ReLU (rectified Linear units) is consistent with the usage of ResNet, and where H and W represent the height and width of the input image, respectively. The invention inputs the processed left and right images into a network, firstly, reference to ResNet, a convolution kernel of 7x7 is adopted to check the input image for down sampling in a first layer of the whole feature extraction network, in order to reserve the information of the original image as much as possible, and the convolution kernel is only executed in the layer closest to the input image, namely the first layer, then the first layer SesNet network, namely the conv1_ x layer of the table 1, is entered to obtain a feature map M1 with the resolution being one third of the input image, then the M1 is passed through the second layer SesNet network, namely the conv2_ x layer of the table 1, to obtain a feature map M2 with the resolution being one sixth of the input image, finally the M2 is passed through the last layer SesNet network, namely the feature map M3 with the resolution being one tenth of the input image, and the obtained three feature maps are passed through a feature pyramid network, as shown in the figure 3, in order to reduce the memory occupation, the M1 is not included in a pyramid, and only the convolution kernel layer of 1x1 is used to generate a high-resolution feature image, the feature pyramid performs 2 times of upsampling on M3 to obtain an image feature with the same spatial size as M2, and adds M2 to the image feature according to elements, while the convolution of 1x1 is to reduce the number of convolution kernels so as to better fuse feature images with the same size, for subsequent iteration, M3 also adds a 1x1 convolution layer to generate a low-resolution feature image, and finally, in order to reduce aliasing effect of upsampling, each feature image is added with 3x3 convolution to form a final feature extraction stage result: p1, P2 and P3, the left and right image features of three resolutions respectively, total 6 feature maps.

Table 1 SesNet network architecture parameters

3-cost aggregation module

The main purpose of the cost aggregation is to enable the calculated matching cost value to reflect the correlation between adjacent pixels more accurately. Since the pixel-based matching cost value is susceptible to noise, a new parallax space image can be obtained by establishing the relationship between adjacent pixels by using the cost aggregation operation. The end-to-end stereo matching algorithm usually obtains a 4D matching cost volume (cost volume) by combining the left and right feature maps extracted in the previous step, which are respectively the height, width, disparity, feature size, and then performs cost aggregation by using a 3D convolution operation to obtain a final disparity map. However, since the obtained matching cost volume is a 4D tensor, it has a higher parameter number, and in order to reduce the error discrete cost, make the computed parallax smoother and keep higher accuracy, usually a larger adjacent pixel range is used as a constraint, i.e. a larger convolution kernel is used, but the complexity of the 3D convolution computation itself is high, and enlarging the 3D convolution kernel can only result in the increase of the parameter number, so the method of performing cost Aggregation by using 3D convolution, although the accuracy is high, but there is a disadvantage that the parameter number and the memory occupation are high, based on the idea of reducing the network parameter number, making the network become small and light, the invention uses two modules in the cost Aggregation stage, namely, an Intra-Scale Aggregation module (Adaptive Intra-Scale Aggregation) and an inter-Scale Aggregation module (Adaptive Cross-Scale Aggregation), and replaces the traditional 3D convolution based on the matching cost volume with the two modules, the running speed is improved, the running loss is reduced, and meanwhile, the running result is kept at a high accuracy. The method comprises the steps of obtaining a matching cost volume in a cost aggregation stage, performing relevant operation on left and right feature graphs extracted in a feature extraction stage by specific operation, obtaining a 4D matching cost volume, performing cost aggregation operation on the matching cost volume, wherein the specific operation is to make the cost of the matching cost volume of each scale more effective through a scale aggregation module, and then fuse matching cost volume information of different scales through an inter-scale aggregation module.

3.1 Scale cohesive Module

The intra-scale aggregation module is used for performing cost aggregation operation on a matching cost volume constructed by left and right graph features under the same resolution, and mainly performs local cost aggregation operation on the matching cost volume, while the local cost aggregation in the traditional stereo matching method is the regular convolution of a fixed window adopted by the matching cost volume, namely sampling points are distributed in a grid shape in a window with a certain size, and the local cost aggregation formula (1) in the traditional stereo matching method is adopted by the local cost aggregation module.

Wherein p and q respectively represent two pixel points, wherein the pixel point q belongs to the adjacent pixel point of the p point,

the parallax is represented at a pixel point p and is the matching cost value of d subjected to cost aggregation, C (d, q) is represented at a pixel point q, the parallax is d original, namely the matching cost value of d not subjected to cost aggregation, and w (p, q) is the aggregation weight of the pixel point q to the pixel point p.

The parallax discontinuity generally refers to two situations, namely, a problem of cost dispersion caused by matching cost calculation exists, and parallax discontinuity caused by disparity value difference existing between different objects or objects and a background exists, while local cost aggregation in the traditional stereo matching method treats the two situations indiscriminately, although a convolutional neural network can automatically learn weight w, the convolutional neural network cannot sample in a self-adaptive manner, common object edge and fine structure coarse edge problems (edge-failure) easily occur, aiming at the parallax discontinuity phenomenon of object edges and fine structures, the invention adopts an adaptive scale aggregation module (ISA) which is similar to the local cost aggregation in the traditional stereo matching algorithm, the module adopts sparse point-based feature representation, and uses deformable convolution for self-adaptive sampling to reduce the influence of foreground and background on cost aggregation, to achieve more efficient cost-effective polymerization. And (3) a cost aggregation formula (2) of the intra-scale aggregation module. In the cost aggregation stage, firstly, an adaptive scale aggregation module (ISA) is adopted, the method is similar to local cost aggregation in the traditional stereo matching algorithm, the obtained three matching cost volumes with different scales are respectively input into the ISA module, the module adopts sparse point-based feature representation, adaptive sampling is carried out by continuous training by utilizing deformable convolution, each original cost value is aggregated by 9 surrounding adaptive sampling points, so that three new matching cost volumes with the same scale as the previous scale are obtained again, the new matching cost value is calculated by a cost aggregation formula (2), and therefore the problem of abnormal matching cost values caused by mismatching can be reduced, the influence of foreground and background on cost aggregation is reduced, and cost aggregation is carried out more efficiently.

Wherein

Expressing the matching cost K of the parallax d at the pixel point p after the intra-scale polymerization²Representing the total number of sample points, the convolution kernel used in the present invention is a deformable convolution of 3 × 3, so K is 3, w_kRepresents the aggregation weight, p, of the k-th point_kIs a fixed offset of a pixel point p, and Δ p_kIs an additional offset that can be learned that,m_kis the position weight of the kth point, is embodied in the distance relationship between the sampling point and the pixel point, and is mainly used for adjusting the aggregation weight w_kAnd the function C is a matching cost value calculation function.

3.2 inter-Scale aggregation Module

The inter-scale aggregation module is used for aggregating features between matching cost volumes of different scales. For non-texture or weak-texture areas, such as sky, glass and walls, features with discriminant can be better extracted through rough scales obtained through down-sampling operation, but for some detailed features, feature representation with higher resolution is needed, so that multi-scale aggregation is also a common aggregation mode, and global information missing in matching cost calculation is made up to a certain extent by using the module in the invention. The invention uses the traditional cross-scale polymerization method for reference, three matching cost volumes with different scales obtained by a scale aggregation module are input into an inter-scale polymerization module, the three matching cost volumes are respectively polymerized with the other two matching cost volumes except the matching cost volumes, when the resolution of the polymerized matching cost volumes is higher than that of the current matching cost volume, the polymerized matching cost volumes are subjected to down-sampling operation, when the resolution of the polymerized matching cost volumes is lower than that of the current matching cost volume, the polymerized matching cost volumes are subjected to up-sampling operation, the specific operation is that the left image and the right image are divided into two parts according to the left image and the right image, the left image and the right image are respectively obtained by the inter-scale polymerization module, the three different resolutions of the left image and the right image are respectively obtained, the six final matching cost volumes are totally counted, the polymerization formula of the inter-scale polymerization module is as formula (3), equation (4).

Wherein

The matching cost volume of the S scale after aggregation among scales is shown, the matching cost volumes of S scales are counted, S is 3 in the invention, and

representing the matching cost volume after intra-scale aggregation at the r-th scale, f_rIs a general function representation of the aggregation of multiple scales, i.e. the aggregation mode is represented by a function f_rAnd (4) determining. Function f_rThe method is divided into three cases, when r is an identity mapping function, namely under the condition of the same resolution, the matching cost volume of the r scale is not transformed, and when r is the same<s is to downsample the matching cost volume of the r-th scale to make it consistent with the resolution of the s-th scale, and specifically, the operation is to pass through s-r 3 × 3 convolution networks with step size of 2, and if r is the same, r is the same>s is then convolved with the matching cost volume of the r-th scale by up-sampling by bilinear difference method and then by 1x1, so as to enhance the learning ability of the network and ensure the cost smoothness.

The invention has the beneficial effects that:

the invention provides an end-to-end depth network suitable for stereo matching, namely AEDNet, which is improved based on AANet, and the parameters of the whole network are reduced by adopting SesNet in a characteristic extraction stage, wherein local information used for a stereo matching task is extracted by simple convolution, characteristic information of different scales is fused through FPN, a matching cost volume with three resolutions is constructed, and a scale aggregation module and an inter-scale aggregation module without 3D convolution are adopted.

The invention first evaluates the model by using different settings, wherein FPN represents a Feature Pyramid Network (FPN). ResNet + FPN denotes a feature extraction network in the underlying network AANet, where ResNet has a structure similar to ResNet-50, where feature extraction is performed using a deformable convolution with a convolution kernel of 3. SesNet + FPN refers to the network AEDNet proposed by the present invention, SimpleResNet stands for SimpleResNet, i.e., a feature extraction network dedicated to the stereo matching task. SesNet represents the SimpleResNet improved by the present invention during the feature extraction phase. The pair ratio of each setting network on the SceneFlow test set is table 2, where the running time refers to the time taken to process a binocular image with a resolution of 576x960, and the end-point-error (EPE) refers to the average parallax error between the predicted parallax value and the true value in units of pixels. FLOPs (floating point operations), which are floating point operands, are commonly used to measure the complexity of a network model. MACC (Multiply-ACCUMULATE operation), also known as MADd, a Multiply-add number, is commonly used to measure the computational load of the model.

Table 2 evaluation model using different settings

As can be seen from Table 2, the SesNet model is improved in precision and operation time compared with the SimpleResNet model, and experimental results prove that SesNet has certain superiority, the quantity of parameters of the network AEDNet provided by the invention is reduced by nearly 25% compared with the reference network AANet, and the quantity of the parameters is reduced by 4x10⁶Down to 3x10⁶The method provides possibility for transplanting the stereo matching network model to a small-sized computing device such as a mobile end platform. The method has the advantages that the matching rate is reduced to a certain extent, the running time is reduced relatively, but the reduction amplitude is small, the presumed reason is probably two points, one is that the improvement based on the AANet is mainly characterized in a feature extraction stage, an end-to-end network is superior to other networks in the running time, the influence of a certain part of the improvement is relatively small, and the other is that the method adopts simple convolution operation to replace complex convolution, but the corresponding network structure becomes complex, and the precision of the improved network is reduced. The invention compares the AEDNet of the network with the basic network A through parallax calculation in the Sceneflow data setAnd (7) ANet. The network of the invention is better than AANet in partial area of Sceneflow data set, wherein the network comprises a reflective object, the continuity of the tube made of reflective material in the estimated disparity map of the network of the invention is kept, and objects with repeated textures, such as rectangular wood blocks, have a certain difference with the real disparity map, but the network of the invention is better than AANet in the processing of repeated textures.

The method comprises the steps of carrying out fine tuning training by using the best model obtained by training on a Sceneflow data set respectively by using training sets of a KITTI2012 data set and a KITTI2015 data set, verifying network performance by using a corresponding test set, wherein the KITTI2012 data set respectively comprises 194 groups of binocular training image pairs and 195 groups of binocular test image pairs, the KITTI2015 data set comprises 200 groups of binocular training image pairs and 200 groups of binocular test image pairs, randomly cutting an original image into 336x960 in the training process to input 1000 epochs for each training, setting the batch size to be 1, starting from 0.0001, reducing the learning rate by half for every 200 epochs after the model is trained by 400epochs, and reducing the learning rate by half after the model is trained by 900 epochs.

And evaluating the fine-tuned trained network model according to the evaluation standard of the KITTI evaluation website, and comparing the evaluation result of each algorithm in a KITTI2012 data set and a KITTI2015 data set as shown in tables 3 and 4, wherein the data in the tables are all used for the KITTI evaluation website or a publicly published paper and have certain credibility, and the adopted standard is that when the end point error of the matching point in the image is less than 3px, the pixel is correctly estimated. In addition, the network performance is better compared. The AEDNet proposed by the invention and the reference network AANet thereof are trained and operated under the same experimental conditions.

TABLE 3 evaluation results of data sets in KITTI2012 for each network

Network name	Out-Noc/％	Out-All/％	Avg-Noc/px	Avg-All/px	Run time/s
						AcfNet	1.17	1.54	0.5	0.5	0.48
GwcNet	1.32	1.80	0.5	0.5	0.32
						GA-Net	1.36	1.70	0.5	0.5	0.36
SGNet	1.38	1.85	0.5	0.5	0.6
						PSMNet	1.49	1.89	0.5	0.6	0.41
SegStereo	1.68	2.03	0.5	0.6	0.6
						GC-Net	1.77	2.06	0.6	0.7	0.9
AANet	2.54	3.18	0.6	0.7	0.09
						DispNetC	4.11	4.65	0.9	1.0	0.06
Flow2Stereo	4.58	5.11	1.0	1.1	0.05
						DispSegNet	4.68	5.66	0.9	1.0	0.9
pSGM	4.68	6.13	1.0	1.4	7.92
						Inventive network	3.40	4.11	0.7	0.8	0.08

Note: out represents the proportion of mismatching points in the disparity map, Avg represents the average error of matching points in the disparity map, Noc represents a non-occlusion region, and All represents the entire region.

TABLE 4 evaluation results of KITTI2015 data set for each network

Network name	D1-bg/％	D1-fg/％	D1-all/％	Run time/s
					GANet	1.48	3.46	1.81	1.8
AcfNet	1.51	3.80	1.89	0.48
					EdgeStereo	1.84	3.30	2.08	0.32
SegStereo	1.88	4.07	2.25	0.6
					GC-Net	2.21	6.16	2.87	0.9
AANet	2.54	4.58	2.88	0.09
					DWARF	3.20	3.94	3.33	0.14
DispNetC	4.32	4.41	4.34	0.06
					MADnet	3.75	9.20	4.66	0.02
DeepCostAggr	5.34	11.35	6.34	0.03
					Inventive network	2.93	5.11	3.29	0.08

Note: d1 represents the proportion of mismatching points in the disparity map, bg represents the background area, fg represents the foreground area, and all represents the entire area.

It can be seen from tables 3 and 4 that, since the size of the convolution kernel is limited and the parameter amount of the network is reduced by using the ordinary convolution in the stage of the feature matching, the running time of the network is reduced to a certain extent, and although the accuracy of the network is not as excellent as that of the basic network, the network has certain superiority in the running time compared with other high-accuracy stereo matching networks.

In table 3, AcfNet, gwcenet, PSMNet, SegStereo, and GC-Net belong to end-to-end Stereo matching algorithms, and these networks all use 3D convolution in the cost aggregation stage, compared with the network of the present invention, the fastest processing time of these network models in the running time is 0.32s, and the processing time of the network of the present invention can be controlled within 0.1s, it is verified that the real-time performance of the intra-scale aggregation module and the inter-scale aggregation module is improved after reducing a large number of redundant parameters, but the accuracy of the network is sacrificed at the same time, while DispNetC and Flow2Stereo combine estimation and Stereo matching, and finally, 2D convolution regression is directly used to obtain a disparity map, resulting in poor effect of the finally obtained disparity map, and the models of Flow2Stereo and DispSegNet are obtained through unsupervised training, so that compared with the supervised network proposed by the present invention, they can extract fewer effective features, affecting the net end result.

In table 4, DWARF and MADnet are subjected to lightweight processing, and particularly MADnet aims at improving and optimizing parallax with low resolution, but such a method omits local detail information of an original input image and loses precision to a certain extent.

According to the network AEDNet and the basic network AANet, compared with the network effect through the parallax estimation results on the KITTI2012 test set and the KITTI2015 test set, in the KITTI2012 test set, the network can more accurately calculate the parallax value on a vehicle, particularly a vehicle glass, and the parallax information of a part of a slender object, such as a branch, is more accurately represented. In the KITTI2015 test set, the network of the invention has better performance at the edge of objects and in weak texture areas, such as round road signs, trunks, telegraph poles, grass clusters and the like.

Drawings

Fig. 1 is a diagram of an adaptive end-to-end depth network architecture suitable for stereo matching.

Fig. 2 is a network data processing flow.

Fig. 3 is an algorithm flow of the feature extraction module.

Detailed Description

The invention provides an end-to-end network suitable for stereo matching, and firstly introduces a data set and a training process used when the method is used for network training. The advantages of the proposed network in terms of time and video memory occupation are then confirmed by the performance in the Sceneflow test set by evaluating the models in different settings. Finally, the performance of the algorithm is finely adjusted and verified on the KITTI2012 and KITTI2015 data sets.

1. Data set and training process

According to the invention, firstly, a Sceneflow data set is adopted to train a network, data sets commonly used by a conventional stereo matching algorithm, such as KITTI and Middlebury training images, are few, researchers need a large amount of data to train in order to improve the performance of the stereo matching network, especially an end-to-end stereo matching network, and in 2016, a CVPR (computer Vision and Pattern recognition) proposes to construct a large-scale synthetic data set, namely the Sceneflow data set, the Sceneflow data set has about 3 ten thousand and 9 thousand pairs of training images, the resolution of each pair of images is 576x960 pixels, and the synthetic data set is called as a synthetic data set because the data set is composed of three-dimensional scenes and is obtained by rendering the three-dimensional scenes. The Sceneflow data set contains related fully dense and accurate label information such as parallax, optical flow, and scene flow. The Sceneflow dataset is divided into three subsets, FlyingThing3D, Monkaa and Driving respectively. Where the FlyThing3D content is the background of random geometry and the foreground of the dynamic object retrieved from ShapeNet. Montaa is a large and diverse data set obtained by modeling a scene in an animation, and Drivin is similar to montaa in the image generation method, but mainly provides data of a Driving scene in the image content. Such a huge data set enables the deep network to be fully trained.

The KITTI data set is a computer vision algorithm evaluation data set under the current international largest automatic driving scene. The KITTI data set comprises real binocular image data and LIDAR point cloud data acquired from 61 scenes such as urban areas, villages and expressways. The KITTI2012 data set and the KITTI2015 data set are common data sets matched in a binocular stereo mode, the resolution of each pair of images is 384x1248 pixels, the KITTI2012 data set is a static stereo matching data set of a first outdoor scene, the data set is composed of 194 training image pairs with depth labels and 195 test image pairs without depth labels, the training image pairs are stored in a lossless png format, and the acquired depth map is semi-dense and covers about one third of the images. The subsequent KITTI2015 data set was expanded to consist of 200 training scenes and 200 test scenes, with 4 color images per scene, again saved in lossless png format. This data set is collected in a similar manner to the KITTI2012, but unlike the previous data set, it is collected in a dynamic scene, mainly a moving object such as a car.

The experimental environment hardware configuration of the invention is Inter (R) CoreTM i7-6700 CPU @3.40GHz X8, a single NVIDIA RTX 2080GPU is used during network training, the memory is 15.6GiB, the size of the hard disk space is 1.0TB, an Ubuntu20.04 system is adopted, the GNOME version is 3.36.3, and the window system is X11. The software configuration uses the version of the general parallel computing architecture as cuda10.0, and uses the corresponding GPU acceleration library cudnn to additionally build an Anaconda virtual environment, wherein the environment important dependence packet is shown in Table 5.

TABLE 5 important dependency Package versions

Adam was used as an optimizer in network training. The method parameters are that beta 1 is equal to 0.9, and beta 2 is equal to 0.999. The Sceneflow dataset was used for training on a total of 35454 image pairs for all training sets and evaluated on a total of 4370 image pairs for its standard test set. The original image was randomly cropped to 288 × 576 as input, and the input image was normalized. And using random color enhancement and vertical flipping, the maximum disparity is set to 192 pixels. According to the invention, the model training is carried out for 64epochs, the batch size is set to be 4, the learning rate is started from 0.001, and in order to prevent the model from being incapable of converging, the learning rate is reduced by half every 10 epochs after the model training is finished for 20 epochs.

The specific embodiments of the present invention and the accompanying drawings disclosed above are intended to aid in understanding the contents and spirit of the present invention, and are not intended to limit the present invention. Any modification, replacement, or improvement made within the spirit and principle scope of the present invention should be included in the protection scope of the present invention.

Claims

1. A binocular stereo matching method comprises the following steps:

2) performing related operation on the feature maps with the same resolution ratio corresponding to the left image and the right image to form a 4D matching cost volume; then, local cost aggregation operation is carried out on each 4D matching cost volume through a scale aggregation module to obtain a brand new matching cost volume with the same resolution as the original matching cost volume;

3) fusing the N brand new matching cost volumes obtained in the step 2) through an inter-scale aggregation module to obtain a final matching cost volume;

4) obtaining disparity maps corresponding to the N different resolutions according to the final matching cost volume; and then, performing up-sampling on each obtained disparity map, and inputting the disparity maps into a StereoDRNet to obtain a final predicted disparity map.

2. The method of claim 1, wherein the feature extraction network first down-samples the input image and then processes the input image using a first residual network to obtain a feature map M1 with a resolution of one third of the input image; then inputting the feature map M1 into a second residual error network for processing to obtain a feature map M2 with the resolution being one sixth of the input image; and finally, inputting the feature map M2 into a third residual error network for processing to obtain a feature map M3 with the resolution of one tenth of that of the input image.

3. The method of claim 1 or 2, wherein the dimensionally cohesive module passes through

4. The method according to claim 1 or 2, wherein the inter-scale aggregation module aggregates each matching cost volume with each matching cost volume other than itself, and when the resolution of the aggregated matching cost volume is higher than that of the current matching cost volume, the aggregated matching cost volume is subjected to down-sampling operation, and when the resolution of the aggregated matching cost volume is smaller than that of the current matching cost volume, the aggregated matching cost volume is subjected to up-sampling operation, and a final matching cost volume is obtained through the inter-scale aggregation module.

5. The method of claim 4, wherein the improvement is made by

Summing the sampling results to obtain a matched cost volume

Wherein the content of the first and second substances,

representing a matching cost volume of an r-th scale after intra-scale aggregation; function f_rThe method is divided into three cases, when r is an identity mapping function, namely under the condition that the resolution is the same, the matching cost volume of the r scale is not transformed; when r is<s, namely when the resolution of the matching cost volume of the r scale after intra-scale aggregation is larger than the resolution of the matching cost volume of the s scale to be aggregated currently, downsampling the matching cost volume of the r scale to make the resolution consistent with the resolution of the matching cost volume of the s scale; if r>s is when passingAnd when the resolution of the matching cost volume of the nth scale after the intra-scale aggregation is smaller than the resolution of the matching cost volume of the current to-be-aggregated nth scale, performing up-sampling on the matching cost volume of the nth scale, and then performing 1x1 convolution, wherein S is 1-S, and S is the same as N.

6. The method of claim 2, wherein the first residual network, the second residual network, and the third residual network are formed by sequentially connecting 1 residual network with 2 layers of convolution kernels of 1X1, 1 residual network with a first layer of convolution kernels of 3X3, a second residual network with a second layer of convolution kernels of 1X1, and finally 2 residual networks with 2 layers of convolution kernels of 1X 1.