CN111583313A

CN111583313A - Improved binocular stereo matching method based on PSmNet

Info

Publication number: CN111583313A
Application number: CN202010217365.5A
Authority: CN
Inventors: 罗炬锋; 蒋煜华; 李丹; 曹永长; 偰超; 张力; 崔笛扬; 郑春雷
Original assignee: Shanghai Internet Of Things Co ltd
Current assignee: Shanghai Internet Of Things Co ltd
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2020-08-25

Abstract

The invention relates to a PSmNet-based improved binocular stereo matching method, which comprises the following steps: acquiring binocular images and constructing a PSmNet-based backbone network; the network comprises: the depth convolution network is used for extracting left and right characteristic maps of the binocular image; the pyramid pooling structure is used for extracting multi-scale target features of the left and right feature maps; the matched cost volume is used for carrying out cost aggregation on the multi-scale target features to obtain a 3D feature module; the 3D convolution structure is used for carrying out subsequent cost calculation on the 3D characteristic module; the structure of the matching cost volume is improved by giving different weights to different feature points by introducing a channel attention mechanism; designing a network structure based on a coding process and a decoding process to improve a 3D convolution structure to obtain an improved PSmNet-based backbone network; and then carrying out stereo matching on the binocular images. The stereo matching method can enable the network structure to obtain faster training time and higher parallax precision, and has better practicability.

Description

Improved binocular stereo matching method based on PSmNet

Technical Field

The invention relates to the technical field of computer vision application, in particular to a PSmNet-based improved binocular stereo matching method.

Background

Stereoscopic vision is an important subject in the field of computer vision, and aims to reconstruct three-dimensional geometric information of a scene, wherein a binocular stereo camera can be used for obtaining left and right views of a current scene, and then a stereo matching algorithm is used for calculating depth information of the current scene. The stereo matching is a key part for acquiring target depth information in stereo vision, and the target is to match corresponding pixel points in two or more viewpoints and calculate parallax and depth so as to obtain three-dimensional information of the scene.

A complete stereo matching algorithm usually comprises four steps: matching cost calculation, cost aggregation, parallax calculation and parallax refinement. The traditional stereo matching method mainly utilizes methods such as a local area method and a dynamic programming method to obtain a disparity map, but the disparity map usually has a plurality of holes, so a series of post-processing steps are required to perfect disparity information and fill up the disparity holes. With the great development of deep learning, the acquisition of the disparity map is no longer limited by the traditional stereo matching algorithm, but the dense disparity map of the scene is directly predicted by designing a deep convolution network. Due to the strong characterization capability of the convolutional neural network, compared with the traditional method, the stereo matching algorithm based on deep learning can more accurately estimate the parallax of the scene target.

Zbontar and LeCun firstly propose to use a twin network to calculate the matching cost of left and right views, the Zbontar and LeCun use a pair of 9 multiplied by 9 image blocks as the input of the twin network, the output of the network is set as the similarity of the left and right image blocks, and the network is trained. Luo et al in 2016 proposed a faster twin network to compute matching costs, and authors treated the computation of matching costs as a multi-label classification problem to increase reasoning speed. Shaked and Wolf proposed in 2017 a high-speed network for matching cost calculation and a global disparity network for predicting disparity confidence scores that help to further refine disparity maps. However, such stereo matching algorithms based on CNN often use only a twin network for matching cost calculation, and the calculated disparity value still needs to be post-processed to improve the accuracy.

In recent years, the main research content of researchers in the stereo matching algorithm is to generate a scene disparity map directly end to end by using a convolutional neural network, and a series of post-processing optimization steps are abandoned. Mayer et al proposed an end-to-end disparity computation network DispNet in 2016 and provided a huge stereo matching dataset Scene Flow for model training. Pang et al introduced a two-stage network called Cascaded Residual Learning (CRL) based on DispNet, which computed disparity maps and their multi-scale residuals separately in the first and second stages, and then summed the outputs of the two stages to form the final disparity map. Kendall et al, 2017, proposed GCNet, which generates a 3D cost volume with size D × h × w × c by densely comparing each pixel point in the left feature map with all possible matched pixels on the same epipolar line in the right feature map, then extracts information from the cost volume by 3D convolution, and finally obtains the best matching disparity by soft-argmin operation. The JiaRen Cheng et al proposed PSMNet based on GCNet, the author introduced a pyramid pooling structure in the stereo matching network to perform multi-scale feature extraction, and used an hourglass structure in the 3D CNN to obtain context information, finally achieving a better parallax effect than GCNet.

Disclosure of Invention

The invention aims to solve the technical problem of providing a PSmNet-based improved binocular stereo matching method, which can calculate a high-precision disparity map aiming at binocular images and videos.

The technical scheme adopted by the invention for solving the technical problems is as follows: the PSmNet-based improved binocular stereo matching method comprises the following steps:

step (1): acquiring binocular images and constructing a PSmNet-based backbone network; the PSmNet-based backbone network includes:

the depth convolution network is used for extracting the features of the binocular image to obtain a left feature map and a right feature map;

the pyramid pooling structure is used for extracting multi-scale target features of the left and right feature maps;

the matched cost volume is used for carrying out cost aggregation on the multi-scale target features of the left feature graph and the right feature graph to obtain a 3D feature module;

the 3D convolution structure is used for carrying out subsequent cost calculation on the 3D characteristic module;

step (2): different weights are given to different feature points extracted by a network model by introducing a channel attention mechanism so as to improve the structure of the matching cost volume;

and (3): and designing a network structure based on an encoding process and a decoding process to improve the 3D convolution structure, so as to obtain an improved PSmNet-based backbone network.

And (4): and carrying out stereo matching on the binocular image by using the improved PSmNet-based backbone network.

The step (2) further comprises: and taking the similarity value of the left and right feature maps as the weight of a depth dimension, wherein the greater the weight proportion of the depth dimension is, the more important the parallax calculation is.

The similarity values of the left and right feature maps are evaluated by a metric function, which includes: euclidean distance, cosine distance and Pearson correlation coefficient.

The step (3) is specifically as follows: an encoding and decoding structure network is constructed by adding an encoding process and a decoding process, wherein the encoding process refers to a changing process of the resolution of a left characteristic diagram and a right characteristic diagram from large to small when the convolutional layer is transmitted forwards; the decoding process refers to a change process of the resolution of the left and right characteristic graphs from small to large when the convolution layer is transmitted forwards; the first half part of the coding and decoding structure network maps data from low dimension to high dimension to encode the data, and the second half part of the coding and decoding structure network maps the encoded data from high dimension to low dimension to decode the data; the encoding and decoding process is used for realizing information conversion.

The encoding process and the decoding process also introduce 1x1 convolution, and the 1x1 convolution fuses deep features and shallow features and carries out feature multiplexing; the 1x1 convolution will automatically adjust for differences in the deep and shallow features.

The deep convolution network in the step (1) comprises ResNet and PVANet deep convolution networks.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: according to the method, an attention mechanism is introduced into a matching cost volume part, and the similarity value of a left characteristic image and a right characteristic image is used as the weight of a depth dimension, so that the importance degree of the depth dimension is evaluated, and the network is helped to better pay attention to information which is larger in weight ratio and more useful for calculating parallax in the training process, so that the parallax precision is improved; the traditional method is improved in the aspect of a 3D convolution structure, an encoding and decoding structure is used, shallow layer features and deep layer features are fused, feature multiplexing is carried out, and model convergence is accelerated during training. The whole network obtains faster training time and higher parallax precision, and has better practicability.

Drawings

FIG. 1 is a schematic structural flow diagram in an embodiment of the present invention;

fig. 2 is a schematic diagram of a weight assignment process in left-right feature cost matching in the embodiment of the present invention;

fig. 3 is a diagram of a codec network according to an embodiment of the present invention.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The embodiment of the invention relates to a binocular stereo matching method based on PSmNet improvement, which comprises the steps of constructing a main network based on PSmNet; the backbone network includes: the deep convolutional network (CNN) is used for extracting the characteristics of the binocular image to obtain a left characteristic image and a right characteristic image; a pyramid pooling structure (SPP Module) for extracting multi-scale target features of the left and right feature maps; a cost volume (cost volume) is matched, and the cost volume is used for carrying out cost aggregation on the multi-scale target features of the left feature graph and the right feature graph to obtain a 3D feature module; a 3D convolution structure (3D CNN) for performing a subsequent cost calculation on the 3D feature module; different weights are given to different feature points extracted by a network model by introducing a channel attention mechanism so as to improve the structure of the matching cost volume; designing a network structure based on an encoding process and a decoding process to improve the 3D convolution structure to obtain an improved PSmNet-based backbone network; and then carrying out stereo matching on the binocular images.

As shown in fig. 1, which is a schematic structural flow diagram in the embodiment of the present invention, S101 represents that a PSMNet-based backbone network is constructed to perform stereo matching on acquired binocular images, S102 represents that a structure of a matching cost volume is modified by introducing an attention mechanism, and more importance is placed on depth dimension information, S103 represents that a network structure based on a coding and decoding idea is designed to improve a 3D convolution structure (3D CNN), and S104 represents a training test to obtain disparity accuracy effect evaluation of the binocular images.

The embodiment of the invention provides a PSmNet-based improved binocular stereo matching method, which comprises the following steps:

(1) constructing a PSmNet-based backbone network, which mainly comprises a deep convolutional network (CNN), a pyramid pooling structure (SPP Module), a matching cost volume (cost volume) and a 3D convolutional structure (3D CNN);

(2) modifying a cost volume structure of the matched cost volume, and introducing a channel attention mechanism to help the network to better pay attention to important depth dimension information during training, so that the accuracy of output parallax is improved;

(3) and modifying the structure of the 3D convolution and designing a network structure based on the coding and decoding idea.

In the step (1) of the embodiment, the deep convolutional Network (CNN) is responsible for feature extraction of the left and right views simultaneously, and is essentially a twin Network (Siamese Network), and the Network structure generally consists of common deep convolutional Network structures such as ResNet and PVANet; the pyramid pooling structure (SPP Module) can perform multi-scale pooling on the same characteristic diagram, so that characteristic information with different resolutions is obtained, and finally, the characteristic extraction of a multi-scale target is realized; the matching cost volume (cost volume) is responsible for carrying out cost aggregation on the multi-scale features of the left feature graph and the right feature graph so as to obtain a 3D feature module, and compared with the original 2D feature graph, the 3D module is provided with one more depth dimension channel; the 3D convolution structure (3D CNN) is used for subsequent cost calculations for the 3D feature module.

In step (2) of this embodiment, different weights are given to different feature points in the network model by introducing a channel attention mechanism method. And the similarity values of the left and right feature maps are used as the weights of the depth dimension, so that the importance degree of the depth dimension is evaluated, and the network is helped to better pay attention to information which is larger in weight ratio and more useful for calculating the parallax in the training process. The method for evaluating the similarity of the left and right feature maps can be calculated through similarity measurement functions such as Euclidean distance, cosine distance, Pearson correlation coefficient and the like.

Preferably, the channel attention mechanism introduced in this embodiment assigns different weights to feature maps of different channels in the network model, and the channel attention mechanism assigns different weights to different feature points in the same feature map. A channel attention mechanism is introduced into the matching cost volume to better enable computation of the target depth information.

FIG. 2 is a schematic diagram of a weight distribution process in left-right feature cost matching according to an embodiment of the present invention, where the left-right feature maps need to be weighted when performing cost matching, where F [ L, R ]]Computing a function for the similarity of left and right feature maps, a one-dimensional weight vector [ W ]₁,W₂,...,W_n]The similarity values in each depth dimension range from 0 to 1.

In step (3) of this embodiment, the network structure based on the coding and decoding idea is designed to include two parts, namely, coding and decoding, where the coding process refers to a process of changing the resolution of the feature map from large to small when the convolutional layer is propagated forward, and the decoding process refers to a process of changing the resolution of the feature map from small to large when the convolutional layer is propagated forward.

Preferably, in this embodiment, a coding/decoding network structure as shown in fig. 3 is adopted, the first half of the network structure maps data from a low dimension to a high dimension to encode the data, and the second half of the network structure maps the encoded data from the high dimension to a low dimension to decode the data, so as to realize information conversion through this process, and simultaneously introduce 1x1 convolution as a bridge between depth features, automatically adjust differences generated by the front and rear features, and accelerate model convergence during training.

Therefore, an attention mechanism is introduced into the matching cost volume part, the similarity values of the left characteristic diagram and the right characteristic diagram are used as the weights of the depth dimension, and information which is larger in weight proportion and more useful for parallax calculation is better concerned in the network training process, so that the parallax precision is improved; the traditional method is improved in the aspect of a 3D convolution structure, the encoding and decoding structure is used, and the shallow layer characteristic and the deep layer characteristic are fused, so that model convergence is accelerated during training, and the method has better practicability.

Claims

1. A binocular stereo matching method based on PSmNet improvement is characterized by comprising the following steps:

2. The PSMNet-based improved binocular stereo matching method of claim 1, wherein the step (2) further comprises: and taking the similarity value of the left and right feature maps as the weight of a depth dimension, wherein the greater the weight proportion of the depth dimension is, the more important the parallax calculation is.

3. The PSMNet-based improved binocular stereo matching method of claim 2, wherein the similarity values of the left and right feature maps are evaluated by a metric function, the metric function comprising: euclidean distance, cosine distance and Pearson correlation coefficient.

4. The PSMNet-based improved binocular stereo matching method according to claim 1, wherein the step (3) is specifically: an encoding and decoding structure network is constructed by adding an encoding process and a decoding process, wherein the encoding process refers to a changing process of the resolution of a left characteristic diagram and a right characteristic diagram from large to small when the convolutional layer is transmitted forwards; the decoding process refers to a change process of the resolution of the left and right characteristic graphs from small to large when the convolution layer is transmitted forwards; the first half part of the coding and decoding structure network maps data from low dimension to high dimension to encode the data, and the second half part of the coding and decoding structure network maps the encoded data from high dimension to low dimension to decode the data; the encoding and decoding process is used for realizing information conversion.

5. The PSmNet-based improved binocular stereo matching method according to claim 4, wherein a 1x1 convolution is further introduced in the encoding process and the decoding process, and the 1x1 convolution fuses deep features and shallow features and performs feature multiplexing; the 1x1 convolution will automatically adjust for differences in the deep and shallow features.

6. The PSMNet-based improved binocular stereo matching method of claim 1, wherein the deep convolutional network in step (1) comprises ResNet and PVANet deep convolutional networks.