CN117475182B

CN117475182B - Stereo matching method based on multi-feature aggregation

Info

Publication number: CN117475182B
Application number: CN202311177663.6A
Authority: CN
Inventors: 杨金龙; 王刚; 吴程; 刘建军; 王映辉
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2024-06-04
Anticipated expiration: 2043-09-13
Also published as: CN117475182A

Abstract

The application discloses a stereo matching method based on multi-feature aggregation, which relates to the technical field of computer vision, and comprises the following steps: shallow feature extraction is carried out on the stereoscopic image to obtain a shallow feature map, and deep feature extraction is carried out on the stereoscopic image to obtain a deep feature map; extracting multi-scale semantic features from the deep feature map and aggregating the multi-scale semantic features to obtain semantic features of the stereoscopic image; extracting multi-scale texture features of the shallow feature map and aggregating the multi-scale texture features to obtain texture features of the stereoscopic image; aggregating semantic features and texture features to obtain image features of the stereoscopic image; connecting the image features of the stereoscopic images in the stereoscopic image pair to obtain an initial cost body; performing multi-scale feature aggregation on the initial cost body to obtain an optimized target cost body; and performing parallax estimation based on the target cost body to obtain a parallax image between the stereo image pairs. The method is favorable for improving the matching precision of stereo matching.

Description

Stereo matching method based on multi-feature aggregation

Technical Field

The application relates to the technical field of computer vision, in particular to a three-dimensional matching method based on multi-feature aggregation.

Background

Binocular stereo matching is an important research task in the field of computer vision. It calculates the parallax between pixels by comparing images of left and right viewing angles, thereby determining depth information of objects in a scene. Stereo matching plays a key role in many fields, such as three-dimensional reconstruction, visual tracking, autopilot, augmented reality, and the like.

With the development of deep learning, accurate stereo matching can be realized by utilizing a deep neural network to learn characteristic representation and a matching model. The first end-to-end stereo matching network DispNet is proposed in the related art ,"A Large Dataset to Train Convolutional Networks for Disparity,Optical Flow,and Scene Flow Estimation", and its proposal advances the research on deep learning based stereo matching, and the proposed synthetic dataset Scene Flow is also widely used by the subsequent algorithm.

However, stereo matching methods based on convolutional neural networks are not well generalized to invisible scenes, such as occlusion and illumination changes. The network may have large domain differences between different data sets, such as color, illumination, contrast, texture, etc., and may give poor results when used in other real or invisible scenes. The pyramid pooling module proposed in the literature Pyramid Stereo Matching Network constructs a matching cost body by aggregating global environment information of different scales and different positions, and obtains a better prediction result than the previous network on a pathological area. Document A SIMPLE AND EFFICIENT Approach for Adaptive Stereo Matching proposes a non-antagonistic method for developing color gamut migration, and uses matching cost regularization to reconstruct training, thereby improving domain adaptability of a network, aiming at the problem that the picture color distribution of a synthetic dataset and a real dataset are remarkably different. Document REVISITING STEREO DEPTH ESTIMATION FROM A SEQUENCE-to-sequence PERSPECTIVE WITH Transformers considers from the angle of serialization pixel matching, and uses position information and attention mechanism to replace the cost body strategy of the traditional matching method to complete pixel matching, thereby enhancing the generalization expression capability of the network. However, the algorithms only use the convolution network to extract the deep semantic information, ignore the shallow information, and especially when the algorithm is applied to a real scene, the matching accuracy is seriously reduced.

Disclosure of Invention

Aiming at the problems and the technical requirements, the inventor provides a three-dimensional matching method based on multi-feature aggregation, and the technical scheme of the application is as follows:

in one aspect, a stereo matching method based on multi-feature aggregation is provided, including the following steps:

acquiring a stereo image pair to be matched;

Shallow feature extraction is carried out on each stereoscopic image in the stereoscopic image pair to obtain a shallow feature image, and deep feature extraction is carried out on each stereoscopic image to obtain a deep feature image;

carrying out multi-scale semantic feature extraction and multi-scale semantic feature aggregation on the deep feature map of each stereoscopic image to obtain semantic features of each stereoscopic image;

carrying out multi-scale texture feature extraction and multi-scale texture feature aggregation on the shallow feature map of each stereoscopic image to obtain texture features of each stereoscopic image;

Aggregating the semantic features and the texture features of each stereoscopic image to obtain image features of each stereoscopic image;

Connecting the image features of all the stereoscopic images in the stereoscopic image pair to obtain an initial cost body;

Performing multi-scale feature aggregation on the initial cost body to obtain an optimized target cost body;

and performing parallax estimation based on the target cost body to obtain a parallax image between the stereo image pairs.

The further technical scheme is as follows:

The multi-scale semantic feature extraction and multi-scale semantic feature aggregation are carried out on the deep feature map of each stereoscopic image to obtain semantic features of each stereoscopic image, and the method comprises the following steps:

Extracting features of the deep feature map by using adaptive filters with different kernel sizes to obtain initial semantic features with different scales;

The initial semantic features with different scales are aggregated to obtain initial aggregated features;

Performing feature extraction on the initial aggregation feature by using a first extraction branch to obtain a first aggregation feature, and performing feature extraction on the initial aggregation feature by using a second extraction branch to obtain a second aggregation feature;

And cascading the first aggregation feature with the second aggregation feature to obtain the semantic feature.

The adaptive filter comprises a first branch and a second branch, wherein the first branch comprises a convolution layer, and the second branch comprises an adaptive average pooling layer and a convolution layer;

The first extraction branch consists of a CDR module and a CD module, wherein the CDR module comprises a convolution layer, a domain normalization layer and a ReLU layer, and the CD module comprises the convolution layer and the domain normalization layer;

The second extraction branch is composed of the CD module.

Multi-scale texture feature extraction and multi-scale texture feature aggregation are carried out on the shallow feature map of each stereoscopic image to obtain texture features of each stereoscopic image, and the method comprises the following steps:

Carrying out statistical feature extraction on the shallow feature map to obtain statistical texture features corresponding to the shallow feature map;

Carrying out feature extraction of different scales on the statistical texture features to obtain statistical texture local features of different scales;

And carrying out feature aggregation on different statistical texture local features to obtain the texture features.

Carrying out statistical feature extraction on the shallow feature map to obtain statistical texture features corresponding to the shallow feature map, wherein the statistical texture features comprise:

extracting features of the shallow feature map by using a channel attention mechanism to obtain an average feature vector corresponding to the shallow feature map;

Quantizing the shallow feature map based on the average feature vector to obtain one-dimensional quantized coding features, wherein the one-dimensional quantized coding features comprise texture grade information corresponding to different pixel points in the shallow feature map;

and aggregating the one-dimensional quantized coding features and the average feature vector to obtain statistical texture features corresponding to the shallow feature map.

The step of quantizing the shallow feature map based on the average feature vector to obtain one-dimensional quantized coding features comprises the following steps:

Calculating the similarity between the average feature vector and the shallow feature map to obtain a similarity matrix;

carrying out one-dimensional level quantization on the similarity matrix to obtain quantization features of different levels, wherein the different quantization features correspond to different texture intensities;

and mapping the similarity matrix based on different quantization characteristics to obtain the one-dimensional quantization coding characteristics.

The step of aggregating the one-dimensional quantized coding features and the average feature vector to obtain statistical texture features corresponding to the shallow feature map comprises the following steps:

Cascading the average value of the one-dimensional quantized coding features with the quantized features to obtain counting features;

And cascading the counting feature with the average feature vector to obtain the statistical texture feature corresponding to the shallow feature map.

The step of extracting the features of the statistical texture features with different scales to obtain the local features of the statistical texture with different scales comprises the following steps:

performing feature aggregation on the statistical texture features and the one-dimensional quantized coding features to obtain two-dimensional texture features;

And carrying out feature extraction on the two-dimensional texture features based on different quantization levels to obtain statistical texture local features with different scales.

The multi-scale feature aggregation is carried out on the initial cost body to obtain an optimized target cost body, which comprises the following steps:

preprocessing the initial cost body to obtain a combined cost body after feature combination;

And inputting the combined cost body into at least two circulating hourglass modules connected in series to sequentially perform feature extraction to obtain a final cost body output by the last circulating hourglass module, and taking the final cost body as the target cost body, wherein the circulating hourglass module is used for extracting and aggregating multi-scale features.

The method is used for a stereo matching network, the stereo matching network comprises a shallow feature extraction network, a deep feature extraction network, a multi-scale semantic feature aggregation module, a multi-scale texture feature aggregation module and a cyclic hourglass aggregation network, and the method further comprises:

obtaining a sample image pair, wherein the sample image pair is marked with a parallax true value;

Carrying out shallow feature extraction on each sample three-dimensional image in the sample image pair by utilizing the shallow feature extraction network to obtain a sample shallow feature image, and carrying out deep feature extraction on each sample three-dimensional image by utilizing the deep feature extraction network to obtain a sample deep feature image;

Carrying out multi-scale semantic feature extraction and multi-scale semantic feature aggregation on the sample deep feature map of each sample stereo image by utilizing the multi-scale semantic feature aggregation module to obtain sample semantic features of each sample stereo image;

carrying out multi-scale texture feature extraction and multi-scale texture feature aggregation on the shallow feature map of each sample stereo image by utilizing the multi-scale texture feature aggregation module to obtain sample texture features of each sample stereo image;

aggregating the sample semantic features and the sample texture features of each stereoscopic image to obtain sample image features of each sample stereoscopic image;

connecting the image features of all the sample stereo images in the sample image pair to obtain an initial sample cost body;

Performing multi-scale feature aggregation on the initial sample cost body by using the circulating hourglass aggregation network to obtain an optimized target sample cost body;

performing parallax estimation based on the target sample cost body to obtain a predicted parallax image between the sample image pairs;

the stereo matching network is trained based on differences between the predicted disparity map and the disparity truth values.

The beneficial technical effects of the application are as follows:

In the embodiment of the application, in the stereo matching process, firstly, the primary characteristic extraction is carried out on the stereo image to obtain the primary semantic characteristics and the primary texture characteristics, then the multi-scale characteristic extraction and aggregation are carried out on the primary semantic characteristics to obtain the semantic characteristics, and the multi-scale characteristic extraction and aggregation are carried out on the primary texture characteristics to obtain the texture characteristics, so that the semantic information and the texture information can be fully extracted; and then, the semantic features and the texture features are used for three-dimensional matching, so that excessive dependence on specific features can be avoided, and the influence on three-dimensional matching under the cross-domain condition can be reduced. In addition, the initial cost body obtained through construction is further subjected to multi-scale feature aggregation, more feature information can be captured, the generalization of the field of the network is enhanced, and the matching precision of stereo matching is improved.

Drawings

FIG. 1 is a block diagram of a multi-feature aggregation-based stereo matching network provided in accordance with an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of the architecture of a multi-scale semantic feature aggregation module provided by an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of a one-dimensional texture feature extraction module according to an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of a multi-scale texture feature aggregation module provided by an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of a cyclic hourglass aggregation network provided by one exemplary embodiment of the present application;

FIG. 6 is a graph comparing the stereo matching results of the algorithm of the present application and the algorithm of the related art provided by an exemplary embodiment;

Fig. 7 is a diagram showing a comparison of the stereo matching result of the algorithm of the present application and the algorithm in the related art according to another exemplary embodiment.

Detailed Description

The following describes the embodiments of the present application further with reference to the drawings.

The stereo matching method based on multi-feature aggregation provided by the embodiment of the application can be executed by computer equipment. The method comprises the following steps:

Step S1, a stereo image pair to be matched is obtained.

The stereoscopic matching refers to matching images of the same object at different view angles at the same time to obtain parallax among pixels in the images at different view angles. The stereo image pair to be matched is an image pair acquired from the same scene at different view angles at the same time, and optionally, the stereo image pair can comprise a left view angle stereo image and a right view angle stereo image.

And S2, carrying out shallow feature extraction on each stereoscopic image in the stereoscopic image pair to obtain a shallow feature map, and carrying out deep feature extraction on each stereoscopic image to obtain a deep feature map.

In the matching process, firstly, the stereoscopic images in the stereoscopic image pair are input into a feature extraction network to perform primary feature extraction, and primary semantic features (deep feature images) and primary texture features (shallow feature images) are obtained. In the embodiment of the application, the introduced feature extraction network comprises a shallow feature extraction network and a deep feature extraction network. The shallow feature extraction network is used for extracting shallow features of the stereoscopic image and extracting subsequent texture features; the deep feature extraction network is used for extracting deep features of the stereoscopic image for extracting subsequent semantic features.

Alternatively, the shallow feature extraction network may be composed of one CBR (Conv+ Batch normalization +

ReLU) module, a CBL (conv+ Batch normalization + LeakyReLU) module, and a1 x 1 convolutional layer. Wherein, the CBR module is composed of a convolution layer with a core size of 1×1, a batch normalization layer and a ReLU layer; the CBL module consists of a convolution layer with a kernel size of 3 x 3, a batch normalization layer and a LeakyReLU layer. When the stereo image is input into the shallow feature extraction network for feature extraction, a shallow feature map can be obtainedWhich coincides with the resolution of the input image, i.e. the stereoscopic image. And carrying out shallow feature extraction on the stereoscopic image by using a shallow feature extraction network to obtain a shallow feature map.

Alternatively, the deep feature extraction network may be composed of CBR modules and residual modules, where the residual modules include a convolutional layer (conv), a batch normalization layer (batch normalization, bn), and a ReLU layer, as shown in fig. 1. After the stereoscopic image is input into a deep feature extraction network, three CBR modules downsample the stereoscopic image to 1/2 of the original image resolution, and then continue downsampling to 1/8 of the original image by using 25 continuous residual modules to obtain a deep feature mapAnd carrying out deep feature extraction on the stereoscopic image by using a deep feature extraction network to obtain a deep feature map.

And extracting a shallow layer characteristic map and a deep layer characteristic map corresponding to each stereoscopic image in the stereoscopic image pair.

And S3, carrying out multi-scale semantic feature extraction and multi-scale semantic feature aggregation on the deep feature map of each stereoscopic image to obtain the semantic features of each stereoscopic image.

The semantic features of the stereoscopic image are used to characterize content information contained in the stereoscopic image, such as roads, buildings, trees, and the like. Depth information of the stereoscopic image can be obtained through extraction of semantic features.

In one possible implementation manner, a multi-scale semantic feature aggregation module is arranged in the stereo matching network and is used for extracting and aggregating semantic features of the stereo image. In the extraction process, multi-scale semantic feature extraction is carried out on the deep feature map to obtain multi-scale semantic features so as to obtain semantic features containing rich semantic information, and then feature aggregation is carried out on the multi-scale semantic features to obtain final semantic features of the stereoscopic image.

And corresponding semantic features can be extracted from each stereoscopic image.

And S4, carrying out multi-scale texture feature extraction and multi-scale texture feature aggregation on the shallow feature map of each stereoscopic image to obtain the texture features of each stereoscopic image.

Because the semantic features may have loss of detail information, in the embodiment of the application, the semantic features are extracted and the texture features of the shallow layer of the image are extracted at the same time so as to supplement the detail information. The texture features of the stereoscopic image are used to characterize the surface information of each object in the stereoscopic image, which contains the relationship between each pixel of the image. The surface layer information of the stereoscopic image can be obtained through extraction of texture features.

And a multi-scale texture feature aggregation module is arranged in the stereo matching network, the multi-scale texture feature aggregation module is utilized for extracting and aggregating features, in the process, firstly, multi-scale texture feature extraction is carried out on a shallow feature map to obtain multi-scale texture features, and then, feature aggregation is carried out to obtain the texture features of the stereo image. Correspondingly, each stereoscopic image can extract the corresponding texture features.

And S5, aggregating the semantic features and the texture features of each stereoscopic image to obtain the image features of each stereoscopic image.

After extracting the semantic features and the texture features of each stereoscopic image, carrying out feature aggregation on the semantic features and the texture features of each stereoscopic image to obtain the image features of each stereoscopic image. In one possible implementation, the semantic features and the texture features may be aggregated by a cascading operation to obtain the image features.

The method comprises the steps of extracting a shallow feature map and a deep feature map from each stereo image by utilizing a feature extraction network, extracting texture features based on the shallow feature map and semantic features based on the deep feature map, and obtaining image features of the stereo images by aggregation. Alternatively, when the stereoscopic image pair includes a left-view stereoscopic image and a right-view stereoscopic image, the image features of the left-view stereoscopic image and the image features of the right-view stereoscopic image may be extracted.

And S6, connecting the image features of all the stereoscopic images in the stereoscopic image pair to obtain an initial cost body.

The cost body is used for measuring the similarity between two stereo images in the stereo image pair. In particular, it may indicate the similarity between potentially matching pixels in two stereo images. For one stereoscopic image, each pixel point can correspond to a pixel point in the other stereoscopic image according to the corresponding parallax level, and the parallax level is the parallax range between the corresponding pixel points. Through each parallax level, the feature mapping of all the stereoscopic images in the stereoscopic image pair can be connected to obtain a cost body for the subsequent parallax estimation.

In the embodiment of the application, the corresponding image feature mapping of all the stereoscopic images is cascaded on each parallax level to form a 4D cost body. The cascade of 4D cost volumes is the initial cost volume. The 4D cost body captures different visual information by using two types of features, namely semantic and texture, and the cost body obtained by multi-feature fusion enables the three-dimensional matching network to analyze images more comprehensively, and the depth of different areas can be estimated more accurately by using a larger parallax range.

For each parallax level, a cost body can be generated, and the 4D cost bodies with the shapes of c×d×h×w can be obtained by cascading the cost bodies corresponding to the parallax levels. Where D represents the maximum parallax range. Illustratively, the maximum parallax range may be 192, the parallax levels may comprise 0,1,2,

…,191, Etc.

And S7, performing multi-scale feature aggregation on the initial cost body to obtain a target cost body.

In the embodiment of the application, a circulating hourglass aggregation network is also introduced and used for carrying out multi-scale feature aggregation on the initial cost body again, and in the process, the circulating hourglass aggregation network can carry out multi-scale feature aggregation on the initial cost body so as to optimize the initial cost body and obtain the target cost body.

And S8, performing parallax estimation based on the target cost body to obtain a parallax image between the stereo image pairs.

And after the target cost body is obtained through aggregation, performing parallax estimation by utilizing a parallax regression algorithm to obtain a parallax image between the stereo image pairs.

As shown in fig. 1, which shows a structural block diagram of a stereo matching network based on multi-feature aggregation, for a left view in a stereo image pair, a shallow feature extraction network is used to extract a shallow feature map, and the shallow feature map is input into a multi-scale texture feature aggregation module to extract texture features of the left view; and extracting by using a deep feature extraction network to obtain a deep feature map, inputting the deep feature map into a multi-scale semantic feature aggregation module, extracting to obtain semantic features of the left view, and cascading the texture features and the semantic features to obtain image features of the left view. And adopting a similar method for the right view in the stereo image pair to obtain the image characteristics of the right view, and sharing the weight between a network for extracting the characteristics of the right view and a network for extracting the characteristics of the left view.

And (3) after image features of the left view and the right view are aggregated, obtaining a 4D cost body, inputting the 4D cost body into a circulating hourglass aggregation network for feature extraction and aggregation to obtain an optimized target cost body, and performing parallax calculation by using the target cost body to obtain a predicted parallax map between left and right attempts.

In the embodiment of the application, a multi-scale semantic aggregation module is utilized to extract semantic features, wherein the process of extracting multi-scale semantic features and aggregating the multi-scale semantic features to the deep feature map of a stereoscopic image to obtain the semantic features of the stereoscopic image comprises the following steps:

and S31, carrying out feature extraction on the deep feature map by utilizing adaptive filters with different kernel sizes to obtain initial semantic features with different scales.

In one possible implementation, the semantic information of the picture can be adaptively learned by learning the information in the deep feature map using an adaptive filter. The adaptive filter is composed of two branches, including a first branch and a second branch, wherein the first branch includes a convolution layer, and the second branch includes an adaptive average pooling layer and a convolution layer. After the feature extraction is performed on the deep feature map by using the first branch and the second branch, the feature map F ₁ extracted by the self-adaptive filter can be obtained through the depth separable convolution operation, the effective representation of input data can be maintained while the parameter and the calculation complexity are reduced through the depth separable convolution operation, the calculation efficiency and the reasoning speed of the model can be improved to a certain extent, the risk of overfitting can be reduced, and the generalization performance of the model can be improved.

And when the adaptive filter is used for extracting the features of the deep feature map, the adaptive filters with different kernel sizes are used for respectively extracting the features to obtain initial semantic features with different scales (namely, feature maps F ₁ output by different adaptive filters). The self-adaptive filters with different kernel sizes are used for respectively learning the features in the deep feature map, so that the network can adapt to images in different fields, and the domain generalization performance of the model is improved.

Optionally, 3 adaptive filters with different kernels (k=1, 3, 5) may be used to perform feature extraction to obtain semantic features of different scales.

Optionally, the first branch is a convolution layer with a step size of 1 and a kernel size of 1×1; the second branch consists of an adaptive averaging pooling layer with an output of C x K and a convolution layer with a step size of 1 and a kernel size of 1 x 1. The adaptive filters of different kernels have different adaptive averaging pooling layers for extracting features of different receptive fields.

Step S32, the initial semantic features with different scales are aggregated to obtain initial aggregated features.

After extracting the initial semantic features with different scales by using different adaptive filters, aggregating the different initial semantic features to obtain initial aggregate features. Optionally, the initial aggregated features (feature map F ₂) may be obtained by rolling and upsampling the different initial semantic features.

Optionally, the initial aggregation characteristic is 1/4 of the original stereo image, so that the construction of a subsequent cost body and the parallax aggregation process are facilitated.

And step S33, performing feature extraction on the initial aggregation feature by using a first extraction branch to obtain a first aggregation feature, and performing feature extraction on the initial aggregation feature by using a second extraction branch to obtain a second aggregation feature.

In order to further extract deep information, in this embodiment, a first extraction branch and a second extraction branch are provided for further extracting features of the initial aggregation feature.

The first extraction branch consists of a CDR (Conv+ Domain normalization +ReLU) module and a CD (Conv+ Domain normalization) module, wherein the CDR module comprises a convolution layer, a domain normalization layer and a ReLU layer, and the CD module comprises the convolution layer and the domain normalization layer. The second extraction branch contains only CD modules. The first extraction branch is used for capturing complex features in data through nonlinear operation, namely extracting complex information in a deep feature map, but the complex information is easily lost while being extracted, so that the second extraction branch is used for extracting features so as to preserve original features and avoid the loss of the excessive information.

Optionally, the CDR block is composed of a convolution layer with a step size of 1 and a kernel size of 3×3, a domain normalization layer, and a ReLU layer; the CD module consists of a convolution layer with a step size of 1 and a kernel size of 3 x 3 and a domain normalization layer. In the CDR module and the CD module, domain normalization operations (Domain Normalization, DN) are adopted, the features are normalized along a space axis, local invariance can be enhanced, the features along a channel axis are normalized by using an L ₂ function, sensitivity problems of domain movement, noise and feature vector extremum are also solved through mutual cooperation of the two, and the domain generalization capability of a network is improved.

Step S34, cascading the first aggregation feature with the second aggregation feature to obtain semantic features.

The first aggregation feature is obtained by extracting through the first extraction branch, and the second aggregation feature is obtained by extracting through the second extraction branch, and the first aggregation feature and the second aggregation feature can be cascaded to obtain the aggregated feature, namely the final semantic featureThe extracted semantic features not only contain deep complex information, but also retain original features, so that the semantic information of the stereoscopic image can be better expressed. Optionally, the extracted semantic features are 1/4 scale of the original image, so that subsequent parallax estimation is facilitated.

As shown in fig. 2, a schematic structural diagram of the multi-scale semantic feature aggregation module proposed in the present embodiment is shown. In the process of extracting semantic features, a deep feature map F _s is input into an adaptive filter with a kernel K of 1,3 and 5 respectively to perform feature extraction to obtain initial semantic features with different scales, then cascade connection, convolution and up-sampling operations are performed on the initial semantic features with different scales, the initial semantic features with different scales are polymerized to obtain an initial polymerized feature F ₂, then the initial polymerized feature F ₂ is input into two extraction branches to perform feature extraction, and then the first polymerized feature and the second polymerized feature obtained by the extraction of the two extraction branches are cascaded to obtain the semantic features

In the embodiment, in the process of extracting semantic features, firstly, extracting semantic features with different scales is performed, so that a network can adapt to images in different fields, the domain generalization capability is improved, in the process of feature aggregation, different extraction branches are used for extracting, the original features are reserved as far as possible while complex features are extracted, and the accuracy of extracting the semantic features can be improved.

In the embodiment of the application, a multi-scale texture aggregation module is utilized to extract texture features, wherein the process of extracting multi-scale texture features and aggregating multi-scale texture features to a shallow feature map of a stereoscopic image to obtain the texture features of the stereoscopic image comprises the following steps:

and S41, carrying out statistical feature extraction on the shallow feature map to obtain statistical texture features corresponding to the shallow feature map.

In one possible implementation, the extraction of one-dimensional texture features is performed first, resulting in statistical texture features. The process of extracting the one-dimensional texture features comprises steps S411-S413, as follows:

and S411, carrying out feature extraction on the shallow feature map by using a channel attention mechanism to obtain an average feature vector corresponding to the shallow feature map.

The channel attention mechanism may filter noise or redundant information to better cope with noise or interference in the shallow feature map of the input. Therefore, firstly, the shallow feature map is extracted by using a channel attention mechanism, and the feature extraction can be realized by a channel attention submodule. The channel attention sub-module uses three-dimensional arrangement to reserve information in a three-dimensional space, and amplifies a cross-dimensional channel space dependence relationship through a two-layer sensor, so that a channel with low attention weight can be restrained to filter noise or redundant information, and the robustness and generalization capability of the model can be improved.

As shown in FIG. 3, the channel attention sub-module comprises an adaptive average pooling layer and an adaptive maximum pooling layer, and shallow characteristic map is obtainedRespectively inputting into an adaptive average pooling Layer and an adaptive maximum pooling Layer for feature extraction, inputting the extracted features into a Multi-Layer perceptron (MLP), adding the features output by the Multi-Layer perceptron into an activation function (sigmoid) to obtain an average feature vector/>

Step S412, quantize the shallow feature map based on the average feature vector to obtain a one-dimensional quantized encoded feature, where the one-dimensional quantized encoded feature includes texture level information corresponding to different pixels in the shallow feature map.

The average feature vector may be used to quantize the shallow feature map, the quantization process comprising the steps of:

and step 1, calculating the similarity between the average feature vector and the shallow feature map to obtain a similarity matrix.

And (3) carrying out similarity calculation on the average feature vector a and the shallow feature map F _t to obtain a similarity matrix S epsilon R ^1×H×W.

And 2, carrying out one-dimensional level quantization on the similarity matrix to obtain quantized features of different levels, wherein the different quantized features correspond to different texture intensities.

The similarity matrix is a two-dimensional matrix, so that the similarity matrix is firstly remodeled to obtain a one-dimensional similarity matrix S epsilon R ^1×HW. Quantization of texture levels can be performed to obtain different quantization characteristics.

The quantization characteristic L _n of the nth level is as follows:

And N ranges from {1,2,3, …, N }. N is the total number of ranks. Illustratively, N is 128.

And step 3, mapping the similarity matrix based on different quantization characteristics to obtain one-dimensional quantization coding characteristics.

After obtaining different quantization characteristics, each S _i (i e {1,2,3, …, HW }) in the one-dimensional similarity matrix may correspond to a corresponding L _n interval. The similarity matrix can be mapped based on different quantization characteristics to obtain one-dimensional quantization coding characteristics, namely a one-dimensional quantization coding matrix E epsilon R ^N×HW, in the following manner:

and S413, aggregating the one-dimensional quantized coded features and the average feature vectors to obtain statistical texture features corresponding to the shallow feature map.

After the one-dimensional quantized coding features are obtained, the average feature vector is used for supplementing, and the one-dimensional quantized coding features and the average feature vector are aggregated to obtain statistical texture features. Wherein the polymerization process comprises the following steps:

step 1, cascading the average value of the one-dimensional quantized coding features with the quantized features to obtain counting features.

Firstly, calculating a feature mean value of one-dimensional quantized coding features, wherein the calculation mode is as follows:

thereafter, the average of the one-dimensional quantized encoded features may be used And cascading with the quantized feature L epsilon R ^N×1 to obtain a count matrix (i.e., count feature) C epsilon R ^N×2.

And step 2, cascading the counting features with the average feature vectors to obtain statistical texture features corresponding to the shallow feature map.

First, the counting feature C epsilon R ^N×2 is up-sampled to obtainAnd at the same time average eigenvectorRemodelling and upsampling to give/>Then, the up-sampled counting feature and the average feature vector are cascaded to obtain statistical texture feature/>

Schematically, as shown in fig. 3, a schematic structure of the one-dimensional texture feature extraction module is shown. First, shallow feature mapThe input channel attention module performs feature extraction to obtain an average feature vectorAnd calculating cosine similarity between the average feature vector and the shallow feature map to obtain a similarity matrix S epsilon R ^1×H×W, and reshaping to obtain a one-dimensional similarity matrix S epsilon R ^1×HW. Different quantized features (L ₁-L_N) are obtained by quantizing the one-dimensional similarity matrix, and one-dimensional quantized coding features E epsilon R ^N×HW can be obtained by mapping the one-dimensional similarity matrix based on the different quantized features. For one-dimensional quantized coding feature E epsilon R ^N×HW, calculating the mean value to obtain/>And cascading the average feature vector with the quantized feature to obtain a counting feature C epsilon R ^N×2, upsampling the counting feature C epsilon R ^N×2, and then continuously cascading the counting feature C epsilon R ^N×2 with the vector obtained by reshaping the average feature vector and upsampling the average feature vector to obtain a final statistical texture feature M.

And S42, carrying out feature extraction on the statistical texture features in different scales to obtain the statistical texture local features in different scales.

The statistical texture features are one-dimensional features, and two-dimensional features are extracted to obtain the spatial distribution features of pixels. In the two-dimensional feature extraction process, features with different scales are also carried out, and statistical texture local features with different scales are obtained. The method for extracting the statistical texture local features comprises the following steps of:

and step 1, carrying out feature aggregation on the statistical texture features and the one-dimensional quantized coding features to obtain two-dimensional texture features.

In the process, one-dimensional quantized coding features E epsilon R ^N×HW and statistical texture features are extractedAnd multiplying the two to obtain a two-dimensional texture feature F _mid∈R^C*×H×W.

And 2, carrying out feature extraction on the two-dimensional texture features based on different quantization levels to obtain statistical texture local features with different scales.

And adopting a two-dimensional texture feature extraction module to extract features in the process of extracting features of the two-dimensional texture features.

When the two-dimensional texture feature extraction module is used for feature extraction, a similar method with one-dimensional texture feature extraction can be adopted. Specifically, one-dimensional quantized encoded features are first of allRemodelling into a matrixFor two adjacent pixel points (i, j) and (i, j+1), the coding matrixes thereof are respectivelyAnd/>Matrix multiplication is used for the two to obtain a matrix/>Multiplying the adjacent pixel point coding matrixes to obtain a two-dimensional coding matrix containing adjacent features

Then, a two-dimensional coding matrix is calculatedMean/>The calculation method is as follows:

and then average value And previous quantized features/>Cascading together to obtain a count matrix/>Obtaining feature vectors using a channel attention mechanism on feature graphsStatistical matrix/>Upsampling to give/>The two are cascaded to obtain statistical texture local characteristics/>

In the embodiment of the application, as for the two-dimensional texture features, two-dimensional texture feature extraction modules with different quantization levels are adopted to respectively perform feature extraction, so as to obtain statistical texture local features with different scales.

Illustratively, quantization levels are respectively adoptedThe two-dimensional texture feature extraction module of the (4) performs feature extraction to obtain/>And/>3 Statistical texture local features of different scales.

And S43, carrying out feature aggregation on the statistical texture local features with different scales to obtain texture features.

And then, feature aggregation can be carried out on the statistical texture local features with different scales, and in the aggregation process, the features with different scales are unified by carrying out up-sampling operation, so that cascading is facilitated. And the up-sampling process can obtain the maximum resolution of the image, so that more detail information can be reserved, and the performance of extracting texture features by the model can be improved. In one possible implementation, a bilinear interpolation operation is used to upsample to the size of the original input image, which is followed by a cascade.

In combination with the above examples, can be respectively to/>Upsampling using bilinear interpolation operation, and cascading the three.

After cascading, downsampling by using 2 convolution layers with step length of 2 and core size of 3×3 to obtain aggregated texture features(1/4 Scale of the input image).

Schematically, as shown in fig. 4, a schematic structural diagram of a multi-scale texture feature aggregation module is shown. Firstly, carrying out feature extraction on a shallow feature map by utilizing a One-dimensional texture feature extraction module (One-Dimensional Texture feature extraction, ODT) to obtain a statistical texture feature M and a One-dimensional quantized coding feature E, multiplying the statistical texture feature M and the One-dimensional quantized coding feature E to obtain a Two-dimensional texture feature F _mid, and then respectively carrying out feature extraction by utilizing Two-dimensional texture feature extraction modules (Two-Dimensional Texture feature extraction, TDT) with different quantization grades to obtain statistical texture local features with different scales And/>) And then, up-sampling the three materials, cascading the materials, and inputting the characteristics obtained by cascading into an encoder formed by 2 convolution layers for aggregation to obtain the final texture characteristics F _ta.

By extracting texture features with different scales, the texture information can be fully extracted, and the accuracy of extracting the texture features is improved.

After extracting semantic features and texture features, image features are obtained through aggregation. And cascading the image features of the two stereoscopic images based on the parallax level to obtain an initial cost body. And then, carrying out multi-scale feature aggregation on the initial cost body to obtain an optimized target cost body so as to carry out parallax estimation through the target cost body.

In this embodiment, the initial cost body is subjected to multi-scale feature aggregation by using a cyclic hourglass aggregation network. The cyclic hourglass aggregation network effectively reduces huge parameter quantity brought by using 3D convolution and can acquire performance equivalent to the 3D convolution, and a gating cyclic unit (Gated Recurrent Unit, GRU) contained in the cyclic hourglass aggregation network can selectively update hidden states and reset past information, flexibly control the dependency degree on the context, comprehensively utilize semantic features and texture features and improve the robustness and generalization performance of the model.

The process for carrying out multi-scale feature aggregation on the initial cost body comprises the following steps:

and step S71, preprocessing the initial cost body to obtain a combined cost body after feature combination.

Firstly, feature combination is carried out on features in an initial cost body by utilizing cascaded GRU layers. Alternatively, a concatenation of 4 GRU layers may be employed, with a jump connection between the latter two GRU layers to merge features in the initial cost volume.

And step S72, inputting the combined cost body into at least two circulating hourglass modules connected in series to sequentially perform feature extraction to obtain a final cost body output by the last circulating hourglass module, and taking the final cost body as a target cost body, wherein the circulating hourglass module is used for extracting and aggregating multi-scale features.

In this embodiment, at least two circulating hourglass modules are connected in series, and the plurality of circulating hourglass modules can capture more context information under the conditions of weak texture and no texture, so as to enhance generalization of the network. And the circular hourglass module can perform feature extraction and aggregation of different scales on the cost body. Each circulating hourglass module outputs a cost body, and optionally, the cost body output by the last circulating hourglass module is used as a target cost body.

Alternatively, three periodically stacked circulating hourglass modules may be connected in series. In each cyclic hourglass module, the cost volume undergoes two downsampling. Since the resolution of the initial cost body is 1/4, the resolutions of 1/8 and 1/16 are obtained after two downsampling processes, and then decoding is started through two deconvolution operations, so that the original image is gradually upsampled to 1/4. And the jump connection is also used in the circulating hourglass module, so that the cross-scale information exchange opportunity is increased, the acquisition of the context information is enriched, and the robustness of the network in a pathological area and the domain generalization of the network are improved.

After the target cost body is obtained, tri-linear interpolation and parallax regression can be carried out on the target cost body to obtain a final predicted parallax image.

As shown in fig. 5, the 4D cost body is input into the cascaded GRU layers for preprocessing, so as to obtain the merged cost body, wherein the two latter GRU layers are connected in a jumping manner. And inputting the combined cost body into 3 circulating hourglass modules connected in series for feature extraction and aggregation to obtain a final target cost body. And then performing parallax estimation based on the target cost body to obtain a parallax image (output 3).

In the above embodiments, a stereo matching method based on multi-feature aggregation is described, which is implemented based on a trained stereo matching network. The training method of the stereo matching network comprises the following steps of:

step one, a sample image pair is obtained, and the sample image pair is marked with a parallax true value.

The sample image pair is also a stereoscopic image of the same scene at different perspectives. And the sample image pair has parallax true value for the supervision training of the stereo matching network.

And secondly, shallow feature extraction is carried out on each sample three-dimensional image in the sample image pair by using a shallow feature extraction network to obtain a sample shallow feature image, and deep feature extraction is carried out on the sample three-dimensional image by using a deep feature extraction network to obtain a sample deep feature image.

Thirdly, carrying out multi-scale semantic feature extraction and multi-scale semantic feature aggregation on the sample deep feature map of each sample stereo image by utilizing a multi-scale semantic feature aggregation module to obtain sample semantic features of each sample stereo image.

And step four, carrying out multi-scale texture feature extraction and multi-scale texture feature aggregation on the shallow feature map of each sample stereo image by utilizing a multi-scale texture feature aggregation module to obtain sample texture features of each sample stereo image.

And fifthly, aggregating sample semantic features and sample texture features of each three-dimensional image to obtain sample image features of each sample three-dimensional image.

The implementation manners of the second to fifth steps may refer to the feature extraction process of the stereoscopic image in the above embodiment, which is not described herein again.

And step six, connecting the image features of all the sample stereo images in the sample image pair to obtain an initial sample cost body.

The implementation of the sixth step may refer to the process of connecting the image features to obtain the initial cost body in the foregoing embodiment, which is not described herein again.

And step seven, performing multi-scale feature aggregation on the initial sample cost body by using a circulating hourglass aggregation network to obtain an optimized target sample cost body.

The circulating hourglass aggregation network comprises at least two circulating hourglass modules, and each circulating hourglass module can extract one sample cost body. The target sample cost body comprises the sample cost body output by each circulating hourglass module.

And step eight, performing parallax estimation based on the target sample cost body to obtain a predicted parallax map between the sample image pairs.

And performing parallax estimation based on each sample cost body in the target sample cost body to obtain a plurality of predicted parallax graphs between the sample image pairs, and training a stereo matching network by utilizing differences between the plurality of predicted parallax graphs and parallax true values.

Schematically, as shown in the structure diagram in fig. 5, when feature aggregation is performed on an initial sample cost body by using a cyclic hourglass aggregation network, three cyclic hourglass modules perform three sample cost bodies available for feature aggregation, and then perform parallax estimation based on the three sample cost bodies respectively, so as to obtain three predicted parallax maps (output 1, output2 and output 3).

And step nine, training a stereo matching network based on the difference between the predicted disparity map and the disparity truth value.

In the training process, the network is trained by utilizing a smooth L ₁ loss function. Wherein the loss is defined as follows:

Wherein,

N is the number of marked pixels; d _gt is the parallax true value; is the predicted disparity.

In combination with the above example, the stereo matching network is trained from 3 predicted disparity maps, with the total loss of network as follows:

Wherein μ ₁,μ₂ and μ ₃ are weights.

Training the three-dimensional matching network by using the total loss, and under the condition that the total loss is converged, stopping training the three-dimensional matching network.

The matching performance of the stereo matching method based on multi-feature aggregation provided by the embodiment of the application is verified as follows.

Wherein the stereo matching network is trained with Adam optimizer (β ₁＝0.9,β₂ =0.999). The whole dataset was first pre-processed and the image was randomly cropped to a size of h=256, w=512. During the experiment, the Scene Flow dataset was used and the KITTI dataset was used. Wherein the Scene Flow data set is a synthesized data set, has 39,824 RGB images with 960 multiplied by 540 resolution, contains 35454 training images and 4370 test images, and provides a dense disparity map as a true value; KITTI 2012 is a dataset of a real driving scene, comprising 194 training image pairs and 195 test image pairs, and providing sparse parallax truth values obtained using LIDAR, the image resolution of the dataset being 1226 x 370; KITTI 2015 is also an image of the real scene, comprising 200 training image pairs and 200 test image pairs, with a resolution of 1242 x 375.

Training the model from zero using the Scene Flow dataset, setting epochs number of training to 30, batch size to 3, maximum disparity D _max to 192, training at a constant learning rate of 0.001 in each epoch, and testing the trained optimal model directly in the Scene Flow test set. For KITTI datasets we fine-tuned on their training set using the optimal model trained on Scene Flow. Set epochs number to 200 and batch size to 4, training was performed at a constant learning rate of 0.001 in the first 100 epochs, and the last 100 were trained using a learning rate of 0.0001. The trimmed models are tested on KITTI test sets and submitted to KITTI evaluation server. Since the number of pictures of the Middlebury 2014 dataset and the ETH3D dataset is small, we tested on both datasets using a model trained on only the Scene Flow dataset.

The performance of the present algorithm was tested under KITTI 2012 test set and compared to other algorithms, as shown in fig. 6. Where the leftmost is the left stereo input image, the first row represents the predicted disparity map and the second row represents the error map for each algorithm. Qualitatively, in the dotted rectangular area, the method has a significantly more accurate estimation result. Quantitatively, at 3 pixels, the percentage of the error pixels in the non-occlusion region is 1.38%, and the percentage of the error pixels in the whole image is 1.77%, which are lower than other stereo matching networks.

And the performance of the algorithm was tested under the Middlebury 2014 training set and compared to the PSMNet algorithm, as shown in fig. 7. Qualitatively, the parallax estimation of the method in the actual scene is more accurate, and the domain generalization is more excellent. Quantitatively, the threshold error rate of the algorithm is 13.9% and 8.8% respectively in the halof and quater subsets of Middlebury 2014, which is lower than other algorithms, and is 24% lower in the halof dataset and 27% lower in the quater compared to the latest domain invariant algorithm FC-PSMNet.

The above is only a preferred embodiment of the present application, and the present application is not limited to the above examples. It is to be understood that other modifications and variations which may be directly derived or contemplated by those skilled in the art without departing from the spirit and concepts of the present application are deemed to be included within the scope of the present application.

Claims

1. A stereo matching method based on multi-feature aggregation, the method comprising:

acquiring a stereo image pair to be matched;

Multi-scale texture feature extraction and multi-scale texture feature aggregation are carried out on the shallow feature map of each stereoscopic image to obtain texture features of each stereoscopic image, and the method comprises the following steps: extracting features of the shallow feature map by using a channel attention mechanism to obtain an average feature vector corresponding to the shallow feature map, quantizing the shallow feature map based on the average feature vector to obtain one-dimensional quantized coding features, wherein the one-dimensional quantized coding features comprise texture grade information corresponding to different pixel points in the shallow feature map, and aggregating the one-dimensional quantized coding features and the average feature vector to obtain statistical texture features corresponding to the shallow feature map; carrying out feature extraction of different scales on the statistical texture features to obtain statistical texture local features of different scales; performing feature aggregation on different statistical texture local features to obtain the texture features;

2. The method according to claim 1, wherein the performing multi-scale semantic feature extraction and multi-scale semantic feature aggregation on the deep feature map of each stereoscopic image to obtain semantic features of each stereoscopic image includes:

3. The method of claim 2, wherein the step of determining the position of the substrate comprises,

The second extraction branch is composed of the CD module.

4. The method of claim 1, wherein quantizing the shallow feature map based on the average feature vector results in a one-dimensional quantized encoded feature, comprising:

5. The method of claim 4, wherein the aggregating the one-dimensional quantized coded features with the average feature vector to obtain statistical texture features corresponding to the shallow feature map comprises:

6. The method according to claim 1, wherein the performing feature extraction on the statistical texture features with different scales to obtain statistical texture local features with different scales includes:

7. The method according to any one of claims 1 to 6, wherein the performing multi-scale feature aggregation on the initial cost volume to obtain an optimized target cost volume includes:

8. The method of any one of claims 1 to 6, wherein the method is for a stereo matching network comprising a shallow feature extraction network, a deep feature extraction network, a multi-scale semantic feature aggregation module, a multi-scale texture feature aggregation module, and a cyclic hourglass aggregation network, the method further comprising: