CN113344869A

CN113344869A - Driving environment real-time stereo matching method and device based on candidate parallax

Info

Publication number: CN113344869A
Application number: CN202110597405.8A
Authority: CN
Inventors: 熊盛武; 王晓楠; 刘江梁; 余涛
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-09-03

Abstract

The invention discloses a vehicle driving environment real-time stereo matching method and device based on candidate parallaxes, wherein a lightweight convolution unit is used for extracting depth features, a Patch match-based algorithm is adopted to generate a discrete, small number of candidate parallaxes with higher possibility, the candidate parallaxes are applied to the construction process of a matching cost body, more effective depth separable convolution with unshared parallaxes dimension weight is used for carrying out cost aggregation on the matching cost body, and finally binocular parallaxes are regressed. The stereo matching network designed by the invention can effectively utilize computing resources, meet the real-time requirement under the driving environment, obtain a disparity map with higher precision, and has better algorithm robustness and stronger generalization capability.

Description

Driving environment real-time stereo matching method and device based on candidate parallax

Technical Field

The invention relates to the technical field of machine vision and automatic driving, in particular to a driving environment real-time stereo matching method and device based on candidate parallax.

Background

Autonomous vehicles must be able to feel well the 3D structure of the surrounding environment, requiring dense depth maps to provide clues for more advanced tasks. Acquiring depth data can often be by means of a depth camera, lidar or binocular stereo vision. The method based on binocular stereo vision is more feasible in consideration of the requirements and cost for acquiring depth data in a driving environment. The main method is that through the principle of binocular range finding, the problem of depth estimation is converted into the parallax problem of pixel point matching along the polar line direction of an image by combining camera internal parameters and external parameters for two left and right views which are subjected to polar line correction.

In recent years, the development of deep learning has also triggered a hot tide of using deep neural networks for stereo matching problems. Researchers design an end-to-end structure to process extraction of image depth features, calculation of matching cost, cost aggregation and parallax calculation. In order to improve accuracy, the PSMNet constructs a 4D matching cost body (cxhxwxd) when left and right view pixels are matched, and processes the matching cost body by using a complex 3D convolution (hxwxd), which results in high computational complexity, high memory occupation and no use of most needed computational resources. The idea is that multi-stage disparity map prediction is adopted, an AnyNet learns a rough disparity map at a low resolution in an initial stage, a residual cost body is constructed in a local disparity search range in a subsequent stage, a disparity residual map is further learned to correct the initial prediction result, and the original resolution disparity map of the image is gradually restored.

In the process of implementing the invention, the inventor of the application finds that the following technical problems exist in the prior art:

although the multi-stage prediction method can realize real-time acquisition of the disparity map, the accurate multi-stage stereo matching effect depends on the matching precision of the initial stage, and in fact, the prediction mode is not effective due to the limitation of the initial precision. The multitask network combined with semantic segmentation usually trains a model independently for two tasks or constructs an end-to-end architecture for processing, mutual guide prediction of the two tasks is processed in a combined mode, accuracy is greatly improved, the multitask independently-trained model needs two forward transmissions, the multitask end-to-end architecture model is complex in network, the model depends on more label information, and a driving environment data set for carrying out segmentation labeling and parallax labeling at the same time is limited. Therefore, the current real-time stereo matching method has many defects when used in a real driving scene.

Disclosure of Invention

The invention provides a driving environment real-time stereo matching method and device based on candidate parallax, which are used for solving or at least partially solving the technical problem of low parallax prediction precision in the prior art.

In order to solve the above technical problem, a first aspect of the present invention provides a driving environment real-time stereo matching method based on candidate disparity, including:

s1: acquiring an original data set, preprocessing the original data set, taking a left image in the original data set as a reference image, taking a right image in the original data set as a corresponding target image, and forming a group of stereo image pairs by the reference image and the target image;

s2: the method comprises the steps of constructing a driving environment real-time stereo matching network, wherein the driving environment real-time stereo matching network comprises a feature extraction module, a candidate parallax calculation module based on a Patch match algorithm, an initial parallax prediction module and a hierarchical parallax optimization module, wherein the feature extraction module is a weight-sharing light-weight twin network and is used for performing depth feature extraction on an input stereo image pair to obtain a left feature image and a right feature image; the candidate parallax calculation module is used for randomly generating parallax values by uniformly dividing each pixel into parallax subspaces based on a Patch match algorithm, and obtaining target candidate parallaxes through a transmission and evaluation strategy, the initial parallax prediction module is used for calculating matching cost, regularizing a matching cost body and performing parallax regression on the basis of parallax sampling vectors to obtain a roughly estimated low-resolution parallax image, and the layering parallax optimization module is used for restoring the low-resolution parallax image into an original resolution parallax image;

s3: inputting the stereo image pair in the preprocessed data set into a constructed driving environment real-time stereo matching network for forward propagation training; then inputting the output final disparity map and the real disparity map into a loss function, and performing backward propagation by using a batch gradient descent method; finally, updating the learning parameters of the iterative model for multiple times according to the gradient to obtain an optimal driving environment real-time stereo matching network model, wherein the learning parameters of the model comprise weight and bias;

s4: and carrying out binocular stereo matching by using the trained real-time stereo matching network model of the driving environment.

In one embodiment, the feature extraction module is a lightweight twin network sharing weights, and comprises standard convolution layers and phantom convolution feature extraction units, the structure of the phantom convolution feature extraction unit with the step size of 1 is ghost conv-BN-ghost conv-BN-Relu, the structure of the phantom convolution feature extraction unit with the step size of 2 is ghost conv-BN-depwise conv-ghost conv-BN-Relu, the ghost conv represents convolution design of a redundant feature graph calculated by a core convolution layer and simple linear operation, BN is batch standardization operation, depwise conv is deep convolution, Relu is a linear rectification function, the input and the output of the phantom convolution feature extraction unit are added by adopting jump connection, and finally the output is two unary features f with the size of H/4 xW/4 xC._lAnd f_rWhere H, W denotes the height and width of the original input image, respectively, and C denotes the feature dimension.

In an embodiment, the candidate disparity prediction module is specifically configured to uniformly divide each pixel into N-10 disparity subspaces, randomly generate a disparity value in each disparity subspace to construct a disparity sample vector, filter the disparity sample vector by using a one-hot convolution kernel, enable N sample values corresponding to the disparity sample vector to propagate to four neighborhoods thereof, respectively score each candidate disparity of each pixel, and reconstruct an N × H/4 × W/4 disparity sample vector according to each candidate disparity estimation score weighted average to perform iteration, so as to obtain a target candidate disparity.

In one embodiment, the formula for the candidate disparity estimation score is defined as follows:

S_i,j＝<f_l(i),f_r(i+d_i,j)>

wherein S_i,jTo score each disparity candidate value for each pixel,<·>means inner productCalculation of d_i,jFor candidate disparity values obtained by the patch match algorithm, f_lAnd f_rFor the extracted left and right feature maps, f_r(i+d_i,j) And performing warping operation on the extracted right image characteristic value by using the acquired parallax sampling value.

In one embodiment, the matching cost calculation process includes: reference will be made to the characteristic diagram f_lAnd corresponding target feature map f_rAfter splicing is carried out under the obtained candidate parallaxes, packing the candidate parallaxes into a 4D matching cost quantity, wherein the dimension of the finally output cost quantity is H/4 xW/4 xNxF, H, W respectively represents the height and the width of an original input image, N represents the number of the candidate parallaxes of each pixel, and F represents a characteristic dimension;

the regularization process includes: performing cost aggregation on the matching cost bodies by adopting depth separable convolution without parallax dimension weight sharing to obtain a regularized matching cost body;

the parallax regression utilizes a differentiable soft argmin operation to carry out regression prediction on a smooth and continuous initial parallax map in the parallax dimension of the regularized matching cost body, the input of the parallax map is the regularized matching cost body, and the output of the parallax map is the initial prediction parallax map with the dimension of H/4 xW/4 x 1.

In one embodiment, the formula for the N planes of the 4D matching cost body is defined as follows:

wherein

Is a function of the translation of the right image along the parallax dimension,

for disparity dimension information, plane (-) is a joint feature constructed by right and left graph features translated along the disparity dimension.

In one embodiment, the disparity regression uses a soft argmin function to calculate an initial disparity for the cost aggregation result, where the soft argmin function is used to convert each disparity prediction value into a normalized probability for each pixel point of a reference image to calculate an initial disparity map and perform back propagation to assist network training, where a soft argmin function calculation formula is defined as follows:

wherein d is_iFor i, taking N candidate parallaxes corresponding to different values, c_dFor the matching cost corresponding to the pixel point at the parallax d after the cost aggregation,

matching costs corresponding to the N candidate parallaxes after cost aggregation, namely a disparity map

The probability of each disparity d is obtained by cumulatively summing the product of each possible disparity and its probability value

And calculating the regularized matching cost body through a softmax function to obtain the matching cost body.

In one embodiment, the hierarchical disparity optimization module is specifically configured to: upsampling the initial disparity map of H/4 xW/4 resolution to H/2 xW/2 resolution by using bilinear interpolation, and then adding the left image I of H/2 xW/2_lObtaining a representation tensor of 32 channels by using a 3 x 3 convolution operation together with a disparity map obtained by bilinear interpolation, obtaining a H/2 x W/2 disparity residual map by using 6 residual block operations using expansion convolution, and obtaining an optimized disparity map of the current scale by combining the disparity residual map with the disparity map obtained by interpolation; and obtaining the original H multiplied by W resolution disparity map through loop iteration.

Based on the same inventive concept, the second aspect of the present invention provides a driving environment real-time stereo matching device based on candidate disparity, comprising:

the data set acquisition module is used for acquiring an original data set, preprocessing the original data set, taking a left image in the original data set as a reference image, taking a right image as a corresponding target image, and forming a group of stereo image pairs by the reference image and the target image;

the matching network construction module is used for constructing a driving environment real-time stereo matching network, wherein the driving environment real-time stereo matching network comprises a feature extraction module, a candidate parallax calculation module based on a Patch match algorithm, an initial parallax prediction module and a hierarchical parallax optimization module, the feature extraction module is a light-weight twin network sharing weight, and is used for performing depth feature extraction on an input stereo image pair to obtain a left feature image and a right feature image; the candidate parallax calculation module is used for randomly generating parallax values by uniformly dividing each pixel into parallax subspaces based on a Patch match algorithm, and obtaining target candidate parallaxes through a transmission and evaluation strategy, the initial parallax prediction module is used for calculating matching cost, regularizing a matching cost body and performing parallax regression on the basis of parallax sampling vectors to obtain a roughly estimated low-resolution parallax image, and the layering parallax optimization module is used for restoring the low-resolution parallax image into an original resolution parallax image;

the model training module is used for inputting the stereo image pair in the preprocessed data set into the constructed driving environment real-time stereo matching network for forward propagation training; then inputting the output final disparity map and the real disparity map into a loss function, and performing backward propagation by using a batch gradient descent method; finally, updating the learning parameters of the iterative model for multiple times according to the gradient to obtain an optimal driving environment real-time stereo matching network model, wherein the learning parameters of the model comprise weight and bias;

and the stereo matching module is used for performing binocular stereo matching by using the trained real-time stereo matching network model of the driving environment.

Based on the same inventive concept, a third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed, performs the method of the first aspect.

Based on the same inventive concept, a fourth aspect of the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to the first aspect when executing the program.

One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:

according to the driving environment real-time stereo matching method based on the candidate parallaxes, the light-weight twin network sharing the weight is adopted in the feature extraction process, and the overall calculation amount of the feature extraction network is reduced and the effectiveness of feature extraction is ensured in a mode of a small amount of traditional convolution calculation and a light-weight redundant feature generator. The candidate parallax method based on the Patchmatch reduces the redundancy of full parallax search space calculation, adopts an algorithm based on the Patchmatch to generate discrete, small and high-possibility candidate parallax values, applies the discrete, small and high-possibility candidate parallax values to the construction process of the matching cost body, carries out cost aggregation on the matching cost body, and finally regresses the binocular parallax. The stereo matching network designed by the invention can effectively utilize computing resources, meet the real-time requirement under the driving environment, obtain a disparity map with higher precision, and has better algorithm robustness and stronger generalization capability.

Furthermore, in view of probability equality of parallax non-full-range space values, the regularization module uses depth separable convolution which is not shared by parallax dimension weight to avoid excessive consumption of computing resources and memory by 3D convolution, so that the real-time requirement is met, the method is more suitable for a driving environment, the precision of parallax prediction is further improved through a ground route single-peak supervision loss function, and mismatching is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of calculating binocular disparity for dense pixel level depth information perception in a driving environment according to the present invention;

FIG. 2 is a network flow chart of a driving environment real-time stereo matching method based on candidate parallax provided by the invention;

FIG. 3 is a network structure diagram of a driving environment real-time stereo matching method based on candidate parallax according to the present invention;

FIG. 4 is a ghost conv principle of the lightweight feature extraction process provided by the present invention;

FIG. 5 is a block diagram of a ghost conv phantom convolution feature extraction unit module with step sizes of 1 and 2 according to the present invention;

FIG. 6 is a one-hot convolution kernel of the design of the micro Patch match module provided by the present invention;

FIG. 7 is a principle of the present invention providing a right image that is translated along the parallax dimension;

FIG. 8 is a block diagram of a computer-readable storage medium according to an embodiment of the present invention;

fig. 9 is a block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The invention aims to provide an effective candidate parallax generation method based on a patch match algorithm, which is a method for performing parallax calculation and hierarchical parallax map optimization by reducing the parallax search range of stereo matching so as to meet the requirements of real-time performance and accuracy of a driving environment.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The embodiment provides a driving environment real-time stereo matching method based on candidate parallaxes, which comprises the following steps:

Specifically, the data set in S1 includes binocular image pairs composed of the reference image and the corresponding target image, and a real disparity map, and all the stereoscopic image pairs are subjected to horizontal epipolar rectification, that is, are shifted only in the horizontal direction. In the implementation, all stereo images are preprocessed by subtracting the mean value and dividing by the standard deviation of the pixel intensity values. In training, pictures are randomly cropped into blocks of H256 and W512, with the maximum disparity set to 192.

Specifically, a lightweight twin network sharing weight is constructed, firstly, the twin network is processed by 16 standard convolution layers with convolution kernel size of 3 × 3, then, feature extraction is carried out through a series of ghost phantom convolution feature extraction units, and channels are gradually increased.

S_i,，j＝<f_l(i),f_r(i+d_i,j)>

wherein S_i,jFor the scoring of each disparity candidate for each pixel, < said > represents the inner product calculation, d_i,jFor candidate disparity values obtained by the patch match algorithm, f_lAnd f_rFor the extracted left and right feature maps, f_r(i+d_i,j) And performing warping operation on the extracted right image characteristic value by using the acquired parallax sampling value.

Specifically, the candidate disparity prediction module based on the patchmatch algorithm specifically includes:

a random sampling layer, executing step S301: for each independent pixel, the whole continuous parallax space is uniformly divided into N-10 parallax subspaces, a parallax value is randomly generated in each parallax subspace, and a parallax sampling vector with the size of N multiplied by H/4 multiplied by W/4 is obtained through combination.

The neighborhood propagation layer executes step S302: filtering the parallax sampling vector by using a pre-designed one-hot convolution kernel, transmitting the N sampling values of each pixel obtained in the step S301 to a four-neighbor domain, processing the N sampling values in the horizontal direction and the vertical direction, and acquiring a parallax candidate value of an adjacent pixel after each independent pixel is transmitted. 3N candidate parallax values are obtained for each pixel of the input image in the horizontal direction, and the same processing is carried out in the vertical direction for each parallax subspace of N parallax subspaces which are uniformly divided by each pixel due to the fact that three candidate parallax values are transmitted.

The layers are evaluated. Step S303 is executed: the candidate disparities for the N subspaces for each pixel are scored to find more accurate disparity values. Each disparity subspace has three candidate values, each having a corresponding score. And carrying out weighted average on the parallax candidate values by taking the evaluation scores as weight values to obtain more accurate parallax values in the parallax space, recombining parallax sampling vectors with the size of NxH/4 xW/4, and then entering the next iteration. The disparity sample vector with size of NxH/4 xW/4 obtained after iteration is the result that the disparity sample vector is closer to the accurate disparity value than the initial disparity candidate value, namely the target candidate disparity.

wherein

In a specific implementation process, the initial disparity prediction module specifically includes the following sub-steps:

step S401: for the extracted left and right characteristic maps f_lAnd f_rAccording to the N target candidate parallaxes obtained based on the patch match algorithm in the step S303, the left image feature is fixed, the right image feature is translated according to the parallax dimension, the left image feature and the right image feature are spliced along the channel dimension to construct a joint feature when each unit is translated, and finally a 4D matching cost body with the size of 2C multiplied by N multiplied by H/4 multiplied by W/4 is generated.

Step S402: the method comprises the steps of performing cost aggregation on a compact 4D matching cost body by adopting a depth separable convolution with unshared parallax dimension weights, specifically, regularizing the space dimension and the parallax dimension of the compact 4D matching cost body by using a convolution kernel with the size of 3 x N, learning parameters in a mode that N parallax candidate weights are unshared, combining all information of the N candidate parallaxes by using the convolution kernel, and learning independent representation among the N candidate parallaxes by using a weight unshared scheme. The result of cost aggregation is a regularized matching cost body with an output dimension of 1 × NxH/4 × W/4.

S403: and (6) parallax regression. And calculating the initial parallax of the cost aggregation result by adopting a differential soft argmin function which can reach sub-pixel precision, wherein the soft argmin function is a function which can convert each parallax prediction value into normalized probability to calculate an initial parallax map for each pixel point of the reference image and can reversely propagate the function for assisting network training.

In the specific implementation process, a high-resolution disparity map can be obtained by utilizing bilinear interpolation, firstly, an initial disparity map is up-sampled to a higher resolution by the bilinear interpolation and is merged with a reference RGB left map with a corresponding resolution, then, a convolution layer with a convolution kernel of 3 multiplied by 3 and a channel of 32 is used for further processing an output result by using 6 residual blocks with expansion rates of 1, 2, 4, 8, 1 and 1; then, the output of the residual block is sent to a convolution layer with the dimensionality of 1 and the convolution kernel of 3 multiplied by 3, wherein the convolution layer does not have BN and Relu, and the output is added with a disparity map obtained by interpolation of corresponding resolution; and circularly iterating the process until obtaining the disparity map of the original resolution. The dimension of the final disparity map output by the hierarchical disparity optimization module is H multiplied by W multiplied by 1.

In one embodiment, the training of the network model specifically includes the following sub-steps:

s501: and inputting the training data set stereo image pair into a real-time stereo matching network model for forward propagation training, wherein the learning parameters of the model comprise weight and bias, and randomly initializing the parameters to train the network model from the beginning.

S502: introducing a cross entropy loss function L of parallax truth peaking supervision:

wherein p is_gt(. cndot.) is a constructed Gaussian distribution,

for each pixel, the normalized probability for each disparity possible.

S503: and repeating the step S501 and the step S502, and continuously iterating and training the parameters of the network model to obtain the optimal real-time stereo matching network model of the driving environment.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a light-weight stereo matching method based on candidate parallax in a driving environment. The candidate parallax method based on the Patch match reduces the redundancy of full parallax search space calculation, and in view of the probability equality of parallax non-full range space values, the regularization module uses the depth separable convolution which is not shared by parallax dimension weight to avoid excessive consumption of calculation resources and memory by 3D convolution, so that the real-time requirement is met, the parallax prediction precision is further improved through the loss function of ground route single peak supervision, and mismatching is reduced.

The following describes a method for implementing stereo matching according to a specific embodiment. FIG. 1 is a flow chart of calculating binocular disparity in a driving environment and combining camera parameters to perform dense pixel-level depth information perception; fig. 2 is a network flowchart for performing real-time stereo matching of a driving environment based on an effective candidate disparity, and fig. 3 is a network structure diagram for generating a candidate disparity based on a Patch match algorithm to perform real-time stereo matching.

The real-time stereo matching method specifically comprises the following steps:

step 1: data set preprocessing: and randomly cutting the left and right images containing the real parallax value, wherein the cutting size is 512 multiplied by 256, and normalizing the cut images to enable the range of the image pixel value to be between-1 and 1. The left image is used as a reference image, and the right image is used as a target image, so that a group of stereo image pairs are formed together. Taking the training sample as a Sceneflow data set and the migration learning binocular stereo image pair as a KITTI2015 data set as an example, the migration learning binocular image pair may also be a Driving stereo data set and an Apollo scape data set.

Step 2: and constructing a real-time stereo matching network of the driving environment. First, learning a depth feature representation for calculating stereo matching cost, inputting a stereo image pair first extracting depth feature representation by using a conventional convolution layer and 4 ghost convolution unit layers, please see fig. 4 and 5. And randomly generating disparity values for N-10 disparity subspaces uniformly divided into each pixel to construct a disparity sampling vector. Referring to fig. 6, a one-hot convolution kernel is used to perform filtering so that N sampling values are spread to a four-neighbor domain, each candidate disparity of a pixel is scored, and a N × H/4 × W/4 disparity sampling vector is reconstructed by weighted average according to an evaluation score to perform iteration. Next, each reference unary feature and the corresponding target unary feature are connected under each candidate disparity to form a 4D cost amount to find a matching relationship between pixels of two input stereo image pairs, see fig. 7. The matching cost calculation provides an initial correlation between stereo image pairs, and cost aggregation can obtain more robust prediction results through regularization. For this, a depth separable convolution with unshared disparity dimension weights is proposed to regularize the cost amount while significantly reducing memory usage and run time during training, speculation. And then, a differentiable soft argmin strategy is used for predicting a smooth continuous disparity map in the disparity dimension regression of the cost quantity. Specifically, the probability of each candidate disparity is calculated by utilizing softmax operation on the cost value, the predicted disparity can be obtained by multiplying each disparity by the probability value of the disparity, an initial predicted disparity map is obtained, and finally, a disparity map with the original resolution is obtained by a hierarchical up-sampling disparity optimization module.

And step 3: and training the network model. Firstly, inputting a preprocessed training data set Sceneflow stereo image pair into a stereo matching network for forward propagation training, wherein the learning parameters of the model comprise weight and bias. And then inputting the output disparity map and the real disparity map of the data set into an L loss function, and performing back propagation by using a batch gradient descent method. And finally, updating the learning parameters of the iterative model for multiple times according to the gradient to obtain the optimal stereo matching network model.

And 4, step 4: and (4) transfer learning. Through the stereo matching model obtained in step 3, the actual driving environment is now tested by using the stereo image pair of the KITTI2015 data set in a transfer learning manner.

Step 4.1: randomly cutting the binocular image pair implemented in the KITTI2015 data set into image blocks with the size of 512 multiplied by 256, then carrying out normalization processing on the image blocks to enable the range of image pixel values to be between [ -1,1], and inputting the binocular image pair into a trained real-time stereo matching network after the pre-training stage of the sceneflow data set is completed.

Step 4.2: referring to fig. 2, light-weight feature extraction is performed on the input real driving environment image pair of the implementation example. Firstly, feature extraction is carried out on a stereo image pair by utilizing a conventional convolution layer, then depth feature extraction is carried out by utilizing four ghost convolution units, the initial feature dimension is 32, and the output feature map dimension is 128 multiplied by 64 multiplied by 128 at the moment. And then, dimension reduction is carried out by utilizing a convolution layer with a convolution kernel of 1 multiplied by 1 and a characteristic dimension of 32 so as to conveniently construct matching cost.

Step 4.3: randomly generating parallax values to construct a parallax sampling vector for N-10 parallax spaces into which each pixel is uniformly divided, filtering by using One-hot convolution kernels to enable N sampling values to be transmitted to four adjacent domains, processing the parallax values in a horizontal direction and a vertical direction, obtaining 3N candidate parallax values for each pixel of an input image in the horizontal direction, scoring the three candidate parallax values of each pixel in each space, and performing iteration by weighting and averaging according to the evaluated scores to form a parallax sampling vector of 10 multiplied by 128 multiplied by 64.

Step 4.4: and (3) cascading the output stereo image pairs to form a 4-dimensional tensor to construct a cost amount, wherein the dimension of the output feature image is 128 multiplied by 64 multiplied by 10 multiplied by 64, the tensor carries out aggregation of matching cost through depth separable convolution which is not shared by parallax dimension weight, and the dimension of the output feature image is 128 multiplied by 64 multiplied by 10 multiplied by 1.

Step 4.5: and (5) initial parallax calculation. At a cost of c_dThe probability of each disparity d is calculated using the softmax operation σ (·). Predicting parallax

Can be obtained by summing the product of each possible disparity and its probability value. The formula is as follows:

and performing regression prediction on the parallax dimension of the cost quantity by using the operation to smooth the continuous parallax map, wherein the dimension of the output feature map is 128 × 64 × 1.

Step 4.6: and (5) layering parallax optimization. Firstly, utilizing bilinear interpolation to up-sample an initial disparity map to obtain a disparity map with dimension of 256 multiplied by 128 multiplied by 1, then convolving an RGB reference left map of the current scale and the disparity map obtained by interpolation together by 3 multiplied by 3 to obtain an expression tensor of 32 channels, obtaining a disparity residual of the current scale by operating an output result through 6 residual blocks with expansion rates of 1, 2, 4, 8, 1 and 1, and obtaining an optimized disparity map of the current scale by combining the disparity map obtained by interpolation; and finally, circularly iterating the process until obtaining the disparity map of the original resolution.

Step 4.7: and inputting the output disparity map and the real disparity map into an L loss function, and performing backward propagation by using a batch gradient descent method. And finally, updating learning parameters of the iterative model for multiple times according to the gradient, including weight and bias, so as to obtain a real-time stereo matching network model for training the optimal driving environment.

And after the transfer learning is finished, the real-time stereo matching of the real driving scene can be carried out by utilizing the network obtained by training.

Example two

Based on the same inventive concept, the embodiment provides a driving environment real-time stereo matching device based on candidate parallax, which includes:

Since the apparatus described in the second embodiment of the present invention is an apparatus used for implementing the driving environment real-time stereo matching method based on the candidate parallax in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the apparatus based on the method described in the first embodiment of the present invention, and thus, the details are not described herein. All the devices adopted in the method of the first embodiment of the present invention belong to the protection scope of the present invention.

EXAMPLE III

Referring to fig. 8, based on the same inventive concept, the present application further provides a computer-readable storage medium 300, on which a computer program 311 is stored, which when executed implements a method according to one embodiment.

Since the computer-readable storage medium introduced in the third embodiment of the present invention is a computer-readable storage medium used for implementing the driving environment real-time stereo matching method based on the candidate disparity in the first embodiment of the present invention, based on the method introduced in the first embodiment of the present invention, persons skilled in the art can understand the specific structure and deformation of the computer-readable storage medium, and therefore, no further description is given here. Any computer readable storage medium used in the method of the first embodiment of the present invention is within the scope of the present invention.

Example four

Based on the same inventive concept, the present application further provides a computer apparatus, please refer to fig. 9, which includes a storage 401, a processor 402, and a computer program 403 stored in the storage and running on the processor, and when the processor 402 executes the above program, the method in the first embodiment is implemented.

Since the computer device introduced in the fourth embodiment of the present invention is a computer device used for implementing the driving environment real-time stereo matching method based on the candidate parallax in the first embodiment of the present invention, based on the method introduced in the first embodiment of the present invention, persons skilled in the art can understand the specific structure and deformation of the computer device, and thus details are not described herein. All the computer devices used in the method of the embodiment of the present invention are within the scope of the present invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims

1. A driving environment real-time stereo matching method based on candidate parallax is characterized by comprising the following steps:

2. The real-time stereo matching method for vehicle driving environment according to claim 1, wherein the feature extraction module is a lightweight weight-sharing twin network, comprising a standard convolution layer and a phantom convolution feature extraction unit, the structure of the phantom convolution feature extraction unit with the step size of 1 is ghost conv-BN-ghost conv-BN-Relu, the structure of the phantom convolution feature extraction unit with the step size of 2 is ghost conv-BN-depwise conv-ghost conv-BN-Relu, the ghost conv represents convolution design of a redundant feature graph calculated by a core convolution layer and simple linear operation, BN is batch standardization operation, depwise conv is deep convolution, Relu is a linear rectification function, the input and the output of the phantom convolution feature extraction unit are added by adopting jump connection, and finally the output is two unary features f with the size of H/4 xW/4 xC._lAnd f_rWhere H, W denotes the height and width of the original input image, respectively, and C denotes the feature dimension.

3. The real-time stereo matching method for vehicle driving environment according to claim 1, wherein the candidate disparity prediction module is specifically configured to uniformly divide each pixel into N-10 disparity subspaces, randomly generate a disparity value in each disparity subspace to construct a disparity sampling vector, filter the disparity sampling vector by using a one-hot convolution kernel, so that N sampling values corresponding to the disparity sampling vector are transmitted to four neighborhoods thereof, score each candidate disparity of each pixel, and reconstruct a N × H/4 × W/4 disparity sampling vector for iteration according to each candidate disparity estimation score weighted average to obtain a target candidate disparity.

4. The real-time stereo matching method for vehicle driving environment according to claim 3, wherein the formula of the candidate disparity estimation score is defined as follows:

S_i,j＝<f_l(i),f_r(i+d_i,j)>

wherein S_i,jTo score each disparity candidate value for each pixel,<·>representing inner product calculation, d_i,jFor candidate disparity values obtained by the patch match algorithm, f_lAnd f_rFor the extracted left and right feature maps, f_r(i+d_i,j) And performing warping operation on the extracted right image characteristic value by using the acquired parallax sampling value.

5. The real-time stereo matching method for the driving environment according to claim 1, wherein the process of calculating the matching cost comprises: reference will be made to the characteristic diagram f_lAnd corresponding target feature map f_rAfter splicing is carried out under the obtained candidate parallaxes, packing the candidate parallaxes into a 4D matching cost quantity, wherein the dimension of the finally output cost quantity is H/4 xW/4 xNxF, H, W respectively represents the height and the width of an original input image, N represents the number of the candidate parallaxes of each pixel, and F represents a characteristic dimension;

6. The real-time stereo matching method for the driving environment according to claim 5, wherein the formula of the N planes of the 4D matching cost body is defined as follows:

wherein

7. The method of claim 5, wherein the disparity regression employs a soft argmin function to calculate the initial disparity for the cost-aggregated result, the soft argmin function is used to convert each disparity prediction value into a normalized probability for each pixel point of the reference image to calculate the initial disparity map and back-propagate to assist network training, wherein the soft argmin function calculation formula is defined as follows:

Passing softmaxAnd calculating the regularized matching cost body by using the function.

8. The method of claim 1, wherein the hierarchical disparity optimization module is specifically configured to: upsampling the initial disparity map of H/4 xW/4 resolution to H/2 xW/2 resolution by using bilinear interpolation, and then adding the left image I of H/2 xW/2_lObtaining a representation tensor of 32 channels by using a 3 x 3 convolution operation together with a disparity map obtained by bilinear interpolation, obtaining a H/2 x W/2 disparity residual map by using 6 residual block operations using expansion convolution, and obtaining an optimized disparity map of the current scale by combining the disparity residual map with the disparity map obtained by interpolation; and obtaining the original H multiplied by W resolution disparity map through loop iteration.

9. The utility model provides a driving environment real-time stereo matching device based on candidate parallax, its characterized in that includes:

10. A computer-readable storage medium, having stored thereon a computer program which, when executed, implements the method of any one of claims 1 to 8.