CN114004754A

CN114004754A - Scene depth completion system and method based on deep learning

Info

Publication number: CN114004754A
Application number: CN202111070656.7A
Authority: CN
Inventors: 岳昊嵩; 刘强; 刘中; 王薇; 王磊; 陈伟海
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-09-13
Filing date: 2021-09-13
Publication date: 2022-02-01
Anticipated expiration: 2041-09-13
Also published as: CN114004754B

Abstract

The invention discloses a scene depth completion system and method based on deep learning, and relates to the technical field of computer vision. The method comprises the steps of obtaining characteristics of scene depth maps of different modes in a KITTI data set, wherein the scene depth maps of the different modes comprise a sparse depth map and an RGB picture; respectively extracting the characteristics of a sparse depth map and an RGB picture by adopting a coding and decoding network based on a UNet network architecture; establishing and adopting an attention-based image convolution network and an attention-based self convolution fusion network to respectively carry out image restoration according to the characteristics of the sparse depth map and the RGB image so as to obtain a low-frequency depth map and a high-frequency depth map; and generating a dense depth map by adopting a pixel-level addition method according to the low-frequency depth map and the high-frequency depth map, and completing scene depth completion. The invention can greatly improve the scene depth completion precision.

Description

Scene depth completion system and method based on deep learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a scene depth completion system and method based on deep learning.

Background

In recent years, with the development of computer vision technology, the fields of automatic driving, virtual reality, pose estimation, target detection and the like have remarkable achievements, and the depth is gradually a research hotspot. However, due to the limitations of RGB-D cameras, lidar, and other sensors, it is still difficult to obtain precise and dense depth information. Although the precision of the laser radar is high, the laser radar is expensive, and the obtained depth information is sparse and irregular. Although the RGB-D camera can obtain dense depth information, the perception range of the RGB-D camera is limited, and the effect is seriously influenced by the environment. In order to obtain dense, accurate depth maps, and to overcome the drawbacks of the sensors themselves, there has been a lot of work starting with the method of completing a given sparse depth map to obtain a dense depth map, which is called depth completion.

Because a large amount of environmental information, especially object geometric information, is lost in sparse depth measurement, the early depth completion effect based on a sparse depth map is poor, the object boundary is fuzzy, and the depth aliasing is serious. In order to compensate for the information lost in the sparse depth sampling process, it becomes a necessary means to introduce additional information. The RGB image contains rich color texture information of a scene and is low in acquisition cost, so in recent years, the RGB image is mainly used as a guide for depth completion, and a sparse depth image from a laser radar is completed.

With the development of deep learning, the data-driven method obtains the achievement far beyond the traditional method in the deep completion field. The existing deep completion based on the deep neural network can be basically divided into two strategies, the first strategy is a single branch structure, and in a simple way, a sparse depth map and an RGB map are sent into a network together, and a dense depth map is regressed through a coding and decoding network. The method has the defects that the problem of abnormal data fusion cannot be well handled, and the geometric information in the RGB picture is not fully utilized. Another strategy is a "multi-branch structure", which is essentially multi-model integration, i.e., each branch can be regarded as a single depth completion model, and by integrating dense depth maps from different branches, a more effective result can be obtained. Since only RGB picture additional data is introduced in the general depth completion, "multi-branch structure" is often designed as "dual-branch structure", where each branch focuses on processing information of one modality.

Although the existing methods achieve good results, the existing methods focus attention on fusion of different modal features or construction of differential branches, and influence of specificity of different modal data on results is ignored. In fact, as the depth value of the scene is down-sampled by the sparse depth information, its high frequency information is missing and mixed with a lot of noise, while the low frequency information is much more accurate. In contrast, color maps contain a large amount of scene geometry information related to high frequency information, but in low frequency regions, the rich texture and color variations of color pictures can introduce noise in the depth estimation results of these regions. In the dense depth map, low-frequency components account for most of the low-frequency components, so that the learned model has data preference on sparse depth information, the model prefers to utilize the sparse depth information in the recovery process of high-frequency and low-frequency information, the RGB image information is not sufficiently utilized, and finally a suboptimal solution is obtained.

Disclosure of Invention

In order to overcome the above problems or at least partially solve the above problems, embodiments of the present invention provide a scene depth completion system and method based on deep learning, which can greatly improve the scene depth completion accuracy.

The embodiment of the invention is realized by the following steps:

in a first aspect, an embodiment of the present invention provides a scene depth completion system based on deep learning, including an image acquisition module, a feature extraction module, an image recovery module, and a depth completion module, where:

the image acquisition module is used for acquiring scene depth maps of different modes in the KITTI data set, wherein the scene depth maps of different modes comprise a sparse depth map and an RGB (red, green and blue) picture;

the characteristic extraction module is used for respectively extracting the characteristics of the sparse depth map and the RGB picture by adopting a coding and decoding network based on a UNet network architecture;

the image recovery module is used for establishing and adopting an attention-based image convolution network and an attention-based self convolution fusion network to carry out image recovery according to the characteristics of the sparse depth map and the RGB image respectively so as to obtain a low-frequency depth map and a high-frequency depth map;

and the depth completion module is used for generating a dense depth map by adopting a pixel-level addition method according to the low-frequency depth map and the high-frequency depth map so as to complete scene depth completion.

In order to solve the technical problem that attention is focused on feature fusion or construction of difference branches of different modes in the prior art, influence of particularity of data of different modes on results is ignored, and the accuracy of a completion result is low, the system firstly selects scene depth maps of different modes in KITTI data set through an image acquisition module, then respectively extracts features of two heterogeneous data, namely a sparse depth map and an RGB (red, green and blue) picture, through a feature extraction module and an independent encoding and decoding network based on a UNet network architecture, then respectively recovers high-frequency components and low-frequency components of the dense depth map through an image recovery module and an attention-based image convolution network and an attention-based self-convolution fusion network, achieves the purposes of sparse depth information domination in the low-frequency component recovery process and picture information domination in the high-frequency component recovery process, and finally combines the two through a depth completion module, so as to obtain a dense depth map, and further effectively improve the precision of the depth completion result.

Based on the first aspect, in some embodiments of the present invention, the image restoration module includes a low frequency processing sub-module, configured to construct a graph model according to features of the sparse depth map, input the graph model into an attention-based graph convolution network, and convolve an output feature map with a standard convolution layer aggregate graph to obtain the low frequency depth map.

Based on the first aspect, in some embodiments of the present invention, the image restoration module includes a high-frequency processing sub-module, configured to adaptively select an information fusion region according to features of the sparse depth map and the RGB picture by an attention-based self-convolution fusion network, and enable the network to adaptively control a degree of fusion of the features of the sparse depth map and the RGB picture through spatial diversity convolution, so as to obtain the high-frequency depth map.

Based on the first aspect, in some embodiments of the present invention, the scene depth completion system based on deep learning further includes a completion evaluation module, configured to compare the dense depth map with a preset comparison map, and generate a comparison evaluation result.

In a second aspect, an embodiment of the present invention provides a scene depth completion method based on deep learning, including the following steps:

acquiring scene depth maps of different modes in a KITTI data set, wherein the scene depth maps of the different modes comprise a sparse depth map and an RGB picture;

respectively extracting the characteristics of a sparse depth map and an RGB picture by adopting a coding and decoding network based on a UNet network architecture;

establishing and adopting an attention-based image convolution network and an attention-based self convolution fusion network to respectively carry out image restoration according to the characteristics of the sparse depth map and the RGB image so as to obtain a low-frequency depth map and a high-frequency depth map;

and generating a dense depth map by adopting a pixel-level addition method according to the low-frequency depth map and the high-frequency depth map, and completing scene depth completion.

In order to solve the technical problem that in the prior art, attention is focused on fusion of different modal characteristics or construction of differential branches, influence of particularity of different modal data on a result is ignored, and accordingly completion result precision is low, the method firstly selects scene depth maps of different modes in a KITTI data set, then, an independent encoding and decoding network based on the UNet network architecture is adopted to respectively extract the characteristics of two heterogeneous data, namely a sparse depth map and an RGB picture, then, a graph convolution network based on attention and a self convolution fusion network based on attention are established and adopted to respectively recover the high-frequency component and the low-frequency component of the dense depth map, so that the purposes of optimizing sparse depth information in the low-frequency component recovery process and optimizing RGB picture information in the high-frequency component recovery process are achieved, and finally the two are combined to obtain the dense depth map, and the precision of the depth completion result is effectively improved.

Based on the second aspect, in some embodiments of the present invention, the method for establishing and using the attention-based convolution network and the attention-based self-convolution fusion network to perform image restoration according to the features of the sparse depth map and the RGB picture, respectively, so as to obtain the low-frequency depth map and the high-frequency depth map includes the following steps:

and constructing a graph model according to the characteristics of the sparse depth map, inputting the graph model into an attention-based graph convolution network, and convolving the output characteristic map by adopting a standard convolution layer aggregate map to obtain a low-frequency depth map.

the attention-based self-convolution fusion network adaptively selects an information fusion area according to the characteristics of the sparse depth map and the RGB picture, and the network adaptively controls the fusion degree of the characteristics of the sparse depth map and the RGB picture through space diversity convolution so as to obtain the high-frequency depth map.

Based on the second aspect, in some embodiments of the present invention, the deep learning-based scene depth completing method further includes the following steps:

and comparing the dense depth map with a preset comparison map to generate a comparison evaluation result.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory for storing one or more programs; a processor. The program or programs, when executed by a processor, implement the method of any of the second aspects as described above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the method according to any one of the above second aspects.

The embodiment of the invention at least has the following advantages or beneficial effects:

the embodiment of the invention provides a scene depth completion system and method based on deep learning, aiming at solving the technical problems that attention is focused on fusion of different modal characteristics or construction of differential branches, influence of particularity of different modal data on results is ignored, and the completion result precision is low in the prior art, the method comprises the steps of firstly selecting scene depth maps of different modes in KITTI data set, then adopting an independent encoding and decoding network based on a UNet network architecture to respectively extract characteristics of two heterogeneous data, namely a sparse depth map and an RGB picture, then establishing and adopting an attention-based image convolution network and an attention-based self convolution fusion network to respectively recover high-frequency components and low-frequency components of the dense RGB depth map, and achieving the purposes of occupying sparse depth information in the low-frequency component recovery process and occupying high-frequency component picture information in the high-frequency component recovery process, and finally, combining the two images to obtain a dense depth map, thereby effectively improving the precision of the depth completion result.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a flowchart of a scene depth completion method based on deep learning according to an embodiment of the present invention;

FIG. 2 is a schematic block diagram of a scene depth completion system based on deep learning according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an encoding and decoding network according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a low frequency branch structure according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a network according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a high frequency branch structure according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of attention-based self-convolution fusion in accordance with an embodiment of the present invention;

fig. 8 is a block diagram of an electronic device according to an embodiment of the present invention.

Icon: 100. an image acquisition module; 200. a feature extraction module; 300. an image restoration module; 310. a low frequency processing sub-module; 320. a high-frequency processing submodule; 400. a depth completion module; 500. a completion evaluation module; 101. a memory; 102. a processor; 103. a communication interface.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Examples

As shown in fig. 1, in a first aspect, an embodiment of the present invention provides a scene depth completion system based on deep learning, including an image acquisition module 100, a feature extraction module 200, an image restoration module 300, and a depth completion module 400, where:

an image obtaining module 100, configured to obtain scene depth maps of different modalities in a KITTI dataset, where the scene depth maps of different modalities include a sparse depth map and an RGB picture;

the feature extraction module 200 is configured to respectively extract features of the sparse depth map and the RGB picture by using a UNet network architecture-based coding and decoding network;

the image restoration module 300 is configured to establish and use an attention-based image convolution network and an attention-based self-convolution fusion network to perform image restoration according to the features of the sparse depth map and the RGB image, respectively, so as to obtain a low-frequency depth map and a high-frequency depth map;

and the depth completion module 400 is configured to generate a dense depth map by adopting a pixel-level addition method according to the low-frequency depth map and the high-frequency depth map, and complete scene depth completion.

In order to solve the technical problem that attention is focused on feature fusion of different modes or construction of difference branches and influence of particularity of data of different modes on a result is ignored to cause low precision of a completed result in the prior art, the system firstly selects scene depth maps of different modes in KITTI data set through an image acquisition module 100, then respectively extracts features of two heterogeneous data, namely a sparse depth map and an RGB picture, through a feature extraction module 200 by adopting an independent encoding and decoding network based on a UNet network architecture, then respectively recovers high-frequency components and low-frequency components of a dense depth map through an image recovery module 300 by adopting an attention-based image convolution network and an attention-based self-convolution fusion network, and achieves the purposes of sparse depth information domination in a low-frequency component recovery process and RGB picture information domination in a high-frequency component recovery process, and finally, the depth completion module 400 is used for combining the depth completion module and the depth completion module to obtain a dense depth map, so that the precision of the depth completion result is effectively improved.

Based on the first aspect, in some embodiments of the present invention, as shown in fig. 1, the image restoration module 300 includes a low frequency processing sub-module 310, configured to construct a graph model according to features of a sparse depth map, input the graph model into an attention-based graph convolution network, and convolve an output feature map with a standard convolution layer aggregation map to obtain a low frequency depth map.

In image restoration processing, a double-branch structure is adopted to restore the high-frequency component and the low-frequency component of the dense depth map respectively, and the double-branch structure comprises a low-frequency branch and a high-frequency branch.

In the low-frequency branch, an attention-based graph convolution network for self-adaptive graph construction is provided as a low-pass filter, and low-frequency information contained in sparse depth information is fully extracted. And recovering the low-frequency component of the dense depth map by using the attention-based graph convolution network through the low-frequency processing sub-module 310 by using the low-frequency information in the sparse depth data to obtain the low-frequency depth map.

Based on the first aspect, in some embodiments of the present invention, as shown in fig. 1, the image restoration module 300 includes a high-frequency processing sub-module 320, configured to adaptively select an information fusion region according to the features of the sparse depth map and the RGB picture by an attention-based self-convolution fusion network, and allow the network to adaptively control the degree of fusion of the features of the sparse depth map and the RGB picture through spatial diversity convolution, so as to obtain a high-frequency depth map.

The high-frequency processing sub-module 320 utilizes high-frequency information in RGB image data by adopting a self-convolution fusion network based on attention, and the problem of insufficient utilization of color information caused by excessive preference of a model to sparse depth information is solved, so that a high-frequency depth map is obtained. In the high-frequency branch, a self-convolution fusion model based on attention is provided to selectively and adaptively fuse different modal data in a space variation mode, so that the method is more effective compared with the prior data fusion mode.

As shown in fig. 1, in some embodiments of the present invention, based on the first aspect, the scene depth completion system based on deep learning further includes a completion evaluation module 500, configured to compare the dense depth map with a preset comparison map, and generate a comparison evaluation result.

In order to effectively control the depth completion effect, the completion evaluation module 500 compares the dense depth map with a preset contrast map, and evaluates the depth completion effect of the image by using a loss function to generate a contrast evaluation result.

As shown in fig. 2 to 7, in a second aspect, an embodiment of the present invention provides a scene depth completion method based on deep learning, including the following steps:

s1, obtaining scene depth maps of different modes in the KITTI data set, wherein the scene depth maps of different modes comprise a sparse depth map and an RGB picture;

s2, respectively extracting the characteristics of the sparse depth map and the RGB picture by adopting a coding and decoding network based on a UNet network architecture;

in some embodiments of the present invention, the sparse depth map and the RGB picture belong to different modal information, and different from the strategy of extracting features after the anomaly information is directly fused at a low level, the method uses two independent "encoding and decoding networks" ED_sAnd ED_rgbThe feature extraction is performed on the different modal data respectively, which helps to avoid the different modal data from interfering with each other in the feature extraction process. The "codec network" uses UNet network architecture, as shown in fig. 3, in which (a) shows a specific structure of the "codec network" for feature extraction, and (b) shows a network ED for extracting RGB picture features_rgbAnd (c) represents a method for extractingNetwork ED for taking sparse depth map features_s，ED_sAnd ED_rgbParameters are not shared.

S3, establishing and adopting an attention-based image convolution network and an attention-based self convolution fusion network to respectively carry out image restoration according to the characteristics of the sparse depth map and the RGB image so as to obtain a low-frequency depth map and a high-frequency depth map;

further, a graph model is constructed according to the characteristics of the sparse depth map, the graph model is input into an attention-based graph convolution network, and the characteristic map output by convolution of a standard convolution layer aggregate map is adopted to obtain a low-frequency depth map.

Furthermore, the attention-based self-convolution fusion network adaptively selects an information fusion area according to the characteristics of the sparse depth map and the RGB picture, and the network adaptively controls the fusion degree of the characteristics of the sparse depth map and the RGB picture through space metamorphic convolution so as to obtain the high-frequency depth map.

In some embodiments of the invention, a dual-branch structure is employed to recover the high-frequency component and the low-frequency component of the dense depth map, respectively, the dual-branch structure including a low-frequency branch and a high-frequency branch. The purpose of the low-frequency branch is to fully utilize the low-frequency information in the sparse depth data and recover the low-frequency component of the dense depth map, and the structure of the low-frequency branch is shown in fig. 4. The low frequency branch is essentially a low pass filter. Although the standard convolution can be regarded as a filter, its parameters are the result of network learning under data driving, and it is difficult to restrict it to a specific filter, such as a low-pass filter. To achieve this, the present invention designs an "attention-based graph convolution network" to be equivalent to a "low-pass filter", and proposes a "graph generation network" that adaptively constructs edges of a graph model in a learning method, as shown in fig. 5. "graph generation network": the most direct way to construct the edge is to calculate the three-dimensional space coordinates of the point cloud and generate the edge according to the nearest neighbor relation. However, this method is affected by the accuracy of the position (depth value) in the point cloud space, which is the target for optimization. Therefore, we propose a strategy of adaptively learning edges, namely, the construction of the edges is not simply determined by inaccurate position relations, but is obtained by neural network learning.

The graph model may be generally represented as G ═ V, E, where V represents nodes and E represents edges. Each node V e V has connection relation with partial nodes in the neighborhood, and the set of the nodes connected with the node V is called N_vThe connection relationship between them is called edge e_v. In addition, the initial state of node v and its state at time t are referred to as

The node state will change dynamically as the graph convolution progresses.

In the invention, each pixel point on the original resolution of the picture is regarded as a node, and each node V ∈ V is in the initial state

Is an "encoding-decoding network" ED_sOutput feature map F_sCorresponding N-dimensional feature vector, its adjacent node N_vAnd its corresponding edge e_vAre obtained through a "graph generation network". It should be noted that all edges mentioned here are directed edges, i.e. the graph model we construct is a directed acyclic graph.

Specifically, the feature map F_sAs input, a coarse depth map D is output by a standard convolution_coarseThen, calculating the three-dimensional space position of each node according to the camera internal parameter K: z ═ D_coarse、

Let P denote a position map representing XYZ_XYZ. Finally, P is added_XYZAnd characteristic diagram F_rgbConnecting, obtaining G representing node connection relation after standard convolution and rounding_E. Setting each node to have n edges, then G_EThere are 2n channels. E.g. having e_i,j∈G_EThen e_i,jIs a 1 × 2n vector:

the vector representing the n-edge of the node at (i, j) has components in both xy directions.

In the invention, in order to accelerate the graph convolution process by using GPU parallel operation, the number of each node edge is set to be 8, namely G_EThe number of channels is 16.

After obtaining the graph model, we send it as input to the attention-based graph convolution network. The graph convolution process can be formulated as:

wherein, P_XYZIs a position diagram representing the three-dimensional spatial coordinates of the nodes; f_sThe feature graph is extracted from the sparse depth map and represents the initial state of the node; | represents a connection feature graph; MLP denotes a multilayer perceptron; alpha is alpha^i,jRepresenting the attention value between node i and node j;

respectively, the aggregation of information and the update of the state at the time of the t +1 th graph convolution.

Finally, the feature map output by convolution of the standard convolution layer aggregate map of 1 multiplied by 1 is used to obtain the low-frequency component map D of the dense depth map_LF。

The purpose of the high-frequency branch is to fully utilize high-frequency information in RGB image data and relieve the problem of insufficient utilization of color information caused by excessive preference of a model for sparse depth information. To achieve this, we model the high frequency components as the residuals of the low frequency depth map and the dense depth map. In this way, on one hand, when the model estimates the low-frequency component in the low-frequency branch, the high-frequency branch can be constrained to be a high-pass filter, and on the other hand, the optimization target of the high-frequency branch is changed from the original dense depth map to residual, which is helpful for relieving the preference of the model for sparse depth data. The high frequency branch structure is shown in fig. 6.

The feature maps from different modalities are subjected to multi-scale fusion, and it should be noted that, because the distribution of high-frequency information in the RGB picture and the sparse depth map is regional and is concentrated on the geometric boundary and the occlusion region, the fusion of the information of the two modalities is a problem of spatial variation, that is, different spatial regions need to adopt different data fusion modes. General information fusion strategies are as follows: and adding or connecting the feature maps extracted from the different modal data, and then performing standard convolution processing. The data fusion method is space-invariant, and different fusion methods are difficult to adopt for different areas. The invention proposes an attention-based Self-convolution Fusion network, namely a Self-Fusion module in figure 6. Specifically, the model can adaptively select an information fusion region through an attention mechanism, and meanwhile, the spatial diversity convolution enables the network to adaptively control the fusion degree of different modal data. Therefore, the RGB picture information is effectively utilized as far as possible while the sparse depth information is effectively utilized, and a better result is obtained.

Attention-based self-convolution fusion can be represented as shown in FIG. 7, where feature maps F from different modalities are first convolved to generate a spatial attention map F_attThen, the attention map is multiplied by the feature map, and the information fusion area is adaptively selected with attention, resulting in F'. F ' on one hand, a feature map F ' with the abnormal state information uniformly distributed on the channels is obtained through channel mixing '_CS(ii) a On the other hand, a series of spatially varying convolution kernels W are obtained by convolution, and the convolution kernels are used for carrying out grouped convolution on the multi-modal information, and the graph is divided into 3 groups. Result of packet convolution F_fuseNamely a feature map after multi-modal information fusion, and the feature map is used for continuing to fuse with multi-modal information on other scales.

And S4, generating a dense depth map by adopting a pixel-level addition method according to the low-frequency depth map and the high-frequency depth map, and completing scene depth completion.

In order to solve the technical problem that in the prior art, attention is focused on fusion of different modal characteristics or construction of differential branches, influence of particularity of different modal data on a result is ignored, and accordingly completion result precision is low, the method firstly selects scene depth maps of different modes in a KITTI data set, then, an independent encoding and decoding network based on the UNet network architecture is adopted to respectively extract the characteristics of two heterogeneous data, namely a sparse depth map and an RGB picture, then, a graph convolution network based on attention and a self convolution fusion network based on attention are established and adopted to respectively recover the high-frequency component and the low-frequency component of the dense depth map, so that the purposes of optimizing sparse depth information in the low-frequency component recovery process and optimizing RGB picture information in the high-frequency component recovery process are achieved, and finally the two are combined to obtain the dense depth map, and the precision of the depth completion result is effectively improved. The invention adopts a double-branch structure to respectively recover the high-frequency component and the low-frequency component of the dense depth map, wherein the double-branch structure comprises a low-frequency branch and a high-frequency branch. In the low-frequency branch, an attention-based graph convolution network for self-adaptive graph construction is provided as a low-pass filter, and low-frequency information contained in sparse depth information is fully extracted. In the high-frequency branch, a self-convolution fusion model based on attention is provided to selectively and adaptively fuse different modal data in a space variation mode, so that the method is more effective compared with the prior data fusion mode.

In order to effectively control the depth completion effect, the dense depth map is compared with a preset comparison map, the loss function is adopted to evaluate the image depth completion effect, and a comparison evaluation result is generated.

The loss function employed by the present invention is:

where D represents a semi-dense real depth map,

representing an estimated dense depth map, H_dAnd W_dRespectively representing picturesLength and width of (2); if the pixel value is available in D, take

Is 1, otherwise is 0; v denotes all

The number of non-0 pixels.

The invention uses the outdoor data set KITTI as our experimental data to carry out experimental verification on the depth completion method for respectively estimating high and low frequencies. The KITTI data set is a large outdoor automatic driving data set and is also a main reference for deep completion. It consists of more than 85000 color images and corresponding sparse depth maps, and semi-dense depth true value maps, 79000 for training, 6000 for verification, and 1000 for testing. We evaluated all images during training using the officially recommended 100 validation images and cropped to 256 x 1216 dimensions. Our model was implemented using the PyTorch0 framework and trained using 4 titarntxgpus. To maintain versatility, we used an ADAM optimizer, β₁＝0.9,β₂Weight attenuation of 10 at 0.99^-6. The model trained 30 batches, with a batch size of 8, a learning rate starting from 1e-3, multiplying every 5 batches by an attenuation factor of 0.2.

Our experimental platform includes pytorch1.0, python3.8, ubuntu16.04 and NVIDIA TITAN RTX of 4 24GB memories, ADAM optimizer was used during training, learning rate was 0.001, half of each 5 epochs was attenuated, and 30 epochs were trained in total.

For the evaluation index of the experiment, we adopted the following common indexes:

and evaluating the KITTI data set by using the most standard indexes, wherein the evaluation indexes comprise: the Root Mean Square Error (RMSE), the Mean Absolute Error (MAE), the root mean square error (iRMSE) of the inverse depth and the mean absolute error (iMAE) of the inverse depth are reasonably and comprehensively evaluated through the indexes to obtain a comprehensive evaluation result, and then the effect of depth completion is intuitively and clearly understood.

The present invention also performs a number of ablation experiments to demonstrate the effectiveness of the various elements presented in the present method.

Effectiveness analysis of low and high frequency branches: the effectiveness of balancing the high and low frequency branches of the data preferences is first demonstrated. The key to constructing the high and low frequency model is to perform low pass filtering in the low frequency branch and model the high frequency branch as a residual. Therefore, we constructed this model step by step, resulting in three experiments (a), (b), (c). As shown in table 1, the single-branch network (a) contains only one high-frequency branch, which fuses the features of different modes by pixel-level addition. Then, (b) model the high frequency branch as residual by adding a simple "low frequency branch" where Conv is only a standard convolution layer, i.e. (b) the "low frequency branch" in the experiment has no low pass filtering function. Experiments have shown that (b) this "residual learning" structure eases data preference and improves results. In addition, (c) the experiment adds the low-pass filtering module AGGAN we propose, which converts the low-frequency branch into a branch with true low-pass filtering function and significantly improves the performance. By comparing the three experiments (a), (b) and (c), we can prove the effectiveness of the low-pass filtering module AGGAN and the effectiveness of modeling the high-frequency branch as a residual.

Effectiveness analysis of attention-based graph convolution network (AGGAN): in (c), we use the nearest neighbor (kNN) method to generate edges of the graph model from the spatial locations of the point clouds. In contrast, (d) the edges of the graph model are generated adaptively by our proposed graph generation network (AGN), which utilizes spatial information from sparse depths and geometric information from color images. As shown in Table I, the error of (d) was 4.6mm smaller than that of (c). This indicates that our proposed "graph generation network" is a more efficient edge generation strategy. In addition, the number of layers of the Graph Attention Convolution (GAC) is a hyper-parameter to be determined. From the errors of (g), (h), and (i), we find that the depth values are more accurate as the number of layers increases. However, the improvement of the deep network comes at the cost of significantly increasing the computational cost of the model. We set the final number of layers of the graph attention convolution to 3 for accuracy and efficiency.

Attention-based self convolution fusion (ASF) module analysis: previous work has typically fused features extracted from different modalities using a concatenation or pixel-by-pixel addition method. In table 1, experiments (g), (e), (d), (f) compare the effect of the proposed ASF with other simple fusion strategies, where (e) denotes connectivity, (d) denotes pixel-level addition, and (f) denotes ASF without attention mechanism. The ASF can be found to achieve the lowest error on RMSE through experiments, which proves the effectiveness of the ASF strategy. When we weighted the features with spatial attention, the error of (g) is significantly lower than (f), indicating the effectiveness of the attention mechanism. In addition, flexibility of sharing convolution kernels in channel dimensions is analyzed, and it is appropriate to obtain that each 16 channels share one convolution kernel based on experimental data, so that performance can be maintained, and calculation cost can be reduced.

Table 1:

as shown in fig. 3, in a third aspect, an embodiment of the present application provides an electronic device, which includes a memory 101 for storing one or more programs; a processor 102. The one or more programs, when executed by the processor 102, implement the method of any of the second aspects as described above.

Also included is a communication interface 103, and the memory 101, processor 102 and communication interface 103 are electrically connected to each other, directly or indirectly, to enable transfer or interaction of data. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 101 may be used to store software programs and modules, and the processor 102 executes the software programs and modules stored in the memory 101 to thereby execute various functional applications and data processing. The communication interface 103 may be used for communicating signaling or data with other node devices.

The Memory 101 may be, but is not limited to, a Random Access Memory 101 (RAM), a Read Only Memory 101 (ROM), a Programmable Read Only Memory 101 (PROM), an Erasable Read Only Memory 101 (EPROM), an electrically Erasable Read Only Memory 101 (EEPROM), and the like.

The processor 102 may be an integrated circuit chip having signal processing capabilities. The Processor 102 may be a general-purpose Processor 102, including a Central Processing Unit (CPU) 102, a Network Processor 102 (NP), and the like; but may also be a Digital Signal processor 102 (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware components.

In the embodiments provided in the present application, it should be understood that the disclosed method and system and method can be implemented in other ways. The method and system embodiments described above are merely illustrative, for example, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by the processor 102, implements the method according to any one of the second aspects described above. The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory 101 (ROM), a Random Access Memory 101 (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. The utility model provides a scene degree of depth completion system based on degree of deep learning, its characterized in that includes image acquisition module, feature extraction module, image recovery module and degree of depth completion module, wherein:

2. The deep learning-based scene depth completion system according to claim 1, wherein the image restoration module includes a low-frequency processing sub-module, configured to construct a graph model according to the features of the sparse depth map, input the graph model into an attention-based graph convolution network, and convolve the output feature map with a standard convolution layer aggregation map to obtain a low-frequency depth map.

3. The deep learning-based scene depth completion system according to claim 1, wherein the image restoration module comprises a high-frequency processing sub-module, configured to select an information fusion region adaptively according to the features of the sparse depth map and the RGB picture by the attention-based self-convolution fusion network, and enable the network to adaptively control the degree of fusion of the features of the sparse depth map and the RGB picture through spatial diversity convolution, so as to obtain the high-frequency depth map.

4. The scene depth completion system based on deep learning of claim 1, further comprising a completion evaluation module for comparing the dense depth map with a preset comparison map to generate a comparison evaluation result.

5. A scene depth completion method based on deep learning is characterized by comprising the following steps:

6. The scene depth completion method based on deep learning of claim 5, wherein the method for establishing and using the attention-based convolution network and the attention-based self-convolution fusion network to perform image restoration according to the features of the sparse depth map and the RGB picture respectively to obtain the low-frequency depth map and the high-frequency depth map comprises the following steps:

7. The scene depth completion method based on deep learning of claim 5, wherein the method for establishing and using the attention-based convolution network and the attention-based self-convolution fusion network to perform image restoration according to the features of the sparse depth map and the RGB picture respectively to obtain the low-frequency depth map and the high-frequency depth map comprises the following steps:

8. The scene depth completion method based on deep learning of claim 5, further comprising the following steps:

9. An electronic device, comprising:

a memory for storing one or more programs;

a processor;

the one or more programs, when executed by the processor, implement the method of any of claims 5-8.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 5-8.