CN115170746B - Multi-view three-dimensional reconstruction method, system and equipment based on deep learning - Google Patents

Multi-view three-dimensional reconstruction method, system and equipment based on deep learning Download PDF

Info

Publication number
CN115170746B
CN115170746B CN202211087276.9A CN202211087276A CN115170746B CN 115170746 B CN115170746 B CN 115170746B CN 202211087276 A CN202211087276 A CN 202211087276A CN 115170746 B CN115170746 B CN 115170746B
Authority
CN
China
Prior art keywords
point cloud
semantic
representing
scales
scale
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211087276.9A
Other languages
Chinese (zh)
Other versions
CN115170746A (en
Inventor
任胜兵
彭泽文
陈旭洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202211087276.9A priority Critical patent/CN115170746B/en
Publication of CN115170746A publication Critical patent/CN115170746A/en
Application granted granted Critical
Publication of CN115170746B publication Critical patent/CN115170746B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/52Scale-space analysis, e.g. wavelet analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method, a system and equipment for multi-view three-dimensional reconstruction based on deep learning, wherein a plurality of multi-view images are obtained, multi-scale semantic feature extraction is carried out on the multi-view images, and feature maps of various scales are obtained; performing multi-scale semantic segmentation on the feature maps of various scales to obtain semantic segmentation sets of various scales; reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map; obtaining depth maps of various scales based on the semantic segmentation sets and the initial depth maps of various scales; constructing point cloud sets of various scales; optimizing the point cloud sets of various scales by adopting different radius filtering to obtain optimized point cloud sets; reconstructing different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results of different scales; and splicing and fusing the three-dimensional reconstruction results of each scale. The invention can fully utilize semantic information of each scale and improve the accuracy of three-dimensional reconstruction.

Description

Multi-view three-dimensional reconstruction method, system and equipment based on deep learning
Technical Field
The invention relates to the technical field of computer vision, in particular to a multi-view three-dimensional reconstruction method, a multi-view three-dimensional reconstruction system and multi-view three-dimensional reconstruction equipment based on deep learning.
Background
The three-dimensional reconstruction method for deep learning is that a neural network is built by a computer, training is carried out through a large amount of image data and three-dimensional model data, and the mapping relation between an image and a three-dimensional model is learned, so that three-dimensional reconstruction of a new image target is realized. Compared with the traditional method such as a 3D digital media management (3D) method and a Structural From Motion (SFM), the three-dimensional reconstruction method for deep learning can introduce some learned global semantic information into image reconstruction, so that the limitation that the traditional reconstruction method is poor in reconstruction of weak illumination and weak texture areas is overcome to a certain extent.
The existing deep learning three-dimensional reconstruction method is mostly based on a single scale, namely, the objects with different sizes in the image are reconstructed in the same way. The single-scale reconstruction can keep better reconstruction accuracy and speed in the environment with lower scene complexity and fewer small objects. However, the problem of insufficient reconstruction accuracy of small-scale objects easily occurs in some environments with complex scenes and more objects of various scales. And only the high-level features are utilized, and the low-level detail information of the image is not fully utilized.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a method, a system and equipment for multi-view three-dimensional reconstruction based on deep learning, which can make full use of semantic information of each scale and improve the accuracy of three-dimensional reconstruction.
In a first aspect, an embodiment of the present invention provides a deep learning-based multi-view three-dimensional reconstruction method, where the deep learning-based multi-view three-dimensional reconstruction method includes:
acquiring a plurality of multi-view images, and performing multi-scale semantic feature extraction on the plurality of multi-view images to obtain feature maps of multiple scales;
performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain semantic segmentation sets of multiple scales;
reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;
obtaining depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map;
constructing a point cloud set with multiple scales based on the depth maps with multiple scales;
according to the scale of the point cloud set, different radius filtering is adopted for the point cloud sets with various scales to carry out optimization, and the optimized point cloud set is obtained;
reconstructing at different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results at different scales;
and splicing and fusing the three-dimensional reconstruction results of each scale to obtain a final three-dimensional reconstruction result.
Compared with the prior art, the first aspect of the invention has the following beneficial effects:
the method can extract the features of different scales by extracting the multi-scale semantic features of a plurality of multi-view images, can obtain the feature maps of various scales, can perform multi-scale semantic segmentation on the feature maps of various scales, and can aggregate the semantic information of each scale, thereby enriching the semantic information of each scale; semantic guidance is respectively carried out on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and an accurate depth map of multiple scales is obtained; the method comprises the steps of constructing a point cloud set with various scales by using the obtained depth maps with various scales, optimizing by adopting different radius filtering according to the scales of the point cloud set, using the optimized point cloud set for reconstruction with different scales, and fusing three-dimensional reconstruction results to obtain more accurate three-dimensional reconstruction results. Therefore, the method can fully utilize semantic information of each scale and improve the accuracy of three-dimensional reconstruction.
According to some embodiments of the present invention, the performing multi-scale semantic feature extraction on a plurality of the multi-view images to obtain feature maps of multiple scales includes:
performing multilayer feature extraction on the multiple multi-view images through a ResNet network to obtain original feature maps with multiple scales;
and respectively connecting the original characteristic map of each scale with channel attention so as to carry out importance weighting on the original characteristic map of each scale through a channel attention mechanism and obtain characteristic maps of various scales.
According to some embodiments of the present invention, the importance weighting is performed on the original feature map of each scale through a channel attention mechanism to obtain feature maps of multiple scales, including:
compressing the original characteristic diagram of each scale through a compression network to obtain a one-dimensional characteristic diagram corresponding to the original characteristic diagram of each scale;
inputting the one-dimensional characteristic diagram into a full-connection layer through an excitation network to perform importance prediction, and obtaining the importance of each channel;
and exciting the importance of each channel to the one-dimensional characteristic diagram of the original characteristic diagram of each scale through an excitation function to obtain characteristic diagrams of various scales.
According to some embodiments of the present invention, the performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain a semantic segmentation set of multiple scales includes:
clustering the characteristic graphs of multiple scales through nonnegative matrix decomposition to obtain semantic segmentation sets of multiple scales; wherein the expression of the non-negative matrix factorization is:
Figure DEST_PATH_IMAGE001
the method comprises the following steps of mapping, connecting and remolding feature maps of various scales into a matrix V with HW rows and C columns, wherein the P represents a matrix with HW rows and K columns, the Q represents a matrix with K rows and C columns, the H represents a coefficient matrix, the W represents a base matrix, the K represents a non-negative matrix decomposition factor of a semantic cluster number, the C represents the dimension of each pixel, and the F represents the adoption of a non-inducible norm.
According to some embodiments of the present invention, the obtaining the depth maps of the plurality of scales based on the semantic segmentation sets of the plurality of scales and the initial depth map comprises:
selecting any one of the multiple multi-view images as a reference image, and taking the other images as images to be matched;
selecting a reference point from the reference image, acquiring a semantic category corresponding to the reference point in the semantic segmentation set, and acquiring a depth value corresponding to the reference point on the initial depth image;
the number of reference points is chosen by the following formula:
Figure 778149DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE003
representing the number of reference points selected by the jth segmentation set, H representing the height of the multi-view image, W representing the width of the multi-view image, HW representing the number of pixel points of the multi-view image, t representing a constant parameter,
Figure 773918DEST_PATH_IMAGE004
representing the number of semantic categories contained in the jth said semantic partition set,
Figure DEST_PATH_IMAGE005
representing the number of semantic categories contained in the ith semantic segmentation set;
based on each reference point, obtaining the matching point of each reference point on the graph to be matched through the following formula:
Figure 819234DEST_PATH_IMAGE006
wherein, the first and the second end of the pipe are connected with each other,
Figure DEST_PATH_IMAGE007
representing the matching point of the ith reference point on the graph to be matched, K representing the internal reference of the camera, T representing the external reference of the camera,
Figure 745602DEST_PATH_IMAGE008
representing a reference point P in said reference map i A corresponding depth value on the initial depth map;
obtaining the semantic category corresponding to each matching point, correcting the multi-view image of each scale by minimizing a semantic loss function to obtain the depth maps of multiple scales, wherein the semantic loss function
Figure DEST_PATH_IMAGE009
The calculation formula of (c) is as follows:
Figure 137138DEST_PATH_IMAGE010
wherein, the first and the second end of the pipe are connected with each other,
Figure DEST_PATH_IMAGE011
representing the difference between the semantic information of the ith reference point and the semantic information of the ith matching point, M i Representing a mask and N representing the number of said reference points.
According to some embodiments of the invention, the constructing a point cloud set of multiple scales based on the depth maps of multiple scales comprises:
constructing a point cloud set of each scale by using the depth map of each scale according to the following expression:
Figure 490759DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE013
the abscissa representing the depth map is shown as,
Figure 23371DEST_PATH_IMAGE014
represents the ordinate of the depth map and,
Figure DEST_PATH_IMAGE015
and
Figure 504162DEST_PATH_IMAGE016
representing the camera focal length obtained from the camera parameters, and x, y and z represent the point cloud coordinates of the point cloud transformation.
According to some embodiments of the present invention, the optimizing the point cloud sets of multiple scales by using different radius filtering according to the scales of the point cloud sets to obtain an optimized point cloud set includes:
acquiring the point cloud sets of multiple scales, wherein the point cloud in the point cloud set of each scale has a corresponding radius and a preset number of adjacent points;
calculating the corresponding radius of the point cloud in the point cloud set by adopting the following formula according to the scale of the point cloud set:
Figure DEST_PATH_IMAGE017
wherein, the first and the second end of the pipe are connected with each other,
Figure 704199DEST_PATH_IMAGE018
representing the corresponding radius of the point cloud in the point cloud set with different scales,
Figure DEST_PATH_IMAGE019
representing a constant parameter, t representing a constant parameter,
Figure 805885DEST_PATH_IMAGE020
representing a preset scale grade of each point cloud set;
and optimizing the point cloud sets with various scales according to the radius corresponding to each point cloud and the preset number of adjacent points to obtain the optimized point cloud set.
In a second aspect, an embodiment of the present invention further provides a deep learning-based multi-view three-dimensional reconstruction system, where the deep learning-based multi-view three-dimensional reconstruction system includes:
the characteristic diagram acquisition unit is used for acquiring multi-view images, and performing multi-scale semantic feature extraction on the multi-view images to acquire characteristic diagrams of multiple scales;
the semantic segmentation set acquisition unit is used for carrying out multi-scale semantic segmentation on the feature maps with various scales to acquire a semantic segmentation set with various scales;
the initial depth map acquisition unit is used for reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;
the depth map acquisition unit is used for acquiring depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map;
the point cloud set acquisition unit is used for constructing point cloud sets with various scales based on the depth maps with various scales;
the radius filtering unit is used for optimizing the point cloud sets with various scales by adopting different radius filtering according to the scales of the point cloud sets to obtain the optimized point cloud sets;
a reconstruction result obtaining unit, configured to perform reconstruction of different scales based on the optimized point cloud set, so as to obtain three-dimensional reconstruction results of different scales;
and the reconstruction result fusion unit is used for splicing and fusing the reconstruction results of each scale to obtain a final three-dimensional reconstruction result.
Compared with the prior art, the second aspect of the invention has the following beneficial effects:
the feature map acquisition unit of the system can extract deep features and acquire feature maps of multiple scales by performing multi-scale semantic feature extraction on multiple multi-view images, performs multi-scale semantic segmentation on the feature maps of multiple scales by the semantic segmentation set acquisition unit, aggregates semantic information of each scale, and enriches the semantic information of each scale; the depth map acquisition unit is used for respectively carrying out semantic guidance on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and the accurate depth map of multiple scales is obtained; a point cloud set acquisition unit of the system constructs a point cloud set with multiple scales by using the acquired depth maps with multiple scales, different radius filtering is adopted for optimization according to the scales of the point cloud set through a radius filtering unit, reconstruction with different scales is carried out on the basis of the optimized point cloud set through a reconstruction result acquisition unit, and then a three-dimensional reconstruction result is fused through a reconstruction result fusion unit to obtain a more accurate three-dimensional reconstruction result. Therefore, the system can make full use of semantic information of each scale and improve the accuracy of three-dimensional reconstruction.
In a third aspect, an embodiment of the present invention further provides a deep learning-based multi-view three-dimensional reconstruction apparatus, including at least one control processor and a memory, which is in communication connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform a method for deep learning based multi-view three-dimensional reconstruction as described above.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored, and the computer-executable instructions are configured to cause a computer to execute a deep learning-based multi-view three-dimensional reconstruction method as described above.
It is to be understood that the advantageous effects of the third aspect to the fourth aspect compared to the related art are the same as the advantageous effects of the first aspect compared to the related art, and reference may be made to the related description in the first aspect, which is not repeated herein.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a flowchart of a deep learning-based multi-view three-dimensional reconstruction method according to an embodiment of the present invention;
FIG. 2 is a block diagram of a depth residual network in accordance with one embodiment of the present invention;
FIG. 3 is a schematic diagram of a non-negative matrix factorization of an embodiment of the present invention;
FIG. 4 is a block diagram of multi-scale semantic segmentation in accordance with one embodiment of the present invention;
fig. 5 is a structural diagram of a deep learning-based multi-view three-dimensional reconstruction system according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention and are not to be construed as limiting the present invention.
In the description of the present invention, if there are first, second, etc. described, it is only for the purpose of distinguishing technical features, and it is not understood that relative importance is indicated or implied or that the number of indicated technical features is implicitly indicated or that the precedence of the indicated technical features is implicitly indicated.
In the description of the present invention, it should be understood that the orientation or positional relationship referred to, for example, the upper, lower, etc., is indicated based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, but does not indicate or imply that the device or element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention.
In the description of the present invention, it should be noted that unless otherwise explicitly defined, terms such as arrangement, installation, connection and the like should be broadly understood, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.
For the convenience of understanding of those skilled in the art, the terms in the present embodiment are explained:
the deep learning three-dimensional reconstruction method comprises the following steps: the three-dimensional reconstruction method for deep learning is that a neural network is built by using a computer, training is carried out through a large amount of image data and three-dimensional model data, and the mapping relation from an image to a three-dimensional model is learned, so that the three-dimensional reconstruction of a new image target is realized. Compared with the traditional method for reconstructing three-dimensional information such as 3DMM and the method for reconstructing three-dimensional information by SFM, the three-dimensional reconstruction method for deep learning can introduce some global semantic information into image reconstruction, thereby overcoming the limitation that the traditional reconstruction method is poor in reconstruction in weak illumination and weak texture areas to a certain extent, wherein the SFM algorithm is an off-line algorithm for three-dimensional reconstruction based on various collected disordered pictures; the 3DMM, a three-dimensional deformable face model, is a general three-dimensional face model, and represents a face by using fixed points.
The current three-dimensional reconstruction methods for deep learning can be mainly classified into supervised three-dimensional reconstruction methods (for example, NVSNet, CVP-MVSNet, patchmatchchnet and the like in the prior art) and self-supervised three-dimensional reconstruction methods (for example, JDACS-MS and the like in the prior art). The supervised three-dimensional reconstruction method needs truth values for training, has high precision, and is difficult to apply in some scenes in which truth values are difficult to acquire. The self-supervision three-dimensional reconstruction method does not need real value training, and has wide application range and relatively low precision.
Semantic segmentation: semantic segmentation is a classification at the pixel level, and pixels belonging to the same class are classified into one class, so that semantic segmentation is used for understanding an image from the pixel level, for example, pixels having different semantics are marked with different colors. Pixels belonging to animals are classified into the same class. The segmented semantic information can guide image reconstruction and improve the reconstruction precision. And performing semantic segmentation by adopting a clustering mode, and clustering pixels belonging to the same class into the same class.
Depth map: the distance image is an image in which the distance (depth) from the image pickup device to each point in the scene is defined as a pixel value.
Point cloud: the point data set of the object appearance surface is point cloud, contains information such as three-dimensional coordinate information and color of the object, and can realize image reconstruction through the point cloud data.
non-Negative Matrix Factorization (NMF): is a matrix decomposition method under the condition that all elements in the matrix are non-negative numbers. There are many analysis methods for solving the practical problem by matrix decomposition, such as PCA (principal component analysis), ICA (independent component analysis), SVD (singular value decomposition), VQ (vector quantization), and the like. In all these methods, the original large matrix V is approximately decomposed into a low rank V = WH form. The common feature of these methods is that the elements in the factors W and H can be positive or negative, and even if the input initial matrix elements are all positive, the non-negativity of the original data cannot be guaranteed by the conventional rank reduction algorithm. Mathematically, it is true from a computational point of view that the presence of negative values in the decomposition results is correct, but negative values elements often make no sense in practical problems.
The three-dimensional reconstruction method for deep learning is that a neural network is built by a computer, training is carried out through a large amount of image data and three-dimensional model data, and the mapping relation between an image and a three-dimensional model is learned, so that three-dimensional reconstruction of a new image target is realized. Compared with the traditional method such as a 3DMM method and an SFM method, the three-dimensional reconstruction method based on deep learning can introduce some learned global semantic information into image reconstruction, so that the limitation that the traditional reconstruction method is poor in reconstruction in weak-illumination and weak-texture areas is overcome to a certain extent.
The existing deep learning three-dimensional reconstruction method is mostly based on a single scale, namely, objects with different sizes in an image are reconstructed in the same way. The single-scale reconstruction can keep better reconstruction accuracy and speed under the environment with lower scene complexity and fewer fine objects. However, in some environments with complex scenes and a large number of objects of various dimensions, the problem of insufficient reconstruction accuracy of small-dimension objects is likely to occur. And only the high-level features are utilized, and the low-level detail information of the image is not fully utilized.
In order to solve the problems, the multi-scale semantic feature extraction is carried out on a plurality of multi-view images, features of different scales can be extracted, feature maps of various scales can be obtained, multi-scale semantic segmentation is carried out on the feature maps of various scales, semantic information of various scales is aggregated, and the semantic information of various scales is enriched; semantic guidance is respectively carried out on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and an accurate depth map of multiple scales is obtained; the method and the device construct the point cloud sets of various scales by using the obtained depth maps of various scales, optimize by adopting different radius filtering according to the scales of the point cloud sets, use the optimized point cloud sets for reconstruction of different scales, and fuse three-dimensional reconstruction results to obtain more accurate three-dimensional reconstruction results. Therefore, the semantic information of each scale can be fully utilized, and the accuracy of three-dimensional reconstruction can be improved.
Referring to fig. 1, an embodiment of the present invention provides a deep learning-based multi-view three-dimensional reconstruction method, where the deep learning-based multi-view three-dimensional reconstruction method includes:
and S100, acquiring a plurality of multi-view images, and performing multi-scale semantic feature extraction on the plurality of multi-view images to obtain feature maps of various scales.
Specifically, a plurality of multi-view images are acquired, and the object to be recognized can be subjected to image acquisition at various angles in all directions through image acquisition equipment such as a camera and an image scanner, so that the plurality of multi-view images are obtained. For example, when the multi-scale semantic feature extraction needs to be performed on a plurality of multi-view images, an image acquisition device such as a camera may be used to obtain the plurality of multi-view images.
In the embodiment, multilayer feature extraction is performed on a plurality of multi-view images through a ResNet network to obtain original feature maps with various scales;
respectively connecting the original feature map of each scale with channel attention, and performing importance weighting on the original feature map of each scale through a channel attention mechanism to obtain feature maps of multiple scales, specifically:
compressing the original characteristic diagram of each scale through a compression network to obtain a one-dimensional characteristic diagram corresponding to the original characteristic diagram of each scale;
inputting the one-dimensional characteristic diagram into a full-connection layer through an excitation network to perform importance prediction, and obtaining the importance of each channel;
and exciting the importance of each channel to the one-dimensional characteristic diagram of the original characteristic diagram of each scale through an excitation function to obtain characteristic diagrams of various scales.
In the embodiment, the ResNet network is adopted to extract the image characteristics, the deeper the number of layers of the deep learning network is, the stronger the expression capability theoretically is, but after the CNN network reaches a certain depth, the classification performance is deepened, so that the network convergence is slower, and the accuracy is reduced; even if the data set is enlarged to solve the problem of overfitting, the classification performance and accuracy will not be improved. The ResNet network adopts a residual error learning method, and with reference to FIG. 2, the learned characteristics are recorded as x when the input is x
Figure DEST_PATH_IMAGE021
Now we want it to learn the residual
Figure 91373DEST_PATH_IMAGE022
Such that the actual original learning features are
Figure DEST_PATH_IMAGE023
. This is so because residual learning is easier than direct learning of the original features. When the residual error is 0, only identity mapping is performed on the accumulation layer at this time, at least the network performance is not reduced, and actually, the residual error is not 0, so that the accumulation layer learns new features on the basis of the input features, and has better performance. The residual function is easier to optimize, and the network layer number can be greatly deepened, so that deeper semantic information can be extracted. The performance of ResNet in the aspects of efficiency, resource consumption and deep semantic feature extraction is obviously superior to that of networks such as VGG (virtual grid generator) and the like.
After multi-layer feature extraction is carried out on a plurality of multi-view images through a ResNet network to obtain original feature maps of various scales, the original feature maps of each scale are respectively connected with channel attention, and importance weighting is carried out on the original feature maps of each scale through a channel attention mechanism to obtain the feature maps of various scales. The channel attention mechanism mainly comprises a compression network and an excitation network, and comprises the following specific processes:
let the dimension of the original feature map be H × W × C, where H is Height (Height), W is width (width), and C is channel number (channel). The compression network does the same thing as compressing H W into one-dimensional features by global averaging pooling. After H × W is compressed into one dimension, the corresponding one-dimensional parameter obtains the view of the whole H × W, and the sensing area is wider. And transmitting the one-dimensional characteristics obtained by the compression network to an excitation network, transmitting the one-dimensional characteristics to a full connection layer by the excitation network, predicting the importance of each channel to obtain the importance of different channels, and exciting the importance of different channels to the channels corresponding to the previous characteristic diagrams by a Sigmoid excitation function. The channel attention mechanism enables the network to pay attention to more effective semantic features, the weight of the semantic features is improved in an iterative mode, the feature extraction network extracts rich semantic features, and different semantic features are different in importance for semantic segmentation. The introduction of the channel attention mechanism can enable the network to pay attention to more effective features, inhibit inefficient features and improve the effectiveness of feature extraction.
In this embodiment, because of the convolutional neural network used in feature extraction in the prior art, feature extraction like VGG network is limited by the number of network extraction layers, the deep level feature extraction capability is insufficient, and the feature validity is not high. With the increase of the number of the convolution layers, the problems of slow network convergence, low accuracy and the like occur, the feature extraction capability is insufficient, all the extracted features have different importance for image reconstruction, and the extraction of the features with high effectiveness is difficult to guarantee. Therefore, in the embodiment, by performing multi-scale semantic feature extraction on a plurality of multi-view images, deep features can be extracted, and feature maps of various scales can be obtained. And through the introduction of a channel attention mechanism, the network can pay attention to more effective features, so that the inefficient features are inhibited, and the effectiveness of feature extraction is improved.
And S200, performing multi-scale semantic segmentation on the feature maps of various scales to obtain a semantic segmentation set of various scales.
Specifically, clustering is carried out on the characteristic graphs of multiple scales through nonnegative matrix factorization to obtain a semantic segmentation set of multiple scales; wherein the expression of the non-negative matrix factorization is as follows:
Figure 93964DEST_PATH_IMAGE001
the method comprises the following steps of mapping, connecting and remolding feature maps of various scales into a matrix V with HW rows and C columns, wherein the P represents a matrix with HW rows and K columns, the Q represents a matrix with K rows and C columns, the H represents a coefficient matrix, the W represents a base matrix, the K represents a non-negative matrix decomposition factor of a semantic cluster number, the C represents the dimension of each pixel, and the F represents the adoption of a non-inducible norm.
A typical matrix decomposition decomposes a large matrix into a number of smaller matrices, but the elements of these matrices have positive and negative values. Whereas in the real world, images,the presence of negative numbers in a matrix formed by text or the like is meaningless, so it makes sense if a matrix can be decomposed into all non-negative elements. Requiring the original matrix in NMF
Figure 414087DEST_PATH_IMAGE024
Is non-negative, then the matrix
Figure 578352DEST_PATH_IMAGE024
Can be decomposed into the product of two smaller non-negative matrices with and without one such decomposition satisfying presence and uniqueness. For example,
given matrix
Figure DEST_PATH_IMAGE025
Looking for non-negative matrices
Figure 101869DEST_PATH_IMAGE026
And a non-negative matrix
Figure DEST_PATH_IMAGE027
So that
Figure 704888DEST_PATH_IMAGE028
. Before and after decomposition it is understood that: original matrix
Figure 82780DEST_PATH_IMAGE024
The column vector of (1) is the weighted sum of all the column vectors in the left matrix, and the weighting coefficient is the element of the corresponding column vector of the right matrix, so called
Figure DEST_PATH_IMAGE029
As a basis matrix, the matrix is,
Figure 995110DEST_PATH_IMAGE030
is a matrix of coefficients.
Referring to fig. 3, N multi-scale feature maps are first concatenated and reshaped into a (HW, C) matrix V. Solving NMF using multiplicative update rules, i.e. using formulae
Figure DEST_PATH_IMAGE031
Figure 255190DEST_PATH_IMAGE032
Solving for NMF, V is decomposed by NMF decomposition (i.e., NMF non-negative matrix decomposition) in the graph into (HW, K) matrix P and (K, C) matrix Q, where K is the NMF factor representing the number of semantic clusters. Due to NMF (QQ) T = I), each row of the (K, C) matrix Q may be considered as a C-dimensional cluster center, each row of the K, C) matrix Q corresponding to several objects in the view. The rows of the (HW, K) matrix P correspond to the positions of all pixels from the N multi-scale feature maps. In general, matrix decomposition forces the product between each row of P and each column of Q to better approximate the C-dimensional characteristics of each pixel in V. Thus, the semantic category of each position in the image is obtained by the P matrix.
Referring to FIG. 4, assume an extracted feature map
Figure DEST_PATH_IMAGE033
Each feature matrix is semantically segmented by means of clustering (i.e., NMF non-negative matrix factorization)
Figure 865163DEST_PATH_IMAGE034
Is decomposed into
Figure DEST_PATH_IMAGE035
And because the receptive field of the high-level feature layer is large, the features are more abstract, and the global situation is more concerned. The lower characteristic layer has small receptive field and focuses more on details. Thus, each segmented set obtained by multi-scale semantic segmentation
Figure 628720DEST_PATH_IMAGE036
The system comprises a plurality of layers from coarse to fine. The segmentation sets S1 to S3 in fig. 4 contain increasingly more detailed information. Each segmentation set S contains the semantic segmentation result of an input group of images (a reference image and an image to be matched), for example, different colors represent different semantic categories, and the segmentation set S contains more detailed informationA cut set (e.g., the segmented set S3) will contain more semantic categories.
In the embodiment, because most of the current deep learning three-dimensional reconstruction methods are based on a single scale, the three-dimensional reconstruction methods are reconstructed in the same manner for objects with different sizes in the image. The single-scale reconstruction can keep better reconstruction accuracy and speed under the environment with lower scene complexity and fewer small objects, but the problem of insufficient reconstruction accuracy of small-scale objects easily occurs under the environment with complex scenes and more objects of various scales; and only the high-level features are utilized, and the low-level detail information of the image is not fully utilized. Therefore, in the embodiment, the multi-scale semantic segmentation is performed on the feature maps of multiple scales, and the semantic information of each scale is aggregated, so that the semantic information of each scale is enriched, and the detail information of the low-level feature layer can be fully utilized.
And S300, reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map.
Specifically, in the embodiment, a plurality of multi-view images are reconstructed by a supervised three-dimensional reconstruction method, so as to obtain an initial depth map.
According to the embodiment, the initial depth map is obtained through a supervised three-dimensional reconstruction method, and the reconstruction precision can be improved. Because the supervised three-dimensional reconstruction method has high precision, but needs a large amount of training truth value data, and under certain specific scenes (for example, underwater), the training truth value is difficult to acquire and is difficult to apply. Therefore, step S400 is required to perform semantic guidance on the initial depth map of this embodiment, and the supervised three-dimensional reconstruction method is converted into an unsupervised one, so as to implement the unsupervised three-dimensional reconstruction, thereby overcoming the inherent defects of the supervised three-dimensional reconstruction method.
The supervised three-dimensional reconstruction method in the present embodiment is any one of the supervised three-dimensional reconstruction methods in the prior art, for example, MVSNet (MVSNet: depth index for unregulated Multi-View step), CVP-MVSNet (Cost Volume focus Based Depth index for Multi-View step), and PatchmatchNet (patchnet: left Multi-View patchstep), and the details thereof are not described in the present embodiment.
And S400, obtaining the depth maps of various scales based on the semantic segmentation sets and the initial depth map of various scales.
Specifically, in this embodiment, semantic information is used as a supervision signal to combine with a supervised three-dimensional reconstruction method, and the image reconstruction is guided to obtain a depth map, which specifically includes the following processes:
acquiring a plurality of multi-view images through image acquisition equipment, and taking the plurality of multi-view images as input to obtain an initial depth map through a supervised three-dimensional reconstruction method;
selecting any one of the multiple multi-view images as a reference image, and taking the other images as images to be matched;
selecting a reference point from the reference image, acquiring a semantic category corresponding to the reference point in the semantic segmentation set, and acquiring a depth value corresponding to the reference point on the initial depth map;
the number of reference points is chosen by the following formula:
Figure DEST_PATH_IMAGE037
wherein the content of the first and second substances,
Figure 682258DEST_PATH_IMAGE038
representing the number of reference points selected by the jth segmentation set, H representing the height of the multi-view image, W representing the width of the multi-view image, HW representing the number of pixel points of the multi-view image, t representing a constant parameter,
Figure DEST_PATH_IMAGE039
representing the number of semantic categories contained in the jth semantic partition set,
Figure 960792DEST_PATH_IMAGE040
representing the number of semantic categories contained in the ith semantic segmentation set;
based on each reference point, acquiring the matching point of each reference point on the graph to be matched through the following formula:
Figure 577718DEST_PATH_IMAGE006
wherein the content of the first and second substances,
Figure 726940DEST_PATH_IMAGE007
represents the matching point of the ith reference point on the graph to be matched, K represents the internal parameter of the camera, T represents the external parameter of the camera,
Figure 473352DEST_PATH_IMAGE008
indicating a reference point P in the reference map i Corresponding depth values on the initial depth map;
obtaining the semantic category corresponding to each matching point, correcting the multi-view image of each scale by minimizing a semantic loss function to obtain depth maps of various scales, and obtaining the semantic loss function
Figure 645707DEST_PATH_IMAGE009
The calculation formula of (c) is as follows:
Figure 863062DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 100002_DEST_PATH_IMAGE041
representing the difference between the semantic information of the ith reference point and the semantic information of the ith matching point, M i Representing the mask and N the number of reference points. This embodiment is illustrated by the following example:
firstly, a plurality of multi-view images of the same object under different viewing angles are acquired through image acquisition equipment, the multi-view images are used as input, and an initial depth map can be obtained through a supervised three-dimensional reconstruction method. Selecting one of the input multi-view images as a reference image, and taking a point of reference point P on the reference image, wherein the rest images are images to be matched i And its corresponding semantic class S on the segmentation set S i And corresponding depth on the depth mapAnd (4) measuring values.
For the segmentation sets of different levels, the segmentation sets with more categories need to be guided more finely due to different semantic category numbers, the number of reference points needs to be more, and the number of the reference points is selected according to a formula:
Figure 601211DEST_PATH_IMAGE042
the matching points corresponding to the reference points on the graph to be matched are obtained through the following homography matrix formula
Figure 511398DEST_PATH_IMAGE007
Get the matching point
Figure 171049DEST_PATH_IMAGE007
Semantic categories of
Figure 100002_DEST_PATH_IMAGE043
The semantic category of the matching point calculated by the reference point under the condition that the depth map is accurate (i.e. the depth value of the corresponding position is correct) should be the same as the semantic category of the reference point, and the following semantic loss function is calculated and minimized:
Figure 473986DEST_PATH_IMAGE044
and continuously correcting the initial depth map by minimizing a semantic loss function, and finally obtaining an accurate depth map. The semantic information can replace a true value for guiding, a supervised three-dimensional reconstruction method is converted into an unsupervised three-dimensional reconstruction method, and the unsupervised three-dimensional reconstruction is realized, so that the inherent defects of the supervised method are overcome.
In this embodiment, since the semantics of the image can be divided into three layers, a visual layer, an object layer and a concept layer, the semantics of the visual layer includes colors, lines, contours, etc., the semantics of the object layer includes various objects, and the semantics of the concept layer relates to understanding of the scene. In the prior art, part of three-dimensional reconstruction methods also utilize semantic information for guidance, but high-level abstract semantic information (object layer) with a single scale has better precision on reconstruction tasks of some large-scale objects, and on reconstruction tasks with small scales, the high-level abstract semantic information is relatively rough and has poor reconstruction precision.
Therefore, in the embodiment, a plurality of multi-view images are used as input, and an initial depth map is obtained by a supervised three-dimensional reconstruction method; obtaining depth maps of various scales based on semantic segmentation sets and initial depth maps of various scales; in the embodiment, semantic guidance is respectively performed on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and an accurate depth map of multiple scales is obtained.
And S500, constructing a point cloud set with various scales based on the depth maps with various scales.
Specifically, a depth map of each scale is constructed by the following expression:
Figure DEST_PATH_IMAGE045
wherein the content of the first and second substances,
Figure 801062DEST_PATH_IMAGE046
the abscissa representing the depth map is shown as,
Figure DEST_PATH_IMAGE047
the ordinate of the depth map is represented,
Figure 616571DEST_PATH_IMAGE048
and
Figure 75103DEST_PATH_IMAGE016
representing the camera focal length obtained from the camera parameters, and x, y and z represent the point cloud coordinates of the point cloud transformation.
And S600, according to the scale of the point cloud set, optimizing the point cloud sets with various scales by adopting different radius filtering to obtain the optimized point cloud set.
Specifically, a point cloud set of multiple scales is obtained, and the point cloud in the point cloud set of each scale has a corresponding radius and a preset number of adjacent points;
calculating the corresponding radius of the point cloud in the point cloud set according to the scale of the point cloud set by adopting the following formula:
Figure 837523DEST_PATH_IMAGE017
wherein the content of the first and second substances,
Figure 550264DEST_PATH_IMAGE018
representing the corresponding radius of the point cloud in the point cloud set with different scales,
Figure 536674DEST_PATH_IMAGE019
representing a constant parameter, t representing a constant parameter,
Figure DEST_PATH_IMAGE049
representing a preset scale grade of each point cloud set;
and optimizing the point cloud sets with various scales according to the radius size corresponding to each point cloud and the preset number of adjacent points to obtain the optimized point cloud sets.
In this embodiment, for point cloud sets of different scales, radius filtering is required after the depth map is converted, noise points are filtered, and point cloud data are optimized. For point cloud sets with different scales, different radius filtering is adopted due to different aggregation degrees of the point clouds. Radius filtering, namely, firstly, acquiring the radius corresponding to each point cloud and presetting the quantity of adjacent points, only the point cloud which meets the requirement of having enough quantity of adjacent points in the radius range can be reserved, and the rest points are filtered out. For the multi-scale point cloud set of this embodiment, the semantic type of the point cloud in the segmentation set needs to be considered, that is, the point cloud having n number of neighboring points with the same semantic type in the radius is retained.
And S700, reconstructing at different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results at different scales.
Specifically, in step S600, point cloud sets of different scales are optimized to obtain point cloud sets optimized in different scales, and the point cloud sets optimized in each scale are reconstructed to obtain three-dimensional reconstruction results in different scales.
And step S800, splicing and fusing the three-dimensional reconstruction results of each scale to obtain a final three-dimensional reconstruction result.
Specifically, the three-dimensional reconstruction results of each scale are spliced and fused to obtain the final three-dimensional reconstruction result. In this embodiment, through the step S700, reconstruction of different scales is performed based on the optimized point cloud set, and the optimized point cloud set is more accurate, so that the final three-dimensional reconstruction result obtained in this embodiment is also more accurate.
In the embodiment, a plurality of multi-view images are acquired, and multi-scale semantic feature extraction is performed on the plurality of multi-view images to acquire feature maps of various scales; carrying out multi-scale semantic segmentation on the feature maps with various scales to obtain a semantic segmentation set with various scales; in the embodiment, the deep-level features can be extracted by performing multi-scale semantic feature extraction on a plurality of multi-view images, and feature maps of various scales can be obtained. And multi-scale semantic segmentation is carried out on the feature maps of various scales, and semantic information of each scale is aggregated, so that the semantic information of each scale is enriched. In the embodiment, a plurality of multi-view images are used as input, and an initial depth map is obtained through a supervised three-dimensional reconstruction method; obtaining depth maps of various scales based on semantic segmentation sets and initial depth maps of various scales; in the embodiment, semantic guidance is respectively performed on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and an accurate depth map of multiple scales is obtained. The method comprises the steps of constructing a point cloud set with various scales based on depth maps with various scales; according to the scale of the point cloud set, optimizing the point cloud sets of various scales by adopting different radius filtering to obtain the optimized point cloud set; reconstructing at different scales based on the optimized point cloud set to obtain reconstruction results at different scales; and splicing and fusing the reconstruction results of each scale to obtain a final reconstruction result. In this embodiment, the obtained depth maps of multiple scales are used to construct a point cloud set of multiple scales, different radius filtering is adopted for optimization according to the scales of the point cloud set, the optimized point cloud set is used for reconstruction of different scales, and then the reconstruction results are fused to obtain a more accurate reconstruction result. According to the embodiment, semantic information of each scale can be fully utilized, and the accuracy of three-dimensional reconstruction can be improved.
Referring to fig. 5, an embodiment of the present invention provides a deep learning-based multi-view three-dimensional reconstruction system, which includes a feature map obtaining unit 100, a semantic segmentation set obtaining unit 200, an initial depth map obtaining unit 300, a depth map obtaining unit 400, a point cloud set obtaining unit 500, a radius filtering unit 600, a reconstruction result obtaining unit 700, and a reconstruction result fusion unit 800, where:
the characteristic diagram acquiring unit 100 is configured to acquire a multi-view image, perform multi-scale semantic feature extraction on the multi-view image, and acquire characteristic diagrams of multiple scales;
a semantic segmentation set acquisition unit 200, configured to perform multi-scale semantic segmentation on feature maps of multiple scales to obtain a semantic segmentation set of multiple scales;
an initial depth map obtaining unit 300, configured to reconstruct the multiple multi-view images by using a supervised three-dimensional reconstruction method to obtain an initial depth map;
a depth map obtaining unit 400, configured to obtain depth maps of multiple scales based on the multiple-scale semantic segmentation sets and the initial depth map;
a point cloud set obtaining unit 500, configured to construct a point cloud set with multiple scales based on depth maps with multiple scales;
the radius filtering unit 600 is configured to optimize point cloud sets of multiple scales by using different radius filtering according to the scale of the point cloud set, so as to obtain an optimized point cloud set;
a reconstruction result obtaining unit 700, configured to perform reconstruction of different scales based on the optimized point cloud set, so as to obtain three-dimensional reconstruction results of different scales;
and the reconstruction result fusion unit 800 is used for splicing and fusing the reconstruction results of each scale to obtain a final three-dimensional reconstruction result.
It should be noted that, since the multi-view three-dimensional reconstruction system based on deep learning in the embodiment is based on the same inventive concept as the above-mentioned multi-view three-dimensional reconstruction method based on deep learning, the corresponding contents in the method embodiment are also applicable to the embodiment of the system, and are not described in detail here.
The embodiment of the invention also provides a multi-view three-dimensional reconstruction device based on deep learning, which comprises: at least one control processor and a memory for communicative connection with the at least one control processor.
The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The non-transitory software program and instructions required to implement a deep learning based multi-view three-dimensional reconstruction method of the above embodiments are stored in a memory, and when executed by a processor, perform the deep learning based multi-view three-dimensional reconstruction method of the above embodiments, for example, perform the above-described method steps S100 to S800 in fig. 1.
The above described system embodiments are merely illustrative, wherein the units described as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Embodiments of the present invention also provide a computer-readable storage medium, which stores computer-executable instructions, which, when executed by one or more control processors, may cause the one or more control processors to perform one of the above method embodiments based on deep learning, for example, perform the functions of the above method steps S100 to S800 in fig. 1.
Through the above description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a general hardware platform. Those skilled in the art will appreciate that all or part of the processes of the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.
The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims (9)

1. A multi-view three-dimensional reconstruction method based on deep learning is characterized by comprising the following steps:
acquiring a plurality of multi-view images, and performing multi-scale semantic feature extraction on the plurality of multi-view images to obtain feature maps of various scales;
performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain semantic segmentation sets of multiple scales;
reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;
obtaining the depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map, specifically:
selecting any one of the multiple multi-view images as a reference image, and taking the others as images to be matched;
selecting a reference point from the reference image, acquiring a semantic category corresponding to the reference point in the semantic segmentation set, and acquiring a depth value corresponding to the reference point on the initial depth image;
the number of reference points is chosen by the following formula:
Figure DEST_PATH_IMAGE002
wherein, the first and the second end of the pipe are connected with each other,
Figure DEST_PATH_IMAGE004
representing the number of reference points selected by the jth segmentation set, H representing the height of the multi-view image, W representing the width of the multi-view image, HW representing the number of pixel points of the multi-view image, t representing a constant parameter,
Figure DEST_PATH_IMAGE006
representing the number of semantic categories contained in the jth said semantic partition set,
Figure DEST_PATH_IMAGE008
representing the number of semantic categories contained in the ith semantic segmentation set, wherein n represents the total number of the semantic segmentation sets;
based on each reference point, acquiring a matching point of each reference point on the graph to be matched through the following formula:
Figure DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE012
representing the matching point of the ith reference point on the graph to be matched, K representing the internal parameter of the camera, T representing the external parameter of the camera,
Figure DEST_PATH_IMAGE014
representing a reference point P in said reference map i Corresponding depth values on the initial depth map;
obtaining semantic categories corresponding to each matching point, correcting the multi-view images of each scale by minimizing a semantic loss function to obtain the depth maps of various scales, wherein the semantic loss function
Figure DEST_PATH_IMAGE016
The calculation formula of (c) is as follows:
Figure DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE020
representing the difference between the semantic information of the ith reference point and the semantic information of the ith matching point, M i Representing a mask, N representing the number of said reference points;
constructing a point cloud set with multiple scales based on the depth maps with multiple scales;
according to the scale of the point cloud set, different radius filtering is adopted for the point cloud sets with various scales to carry out optimization, and the optimized point cloud set is obtained;
reconstructing at different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results at different scales;
and splicing and fusing the three-dimensional reconstruction results of each scale to obtain a final three-dimensional reconstruction result.
2. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the performing multi-scale semantic feature extraction on the multiple multi-view images to obtain feature maps of multiple scales comprises:
performing multi-layer feature extraction on the multi-view images through a ResNet network to obtain original feature maps with various scales;
and respectively connecting the original feature map of each scale with channel attention so as to carry out importance weighting on the original feature map of each scale through a channel attention mechanism and obtain feature maps of various scales.
3. The deep learning-based multi-view three-dimensional reconstruction method according to claim 2, wherein the weighting of importance of the original feature map of each scale through a channel attention mechanism to obtain feature maps of multiple scales comprises:
compressing the original characteristic diagram of each scale through a compression network to obtain a one-dimensional characteristic diagram corresponding to the original characteristic diagram of each scale;
inputting the one-dimensional characteristic diagram into a full-connection layer through an excitation network to perform importance prediction, and obtaining the importance of each channel;
and exciting the importance of each channel to the one-dimensional characteristic diagram of the original characteristic diagram of each scale through an excitation function to obtain characteristic diagrams of various scales.
4. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain semantic segmentation sets of multiple scales includes:
clustering the characteristic graphs of multiple scales through nonnegative matrix decomposition to obtain semantic segmentation sets of multiple scales; wherein the expression of the non-negative matrix factorization is:
Figure DEST_PATH_IMAGE022
the method comprises the following steps of mapping, connecting and remolding feature maps of various scales into a matrix V with HW rows and C columns, wherein the P represents a matrix with HW rows and K columns, the Q represents a matrix with K rows and C columns, the H represents a coefficient matrix, the W represents a base matrix, the K represents a non-negative matrix decomposition factor of a semantic cluster number, the C represents the dimension of each pixel, and the F represents the adoption of a non-inducible norm.
5. The method for multi-view three-dimensional reconstruction based on deep learning of claim 1, wherein the constructing the point cloud sets of multiple scales based on the depth maps of multiple scales comprises:
constructing a point cloud set of each scale by using the depth map of each scale according to the following expression:
Figure DEST_PATH_IMAGE024
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE026
the abscissa representing the depth map is shown,
Figure DEST_PATH_IMAGE028
represents the ordinate of the depth map and,
Figure DEST_PATH_IMAGE030
and
Figure DEST_PATH_IMAGE032
representing the camera focal length obtained from the camera parameters, and x, y and z represent the point cloud coordinates of the point cloud transformation.
6. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the optimization of the point cloud sets of multiple scales by using different radius filters according to the scales of the point cloud sets to obtain an optimized point cloud set comprises:
acquiring the point cloud sets of multiple scales, wherein the point cloud in the point cloud set of each scale has a corresponding radius and a preset number of adjacent points;
calculating the corresponding radius of the point cloud in the point cloud set by adopting the following formula according to the scale of the point cloud set:
Figure DEST_PATH_IMAGE034
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE036
representing the corresponding radius of the point cloud in the point cloud set with different scales,
Figure DEST_PATH_IMAGE038
representing a constant parameter, t representing a constant parameter,
Figure DEST_PATH_IMAGE040
representing a preset scale grade of each point cloud set;
and optimizing the point cloud sets with various scales according to the radius corresponding to each point cloud and the preset number of adjacent points to obtain an optimized point cloud set.
7. A deep learning based multi-view three-dimensional reconstruction system, comprising:
the characteristic diagram acquisition unit is used for acquiring multi-view images, and performing multi-scale semantic feature extraction on the multi-view images to acquire characteristic diagrams of multiple scales;
the semantic segmentation set acquisition unit is used for carrying out multi-scale semantic segmentation on the feature maps with various scales to obtain a semantic segmentation set with various scales;
the initial depth map acquisition unit is used for reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;
a depth map obtaining unit, configured to obtain depth maps of multiple scales based on the multiple-scale semantic segmentation sets and the initial depth map, specifically:
selecting any one of the multiple multi-view images as a reference image, and taking the other images as images to be matched;
selecting a reference point from the reference image, acquiring a semantic category corresponding to the reference point in the semantic segmentation set, and acquiring a depth value corresponding to the reference point on the initial depth image;
the number of reference points is chosen by the following formula:
Figure 650250DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE041
representing the number of reference points selected by the jth segmentation set, H representing the height of the multi-view image, W representing the width of the multi-view image, HW representing the number of pixel points of the multi-view image, t representing a constant parameter,
Figure DEST_PATH_IMAGE042
representing the number of semantic categories contained in the jth of said semantic segmentation sets,
Figure 629707DEST_PATH_IMAGE008
representing the number of semantic categories contained in the ith semantic segmentation set, wherein n represents the total number of the semantic segmentation sets;
based on each reference point, obtaining the matching point of each reference point on the graph to be matched through the following formula:
Figure 675023DEST_PATH_IMAGE010
wherein, the first and the second end of the pipe are connected with each other,
Figure DEST_PATH_IMAGE043
representing the matching point of the ith reference point on the graph to be matched, K representing the internal reference of the camera, T representing the external reference of the camera,
Figure 414440DEST_PATH_IMAGE014
representing a reference point P in said reference map i Corresponding depth values on the initial depth map;
obtaining the semantic category corresponding to each matching point, correcting the multi-view image of each scale by minimizing a semantic loss function to obtain the depth maps of multiple scales, wherein the semantic loss function
Figure 432075DEST_PATH_IMAGE016
The calculation formula of (c) is as follows:
Figure 848013DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure 380625DEST_PATH_IMAGE020
representing the difference between the semantic information of the ith reference point and the semantic information of the ith matching point, M i Representing a mask, N representing the number of said reference points;
the point cloud set acquisition unit is used for constructing point cloud sets with various scales based on the depth maps with various scales;
the radius filtering unit is used for optimizing the point cloud sets with various scales by adopting different radius filtering according to the scales of the point cloud sets to obtain the optimized point cloud sets;
a reconstruction result obtaining unit, configured to perform reconstruction of different scales based on the optimized point cloud set, so as to obtain three-dimensional reconstruction results of different scales;
and the reconstruction result fusion unit is used for splicing and fusing the reconstruction results of each scale to obtain a final three-dimensional reconstruction result.
8. A deep learning based multi-view three-dimensional reconstruction device comprising at least one control processor and a memory for communicative connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform the method of deep learning based multi-view three-dimensional reconstruction according to any one of claims 1 to 6.
9. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method for deep learning based multi-view three-dimensional reconstruction according to any one of claims 1 to 6.
CN202211087276.9A 2022-09-07 2022-09-07 Multi-view three-dimensional reconstruction method, system and equipment based on deep learning Active CN115170746B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211087276.9A CN115170746B (en) 2022-09-07 2022-09-07 Multi-view three-dimensional reconstruction method, system and equipment based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211087276.9A CN115170746B (en) 2022-09-07 2022-09-07 Multi-view three-dimensional reconstruction method, system and equipment based on deep learning

Publications (2)

Publication Number Publication Date
CN115170746A CN115170746A (en) 2022-10-11
CN115170746B true CN115170746B (en) 2022-11-22

Family

ID=83481918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211087276.9A Active CN115170746B (en) 2022-09-07 2022-09-07 Multi-view three-dimensional reconstruction method, system and equipment based on deep learning

Country Status (1)

Country Link
CN (1) CN115170746B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115457101B (en) * 2022-11-10 2023-03-24 武汉图科智能科技有限公司 Edge-preserving multi-view depth estimation and ranging method for unmanned aerial vehicle platform
CN118096995A (en) * 2022-11-21 2024-05-28 华为云计算技术有限公司 Three-dimensional twin method and device
CN117876397B (en) * 2024-01-12 2024-06-18 浙江大学 Bridge member three-dimensional point cloud segmentation method based on multi-view data fusion

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104715504A (en) * 2015-02-12 2015-06-17 四川大学 Robust large-scene dense three-dimensional reconstruction method
CN106157307B (en) * 2016-06-27 2018-09-11 浙江工商大学 A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF
US11004202B2 (en) * 2017-10-09 2021-05-11 The Board Of Trustees Of The Leland Stanford Junior University Systems and methods for semantic segmentation of 3D point clouds
CN108388639B (en) * 2018-02-26 2022-02-15 武汉科技大学 Cross-media retrieval method based on subspace learning and semi-supervised regularization
EP3970114A4 (en) * 2019-05-17 2022-07-13 Magic Leap, Inc. Methods and apparatuses for corner detection using neural network and corner detector
US11645756B2 (en) * 2019-11-14 2023-05-09 Samsung Electronics Co., Ltd. Image processing apparatus and method
CN111340186B (en) * 2020-02-17 2022-10-21 之江实验室 Compressed representation learning method based on tensor decomposition
CN112734915A (en) * 2021-01-19 2021-04-30 北京工业大学 Multi-view stereoscopic vision three-dimensional scene reconstruction method based on deep learning
CN113066168B (en) * 2021-04-08 2022-08-26 云南大学 Multi-view stereo network three-dimensional reconstruction method and system
CN113673400A (en) * 2021-08-12 2021-11-19 土豆数据科技集团有限公司 Real scene three-dimensional semantic reconstruction method and device based on deep learning and storage medium
CN114881867A (en) * 2022-03-24 2022-08-09 山西三友和智慧信息技术股份有限公司 Image denoising method based on deep learning
CN114677479A (en) * 2022-04-13 2022-06-28 温州大学大数据与信息技术研究院 Natural landscape multi-view three-dimensional reconstruction method based on deep learning

Also Published As

Publication number Publication date
CN115170746A (en) 2022-10-11

Similar Documents

Publication Publication Date Title
CN112287939B (en) Three-dimensional point cloud semantic segmentation method, device, equipment and medium
CN115170746B (en) Multi-view three-dimensional reconstruction method, system and equipment based on deep learning
CN112446270B (en) Training method of pedestrian re-recognition network, pedestrian re-recognition method and device
CN109993707B (en) Image denoising method and device
CN111340738B (en) Image rain removing method based on multi-scale progressive fusion
CN114255238A (en) Three-dimensional point cloud scene segmentation method and system fusing image features
US11875424B2 (en) Point cloud data processing method and device, computer device, and storage medium
CN110222718B (en) Image processing method and device
CN111753698A (en) Multi-mode three-dimensional point cloud segmentation system and method
CN111476806B (en) Image processing method, image processing device, computer equipment and storage medium
WO2022193335A1 (en) Point cloud data processing method and apparatus, and computer device and storage medium
CN111310821A (en) Multi-view feature fusion method, system, computer device and storage medium
CN114418030A (en) Image classification method, and training method and device of image classification model
CN111179270A (en) Image co-segmentation method and device based on attention mechanism
CN114219855A (en) Point cloud normal vector estimation method and device, computer equipment and storage medium
CN115205150A (en) Image deblurring method, device, equipment, medium and computer program product
CN111368733B (en) Three-dimensional hand posture estimation method based on label distribution learning, storage medium and terminal
CN113313176A (en) Point cloud analysis method based on dynamic graph convolution neural network
CN110705564B (en) Image recognition method and device
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN112668675B (en) Image processing method and device, computer equipment and storage medium
CN114332796A (en) Multi-sensor fusion voxel characteristic map generation method and system
CN111667495A (en) Image scene analysis method and device
CN111860668A (en) Point cloud identification method of deep convolution network for original 3D point cloud processing
CN112668662A (en) Outdoor mountain forest environment target detection method based on improved YOLOv3 network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant