CN115170746B

CN115170746B - Multi-view three-dimensional reconstruction method, system and equipment based on deep learning

Info

Publication number: CN115170746B
Application number: CN202211087276.9A
Authority: CN
Inventors: 任胜兵; 彭泽文; 陈旭洋
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2022-11-22
Anticipated expiration: 2042-09-07
Also published as: CN115170746A

Abstract

The invention discloses a method, a system and equipment for multi-view three-dimensional reconstruction based on deep learning, wherein a plurality of multi-view images are obtained, multi-scale semantic feature extraction is carried out on the multi-view images, and feature maps of various scales are obtained; performing multi-scale semantic segmentation on the feature maps of various scales to obtain semantic segmentation sets of various scales; reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map; obtaining depth maps of various scales based on the semantic segmentation sets and the initial depth maps of various scales; constructing point cloud sets of various scales; optimizing the point cloud sets of various scales by adopting different radius filtering to obtain optimized point cloud sets; reconstructing different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results of different scales; and splicing and fusing the three-dimensional reconstruction results of each scale. The invention can fully utilize semantic information of each scale and improve the accuracy of three-dimensional reconstruction.

Description

Multi-view three-dimensional reconstruction method, system and equipment based on deep learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a multi-view three-dimensional reconstruction method, a multi-view three-dimensional reconstruction system and multi-view three-dimensional reconstruction equipment based on deep learning.

Background

The three-dimensional reconstruction method for deep learning is that a neural network is built by a computer, training is carried out through a large amount of image data and three-dimensional model data, and the mapping relation between an image and a three-dimensional model is learned, so that three-dimensional reconstruction of a new image target is realized. Compared with the traditional method such as a 3D digital media management (3D) method and a Structural From Motion (SFM), the three-dimensional reconstruction method for deep learning can introduce some learned global semantic information into image reconstruction, so that the limitation that the traditional reconstruction method is poor in reconstruction of weak illumination and weak texture areas is overcome to a certain extent.

The existing deep learning three-dimensional reconstruction method is mostly based on a single scale, namely, the objects with different sizes in the image are reconstructed in the same way. The single-scale reconstruction can keep better reconstruction accuracy and speed in the environment with lower scene complexity and fewer small objects. However, the problem of insufficient reconstruction accuracy of small-scale objects easily occurs in some environments with complex scenes and more objects of various scales. And only the high-level features are utilized, and the low-level detail information of the image is not fully utilized.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a method, a system and equipment for multi-view three-dimensional reconstruction based on deep learning, which can make full use of semantic information of each scale and improve the accuracy of three-dimensional reconstruction.

In a first aspect, an embodiment of the present invention provides a deep learning-based multi-view three-dimensional reconstruction method, where the deep learning-based multi-view three-dimensional reconstruction method includes:

acquiring a plurality of multi-view images, and performing multi-scale semantic feature extraction on the plurality of multi-view images to obtain feature maps of multiple scales;

performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain semantic segmentation sets of multiple scales;

reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;

obtaining depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map;

constructing a point cloud set with multiple scales based on the depth maps with multiple scales;

according to the scale of the point cloud set, different radius filtering is adopted for the point cloud sets with various scales to carry out optimization, and the optimized point cloud set is obtained;

reconstructing at different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results at different scales;

and splicing and fusing the three-dimensional reconstruction results of each scale to obtain a final three-dimensional reconstruction result.

Compared with the prior art, the first aspect of the invention has the following beneficial effects:

the method can extract the features of different scales by extracting the multi-scale semantic features of a plurality of multi-view images, can obtain the feature maps of various scales, can perform multi-scale semantic segmentation on the feature maps of various scales, and can aggregate the semantic information of each scale, thereby enriching the semantic information of each scale; semantic guidance is respectively carried out on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and an accurate depth map of multiple scales is obtained; the method comprises the steps of constructing a point cloud set with various scales by using the obtained depth maps with various scales, optimizing by adopting different radius filtering according to the scales of the point cloud set, using the optimized point cloud set for reconstruction with different scales, and fusing three-dimensional reconstruction results to obtain more accurate three-dimensional reconstruction results. Therefore, the method can fully utilize semantic information of each scale and improve the accuracy of three-dimensional reconstruction.

According to some embodiments of the present invention, the performing multi-scale semantic feature extraction on a plurality of the multi-view images to obtain feature maps of multiple scales includes:

performing multilayer feature extraction on the multiple multi-view images through a ResNet network to obtain original feature maps with multiple scales;

and respectively connecting the original characteristic map of each scale with channel attention so as to carry out importance weighting on the original characteristic map of each scale through a channel attention mechanism and obtain characteristic maps of various scales.

According to some embodiments of the present invention, the importance weighting is performed on the original feature map of each scale through a channel attention mechanism to obtain feature maps of multiple scales, including:

compressing the original characteristic diagram of each scale through a compression network to obtain a one-dimensional characteristic diagram corresponding to the original characteristic diagram of each scale;

inputting the one-dimensional characteristic diagram into a full-connection layer through an excitation network to perform importance prediction, and obtaining the importance of each channel;

and exciting the importance of each channel to the one-dimensional characteristic diagram of the original characteristic diagram of each scale through an excitation function to obtain characteristic diagrams of various scales.

According to some embodiments of the present invention, the performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain a semantic segmentation set of multiple scales includes:

clustering the characteristic graphs of multiple scales through nonnegative matrix decomposition to obtain semantic segmentation sets of multiple scales; wherein the expression of the non-negative matrix factorization is:

the method comprises the following steps of mapping, connecting and remolding feature maps of various scales into a matrix V with HW rows and C columns, wherein the P represents a matrix with HW rows and K columns, the Q represents a matrix with K rows and C columns, the H represents a coefficient matrix, the W represents a base matrix, the K represents a non-negative matrix decomposition factor of a semantic cluster number, the C represents the dimension of each pixel, and the F represents the adoption of a non-inducible norm.

According to some embodiments of the present invention, the obtaining the depth maps of the plurality of scales based on the semantic segmentation sets of the plurality of scales and the initial depth map comprises:

selecting any one of the multiple multi-view images as a reference image, and taking the other images as images to be matched;

selecting a reference point from the reference image, acquiring a semantic category corresponding to the reference point in the semantic segmentation set, and acquiring a depth value corresponding to the reference point on the initial depth image;

the number of reference points is chosen by the following formula:

wherein the content of the first and second substances,

representing the number of reference points selected by the jth segmentation set, H representing the height of the multi-view image, W representing the width of the multi-view image, HW representing the number of pixel points of the multi-view image, t representing a constant parameter,

representing the number of semantic categories contained in the jth said semantic partition set,

representing the number of semantic categories contained in the ith semantic segmentation set;

based on each reference point, obtaining the matching point of each reference point on the graph to be matched through the following formula:

wherein, the first and the second end of the pipe are connected with each other,

representing the matching point of the ith reference point on the graph to be matched, K representing the internal reference of the camera, T representing the external reference of the camera,

representing a reference point P in said reference map _i A corresponding depth value on the initial depth map;

obtaining the semantic category corresponding to each matching point, correcting the multi-view image of each scale by minimizing a semantic loss function to obtain the depth maps of multiple scales, wherein the semantic loss function

The calculation formula of (c) is as follows:

representing the difference between the semantic information of the ith reference point and the semantic information of the ith matching point, M _i Representing a mask and N representing the number of said reference points.

According to some embodiments of the invention, the constructing a point cloud set of multiple scales based on the depth maps of multiple scales comprises:

constructing a point cloud set of each scale by using the depth map of each scale according to the following expression:

wherein the content of the first and second substances,

the abscissa representing the depth map is shown as,

represents the ordinate of the depth map and,

and

representing the camera focal length obtained from the camera parameters, and x, y and z represent the point cloud coordinates of the point cloud transformation.

According to some embodiments of the present invention, the optimizing the point cloud sets of multiple scales by using different radius filtering according to the scales of the point cloud sets to obtain an optimized point cloud set includes:

acquiring the point cloud sets of multiple scales, wherein the point cloud in the point cloud set of each scale has a corresponding radius and a preset number of adjacent points;

calculating the corresponding radius of the point cloud in the point cloud set by adopting the following formula according to the scale of the point cloud set:

representing the corresponding radius of the point cloud in the point cloud set with different scales,

representing a constant parameter, t representing a constant parameter,

representing a preset scale grade of each point cloud set;

and optimizing the point cloud sets with various scales according to the radius corresponding to each point cloud and the preset number of adjacent points to obtain the optimized point cloud set.

In a second aspect, an embodiment of the present invention further provides a deep learning-based multi-view three-dimensional reconstruction system, where the deep learning-based multi-view three-dimensional reconstruction system includes:

the characteristic diagram acquisition unit is used for acquiring multi-view images, and performing multi-scale semantic feature extraction on the multi-view images to acquire characteristic diagrams of multiple scales;

the semantic segmentation set acquisition unit is used for carrying out multi-scale semantic segmentation on the feature maps with various scales to acquire a semantic segmentation set with various scales;

the initial depth map acquisition unit is used for reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;

the depth map acquisition unit is used for acquiring depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map;

the point cloud set acquisition unit is used for constructing point cloud sets with various scales based on the depth maps with various scales;

the radius filtering unit is used for optimizing the point cloud sets with various scales by adopting different radius filtering according to the scales of the point cloud sets to obtain the optimized point cloud sets;

a reconstruction result obtaining unit, configured to perform reconstruction of different scales based on the optimized point cloud set, so as to obtain three-dimensional reconstruction results of different scales;

and the reconstruction result fusion unit is used for splicing and fusing the reconstruction results of each scale to obtain a final three-dimensional reconstruction result.

Compared with the prior art, the second aspect of the invention has the following beneficial effects:

the feature map acquisition unit of the system can extract deep features and acquire feature maps of multiple scales by performing multi-scale semantic feature extraction on multiple multi-view images, performs multi-scale semantic segmentation on the feature maps of multiple scales by the semantic segmentation set acquisition unit, aggregates semantic information of each scale, and enriches the semantic information of each scale; the depth map acquisition unit is used for respectively carrying out semantic guidance on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and the accurate depth map of multiple scales is obtained; a point cloud set acquisition unit of the system constructs a point cloud set with multiple scales by using the acquired depth maps with multiple scales, different radius filtering is adopted for optimization according to the scales of the point cloud set through a radius filtering unit, reconstruction with different scales is carried out on the basis of the optimized point cloud set through a reconstruction result acquisition unit, and then a three-dimensional reconstruction result is fused through a reconstruction result fusion unit to obtain a more accurate three-dimensional reconstruction result. Therefore, the system can make full use of semantic information of each scale and improve the accuracy of three-dimensional reconstruction.

In a third aspect, an embodiment of the present invention further provides a deep learning-based multi-view three-dimensional reconstruction apparatus, including at least one control processor and a memory, which is in communication connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform a method for deep learning based multi-view three-dimensional reconstruction as described above.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored, and the computer-executable instructions are configured to cause a computer to execute a deep learning-based multi-view three-dimensional reconstruction method as described above.

It is to be understood that the advantageous effects of the third aspect to the fourth aspect compared to the related art are the same as the advantageous effects of the first aspect compared to the related art, and reference may be made to the related description in the first aspect, which is not repeated herein.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of a deep learning-based multi-view three-dimensional reconstruction method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a depth residual network in accordance with one embodiment of the present invention;

FIG. 3 is a schematic diagram of a non-negative matrix factorization of an embodiment of the present invention;

FIG. 4 is a block diagram of multi-scale semantic segmentation in accordance with one embodiment of the present invention;

fig. 5 is a structural diagram of a deep learning-based multi-view three-dimensional reconstruction system according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention and are not to be construed as limiting the present invention.

In the description of the present invention, if there are first, second, etc. described, it is only for the purpose of distinguishing technical features, and it is not understood that relative importance is indicated or implied or that the number of indicated technical features is implicitly indicated or that the precedence of the indicated technical features is implicitly indicated.

In the description of the present invention, it should be understood that the orientation or positional relationship referred to, for example, the upper, lower, etc., is indicated based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, but does not indicate or imply that the device or element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention.

In the description of the present invention, it should be noted that unless otherwise explicitly defined, terms such as arrangement, installation, connection and the like should be broadly understood, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.

For the convenience of understanding of those skilled in the art, the terms in the present embodiment are explained:

the deep learning three-dimensional reconstruction method comprises the following steps: the three-dimensional reconstruction method for deep learning is that a neural network is built by using a computer, training is carried out through a large amount of image data and three-dimensional model data, and the mapping relation from an image to a three-dimensional model is learned, so that the three-dimensional reconstruction of a new image target is realized. Compared with the traditional method for reconstructing three-dimensional information such as 3DMM and the method for reconstructing three-dimensional information by SFM, the three-dimensional reconstruction method for deep learning can introduce some global semantic information into image reconstruction, thereby overcoming the limitation that the traditional reconstruction method is poor in reconstruction in weak illumination and weak texture areas to a certain extent, wherein the SFM algorithm is an off-line algorithm for three-dimensional reconstruction based on various collected disordered pictures; the 3DMM, a three-dimensional deformable face model, is a general three-dimensional face model, and represents a face by using fixed points.

The current three-dimensional reconstruction methods for deep learning can be mainly classified into supervised three-dimensional reconstruction methods (for example, NVSNet, CVP-MVSNet, patchmatchchnet and the like in the prior art) and self-supervised three-dimensional reconstruction methods (for example, JDACS-MS and the like in the prior art). The supervised three-dimensional reconstruction method needs truth values for training, has high precision, and is difficult to apply in some scenes in which truth values are difficult to acquire. The self-supervision three-dimensional reconstruction method does not need real value training, and has wide application range and relatively low precision.

Semantic segmentation: semantic segmentation is a classification at the pixel level, and pixels belonging to the same class are classified into one class, so that semantic segmentation is used for understanding an image from the pixel level, for example, pixels having different semantics are marked with different colors. Pixels belonging to animals are classified into the same class. The segmented semantic information can guide image reconstruction and improve the reconstruction precision. And performing semantic segmentation by adopting a clustering mode, and clustering pixels belonging to the same class into the same class.

Depth map: the distance image is an image in which the distance (depth) from the image pickup device to each point in the scene is defined as a pixel value.

Point cloud: the point data set of the object appearance surface is point cloud, contains information such as three-dimensional coordinate information and color of the object, and can realize image reconstruction through the point cloud data.

non-Negative Matrix Factorization (NMF): is a matrix decomposition method under the condition that all elements in the matrix are non-negative numbers. There are many analysis methods for solving the practical problem by matrix decomposition, such as PCA (principal component analysis), ICA (independent component analysis), SVD (singular value decomposition), VQ (vector quantization), and the like. In all these methods, the original large matrix V is approximately decomposed into a low rank V = WH form. The common feature of these methods is that the elements in the factors W and H can be positive or negative, and even if the input initial matrix elements are all positive, the non-negativity of the original data cannot be guaranteed by the conventional rank reduction algorithm. Mathematically, it is true from a computational point of view that the presence of negative values in the decomposition results is correct, but negative values elements often make no sense in practical problems.

The three-dimensional reconstruction method for deep learning is that a neural network is built by a computer, training is carried out through a large amount of image data and three-dimensional model data, and the mapping relation between an image and a three-dimensional model is learned, so that three-dimensional reconstruction of a new image target is realized. Compared with the traditional method such as a 3DMM method and an SFM method, the three-dimensional reconstruction method based on deep learning can introduce some learned global semantic information into image reconstruction, so that the limitation that the traditional reconstruction method is poor in reconstruction in weak-illumination and weak-texture areas is overcome to a certain extent.

The existing deep learning three-dimensional reconstruction method is mostly based on a single scale, namely, objects with different sizes in an image are reconstructed in the same way. The single-scale reconstruction can keep better reconstruction accuracy and speed under the environment with lower scene complexity and fewer fine objects. However, in some environments with complex scenes and a large number of objects of various dimensions, the problem of insufficient reconstruction accuracy of small-dimension objects is likely to occur. And only the high-level features are utilized, and the low-level detail information of the image is not fully utilized.

In order to solve the problems, the multi-scale semantic feature extraction is carried out on a plurality of multi-view images, features of different scales can be extracted, feature maps of various scales can be obtained, multi-scale semantic segmentation is carried out on the feature maps of various scales, semantic information of various scales is aggregated, and the semantic information of various scales is enriched; semantic guidance is respectively carried out on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and an accurate depth map of multiple scales is obtained; the method and the device construct the point cloud sets of various scales by using the obtained depth maps of various scales, optimize by adopting different radius filtering according to the scales of the point cloud sets, use the optimized point cloud sets for reconstruction of different scales, and fuse three-dimensional reconstruction results to obtain more accurate three-dimensional reconstruction results. Therefore, the semantic information of each scale can be fully utilized, and the accuracy of three-dimensional reconstruction can be improved.

Referring to fig. 1, an embodiment of the present invention provides a deep learning-based multi-view three-dimensional reconstruction method, where the deep learning-based multi-view three-dimensional reconstruction method includes:

and S100, acquiring a plurality of multi-view images, and performing multi-scale semantic feature extraction on the plurality of multi-view images to obtain feature maps of various scales.

Specifically, a plurality of multi-view images are acquired, and the object to be recognized can be subjected to image acquisition at various angles in all directions through image acquisition equipment such as a camera and an image scanner, so that the plurality of multi-view images are obtained. For example, when the multi-scale semantic feature extraction needs to be performed on a plurality of multi-view images, an image acquisition device such as a camera may be used to obtain the plurality of multi-view images.

In the embodiment, multilayer feature extraction is performed on a plurality of multi-view images through a ResNet network to obtain original feature maps with various scales;

respectively connecting the original feature map of each scale with channel attention, and performing importance weighting on the original feature map of each scale through a channel attention mechanism to obtain feature maps of multiple scales, specifically:

In the embodiment, the ResNet network is adopted to extract the image characteristics, the deeper the number of layers of the deep learning network is, the stronger the expression capability theoretically is, but after the CNN network reaches a certain depth, the classification performance is deepened, so that the network convergence is slower, and the accuracy is reduced; even if the data set is enlarged to solve the problem of overfitting, the classification performance and accuracy will not be improved. The ResNet network adopts a residual error learning method, and with reference to FIG. 2, the learned characteristics are recorded as x when the input is x

Now we want it to learn the residual

Such that the actual original learning features are

. This is so because residual learning is easier than direct learning of the original features. When the residual error is 0, only identity mapping is performed on the accumulation layer at this time, at least the network performance is not reduced, and actually, the residual error is not 0, so that the accumulation layer learns new features on the basis of the input features, and has better performance. The residual function is easier to optimize, and the network layer number can be greatly deepened, so that deeper semantic information can be extracted. The performance of ResNet in the aspects of efficiency, resource consumption and deep semantic feature extraction is obviously superior to that of networks such as VGG (virtual grid generator) and the like.

After multi-layer feature extraction is carried out on a plurality of multi-view images through a ResNet network to obtain original feature maps of various scales, the original feature maps of each scale are respectively connected with channel attention, and importance weighting is carried out on the original feature maps of each scale through a channel attention mechanism to obtain the feature maps of various scales. The channel attention mechanism mainly comprises a compression network and an excitation network, and comprises the following specific processes:

let the dimension of the original feature map be H × W × C, where H is Height (Height), W is width (width), and C is channel number (channel). The compression network does the same thing as compressing H W into one-dimensional features by global averaging pooling. After H × W is compressed into one dimension, the corresponding one-dimensional parameter obtains the view of the whole H × W, and the sensing area is wider. And transmitting the one-dimensional characteristics obtained by the compression network to an excitation network, transmitting the one-dimensional characteristics to a full connection layer by the excitation network, predicting the importance of each channel to obtain the importance of different channels, and exciting the importance of different channels to the channels corresponding to the previous characteristic diagrams by a Sigmoid excitation function. The channel attention mechanism enables the network to pay attention to more effective semantic features, the weight of the semantic features is improved in an iterative mode, the feature extraction network extracts rich semantic features, and different semantic features are different in importance for semantic segmentation. The introduction of the channel attention mechanism can enable the network to pay attention to more effective features, inhibit inefficient features and improve the effectiveness of feature extraction.

In this embodiment, because of the convolutional neural network used in feature extraction in the prior art, feature extraction like VGG network is limited by the number of network extraction layers, the deep level feature extraction capability is insufficient, and the feature validity is not high. With the increase of the number of the convolution layers, the problems of slow network convergence, low accuracy and the like occur, the feature extraction capability is insufficient, all the extracted features have different importance for image reconstruction, and the extraction of the features with high effectiveness is difficult to guarantee. Therefore, in the embodiment, by performing multi-scale semantic feature extraction on a plurality of multi-view images, deep features can be extracted, and feature maps of various scales can be obtained. And through the introduction of a channel attention mechanism, the network can pay attention to more effective features, so that the inefficient features are inhibited, and the effectiveness of feature extraction is improved.

And S200, performing multi-scale semantic segmentation on the feature maps of various scales to obtain a semantic segmentation set of various scales.

Specifically, clustering is carried out on the characteristic graphs of multiple scales through nonnegative matrix factorization to obtain a semantic segmentation set of multiple scales; wherein the expression of the non-negative matrix factorization is as follows:

A typical matrix decomposition decomposes a large matrix into a number of smaller matrices, but the elements of these matrices have positive and negative values. Whereas in the real world, images,the presence of negative numbers in a matrix formed by text or the like is meaningless, so it makes sense if a matrix can be decomposed into all non-negative elements. Requiring the original matrix in NMF

Is non-negative, then the matrix

Can be decomposed into the product of two smaller non-negative matrices with and without one such decomposition satisfying presence and uniqueness. For example,

given matrix

Looking for non-negative matrices

And a non-negative matrix

So that

. Before and after decomposition it is understood that: original matrix

The column vector of (1) is the weighted sum of all the column vectors in the left matrix, and the weighting coefficient is the element of the corresponding column vector of the right matrix, so called

As a basis matrix, the matrix is,

is a matrix of coefficients.

Referring to fig. 3, N multi-scale feature maps are first concatenated and reshaped into a (HW, C) matrix V. Solving NMF using multiplicative update rules, i.e. using formulae

，

Solving for NMF, V is decomposed by NMF decomposition (i.e., NMF non-negative matrix decomposition) in the graph into (HW, K) matrix P and (K, C) matrix Q, where K is the NMF factor representing the number of semantic clusters. Due to NMF (QQ) ^T = I), each row of the (K, C) matrix Q may be considered as a C-dimensional cluster center, each row of the K, C) matrix Q corresponding to several objects in the view. The rows of the (HW, K) matrix P correspond to the positions of all pixels from the N multi-scale feature maps. In general, matrix decomposition forces the product between each row of P and each column of Q to better approximate the C-dimensional characteristics of each pixel in V. Thus, the semantic category of each position in the image is obtained by the P matrix.

Referring to FIG. 4, assume an extracted feature map

Each feature matrix is semantically segmented by means of clustering (i.e., NMF non-negative matrix factorization)

Is decomposed into

And because the receptive field of the high-level feature layer is large, the features are more abstract, and the global situation is more concerned. The lower characteristic layer has small receptive field and focuses more on details. Thus, each segmented set obtained by multi-scale semantic segmentation

The system comprises a plurality of layers from coarse to fine. The segmentation sets S1 to S3 in fig. 4 contain increasingly more detailed information. Each segmentation set S contains the semantic segmentation result of an input group of images (a reference image and an image to be matched), for example, different colors represent different semantic categories, and the segmentation set S contains more detailed informationA cut set (e.g., the segmented set S3) will contain more semantic categories.

In the embodiment, because most of the current deep learning three-dimensional reconstruction methods are based on a single scale, the three-dimensional reconstruction methods are reconstructed in the same manner for objects with different sizes in the image. The single-scale reconstruction can keep better reconstruction accuracy and speed under the environment with lower scene complexity and fewer small objects, but the problem of insufficient reconstruction accuracy of small-scale objects easily occurs under the environment with complex scenes and more objects of various scales; and only the high-level features are utilized, and the low-level detail information of the image is not fully utilized. Therefore, in the embodiment, the multi-scale semantic segmentation is performed on the feature maps of multiple scales, and the semantic information of each scale is aggregated, so that the semantic information of each scale is enriched, and the detail information of the low-level feature layer can be fully utilized.

And S300, reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map.

Specifically, in the embodiment, a plurality of multi-view images are reconstructed by a supervised three-dimensional reconstruction method, so as to obtain an initial depth map.

According to the embodiment, the initial depth map is obtained through a supervised three-dimensional reconstruction method, and the reconstruction precision can be improved. Because the supervised three-dimensional reconstruction method has high precision, but needs a large amount of training truth value data, and under certain specific scenes (for example, underwater), the training truth value is difficult to acquire and is difficult to apply. Therefore, step S400 is required to perform semantic guidance on the initial depth map of this embodiment, and the supervised three-dimensional reconstruction method is converted into an unsupervised one, so as to implement the unsupervised three-dimensional reconstruction, thereby overcoming the inherent defects of the supervised three-dimensional reconstruction method.

The supervised three-dimensional reconstruction method in the present embodiment is any one of the supervised three-dimensional reconstruction methods in the prior art, for example, MVSNet (MVSNet: depth index for unregulated Multi-View step), CVP-MVSNet (Cost Volume focus Based Depth index for Multi-View step), and PatchmatchNet (patchnet: left Multi-View patchstep), and the details thereof are not described in the present embodiment.

And S400, obtaining the depth maps of various scales based on the semantic segmentation sets and the initial depth map of various scales.

Specifically, in this embodiment, semantic information is used as a supervision signal to combine with a supervised three-dimensional reconstruction method, and the image reconstruction is guided to obtain a depth map, which specifically includes the following processes:

acquiring a plurality of multi-view images through image acquisition equipment, and taking the plurality of multi-view images as input to obtain an initial depth map through a supervised three-dimensional reconstruction method;

selecting a reference point from the reference image, acquiring a semantic category corresponding to the reference point in the semantic segmentation set, and acquiring a depth value corresponding to the reference point on the initial depth map;

the number of reference points is chosen by the following formula:

wherein the content of the first and second substances,

representing the number of semantic categories contained in the jth semantic partition set,

based on each reference point, acquiring the matching point of each reference point on the graph to be matched through the following formula:

wherein the content of the first and second substances,

represents the matching point of the ith reference point on the graph to be matched, K represents the internal parameter of the camera, T represents the external parameter of the camera,

indicating a reference point P in the reference map _i Corresponding depth values on the initial depth map;

obtaining the semantic category corresponding to each matching point, correcting the multi-view image of each scale by minimizing a semantic loss function to obtain depth maps of various scales, and obtaining the semantic loss function

The calculation formula of (c) is as follows:

wherein the content of the first and second substances,

representing the difference between the semantic information of the ith reference point and the semantic information of the ith matching point, M _i Representing the mask and N the number of reference points. This embodiment is illustrated by the following example:

firstly, a plurality of multi-view images of the same object under different viewing angles are acquired through image acquisition equipment, the multi-view images are used as input, and an initial depth map can be obtained through a supervised three-dimensional reconstruction method. Selecting one of the input multi-view images as a reference image, and taking a point of reference point P on the reference image, wherein the rest images are images to be matched _i And its corresponding semantic class S on the segmentation set S _i And corresponding depth on the depth mapAnd (4) measuring values.

For the segmentation sets of different levels, the segmentation sets with more categories need to be guided more finely due to different semantic category numbers, the number of reference points needs to be more, and the number of the reference points is selected according to a formula:

the matching points corresponding to the reference points on the graph to be matched are obtained through the following homography matrix formula

：

Get the matching point

Semantic categories of

The semantic category of the matching point calculated by the reference point under the condition that the depth map is accurate (i.e. the depth value of the corresponding position is correct) should be the same as the semantic category of the reference point, and the following semantic loss function is calculated and minimized:

and continuously correcting the initial depth map by minimizing a semantic loss function, and finally obtaining an accurate depth map. The semantic information can replace a true value for guiding, a supervised three-dimensional reconstruction method is converted into an unsupervised three-dimensional reconstruction method, and the unsupervised three-dimensional reconstruction is realized, so that the inherent defects of the supervised method are overcome.

In this embodiment, since the semantics of the image can be divided into three layers, a visual layer, an object layer and a concept layer, the semantics of the visual layer includes colors, lines, contours, etc., the semantics of the object layer includes various objects, and the semantics of the concept layer relates to understanding of the scene. In the prior art, part of three-dimensional reconstruction methods also utilize semantic information for guidance, but high-level abstract semantic information (object layer) with a single scale has better precision on reconstruction tasks of some large-scale objects, and on reconstruction tasks with small scales, the high-level abstract semantic information is relatively rough and has poor reconstruction precision.

Therefore, in the embodiment, a plurality of multi-view images are used as input, and an initial depth map is obtained by a supervised three-dimensional reconstruction method; obtaining depth maps of various scales based on semantic segmentation sets and initial depth maps of various scales; in the embodiment, semantic guidance is respectively performed on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and an accurate depth map of multiple scales is obtained.

And S500, constructing a point cloud set with various scales based on the depth maps with various scales.

Specifically, a depth map of each scale is constructed by the following expression:

wherein the content of the first and second substances,

the abscissa representing the depth map is shown as,

the ordinate of the depth map is represented,

and

And S600, according to the scale of the point cloud set, optimizing the point cloud sets with various scales by adopting different radius filtering to obtain the optimized point cloud set.

Specifically, a point cloud set of multiple scales is obtained, and the point cloud in the point cloud set of each scale has a corresponding radius and a preset number of adjacent points;

calculating the corresponding radius of the point cloud in the point cloud set according to the scale of the point cloud set by adopting the following formula:

wherein the content of the first and second substances,

representing a constant parameter, t representing a constant parameter,

representing a preset scale grade of each point cloud set;

and optimizing the point cloud sets with various scales according to the radius size corresponding to each point cloud and the preset number of adjacent points to obtain the optimized point cloud sets.

In this embodiment, for point cloud sets of different scales, radius filtering is required after the depth map is converted, noise points are filtered, and point cloud data are optimized. For point cloud sets with different scales, different radius filtering is adopted due to different aggregation degrees of the point clouds. Radius filtering, namely, firstly, acquiring the radius corresponding to each point cloud and presetting the quantity of adjacent points, only the point cloud which meets the requirement of having enough quantity of adjacent points in the radius range can be reserved, and the rest points are filtered out. For the multi-scale point cloud set of this embodiment, the semantic type of the point cloud in the segmentation set needs to be considered, that is, the point cloud having n number of neighboring points with the same semantic type in the radius is retained.

And S700, reconstructing at different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results at different scales.

Specifically, in step S600, point cloud sets of different scales are optimized to obtain point cloud sets optimized in different scales, and the point cloud sets optimized in each scale are reconstructed to obtain three-dimensional reconstruction results in different scales.

And step S800, splicing and fusing the three-dimensional reconstruction results of each scale to obtain a final three-dimensional reconstruction result.

Specifically, the three-dimensional reconstruction results of each scale are spliced and fused to obtain the final three-dimensional reconstruction result. In this embodiment, through the step S700, reconstruction of different scales is performed based on the optimized point cloud set, and the optimized point cloud set is more accurate, so that the final three-dimensional reconstruction result obtained in this embodiment is also more accurate.

In the embodiment, a plurality of multi-view images are acquired, and multi-scale semantic feature extraction is performed on the plurality of multi-view images to acquire feature maps of various scales; carrying out multi-scale semantic segmentation on the feature maps with various scales to obtain a semantic segmentation set with various scales; in the embodiment, the deep-level features can be extracted by performing multi-scale semantic feature extraction on a plurality of multi-view images, and feature maps of various scales can be obtained. And multi-scale semantic segmentation is carried out on the feature maps of various scales, and semantic information of each scale is aggregated, so that the semantic information of each scale is enriched. In the embodiment, a plurality of multi-view images are used as input, and an initial depth map is obtained through a supervised three-dimensional reconstruction method; obtaining depth maps of various scales based on semantic segmentation sets and initial depth maps of various scales; in the embodiment, semantic guidance is respectively performed on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and an accurate depth map of multiple scales is obtained. The method comprises the steps of constructing a point cloud set with various scales based on depth maps with various scales; according to the scale of the point cloud set, optimizing the point cloud sets of various scales by adopting different radius filtering to obtain the optimized point cloud set; reconstructing at different scales based on the optimized point cloud set to obtain reconstruction results at different scales; and splicing and fusing the reconstruction results of each scale to obtain a final reconstruction result. In this embodiment, the obtained depth maps of multiple scales are used to construct a point cloud set of multiple scales, different radius filtering is adopted for optimization according to the scales of the point cloud set, the optimized point cloud set is used for reconstruction of different scales, and then the reconstruction results are fused to obtain a more accurate reconstruction result. According to the embodiment, semantic information of each scale can be fully utilized, and the accuracy of three-dimensional reconstruction can be improved.

Referring to fig. 5, an embodiment of the present invention provides a deep learning-based multi-view three-dimensional reconstruction system, which includes a feature map obtaining unit 100, a semantic segmentation set obtaining unit 200, an initial depth map obtaining unit 300, a depth map obtaining unit 400, a point cloud set obtaining unit 500, a radius filtering unit 600, a reconstruction result obtaining unit 700, and a reconstruction result fusion unit 800, where:

the characteristic diagram acquiring unit 100 is configured to acquire a multi-view image, perform multi-scale semantic feature extraction on the multi-view image, and acquire characteristic diagrams of multiple scales;

a semantic segmentation set acquisition unit 200, configured to perform multi-scale semantic segmentation on feature maps of multiple scales to obtain a semantic segmentation set of multiple scales;

an initial depth map obtaining unit 300, configured to reconstruct the multiple multi-view images by using a supervised three-dimensional reconstruction method to obtain an initial depth map;

a depth map obtaining unit 400, configured to obtain depth maps of multiple scales based on the multiple-scale semantic segmentation sets and the initial depth map;

a point cloud set obtaining unit 500, configured to construct a point cloud set with multiple scales based on depth maps with multiple scales;

the radius filtering unit 600 is configured to optimize point cloud sets of multiple scales by using different radius filtering according to the scale of the point cloud set, so as to obtain an optimized point cloud set;

a reconstruction result obtaining unit 700, configured to perform reconstruction of different scales based on the optimized point cloud set, so as to obtain three-dimensional reconstruction results of different scales;

and the reconstruction result fusion unit 800 is used for splicing and fusing the reconstruction results of each scale to obtain a final three-dimensional reconstruction result.

It should be noted that, since the multi-view three-dimensional reconstruction system based on deep learning in the embodiment is based on the same inventive concept as the above-mentioned multi-view three-dimensional reconstruction method based on deep learning, the corresponding contents in the method embodiment are also applicable to the embodiment of the system, and are not described in detail here.

The embodiment of the invention also provides a multi-view three-dimensional reconstruction device based on deep learning, which comprises: at least one control processor and a memory for communicative connection with the at least one control processor.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software program and instructions required to implement a deep learning based multi-view three-dimensional reconstruction method of the above embodiments are stored in a memory, and when executed by a processor, perform the deep learning based multi-view three-dimensional reconstruction method of the above embodiments, for example, perform the above-described method steps S100 to S800 in fig. 1.

The above described system embodiments are merely illustrative, wherein the units described as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Embodiments of the present invention also provide a computer-readable storage medium, which stores computer-executable instructions, which, when executed by one or more control processors, may cause the one or more control processors to perform one of the above method embodiments based on deep learning, for example, perform the functions of the above method steps S100 to S800 in fig. 1.

Through the above description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a general hardware platform. Those skilled in the art will appreciate that all or part of the processes of the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. A multi-view three-dimensional reconstruction method based on deep learning is characterized by comprising the following steps:

acquiring a plurality of multi-view images, and performing multi-scale semantic feature extraction on the plurality of multi-view images to obtain feature maps of various scales;

obtaining the depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map, specifically:

selecting any one of the multiple multi-view images as a reference image, and taking the others as images to be matched;

the number of reference points is chosen by the following formula:

representing the number of semantic categories contained in the ith semantic segmentation set, wherein n represents the total number of the semantic segmentation sets;

based on each reference point, acquiring a matching point of each reference point on the graph to be matched through the following formula:

wherein the content of the first and second substances,

representing the matching point of the ith reference point on the graph to be matched, K representing the internal parameter of the camera, T representing the external parameter of the camera,

representing a reference point P in said reference map _i Corresponding depth values on the initial depth map;

obtaining semantic categories corresponding to each matching point, correcting the multi-view images of each scale by minimizing a semantic loss function to obtain the depth maps of various scales, wherein the semantic loss function

The calculation formula of (c) is as follows:

wherein the content of the first and second substances,

representing the difference between the semantic information of the ith reference point and the semantic information of the ith matching point, M _i Representing a mask, N representing the number of said reference points;

2. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the performing multi-scale semantic feature extraction on the multiple multi-view images to obtain feature maps of multiple scales comprises:

performing multi-layer feature extraction on the multi-view images through a ResNet network to obtain original feature maps with various scales;

and respectively connecting the original feature map of each scale with channel attention so as to carry out importance weighting on the original feature map of each scale through a channel attention mechanism and obtain feature maps of various scales.

3. The deep learning-based multi-view three-dimensional reconstruction method according to claim 2, wherein the weighting of importance of the original feature map of each scale through a channel attention mechanism to obtain feature maps of multiple scales comprises:

4. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain semantic segmentation sets of multiple scales includes:

5. The method for multi-view three-dimensional reconstruction based on deep learning of claim 1, wherein the constructing the point cloud sets of multiple scales based on the depth maps of multiple scales comprises:

wherein the content of the first and second substances,

the abscissa representing the depth map is shown,

represents the ordinate of the depth map and,

and

6. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the optimization of the point cloud sets of multiple scales by using different radius filters according to the scales of the point cloud sets to obtain an optimized point cloud set comprises:

wherein the content of the first and second substances,

representing a constant parameter, t representing a constant parameter,

representing a preset scale grade of each point cloud set;

and optimizing the point cloud sets with various scales according to the radius corresponding to each point cloud and the preset number of adjacent points to obtain an optimized point cloud set.

7. A deep learning based multi-view three-dimensional reconstruction system, comprising:

the semantic segmentation set acquisition unit is used for carrying out multi-scale semantic segmentation on the feature maps with various scales to obtain a semantic segmentation set with various scales;

a depth map obtaining unit, configured to obtain depth maps of multiple scales based on the multiple-scale semantic segmentation sets and the initial depth map, specifically:

the number of reference points is chosen by the following formula:

wherein the content of the first and second substances,

representing the number of semantic categories contained in the jth of said semantic segmentation sets,

The calculation formula of (c) is as follows:

wherein the content of the first and second substances,

8. A deep learning based multi-view three-dimensional reconstruction device comprising at least one control processor and a memory for communicative connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform the method of deep learning based multi-view three-dimensional reconstruction according to any one of claims 1 to 6.

9. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method for deep learning based multi-view three-dimensional reconstruction according to any one of claims 1 to 6.